Support switching Y'CbCr coefficients midway, which will allow doing the Right Thing(TM) (BT.601 when you can for greater stream compatibility, BT.709 when you must for HDMI/SDI output) automatically.
Display a copy of the Y'CbCr images instead of an RGB565 copy.
This is both higher-quality (the 16-bit artifacts were getting rather
annoying), more true to what's actually being output, _and_ higher performance
(well, at least lower memory bandwidth; I haven't benchmarked in practice),
since we can use multi-output to make extra copies on-the-fly when writing
instead of doing it explicitly. Sample calculation for a 1280x720 image; let's
say it is one megapixel for ease of calculation:
GL_565: 2 MB written (565 texture), 2 MB read during display = 4 MB used
Y'CbCr: 1.0 + 0.5 MB written (Y' texture plus half-res dual-channel CbCr texture),
same amount read during display = 3 MB used
We could have reused the full-resolution CbCr texture, saving the 0.5 MB
write, but that make the readback 3 MB instead of 1.5 MB, so it's a net loss.
Ideally, we'd avoid the copies altogether, cutting the writes away
and getting to 1.5 MB, but interactions with VA-API zerocopy seemingly
made that impossible.
Instead of specifying that frame N always uses surface N % 16,
we allocate dynamically from a pool. This both makes a lot more
sense, and also allows us to hold onto surfaces for other reasons
(like that we want to render _from_ them) in a future patch.
This also necessitated explicit usage tracking of reference frames
in order to avoid display corruption (you can't reuse a surface
before its dependent frames are also done rendering); I'm unsure
if this actually was correct before, but it's possible that the
implicit serialization made sure it actually was, because I've
run the existing code pretty hard before without seeing reference
frame corruption.
Fix an issue with the correction factor locking to 0.95.
Basically what was happening is that if the master card lost
or corrupted a frame, which we didn't set a timestamp on,
causing it to have steady_clock::time_point::min(). This would
in turn cause us to assume a latency of trillions of seconds,
throwing off the filter and essentially making it be 0.95 forever.
The fix is twofold; we always set timestamps, but also make
ourselves robust to the ones that are way off (negative uptime).
It turns out I've been misunderstanding parts of Fons' paper;
my estimation is different, and although it works surprisingly well
for something that's hardly supposed to work at all, it has some
significant problems with edge cases like frame rates being _nearly_
off (e.g. 59.94 Hz input on a 60 Hz output); the estimated delay
under the old algorithm will be a very slow sawtooth, which isn't
nice even after being passed through the filter.
The new algorithm probably still isn't 100% identical to zita-ajbridge,
but it should be much closer to how the algorithm is intended to work.
In particular, it makes a real try to understand that an output frame
can arrive between two input frames in time; this makes it dependent
on the system clock, but that's really the core that was missing from
the algorithm, so it's really more a feature than a bug.
I've made some real attempts at making all the received timestamps
more stable; FakeCapture is a bit odd still (especially at startup)
since it has its thing of just doing frames late instead of dropping
them, but it generally seems to work OK. For cases of frame rate
mismatch (even pretty benign ones), the correction rate seems to be
two orders of magnitude more stable, i.e., the maximum difference
from 1.0 during normal operation is greatly reduced.
When switching output cards, do it from the mixer thread.
This is a _lot_ easier to reason about (and also much more stable
in practice), but we're still having some issues with delays
on disabling video input.
Make the last pointers in CaptureCard into unique_ptr; the amount of manual bookkeeping was getting silly when we have a good solution already in place.
This is pretty raw still; audio isn't tested much, there's no
documentation, hardcoded 720p60 and no GUI control. But most
of the basic ideas are in place, so it should be a reasonable
base to build on.
Seemingly creating a context etc. can take 70+ ms, and letting the
capture cards insert stuff freely for that time confuses ResamplingQueue
(plus probably our video queue length policy). Delay it until we are
actually ready to process output frames.
Wait for frames in render order, not QuickSync order.
Now that we have x264 and uncompressed outputs, there's no need
to add extra latency for these cases, and the gain for going in
QuickSync order is rather marginal anyway, as long as GPUs don't
render out-of-order.