The issue would be that the preview chains and the live chain would often
have different sRGB texture needs, and they'd run at the same time and thus
have a race condition. Solve this by using a sampler object instead, which
overrides the texture state.
This has a variety of uses; most notably to play video, but it can
also be used to feed a CasparCG feed into Nageru. For this reason,
it uses RGBA as a format (for now), not Y'CbCr. There are some
limitations, but hopefully we can figure them out eventually.
This is preliminary work towards supporting frames with alpha.
It generally reduces the dependency on the global --10-bit-input
flag a bit, which is also a good thing.
This again requires compute shaders; my GTX 950 needs a bit under 0.1 ms
to convert a 720p frame from the 16-bit planar representation. It replaces the
flag for 10-bit x264.
v210 is, as far as I understand, pretty much the native format for the DeckLink
cards, but I believe the conversion happens in hardware, so there shouldn't
be any significant speed gains to be have.
This makes recording entirely independent on Quick Sync Video
(or VA-API, if you wish). There's no way of running two separate
x264 encodes, though; you get the same as for the stream.
Support switching Y'CbCr coefficients midway, which will allow doing the Right Thing(TM) (BT.601 when you can for greater stream compatibility, BT.709 when you must for HDMI/SDI output) automatically.
Display a copy of the Y'CbCr images instead of an RGB565 copy.
This is both higher-quality (the 16-bit artifacts were getting rather
annoying), more true to what's actually being output, _and_ higher performance
(well, at least lower memory bandwidth; I haven't benchmarked in practice),
since we can use multi-output to make extra copies on-the-fly when writing
instead of doing it explicitly. Sample calculation for a 1280x720 image; let's
say it is one megapixel for ease of calculation:
GL_565: 2 MB written (565 texture), 2 MB read during display = 4 MB used
Y'CbCr: 1.0 + 0.5 MB written (Y' texture plus half-res dual-channel CbCr texture),
same amount read during display = 3 MB used
We could have reused the full-resolution CbCr texture, saving the 0.5 MB
write, but that make the readback 3 MB instead of 1.5 MB, so it's a net loss.
Ideally, we'd avoid the copies altogether, cutting the writes away
and getting to 1.5 MB, but interactions with VA-API zerocopy seemingly
made that impossible.
Instead of specifying that frame N always uses surface N % 16,
we allocate dynamically from a pool. This both makes a lot more
sense, and also allows us to hold onto surfaces for other reasons
(like that we want to render _from_ them) in a future patch.
This also necessitated explicit usage tracking of reference frames
in order to avoid display corruption (you can't reuse a surface
before its dependent frames are also done rendering); I'm unsure
if this actually was correct before, but it's possible that the
implicit serialization made sure it actually was, because I've
run the existing code pretty hard before without seeing reference
frame corruption.
Fix an issue with the correction factor locking to 0.95.
Basically what was happening is that if the master card lost
or corrupted a frame, which we didn't set a timestamp on,
causing it to have steady_clock::time_point::min(). This would
in turn cause us to assume a latency of trillions of seconds,
throwing off the filter and essentially making it be 0.95 forever.
The fix is twofold; we always set timestamps, but also make
ourselves robust to the ones that are way off (negative uptime).
It turns out I've been misunderstanding parts of Fons' paper;
my estimation is different, and although it works surprisingly well
for something that's hardly supposed to work at all, it has some
significant problems with edge cases like frame rates being _nearly_
off (e.g. 59.94 Hz input on a 60 Hz output); the estimated delay
under the old algorithm will be a very slow sawtooth, which isn't
nice even after being passed through the filter.
The new algorithm probably still isn't 100% identical to zita-ajbridge,
but it should be much closer to how the algorithm is intended to work.
In particular, it makes a real try to understand that an output frame
can arrive between two input frames in time; this makes it dependent
on the system clock, but that's really the core that was missing from
the algorithm, so it's really more a feature than a bug.
I've made some real attempts at making all the received timestamps
more stable; FakeCapture is a bit odd still (especially at startup)
since it has its thing of just doing frames late instead of dropping
them, but it generally seems to work OK. For cases of frame rate
mismatch (even pretty benign ones), the correction rate seems to be
two orders of magnitude more stable, i.e., the maximum difference
from 1.0 during normal operation is greatly reduced.
When switching output cards, do it from the mixer thread.
This is a _lot_ easier to reason about (and also much more stable
in practice), but we're still having some issues with delays
on disabling video input.
Make the last pointers in CaptureCard into unique_ptr; the amount of manual bookkeeping was getting silly when we have a good solution already in place.
This is pretty raw still; audio isn't tested much, there's no
documentation, hardcoded 720p60 and no GUI control. But most
of the basic ideas are in place, so it should be a reasonable
base to build on.