This is similar to what we had earlier to just reuse the VAO,
but now with correct bindings no matter what vertex attributes
are assigned to what index, so that the (new) test passes.
This is actually slightly more efficient than what we had before,
since we don't look up the attributes by text anymore, and don't
reupload the VBO for each frame anymore. In practice, the effects
should be small.
The patch trickles a bug where, if the first phase doesn't need texture
coordinates, the rest of the phases don't get it either. (Or more generally,
if the vertex shader varying indices are not predictable, the patch does
the wrong thing.) Add a unit test and revert it for now; in time, we'll find a
way that's both low-overhead (the patch fixes a real problem) _and_ correct in
these cases.
I tried a few different things before I finally settled on this, in particular
Weston's 3-field deinterlacer (w3fdif). It's not perfect (see .h comments),
but it works overall pretty well.
We multiplied by 224/219 once too many, causing some small accuracy issues.
Furthermore, we also did this for full-range Y'CbCr, which obviously is wrong.
The issue was so small that the unit tests kept on passing (its investigation
was prompted by a test that failed on AMD cards, which is a separate issue).
After this, the Rec. 601 matrices match Wikipedia exactly, both for limited
and full range. Added unit tests for this.
Evidently ATI drivers use the freedom the standard gives them to assign
these in another order than they are specified in the shader source,
so we need to explicitly bind them, or YCbCrConversionEffectTest will fail
in the multi-output tests.
This was never intended to be there, and we don't install headers for it
(so no API/ABI break); it is actively harmful because it has a static
ResourcePool, which is attempted destroyed during shutdown (which causes
use of uninitialized memory as we try to get the current context).
Make get_current_context_identifier() understand EGL.
If we're using EGL and not GLX (typically because we're using GLES,
but also increasingly with desktop GL), we'd always return NULL.
This could FBOs to be confused between contexts.
The intended use case is to have Y'CbCr for encoding output but keep
RGBA around for easier preview. This causes a few effects to need to
send arrays around; it's a bit ugly to special-case them like this,
but I'm concerned about going generic wrt. how good various shader
compilers are to optimize if we went full multi-model everywhere
(without having tested, though).
In practice, we haven't _actually_ supported this since we used integers
in ResampleEffect (and ResampleEffect is a pretty central effect),
so let's be honest with ourselves. (Also, we will soon start using arrays
in some cases, which are cumbersome pre-1.30.) I don't know of any drivers
that support all the other stuff we want but not GLSL 1.30 anyway;
it came with OpenGL 3.0, in 2008.
This actually isn't an ABI break, at least not on the C++ level.
In ycbcr_conversion_effect_test, use a non-float framebuffer.
This way, we let the card convert float-to-int, which we have reasonable
control over, as opposed to glReadPixels(), which is rather unpredictable.
Fixes unit test failures on Broadwell on Linux (Mesa 10.1).
In ResampleEffect, precompute the Lanczos function into a table.
A 2048-element table (with linear interpolation between the elements)
is seemingly enough to get down to beyond float epsilon, and this
saves a lot of CPU time when computing large filter kernels.
Fix a bug where combined fp16 weights would be horribly wrong.
Seemingly weights were always returned as float, and then cast
to fp16_int_t -- without proper conversion! And sum_sq_error
would be calculated based on the correct value, not the broken-
casted one.
It's a small miracle the unit tests didn't catch this; they didn't
until I started introducing small errors for another reason.
Most real-world testing seems to have hit fp32, and thus this
wasn't caught there either.
Also make fp16_int_t a struct so that it is not implicitly
convertible to/from numeric types, so this never ever can happen again.
In ResampleEffect, be more aggressive about giving up on saving bilinear samples.
It turns out that for some kinds of loads, we can't use bilinearity at all
to our benefit, so we spend almost all of our time trying to go through
each line to see how much we can save. Simply send in the minimum number
so far when doing this evaluation to begin with, which means we'll effectively
short-circuit the entire thing pretty fast once we find a line that can save
nothing.
For me, this was needed when I wanted to render directly into
VA-API's encoder buffers, which are always top-left origin (and FBOs
are always bottom-left origin).
Add support for Y'CbCr output split between multiple textures.
This is useful primarily for avoiding copies in later stages;
e.g., when rendering directly into a video encoder buffer.
We support both full planar and NV12-style interleaved Cb+Cr.
You still have to subsample chroma yourself, though; we don't
really support chains that diverge except in the final output node
(and changing resolution would definitely need a bounce;
and even worse, one in a non-fp16 intermediate format).
One would think something as mundane as setting a few uniforms wouldn't
really mean much for performance, but seemingly this is not always so --
I had a real-world shader that counted no less than 55 uniforms.
Of course, not all of these were actually used, but we still have to go
through looking up the name etc. for every single one, every single frame.
Thus, we introduce a new way of dealing with uniforms: Register them before
finalization time, and then EffectChain can store their numbers once and
for all, instead of this repeated lookup. The system is also set up such
that we can go to uniform buffer objects (UBOs) in the very near future.
It's a bit unfortunate that uniform declaration now is removed from the
.frag files, where it sat very nicely, but the alternative would be to
try to parse GLSL, which I'm a bit wary at right now. All effects are
converted, leaving the set_uniform_* functions without any users, but
they are kept around for now in case external effects want them.
This seems to bring 1–2% speedup for my use case; hopefully UBOs will
bring a tiny bit more.
Prepare for better understanding of 10- and 12-bit Y'CbCr.
Seemingly there is trickiness in how to interpret the integer
values that is different from what you'll typically see in R'G'B'
(or just GPUs and TV standards differ on that point as well).
Add an explanatory comment, and add a data member to YCbCrFormat
to prepare for correct 10/12-bit level handlings. We'll stay 8-bit
only for now, though, to avoid an API break for existing clients
for no good reason (there's no 10-bit input, really).
Minor optimization in ResampleEffect: Set less GL state.
In particular, if we can avoid it, use glTexSubImage2D instead of glTexImage2D.
This actually has a real effect, at least on Intel/Linux, where the drive seems
to stall on some mappings.
Of course, this only really helps for things like pans, not zooms.
Note that this is an API break; PaddingEffect now does something else
from what it used to do before when it comes to fractional offsets.
But I feel this is more useful; it allows PaddingEffect to be used
more efficiently for moving things smoothly around.
Also add a concept of border offset which moves the border around
without changing the pixels; useful if you want the subpixel placement
to be done by ResampleEffect (put the integral offset into top/left
and then move the border by the fractional amount it missed).
The assumption is broken whenever a non-integral top or left parameter
is specified. Instead, make an IntegralResampleEffect that enforces
these parameters to be integers, and then mark it as one-to-one sampling.
Collapse passes more aggressively in the face of size changes.
The motivating chain for this change was a case where we had
a SinglePassResampleEffect (the second half of a ResampleEffect)
feeding into a PaddingEffect, feeding into an OverlayEffect.
Currently, since the two former change output size, we'd bounce
to a temporary texture twice (output size changes would always
cause bounces).
However, this is needlessly conservative. The reason for bouncing
when changing output size is really if you want to get rid of
data by downscaling and then later upsampling, e.g. for a blur.
(It could also be useful for cropping, but we don't really use
that right now; PaddingEffect, which does crop, explicitly checks
the borders anyway to set the border color manually.) But in this case,
we are not downscaling at all, so we could just drop the bounce,
saving tons of texture bandwidth.
Thus, we add yet more parameters that effects can specify; first,
that an effect uses _one-to-one_ sampling; that is, that it
will only use its input as-is without sampling
between texels or outside the border (so the different
interpolation and border behavior will be irrelevant).
(Actually, almost all of our effects fall into this category.)
Second, a flag saying that even if an effect changes size,
it doesn't use virtual sizes (otherwise even a one-to-one effect
would de-facto be sampling between texels). If these flags
are set on the input and the output respectively, we can avoid
the bounce, at least unless there's an effect that's _not_
one-to-one further up the chain.
For my motivating case, this folded eight phases into four,
changing ~16.0 ms into ~10.6 ms rendering time. Seemingly
memory bandwidth is a really precious resource on my laptop's
GPU.
This is mostly theoretical; I've never been able to measure any
sort of real change from this. But according to popular cargo-culting,
it might have an effect since there are fewer edge pixels to shade.
Propagate size correctly across effects that change output size.
When propagating size information between effects in a phase,
we'd forget to check if the effect wanted to change size
and use that information instead of our own heuristics.
Fix that.
This is currently a no-op, since right now we always break a phase
when an effect changes output size, but there are very real situations
where we'd be fine with not doing so, so this patch paves the way
for that.
This is useful for debugging slow chains; it can give information
about which phase takes the most time. Right now there seems to be
~5 ms in one of my test chains that disappear into nothing
(ie. show up in the fps counter with vsync off, but not in any
phase), but hopefully we can eventually solve that discrepancy.
Use std::scientific when outputting floats, so we do not get issues with 0.0 being outputs as 0 (which is an int, which cannot always be implicitly converted to float in GLSL).