I found very similar workaround code for this bug in Chromium,
with the following comment:
// Workaround for Mac driver bug. In the large scheme of things setting
// glTexParamter twice for glGenerateMipmap is probably not a lage performance
// hit so there's probably no need to make this conditional. The bug appears
// to be that if the filtering mode is set to something that doesn't require
// mipmaps for rendering, or is never set to something other than the default,
// then glGenerateMipmap misbehaves.
Going back all the way to the point in which this code was written,
it is indeed true; we called glGenerateMipmap(), and then right afterwards
set the mode to GL_LINEAR_MIPMAP_NEAREST. Since then, the code has been
reorganized and moved around a lot, and now we set the mode long before
the first call to glGenerateMipmap(), and thus we can retire the hack;
simply generate mipmaps on-demand, and that's the end of it. I tested
with the Mesa 8.0.x version where I originally saw this bug, and it passes
flat_input_test without any problems (well, actually all tests except
the tests for deconvolution sharpen, whose shaders are too big for it).
This is nice not only because it gives us a less hacky structure, but also
because GL_GENERATE_MIPMAPS is a nightmare for the driver to handle;
several edge conditions are tricky, from what I've been told.
Disable OpenGL dithering, just to be on the safe side.
I don't actually think any modern OpenGL implementations actually
heed this flag for 8-bit rendering, but it's fine to be on the safe
side nevertheless.
Round explicitly after dithering, for GPUs that don't do it properly themselves.
This was causing unit test failures in the DitherEffect unit test both on
ATI and nVidia GPUs; Intel also rounds somewhat inaccurately, but much,
much better, so the extra code won't be activated for them.
I think this might be driver-dependent, but we will detect it correctly
in any case.
- GL_FLOAT FlatInput is primarily used for tests, and even more importantly,
mostly accuracy tests. ATI's drivers appear to round off fp32 -> fp16
wrong (truncate instead of round), which breaks some of these tests.
- In case someone _would_ use GL_FLOAT inputs, they'd probably be updated
every frame anyway, so the fp32 -> fp16 conversion step (probably on CPU)
will negate any performance benefits by fp16 sampling anyway.
Turn off .dot and .frag file generation during unit tests.
Even though this is sometimes useful in the effect_chain_test,
this makes the test suite literally five times as fast on my
machine due to all the disk I/O. We can always turn it on again
when debugging specific issues.
This is mainly to reduce the number of substantially identical shaders
we have to keep around and compile; even though two chains may be
different, often, some phase (and very often, a large one at that)
will be similar. However, in the old system, since effects had global
IDs, a change in an earlier phase would displace identifiers in a
later one, and the shader would be uncacheable.
Note that this means that an effect can actually have multiple effect_ids
now (since it could already be part of multiple phases). This is the
reason why we can't just keep having a single effect_id on the node
that we set phase-locally; it really needs to be different between phases.
Add a shared ResourcePool to share resources between EffectChains.
Right now in this first implementation, we only share shaders,
which mainly saves compile time when multiple EffectChains are
similar (e.g., many clips all are modified by white balance only).
However, in the future, the plan is for them to also be able to
share temporary textures.
This isn't required anymore after the Gamma{Compression,Expansion}Effect
stopped using it; and after seeing how it was easy to missample them
with regards to texture coordinates, I'm not honestly so sure anymore
they were a good idea in the first place.
Creating a PBO to hold the texture data just before upload (like we
did before this patch) is pointless; if that would help at all, the driver
could just do it itself. Instead, we expose the PBOs to the application
(in a way such that applications that don't care can continue to use
the simple interface). This means that a client that needs to do
e.g. a fade can optimize its texture upload by a process like this:
1. Decode frame from input A.
2. Upload frame from input A to GPU (by putting it into a PBO).
Texture upload starts in the background.
3. Decode frame from input B.
4. Upload frame from input B to GPU. (This time, there will be
no parallelism, though.)
5. Render.
With correct use of ping-ponging PBOs, it is also possible to overlap
step 4/5 with operations from the _next_ frame in the fade.
More information can be found in this presentation:
Implement GammaCompressionEffect using ALU ops instead of texture lookups.
This is the same as GammaExpansionEffect; however, the numerical issues
are much much bigger, and it took forever to actually get the coefficients
right for proper accuracy.
Added accuracy tests to GammaCompressionEffect to make sure things are
fine; unsurprisingly, just as with GammaExpansionEffect, the old texture
code wasn't all that accurate. The effects were not as bad, though;
it primarily affects 10- and 12-bit processing, and Movit isn't ready
for 12-bit accuracy-wise.
The benchmark wins are huge; in a chain with effectively only a
GammaExpansionEffect and a GammaCompressionEffect, FPS on Sandy Bridge
(HD3000) goes up ~80%. Seemingly sampling that 4096x1 texture was really
expensive.
After all the worry about accuracy in the polynomial expansion,
I figured it was time to actually add tests verifying the error
bounds here in a more reasonable way. The limits are set sort-of
randomly, more-or-less to what the new code already handles,
rounded up a bit. (The old texture-based code was _way_ worse than this,
it seems, probably due to texel center issues.)
This also shows that you can probably do 10-bit processing with
Movit without losing way too much accuracy, but that 12-bit is
too much for fp16 to handle.
Implement GammaExpansionEffect using ALU ops instead of texture lookups.
In a standalone benchmark (on a Sandy Bridge laptop), this is pretty much
a no-op performance-wise, but when more ops are put into the mix, it's
a ~20% FPS win, and but in a more real situation with multiple inputs
etc., it's subjectively also a pretty clear win. The reason is probably
that we generally are way overloaded on texture operations.
Note that we had similar code like this before (before we started using
the texture for lookup), but it used pow(), which is markedly slower than our
fourth-degree polynomial approximation.
We should probably do the same for GammaCompressionEffect.
Calculate the RGB-to-XYZ matrix ourselves instead of using a “magic” one from Wikipedia.
Generally we try to derive such values from first principles when possible;
this is some of the oldest code in Movit, which explains why it was forgotten.
Maybe longer-term we should move this out of ColorspaceConversionEffect into a
free function in util.h or something similar, but for now, this will do.
Fix a minor error in one of the color temperature constants.
Wikipedia cites http://icpr.snu.ac.kr/resource/wop.pdf/J01/2002/041/R06/J012002041R060865.pdf,
but seems to differ in in the last decimal of one of the coefficients.
The error is, of course, inconsequential, but it's good to be correct
in any case, and I trust the paper more.
This fixes an issue where reusing the same EffectChain from two different GL
contexts would cause errors even if they are shared, because FBOs cannot be
shared between contexts (in the ARB extension, anyway).
There might be some negative performance implications to this, but I was unable
to measure any on Intel/Mesa. If this should prove to be a problem in the future,
we could maybe keep one around per context, or have the caller invalidate the
FBO for us.
Fix a bug where DeconvolutionSharpenEffect would forget one line of the kernel.
This manifested itself in that non-identity filters would start changing alpha
of solid images (since the kernel didn't sum up to one), which obviously isn't
good. Added a unit test to make sure it doesn't happen again.
The intended use is for overlays, where you'd want to do e.g.
0.2*x atop y instead of just x atop y, fading the overlay in or out.
Also, give it full RGBA inputs, as that might potentially be useful
for someone. It certainly was useful for adapting it to continue to
be used in the EffectChain unit test, at least. (It doesn't have
its own unit test, since it's so trivial.)
Fix a bug where PaddingEffect could create assertion errors.
The ratinale is explained in the comment, but in short, PaddingEffect
could convert blank to premultiplied alpha without realizing that
it then needed linear light (since premultiplied alpha in our case
is defined as being in linear light). Also added a unit test.
Clip below-zero (out-of-gamut) colors in LiftGammaGainEffect.
pow(x, y) for x < 0 is undefined, and behaves differently on nVidia
and Intel. This can reasonably happen when having inputs from a different
gamut, or just a complex chain before the LGG effect; there's nothing
in Movit that prohibits out-of-sRGB-gamut colors. Thus, we need to detect
such inputs and clamp them.
We could in theory check for the special case of y=1 (no gamma change)
and allow negative values through then, but this wouldn't seem like a good
solution, especially if animating gamma.
We don't use anything fancy from autoconf yet (in particular,
no config.h file), and we don't use automake or libtool.
Most likely, this will happen later.
Add a new alpha handling method, INPUT_PREMULTIPLIED_ALPHA_KEEP_BLANK.
This should fix most of the problems where you only process inputs
without alpha, but the framework still inserts an AlphaDivisionEffect
at the end because it loses track of the blank alpha state throughout
the pipeline and the output expects postmultiplied alpha.
There's one important case left, though: MixEffect will always reset
tha alpha state to premultiplied. We should probably fix that too
at some later stage.
In resizing effects, add the notion of a “virtual output size”.
This is the size that the effect wants to be perceived as for the next
effect in the chain, which is useful if you e.g. have a blur and then
padding. Even though the blur might rescale the image down from e.g.
512x512 to 128x128 before blurring (in case of a large blur), and this
size is what actually comes out in the texture at the other end,
it still wants to be treated as a 512x512 image when adding padding.
Fix a bug where intermediate phase outputs could get too low height.
Basically our aspect ratio adjustment and lack of proper roundoff
could conspire to let e.g. mix(1280x720, 939x939) = 1669x938,
which is obviously wrong. I could probably have fixed it with
an extra lrintf(), but it seems more robust to simply avoid the
extra conversion (ie., never convert height -> width -> height).
ColorspaceConversionEffect, DitherEffect, GammaExpansionEffect and GammaCompressionEffect
are all supposed to be used by EffectChain only, so make them private; I've had
reports of users trying to use these directly, leaving the framework in a confused state.