Reduce the amount of arithmetic in the BlurEffect shader a bit.
We did additions and subtractions with zero, which is sort of a waste
on scalar architectures. Helps ever so slightly on the demo app on my NVidia
card (3–4%).
Seemingly creating and deleting them is crazy expensive on NVidia
(~3 ms for a create/delete pair), so 6dea8d2 caused a performance
regression at high frame rates. Now we instead keep one around per
context (they cannot be shared), which brings us basically back
to where we were performance-wise.
Make Phase take other Phases as inputs, not Nodes.
This was a refactoring I wanted to do for a while, but actually finding
the right structure was a bit tricky. In the process, the entire phase
generation logic was rewritten, but the separation between compilation
and Phase construction is much cleaner now, and the logic in general
is easier to follow with more use of explicit recursion.
I'm still not 100% happy about what might be overuse of output_node;
we still need to link Phase and Node (the link just goes the other way
now), but I'm not sure we need to use it in all the cases we currently do.
A lot of the later commits have been leading up to this, and I finally
got to the point where all the unit tests check out, everything seems
to work (modulo maybe some overflow issues) and we have a model that
matches what people actually expects from convolutions.
Note that this adds a dependency on FFTW3; we could probably have added
our own routines for such small needs, but like with Eigen, calling out to a
library is fine as long as it's of good quality (which FFTW certainly is) and
is widely available.
Revert "Support pad/crop from bottom, not just from the top."
This turned out not to be so useful after all, as we'd like a more
consistent top-left coordinate system, and changes to do that will
obsolete this patch.
Fix a bug where repeated vertical FFTs would reverse the output.
Unfortunately, the tests didn't catch this, as the Repeat test used
an even number of passes (being of size 64), which reversed things
back into place. It now tries a wider range of sizes to make sure
everything is okay.
This tests a few edge cases that are not adequately covered by the
random fp32 tests; in particular, the round-to-even logic had
no test coverage, which is bad.
Formalize the notion of messing with sampler state.
This kills a lot of the assumptions that have been going around,
and should allow us to deal much better with the situation when
we have two or more inputs to an effect (where you basically can't
predict the sampler number used reliably); there's still an edge
case that's documented with a TODO, but this is generally much better.
This allows us to ignore the texture bounce flag when reading from a
FlatInput, and also handles better the case where an YCbCrInput is read
from multiple times (it's now bounced, which should be better for speed,
I think).
The main motivation, however, is to be able to control sampler state
a bit less hackish in the future.
This not only fixes issues with poor downconversion on ATI, but also
allows us to normalize while being aware of fp16 roundoff issues.
Seems to about cut the error in half in the HeavyResampleGetsSumRight
test, which as far as I can see would take us up to 10-bit accuracy.
Use the GL_RED texture format instead of GL_LUMINANCE.
Seemingly GL_LUMINANCE is also deprecated; this actually decreases
support for GLES2 somewhat, but we need GLES3 anyway, so the net
loss shouldn't be too bad.
This is a pretty hard API break, but it's probably the last big API
break before 1.0, and some of the names (e.g. Effect, Input ResourcePool)
are really so generic that they should not be allowed to pollute the global
namespace.
First, make sure we test one individual pass, and that we test it in
fp32. Second, set a limit that's actually grounded in something real,
not just a pretty power of 10.
Normalize the resample weight after bilinear combining.
We introduce a small bit of error in the combining (due to having to
compensate for lack of subpixel sampling precision), so normalize
after it rather than before it. Also, do a second normalization pass,
which seemingly helps sometimes (probably due to inaccuracies in the
float sum).
This seems to kill about half the precision loss on Intel, at least.
Rescale resampling weights so that the sum becomes one.
For some reason, I had forgotten this, and it showed up because Qt
has buggy handling of pixels with alpha != 0xff. Add unit test
so it doesn't happen again.
I'm a bit concerned that rounding might cause problems so that we
should perhaps renormalize after the bilinear conversion, but we
can deal with that later if it should show up.
I found very similar workaround code for this bug in Chromium,
with the following comment:
// Workaround for Mac driver bug. In the large scheme of things setting
// glTexParamter twice for glGenerateMipmap is probably not a lage performance
// hit so there's probably no need to make this conditional. The bug appears
// to be that if the filtering mode is set to something that doesn't require
// mipmaps for rendering, or is never set to something other than the default,
// then glGenerateMipmap misbehaves.
Going back all the way to the point in which this code was written,
it is indeed true; we called glGenerateMipmap(), and then right afterwards
set the mode to GL_LINEAR_MIPMAP_NEAREST. Since then, the code has been
reorganized and moved around a lot, and now we set the mode long before
the first call to glGenerateMipmap(), and thus we can retire the hack;
simply generate mipmaps on-demand, and that's the end of it. I tested
with the Mesa 8.0.x version where I originally saw this bug, and it passes
flat_input_test without any problems (well, actually all tests except
the tests for deconvolution sharpen, whose shaders are too big for it).
This is nice not only because it gives us a less hacky structure, but also
because GL_GENERATE_MIPMAPS is a nightmare for the driver to handle;
several edge conditions are tricky, from what I've been told.
Disable OpenGL dithering, just to be on the safe side.
I don't actually think any modern OpenGL implementations actually
heed this flag for 8-bit rendering, but it's fine to be on the safe
side nevertheless.
Round explicitly after dithering, for GPUs that don't do it properly themselves.
This was causing unit test failures in the DitherEffect unit test both on
ATI and nVidia GPUs; Intel also rounds somewhat inaccurately, but much,
much better, so the extra code won't be activated for them.
I think this might be driver-dependent, but we will detect it correctly
in any case.
- GL_FLOAT FlatInput is primarily used for tests, and even more importantly,
mostly accuracy tests. ATI's drivers appear to round off fp32 -> fp16
wrong (truncate instead of round), which breaks some of these tests.
- In case someone _would_ use GL_FLOAT inputs, they'd probably be updated
every frame anyway, so the fp32 -> fp16 conversion step (probably on CPU)
will negate any performance benefits by fp16 sampling anyway.