Many of the rows in the support texture are exactly the same,
so don't store the duplicates; gives a small performance boost.
In a sense, this is exactly the same property that GPUwave uses
with drawing multiple quads at the lower level.
Stop the FFTPassEffect Repeat test after FFT size 128.
The reason is that the 256 test uses texture sizes of 256*31=7936,
and above ~3900, some cards (at least both my Intel and NVidia card)
start having accuracy issues on some sizes. The test happens not to
die on this for semi-obscure reasons, but that's mostly by accident,
and in any case, requiring 8k textures for a unit test might be
a bit on the upper side.
Redo FBO association yet again, this time per-texture.
According to http://adrienb.fr/blog/wp-content/uploads/2013/04/PortingSourceToLinux.pdf,
you want an FBO per-texture, not just format. And indeed, I can measure a very slight
performance improvement on both NVidia and ATI for this.
Seemingly this _also_ costs on NVidia; the demo app is down 0.9 ms/frame or so.
This rapidly started approaching complexity worthy of the ResourcePool,
so I moved the functionality in there even though it's not context-shareable.
We support 1.10 (for OpenGL 2.1 cards), 1.30 (for OpenGL 3.2 core contexts),
and 3.00 ES (for GLES3). There's some code duplication, but thankfully
not a whole lot.
With this, we compile in core contexts without any warning from ATI's driver,
and should also in theory be GLES3 compliant (tested on NVidia's desktop driver).
Make handling of non-RGBA sRGB textures more consistent.
Previously, we'd ask the driver to convert these to RGBA, which maybe
isn't ideal, and certainly doesn't work with GLES. Now we send in
the right format for RGB and RGBA, and refuse hardware conversions with
single-channel (which GLES doesn't accept). I don't think this is optimal,
but finding a use-case for sRGB single-channel is a bit tricky anyway,
and the fallback is fast, too.
Reduce the amount of arithmetic in the BlurEffect shader a bit.
We did additions and subtractions with zero, which is sort of a waste
on scalar architectures. Helps ever so slightly on the demo app on my NVidia
card (3–4%).
Seemingly creating and deleting them is crazy expensive on NVidia
(~3 ms for a create/delete pair), so 6dea8d2 caused a performance
regression at high frame rates. Now we instead keep one around per
context (they cannot be shared), which brings us basically back
to where we were performance-wise.
Make Phase take other Phases as inputs, not Nodes.
This was a refactoring I wanted to do for a while, but actually finding
the right structure was a bit tricky. In the process, the entire phase
generation logic was rewritten, but the separation between compilation
and Phase construction is much cleaner now, and the logic in general
is easier to follow with more use of explicit recursion.
I'm still not 100% happy about what might be overuse of output_node;
we still need to link Phase and Node (the link just goes the other way
now), but I'm not sure we need to use it in all the cases we currently do.
A lot of the later commits have been leading up to this, and I finally
got to the point where all the unit tests check out, everything seems
to work (modulo maybe some overflow issues) and we have a model that
matches what people actually expects from convolutions.
Note that this adds a dependency on FFTW3; we could probably have added
our own routines for such small needs, but like with Eigen, calling out to a
library is fine as long as it's of good quality (which FFTW certainly is) and
is widely available.
Revert "Support pad/crop from bottom, not just from the top."
This turned out not to be so useful after all, as we'd like a more
consistent top-left coordinate system, and changes to do that will
obsolete this patch.
Fix a bug where repeated vertical FFTs would reverse the output.
Unfortunately, the tests didn't catch this, as the Repeat test used
an even number of passes (being of size 64), which reversed things
back into place. It now tries a wider range of sizes to make sure
everything is okay.
This tests a few edge cases that are not adequately covered by the
random fp32 tests; in particular, the round-to-even logic had
no test coverage, which is bad.
Formalize the notion of messing with sampler state.
This kills a lot of the assumptions that have been going around,
and should allow us to deal much better with the situation when
we have two or more inputs to an effect (where you basically can't
predict the sampler number used reliably); there's still an edge
case that's documented with a TODO, but this is generally much better.
This allows us to ignore the texture bounce flag when reading from a
FlatInput, and also handles better the case where an YCbCrInput is read
from multiple times (it's now bounced, which should be better for speed,
I think).
The main motivation, however, is to be able to control sampler state
a bit less hackish in the future.