Trying to use sprintf and floats right in a portable manner is seemingly
impossible (MinGW doesn't support the per-thread locale stuff), so simply
do it a different way; stop sprintf-ing floats and use std::stringstream
instead. I dislike the iostream interface a lot, but it can do per-stream
locales, which is exactly what we want here.
Dan Dennedy [Thu, 5 Mar 2015 07:41:39 +0000 (23:41 -0800)]
Fix build on OS X and MinGW.
OS X requires the xlocale.h header to define locale_t:
https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man3/newlocale.3.html
MinGW does not include implementations for newlocale() and uselocale().
Instead, use the previous approach using setlocale().
setlocale() affects the whole process, not just the current thread
as I assumed; uselocale() (available since glibc 2.3, so basically
forever) is per-thread, and also conveniently seems to avoid the
issue of the returned pointer being destroyed (unless the driver
uses the return value of uselocale() as a base, which I really hope
it doesn't).
I'm slightly worried that since this overrides setlocale(), buggy drivers
might get confused when they try to do setlocale() and something else
overrides that precedence, but hopefully this shouldn't be a case.
Also add a unit test for locale handling while we're at it. It doesn't
test multi-threaded behavior, though, only the simple case.
For most users, this is mostly theoretical, as it requires compiling
with -march=native or similar. And these are definitely meant for
vectorizing, although it's still 2-3x as fast to use them as our own
software fallback.
These are supported starting from Haswell, and also by some AMD CPUs.
For the case where the resampling changed every frame (e.g. a zoom),
it just consumed too much CPU to be worth it, especially in memory
management; this is painful because it was an elegant solution to
a tricky problem, but it just has to go for now.
Also drop out to fp32 at the first sight of too-high error.
In ResampleEffect, optimize the bilinear weights on a global scale.
In addition to the individual weight optimization we do when combining samples,
this technique optimizes the weights as a whole, through some linear algebra.
This means it can take into account effects such as multiple bilinear samples
influencing the same coefficient (which normally should not happen, but might
nevertheless due to imprecisions in the stored texture coordinates), or
non-combined sample positions that can't hit the exact middle of the texel.
In practical tests, this is extremely effective; it often reduces the computed
sum of squared coefficient errors by as much as a factor 1000, although I
haven't verified how often it actually saves us from having to do fp32 fallback
with the rather tight error bounds that are in place.
When combining samples, take fp16 rounding into account.
This makes us somewhat more conservative in combining samples;
when we are near the lower/right edges of the image, we are starting
to get close to 1.0, and fp16 just doesn't have enough precision
to give us the 6 or 8 bits of subpixel precision we want (it is
hardly enough to address individual pixels!). In particular, this
can affect zooming with ResampleEffect, as reported by Christophe
Thommeret.
This does not fix all cases (especially not non-power-of-two cases);
for that, we will probably need to be able to fall back to fp32
when we detect fp16 doesn't work well.
We read about twice as many as we should have; the others were
probably just set to 0.0, which has no effect but still burns
arithmetic, unless your driver happens to optimize very aggressively
for this (which I don't think anyone does anymore).
Properly restore the LC_NUMERIC locale after finalizing.
There were two issues here:
1. setlocale(LC_NUMERIC, "C") always returns C, not the previous
locale.
2. The return value of setlocale() may point into static storage,
which may be corrupted when we call into libGL, if e.g.
the shader compiler calls setlocale() on its own.
1. If you're missing some functionality, Movit will now tell you
on stderr what you're missing. (We might suppress this later
if it turns out that people want to init_movit() but are actually
fine with it failing.)
2. Use a table instead of repeated if-then logic, since this started
to become a bit messy after we added OpenGL-version-equivalence
checks.
Same rationale as with the offset; we need resampling for proper zoom.
The look at heavy zoom isn't _quite_ what I had hoped for (although it's OK),
and there's a hint of shimmering in the zoom center if there's high-contrast
material there. For now, I'll write off the latter as Lanczos ringing;
I'll need to see what it does to video eventually (only tested with stills).
This enables smooth (subpixel) panning that people frequently want for stills
and titles, but that you couldn't do in a subpixel fashion before (PaddingEffect
could only do integer pixel offsets).
The placement (ResampleEffect) might seem a bit off at first, but subpixel
offset needs resampling, and ResampleEffect already has all the logic in place
for that. We could have used the GPU's built-in bilinear resampling, of course,
but it doesn't look all that good for high-contrast situations (although working
in linear light should help some).
This is mainly a convenience so that you can change e.g. a left-to-right
wipe into a right-to-left wipe without having to add a separate inverting
effect to the luma. Suggested by Dan Dennedy.
Make the ResourcePool hold FBOs as a per-context resource.
This is an attempt to get out of the FBO sharability mess (unfortunately
we can't just stop having persistent FBOs, due to NVidia performance).
We now require the client to tell us whenever a context is going away,
and we try to be more careful about not deleting them in the wrong context.
Also, we assumed FBO names were globally unique, which isn't necessarily
true, so re-key them.
For good measure, we were deleting FBOs off the freelist from the front,
not the back as we should have -- fixed.
We have a problem when trying to delete an EffectChain or ResourcePool;
we might have created FBOs or VAOs in the wrong context. Work around it
for now (unbreaking Kdenlive) by making VAOs non-persistent again,
and simply never deleting FBOs (leaking them).
A proper solution here will be hard, unfortunately, and will nede some thought.
Many of the rows in the support texture are exactly the same,
so don't store the duplicates; gives a small performance boost.
In a sense, this is exactly the same property that GPUwave uses
with drawing multiple quads at the lower level.
Stop the FFTPassEffect Repeat test after FFT size 128.
The reason is that the 256 test uses texture sizes of 256*31=7936,
and above ~3900, some cards (at least both my Intel and NVidia card)
start having accuracy issues on some sizes. The test happens not to
die on this for semi-obscure reasons, but that's mostly by accident,
and in any case, requiring 8k textures for a unit test might be
a bit on the upper side.
Redo FBO association yet again, this time per-texture.
According to http://adrienb.fr/blog/wp-content/uploads/2013/04/PortingSourceToLinux.pdf,
you want an FBO per-texture, not just format. And indeed, I can measure a very slight
performance improvement on both NVidia and ATI for this.
Seemingly this _also_ costs on NVidia; the demo app is down 0.9 ms/frame or so.
This rapidly started approaching complexity worthy of the ResourcePool,
so I moved the functionality in there even though it's not context-shareable.
We support 1.10 (for OpenGL 2.1 cards), 1.30 (for OpenGL 3.2 core contexts),
and 3.00 ES (for GLES3). There's some code duplication, but thankfully
not a whole lot.
With this, we compile in core contexts without any warning from ATI's driver,
and should also in theory be GLES3 compliant (tested on NVidia's desktop driver).
Make handling of non-RGBA sRGB textures more consistent.
Previously, we'd ask the driver to convert these to RGBA, which maybe
isn't ideal, and certainly doesn't work with GLES. Now we send in
the right format for RGB and RGBA, and refuse hardware conversions with
single-channel (which GLES doesn't accept). I don't think this is optimal,
but finding a use-case for sRGB single-channel is a bit tricky anyway,
and the fallback is fast, too.
Reduce the amount of arithmetic in the BlurEffect shader a bit.
We did additions and subtractions with zero, which is sort of a waste
on scalar architectures. Helps ever so slightly on the demo app on my NVidia
card (3–4%).
Seemingly creating and deleting them is crazy expensive on NVidia
(~3 ms for a create/delete pair), so 6dea8d2 caused a performance
regression at high frame rates. Now we instead keep one around per
context (they cannot be shared), which brings us basically back
to where we were performance-wise.
Make Phase take other Phases as inputs, not Nodes.
This was a refactoring I wanted to do for a while, but actually finding
the right structure was a bit tricky. In the process, the entire phase
generation logic was rewritten, but the separation between compilation
and Phase construction is much cleaner now, and the logic in general
is easier to follow with more use of explicit recursion.
I'm still not 100% happy about what might be overuse of output_node;
we still need to link Phase and Node (the link just goes the other way
now), but I'm not sure we need to use it in all the cases we currently do.