User selectable surfaces are not working correctly, if you set number of
surfaces on cmdline, it will always use minimum 32 or 48 depends on
selected resolution, but in nvenc it is not necessary to use so many
surfaces.
So from now you can define as low as 1 surface and nvenc will still
work, it will ofcourse lower GPU memory usage by 95% and async_delay to zero
That was the easy part, now littlebit more...
Next part of this patch is to always prefer rc_lookahead to be more
important for number of surfaces, than user defined surfaces value.
Maximum rc_lookahead from nvidia documentation is 32, but could increase
in future generations so there is no limit for this yet. Value
async_depth is still accepted and prefered over rc_lookahead.
There were also bug when you request more than rc_lookahead > 31, it
will always set maximum 31, because surface numbers recalculation was
after setting lookahead, which is now fixed.
Results:
If you set -rc_lookahead 32 and -bf 3 it will now use only 40 surfaces
and lower GPU memory usage by 20%, also it will now increase PSNR by 0.012dB
Two more comments:
1. from my internal test, i don't understand addition of 4 more surfaces
when lookahead is calculated, i didn't used this and everything works as
with those 4 more extra surfaces, does anybody know what is going on
there? I looks like it was used for B frames which are calculated
separately, because B frames maximum is 4.
2. rc_lookahead is defined default to -1, but in test condition if
(ctx->rc_lookahead) which sets lookahead it will be always true, i don't
know if this is intended behavior, so in default behavior is lookahead
always on!
This is default condition when rc_lokkahead is -1 (not defined on
cmdline), whis is maybe something that is not intended:
ctx->encode_config.rcParams.enableLookahead = 1;
ctx->encode_config.rcParams.lookaheadDepth = 0;
ctx->encode_config.rcParams.disableIadapt = 0;
ctx->encode_config.rcParams.disableBadapt = 0;
Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
Jun Zhao [Fri, 11 Nov 2016 06:53:49 +0000 (14:53 +0800)]
lavc/vaapi_encode_h264: fix poc incorrect issue after meeting idr frame.
when meeting IDR frame, vaapi_encode_h264 poc number don't reset, now fix
this issue based on h264 spec. Some decoder don't care this case, but this
fix will enhance the encoder action. Before this fix, poc number is
negative in some case.
Reviewed-by: Jun Zhao <jun.zhao@intel.com> Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Signed-off-by: Mark Thompson <sw@jkqxz.net>
Mark Thompson [Fri, 30 Sep 2016 15:31:49 +0000 (17:31 +0200)]
vaapi_h265: Fix slice header writing
This was not observed earlier because the only syntax element which
it normally misses with the current setup is slice_qp_delta, but that
is always going to be zero (in IDR frames QP isn't varied on the
slice) which will always exp-golomb code as a single 1 bit. The
immediately following part is the byte alignment, which is always a 1
bit followed by 0s which are ignored, so as long as the bitstream is
never aligned at that point we will never notice because the only
difference is that an ignored bit is a 1 instead of a 0.
Mark Thompson [Sun, 18 Sep 2016 15:06:55 +0000 (16:06 +0100)]
vaapi_encode: Sync to input surface rather than output
While outwardly bizarre, this change makes the behaviour consistent
with other VAAPI encoders which sync to the encode /input/ picture in
order to wait for /output/ from the encoder. It is not harmful on
i965 (because synchronisation already happens in vaRenderPicture(),
so it has no effect there), and it allows the encoder to work on
mesa/gallium which assumes this behaviour.
James Almer [Mon, 31 Oct 2016 23:04:46 +0000 (20:04 -0300)]
avformat/matroskaenc: write updated STREAMINFO metadata for FLAC streams if available
FLAC streams originating from the FLAC encoder send updated and more
complete STREAMINFO metadata as part of the last packet, so write that
to CodecPrivate instead of the incomplete one available in extradata
during init.
James Almer [Sat, 19 Nov 2016 15:38:44 +0000 (12:38 -0300)]
avcodec/avpacket: fix leak on realloc in av_packet_add_side_data()
If realloc fails, the pointer is overwritten and the previously allocated
buffer is leaked, which goes against the expected behavior of keeping the
packet unchanged in case of error.
Michael Niedermayer <michael@niedermayer.cc> Signed-off-by: James Almer <jamrial@gmail.com>
Said commit changed the behavior of the demuxer and decoder in a non
backwards compatible way.
Demuxers should make extradata available at init if possible, and send
new extradata as side data within a packet if needed.
Fixes division by 0
This is similar to how avg_frame_rate is checked elsewhere Fixes: 6d24add0455f41b1b45b7ba615cd46f3/asan_generic_dc34c3_5480_0a2ef411cae999b9871ed71a2e481b71.mov Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
avcodec/ass_split: Change order of operations in ass_split_section()
This matches the other branch
Fixes out of array read Fixes: 4d142ca76d39fe685effcf5017098723/asan_heap-oob_31ae824_8611_348fdb64f9009b63c8a8eae9a0e497c5.mkv Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Martin Storsjö [Mon, 14 Nov 2016 10:32:27 +0000 (12:32 +0200)]
aarch64: vp9: Implement NEON loop filters
This work is sponsored by, and copyright, Google.
These are ported from the ARM version; thanks to the larger
amount of registers available, we can do the loop filters with
16 pixels at a time. The implementation is fully templated, with
a single macro which can generate versions for both 8 and
16 pixels wide, for both 4, 8 and 16 pixels loop filters
(and the 4/8 mixed versions as well).
For the 8 pixel wide versions, it is pretty close in speed (the
v_4_8 and v_8_8 filters are the best examples of this; the h_4_8
and h_8_8 filters seem to get some gain in the load/transpose/store
part). For the 16 pixels wide ones, we get a speedup of around
1.2-1.4x compared to the 32 bit version.
Examples of runtimes vs the 32 bit version, on a Cortex A53:
ARM AArch64
vp9_loop_filter_h_4_8_neon: 144.0 127.2
vp9_loop_filter_h_8_8_neon: 207.0 182.5
vp9_loop_filter_h_16_8_neon: 415.0 328.7
vp9_loop_filter_h_16_16_neon: 672.0 558.6
vp9_loop_filter_mix2_h_44_16_neon: 302.0 203.5
vp9_loop_filter_mix2_h_48_16_neon: 365.0 305.2
vp9_loop_filter_mix2_h_84_16_neon: 365.0 305.2
vp9_loop_filter_mix2_h_88_16_neon: 376.0 305.2
vp9_loop_filter_mix2_v_44_16_neon: 193.2 128.2
vp9_loop_filter_mix2_v_48_16_neon: 246.7 218.4
vp9_loop_filter_mix2_v_84_16_neon: 248.0 218.5
vp9_loop_filter_mix2_v_88_16_neon: 302.0 218.2
vp9_loop_filter_v_4_8_neon: 89.0 88.7
vp9_loop_filter_v_8_8_neon: 141.0 137.7
vp9_loop_filter_v_16_8_neon: 295.0 272.7
vp9_loop_filter_v_16_16_neon: 546.0 453.7
The speedup vs C code in checkasm tests is around 2-7x, which is
pretty much the same as for the 32 bit version. Even if these functions
are faster than their 32 bit equivalent, the C version that we compare
to also became around 1.3-1.7x faster than the C version in 32 bit.
Based on START_TIMER/STOP_TIMER wrapping around a few individual
functions, the speedup vs C code is around 4-5x.
Examples of runtimes vs C on a Cortex A57 (for a slightly older version
of the patch):
A57 gcc-5.3 neon
loop_filter_h_4_8_neon: 256.6 93.4
loop_filter_h_8_8_neon: 307.3 139.1
loop_filter_h_16_8_neon: 340.1 254.1
loop_filter_h_16_16_neon: 827.0 407.9
loop_filter_mix2_h_44_16_neon: 524.5 155.4
loop_filter_mix2_h_48_16_neon: 644.5 173.3
loop_filter_mix2_h_84_16_neon: 630.5 222.0
loop_filter_mix2_h_88_16_neon: 697.3 222.0
loop_filter_mix2_v_44_16_neon: 598.5 100.6
loop_filter_mix2_v_48_16_neon: 651.5 127.0
loop_filter_mix2_v_84_16_neon: 591.5 167.1
loop_filter_mix2_v_88_16_neon: 855.1 166.7
loop_filter_v_4_8_neon: 271.7 65.3
loop_filter_v_8_8_neon: 312.5 106.9
loop_filter_v_16_8_neon: 473.3 206.5
loop_filter_v_16_16_neon: 976.1 327.8
The speed-up compared to the C functions is 2.5 to 6 and the cortex-a57
is again 30-50% faster than the cortex-a53.