Martin Storsjö [Sat, 31 Dec 2016 12:18:31 +0000 (14:18 +0200)]
aarch64: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.
This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
Martin Storsjö [Sat, 31 Dec 2016 12:05:44 +0000 (14:05 +0200)]
arm: vp9itxfm: Reorder the idct coefficients for better pairing
All elements are used pairwise, except for the first one.
Previously, the 16th element was unused. Move the unused element
to the second slot, to make the later element pairs not split
across registers.
This simplifies loading only parts of the coefficients,
reducing the difference to the 16 bpp version.
Martin Storsjö [Mon, 2 Jan 2017 20:08:41 +0000 (22:08 +0200)]
aarch64: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed d8-d15 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.
Martin Storsjö [Mon, 2 Jan 2017 20:50:38 +0000 (22:50 +0200)]
arm: vp9itxfm: Avoid reloading the idct32 coefficients
The idct32x32 function actually pushed q4-q7 onto the stack even
though it didn't clobber them; there are plenty of registers that
can be used to allow keeping all the idct coefficients in registers
without having to reload different subsets of them at different
stages in the transform.
Since the idct16 core transform avoids clobbering q4-q7 (but clobbers
q2-q3 instead, to avoid needing to back up and restore q4-q7 at all
in the idct16 function), and the lanewise vmul needs a register in
the q0-q3 range, we move the stored coefficients from q2-q3 into q4-q5
while doing idct16.
While keeping these coefficients in registers, we still can skip pushing
q7.
Martin Storsjö [Sat, 14 Jan 2017 11:22:30 +0000 (13:22 +0200)]
arm: vp9lpf: Implement the mix2_44 function with one single filter pass
For this case, with 8 inputs but only changing 4 of them, we can fit
all 16 input pixels into a q register, and still have enough temporary
registers for doing the loop filter.
The wd=8 filters would require too many temporary registers for
processing all 16 pixels at once though.
Diego Biurrun [Wed, 22 Feb 2017 10:39:21 +0000 (11:39 +0100)]
Place attribute_deprecated in the right position for struct declarations
libavcodec/vaapi.h:58:1: warning: attribute 'deprecated' is ignored, place it after "struct" to apply attribute to type declaration [-Wignored-attributes]
John Stebbins [Wed, 15 Feb 2017 22:22:40 +0000 (15:22 -0700)]
matroskaenc: factor ts_offset into block timecode computation
ts_offset was added to cluster timecode, but then effectively subtracted
back off the block timecode
When setting initial_padding for an audio stream, the timestamps are
written incorrectly to the mkv file. cluster timecode gets written
as pts0 + ts_offset which is correct, but then block timecode gets
written as pts - cluster timecode which expanded is
pts - (pts0 + ts_offset). Adding cluster and block tc back together:
cluster + block = (pts0 + ts_offset) + (pts - (pts0 + ts_offset)) = pts
But the result should be pts + ts_offset since demux will subtract the
CodecDelay element from pts and set initial_padding to CodecDelay.
This patch gives the correct result.
Diego Biurrun [Sat, 11 Feb 2017 12:09:27 +0000 (13:09 +0100)]
configure: Restructure the way check_pkg_config() operates
Have check_pkg_config() enable variables and set cflags and extralibs
instead of relegating that task to require_pkg_config. This simplifies
require_pkg_config(), is consistent with what other helper functions
like check_lib() do and allows getting rid of some manual variable
setting in places where check_pkg_config() is used.
Mark Thompson [Fri, 17 Feb 2017 23:14:19 +0000 (23:14 +0000)]
webp: Fix alpha decoding
This was broken by 4e528206bc4d968706401206cf54471739250ec7 - the webp
decoder was assuming that it could set the output pixfmt of the vp8
decoder directly, but after that change it no longer could because
ff_get_format() was used instead. This adds an internal get_format()
callback to webp use of the vp8 decoder to override the pixfmt
appropriately.
Mark Thompson [Thu, 9 Feb 2017 19:26:11 +0000 (19:26 +0000)]
vf_deinterlace_vaapi: Create filter buffer after context
The Intel proprietary VAAPI driver enforces the restriction that a
buffer must be created inside an existing context, so just ensure
this is always true.
Martin Storsjö [Thu, 16 Feb 2017 10:23:20 +0000 (12:23 +0200)]
vf_fade: Make sure to not miss the last lines of a frame
When slice_h is rounded up due to chroma subsampling, there's
a risk that jobnr * slice_h exceeds frame->height.
Prior to a638e9184d63, this wasn't an issue for the last slice
of a frame, since slice_end was set to frame->height for the last
slice.
a638e9184d63 tried to fix the case where other slices than the
last one would exceed frame->height (which can happen where the
number of slices/threads is very large compared to the frame
height).
However, the fix in a638e9184d63 instead broke other cases,
where slice_h * nb_threads < frame->height. Therefore, make
sure the last slice always ends at frame->height.
CC: libav-stable@libav.org Signed-off-by: Martin Storsjö <martin@martin.st>
Mark Thompson [Sun, 12 Feb 2017 23:47:58 +0000 (23:47 +0000)]
avconv: Move rescale to stream timebase before monotonisation
If the stream timebase is coarser than the muxing timebase then the
monotonisation process may fail because adding one to the timestamp
need not actually produce a different timestamp after the rescale.
Mark Thompson [Mon, 30 Jan 2017 19:11:28 +0000 (19:11 +0000)]
hwcontext_vaapi: Try to support the VDPAU wrapper
The driver is somewhat bitrotten (not updated for years) but is still
usable for decoding with this change. To support it, this adds a new
driver quirk to indicate no support at all for surface attributes.
wm4 [Fri, 10 Feb 2017 11:17:24 +0000 (12:17 +0100)]
hwcontext_dxva2: support D3D9Ex
D3D9Ex uses different driver paths. This helps with "headless"
configurations when no user logs in. Plain D3D9 device creation will
fail if no user is logged in, while it works with D3D9Ex.
wm4 [Thu, 2 Feb 2017 10:27:54 +0000 (11:27 +0100)]
AVFrame: add an opaque_ref field
This is an extended version of the AVFrame.opaque field, which can be
used to attach arbitrary user information to an AVFrame.
The usefulness of the opaque field is rather limited, because it can
store only up to 32 bits of information (or 64 bit on 64 bit systems).
It's not possible to set this field to a memory allocation, because
there is no way to deallocate it correctly.
The opaque_ref field circumvents this by letting the user set an
AVBuffer, which makes the user data refcounted.
Martin Storsjö [Tue, 22 Nov 2016 20:58:35 +0000 (22:58 +0200)]
aarch64: vp9itxfm: Do separate functions for half/quarter idct16 and idct32
This work is sponsored by, and copyright, Google.
This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.
This gives a pretty substantial speedup for the smaller subpartitions.
The code size increases from 14740 bytes to 24292 bytes.
The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.
Martin Storsjö [Tue, 22 Nov 2016 09:07:38 +0000 (11:07 +0200)]
arm: vp9itxfm: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google.
This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.
This gives a pretty substantial speedup for the smaller subpartitions.
The code size increases from 12388 bytes to 19784 bytes.
The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.
Martin Storsjö [Wed, 23 Nov 2016 08:56:12 +0000 (10:56 +0200)]
arm: vp9itxfm: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google.
This reduces the code size of libavcodec/arm/vp9itxfm_neon.o from
15324 to 12388 bytes.
This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.
Diego Biurrun [Wed, 8 Feb 2017 17:06:34 +0000 (18:06 +0100)]
configure: Correctly recurse in do_check_deps()
Fixes all sorts of configuration problems introducec by dad7a9c7c0ae
on non-Linux or non-vanilla configs. Also removes a line made redundant
in that commit.
Martin Storsjö [Mon, 6 Feb 2017 22:25:19 +0000 (00:25 +0200)]
omx: Use the EOS flag to handle flushing at the end
This avoids having to count the number of frames sent to the codec
and the number of output packets received; instead just wait until
the encoder returns a buffer with the EOS flag set.
Diego Biurrun [Mon, 23 Jan 2017 10:57:14 +0000 (11:57 +0100)]
configure: Add name parameter to require_pkg_config() helper function
This allows distinguishing between the internal variable name for
external libraries and the pkg-config package name. Having both
names available avoids special-casing outside the helper function
when the two identifiers do not match.
Martin Storsjö [Tue, 31 Jan 2017 14:15:56 +0000 (16:15 +0200)]
rtmp: Correctly handle the Window Acknowledgement Size packets
This swaps which field is set when the Window Acknowledgement Size
and Set Peer BW packets are received, renames the fields in
order to clarify their role further and adds verbose comments
explaining their respective roles and how well the code currently
does what it is supposed to.
The Set Peer BW packet tells the receiver of the packet (which
can be either client or server) that it should not send more data
if it already has sent more data than the specified number of bytes,
without receiving acknowledgement for them. Actually checking this
limit is currently not implemented.
In order to be able to check that properly, one can send the
Window Acknowledgement Size packet, which tells the receiver of the
packet that it needs to send Acknowledgement packets
(RTMP_PT_BYTES_READ) at least after receiving a given number of bytes
since the last Acknowledgement.
Therefore, when we receive a Window Acknowledgement Size packet,
this sets the maximum number of bytes we can receive without sending
an Acknowledgement; therefore when handling this packet we should set
the receive_report_size field (previously client_report_size).
Martin Storsjö [Tue, 31 Jan 2017 13:47:00 +0000 (15:47 +0200)]
rtmp: Rename packet types to closer match the spec
Also rename comments and log messages accordingly,
and add clarifying comments for some hardcoded values.
The previous names were taken from older, reverse engineered
references.
These names match the official public rtmp specification, and
matches the names used by wirecast in annotating captured
streams. These names also avoid hardcoding the roles of server
and client, since the handling of them is irrelevant of whether
we act as server or client.
The RTMP_PT_PING type maps to RTMP_PT_USER_CONTROL.
The SERVER_BW and CLIENT_BW types are a bit more intertwined;
RTMP_PT_SERVER_BW maps to RTMP_PT_WINDOW_ACK_SIZE and
RTMP_PT_CLIENT_BW maps to RTMP_PT_SET_PEER_BW.
Diego Biurrun [Mon, 23 Jan 2017 12:17:24 +0000 (13:17 +0100)]
configure: Drop weak dependencies on external libraries for webm muxer
Weak dependencies on external libraries do not obviate having to
explicitly enable these libraries, so the weak dependency does not
simplify the configure command line nor have any real effect.