git.sesse.net Git - x264/log

]> git.sesse.net Git - x264/log

projects / x264 / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Steinar H. Gunderson [Thu, 21 Apr 2016 13:58:20 +0000 (15:58 +0200)]

Switch to exponential interpolation between presets.

The preset timings mostly grow in an exponential fashion, not linear,
and as such, interpolation should be exponential, too. More importantly,
this changes extrapolation to be exponential. This fixes an issue where
the chosen preset goes way out of range: If the picture has been at very low
complexity, we would have a very high target (say, 10000 seconds), which
due to linear interpolation would choose a way too high preset (say, preset
5000 out of 25). This would completely drown out the controller responsible
for turning down the preset based on the queue length alone; even though
it would subtract e.g. 20 levels from the chosen preset, still preset 4980
would we be chosen, and effectively, we'd oscillate between the highest
and the lowest presets all the time.

As an extra precaution, we cap the chosen preset to five levels of
extrapolation. This makes us less sensitive to choosing a max_presets where the
last two entries are not that far apart (throwing the extrapolation off).

commit | commitdiff | tree

Steinar H. Gunderson [Sat, 23 Apr 2016 01:18:14 +0000 (03:18 +0200)]

Redo the speedcontrol preset list from scratch.

This was primarily motivated by the need for both faster and slower presets
than what were in the list, but it also served to move the presets closer
to the current x264 tunings (the old ones were done before significant new
features, such as weighted P-frames). In particular, we no longer disable psy-rd
or motion estimation based on chroma, since neither is changed by any x264 preset
and I could not get them to give me any speed increase at all.

In general, the strategy was to take x264's existing presets, save for
parameters that we cannot change runtime (we never set --subme to 0 because
you cannot get out of it, and cannot modify --mbtree, --rc-lookahead and
--weightp after initial setup), and then build intermediate steps between
them based on what would give the biggest SSIM gain with the least fps drop.
This gives a significantly expanded range; the old one went from roughly
"superfast" to a bit faster than "slow", while the new goes from about 55%
the speed of "ultrafast" (more than twice as fast as the old fastest preset)
all the way to "veryslow" (requiring four times as much CPU as the old
slowest preset, but with a ~0.7 dB SSIM boost in the reference benchmark,
which is quite a bit), about a 30x range. It also gives more evenly spaced
presets than before, although especially in the faster end, it would have
been nice with even more in-between presets (the lack of them is not for
trying; I just couldn't find more useful knobs to tweak in these ranges).

The presets assume the base is "--preset faster --ref 16" (in particular,
that means --weightp 1 --mbtree --rc-lookahead 20); if you change
other options (including starting at "ultrafast", which does things like
--no-deblock which has only modest speed gains but _eats_ quality), you're on
your own. In particular, using "medium" or something else as base would
change --rc-lookahead, which affects both performance and quality
(invalidating the timings), and choosing a lower value for --ref would mean
the slower presets would not be able to use that many reference frames.
x264 does not warn about this, it just silently refuses to apply your
settings.

All benchmarks (both timing and the SSIM values) were run by encoding the first
1000 frames of “Tears of Steel” in 1080p at 2 Mbit/sec, on a quadcore 3.2 GHz
i5 Haswell, which should be more representative of today's machines than the
Sandy Bridge the old timings were done on. I also verified the general trend of
the results on a much more difficult test, namely a 5 Mbit/sec encode of 2000
frames (starting at 48 seconds) of "Metamorf II" at 720p60 (source:
http://zd3n.com/get.php3?get=files/str_metamorf2_hi.mp4). Despite the
addition of faster presets, there are still sections of Metamorf II that I
cannot encode in realtime on the target machine without dropping frames from
the 50-frame queue I'm using.

The benchmark was run using the x264 command-line tool on said machine (so no
hyperthreading) when it was otherwise idle. Input was y4m on tmpfs, output was to
/dev/null. The x264 process was run with realtime priority, frequency scaling
was off, and results varied generally less than 1% between runs (although
there were some exceptions). --no-psy --ssim was used to get SSIM numbers;
--no-psy did not have any impact on measured performance. All tests were run
ten times with frame times (not frame rates) averaged.

commit | commitdiff | tree

Steinar H. Gunderson [Tue, 19 Apr 2016 22:37:55 +0000 (00:37 +0200)]

Get rid of redundant speedcontrol variable fps, and clarify the units of some variables.

Also fix a bug where the i_buffer_size parameter to x264_speedcontrol_sync()
would get ignored, and not update the dependent parameters properly.

commit | commitdiff | tree

Kieran Kunhya [Fri, 30 Mar 2012 15:50:13 +0000 (16:50 +0100)]

Add ability to signal to x264 when speedcontrol buffering is complete

commit | commitdiff | tree

Kieran Kunhya [Fri, 30 Mar 2012 15:10:18 +0000 (16:10 +0100)]

Turn down verbosity of speedcontrol

commit | commitdiff | tree

Kieran Kunhya [Mon, 29 Aug 2011 16:58:28 +0000 (17:58 +0100)]

Get rid of hardcoded parameters in speedcontrol. This allows SAR changes.

commit | commitdiff | tree

Kieran Kunhya [Fri, 29 Jul 2011 16:41:25 +0000 (17:41 +0100)]

Only use 10 speedcontrol presets for now.

commit | commitdiff | tree

Kieran Kunhya [Wed, 27 Jul 2011 15:27:41 +0000 (16:27 +0100)]

Make speedcontrol idle an informational log message.

commit | commitdiff | tree

Kieran Kunhya [Tue, 19 Jul 2011 12:15:56 +0000 (13:15 +0100)]

Add new speedcontrol timings.

commit | commitdiff | tree

Kieran Kunhya [Wed, 9 Mar 2011 11:55:39 +0000 (11:55 +0000)]

Change defaults and use x264_log

commit | commitdiff | tree

Kieran Kunhya [Tue, 8 Mar 2011 14:57:03 +0000 (14:57 +0000)]

Fix 10L

commit | commitdiff | tree

Kieran Kunhya [Tue, 8 Mar 2011 14:45:04 +0000 (14:45 +0000)]

Start counting when the first frame is encoded, not when x264_encoder_open is called.

commit | commitdiff | tree

Kieran Kunhya [Mon, 21 Feb 2011 02:57:27 +0000 (02:57 +0000)]

Merge speedcontrol.

commit | commitdiff | tree

Kieran Kunhya [Thu, 20 Jan 2011 13:07:54 +0000 (13:07 +0000)]

Add speedcontrol file.

commit | commitdiff | tree

Anton Mitrofanov [Wed, 13 Apr 2016 18:54:25 +0000 (21:54 +0300)]

Clean up header includes

commit | commitdiff | tree

Henrik Gramner [Wed, 13 Apr 2016 15:53:49 +0000 (17:53 +0200)]

Eliminate some compiler warnings on BSD

Include <strings.h> in addition to <string.h>. According to the POSIX
specification the prototypes for strcasecmp() and strncasecmp() are
declared in <strings.h>. On some systems they are also declared in
<string.h> for compatibility reasons but we shouldn't rely on that.

Define _POSIX_C_SOURCE only when it's required to do so. Some BSD
variants doesn't declare certain function prototypes otherwise.

commit | commitdiff | tree

Henrik Gramner [Tue, 12 Apr 2016 19:33:54 +0000 (21:33 +0200)]

osx: Add -D_DARWIN_C_SOURCE to CFLAGS

OSX doesn't like _POSIX_C_SOURCE being defined when _DARWIN_C_SOURCE isn't.

commit | commitdiff | tree

Anton Mitrofanov [Tue, 12 Apr 2016 17:33:42 +0000 (20:33 +0300)]

Remove an unused parameter from x264_slicetype_frame_cost()

The b_intra_penalty parameter is no longer used anywhere after the
improvements to the --b-adapt 1 algorithm.

commit | commitdiff | tree

Anton Mitrofanov [Sun, 10 Apr 2016 17:17:32 +0000 (20:17 +0300)]

Improve the --b-adapt 1 algorithm

Roughly the same speed as before but with significantly better results,
comparable to --b-adapt 2.

commit | commitdiff | tree

Henrik Gramner [Sun, 3 Apr 2016 13:49:26 +0000 (15:49 +0200)]

analyse: i_sub_partition write combining

commit | commitdiff | tree

Henrik Gramner [Tue, 15 Mar 2016 19:16:45 +0000 (20:16 +0100)]

x86: Use one less register in mbtree_propagate_cost_avx2

Avoids the need to save and restore xmm6 on 64-bit Windows.

commit | commitdiff | tree

Henrik Gramner [Fri, 4 Mar 2016 16:53:08 +0000 (17:53 +0100)]

x86: Add asm for mbtree fixed point conversion

The QP offsets of each macroblock are stored as floats internally and
converted to big-endian Q8.8 fixed point numbers when written to the 2-pass
stats file, and converted back to floats when read from the stats file.

Add SSSE3 and AVX2 implementations for conversions in both directions.

About 8x faster than C on Haswell.

commit | commitdiff | tree

Anton Mitrofanov [Thu, 7 Apr 2016 10:09:03 +0000 (13:09 +0300)]

x86inc: Enable AVX emulation in additional cases

Allows emulation to work when dst is equal to src2 as long as the
instruction is commutative, e.g. `addps m0, m1, m0`.

commit | commitdiff | tree

Anton Mitrofanov [Thu, 7 Apr 2016 09:48:29 +0000 (12:48 +0300)]

x86inc: Improve handling of %ifid with multi-token parameters

The yasm/nasm preprocessor only checks the first token, which means that
parameters such as `dword [rax]` are treated as identifiers, which is
generally not what we want.

commit | commitdiff | tree

Anton Mitrofanov [Mon, 28 Mar 2016 15:35:38 +0000 (18:35 +0300)]

x86inc: Fix AVX emulation of some instructions

commit | commitdiff | tree

Henrik Gramner [Fri, 4 Mar 2016 16:51:41 +0000 (17:51 +0100)]

x86inc: Fix AVX emulation of scalar float instructions

Those instructions are not commutative since they only change the first
element in the vector and leave the rest unmodified.

commit | commitdiff | tree

Henrik Gramner [Sat, 27 Feb 2016 19:34:39 +0000 (20:34 +0100)]

x86: dct2x4dc asm

Only used in 4:2:2. MMX2 version implemented for 8-bit, SSE2 and AVX
versions implemented for high bit-depth.

2.5x faster on 32-bit and 1.6x faster on 64-bit compared to C on Ivy Bridge.

commit | commitdiff | tree

Henrik Gramner [Sat, 20 Feb 2016 19:31:22 +0000 (20:31 +0100)]

x86: SSE2/AVX idct_dequant_2x4_(dc|dconly)

Only used in 4:2:2. Both 8-bit and high bit-depth implemented.

Approximate performance improvement compared to C on Ivy Bridge:

                         x86-32  x86-64
idct_dequant_2x4_dc      2.1x    1.7x
idct_dequant_2x4_dconly  2.7x    2.0x

Helps more on 32-bit due to the C versions being register starved.

commit | commitdiff | tree

Henrik Gramner [Sat, 20 Feb 2016 15:53:35 +0000 (16:53 +0100)]

checkasm: Fix idct_dequant_2x4_(dc|dconly) tests

They used the wrong qp values and the dconly test had the wrong name. This
was undetected before because there wasn't any assembly implementations.

commit | commitdiff | tree

Henrik Gramner [Sun, 7 Feb 2016 13:55:26 +0000 (14:55 +0100)]

checkasm: Disable Windows Error Reporting

When developing new assembly code it's expected that checkasm may crash,
and the error reporting dialog popup can be somewhat annoying.

commit | commitdiff | tree

Henrik Gramner [Sat, 6 Feb 2016 17:49:46 +0000 (18:49 +0100)]

windows: Flag debug builds in the resource file

commit | commitdiff | tree

Henrik Gramner [Thu, 4 Feb 2016 19:06:57 +0000 (20:06 +0100)]

cli: Refactor filter option parsing

The old code contained a whole bunch of memory leaks, unchecked mallocs,
sections of dead code, etc. and was generally overly complex.

Also consolidate some memory allocations into a single one.

commit | commitdiff | tree

Henrik Gramner [Sun, 31 Jan 2016 20:50:52 +0000 (21:50 +0100)]

ffms: Various improvements

* Drop the MinGW Unicode workarounds. Those were required at the time
   Windows Unicode support was added to x264 but the underlying problem
   has since been fixed in FFMS.

* Use FFMS_IndexBelongsToFile() as an additional sanity check when reading
   an index file to ensure that it belongs to the current source video.

* Upgrade to the new API to prevent deprecation warnings when compiling.

* Fix a resource leak that would occur if FFMS_GetFirstTrackOfType() or
   FFMS_CreateVideoSource() failed.

* Minor string handling adjustments related to progress reporting.

This increases the FFMS version requirement from 2.16.2 to 2.21.0.

commit | commitdiff | tree

Henrik Gramner [Mon, 11 Apr 2016 14:59:46 +0000 (16:59 +0200)]

msvc: Add snprintf/vsnprintf replacements

MSVC pre-VS2015 has broken snprintf/vsnprintf implementations which are
incompatible with C99 and may lead to buffer overflows.

commit | commitdiff | tree

Henrik Gramner [Sun, 31 Jan 2016 19:21:01 +0000 (20:21 +0100)]

configure: Define feature test macros for --std=gnu99

Makes the printf() family functions on MinGW use the correct C99 POSIX
versions instead of the broken pre-VS2015 Microsoft ones.

Also allows us to get rid of some _GNU_SOURCE and _ISOC99_SOURCE defines.

commit | commitdiff | tree

Henrik Gramner [Thu, 28 Jan 2016 17:37:37 +0000 (18:37 +0100)]

mingw: Enable high-entropy ASLR on 64-bit Windows

To fully utilize HEASLR the image base address must also be set above
4 GiB. For consistency use the same address as MSVC uses by default.

This requires binutils 2.25 which isn't available on all common
distributions, so only enable it after checking that it's supported.

commit | commitdiff | tree

Henrik Gramner [Sun, 24 Jan 2016 00:48:18 +0000 (01:48 +0100)]

msvs: WinRT support

To compile x264 for WinRT the following additional steps has to be performed.

* Ensure that the necessary SDK is installed.

* Set the correct environment variables in the VS command prompt as shown at
   https://trac.ffmpeg.org/wiki/CompilationGuide/WinRT

* Add one of the following to --extra-cflags depending on the target OS:
   "-DWINAPI_FAMILY=WINAPI_FAMILY_PC_APP -D_WIN32_WINNT=0x0A00" (Windows 10)
   "-DWINAPI_FAMILY=WINAPI_FAMILY_PC_APP -D_WIN32_WINNT=0x0603" (Windows 8.1)

commit | commitdiff | tree

Henrik Gramner [Sun, 24 Jan 2016 22:58:40 +0000 (23:58 +0100)]

configure: Disable CLI libraries when CLI is disabled

commit | commitdiff | tree

Henrik Gramner [Fri, 5 Feb 2016 17:46:13 +0000 (18:46 +0100)]

matroska: mk_close: Check fseek() return value

commit | commitdiff | tree

Henrik Gramner [Fri, 5 Feb 2016 17:46:02 +0000 (18:46 +0100)]

parse_qpfile: Check ftell() and fseek() return values

commit | commitdiff | tree

Anton Mitrofanov [Sun, 10 Apr 2016 17:13:59 +0000 (20:13 +0300)]

Use the correct default B-ref placement with B-pyramid

Cost analyse functions expects the placement of the B-ref in a sequence of
an even number of B-frames to be located towards the beginning while the
actual placement was towards the end.

Change the placement to be consistent with the analyse expectations, e.g.
PbbBbP -> PbBbbP.

commit | commitdiff | tree

Henrik Gramner [Fri, 5 Feb 2016 17:45:47 +0000 (18:45 +0100)]

parse_zones: Fix memory leak

commit | commitdiff | tree

Alexey Samsonov [Tue, 26 Jan 2016 00:05:25 +0000 (16:05 -0800)]

Fix float-cast-overflow in x264_ratecontrol_end function

According to the C standard, it is undefined behavior to cast a negative
floating point number to an unsigned integer. Float-cast-overflow in
general is known to produce different results on different architectures.

Building x264 code with Clang and -fsanitize=float-cast-overflow
(http://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#availablle-checks)
and running it on some real-life examples occasionally produces errors
of the form:

encoder/ratecontrol.c:1892: runtime error: value -5011.14 is outside the
range of representable values of type 'unsigned short'

Fix these errors by explicitly coding the de-facto x86 behavior: casting
float to uint16_t through int16_t.

commit | commitdiff | tree

Sebastian Dröge [Sun, 20 Dec 2015 20:49:35 +0000 (23:49 +0300)]

Fix AVC-Intra padding for non-Annex B encoding

commit | commitdiff | tree

Anton Mitrofanov [Mon, 11 Jan 2016 18:39:22 +0000 (21:39 +0300)]

ppc: Only perform AltiVec detection if compiled with AltiVec enabled

commit | commitdiff | tree

Anton Mitrofanov [Tue, 13 Oct 2015 12:30:16 +0000 (15:30 +0300)]

2-pass: Take into account possible frame reordering

commit | commitdiff | tree

Anton Mitrofanov [Tue, 13 Oct 2015 09:54:05 +0000 (12:54 +0300)]

Revise the 2-pass algorithm

commit | commitdiff | tree

Anton Mitrofanov [Mon, 4 Jan 2016 23:41:43 +0000 (02:41 +0300)]

Revise the row VBV algorithm (part 2)

Should fix rare cases of VBV emergency mode activation caused by too much trust
to the row predictors.

commit | commitdiff | tree

Henrik Gramner [Fri, 1 Jan 2016 11:44:31 +0000 (12:44 +0100)]

Bump dates to 2016

commit | commitdiff | tree

Henrik Gramner [Mon, 26 Oct 2015 18:54:20 +0000 (19:54 +0100)]

cli: Use memory-mapped input frames for yuv and y4m

Improves performance by avoiding extraneous memory copying.
Most beneficial on fast settings.

On average around 5-10% faster overall on ultrafast but the
performance improvement can be even larger in some cases.

commit | commitdiff | tree

Henrik Gramner [Thu, 7 Jan 2016 00:59:24 +0000 (01:59 +0100)]

y4m: Support extended frame headers when seeking

Use the actual length of the frame header of the first frame instead of
assuming a header without extensions when calculating the frame size.

Also makes the frame counter more accurate with extended frame headers.

commit | commitdiff | tree

Henrik Gramner [Tue, 3 Nov 2015 16:55:08 +0000 (17:55 +0100)]

configure: Simplify cygwin/mingw/msys code

Avoids some code duplication.

Also drop the -mno-cygwin check since that option was removed back in 2008.

commit | commitdiff | tree

Henrik Gramner [Mon, 26 Oct 2015 17:52:46 +0000 (18:52 +0100)]

y4m: Avoid some redundant strlen() calls

commit | commitdiff | tree

Henrik Gramner [Sun, 25 Oct 2015 16:15:10 +0000 (17:15 +0100)]

Simplify threadpool_wait

commit | commitdiff | tree

Henrik Gramner [Fri, 16 Oct 2015 17:05:34 +0000 (19:05 +0200)]

windows: Use native threads by default

--disable-win32thread can be passed as an argument to configure to compile
with pthreads, which was the old default behavior.

commit | commitdiff | tree

Henrik Gramner [Sun, 11 Oct 2015 20:32:11 +0000 (22:32 +0200)]

x86: Avoid some bypass delays and false dependencies

A bypass delay of 1-3 clock cycles may occur on some CPUs when transitioning
between int and float domains, so try to avoid that if possible.

commit | commitdiff | tree

Henrik Gramner [Sun, 11 Oct 2015 20:32:03 +0000 (22:32 +0200)]

x86: Enable high bit-depth x264_coeff_last64_avx2_lzcnt

The function existed but was never enabled.

commit | commitdiff | tree

Geza Lore [Mon, 12 Oct 2015 12:13:42 +0000 (13:13 +0100)]

x86inc: Add debug symbols indicating sizes of compiled functions

Some debuggers/profilers use this metadata to determine which function a
given instruction is in; without it they get can confused by local labels
(if you haven't stripped those). On the other hand, some tools are still
confused even with this metadata. e.g. this fixes `gdb`, but not `perf`.

Currently only implemented for ELF.

commit | commitdiff | tree

Henrik Gramner [Fri, 16 Oct 2015 19:28:49 +0000 (21:28 +0200)]

x86inc: Avoid creating unnecessary local labels

The REP_RET workaround is only needed on old AMD cpus, and the labels clutter
up the symbol table and confuse debugging/profiling tools, so use EQU to
create SHN_ABS symbols instead of creating local labels. Furthermore, skip
the workaround completely in functions that definitely won't run on such cpus.

This patch doesn't modify any emitted instructions, and doesn't actually affect
x264 at all. It's only for other projects that use x86inc.asm without an
appropriate `strip` command in their buildsystem.

Note that EQU is just creating a local label when using nasm instead of yasm.
This is probably a bug, but at least it doesn't break anything.

commit | commitdiff | tree

Henrik Gramner [Thu, 15 Oct 2015 15:42:49 +0000 (17:42 +0200)]

x86inc: Simplify AUTO_REP_RET

cpuflags is never undefined any more, it's set to 0 instead.

Also fix an incorrect comment.

commit | commitdiff | tree

Henrik Gramner [Mon, 12 Oct 2015 19:55:11 +0000 (21:55 +0200)]

x86inc: Use more consistent indentation

commit | commitdiff | tree

Henrik Gramner [Mon, 12 Oct 2015 18:15:18 +0000 (20:15 +0200)]

x86inc: Preserve arguments when allocating stack space

When allocating stack space with a larger alignment than the known stack
alignment a temporary register is used for storing the stack pointer.
Ensure that this isn't one of the registers used for passing arguments.

commit | commitdiff | tree

Henrik Gramner [Sat, 16 Jan 2016 23:25:47 +0000 (00:25 +0100)]

x86inc: Improve FMA instruction handling

* Correctly handle FMA instructions with memory operands.
* Print a warning if FMA instructions are used without the correct cpuflag.
* Simplify the instantiation code.
* Clarify documentation.

Only the last operand in FMA3 instructions can be a memory operand. When
converting FMA4 instructions to FMA3 instructions we can utilize the fact
that multiply is a commutative operation and reorder operands if necessary
to ensure that a memory operand is used only as the last operand.

commit | commitdiff | tree

Henrik Gramner [Sun, 11 Oct 2015 20:31:53 +0000 (22:31 +0200)]

x86inc: Be more verbose in assertion failures

commit | commitdiff | tree

Henrik Gramner [Wed, 30 Sep 2015 21:17:00 +0000 (23:17 +0200)]

x86inc: Make cpuflag() and notcpuflag() return 0 or 1

Makes it possible to use them in arithmetic expressions.

commit | commitdiff | tree

Henrik Gramner [Fri, 30 Oct 2015 15:55:49 +0000 (16:55 +0100)]

encoder_open: Fix memory leak

Furthermore, the x264_analyse_prepare_costs() and x264_analyse_init_costs()
functions were only used in x264_encoder_open(), so move that entire section
of code to analyse.c as well to simplify things.

commit | commitdiff | tree

Janne Grunau [Wed, 18 Nov 2015 10:08:22 +0000 (11:08 +0100)]

arm: do not fill mc_weight*_neon tabs for HIGH_BIT_DEPTH

The asm is only for 8-bit and function prototypes reflect that. Avoids
numerous warnings with --bit-depth=9/10.

commit | commitdiff | tree

Janne Grunau [Tue, 13 Oct 2015 21:50:11 +0000 (23:50 +0200)]

arm: Eliminate text relocations in asm

Android 6 does not link shared libraries with text relocations.

Make the movrel macro position independent and add movrelx for indirect
loads of external symbols.

Move the function pointer table for the aligned memcpy variants to the
data.rel.ro section on Linux/Android.

commit | commitdiff | tree

Martin Storsjö [Thu, 15 Oct 2015 08:50:33 +0000 (11:50 +0300)]

arm: Don't assume alignment in mbtree_propagate_list_internal where it isn't provided

commit | commitdiff | tree

Janne Grunau [Tue, 13 Oct 2015 21:50:12 +0000 (23:50 +0200)]

arm: Fix checkasm register clobber check on iOS

r9 is a volatile register in the iOS ABI and will therefore not be
preserved by compiled functions like the luma motion compensation.

Add the symbol prefix to the puts() call and use blx since a switch
between arm and thumb mode might be required.

commit | commitdiff | tree

Anton Mitrofanov [Wed, 30 Sep 2015 22:02:16 +0000 (01:02 +0300)]

ppc: Add detection of AltiVec support for FreeBSD

Patch from FreeBSD ports.

commit | commitdiff | tree

Anton Mitrofanov [Mon, 28 Sep 2015 18:07:55 +0000 (21:07 +0300)]

Don't assume 16-byte stack alignment by default on x86-32

Some compilers depending on target OS uses 4-byte stack alignment by default.
Explicitly check known good compilers and specific options for stack alignment.

commit | commitdiff | tree

Anton Mitrofanov [Tue, 22 Sep 2015 18:33:07 +0000 (21:33 +0300)]

Fix a few static analyzer performance hints

commit | commitdiff | tree

Anton Mitrofanov [Tue, 22 Sep 2015 17:19:23 +0000 (20:19 +0300)]

Revise the row VBV algorithm

commit | commitdiff | tree

Anton Mitrofanov [Tue, 22 Sep 2015 16:26:25 +0000 (19:26 +0300)]

Fix high bit depth lookahead cost compensation algorithm

Now high bit depth VBV should act more like 8-bit depth one.

commit | commitdiff | tree

Anton Mitrofanov [Tue, 22 Sep 2015 16:05:52 +0000 (19:05 +0300)]

Correctly update the intra row predictor in B-frames

It was previously used but never updated from it's initialization value.

commit | commitdiff | tree

Anton Mitrofanov [Tue, 22 Sep 2015 15:58:24 +0000 (18:58 +0300)]

Change the predictors update algorithm

Keep predictor offsets more stable. This should fix VBV misprediction in frames
with a large difference in complexity between the top and bottom parts.

commit | commitdiff | tree

Martin Storsjö [Thu, 3 Sep 2015 06:30:44 +0000 (09:30 +0300)]

arm: Implement x264_mbtree_propagate_{cost, list}_neon

The cost function could be simplified to avoid having to clobber
q4/q5, but this requires reordering instructions which increase
the total runtime.

checkasm timing       Cortex-A7      A8      A9
mbtree_propagate_cost_c      63702   155835  62829
mbtree_propagate_cost_neon   17199   10454   11106

mbtree_propagate_list_c      104203  108949  84532
mbtree_propagate_list_neon   82035   78348   60410

commit | commitdiff | tree

Martin Storsjö [Thu, 3 Sep 2015 06:30:43 +0000 (09:30 +0300)]

x86: Share the mbtree_propagate_list macro with aarch64

This avoids having to duplicate the same code for all architectures
that implement only the internal part of this function in assembler.

commit | commitdiff | tree

Martin Storsjö [Wed, 2 Sep 2015 19:39:51 +0000 (22:39 +0300)]

arm: Implement luma intra deblocking

checkasm timing       Cortex-A7      A8     A9
deblock_luma_intra[0]_c      5988    4653   4316
deblock_luma_intra[0]_neon   3103    2170   2128
deblock_luma_intra[1]_c      7119    5905   5347
deblock_luma_intra[1]_neon   2068    1381   1412

This includes extra optimizations by Janne Grunau.

Timings from a separate build, on Exynos 5422:

                      Cortex-A7     A15
deblock_luma_intra[0]_c      6627   3300
deblock_luma_intra[0]_neon   3059   1128
deblock_luma_intra[1]_c      7314   4128
deblock_luma_intra[1]_neon   2038   720

commit | commitdiff | tree

Martin Storsjö [Mon, 31 Aug 2015 19:40:31 +0000 (22:40 +0300)]

arm: Implement some neon 8x16c intra predict functions

checkasm timing       Cortex-A7      A8     A9
intra_predict_8x16c_dct_c    862     540    590
intra_predict_8x16c_dct_neon 608     511    657
intra_predict_8x16c_h_c      972     707    719
intra_predict_8x16c_h_neon   722     656    672
intra_predict_8x16c_p_c      10183   9819   8655
intra_predict_8x16c_p_neon   2622    1972   1983

commit | commitdiff | tree

Martin Storsjö [Thu, 27 Aug 2015 21:15:01 +0000 (00:15 +0300)]

arm: Implement x264_plane_copy_neon

checkasm timing       Cortex-A7      A8     A9
plane_copy_c                 13124   10925  9106
plane_copy_neon              7349    5103   8945

commit | commitdiff | tree

Martin Storsjö [Fri, 28 Aug 2015 06:40:24 +0000 (09:40 +0300)]

checkasm: arm: Check register clobbering

Cast the function pointer to a different type signature, to
be able to use uint64_t as return type (instead of intptr_t) for
those calls that require it.

Use two separate functions, depending on whether neon is available.

commit | commitdiff | tree

Martin Storsjö [Thu, 13 Aug 2015 21:00:57 +0000 (00:00 +0300)]

checkasm: Try different widths for ssd_nv12

To test all codepaths in the aarch64 neon implementation, one at
the very least needs to test with width 8, 16, 24 and 32.

commit | commitdiff | tree

Jerome Duval [Fri, 13 Jun 2014 19:56:27 +0000 (19:56 +0000)]

Haiku support

Add Haiku as supported platform in configure.
Haiku has no nice() function, use the platform specific substitute instead.

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:20 +0000 (14:38 +0300)]

checkasm: aarch64: Check register clobbering

Disable this on iOS, since it has got a slightly different ABI
for vararg parameters.

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 20:36:45 +0000 (23:36 +0300)]

arm: Implement x284_decimate_score15/16/64_neon

checkasm timing       Cortex-A7      A8     A9
decimate_score15_c           764     736    535
decimate_score15_neon        487     494    453
decimate_score16_c           782     727    553
decimate_score16_neon        487     494    521
decimate_score64_c           2361    2597   2011
decimate_score64_neon        1017    802    785

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 20:36:44 +0000 (23:36 +0300)]

arm: Implement chroma intra deblock

checkasm timing              Cortex-A7      A8     A9
deblock_chroma_420_intra_mbaff_c    1469    1276   1181
deblock_chroma_420_intra_mbaff_neon 981     717    644
deblock_chroma_intra[1]_c           2954    2402   2321
deblock_chroma_intra[1]_neon        947     581    575
deblock_h_chroma_420_intra_c        2859    2509   2264
deblock_h_chroma_420_intra_neon     1480    1119   1028
deblock_h_chroma_422_intra_c        6211    5030   4792
deblock_h_chroma_422_intra_neon     2894    1990   2077

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:17 +0000 (14:38 +0300)]

arm: Implement x264_pixel_sa8d_satd_16x16_neon

This requires spilling some registers to the stack,
contray to the aarch64 version.

checkasm timing        Cortex-A7      A8     A9
sa8d_satd_16x16_neon          12936   6365   7492
sa8d_satd_16x16_separate_neon 14841   6605   8324

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:16 +0000 (14:38 +0300)]

arm: Implement x264_deblock_h_chroma_mbaff_neon

checkasm timing        Cortex-A7      A8     A9
deblock_chroma_420_mbaff_c    1944    1706   1526
deblock_chroma_420_mbaff_neon 1210    873    865

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:15 +0000 (14:38 +0300)]

arm: Implement x264_deblock_h_chroma_422_neon

checkasm timing       Cortex-A7      A8     A9
deblock_h_chroma_422_c       6953    6269   5145
deblock_h_chroma_422_neon    3905    2569   2551

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:14 +0000 (14:38 +0300)]

arm: Implement integral_init4/8h/v_neon

checkasm timing       Cortex-A7      A8     A9
integral_init4h_c            10466   8590   6161
integral_init4h_neon         3021    1494   1800
integral_init4v_c            16250   13590  13628
integral_init4v_neon         3473    2073   3291
integral_init8h_c            10100   8275   5705
integral_init8h_neon         4403    2344   2751
integral_init8v_c            6403    4632   4999
integral_init8v_neon         1184    783    1306

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:13 +0000 (14:38 +0300)]

arm: Implement x264_denoise_dct_neon

checkasm timing       Cortex-A7      A8     A9
denoise_dct_c                6604    5510   5858
denoise_dct_neon             1774    1139   1614

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:12 +0000 (14:38 +0300)]

arm: Add x264_nal_escape_neon

checkasm timing      Cortex-A7      A8      A9
nal_escape_c                852758  879566  655497
nal_escape_neon             376831  450678  371673

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:11 +0000 (14:38 +0300)]

arm: Add neon versions of vsad, asd8 and ssd_nv12_core

These are straight translations of the aarch64 versions.

checkasm timing      Cortex-A7      A8      A9
vsad_c                      16234   10984   9850
vsad_neon                   2132    1020    789

asd8_c                      5859    3561    3543
asd8_neon                   1407    1279    1250

ssd_nv12_c                  608096  591072  426285
ssd_nv12_neon               72752   33549   41347

commit | commitdiff | tree

Martin Storsjö [Tue, 25 Aug 2015 11:38:10 +0000 (14:38 +0300)]

checkasm: Check the right output range for integral_initXh

These functions write their output into sum+stride, while we previously
only checked [0..stride-8] within the sum array.

This catches the previously broken aarch64 version of these functions.

Also check up until stride-4 elements for init4h.

commit | commitdiff | tree

Janne Grunau [Thu, 20 Aug 2015 11:55:54 +0000 (13:55 +0200)]

aarch64: Skip deblocking in 264_deblock_h_chroma_422_neon

If the parameters (alpha, beta, tc0[]) indicated that the deblocking
should have been skipped, every 2nd chrome line would have deblocked
anyway.

deblock_h_chroma_422_neon: 2259 (before)
deblock_h_chroma_422_neon: 2192 (after)

commit | commitdiff | tree

Janne Grunau [Mon, 17 Aug 2015 14:39:20 +0000 (16:39 +0200)]

aarch64: Optimize various intra_predict asm functions

Make them at least as fast as the compiled C version (tested on
cortex-a53 vs. gcc 4.9.2).

                        C     NEON (before)   NEON (after)
intra_predict_4x4_dc:   260   335             260
intra_predict_4x4_dct:  210   265             200
intra_predict_8x8c_dc:  497   548             493
intra_predict_8x8c_v:   232   309             179 (arm64)
intra_predict_8x16c_dc: 795   830             790

commit | commitdiff | tree

Janne Grunau [Tue, 18 Aug 2015 08:25:10 +0000 (10:25 +0200)]

aarch64: Faster intra_predict_4x4_h

Use multiplication with 0x01010101 for splats.

On a cortex-a53:
gcc 4.9.2 llvm 3.6 neon (before) neon (after)
intra_predict_4x4_h: 162 147 160/155 139/135

commit | commitdiff | tree

Janne Grunau [Tue, 18 Aug 2015 08:25:09 +0000 (10:25 +0200)]

aarch64: Fix coeff_level_run* macros with LLVM's assembler

LLVM's integrated assembler does not treat symbols as integer constants.

Unnamed repository; edit this file 'description' to name the repository.