git.sesse.net Git - x264/blob - doc/threads.txt

   1 Historical notes:
   2 Slice-based threads was the original threading model of x264.  It was replaced with frame-based threads in r607.  This document was originally written at that time.  Slice-based threading was brought back (as an optional mode) in r1364 for low-latency encoding.  Furthermore, frame-based threading was modified significantly in r1246, with the addition of threaded lookahead.
   3
   4 Old threading method: slice-based
   5 application calls x264
   6 x264 runs B-adapt and ratecontrol (serial)
   7 split frame into several slices, and spawn a thread for each slice
   8 wait until all threads are done
   9 deblock and hpel filter (serial)
  10 return to application
  11 In x264cli, there is one additional thread to decode the input.
  12
  13 New threading method: frame-based
  14 application calls x264
  15 x264 requests a frame from lookahead, which runs B-adapt and ratecontrol parallel to the current thread, separated by a buffer of size sync-lookahead
  16 spawn a thread for this frame
  17 thread runs encode, deblock, hpel filter
  18 meanwhile x264 waits for the oldest thread to finish
  19 return to application, but the rest of the threads continue running in the background
  20 No additional threads are needed to decode the input, unless decoding is slower than slice+deblock+hpel, in which case an additional input thread would allow decoding in parallel.
  21
  22 Penalties for slice-based threading:
  23 Each slice adds some bitrate (or equivalently reduces quality), for a variety of reasons: the slice header costs some bits, cabac contexts are reset, mvs and intra samples can't be predicted across the slice boundary.
  24 In CBR mode, multiple slices encode simultaneously, thus increasing the maximum misprediction possible with VBV.
  25 Some parts of the encoder are serial, so it doesn't scale well with lots of cpus.
  26
  27 Some numbers on penalties for slicing:
  28 Tested at 720p with 45 slices (one per mb row) to maximize the total cost for easy measurement. Averaged over 4 movies at crf20 and crf30. Total cost: +30% bitrate at constant psnr.
  29 I enabled the various components of slicing one at a time, and measured the portion of that cost they contribute:
  30     * 34% intra prediction
  31     * 25% redundant slice headers, nal headers, and rounding to whole bytes
  32     * 16% mv prediction
  33     * 16% reset cabac contexts
  34     * 6% deblocking between slices (you don't strictly have to turn this off just for standard compliance, but you do if you want to use slices for decoder multithreading)
  35     * 2% cabac neighbors (cbp, skip, etc)
  36 The proportional cost of redundant headers should certainly depend on bitrate (since the header size is constant and everything else depends on bitrate). Deblocking should too (due to varing deblock strength).
  37 But none of the proportions should depend strongly on the number of slices: some are triggered per slice while some are triggered per macroblock-that's-on-the-edge-of-a-slice, but as long as there's no more than 1 slice per row, the relative frequency of those two conditions is determined solely by the image width.
  38
  39
  40 Penalties for frame-base threading:
  41 To allow encoding of multiple frames in parallel, we have to ensure that any given macroblock uses motion vectors only from pieces of the reference frames that have been encoded already. This is usually not noticeable, but can matter for very fast upward motion.
  42 We have to commit to one frame type before starting on the frame. Thus scenecut detection must run during the lowres pre-motion-estimation along with B-adapt, which makes it faster but less accurate than re-encoding the whole frame.
  43 Ratecontrol gets delayed feedback, since it has to plan frame N before frame N-1 finishes.
  44
  45 Benchmarks:
  46 cpu: 8core Nehalem (2x E5520) 2.27GHz, hyperthreading disabled
  47 kernel: linux 2.6.34.7, 64-bit
  48 x264: r1732 b20059aa
  49 input: http://media.xiph.org/video/derf/y4m/1080p/park_joy_1080p.y4m
  50
  51 NOTE: the "thread count" listed below does not count the lookahead thread, only encoding threads.  This is why for "veryfast", the speedup for 2 and 3 threads exceeds the logical limit.
  52
  53 threads  speedup       psnr
  54       slice frame   slice  frame
  55 x264 --preset veryfast --tune psnr --crf 30
  56  1:   1.00x 1.00x  +0.000 +0.000
  57  2:   1.41x 2.29x  -0.005 -0.002
  58  3:   1.70x 3.65x  -0.035 +0.000
  59  4:   1.96x 3.97x  -0.029 -0.001
  60  5:   2.10x 3.98x  -0.047 -0.002
  61  6:   2.29x 3.97x  -0.060 +0.001
  62  7:   2.36x 3.98x  -0.057 -0.001
  63  8:   2.43x 3.98x  -0.067 -0.001
  64  9:         3.96x         +0.000
  65 10:         3.99x         +0.000
  66 11:         4.00x         +0.001
  67 12:         4.00x         +0.001
  68
  69 x264 --preset medium --tune psnr --crf 30
  70  1:   1.00x 1.00x  +0.000 +0.000
  71  2:   1.54x 1.59x  -0.002 -0.003
  72  3:   2.01x 2.81x  -0.005 +0.000
  73  4:   2.51x 3.11x  -0.009 +0.000
  74  5:   2.89x 4.20x  -0.012 -0.000
  75  6:   3.27x 4.50x  -0.016 -0.000
  76  7:   3.58x 5.45x  -0.019 -0.002
  77  8:   3.79x 5.76x  -0.015 -0.002
  78  9:         6.49x         -0.000
  79 10:         6.64x         -0.000
  80 11:         6.94x         +0.000
  81 12:         6.96x         +0.000
  82
  83 x264 --preset slower --tune psnr --crf 30
  84  1:   1.00x 1.00x  +0.000 +0.000
  85  2:   1.54x 1.83x  +0.000 +0.002
  86  3:   1.98x 2.21x  -0.006 +0.002
  87  4:   2.50x 2.61x  -0.011 +0.002
  88  5:   2.93x 3.94x  -0.018 +0.003
  89  6:   3.45x 4.19x  -0.024 +0.001
  90  7:   3.84x 4.52x  -0.028 -0.001
  91  8:   4.13x 5.04x  -0.026 -0.001
  92  9:         6.15x         +0.001
  93 10:         6.24x         +0.001
  94 11:         6.55x         -0.001
  95 12:         6.89x         -0.001