Ticket #113 (new enhancement)

Opened 2 years ago

Last modified 2 months ago

Optimization of ffmpeg h264 decoder

Reported by: astrange Assigned to: astrange
Priority: normal Milestone: Sometime after 1.0
Component: ffmpeg Version:
Severity: normal Keywords:
Cc:

Description (Last modified by astrange)

The H.264 decoder is impressively fast, but not enough.

Possible improvements, based on profiling I did on x86:

Cache misses happen too often:

* rewrite very large functions (fill_caches, decode_mb_cabac) to be smaller; interlacing/other oddities can be slightly penalized if necessary as in hl_decode_mb.

* improve cache hints; hl_motion and hl_decode_mb use some, but they use the same ones which might be wrong in both cases. x86 has no hints for store, but maybe non-temporal could be used.

* the decoding context is over 200KB!

* lots of really complex struct accesses, ex:

"h->non_zero_count_cache[3+8*1 + 2*8*i]= h->non_zero_count[left_xy[i]][left_block[0+2*i]];"

Branch misprediction:

* happens often in the CABAC/arithmetic decoder, which is naturally unpredictable. Rewrite more of it to be branchless?

PPC is missing some assembly optimizations that x86 has (cabac, altivec), but no idea if they're needed. I don't have a recent profile on one.

Attachments

ffmpeg-cbpluma-speed.diff (1.9 kB) - added by astrange on 03/13/07 22:46:17.
This patch SHOULD be an optimization, but it causes gcc to produce worse code elsewhere.
ffmpeg-simpler-cabac.diff (3.0 kB) - added by astrange on 03/13/07 22:47:37.
Same here; get_cabac_noinline() spills registers less, but decode_cabac_residual spills more (with assembler disabled on x86). I think this is still a win on PPC.
altivec_lum.diff (16.6 kB) - added by gbooker on 03/18/07 19:43:52.
First attempt at an altivec h264_h/v_loop_filter_luma_ routines. There is still a lot of commented out code and other debug information, but it does work. I think that there are a few mathematical errors in there, because I swear the video looks different. With this, we are very near the speed of Apple's decoder.
altivec_lum.2.diff (16.9 kB) - added by gbooker on 03/19/07 17:39:06.
Fixed the math errors, such a stupid mistake on my part. This works, and works well. Maybe we should push to get it in ffmpeg now.
altivec_lum.3.diff (13.1 kB) - added by gbooker on 03/24/07 11:29:23.
Replaced all functions which have more than one result register to be #defines and pass the values back in registers. This eliminates using temp memory for this operation. Also, che 8x16 transpose is changed to a 6x16 (all that is neccessary) and the 4 4x4 transposes are made into a single 4x16. On the whole, these improvements really increase the speed. 1080p is quite playable on my G5 now.
ffmpeg-mbcabac-loops.patch (1.5 kB) - added by astrange on 03/24/07 18:15:30.
Small change in cabac, small but positive speed gain.
altivec_lum.4.diff (11.9 kB) - added by gbooker on 03/26/07 15:40:42.
Some cleanup of function names and the like. Removed unused functions. Should be no functionality change.
idct_add_altivec.diff (12.4 kB) - added by dconrad on 04/27/07 01:01:30.
Altivec version of ff_h264_idct_add_c by Mauricio Alvarez. ff_h264_idct_add_altivec didn't work correctly with the first transpose, probably something similar needs to be done to the faster ff_h264_idct_add_altivec_mat to get it to work. With this, Apple's 720p trailers play without framedrops on my G4 again.
ffmpeg-simpler-cabac-2.diff (3.1 kB) - added by astrange on 04/27/07 01:21:27.
Slightly different version of the older patch, should improve x86 and ppc code as long as it's compiled with -fweb (which we do, but ffmpeg doesn't)
ffmpeg-simpler-cabacresid.diff (3.5 kB) - added by astrange on 05/04/07 00:16:10.
Cleanup decode_cabac_residual (on x86, at least). Needs more work, the int[64] array looks like it could be a char[64] array but I didn't want to rewrite any asm.
ffmpeg-64bit-copies.diff (0.7 kB) - added by astrange on 05/06/07 03:24:18.
Awfully ugly hack, make 64-bit memory copies use 'double' (one SSE load) instead of 'long long' (two int loads). .841sec saved on 30sec clip, 4KB saved on mplayer binary
ffmpeg-64bit-copies-2.diff (2.0 kB) - added by astrange on 05/06/07 17:22:45.
MMXify fill_rectangle. For some reason, gcc doesn't generate MMX code half the time with the intrinsics? Still faster anyway.
altivec_lum.5.diff (18.7 kB) - added by gpoirier on 05/07/07 16:21:34.
Not functionnal changes from v4, just removal of tabs a trailing spaces (hopefully, all of them), use DECLARE_INLINE_16 macro instead of attribute
ffmpeg-cabacres-types.diff (1.4 kB) - added by astrange on 05/07/07 17:12:25.
Make some variable types in decode_cabac_residual smaller. Breaks x86.
altivec_lum.6.diff (18.7 kB) - added by gbooker on 05/11/07 17:43:07.
Took gpoirier's changes, and added a few more. The deblock_p0_q0 has been reworked (I never trusted the math in the MMX version) into what makes more sense to me and fixed a glaring mistake with the if statement on the four bytes of the tc array. The * in there should have been an &.

Change History

02/24/07 20:23:43 changed by astrange

  • description changed.

02/24/07 20:30:47 changed by astrange

  • type changed from task to enhancement.

02/26/07 08:37:58 changed by astrange

  • description changed.

03/07/07 13:13:23 changed by astrange

Fixing branch prediction helps a little, but not all that much.

PPC needs SIMD versions of h264_h/v_loop_filter_luma_c badly. Also, the C CABAC decoder is somewhat slower than the x86 one.

03/13/07 22:46:17 changed by astrange

  • attachment ffmpeg-cbpluma-speed.diff added.

This patch SHOULD be an optimization, but it causes gcc to produce worse code elsewhere.

03/13/07 22:47:37 changed by astrange

  • attachment ffmpeg-simpler-cabac.diff added.

Same here; get_cabac_noinline() spills registers less, but decode_cabac_residual spills more (with assembler disabled on x86). I think this is still a win on PPC.

03/15/07 04:01:46 changed by astrange

but decode_cabac_residual spills more (with assembler disabled on x86)

Successfully worked around, but I have to forcibly disable inlining on one function. Hopefully better compilers won't mind too much.

03/18/07 19:43:52 changed by gbooker

  • attachment altivec_lum.diff added.

First attempt at an altivec h264_h/v_loop_filter_luma_ routines. There is still a lot of commented out code and other debug information, but it does work. I think that there are a few mathematical errors in there, because I swear the video looks different. With this, we are very near the speed of Apple's decoder.

03/19/07 17:39:06 changed by gbooker

  • attachment altivec_lum.2.diff added.

Fixed the math errors, such a stupid mistake on my part. This works, and works well. Maybe we should push to get it in ffmpeg now.

03/24/07 11:29:23 changed by gbooker

  • attachment altivec_lum.3.diff added.

Replaced all functions which have more than one result register to be #defines and pass the values back in registers. This eliminates using temp memory for this operation. Also, che 8x16 transpose is changed to a 6x16 (all that is neccessary) and the 4 4x4 transposes are made into a single 4x16. On the whole, these improvements really increase the speed. 1080p is quite playable on my G5 now.

03/24/07 18:15:30 changed by astrange

  • attachment ffmpeg-mbcabac-loops.patch added.

Small change in cabac, small but positive speed gain.

03/24/07 18:17:35 changed by astrange

I looked into shrinking fill_caches/decode_mb_cabac through removing large untaken if statements but couldn't get anywhere. fill_caches is in several large chunks that could be split up, though.

The simpler cabac + altivec loop filter should be a really big help on PPC.

03/26/07 15:40:42 changed by gbooker

  • attachment altivec_lum.4.diff added.

Some cleanup of function names and the like. Removed unused functions. Should be no functionality change.

04/27/07 00:59:23 changed by dconrad

There were several functions that were Altiveced but never applied from this thread: http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-February/007211.html The most useful from my profiling are ff_h264_idct_add and put_h264_chroma_mc4. ff_h264_idct_add_altivec_mat uses an optimization that could probably be utilized in the MMX version as well, but it currently doesn't work on Altivec.

04/27/07 01:01:30 changed by dconrad

  • attachment idct_add_altivec.diff added.

Altivec version of ff_h264_idct_add_c by Mauricio Alvarez. ff_h264_idct_add_altivec didn't work correctly with the first transpose, probably something similar needs to be done to the faster ff_h264_idct_add_altivec_mat to get it to work. With this, Apple's 720p trailers play without framedrops on my G4 again.

04/27/07 01:21:27 changed by astrange

  • attachment ffmpeg-simpler-cabac-2.diff added.

Slightly different version of the older patch, should improve x86 and ppc code as long as it's compiled with -fweb (which we do, but ffmpeg doesn't)

05/04/07 00:16:10 changed by astrange

  • attachment ffmpeg-simpler-cabacresid.diff added.

Cleanup decode_cabac_residual (on x86, at least). Needs more work, the int[64] array looks like it could be a char[64] array but I didn't want to rewrite any asm.

05/04/07 00:26:46 changed by astrange

Every call to get_cabac() -> refill2() involves an unaligned 16-bit load :(

05/06/07 03:24:18 changed by astrange

  • attachment ffmpeg-64bit-copies.diff added.

Awfully ugly hack, make 64-bit memory copies use 'double' (one SSE load) instead of 'long long' (two int loads). .841sec saved on 30sec clip, 4KB saved on mplayer binary

05/06/07 17:22:45 changed by astrange

  • attachment ffmpeg-64bit-copies-2.diff added.

MMXify fill_rectangle. For some reason, gcc doesn't generate MMX code half the time with the intrinsics? Still faster anyway.

05/07/07 16:21:34 changed by gpoirier

  • attachment altivec_lum.5.diff added.

Not functionnal changes from v4, just removal of tabs a trailing spaces (hopefully, all of them), use DECLARE_INLINE_16 macro instead of attribute ...

05/07/07 17:12:25 changed by astrange

  • attachment ffmpeg-cabacres-types.diff added.

Make some variable types in decode_cabac_residual smaller. Breaks x86.

05/11/07 17:43:07 changed by gbooker

  • attachment altivec_lum.6.diff added.

Took gpoirier's changes, and added a few more. The deblock_p0_q0 has been reworked (I never trusted the math in the MMX version) into what makes more sense to me and fixed a glaring mistake with the if statement on the four bytes of the tc array. The * in there should have been an &.

05/14/07 01:18:23 changed by astrange

Bunch of these are broken now. Will make them better than new after 1.0.

09/03/07 01:23:34 changed by astrange

  • milestone changed from 1.1 to Sometime after 1.0.

Submitted the good ones. The only real interesting thing standing out is making the really big functions (fill_caches()) smaller so they fit in cache better. It's not worth a 1.1 milestone, anyway.