Ticket #113 (new enhancement)

Opened 3 years ago

Last modified 3 months ago

Optimization of ffmpeg h264 decoder

Reported by: astrange Owned by: astrange
Priority: normal Milestone: Sometime after 1.0
Component: ffmpeg Version:
Severity: normal Keywords:
Cc:

Description (last modified by astrange) (diff)

The H.264 decoder is impressively fast, but not enough.

Possible improvements, based on profiling I did on x86:

Cache misses happen too often:

* rewrite very large functions (fill_caches, decode_mb_cabac) to be smaller; interlacing/other oddities can be slightly penalized if necessary as in hl_decode_mb.

* improve cache hints; hl_motion and hl_decode_mb use some, but they use the same ones which might be wrong in both cases. x86 has no hints for store, but maybe non-temporal could be used.

* the decoding context is over 200KB!

* lots of really complex struct accesses, ex:

"h->non_zero_count_cache[3+8*1 + 2*8*i]= h->non_zero_count[left_xy[i]][left_block[0+2*i]];"

Branch misprediction:

* happens often in the CABAC/arithmetic decoder, which is naturally unpredictable. Rewrite more of it to be branchless?

PPC is missing some assembly optimizations that x86 has (cabac, altivec), but no idea if they're needed. I don't have a recent profile on one.

Attachments

ffmpeg-cbpluma-speed.diff Download (1.9 KB) - added by astrange 3 years ago.
This patch SHOULD be an optimization, but it causes gcc to produce worse code elsewhere.
ffmpeg-simpler-cabac.diff Download (3.0 KB) - added by astrange 3 years ago.
Same here; get_cabac_noinline() spills registers less, but decode_cabac_residual spills more (with assembler disabled on x86). I think this is still a win on PPC.
altivec_lum.diff Download (16.6 KB) - added by gbooker 3 years ago.
First attempt at an altivec h264_h/v_loop_filter_luma_ routines. There is still a lot of commented out code and other debug information, but it does work. I think that there are a few mathematical errors in there, because I swear the video looks different. With this, we are very near the speed of Apple's decoder.
altivec_lum.2.diff Download (16.9 KB) - added by gbooker 3 years ago.
Fixed the math errors, such a stupid mistake on my part. This works, and works well. Maybe we should push to get it in ffmpeg now.
altivec_lum.3.diff Download (13.1 KB) - added by gbooker 3 years ago.
Replaced all functions which have more than one result register to be #defines and pass the values back in registers. This eliminates using temp memory for this operation. Also, che 8x16 transpose is changed to a 6x16 (all that is neccessary) and the 4 4x4 transposes are made into a single 4x16. On the whole, these improvements really increase the speed. 1080p is quite playable on my G5 now.
ffmpeg-mbcabac-loops.patch Download (1.5 KB) - added by astrange 3 years ago.
Small change in cabac, small but positive speed gain.
altivec_lum.4.diff Download (11.9 KB) - added by gbooker 3 years ago.
Some cleanup of function names and the like. Removed unused functions. Should be no functionality change.
idct_add_altivec.diff Download (12.4 KB) - added by dconrad 3 years ago.
Altivec version of ff_h264_idct_add_c by Mauricio Alvarez. ff_h264_idct_add_altivec didn't work correctly with the first transpose, probably something similar needs to be done to the faster ff_h264_idct_add_altivec_mat to get it to work. With this, Apple's 720p trailers play without framedrops on my G4 again.
ffmpeg-simpler-cabac-2.diff Download (3.1 KB) - added by astrange 3 years ago.
Slightly different version of the older patch, should improve x86 and ppc code as long as it's compiled with -fweb (which we do, but ffmpeg doesn't)
ffmpeg-simpler-cabacresid.diff Download (3.5 KB) - added by astrange 3 years ago.
Cleanup decode_cabac_residual (on x86, at least). Needs more work, the int[64] array looks like it could be a char[64] array but I didn't want to rewrite any asm.
ffmpeg-64bit-copies.diff Download (0.7 KB) - added by astrange 3 years ago.
Awfully ugly hack, make 64-bit memory copies use 'double' (one SSE load) instead of 'long long' (two int loads). .841sec saved on 30sec clip, 4KB saved on mplayer binary
ffmpeg-64bit-copies-2.diff Download (2.0 KB) - added by astrange 3 years ago.
MMXify fill_rectangle. For some reason, gcc doesn't generate MMX code half the time with the intrinsics? Still faster anyway.
altivec_lum.5.diff Download (18.7 KB) - added by gpoirier 3 years ago.
Not functionnal changes from v4, just removal of tabs a trailing spaces (hopefully, all of them), use DECLARE_INLINE_16 macro instead of attribute
ffmpeg-cabacres-types.diff Download (1.4 KB) - added by astrange 3 years ago.
Make some variable types in decode_cabac_residual smaller. Breaks x86.
altivec_lum.6.diff Download (18.7 KB) - added by gbooker 3 years ago.
Took gpoirier's changes, and added a few more. The deblock_p0_q0 has been reworked (I never trusted the math in the MMX version) into what makes more sense to me and fixed a glaring mistake with the if statement on the four bytes of the tc array. The * in there should have been an &.

Change History

Changed 3 years ago by astrange

  • description modified (diff)

Changed 3 years ago by astrange

  • type changed from task to enhancement

Changed 3 years ago by astrange

  • description modified (diff)

Changed 3 years ago by astrange

Fixing branch prediction helps a little, but not all that much.

PPC needs SIMD versions of h264_h/v_loop_filter_luma_c badly. Also, the C CABAC decoder is somewhat slower than the x86 one.

Changed 3 years ago by astrange

This patch SHOULD be an optimization, but it causes gcc to produce worse code elsewhere.

Changed 3 years ago by astrange

Same here; get_cabac_noinline() spills registers less, but decode_cabac_residual spills more (with assembler disabled on x86). I think this is still a win on PPC.

Changed 3 years ago by astrange

but decode_cabac_residual spills more (with assembler disabled on x86)

Successfully worked around, but I have to forcibly disable inlining on one function. Hopefully better compilers won't mind too much.

Changed 3 years ago by gbooker

First attempt at an altivec h264_h/v_loop_filter_luma_ routines. There is still a lot of commented out code and other debug information, but it does work. I think that there are a few mathematical errors in there, because I swear the video looks different. With this, we are very near the speed of Apple's decoder.

Changed 3 years ago by gbooker

Fixed the math errors, such a stupid mistake on my part. This works, and works well. Maybe we should push to get it in ffmpeg now.

Changed 3 years ago by gbooker

Replaced all functions which have more than one result register to be #defines and pass the values back in registers. This eliminates using temp memory for this operation. Also, che 8x16 transpose is changed to a 6x16 (all that is neccessary) and the 4 4x4 transposes are made into a single 4x16. On the whole, these improvements really increase the speed. 1080p is quite playable on my G5 now.

Changed 3 years ago by astrange

Small change in cabac, small but positive speed gain.

Changed 3 years ago by astrange

I looked into shrinking fill_caches/decode_mb_cabac through removing large untaken if statements but couldn't get anywhere. fill_caches is in several large chunks that could be split up, though.

The simpler cabac + altivec loop filter should be a really big help on PPC.

Changed 3 years ago by gbooker

Some cleanup of function names and the like. Removed unused functions. Should be no functionality change.

Changed 3 years ago by dconrad

There were several functions that were Altiveced but never applied from this thread:  http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-February/007211.html The most useful from my profiling are ff_h264_idct_add and put_h264_chroma_mc4. ff_h264_idct_add_altivec_mat uses an optimization that could probably be utilized in the MMX version as well, but it currently doesn't work on Altivec.

Changed 3 years ago by dconrad

Altivec version of ff_h264_idct_add_c by Mauricio Alvarez. ff_h264_idct_add_altivec didn't work correctly with the first transpose, probably something similar needs to be done to the faster ff_h264_idct_add_altivec_mat to get it to work. With this, Apple's 720p trailers play without framedrops on my G4 again.

Changed 3 years ago by astrange

Slightly different version of the older patch, should improve x86 and ppc code as long as it's compiled with -fweb (which we do, but ffmpeg doesn't)

Changed 3 years ago by astrange

Cleanup decode_cabac_residual (on x86, at least). Needs more work, the int[64] array looks like it could be a char[64] array but I didn't want to rewrite any asm.

Changed 3 years ago by astrange

Every call to get_cabac() -> refill2() involves an unaligned 16-bit load :(

Changed 3 years ago by astrange

Awfully ugly hack, make 64-bit memory copies use 'double' (one SSE load) instead of 'long long' (two int loads). .841sec saved on 30sec clip, 4KB saved on mplayer binary

Changed 3 years ago by astrange

MMXify fill_rectangle. For some reason, gcc doesn't generate MMX code half the time with the intrinsics? Still faster anyway.

Changed 3 years ago by gpoirier

Not functionnal changes from v4, just removal of tabs a trailing spaces (hopefully, all of them), use DECLARE_INLINE_16 macro instead of attribute ...

Changed 3 years ago by astrange

Make some variable types in decode_cabac_residual smaller. Breaks x86.

Changed 3 years ago by gbooker

Took gpoirier's changes, and added a few more. The deblock_p0_q0 has been reworked (I never trusted the math in the MMX version) into what makes more sense to me and fixed a glaring mistake with the if statement on the four bytes of the tc array. The * in there should have been an &.

Changed 3 years ago by astrange

Bunch of these are broken now. Will make them better than new after 1.0.

Changed 3 years ago by astrange

  • milestone changed from 1.1 to Sometime after 1.0

Submitted the good ones. The only real interesting thing standing out is making the really big functions (fill_caches()) smaller so they fit in cache better. It's not worth a 1.1 milestone, anyway.

Note: See TracTickets for help on using tickets.