Ticket #113 (new enhancement)
Opened 2 years ago
Last modified 2 months ago
Optimization of ffmpeg h264 decoder
| Reported by: | astrange | Assigned to: | astrange |
|---|---|---|---|
| Priority: | normal | Milestone: | Sometime after 1.0 |
| Component: | ffmpeg | Version: | |
| Severity: | normal | Keywords: | |
| Cc: |
Attachments
- ffmpeg-cbpluma-speed.diff (1.9 kB) - added by astrange on 03/13/07 22:46:17.
- This patch SHOULD be an optimization, but it causes gcc to produce worse code elsewhere.
- ffmpeg-simpler-cabac.diff (3.0 kB) - added by astrange on 03/13/07 22:47:37.
- Same here; get_cabac_noinline() spills registers less, but decode_cabac_residual spills more (with assembler disabled on x86). I think this is still a win on PPC.
- altivec_lum.diff (16.6 kB) - added by gbooker on 03/18/07 19:43:52.
- First attempt at an altivec h264_h/v_loop_filter_luma_ routines. There is still a lot of commented out code and other debug information, but it does work. I think that there are a few mathematical errors in there, because I swear the video looks different. With this, we are very near the speed of Apple's decoder.
- altivec_lum.2.diff (16.9 kB) - added by gbooker on 03/19/07 17:39:06.
- Fixed the math errors, such a stupid mistake on my part. This works, and works well. Maybe we should push to get it in ffmpeg now.
- altivec_lum.3.diff (13.1 kB) - added by gbooker on 03/24/07 11:29:23.
- Replaced all functions which have more than one result register to be #defines and pass the values back in registers. This eliminates using temp memory for this operation. Also, che 8x16 transpose is changed to a 6x16 (all that is neccessary) and the 4 4x4 transposes are made into a single 4x16. On the whole, these improvements really increase the speed. 1080p is quite playable on my G5 now.
- ffmpeg-mbcabac-loops.patch (1.5 kB) - added by astrange on 03/24/07 18:15:30.
- Small change in cabac, small but positive speed gain.
- altivec_lum.4.diff (11.9 kB) - added by gbooker on 03/26/07 15:40:42.
- Some cleanup of function names and the like. Removed unused functions. Should be no functionality change.
- idct_add_altivec.diff (12.4 kB) - added by dconrad on 04/27/07 01:01:30.
- Altivec version of ff_h264_idct_add_c by Mauricio Alvarez. ff_h264_idct_add_altivec didn't work correctly with the first transpose, probably something similar needs to be done to the faster ff_h264_idct_add_altivec_mat to get it to work. With this, Apple's 720p trailers play without framedrops on my G4 again.
- ffmpeg-simpler-cabac-2.diff (3.1 kB) - added by astrange on 04/27/07 01:21:27.
- Slightly different version of the older patch, should improve x86 and ppc code as long as it's compiled with -fweb (which we do, but ffmpeg doesn't)
- ffmpeg-simpler-cabacresid.diff (3.5 kB) - added by astrange on 05/04/07 00:16:10.
- Cleanup decode_cabac_residual (on x86, at least). Needs more work, the int[64] array looks like it could be a char[64] array but I didn't want to rewrite any asm.
- ffmpeg-64bit-copies.diff (0.7 kB) - added by astrange on 05/06/07 03:24:18.
- Awfully ugly hack, make 64-bit memory copies use 'double' (one SSE load) instead of 'long long' (two int loads). .841sec saved on 30sec clip, 4KB saved on mplayer binary
- ffmpeg-64bit-copies-2.diff (2.0 kB) - added by astrange on 05/06/07 17:22:45.
- MMXify fill_rectangle. For some reason, gcc doesn't generate MMX code half the time with the intrinsics? Still faster anyway.
- altivec_lum.5.diff (18.7 kB) - added by gpoirier on 05/07/07 16:21:34.
- Not functionnal changes from v4, just removal of tabs a trailing spaces (hopefully, all of them), use DECLARE_INLINE_16 macro instead of attribute …
- ffmpeg-cabacres-types.diff (1.4 kB) - added by astrange on 05/07/07 17:12:25.
- Make some variable types in decode_cabac_residual smaller. Breaks x86.
- altivec_lum.6.diff (18.7 kB) - added by gbooker on 05/11/07 17:43:07.
- Took gpoirier's changes, and added a few more. The deblock_p0_q0 has been reworked (I never trusted the math in the MMX version) into what makes more sense to me and fixed a glaring mistake with the if statement on the four bytes of the tc array. The * in there should have been an &.
