1
mirror of https://git.videolan.org/git/ffmpeg.git synced 2024-09-16 03:44:15 +02:00
Commit Graph

2474 Commits

Author SHA1 Message Date
Rostislav Pehlivanov
50945482a7 h264_idct: enable unmacro on newer NASM versions
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2018-02-12 10:50:37 +00:00
Martin Vignali
8f9c38b196 avcodec/utvideoenc : add SIMD (avx) for sub_left_prediction
asm code by Henrik Gramner
2018-01-28 20:23:11 +01:00
James Almer
6e80079a28 avcodec: increase AV_INPUT_BUFFER_PADDING_SIZE to 64
AVX-512 support has been introduced, and even if no functions currently
use zmm registers (able to load as much as 64 bytes of consecutive data
per instruction), they will be added eventually.

Reviewed-by: Rostislav Pehlivanov <atomnuker@gmail.com>
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2018-01-11 23:46:31 -03:00
James Almer
438f884fc4 x86/lossless_videodsp: rename ff_add_left_pred_int16_sse4 to ff_add_left_pred_int16_unaligned_ssse3
SSSE3_FAST is the proper check for it.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-12-10 00:51:01 -03:00
James Almer
a4fc63c0f9 x86/lossless_videodsp: don't overread the dst buffer in ff_add_left_pred_unaligned_avx2
Fixes valgrind

Signed-off-by: James Almer <jamrial@gmail.com>
2017-12-10 00:38:05 -03:00
Martin Vignali
630967ef63 avcodec/utvideodec : add SIMD (SSSE3 and AVX2) for gradient_pred 2017-12-09 15:19:03 +01:00
Martin Vignali
4353c35067 avcodec/x86/lossless_videodsp : add avx2 version for add_left_pred 2017-12-09 15:16:03 +01:00
Martin Vignali
cfbcea1cca avcodec/x86/lossless_videodsp.asm : make macro for add_left_pred_unaligned in order to add avx2 version 2017-12-09 15:15:59 +01:00
Martin Vignali
be6d1f9632 avcodec/x86/bswapdsp : use macro for 128 bits constants loading in xmm or ymm 2017-12-02 18:25:25 +01:00
Mikulas Patocka
fbdd78fa3e avcodec/fft: fix INTERL macro on 3dnow
The commit b7c16a3f2c ("x86: fft: Port to
cpuflags") breaks the opus decoder in ffmpeg when compiling for 3dnow. The
output is audible, but there's a lot of noise.

The reason for the breakage is that the commit unintentionally changed the
INTERL macro so that it is empty when compiling for 3dnow. This patch
fixes it.

Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2017-11-25 13:11:45 -03:00
Martin Vignali
515555af6c avcodec/x86/exrdsp : use ymm constant for pb_80
speed seems to be similar, but simplify code
2017-11-23 20:00:13 +01:00
James Almer
beb63baa69 x86/utvideodsp: reuse shared constants
Remove the broadcast instructions as well now that they are wide
enough.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-11-21 10:57:14 -03:00
James Almer
ebf352116b x86/constants: make pb_80 32 byte wide
Signed-off-by: James Almer <jamrial@gmail.com>
2017-11-21 10:57:03 -03:00
Martin Vignali
ba98f8463f avcodec/huffyuvdspenc : add diff_int16 AVX2 func 2017-11-21 09:42:08 +01:00
Martin Vignali
d189a426fa avcodec/huffyuvdspenc : reorganize diff_int16 2017-11-21 09:42:03 +01:00
Martin Vignali
e641c94190 avcodec/huffyuvdsp : add add_int16 AVX2 func 2017-11-21 09:41:58 +01:00
Martin Vignali
6955e8842e avcodec/huffyuvdsp : reorganize add_int16 asm 2017-11-21 09:41:52 +01:00
Martin Vignali
7f9b67bcb6 avcodec/huffyuvdsp(enc) : move duplicate macro to a template file 2017-11-21 09:41:46 +01:00
Martin Vignali
caf51a573d avcodec/x86/utvideodsp.asm : cosmetic
better func separator
and add comment for the restore rgb planes10 declaration
2017-11-21 09:00:47 +01:00
Martin Vignali
b5ebe38443 avcodec/utvideodsp : add avx2 version for the dsp 2017-11-21 09:00:42 +01:00
Martin Vignali
48b7c45b0c avcodec/x86/utvideodsp : make macro for func 2017-11-21 09:00:38 +01:00
James Almer
aea0f06db7 x86/jpeg2000dsp: add ff_ict_float_{fma3,fma4}
jpeg2000_ict_float_c: 2296.0
jpeg2000_ict_float_sse: 628.0
jpeg2000_ict_float_avx: 317.0
jpeg2000_ict_float_fma3: 262.0

Signed-off-by: James Almer <jamrial@gmail.com>
2017-11-20 18:33:58 -03:00
Michael Niedermayer
58cf31cee7 avcodec/x86/mpegvideodsp: Fix signedness bug in need_emu
Fixes: out of array read
Fixes: 3516/attachment-311488.dat

Found-by: Insu Yun, Georgia Tech.
Tested-by: wuninsu@gmail.com
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-11-14 04:54:31 +01:00
Thomas Köppe
43171a2a73 Fix missing used attribute for inline assembly variables
Variables used in inline assembly need to be marked with attribute((used)).
Static constants already were, via the define of DECLARE_ASM_CONST.
But DECLARE_ALIGNED does not add this attribute, and some of the variables
defined with it are const only used in inline assembly, and therefore
appeared dead. This change adds a macro DECLARE_ASM_ALIGNED that marks
variables as used.

This change makes FFMPEG work with Clang's ThinLTO.

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-11-13 03:58:34 +01:00
Martin Vignali
0380b72d35 libavcodec/lossless_video_dsp : cosmetic add better separator for each function, in order to make reading of the asm file easier 2017-11-07 00:56:54 +01:00
Martin Vignali
da62128ea1 libavcodec/lossless_videodsp : add add_bytes avx2 version 2017-11-07 00:56:02 +01:00
James Almer
783535a4cd x86/bswapdsp: add missing preprocessor wrappers for AVX2 functions
Fixes build with old nasm/yasm.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-10-29 22:21:51 -03:00
Martin Vignali
e9930883a2 libavcodec/bswapdsp : add AVX2 func for bswap_buf (swap uint32_t) 2017-10-29 15:21:35 +01:00
James Almer
b7c16a3f2c Merge commit '681a86aba6cb09b98ad716d986182060c7795d20'
* commit '681a86aba6cb09b98ad716d986182060c7795d20':
  x86: fft: Port to cpuflags

Merged-by: James Almer <jamrial@gmail.com>
2017-10-21 12:45:49 -03:00
James Almer
11f5ffd330 Merge commit 'e9bb77fb1012cba1951a82136df7071f71bce8fb'
* commit 'e9bb77fb1012cba1951a82136df7071f71bce8fb':
  x86: h264: Simplify DEQUANT macro with cpuflags

Merged-by: James Almer <jamrial@gmail.com>
2017-10-21 12:39:41 -03:00
James Almer
53eea3a569 Merge commit '307eb1a8ee363db1fcf869e427a8deb6d9538881'
* commit '307eb1a8ee363db1fcf869e427a8deb6d9538881':
  x86: vp8dsp: port FILTER_BILINEAR macro to cpuflags

Merged-by: James Almer <jamrial@gmail.com>
2017-10-21 12:28:39 -03:00
James Almer
2904db9045 Merge commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2'
* commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2':
  x86util: Port all macros to cpuflags

See d5f8a642f6

Merged-by: James Almer <jamrial@gmail.com>
2017-10-21 12:15:57 -03:00
James Almer
b78bb51a7c Merge commit '6eef263aca281fb582e1fa3d841ac20ef747a252'
* commit '6eef263aca281fb582e1fa3d841ac20ef747a252':
  x86: Merge align directives into SECTION_RODATA declarations where possible

Merged-by: James Almer <jamrial@gmail.com>
2017-10-12 13:48:35 -03:00
James Almer
18279738f9 x86/blockdsp: use three operand form for an instruction
Fixes assembling with old yasm.
2017-10-04 23:51:44 -03:00
Michael Niedermayer
26ea142658 avcodec/x86/lossless_videoencdsp: Fix warning: signed dword value exceeds bounds
Add () to regsize define

Suggested-by: Henrik Gramner <henrik@gramner.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-10-05 01:22:44 +02:00
Michael Niedermayer
df62b70de8 avcodec/x86/lossless_videoencdsp: Fix handling of small widths
Fixes out of array access
Fixes: crash-huf.avi

Regression since: 6b41b44149

This could also be fixed by adding checks in the C code that calls the dsp

Found-by: Zhibin Hu and 连一汉 <lianyihan@360.cn>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-10-05 01:22:44 +02:00
Martin Vignali
cbbec68847 libavcodec/blockdsp : add AVX version
Also modify the required alignment, to 32 instead of 16
for several codecs

Signed-off-by: James Almer <jamrial@gmail.com>
2017-10-03 19:47:37 -03:00
Martin Vignali
ac5908b13f libavcodec/exr : add x86 SIMD for predictor
Signed-off-by: James Almer <jamrial@gmail.com>
2017-10-01 17:35:30 -03:00
James Almer
0c005fa86f Merge commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6'
* commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6':
  asm: Consistently uppercase SECTION markers

Merged-by: James Almer <jamrial@gmail.com>
2017-09-26 18:48:06 -03:00
James Almer
318778de9e Merge commit 'fd9212f2edfe9b107c3c08ba2df5fd2cba5ab9e3'
* commit 'fd9212f2edfe9b107c3c08ba2df5fd2cba5ab9e3':
  Mark some arrays that never change as const.

Merged-by: James Almer <jamrial@gmail.com>
2017-09-26 16:02:40 -03:00
Henrik Gramner
18821e3ba1 x86/exrdsp: optimize ff_reorder_pixels_avx2()
Tested with "checkasm --test=exrdsp -bench"

Before:
reorder_pixels_c: 5187.8
reorder_pixels_sse2: 377.0
reorder_pixels_avx2: 331.3

After:
reorder_pixels_c: 5181.5
reorder_pixels_sse2: 377.0
reorder_pixels_avx2: 313.8

Signed-off-by: James Almer <jamrial@gmail.com>
2017-09-18 23:24:55 -03:00
James Almer
98d7ad085e avcodec/exrdsp: improve the ExrDSPContext->reorder_pixels prototype
Make dst be the first parameter and src const. It's more in line with the rest of the codebase.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-09-17 19:01:40 -03:00
Martin Vignali
9b8c1224d7 libavcodec/exr : add X86 SIMD for reorder_pixels
Signed-off-by: James Almer <jamrial@gmail.com>
2017-09-17 17:53:57 -03:00
Michael Niedermayer
bc488ec28a avcodec/me_cmp: Fix crashes on ARM due to misalignment
Adds a diff_pixels_unaligned()

Fixes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=872503

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-08-21 23:19:18 +02:00
Ivan Kalvachev
43dab86bcd opus_pvq_search: Restore the proper use of conditional define and simplify the function name suffix handling.
Using named define properly documents the code paths.
It also avoids passing additional numbered arguments through
multiple levels of macro templates.

The suffix handling is done by concatenation, like in
other asm functions and avoid having two separate
"cglobal" defines.

Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
2017-08-19 22:42:56 +01:00
Rostislav Pehlivanov
3c99523a28 opus_pvq_search: split functions into exactness and only use the exact if its faster
This splits the asm function into exact and non-exact version. The exact
version is as fast or faster on newer CPUs (which EXTERNAL_AVX_FAST describes
well) whilst the non-exact version is faster than the exact on older CPUs.

Also fixes yasm compilation which doesn't accept !cpuflags(avx) syntax.

Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2017-08-18 19:32:55 +01:00
Rostislav Pehlivanov
f386dd70ac opus_pvq_search: only use rsqrtps approximation on CPUs with avx
Makes the search produce idential results with the C version.

Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2017-08-18 17:30:41 +01:00
Rostislav Pehlivanov
8e53cd1fab ops_pvq_search: remove dead macro
There's no point in toggling it, even for debugging. Its just worse.

Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2017-08-18 17:27:41 +01:00
Ivan Kalvachev
7205513f8f SIMD opus pvq_search implementation
Explanation on the workings and methods used by the
Pyramid Vector Quantization Search function
could be found in the following Work-In-Progress mail threads:
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212146.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-June/212816.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213030.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-July/213436.html

Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
2017-08-18 17:18:32 +01:00
Rostislav Pehlivanov
70eb77b34e mdct15: add inverse transform postrotation SIMD
2.5ms frames:
Before   (c):  2638 decicycles in postrotate, 2097040 runs,    112 skips
After (sse3):  1467 decicycles in postrotate, 2097083 runs,     69 skips
After (avx2):  1244 decicycles in postrotate, 2097085 runs,     67 skips

5ms frames:
Before   (c):  4987 decicycles in postrotate, 1048371 runs,    205 skips
After (sse3):  2644 decicycles in postrotate, 1048509 runs,     67 skips
After (avx2):  2031 decicycles in postrotate, 1048523 runs,     53 skips

10ms frames:
Before   (c):  9153 decicycles in postrotate,  523575 runs,    713 skips
After (sse3):  5110 decicycles in postrotate,  523726 runs,    562 skips
After (avx2):  3738 decicycles in postrotate,  524223 runs,     65 skips

20ms frames:
Before   (c): 17857 decicycles in postrotate,  261866 runs,    278 skips
After (sse3): 10041 decicycles in postrotate,  261746 runs,    398 skips
After (avx2):  7050 decicycles in postrotate,  262116 runs,     28 skips

Improves total decoding performance for real world content by 9% with avx2.

Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2017-07-30 07:38:39 +01:00
Wan-Teh Chang
ea1ca17be2 avcodec/x86/cavsdsp: Delete #include "libavcodec/x86/idctdsp.h".
This file already has #include "idctdsp.h", which is resolved to the
idctdsp.h header in the directory where this file resides by compilers.
Two other files in this directory, libavcodec/x86/idctdsp_init.c and
libavcodec/x86/xvididct_init.c, also rely on #include "idctdsp.h"
working this way.

Signed-off-by: Wan-Teh Chang <wtc@google.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-07-21 02:08:33 +02:00
James Almer
9d5e81d3b1 Revert "x86/sbrdsp: remove unnecessary sign extend instruction in apply_noise_main"
This reverts commit 24bb7db403.

noise has to after all be sign extended, not zero extended, on tests
other than checkasm.
Fixes most aac tests broken by the now reverted commit.
2017-07-05 10:29:15 -03:00
James Almer
24bb7db403 x86/sbrdsp: remove unnecessary sign extend instruction in apply_noise_main
noise needs to be zero extended and it can be done implicitly as a side effect
in a subsequent instruction.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-07-04 23:36:17 -03:00
James Almer
bcbe9e4447 x86/sbrdsp: zero extend m_max in apply_noise_main
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2017-07-04 23:02:24 -03:00
James Almer
440285474b x86/utvideodsp: make restore_rgb_planes functions work on x86_32
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2017-07-04 22:49:45 -03:00
James Almer
ac8ad8d098 x86/sbrdsp: sign extend start and end gprs in ff_sbr_hf_gen_sse
Tested-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-30 11:46:24 -03:00
James Darnley
0c2acccd4b avcodec/x86: use new x86-64 functions for -idct simple
They now match according to FATE, barring any further bugs with untested
parts
2017-06-28 17:27:35 +02:00
James Darnley
d7246ea9f2 avcodec/x86: add an 8-bit simple IDCT function based on the x86-64 high depth functions
Includes add/put functions

Rounding contributed by Ronald S. Bultje
2017-06-28 17:27:35 +02:00
James Darnley
8b19467d07 avcodec/x86: allow future 8-bit simple idct to have "DC only hack"
Created by Ronald S. Bultje
2017-06-28 17:27:35 +02:00
Clément Bœsch
b12a36170b lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
Michael Niedermayer
516c213f08 avcodec/x86/vp9dsp_init_16bpp: Fix linking to missing ff_vp9_ipred_dr_32x32_16_avx2() on 32bit
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-06-28 00:31:33 +02:00
Ilia Valiakhmetov
35a5d9715d avcodec/vp9: add 64-bit ipred_dr_32x32_16 avx2 implementation
vp9_diag_downright_32x32_12bpp_c: 429.7
vp9_diag_downright_32x32_12bpp_sse2: 158.9
vp9_diag_downright_32x32_12bpp_ssse3: 144.6
vp9_diag_downright_32x32_12bpp_avx: 141.0
vp9_diag_downright_32x32_12bpp_avx2: 73.8

Almost 50% faster than avx implementation

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2017-06-27 16:10:50 -04:00
Paul B Mahol
4ed7c2bbc3 avcodec/utvideodec: add SIMD for restore_rgb_planes
Signed-off-by: Paul B Mahol <onemda@gmail.com>
2017-06-27 09:54:10 +02:00
Matthieu Bouron
db5bf64b21 lavc/x86: clear r2 higher bits in ff_sbr_sum_square
Suggested-by: James Almer <jamrial@gmail.com>
2017-06-26 09:55:23 +02:00
James Almer
349446e36f x86/mdct15: use three operand form for some instructions
Fixes compilation with old yasm
2017-06-24 01:44:49 -03:00
Rostislav Pehlivanov
e1120b1c54 mdct15: add assembly optimizations for the 15-point FFT
c:    1802 decicycles in fft15,16774635 runs,   2581 skips
avx:   865 decicycles in fft15,16776378 runs,    838 skips

Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
2017-06-23 23:45:37 +01:00
Diego Biurrun
fd502f4f5f build: Generalize yasm/nasm-related variable names
None of them are specific to the YASM assembler.

(Cherry-picked from libav commit 39e208f4d4)

Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-21 17:00:29 -03:00
James Darnley
8221c71703 avcodec/x86: allow future 8-bit simple idct to use slightly different coefficients 2017-06-20 16:12:25 +02:00
James Darnley
d2597fb0c1 avcodec/x86: modify simple_idct10 macros to add an action paramter 2017-06-20 13:35:01 +02:00
James Darnley
8781330d80 avcodec/x86: cleanup simple_idct10
Use named arguments for the functions so we can remove a define.  The
stride/linesize argument is now ptrdiff_t type so we no longer need to
sign extend the register.
2017-06-20 13:34:38 +02:00
James Darnley
e3db94302c avcodec/x86/mpegenc: support transpose permuation type 2017-06-20 12:12:13 +02:00
James Darnley
fa30a0a548 avcodec/x86/mpegenc: check IDCT permutation type is a valid value 2017-06-20 12:12:13 +02:00
Michael Niedermayer
ae6f6d4e34 avcodec/x86/mpegvideo: Use intra scantable in dct_unquantize_h263_intra_mmx()
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-06-20 00:07:51 +02:00
James Almer
8bb59e6742 x86/aacpsdsp: add ff_ps_hybrid_analysis_ileave_sse
About 2x faster than the c version.
2017-06-18 22:34:22 -03:00
James Almer
e229df9478 x86/aacpsdsp: add ff_ps_hybrid_synthesis_deint_{sse,sse4}
About 2x faster than the c version.
2017-06-18 22:33:27 -03:00
James Almer
623d217ed1 avcodec/aacps: move checks for valid length outside the stereo_interpolate dsp function
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-15 23:49:40 -03:00
James Almer
b3446862bf x86/vorbisdsp: optimize ff_vorbis_inverse_coupling_sse
About 7% faster.
2017-06-15 23:20:05 -03:00
Ronald S. Bultje
d35ff98e27 vp9: fix overwrite in ff_vp9_ipred_dr_16x16_16_avx2.
Fixes trac issue 6459.
2017-06-14 11:37:38 -04:00
Ilia Valiakhmetov
81fc617c12 avcodec/vp9: ipred_dr_16x16_16 avx2 implementation
Signed-off-by: Ilia Valiakhmetov <zakne0ne@gmail.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2017-06-12 12:40:58 -04:00
James Almer
497a4b554c x86/aacpsdsp: fix output of ff_ps_stereo_interpolate_ipdopd_sse3
The fate-aac-al_sbr_ps_04_ur test did not detect this mistake.
2017-06-07 13:53:51 -03:00
Ilia Valiakhmetov
73d9a9a6af libavcodec/vp9: ipred_dl_32x32_16 avx2 implementation
vp9_diag_downleft_32x32_8bpp_c: 580.2
vp9_diag_downleft_32x32_8bpp_sse2: 75.6
vp9_diag_downleft_32x32_8bpp_ssse3: 73.7
vp9_diag_downleft_32x32_8bpp_avx: 72.7
vp9_diag_downleft_32x32_10bpp_c: 1101.2
vp9_diag_downleft_32x32_10bpp_sse2: 145.4
vp9_diag_downleft_32x32_10bpp_ssse3: 137.5
vp9_diag_downleft_32x32_10bpp_avx: 134.8
vp9_diag_downleft_32x32_10bpp_avx2: 94.0
vp9_diag_downleft_32x32_12bpp_c: 1108.5
vp9_diag_downleft_32x32_12bpp_sse2: 145.5
vp9_diag_downleft_32x32_12bpp_ssse3: 137.3
vp9_diag_downleft_32x32_12bpp_avx: 135.2
vp9_diag_downleft_32x32_12bpp_avx2: 94.0

~30% faster than avx implementation

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2017-06-06 08:05:03 -04:00
James Almer
933dd62288 x86/aacpsdsp: optimize ff_ps_mul_pair_single_sse
~2% faster.
2017-06-04 23:29:56 -03:00
James Almer
be3809a521 x86/aacpsdsp: optimize ff_ps_stereo_interpolate_sse3
Move the unpacking outside of the loop. 5% to 10% faster.

Suggested-by: ubitux
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-03 12:39:43 -03:00
James Almer
b5a0971ff0 x86/aacps: add ff_ps_stereo_interpolate_ipdopd_sse3()
About 2x faster than the c version.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-02 11:06:24 -03:00
James Darnley
0dea0114fb avcodec/x86/idctdsp_init: reindent 2017-05-30 13:20:44 +02:00
James Darnley
8e89f6fd37 avcodec/x86: move simple_idct to external assembly 2017-05-30 13:20:42 +02:00
Clément Bœsch
584366a436 lavc/mpegvideoenc: reformat inv_zigzag_direct16 so the zigzag pattern is visible 2017-05-19 11:17:58 +02:00
Clément Bœsch
19bb2cade5 Merge commit 'b4a911c189962e563a09fb0efaf6fa9ab56263a4'
* commit 'b4a911c189962e563a09fb0efaf6fa9ab56263a4':
  mpegvideoenc: make a table const

Merged-by: Clément Bœsch <u@pkh.me>
2017-05-19 11:15:16 +02:00
James Darnley
7aa90b4e94 avcodec/h264: add sse2 versions of previous idct functions
Kaby Lake Pentium:
 - ff_h264_idct_add_8_sse2:    ~1.18x faster than mmxext
 - ff_h264_idct_dc_add_8_sse2: ~1.07x faster than mmxext
2017-05-15 15:00:20 +02:00
James Darnley
27460dfebc avcodec/h264: add avx 8-bit h264_idct_dc_add
Haswell:
 - 1.02x faster (405±0.7 vs. 397±0.8 decicycles) compared with mmxext

Skylake-U:
 - 1.06x faster (498±1.8 vs. 470±1.3 decicycles) compared with mmxext
2017-05-15 15:00:19 +02:00
James Darnley
f61d454ca1 avcodec/h264: add avx 8-bit h264_idct_add
Haswell:
 - 1.11x faster (522±0.4 vs. 469±1.8 decicycles) compared with mmxext

Skylake-U:
 - 1.21x faster (671±5.5 vs. 555±1.4 decicycles) compared with mmxext
2017-05-15 15:00:17 +02:00
James Darnley
b5325c6711 avcodec/h264: use some 3 operand forms 2017-05-15 15:00:16 +02:00
James Darnley
060ba9e5e3 avcodec/h264: change RETs into REP_RETs where appropriate 2017-05-15 15:00:15 +02:00
Michael Niedermayer
fa8fd0808f avcodec/x86/vc1dsp_init: Fix build failure with --disable-optimizations and clang
compilers doing DCE at -O0 do not necessarily understand "complex" boolean expressions
Build succeeds with this change, this was the only failure

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-04-27 04:25:31 +02:00
Clément Bœsch
5be1440c74 Merge commit '0a35f128f3c6e0ae9a0a2236c557602c108da269'
* commit '0a35f128f3c6e0ae9a0a2236c557602c108da269':
  cabac: x86: Give optimizations header a more meaningful name

Merged-by: Clément Bœsch <u@pkh.me>
2017-04-08 14:30:13 +02:00
Ronald S. Bultje
83ae7e6350 x86/idctdsp_init: reindent. 2017-04-06 10:03:28 -04:00
Ronald S. Bultje
e0c205677f x86/simple_idct: add explicit sse2 simple_idct_put/add versions.
These use the mmx IDCT, but sse2 put/add_pixels_clamped implementations.
This way we don't need to use the ff_put/add_pixels_clamped function
pointers.
2017-04-06 10:03:28 -04:00
Ronald S. Bultje
2f0591cfa3 cavs: add a sse2 idct implementation.
This makes using the function pointer ff_add_pixels_clamped() unnecessary,
since we always know what the best implementation is at compile-time.
2017-04-06 10:03:28 -04:00
Ronald S. Bultje
c9d98c5649 cavs: convert idct from inline asm to yasm. 2017-04-06 10:03:27 -04:00
Ronald S. Bultje
b51d7d89f8 x86/xvididct: remove use of ff_put/add_pixels_clamped function pointer.
Since there's separate SSE2 implementations of xvid_idct_put/add, this
patch has no practical impact on performance.
2017-04-06 10:03:27 -04:00