videolan/x264 - x264

Commit Graph

Author	SHA1	Message	Date
Xiwei Gu	7ed753b10a	loongarch: Enhance ultrafast encoding performance Using the following command, ultrafast encoding has improved from 182fps to 189fps: ./x264 --preset ultrafast -o out.mkv yuv_1920x1080.yuv	2024-03-21 09:18:50 +08:00
Xiwei Gu	162622863a	loongarch: Fixed pixel_sa8d_16x16_lasx Save and restore FPR	2024-03-21 09:18:32 +08:00
Xiwei Gu	5a61afdbf1	loongarch: Add checkasm_call	2024-03-21 09:18:00 +08:00
Xiwei Gu	982d32400f	loongarch: Update loongson_asm.S version to 0.4.0	2024-03-21 09:17:09 +08:00
Henrik Gramner	585e01997f	x86inc: Improve XMM-spilling functionality on 64-bit Windows Prior to this change dealing with the scenario where the number of XMM registers spilled depends on if a branch is taken or not was complicated to handle well. There was essentially three options: 1) Always spill the largest number of XMM register. Results in unnecessary spills. 2) Do the spilling after the branch. Results in code duplication for the shared subset of spills. 3) Do the spilling manually. Optimal, but overly complex and vexing. This adds an additional optional argument to the WIN64_SPILL_XMM and WIN64_PUSH_XMM macros to make it possible to allocate space for a certain number of registers but initially only push a subset of those, with the option of pushing additional register later.	2024-03-14 23:29:26 +00:00
Henrik Gramner	4df71a75bf	x86inc: Restore the stack state between stack allocations Allows the use of multiple independent stack allocations within a function without having to manually fiddle with stack offsets.	2024-03-14 23:29:26 +00:00
Henrik Gramner	3d8aff7e26	x86inc: Fix warnings with old nasm versions	2024-03-14 23:29:26 +00:00
Anton Mitrofanov	de1bea534f	ppc: Fix incompatible pointer type errors Use correct return type for pixel_sad_x3/x4 functions. Bug report by Dominik 'Rathann' Mierzejewski .	2024-03-12 23:10:12 +03:00
Martin Storsjö	be4f0200ed	aarch64: Use regular hwcaps flags instead of HWCAP_CPUID for CPU feature detection on Linux This makes the code much simpler (especially for adding support for other instruction set extensions), avoids needing inline assembly for this feature, and generally is more of the canonical way to do this. The CPU feature detection was added in `9c3c716882`, using HWCAP_CPUID. The argument for using that, was that HWCAP_CPUID was added much earlier in the kernel (in Linux v4.11), while the HWCAP flags for individual features always come later. This allows detecting support for new CPU extensions before the kernel exposes information about them via hwcap flags. However in practice, there's probably quite little advantage in this. E.g. HWCAP_SVE was added in Linux v4.15, and HWCAP2_SVE2 was added in v5.10 - later than HWCAP_CPUID, but there's probably very little practical cases where one would run a kernel older than that on a CPU that supports those instructions. Additionally, we provide our own definitions of the flag values to check (as they are fixed constants anyway), with names not conflicting with the ones from system headers. This reduces the number of ifdefs needed, and allows detecting those features even if building with userland headers that are lacking the definitions of those flags. Also, slightly older versions of QEMU, e.g. 6.2 in Ubuntu 22.04, do expose support for these features via HWCAP flags, but the emulated cpuid registers are missing the bits for exposing e.g. SVE2 (This issue is fixed in later versions of QEMU though.) Also drop the ifdef check for whether AT_HWCAP is defined; it was added to glibc in 1997. AT_HWCAP2 was added in 2013, in glibc 2.18, which also precedes when aarch64 was commonly used anyway, so don't guard the use of that with an ifdef.	2024-02-28 22:26:17 +00:00
Anton Mitrofanov	7241d02011	CI: Switch 32/64-bit windows builds to LLVM Use same Docker images as VLC for contrib compilation.	2024-02-28 23:23:15 +03:00
Anton Mitrofanov	ea08f58648	CI: Add config.log to job artifacts	2024-02-28 23:19:23 +03:00
Henrik Gramner	12426f5f49	x86inc: Add support for ELF CET properties Automatically flag x86-64 asm object files as SHSTK-compatible. Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology (CET) which is a feature aimed at defending against ROP attacks by verifying that 'call' and 'ret' instructions are correctly matched. For well-written code this works transparently without any code changes, as return addresses popped from the shadow stack should match return addresses popped from the normal stack for performance reasons anyway.	2024-02-20 00:03:09 +01:00
Henrik Gramner	6fc4480cf0	x86inc.asm: Add the crc32 SSE4.2 GPR instruction	2024-02-20 00:03:09 +01:00
Henrik Gramner	87476b4c4d	x86inc: Add a cpu flag for the Ice Lake AVX-512 subset	2024-02-20 00:03:09 +01:00
Henrik Gramner	a6b561792f	x86inc: Add CLMUL cpu flag Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.	2024-02-20 00:03:09 +01:00
Henrik Gramner	5207a74e77	x86inc: Add template defines for EVEX broadcasts Broadcasting a memory operand is a binary flag, you either broadcast or you don't, and there's only a single possible element size for any given instruction. The instruction syntax however requires the broadcast semanticts to be explicitly defined, which is an issue when using macros to template code for multiple register widths. Add some helper defines to alleviate the issue.	2024-02-20 00:02:59 +01:00
Henrik Gramner	436be41fc1	x86inc: Properly sort instructions in alphabetical order	2024-02-19 23:49:36 +01:00
Anton Mitrofanov	4815ccadb1	Bump dates to 2024	2024-01-13 14:45:39 +03:00
David Chen	c1c9931dc8	Improve pixel-a.S Performance by Using SVE/SVE2 Imporve the performance of NEON functions of aarch64/pixel-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=ssd Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: ssd_4x4_c: 235 ssd_4x4_neon: 226 ssd_4x4_sve: 151 ssd_4x8_c: 409 ssd_4x8_neon: 363 ssd_4x8_sve: 201 ssd_4x16_c: 781 ssd_4x16_neon: 653 ssd_4x16_sve: 313 ssd_8x4_c: 402 ssd_8x4_neon: 192 ssd_8x4_sve: 192 ssd_8x8_c: 728 ssd_8x8_neon: 275 ssd_8x8_sve: 275 Command executed: ./checkasm10 --bench=ssd Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: ssd_4x4_c: 256 ssd_4x4_neon: 226 ssd_4x4_sve: 153 ssd_4x8_c: 460 ssd_4x8_neon: 369 ssd_4x8_sve: 215 ssd_4x16_c: 852 ssd_4x16_neon: 651 ssd_4x16_sve: 340 Command executed: ./checkasm8 --bench=ssd Testbed: AWS Graviton3 Results: ssd_4x4_c: 295 ssd_4x4_neon: 288 ssd_4x4_sve: 228 ssd_4x8_c: 454 ssd_4x8_neon: 431 ssd_4x8_sve: 294 ssd_4x16_c: 779 ssd_4x16_neon: 631 ssd_4x16_sve: 438 ssd_8x4_c: 463 ssd_8x4_neon: 247 ssd_8x4_sve: 246 ssd_8x8_c: 781 ssd_8x8_neon: 413 ssd_8x8_sve: 353 Command executed: ./checkasm10 --bench=ssd Testbed: AWS Graviton3 Results: ssd_4x4_c: 322 ssd_4x4_neon: 335 ssd_4x4_sve: 240 ssd_4x8_c: 522 ssd_4x8_neon: 448 ssd_4x8_sve: 294 ssd_4x16_c: 832 ssd_4x16_neon: 603 ssd_4x16_sve: 440 Command executed: ./checkasm8 --bench=sa8d Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: sa8d_8x8_c: 2103 sa8d_8x8_neon: 619 sa8d_8x8_sve: 617 Command executed: ./checkasm8 --bench=sa8d Testbed: AWS Graviton3 Results: sa8d_8x8_c: 2021 sa8d_8x8_neon: 597 sa8d_8x8_sve: 580 Command executed: ./checkasm8 --bench=var Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: var_8x8_c: 595 var_8x8_neon: 262 var_8x8_sve: 262 var_8x16_c: 1193 var_8x16_neon: 435 var_8x16_sve: 419 Command executed: ./checkasm8 --bench=var Testbed: AWS Graviton3 Results: var_8x8_c: 616 var_8x8_neon: 229 var_8x8_sve: 222 var_8x16_c: 1207 var_8x16_neon: 399 var_8x16_sve: 389 Command executed: ./checkasm8 --bench=hadamard_ac Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: hadamard_ac_8x8_c: 2330 hadamard_ac_8x8_neon: 635 hadamard_ac_8x8_sve: 635 hadamard_ac_8x16_c: 4500 hadamard_ac_8x16_neon: 1152 hadamard_ac_8x16_sve: 1151 hadamard_ac_16x8_c: 4499 hadamard_ac_16x8_neon: 1151 hadamard_ac_16x8_sve: 1150 hadamard_ac_16x16_c: 8812 hadamard_ac_16x16_neon: 2187 hadamard_ac_16x16_sve: 2186 Command executed: ./checkasm8 --bench=hadamard_ac Testbed: AWS Graviton3 Results: hadamard_ac_8x8_c: 2266 hadamard_ac_8x8_neon: 517 hadamard_ac_8x8_sve: 513 hadamard_ac_8x16_c: 4444 hadamard_ac_8x16_neon: 867 hadamard_ac_8x16_sve: 849 hadamard_ac_16x8_c: 4443 hadamard_ac_16x8_neon: 880 hadamard_ac_16x8_sve: 868 hadamard_ac_16x16_c: 8595 hadamard_ac_16x16_neon: 1656 hadamard_ac_16x16_sve: 1622	2023-11-23 19:01:29 +02:00
David Chen	0ac52d2915	Create Common NEON pixel-a Macros and Constants Place NEON pixel-a macros and constants that are intended to be used by SVE/SVE2 functions as well in a common file.	2023-11-23 08:26:53 +02:00
David Chen	06dcf3f9cd	Improve mc-a.S Performance by Using SVE/SVE2 Imporve the performance of NEON functions of aarch64/mc-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=avg Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: avg_4x2_c: 274 avg_4x2_neon: 215 avg_4x2_sve: 171 avg_4x4_c: 461 avg_4x4_neon: 343 avg_4x4_sve: 225 avg_4x8_c: 806 avg_4x8_neon: 619 avg_4x8_sve: 334 avg_4x16_c: 1523 avg_4x16_neon: 1168 avg_4x16_sve: 558 Command executed: ./checkasm8 --bench=avg Testbed: AWS Graviton3 Results: avg_4x2_c: 267 avg_4x2_neon: 213 avg_4x2_sve: 167 avg_4x4_c: 467 avg_4x4_neon: 350 avg_4x4_sve: 221 avg_4x8_c: 784 avg_4x8_neon: 624 avg_4x8_sve: 302 avg_4x16_c: 1445 avg_4x16_neon: 1182 avg_4x16_sve: 485	2023-11-23 08:24:16 +02:00
David Chen	21a788f159	Create Common NEON mc-a Macros and Functions Place NEON mc-a macros and functions that are intended to be used by SVE/SVE2 functions as well in a common file.	2023-11-23 08:24:13 +02:00
David Chen	5ad5e5d8f1	Improve deblock-a.S Performance by Using SVE/SVE2 Imporve the performance of NEON functions of aarch64/deblock-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=deblock Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: deblock_chroma[1]_c: 735 deblock_chroma[1]_neon: 427 deblock_chroma[1]_sve: 353 Command executed: ./checkasm8 --bench=deblock Testbed: AWS Graviton3 Results: deblock_chroma[1]_c: 719 deblock_chroma[1]_neon: 442 deblock_chroma[1]_sve: 345	2023-11-20 08:03:54 +02:00
David Chen	37949a994e	Create Common NEON deblock-a Macros Place NEON deblock-a macros that are intended to be used by SVE/SVE2 functions as well in a common file.	2023-11-20 08:03:53 +02:00
David Chen	5c382660fb	Improve dct-a.S Performance by Using SVE/SVE2 Imporve the performance of NEON functions of aarch64/dct-a.S by using the SVE/SVE2 instruction set. Below, the specific functions are listed together with the improved performance results. Command executed: ./checkasm8 --bench=sub Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: sub4x4_dct_c: 528 sub4x4_dct_neon: 322 sub4x4_dct_sve: 247 Command executed: ./checkasm8 --bench=sub Testbed: AWS Graviton3 Results: sub4x4_dct_c: 562 sub4x4_dct_neon: 376 sub4x4_dct_sve: 255 Command executed: ./checkasm8 --bench=add Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: add4x4_idct_c: 698 add4x4_idct_neon: 386 add4x4_idct_sve2: 345 Command executed: ./checkasm8 --bench=zigzag Testbed: Alibaba g8y instance based on Yitian 710 CPU Results: zigzag_interleave_8x8_cavlc_frame_c: 582 zigzag_interleave_8x8_cavlc_frame_neon: 273 zigzag_interleave_8x8_cavlc_frame_sve: 257 Command executed: ./checkasm8 --bench=zigzag Testbed: AWS Graviton3 Results: zigzag_interleave_8x8_cavlc_frame_c: 587 zigzag_interleave_8x8_cavlc_frame_neon: 257 zigzag_interleave_8x8_cavlc_frame_sve: 249	2023-11-20 08:03:51 +02:00
David Chen	b6190c6fa1	Create Common NEON dct-a Macros Place NEON dct-a macros that are intended to be used by SVE/SVE2 functions as well in a common file.	2023-11-18 08:42:48 +02:00
Martin Storsjö	c196240409	ci: Test the aarch64 build in QEMU with varying SVE sizes The sve-default-vector-length property sets the maximum vector length in bytes; the default is 64, i.e. handling up to 512 bit vectors. In order to be able to test 1024 and 2048 bit vectors, this has to be raised separately from setting the sve<n>=on property.	2023-11-14 12:44:15 +00:00
Martin Storsjö	9b3e653be4	ci: Update the build-debian-amd64 job to a new base image In the new version, there's no longer any "wine64" executable, but both i386 and x86_64 are handled with the same "wine" frontend.	2023-11-14 12:44:15 +00:00
Martin Storsjö	611b87b7a2	checkasm: Print the actual SVE vector length	2023-11-14 12:38:47 +02:00
Martin Storsjö	a354f11f8f	aarch64: Consistently use lowercase vector element specifiers	2023-11-02 23:34:23 +02:00
Martin Storsjö	ef572b9f06	aarch64: Make the assembly indentation slightly more consistent The assembly currently uses a mixture of different styles. Don't make all of it entirely consistent now, but try to make functions more consistent within themselves at least. In particular, get rid of the convention to have braces hanging outside of the alignment line. Some functions have the whole content indented off by one char compared to other functions; adjust those (but retain the functions that are self-consistent and match either of the common styles).	2023-11-02 23:34:22 +02:00
Martin Storsjö	3bc7c36256	arm: Make the assembly indentation slightly more consistent The assembly currently uses a mixture of different styles. Don't make all of it entirely consistent now, but try to make functions more consistent within themselves at least. In particular, get rid of the convention to have braces hanging outside of the alignment line.	2023-11-02 23:31:40 +02:00
Martin Storsjö	dc755eabb9	aarch64: Use rounded right shifts in dequant Don't manually add in the rounding constant (via a fused multiply-add instruction) when we can just do a plain rounded right shift. Cortex A53 A72 A73 8bpc: Before: dequant_4x4_cqm_neon: 515 246 267 dequant_4x4_dc_cqm_neon: 410 265 266 dequant_4x4_dc_flat_neon: 413 271 271 dequant_4x4_flat_neon: 519 254 274 dequant_8x8_cqm_neon: 1555 980 1002 dequant_8x8_flat_neon: 1562 994 1014 After: dequant_4x4_cqm_neon: 499 246 255 dequant_4x4_dc_cqm_neon: 376 265 255 dequant_4x4_dc_flat_neon: 378 271 260 dequant_4x4_flat_neon: 500 254 262 dequant_8x8_cqm_neon: 1489 900 925 dequant_8x8_flat_neon: 1493 915 938 10bpc: Before: dequant_4x4_cqm_neon: 483 275 275 dequant_4x4_dc_cqm_neon: 429 256 261 dequant_4x4_dc_flat_neon: 435 267 267 dequant_4x4_flat_neon: 487 283 288 dequant_8x8_cqm_neon: 1511 1112 1076 dequant_8x8_flat_neon: 1518 1139 1089 After: dequant_4x4_cqm_neon: 472 255 239 dequant_4x4_dc_cqm_neon: 404 256 232 dequant_4x4_dc_flat_neon: 406 267 234 dequant_4x4_flat_neon: 472 255 239 dequant_8x8_cqm_neon: 1462 922 978 dequant_8x8_flat_neon: 1462 922 978 This makes it around 3% faster on the Cortex A53, around 8% faster for 8bpc on Cortex A72/A73, and around 10-20% faster for 10bpp on A72/A73.	2023-11-02 21:26:03 +00:00
Martin Storsjö	4664f5aa66	aarch64: Improve scheduling in sad_x3/sad_x4 Cortex A53 A72 A73 8 bpc: Before: sad_x3_4x4_neon: 580 303 204 sad_x3_4x8_neon: 1065 516 323 sad_x3_8x4_neon: 668 262 282 sad_x3_8x8_neon: 1238 454 471 sad_x3_8x16_neon: 2378 842 847 sad_x3_16x8_neon: 2136 738 776 sad_x3_16x16_neon: 4162 1378 1463 After: sad_x3_4x4_neon: 477 298 206 sad_x3_4x8_neon: 842 515 327 sad_x3_8x4_neon: 603 260 279 sad_x3_8x8_neon: 1110 451 464 sad_x3_8x16_neon: 2125 841 843 sad_x3_16x8_neon: 2124 730 766 sad_x3_16x16_neon: 4145 1370 1434 10 bpc: Before: sad_x3_4x4_neon: 632 247 254 sad_x3_4x8_neon: 1162 419 443 sad_x3_8x4_neon: 890 358 416 sad_x3_8x8_neon: 1670 632 759 sad_x3_8x16_neon: 3230 1179 1458 sad_x3_16x8_neon: 3070 1209 1403 sad_x3_16x16_neon: 6030 2333 2699 After: sad_x3_4x4_neon: 522 253 255 sad_x3_4x8_neon: 932 443 431 sad_x3_8x4_neon: 880 354 406 sad_x3_8x8_neon: 1660 626 736 sad_x3_8x16_neon: 3220 1170 1397 sad_x3_16x8_neon: 3060 1184 1362 sad_x3_16x16_neon: 6020 2272 2579 Thus, this is around a 20-25% speedup on Cortex A53 for the small sizes (much smaller difference for bigger sizes though), while it doesn't make much of a difference at all (mostly within measurement noise) for the out-of-order cores (A72 and A73).	2023-11-02 13:27:08 +02:00
Anton Mitrofanov	d46938dec1	Fix VBV with sliced threads	2023-10-24 22:07:14 +03:00
Martin Storsjö	9c3c716882	Add cpu flags and runtime detection of SVE and SVE2 We could also use HWCAP_SVE and HWCAP2_SVE2 for detecting this, but these might not be available in all userland headers, while HWCAP_CPUID is available much earlier. The register ID_AA64ZFR0_EL1, which indicates if SVE2 is available, can only be accessed if SVE is available. If not building all the C code with SVE enabled (which could make it impossible to run on on HW without SVE), binutils refuses to assemble an instruction reading ID_AA64ZFR0_EL1 - but if referring to it with the technical name S3_0_C0_C4_4, it can be assembled even without any extra extensions enabled.	2023-10-19 22:58:11 +03:00
Martin Storsjö	db9bc75b0b	configure: Check for support for AArch64 SVE and SVE2 We don't expect the user to build the whole x264 codebase with SVE/SVE2 enabled, as we only enable this feature for the assembly files that use it, in order to have binaries that are portable and enable the SVE codepaths at runtime if supported.	2023-10-18 11:23:47 +03:00
Loongson Technology Corporation Limited	5f84d403fc	loongarch: Improve the performance of pixel series functions Performance has improved from 11.27fps to 20.50fps by using the following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) hadamard_ac_8x8 117 21 hadamard_ac_8x16 236 42 hadamard_ac_16x8 235 31 hadamard_ac_16x16 473 60 intra_sad_x3_4x4 50 21 intra_sad_x3_8x8 183 34 intra_sad_x3_8x8c 181 36 intra_sad_x3_16x16 643 68 intra_satd_x3_4x4 83 61 intra_satd_x3_8x8c 344 81 intra_satd_x3_16x16 1389 136 sa8d_8x8 97 19 sa8d_16x16 394 68 satd_4x4 24 8 satd_4x8 51 11 satd_4x16 103 24 satd_8x4 52 9 satd_8x8 108 12 satd_8x16 218 24 satd_16x8 218 19 satd_16x16 437 38 ssd_4x4 10 5 ssd_4x8 24 8 ssd_4x16 42 15 ssd_8x4 23 5 ssd_8x8 37 9 ssd_8x16 74 17 ssd_16x8 72 11 ssd_16x16 140 23 var2_8x8 91 37 var2_8x16 176 66 var_8x8 50 15 var_8x16 65 29 var_16x16 132 56 Signed-off-by: Hecai Yuan <yuanhecai@loongson.cn>	2023-10-12 17:28:23 +08:00
Loongson Technology Corporation Limited	fa7f1fce7f	loongarch: Improve the performance of dct series functions Performance has improved from 10.53fps to 11.27fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) add4x4_idct 34 9 add8x8_idct 139 31 add8x8_idct8 269 39 add8x8_idct_dc 67 7 add16x16_idct 564 123 add16x16_idct_dc 260 22 dct4x4dc 18 10 idct4x4dc 16 9 sub4x4_dct 25 7 sub8x8_dct 101 12 sub8x8_dct8 160 25 sub16x16_dct 403 52 sub16x16_dct8 646 68 zigzag_scan_4x4_frame 4 1 Signed-off-by: zhoupeng <zhoupeng@loongson.cn>	2023-10-12 17:28:15 +08:00
Loongson Technology Corporation Limited	981c8f25a2	loongarch: Improve the performance of mc series functions Performance has improved from 6.78fps to 10.53fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) avg_4x2 16 5 avg_4x4 30 6 avg_4x8 63 10 avg_4x16 124 19 avg_8x4 60 6 avg_8x8 119 10 avg_8x16 233 19 avg_16x8 229 21 avg_16x16 451 41 get_ref_4x4 30 9 get_ref_4x8 52 11 get_ref_8x4 45 9 get_ref_8x8 80 11 get_ref_8x16 156 16 get_ref_12x10 137 13 get_ref_16x8 147 11 get_ref_16x16 282 16 get_ref_20x18 278 22 hpel_filter 5163 686 lowres_init 5440 286 mc_chroma_2x2 24 7 mc_chroma_2x4 42 10 mc_chroma_4x2 41 7 mc_chroma_4x4 75 10 mc_chroma_4x8 144 19 mc_chroma_8x4 137 15 mc_chroma_8x8 269 28 mc_luma_4x4 30 10 mc_luma_4x8 52 12 mc_luma_8x4 44 10 mc_luma_8x8 80 13 mc_luma_8x16 156 19 mc_luma_16x8 147 13 mc_luma_16x16 281 19 memcpy_aligned 14 9 memzero_aligned 24 4 offsetadd_w4 79 18 offsetadd_w8 142 18 offsetadd_w16 277 25 offsetadd_w20 1118 38 offsetsub_w4 75 18 offsetsub_w8 140 18 offsetsub_w16 265 25 offsetsub_w20 989 39 weight_w4 111 19 weight_w8 205 19 weight_w16 396 29 weight_w20 1143 45 deinterleave_chroma_fdec 76 9 deinterleave_chroma_fenc 86 9 plane_copy_deinterleave 733 90 plane_copy_interleave 791 245 store_interleave_chroma 82 12 Signed-off-by: Xiwei Gu <guxiwei-hf@loongson.cn>	2023-10-12 17:27:40 +08:00
Loongson Technology Corporation Limited	65e7bac50d	loongarch: Improve the performance of quant series functions Performance has improved from 6.34fps to 6.78fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) coeff_last15 3 2 coeff_last16 3 1 coeff_last64 42 6 decimate_score15 8 12 decimate_score16 8 11 decimate_score64 61 43 dequant_4x4_cqm 16 5 dequant_4x4_dc_cqm 13 5 dequant_4x4_dc_flat 13 5 dequant_4x4_flat 16 5 dequant_8x8_cqm 71 9 dequant_8x8_flat 71 9 Signed-off-by: Shiyou Yin <yinshiyou-hf@loongson.cn>	2023-10-10 09:15:32 +08:00
Loongson Technology Corporation Limited	d8ed272a19	loongarch: Improve the performance of predict series functions Performance has improved from 6.32fps to 6.34fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) intra_predict_4x4_dc 3 2 intra_predict_4x4_dc8 1 1 intra_predict_4x4_dcl 2 1 intra_predict_4x4_dct 2 1 intra_predict_4x4_ddl 7 2 intra_predict_4x4_h 2 1 intra_predict_4x4_v 1 1 intra_predict_8x8_dc 8 2 intra_predict_8x8_dc8 1 1 intra_predict_8x8_dcl 5 2 intra_predict_8x8_dct 5 2 intra_predict_8x8_ddl 27 3 intra_predict_8x8_ddr 26 3 intra_predict_8x8_h 4 2 intra_predict_8x8_v 3 1 intra_predict_8x8_vl 29 3 intra_predict_8x8_vr 31 4 intra_predict_8x8c_dc 8 5 intra_predict_8x8c_dc8 1 1 intra_predict_8x8c_dcl 5 3 intra_predict_8x8c_dct 5 3 intra_predict_8x8c_h 4 2 intra_predict_8x8c_p 58 30 intra_predict_8x8c_v 4 1 intra_predict_16x16_dc 32 8 intra_predict_16x16_dc8 9 4 intra_predict_16x16_dcl 26 6 intra_predict_16x16_dct 26 6 intra_predict_16x16_h 23 7 intra_predict_16x16_p 182 44 intra_predict_16x16_v 22 4 Signed-off-by: Xiwei Gu <guxiwei-hf@loongson.cn>	2023-10-10 09:13:58 +08:00
Loongson Technology Corporation Limited	00b8e3b9cd	loongarch: Improve the performance of sad/sad_x3/sad_x4 series functions Performance has improved from 4.92fps to 6.32fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) sad_4x4 13 3 sad_4x8 26 7 sad_4x16 57 13 sad_8x4 24 3 sad_8x8 54 8 sad_8x16 108 13 sad_16x8 95 8 sad_16x16 189 13 sad_x3_4x4 37 6 sad_x3_4x8 71 13 sad_x3_8x4 70 8 sad_x3_8x8 162 14 sad_x3_8x16 323 25 sad_x3_16x8 279 15 sad_x3_16x16 555 27 sad_x4_4x4 49 8 sad_x4_4x8 95 17 sad_x4_8x4 94 8 sad_x4_8x8 214 16 sad_x4_8x16 429 33 sad_x4_16x8 372 18 sad_x4_16x16 740 34 Signed-off-by: wanglu <wanglu@loongson.cn>	2023-10-10 09:09:52 +08:00
Loongson Technology Corporation Limited	d7d283f634	loongarch: Improve the performance of deblock series functions. Performance has improved from 4.76fps to 4.92fps. Tested with following command: ./configure && make -j5 ./x264 --threads 4 -o out.mkv yuv_1920x1080.yuv functions performance performance (c) (asm) deblock_luma[0] 79 39 deblock_luma[1] 91 18 deblock_luma_intra[0] 63 44 deblock_luma_intra[1] 71 18 deblock_strength 104 33 Signed-off-by: Hao Chen <chenhao@loongson.cn>	2023-10-10 09:04:49 +08:00
Loongson Technology Corporation Limited	25ffd616b1	loongarch: Add loongson_asm.S and loongson_utils.S Common macros and functions for loongson optimization. Signed-off-by: Shiyou Yin <yinshiyou-hf@loongson.cn>	2023-10-10 09:00:47 +08:00
Loongson Technology Corporation Limited	1ecc51ee97	loongarch: Init LSX/LASX support LSX/LASX is the LOONGARCH 128-bit/256-bit SIMD Architecture. Signed-off-by: Shiyou Yin <yinshiyou-hf@loongson.cn> Signed-off-by: Xiwei Gu <guxiwei-hf@loongson.cn>	2023-10-10 09:00:09 +08:00
Hubert Mazur	5a9dfddea4	pixel: Add neon ssim_end implementation for 10 bit Provide arm64 neon implementation for ssim_end function for 10 bit depth. The implementation is based on the previous one for 8 bit depth with a few differences like IEEE-754 constant values and scheduling. The conversion to floating point number must be done at the beginning to prevent range overflows. Benchmarks are shown below. ssim_end_c: 715 ssim_end_neon: 380 Signed-off-by: Hubert Mazur <hum@semihalf.com>	2023-10-01 15:45:18 +00:00
Hubert Mazur	67ad1cb635	pixel: Add neon ssim_core implementation for 10 bit Provide arm64 neon implementation for ssim_core function for 10 bit depth. Benchmarks are shown below. ssim_core_c: 1315 ssim_core_neon: 470 Signed-off-by: Hubert Mazur <hum@semihalf.com>	2023-10-01 15:45:18 +00:00
Hubert Mazur	0e6165de1c	pixel: Add neon hadamard implementations for 10 bit Provide arm64 neon implementation for hadamard_ac functions for 10 bit depth. Benchmarks are shown below. hadamard_ac_8x8_c: 2995 hadamard_ac_8x8_neon: 682 hadamard_ac_8x16_c: 5959 hadamard_ac_8x16_neon: 1207 hadamard_ac_16x8_c: 5963 hadamard_ac_16x8_neon: 1212 hadamard_ac_16x16_c: 11851 hadamard_ac_16x16_neon: 2260 Signed-off-by: Hubert Mazur <hum@semihalf.com>	2023-10-01 15:45:18 +00:00
Hubert Mazur	8743a46d10	pixel: Add neon sa8d implementations for 10 bit Provide arm64 neon implementation for sa8d 16x8 and 16x16 functions for 10 bit depth. Benchmarks are shown below. sa8d_8x8_c: 2914 sa8d_8x8_neon: 608 sa8d_16x16_c: 11469 sa8d_16x16_neon: 2030 Signed-off-by: Hubert Mazur <hum@semihalf.com>	2023-10-01 15:45:18 +00:00

1 2 3 4 5 ...

3190 Commits All Branches Search

3190 Commits

All Branches