dav1d

Commit Graph

Author	SHA1	Message	Date
Matthias Dressel	c7df9a3e65	CI: Improve coverage for argon samples using different thread counts Similar to `4796b59fc0`.	2024-05-01 13:09:09 +00:00
Matthias Dressel	0f504bf57c	CI: Add dotprod to argon tests	2024-05-01 13:09:09 +00:00
Henrik Gramner	223901243c	x86: Add 6-tap variants of high bit-depth mc AVX-512 (Ice Lake) functions	2024-04-29 17:59:09 +02:00
Henrik Gramner	8ff97b3a0b	x86: Add minor high bit-depth mc AVX-512 improvements	2024-04-29 17:59:09 +02:00
Martin Storsjö	236e1d1912	tools: Make ARM cpu flags imply relevant lower level flags The --cpumask flag only takes one single flag name, one can't set a combination like neon+dotprod. Therefore, apply the same pattern as for x86, by adding mask values that contain all the implied lower level flags. This is somewhat complicated, as the set of features isn't entirely linear - in particular, SVE doesn't imply either dotprod or i8mm, and SVE2 only implies dotprod, but not i8mm. This makes sure that "dav1d --cpumask dotprod" actually uses any SIMD at all, as it previously only set the dotprod flag but not neon, which essentially opted out from all SIMD.	2024-04-26 15:09:19 +00:00
Arpad Panyik	1776c45a08	AArch64: Add basic i8mm support for convolutions Add an Armv8.6-A i8mm code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element USDOT instruction. Benchmarks show 4-9% FPS increase relative to the Armv8.4-A code path depending on the input video and the CPU used. This patch will increase the .text by around 5.7 KiB. Relative performance to the C reference on some Cortex CPU cores: Cortex-A715 Cortex-X3 Cortex-A510 regular w4 hv neon: 7.20x 11.20x 4.40x regular w4 hv dotprod: 12.77x 18.35x 6.21x regular w4 hv i8mm: 14.50x 21.42x 6.16x sharp w4 hv neon: 6.24x 9.77x 3.96x sharp w4 hv dotprod: 9.76x 14.02x 5.20x sharp w4 hv i8mm: 10.84x 16.09x 5.42x regular w8 hv neon: 2.17x 2.46x 3.17x regular w8 hv dotprod: 3.04x 3.11x 3.03x regular w8 hv i8mm: 3.57x 3.40x 3.27x sharp w8 hv neon: 1.72x 1.93x 2.75x sharp w8 hv dotprod: 2.49x 2.54x 2.62x sharp w8 hv i8mm: 2.80x 2.79x 2.70x regular w16 hv neon: 1.90x 2.17x 2.02x regular w16 hv dotprod: 2.59x 2.64x 1.93x regular w16 hv i8mm: 3.01x 2.85x 2.05x sharp w16 hv neon: 1.51x 1.72x 1.74x sharp w16 hv dotprod: 2.17x 2.22x 1.70x sharp w16 hv i8mm: 2.42x 2.42x 1.72x regular w32 hv neon: 1.80x 1.96x 1.81x regular w32 hv dotprod: 2.43x 2.36x 1.74x regular w32 hv i8mm: 2.83x 2.51x 1.83x sharp w32 hv neon: 1.42x 1.54x 1.56x sharp w32 hv dotprod: 2.07x 2.00x 1.55x sharp w32 hv i8mm: 2.29x 2.16x 1.55x regular w64 hv neon: 1.82x 1.89x 1.70x regular w64 hv dotprod: 2.43x 2.25x 1.65x regular w64 hv i8mm: 2.84x 2.39x 1.73x sharp w64 hv neon: 1.43x 1.47x 1.49x sharp w64 hv dotprod: 2.08x 1.91x 1.49x sharp w64 hv i8mm: 2.30x 2.07x 1.48x regular w128 hv neon: 1.77x 1.84x 1.75x regular w128 hv dotprod: 2.37x 2.18x 1.70x regular w128 hv i8mm: 2.76x 2.33x 1.78x sharp w128 hv neon: 1.40x 1.45x 1.42x sharp w128 hv dotprod: 2.04x 1.87x 1.43x sharp w128 hv i8mm: 2.24x 2.02x 1.42x regular w8 h neon: 3.16x 3.51x 3.43x regular w8 h dotprod: 4.97x 7.43x 4.95x regular w8 h i8mm: 7.28x 10.38x 5.69x sharp w8 h neon: 2.71x 2.77x 3.10x sharp w8 h dotprod: 4.92x 7.14x 4.94x sharp w8 h i8mm: 7.21x 10.11x 5.70x regular w16 h neon: 2.79x 2.76x 3.53x regular w16 h dotprod: 3.81x 4.77x 3.13x regular w16 h i8mm: 5.21x 6.04x 3.56x sharp w16 h neon: 2.31x 2.38x 3.12x sharp w16 h dotprod: 3.80x 4.74x 3.13x sharp w16 h i8mm: 5.20x 5.98x 3.56x regular w64 h neon: 2.49x 2.46x 2.94x regular w64 h dotprod: 3.17x 3.60x 2.41x regular w64 h i8mm: 4.22x 4.40x 2.72x sharp w64 h neon: 2.07x 2.06x 2.60x sharp w64 h dotprod: 3.16x 3.58x 2.40x sharp w64 h i8mm: 4.20x 4.38x 2.71x regular w8 v neon: 6.11x 8.05x 4.07x regular w8 v dotprod: 5.45x 8.15x 4.01x regular w8 v i8mm: 7.30x 9.46x 4.19x sharp w8 v neon: 4.23x 5.46x 3.09x sharp w8 v dotprod: 5.43x 7.96x 4.01x sharp w8 v i8mm: 7.26x 9.12x 4.19x regular w16 v neon: 3.44x 4.33x 2.40x regular w16 v dotprod: 3.20x 4.53x 2.85x regular w16 v i8mm: 4.09x 5.27x 2.87x sharp w16 v neon: 2.50x 3.14x 1.82x sharp w16 v dotprod: 3.20x 4.52x 2.86x sharp w16 v i8mm: 4.09x 5.15x 2.86x regular w64 v neon: 2.74x 3.11x 1.53x regular w64 v dotprod: 2.63x 3.30x 1.84x regular w64 v i8mm: 3.31x 3.73x 1.84x sharp w64 v neon: 2.01x 2.29x 1.16x sharp w64 v dotprod: 2.61x 3.27x 1.83x sharp w64 v i8mm: 3.29x 3.68x 1.84x	2024-04-26 14:04:18 +02:00
Arpad Panyik	fbf23637ce	AArch64: Simplify DotProd path of 2D subpel filters Simplify the DotProd code path of the 2D (horizontal-vertical) subpel filters. It contains some instruction reordering and some macro simplifications to be more similar to the upcoming i8mm version. These changes have negligible effect on performance. Cortex-A510: mc_8tap_regular_w2_hv_8bpc_dotprod: 8.3769 -> 8.3380 mc_8tap_sharp_w2_hv_8bpc_dotprod: 9.5441 -> 9.5457 mc_8tap_regular_w4_hv_8bpc_dotprod: 8.3422 -> 8.3444 mc_8tap_sharp_w4_hv_8bpc_dotprod: 9.5441 -> 9.5367 mc_8tap_regular_w8_hv_8bpc_dotprod: 9.9852 -> 9.9666 mc_8tap_sharp_w8_hv_8bpc_dotprod: 12.5554 -> 12.5314 Cortex-A55: mc_8tap_regular_w2_hv_8bpc_dotprod: 6.4504 -> 6.4892 mc_8tap_sharp_w2_hv_8bpc_dotprod: 7.5732 -> 7.6078 mc_8tap_regular_w4_hv_8bpc_dotprod: 6.5088 -> 6.4760 mc_8tap_sharp_w4_hv_8bpc_dotprod: 7.5796 -> 7.5763 mc_8tap_regular_w8_hv_8bpc_dotprod: 9.3384 -> 9.3078 mc_8tap_sharp_w8_hv_8bpc_dotprod: 11.1159 -> 11.1401 Cortex-A78: mc_8tap_regular_w2_hv_8bpc_dotprod: 1.4122 -> 1.4250 mc_8tap_sharp_w2_hv_8bpc_dotprod: 1.7696 -> 1.7821 mc_8tap_regular_w4_hv_8bpc_dotprod: 1.4243 -> 1.4243 mc_8tap_sharp_w4_hv_8bpc_dotprod: 1.7866 -> 1.7863 mc_8tap_regular_w8_hv_8bpc_dotprod: 2.5304 -> 2.5171 mc_8tap_sharp_w8_hv_8bpc_dotprod: 3.0815 -> 3.0632 Cortex-X1: mc_8tap_regular_w2_hv_8bpc_dotprod: 0.8195 -> 0.8194 mc_8tap_sharp_w2_hv_8bpc_dotprod: 1.0092 -> 1.0081 mc_8tap_regular_w4_hv_8bpc_dotprod: 0.8197 -> 0.8166 mc_8tap_sharp_w4_hv_8bpc_dotprod: 1.0089 -> 1.0068 mc_8tap_regular_w8_hv_8bpc_dotprod: 1.5230 -> 1.5166 mc_8tap_sharp_w8_hv_8bpc_dotprod: 1.8683 -> 1.8625	2024-04-25 17:02:09 +02:00
Arpad Panyik	a40301b33f	AArch64: Simplify loads in hv_filter of DotProd path Simplify the load sequences in hv_filter functions (ldr + add -> ld1) to be more uniform and smaller. Performance is not affected.	2024-04-25 17:02:09 +02:00
Arpad Panyik	b0685c387d	AArch64: Simplify TBL usage in 2D DotProd filters Simplify the TBL usages in small block size (2, 4) parts of the 2D (horizontal-vertical) put subpel filters. The 2-register TBLs are replaced with the 1-register form because we only need the lower 64-bits of the result and it can be extracted from only one source register. Performance is not affected by this change.	2024-04-25 17:02:09 +02:00
Arpad Panyik	ad7938d517	AArch64: Simplify DotProd path of horizontal subpel filters Simplify the inner loops of the DotProd code path of horizontal subpel filters to avoid using 2-register TBL instructions. The store part of block size 16 of the horizontal put case is also simplified (str + add -> st1). This patch can improve performance mostly on small cores like Cortex-A510 and newer. Other CPUs are mostly unaffected. Cortex-A510: mct_8tap_sharp_w16_h_8bpc_dotprod: 2.77x -> 3.13x mct_8tap_sharp_w32_h_8bpc_dotprod: 2.32x -> 2.56x Cortex-A55: mct_8tap_sharp_w16_h_8bpc_dotprod: 3.89x -> 3.89x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.35x -> 3.35x Cortex-A715: mct_8tap_sharp_w16_h_8bpc_dotprod: 3.79x -> 3.78x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.30x -> 3.30x Cortex-A78: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.30x -> 4.31x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.79x -> 3.80x Cortex-X3: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.74x -> 4.75x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.89x -> 3.91x Cortex-X1: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.61x -> 4.62x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.67x -> 3.66x	2024-04-25 16:59:53 +02:00
Arpad Panyik	317a94c6bb	AArch64: Simplify DotProd path of vertical subpel filters Simplify the accumulator initializations of the DotProd code path of vertical subpel filters. This also makes it possible for some CPUs to use zero latency vector register moves. The load is also simplified (ldr + add -> ld1) in the inner loop of vertical filter for block size 16.	2024-04-25 16:59:13 +02:00
Arpad Panyik	7eee4a2059	AArch64: Add \dot parameter to filter_8tap_fn macro Add \dot parameter to filter_8tap_fn macro in preparation to extend it with i8mm code path. This patch also contains string fixes and some instruction reorderings along with some register renaming to make it more uniform. These changes don't affect performance but simplifies the code a bit.	2024-04-25 16:58:11 +02:00
Martin Storsjö	cb8151c969	aarch64: Avoid unaligned jump tables Manually add a padding 0 entry to make the odd number of .hword entries align with the instruction size. This fixes assembling with GAS, with the --gdwarf2 option, where it previously produced the error message "unaligned opcodes detected in executable segment". The message is slightly misleading, as the error is printed even if there actually are no opcodes that are misaligned, as the jump table is the last thing within the .text section. The issue can be reproduced with an input as small as this, assembled with "as --gdwarf2 -c test.s". .text nop .hword 0 See `a6228f47f0` for earlier cases of the same error - although in those cases, we actually did have more code and labels following the unaligned jump tables. This error is present with binutils 2.39 and earlier; in binutils 2.40, this input no longer is considered an error, fixed in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=6f6f5b0adc9efd103c434fd316e8c880a259775d.	2024-04-22 09:11:37 +00:00
Kyle Siefring	a9feab9bc1	ARM64: Minor msac improvements One addressing optimization and fix some missing changes to a previous commit that ported improvements from hi tok to other decode tok functions.	2024-04-21 10:32:43 +00:00
Matthias Dressel	5851901772	CI: Move llvm crossfiles from image to project Since dav1d was the only user of these crossfiles, it was agreed upon to remove them from the image [0] and move to dav1d directly. [1] [0] https://code.videolan.org/videolan/docker-images/-/merge_requests/293 [1] https://code.videolan.org/videolan/docker-images/-/merge_requests/294#note_434720	2024-04-16 11:53:16 +02:00
Kyle Siefring	37d52435d1	ARM64: Port msac improvements to more functions Port improvements from the hi token functions to the rest of the symbol adaption functions. These weren't originally ported since they didn't work with arbitrary padding. In practice, zero padding is already used and only the tests need to be updated. Results - Neoverse N1 Old: msac_decode_symbol_adapt4_c: 41.4 ( 1.00x) msac_decode_symbol_adapt4_neon: 31.0 ( 1.34x) msac_decode_symbol_adapt8_c: 54.5 ( 1.00x) msac_decode_symbol_adapt8_neon: 32.2 ( 1.69x) msac_decode_symbol_adapt16_c: 85.6 ( 1.00x) msac_decode_symbol_adapt16_neon: 37.5 ( 2.28x) New: msac_decode_symbol_adapt4_c: 41.5 ( 1.00x) msac_decode_symbol_adapt4_neon: 27.7 ( 1.50x) msac_decode_symbol_adapt8_c: 55.7 ( 1.00x) msac_decode_symbol_adapt8_neon: 30.1 ( 1.85x) msac_decode_symbol_adapt16_c: 82.4 ( 1.00x) msac_decode_symbol_adapt16_neon: 35.2 ( 2.34x)	2024-04-15 12:38:20 +00:00
Henrik Gramner	5b5399911d	x86: Add 6-tap variants of 8bpc mc AVX-512 (Ice Lake) functions 6-tap filtering is only performed vertically due to use of VNNI instructions processing 4 pixels per instruction horizontally.	2024-04-15 13:19:42 +02:00
Henrik Gramner	38df35d2d1	x86: Add various 8bpc mc AVX-512 improvements	2024-04-15 13:12:20 +02:00
Matthias Dressel	313af0b6a5	CI: Update images Now with clang 18 and downgraded xz-utils.	2024-04-14 01:57:37 +02:00
Luca Barbato	09f2a21e7c	Deduplicate itx macros	2024-04-13 23:19:52 +02:00
Ronald S. Bultje	f1c518901b	Increase timeout multiplier for aarch64/riscv64/la64-qemu CI jobs They have been failing occasionally lately.	2024-04-13 09:53:54 -04:00
Matthias Dressel	aa63a41ccd	cli: Add missing ARM cpumasks help text Forgotten in `acc1121d2f`.	2024-04-11 23:15:07 +02:00
Arpad Panyik	9d77b6336a	AArch64: Add DotProd support for convolutions Add an Armv8.4-A DotProd code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element SDOT instruction. Benchmarks show up-to 7-29% FPS increase depending on the input video and the CPU used. This patch will increase the .text by around 6.5 KiB. Performance highly depends on the SDOT and MLA throughput ratio, this can be seen on the vertical filter cases. Small cores are also affected by the TBL execution latencies: Relative performance to the C reference on some CPUs: A76 A78 X1 A55 regular w4 hv neon: 5.52x 5.78x 10.75x 8.27x regular w4 hv dotprod: 7.94x 8.49x 16.84x 8.09x sharp w4 hv neon: 5.27x 5.22x 9.06x 7.87x sharp w4 hv dotprod: 6.61x 6.73x 12.64x 6.89x regular w8 hv neon: 1.95x 2.19x 2.56x 3.16x regular w8 hv dotprod: 3.23x 2.81x 3.20x 3.26x sharp w8 hv neon: 1.61x 1.79x 2.05x 2.72x sharp w8 hv dotprod: 2.72x 2.29x 2.66x 2.76x regular w16 hv neon: 1.63x 2.04x 2.16x 2.73x regular w16 hv dotprod: 2.72x 2.57x 2.67x 2.80x sharp w16 hv neon: 1.33x 1.67x 1.74x 2.34x sharp w16 hv dotprod: 2.31x 2.14x 2.26x 2.39x regular w32 hv neon: 1.48x 1.92x 1.94x 2.51x regular w32 hv dotprod: 2.49x 2.40x 2.33x 2.58x sharp w32 hv neon: 1.21x 1.56x 1.53x 2.14x sharp w32 hv dotprod: 2.12x 2.02x 2.00x 2.22x regular w64 hv neon: 1.42x 1.87x 1.85x 2.40x regular w64 hv dotprod: 2.40x 2.32x 2.21x 2.46x sharp w64 hv neon: 1.16x 1.52x 1.46x 2.04x sharp w64 hv dotprod: 2.02x 1.96x 1.90x 2.11x regular w128 hv neon: 1.39x 1.84x 1.80x 2.27x regular w128 hv dotprod: 2.33x 2.28x 2.14x 2.35x sharp w128 hv neon: 1.14x 1.50x 1.42x 1.94x sharp w128 hv dotprod: 1.98x 1.93x 1.84x 2.03x regular w8 h neon: 2.61x 3.20x 3.51x 3.55x regular w8 h dotprod: 4.43x 5.17x 6.26x 4.30x sharp w8 h neon: 2.01x 2.80x 2.89x 3.12x sharp w8 h dotprod: 4.42x 5.16x 6.27x 4.28x regular w16 h neon: 2.17x 3.13x 2.92x 3.35x regular w16 h dotprod: 4.38x 4.27x 4.53x 3.90x sharp w16 h neon: 1.74x 2.65x 2.48x 2.92x sharp w16 h dotprod: 4.33x 4.27x 4.53x 3.91x regular w64 h neon: 1.92x 2.82x 2.39x 2.96x regular w64 h dotprod: 3.68x 3.60x 3.40x 3.18x sharp w64 h neon: 1.47x 2.33x 2.05x 2.54x sharp w64 h dotprod: 3.68x 3.60x 3.40x 3.17x regular w4 v neon: 5.39x 7.38x 10.27x 11.41x regular w4 v dotprod: 9.46x 14.15x 18.72x 9.84x sharp w4 v neon: 4.51x 6.39x 8.17x 10.70x sharp w4 v dotprod: 9.35x 14.20x 18.63x 9.78x regular w16 v neon: 3.03x 4.03x 4.65x 6.28x regular w16 v dotprod: 4.64x 3.75x 4.78x 3.89x sharp w16 v neon: 2.29x 3.09x 3.44x 5.52x sharp w16 v dotprod: 4.62x 3.74x 4.77x 3.89x regular w64 v neon: 2.17x 3.14x 3.19x 4.46x regular w64 v dotprod: 3.43x 3.00x 3.31x 2.74x sharp w64 v neon: 1.61x 2.42x 2.34x 3.89x sharp w64 v dotprod: 3.38x 3.00x 3.29x 2.73x	2024-04-11 19:03:58 +02:00
Henrik Gramner	dc9490134f	meson: Enable parallel execution of checkasm in 'meson test' It was originally disabled due to older meson versions mixing the output of 'meson test -v' from different tests, which made the log difficult to read. Newer versions however caches the output from each test as it runs and prints it in one contiguous block, so that's no longer an issue.	2024-04-08 22:51:15 +02:00
Henrik Gramner	f6e05da093	cdf: Combine memcpy() calls in dav1d_cdf_thread_copy() Place multiple default contexts inside a single outer struct so that copying can be performed in larger blocks.	2024-04-08 20:25:59 +02:00
Henrik Gramner	c8add4f8bf	cdf: Reduce code size of dav1d_cdf_thread_update() Reorder CDF arrays so that copying can be performed in larger blocks.	2024-04-08 20:25:59 +02:00
Henrik Gramner	ed24201356	cdf: Make qcat calculation branchless	2024-04-08 20:25:58 +02:00
Henrik Gramner	67fcf01bf2	decode: Simplify read_mv_residual()	2024-04-08 20:25:58 +02:00
Henrik Gramner	17a2180a61	cdf: Remove separate intra-only dmv contexts We can simply use the regular mv contexts for intra frames. They are mutually exclusive, and the dmv contexts were already discarded and replaced with default contexts on frame completion.	2024-04-08 20:25:58 +02:00
Henrik Gramner	e2145f5295	cdf: Skip unnecessary context copying in dav1d_cdf_thread_update() The intrabc and dmv contexts are never reused between frames.	2024-04-08 20:25:58 +02:00
Henrik Gramner	e27b451e2a	cli: Handle SIGINT and SIGTERM more gracefully Attempt to finish writing the current frame before exiting to avoid ending up with a partially written frame at the end of the output file. Only try catching a signal once, falling back to the default behavior of exiting immediately the second time a given signal is raised.	2024-04-04 13:06:12 +00:00
Kyle Siefring	72dfbc075b	ARM64: Improve hi_tok msac Before: msac_decode_hi_tok_c: 259.5 ( 1.00x) msac_decode_hi_tok_neon: 220.7 ( 1.18x) msac_decode_symbol_adapt4_c: 105.7 ( 1.00x) msac_decode_symbol_adapt4_neon: 63.3 ( 1.67x) After: msac_decode_hi_tok_c: 260.9 ( 1.00x) msac_decode_hi_tok_neon: 197.9 ( 1.32x) msac_decode_symbol_adapt4_c: 105.7 ( 1.00x) msac_decode_symbol_adapt4_neon: 63.3 ( 1.67x) decode_symbol_adapt4 is not changed, but is included for reference since decode_hi_tok calls it.	2024-04-03 09:23:21 +00:00
Martin Storsjö	5e31720b89	checkasm: Add support for the private macOS kperf API for benchmarking On AArch64, the performance counter registers usually are restricted and not accessible from user space. On macOS, we currently use mach_absolute_time() as timer on aarch64. This measures wallclock time but with a very coarse resolution. There is a private API, kperf, that one can use for getting high precision timers though. Unfortunately, it requires running the checkasm binary as root (e.g. with sudo). Also, as it is a private, undocumented API, it can potentially change at any time. This is handled by adding a new meson build option, for switching to this timer. If the timer source in checkasm could be changed at runtime with an option, this wouldn't need to be a build time option. This allows getting benchmarks like this: mc_8tap_regular_w16_hv_8bpc_c: 1522.1 ( 1.00x) mc_8tap_regular_w16_hv_8bpc_neon: 331.8 ( 4.59x) Instead of this: mc_8tap_regular_w16_hv_8bpc_c: 9.0 ( 1.00x) mc_8tap_regular_w16_hv_8bpc_neon: 1.9 ( 4.76x) Co-authored-by: J. Dekker <jdek@itanimul.li>	2024-04-02 10:35:29 +00:00
Henrik Gramner	abc8a1689f	lf_mask: Align lvl buffers Ensures that SIMD stores performed by memset() are aligned.	2024-03-28 15:58:36 +01:00
Henrik Gramner	119df64b21	lf_mask: Use sizeof() in memset() size calculations	2024-03-28 15:58:35 +01:00
Henrik Gramner	df3dafddc3	lf_mask: Use a union type for last_delta_lf On architectures without unaligned load capabilites the compiler will otherwise load the individual 8-bit values one at a time.	2024-03-28 15:58:34 +01:00
Henrik Gramner	076955a153	refmvs: Fix buffer overread in save_tmvs() asm The refmvs_block struct is only 12 bytes large but it's accessed using 16-byte unaligned loads in asm. In order to avoid reading past the end of the allocated buffer we therefore need to pad the allocation size by 4 bytes.	2024-03-28 01:41:28 +01:00
Henrik Gramner	3d98a242a0	x86: Add 6-tap variants of high bit-depth mc AVX2 functions	2024-03-22 11:11:58 +01:00
Henrik Gramner	b3323a8ccd	x86: Add minor high bit-depth mc 8-tap AVX2 improvements	2024-03-22 10:41:45 +01:00
Henrik Gramner	9849ede130	x86: Add 6-tap variants of 8bpc mc AVX2 functions 6-taps filters are sufficient in the majority of cases, and are quite a bit faster than the equivalent 8-tap filters.	2024-03-21 12:30:05 +00:00
Henrik Gramner	02c2033a1e	x86: Add minor 8bpc mc 8-tap AVX2 improvements	2024-03-21 12:30:05 +00:00
Peter Collingbourne	8e08426468	arm64: Use different instruction sequence for taking global address with HWASan When dav1d is built with HWASan, the build fails because globals are tagged and the normal adrp/add instruction sequence does not have enough range to take the tagged address. Therefore, use an alternative instruction sequence when HWASan is enabled, which is the same as what the compiler generates.	2024-03-18 20:50:37 +00:00
Henrik Gramner	645da27785	x86: Update x86inc.asm `8494a52b95` `04f14f431c`	2024-03-15 12:19:27 +01:00
Henrik Gramner	8b46166852	ci: Make checkasm work on the x86-32 build	2024-03-15 12:19:24 +01:00
Jean-Baptiste Kempf	872e470ebf	NEWS: Forgotten intro sentence	2024-03-09 11:00:33 +01:00
Jean-Baptiste Kempf	162fb6d85c	Update to 1.4.1	2024-03-08 22:49:10 +00:00
Matthias Dressel	b9312c8dd8	Update THANKS.md	2024-03-08 23:24:30 +01:00
Martin Storsjö	024b260cb9	arm32: Fix right shifts in the 16bpc iwht implementation These shifts used the wrong element size; this only was noticed in some argon tests.	2024-03-08 21:49:57 +00:00
Nathan E. Egge	0fff614a4c	arm32/msac: Trim C functions, saves 1024 bytes	2024-03-08 20:26:46 +00:00
Nathan E. Egge	b9f5333021	arm64/msac: Trim C functions, saves 1392 bytes	2024-03-08 20:16:13 +00:00

1 2 3 4 5 ...

2485 Commits All Branches Search

2485 Commits

All Branches