Commit Graph

2485 Commits

Author SHA1 Message Date
Matthias Dressel c7df9a3e65 CI: Improve coverage for argon samples using different thread counts
Similar to 4796b59fc0.
2024-05-01 13:09:09 +00:00
Matthias Dressel 0f504bf57c CI: Add dotprod to argon tests 2024-05-01 13:09:09 +00:00
Henrik Gramner 223901243c x86: Add 6-tap variants of high bit-depth mc AVX-512 (Ice Lake) functions 2024-04-29 17:59:09 +02:00
Henrik Gramner 8ff97b3a0b x86: Add minor high bit-depth mc AVX-512 improvements 2024-04-29 17:59:09 +02:00
Martin Storsjö 236e1d1912 tools: Make ARM cpu flags imply relevant lower level flags
The --cpumask flag only takes one single flag name, one can't set
a combination like neon+dotprod.

Therefore, apply the same pattern as for x86, by adding mask values
that contain all the implied lower level flags.

This is somewhat complicated, as the set of features isn't entirely
linear - in particular, SVE doesn't imply either dotprod or i8mm,
and SVE2 only implies dotprod, but not i8mm.

This makes sure that "dav1d --cpumask dotprod" actually uses any
SIMD at all, as it previously only set the dotprod flag but not
neon, which essentially opted out from all SIMD.
2024-04-26 15:09:19 +00:00
Arpad Panyik 1776c45a08 AArch64: Add basic i8mm support for convolutions
Add an Armv8.6-A i8mm code path for standard bitdepth convolutions.
Only horizontal-vertical (HV) convolutions have 6-tap specialisations
of their vertical passes. All other convolutions are 4- or 8-tap
filters which fit well with the 4-element USDOT instruction.

Benchmarks show 4-9% FPS increase relative to the Armv8.4-A
code path depending on the input video and the CPU used.

This patch will increase the .text by around 5.7 KiB.

Relative performance to the C reference on some Cortex CPU cores:

                       Cortex-A715   Cortex-X3  Cortex-A510
regular w4 hv neon:          7.20x      11.20x        4.40x
regular w4 hv dotprod:      12.77x      18.35x        6.21x
regular w4 hv i8mm:         14.50x      21.42x        6.16x

  sharp w4 hv neon:          6.24x       9.77x        3.96x
  sharp w4 hv dotprod:       9.76x      14.02x        5.20x
  sharp w4 hv i8mm:         10.84x      16.09x        5.42x

regular w8 hv neon:          2.17x       2.46x        3.17x
regular w8 hv dotprod:       3.04x       3.11x        3.03x
regular w8 hv i8mm:          3.57x       3.40x        3.27x

  sharp w8 hv neon:          1.72x       1.93x        2.75x
  sharp w8 hv dotprod:       2.49x       2.54x        2.62x
  sharp w8 hv i8mm:          2.80x       2.79x        2.70x

regular w16 hv neon:         1.90x       2.17x        2.02x
regular w16 hv dotprod:      2.59x       2.64x        1.93x
regular w16 hv i8mm:         3.01x       2.85x        2.05x

  sharp w16 hv neon:         1.51x       1.72x        1.74x
  sharp w16 hv dotprod:      2.17x       2.22x        1.70x
  sharp w16 hv i8mm:         2.42x       2.42x        1.72x

regular w32 hv neon:         1.80x       1.96x        1.81x
regular w32 hv dotprod:      2.43x       2.36x        1.74x
regular w32 hv i8mm:         2.83x       2.51x        1.83x

  sharp w32 hv neon:         1.42x       1.54x        1.56x
  sharp w32 hv dotprod:      2.07x       2.00x        1.55x
  sharp w32 hv i8mm:         2.29x       2.16x        1.55x

regular w64 hv neon:         1.82x       1.89x        1.70x
regular w64 hv dotprod:      2.43x       2.25x        1.65x
regular w64 hv i8mm:         2.84x       2.39x        1.73x

  sharp w64 hv neon:         1.43x       1.47x        1.49x
  sharp w64 hv dotprod:      2.08x       1.91x        1.49x
  sharp w64 hv i8mm:         2.30x       2.07x        1.48x

regular w128 hv neon:        1.77x       1.84x        1.75x
regular w128 hv dotprod:     2.37x       2.18x        1.70x
regular w128 hv i8mm:        2.76x       2.33x        1.78x

  sharp w128 hv neon:        1.40x       1.45x        1.42x
  sharp w128 hv dotprod:     2.04x       1.87x        1.43x
  sharp w128 hv i8mm:        2.24x       2.02x        1.42x

regular w8 h neon:           3.16x       3.51x        3.43x
regular w8 h dotprod:        4.97x       7.43x        4.95x
regular w8 h i8mm:           7.28x      10.38x        5.69x

  sharp w8 h neon:           2.71x       2.77x        3.10x
  sharp w8 h dotprod:        4.92x       7.14x        4.94x
  sharp w8 h i8mm:           7.21x      10.11x        5.70x

regular w16 h neon:          2.79x       2.76x        3.53x
regular w16 h dotprod:       3.81x       4.77x        3.13x
regular w16 h i8mm:          5.21x       6.04x        3.56x

  sharp w16 h neon:          2.31x       2.38x        3.12x
  sharp w16 h dotprod:       3.80x       4.74x        3.13x
  sharp w16 h i8mm:          5.20x       5.98x        3.56x

regular w64 h neon:          2.49x       2.46x        2.94x
regular w64 h dotprod:       3.17x       3.60x        2.41x
regular w64 h i8mm:          4.22x       4.40x        2.72x

  sharp w64 h neon:          2.07x       2.06x        2.60x
  sharp w64 h dotprod:       3.16x       3.58x        2.40x
  sharp w64 h i8mm:          4.20x       4.38x        2.71x

regular w8 v neon:           6.11x       8.05x        4.07x
regular w8 v dotprod:        5.45x       8.15x        4.01x
regular w8 v i8mm:           7.30x       9.46x        4.19x

  sharp w8 v neon:           4.23x       5.46x        3.09x
  sharp w8 v dotprod:        5.43x       7.96x        4.01x
  sharp w8 v i8mm:           7.26x       9.12x        4.19x

regular w16 v neon:          3.44x       4.33x        2.40x
regular w16 v dotprod:       3.20x       4.53x        2.85x
regular w16 v i8mm:          4.09x       5.27x        2.87x

  sharp w16 v neon:          2.50x       3.14x        1.82x
  sharp w16 v dotprod:       3.20x       4.52x        2.86x
  sharp w16 v i8mm:          4.09x       5.15x        2.86x

regular w64 v neon:          2.74x       3.11x        1.53x
regular w64 v dotprod:       2.63x       3.30x        1.84x
regular w64 v i8mm:          3.31x       3.73x        1.84x

  sharp w64 v neon:          2.01x       2.29x        1.16x
  sharp w64 v dotprod:       2.61x       3.27x        1.83x
  sharp w64 v i8mm:          3.29x       3.68x        1.84x
2024-04-26 14:04:18 +02:00
Arpad Panyik fbf23637ce AArch64: Simplify DotProd path of 2D subpel filters
Simplify the DotProd code path of the 2D (horizontal-vertical) subpel
filters. It contains some instruction reordering and some macro
simplifications to be more similar to the upcoming i8mm version.

These changes have negligible effect on performance.

Cortex-A510:
mc_8tap_regular_w2_hv_8bpc_dotprod:   8.3769 ->  8.3380
mc_8tap_sharp_w2_hv_8bpc_dotprod:     9.5441 ->  9.5457
mc_8tap_regular_w4_hv_8bpc_dotprod:   8.3422 ->  8.3444
mc_8tap_sharp_w4_hv_8bpc_dotprod:     9.5441 ->  9.5367
mc_8tap_regular_w8_hv_8bpc_dotprod:   9.9852 ->  9.9666
mc_8tap_sharp_w8_hv_8bpc_dotprod:    12.5554 -> 12.5314

Cortex-A55:
mc_8tap_regular_w2_hv_8bpc_dotprod:  6.4504  ->  6.4892
mc_8tap_sharp_w2_hv_8bpc_dotprod:    7.5732  ->  7.6078
mc_8tap_regular_w4_hv_8bpc_dotprod:  6.5088  ->  6.4760
mc_8tap_sharp_w4_hv_8bpc_dotprod:    7.5796  ->  7.5763
mc_8tap_regular_w8_hv_8bpc_dotprod:  9.3384  ->  9.3078
mc_8tap_sharp_w8_hv_8bpc_dotprod:   11.1159  -> 11.1401

Cortex-A78:
mc_8tap_regular_w2_hv_8bpc_dotprod:  1.4122  ->  1.4250
mc_8tap_sharp_w2_hv_8bpc_dotprod:    1.7696  ->  1.7821
mc_8tap_regular_w4_hv_8bpc_dotprod:  1.4243  ->  1.4243
mc_8tap_sharp_w4_hv_8bpc_dotprod:    1.7866  ->  1.7863
mc_8tap_regular_w8_hv_8bpc_dotprod:  2.5304  ->  2.5171
mc_8tap_sharp_w8_hv_8bpc_dotprod:    3.0815  ->  3.0632

Cortex-X1:
mc_8tap_regular_w2_hv_8bpc_dotprod:  0.8195  ->  0.8194
mc_8tap_sharp_w2_hv_8bpc_dotprod:    1.0092  ->  1.0081
mc_8tap_regular_w4_hv_8bpc_dotprod:  0.8197  ->  0.8166
mc_8tap_sharp_w4_hv_8bpc_dotprod:    1.0089  ->  1.0068
mc_8tap_regular_w8_hv_8bpc_dotprod:  1.5230  ->  1.5166
mc_8tap_sharp_w8_hv_8bpc_dotprod:    1.8683  ->  1.8625
2024-04-25 17:02:09 +02:00
Arpad Panyik a40301b33f AArch64: Simplify loads in *hv_filter* of DotProd path
Simplify the load sequences in *hv_filter* functions (ldr + add -> ld1)
to be more uniform and smaller. Performance is not affected.
2024-04-25 17:02:09 +02:00
Arpad Panyik b0685c387d AArch64: Simplify TBL usage in 2D DotProd filters
Simplify the TBL usages in small block size (2, 4) parts of the 2D
(horizontal-vertical) put subpel filters. The 2-register TBLs are
replaced with the 1-register form because we only need the lower
64-bits of the result and it can be extracted from only one source
register. Performance is not affected by this change.
2024-04-25 17:02:09 +02:00
Arpad Panyik ad7938d517 AArch64: Simplify DotProd path of horizontal subpel filters
Simplify the inner loops of the DotProd code path of horizontal
subpel filters to avoid using 2-register TBL instructions. The
store part of block size 16 of the horizontal put case is also
simplified (str + add -> st1). This patch can improve performance
mostly on small cores like Cortex-A510 and newer. Other CPUs are
mostly unaffected.

Cortex-A510:
mct_8tap_sharp_w16_h_8bpc_dotprod:  2.77x -> 3.13x
mct_8tap_sharp_w32_h_8bpc_dotprod:  2.32x -> 2.56x

Cortex-A55:
mct_8tap_sharp_w16_h_8bpc_dotprod:  3.89x -> 3.89x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.35x -> 3.35x

Cortex-A715:
mct_8tap_sharp_w16_h_8bpc_dotprod:  3.79x -> 3.78x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.30x -> 3.30x

Cortex-A78:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.30x -> 4.31x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.79x -> 3.80x

Cortex-X3:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.74x -> 4.75x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.89x -> 3.91x

Cortex-X1:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.61x -> 4.62x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.67x -> 3.66x
2024-04-25 16:59:53 +02:00
Arpad Panyik 317a94c6bb AArch64: Simplify DotProd path of vertical subpel filters
Simplify the accumulator initializations of the DotProd code path of
vertical subpel filters. This also makes it possible for some CPUs to
use zero latency vector register moves. The load is also simplified
(ldr + add -> ld1) in the inner loop of vertical filter for block
size 16.
2024-04-25 16:59:13 +02:00
Arpad Panyik 7eee4a2059 AArch64: Add \dot parameter to filter_8tap_fn macro
Add \dot parameter to filter_8tap_fn macro in preparation to extend
it with i8mm code path. This patch also contains string fixes and
some instruction reorderings along with some register renaming to
make it more uniform. These changes don't affect performance but
simplifies the code a bit.
2024-04-25 16:58:11 +02:00
Martin Storsjö cb8151c969 aarch64: Avoid unaligned jump tables
Manually add a padding 0 entry to make the odd number of .hword
entries align with the instruction size.

This fixes assembling with GAS, with the --gdwarf2 option, where
it previously produced the error message "unaligned opcodes detected
in executable segment".

The message is slightly misleading, as the error is printed even
if there actually are no opcodes that are misaligned, as the jump
table is the last thing within the .text section. The issue can
be reproduced with an input as small as this, assembled with
"as --gdwarf2 -c test.s".

        .text
        nop
        .hword 0

See a6228f47f0 for earlier cases of
the same error - although in those cases, we actually did have more
code and labels following the unaligned jump tables.

This error is present with binutils 2.39 and earlier; in
binutils 2.40, this input no longer is considered an error, fixed
in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=6f6f5b0adc9efd103c434fd316e8c880a259775d.
2024-04-22 09:11:37 +00:00
Kyle Siefring a9feab9bc1 ARM64: Minor msac improvements
One addressing optimization and fix some missing changes to a previous
commit that ported improvements from hi tok to other decode tok
functions.
2024-04-21 10:32:43 +00:00
Matthias Dressel 5851901772 CI: Move llvm crossfiles from image to project
Since dav1d was the only user of these crossfiles, it was agreed upon to
remove them from the image [0] and move to dav1d directly. [1]

[0] https://code.videolan.org/videolan/docker-images/-/merge_requests/293
[1] https://code.videolan.org/videolan/docker-images/-/merge_requests/294#note_434720
2024-04-16 11:53:16 +02:00
Kyle Siefring 37d52435d1 ARM64: Port msac improvements to more functions
Port improvements from the hi token functions to the rest of the symbol
adaption functions. These weren't originally ported since they didn't
work with arbitrary padding. In practice, zero padding is already used
and only the tests need to be updated.

Results - Neoverse N1

Old:
msac_decode_symbol_adapt4_c:         41.4 ( 1.00x)
msac_decode_symbol_adapt4_neon:      31.0 ( 1.34x)
msac_decode_symbol_adapt8_c:         54.5 ( 1.00x)
msac_decode_symbol_adapt8_neon:      32.2 ( 1.69x)
msac_decode_symbol_adapt16_c:        85.6 ( 1.00x)
msac_decode_symbol_adapt16_neon:     37.5 ( 2.28x)

New:
msac_decode_symbol_adapt4_c:         41.5 ( 1.00x)
msac_decode_symbol_adapt4_neon:      27.7 ( 1.50x)
msac_decode_symbol_adapt8_c:         55.7 ( 1.00x)
msac_decode_symbol_adapt8_neon:      30.1 ( 1.85x)
msac_decode_symbol_adapt16_c:        82.4 ( 1.00x)
msac_decode_symbol_adapt16_neon:     35.2 ( 2.34x)
2024-04-15 12:38:20 +00:00
Henrik Gramner 5b5399911d x86: Add 6-tap variants of 8bpc mc AVX-512 (Ice Lake) functions
6-tap filtering is only performed vertically due to use of VNNI
instructions processing 4 pixels per instruction horizontally.
2024-04-15 13:19:42 +02:00
Henrik Gramner 38df35d2d1 x86: Add various 8bpc mc AVX-512 improvements 2024-04-15 13:12:20 +02:00
Matthias Dressel 313af0b6a5 CI: Update images
Now with clang 18 and downgraded xz-utils.
2024-04-14 01:57:37 +02:00
Luca Barbato 09f2a21e7c Deduplicate itx macros 2024-04-13 23:19:52 +02:00
Ronald S. Bultje f1c518901b Increase timeout multiplier for aarch64/riscv64/la64-qemu CI jobs
They have been failing occasionally lately.
2024-04-13 09:53:54 -04:00
Matthias Dressel aa63a41ccd cli: Add missing ARM cpumasks help text
Forgotten in acc1121d2f.
2024-04-11 23:15:07 +02:00
Arpad Panyik 9d77b6336a AArch64: Add DotProd support for convolutions
Add an Armv8.4-A DotProd code path for standard bitdepth convolutions.
Only horizontal-vertical (HV) convolutions have 6-tap specialisations
of their vertical passes. All other convolutions are 4- or 8-tap
filters which fit well with the 4-element SDOT instruction.

Benchmarks show up-to 7-29% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 6.5 KiB.

Performance highly depends on the SDOT and MLA throughput ratio, this
can be seen on the vertical filter cases. Small cores are also
affected by the TBL execution latencies:

Relative performance to the C reference on some CPUs:

                          A76      A78       X1      A55
regular w4 hv neon:      5.52x    5.78x   10.75x    8.27x
regular w4 hv dotprod:   7.94x    8.49x   16.84x    8.09x
sharp w4 hv neon:        5.27x    5.22x    9.06x    7.87x
sharp w4 hv dotprod:     6.61x    6.73x   12.64x    6.89x

regular w8 hv neon:      1.95x    2.19x    2.56x    3.16x
regular w8 hv dotprod:   3.23x    2.81x    3.20x    3.26x
sharp w8 hv neon:        1.61x    1.79x    2.05x    2.72x
sharp w8 hv dotprod:     2.72x    2.29x    2.66x    2.76x

regular w16 hv neon:     1.63x    2.04x    2.16x    2.73x
regular w16 hv dotprod:  2.72x    2.57x    2.67x    2.80x
sharp w16 hv neon:       1.33x    1.67x    1.74x    2.34x
sharp w16 hv dotprod:    2.31x    2.14x    2.26x    2.39x

regular w32 hv neon:     1.48x    1.92x    1.94x    2.51x
regular w32 hv dotprod:  2.49x    2.40x    2.33x    2.58x
sharp w32 hv neon:       1.21x    1.56x    1.53x    2.14x
sharp w32 hv dotprod:    2.12x    2.02x    2.00x    2.22x

regular w64 hv neon:     1.42x    1.87x    1.85x    2.40x
regular w64 hv dotprod:  2.40x    2.32x    2.21x    2.46x
sharp w64 hv neon:       1.16x    1.52x    1.46x    2.04x
sharp w64 hv dotprod:    2.02x    1.96x    1.90x    2.11x

regular w128 hv neon:    1.39x    1.84x    1.80x    2.27x
regular w128 hv dotprod: 2.33x    2.28x    2.14x    2.35x
sharp w128 hv neon:      1.14x    1.50x    1.42x    1.94x
sharp w128 hv dotprod:   1.98x    1.93x    1.84x    2.03x

regular w8 h neon:       2.61x    3.20x    3.51x    3.55x
regular w8 h dotprod:    4.43x    5.17x    6.26x    4.30x
sharp w8 h neon:         2.01x    2.80x    2.89x    3.12x
sharp w8 h dotprod:      4.42x    5.16x    6.27x    4.28x

regular w16 h neon:      2.17x    3.13x    2.92x    3.35x
regular w16 h dotprod:   4.38x    4.27x    4.53x    3.90x
sharp w16 h neon:        1.74x    2.65x    2.48x    2.92x
sharp w16 h dotprod:     4.33x    4.27x    4.53x    3.91x

regular w64 h neon:      1.92x    2.82x    2.39x    2.96x
regular w64 h dotprod:   3.68x    3.60x    3.40x    3.18x
sharp w64 h neon:        1.47x    2.33x    2.05x    2.54x
sharp w64 h dotprod:     3.68x    3.60x    3.40x    3.17x

regular w4 v neon:       5.39x    7.38x   10.27x   11.41x
regular w4 v dotprod:    9.46x   14.15x   18.72x    9.84x
sharp w4 v neon:         4.51x    6.39x    8.17x   10.70x
sharp w4 v dotprod:      9.35x   14.20x   18.63x    9.78x

regular w16 v neon:      3.03x    4.03x    4.65x    6.28x
regular w16 v dotprod:   4.64x    3.75x    4.78x    3.89x
sharp w16 v neon:        2.29x    3.09x    3.44x    5.52x
sharp w16 v dotprod:     4.62x    3.74x    4.77x    3.89x

regular w64 v neon:      2.17x    3.14x    3.19x    4.46x
regular w64 v dotprod:   3.43x    3.00x    3.31x    2.74x
sharp w64 v neon:        1.61x    2.42x    2.34x    3.89x
sharp w64 v dotprod:     3.38x    3.00x    3.29x    2.73x
2024-04-11 19:03:58 +02:00
Henrik Gramner dc9490134f meson: Enable parallel execution of checkasm in 'meson test'
It was originally disabled due to older meson versions mixing the output
of 'meson test -v' from different tests, which made the log difficult to
read. Newer versions however caches the output from each test as it runs
and prints it in one contiguous block, so that's no longer an issue.
2024-04-08 22:51:15 +02:00
Henrik Gramner f6e05da093 cdf: Combine memcpy() calls in dav1d_cdf_thread_copy()
Place multiple default contexts inside a single outer struct so
that copying can be performed in larger blocks.
2024-04-08 20:25:59 +02:00
Henrik Gramner c8add4f8bf cdf: Reduce code size of dav1d_cdf_thread_update()
Reorder CDF arrays so that copying can be performed in larger blocks.
2024-04-08 20:25:59 +02:00
Henrik Gramner ed24201356 cdf: Make qcat calculation branchless 2024-04-08 20:25:58 +02:00
Henrik Gramner 67fcf01bf2 decode: Simplify read_mv_residual() 2024-04-08 20:25:58 +02:00
Henrik Gramner 17a2180a61 cdf: Remove separate intra-only dmv contexts
We can simply use the regular mv contexts for intra frames.

They are mutually exclusive, and the dmv contexts were already
discarded and replaced with default contexts on frame completion.
2024-04-08 20:25:58 +02:00
Henrik Gramner e2145f5295 cdf: Skip unnecessary context copying in dav1d_cdf_thread_update()
The intrabc and dmv contexts are never reused between frames.
2024-04-08 20:25:58 +02:00
Henrik Gramner e27b451e2a cli: Handle SIGINT and SIGTERM more gracefully
Attempt to finish writing the current frame before exiting to avoid
ending up with a partially written frame at the end of the output file.

Only try catching a signal once, falling back to the default behavior
of exiting immediately the second time a given signal is raised.
2024-04-04 13:06:12 +00:00
Kyle Siefring 72dfbc075b ARM64: Improve hi_tok msac
Before:
msac_decode_hi_tok_c:               259.5 ( 1.00x)
msac_decode_hi_tok_neon:            220.7 ( 1.18x)
msac_decode_symbol_adapt4_c:        105.7 ( 1.00x)
msac_decode_symbol_adapt4_neon:      63.3 ( 1.67x)

After:
msac_decode_hi_tok_c:               260.9 ( 1.00x)
msac_decode_hi_tok_neon:            197.9 ( 1.32x)
msac_decode_symbol_adapt4_c:        105.7 ( 1.00x)
msac_decode_symbol_adapt4_neon:      63.3 ( 1.67x)

decode_symbol_adapt4 is not changed, but is included for reference since
decode_hi_tok calls it.
2024-04-03 09:23:21 +00:00
Martin Storsjö 5e31720b89 checkasm: Add support for the private macOS kperf API for benchmarking
On AArch64, the performance counter registers usually are
restricted and not accessible from user space.

On macOS, we currently use mach_absolute_time() as timer on
aarch64. This measures wallclock time but with a very coarse
resolution.

There is a private API, kperf, that one can use for getting
high precision timers though. Unfortunately, it requires running
the checkasm binary as root (e.g. with sudo).

Also, as it is a private, undocumented API, it can potentially
change at any time.

This is handled by adding a new meson build option, for switching
to this timer. If the timer source in checkasm could be changed
at runtime with an option, this wouldn't need to be a build time
option.

This allows getting benchmarks like this:

mc_8tap_regular_w16_hv_8bpc_c:              1522.1 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_neon:            331.8 ( 4.59x)

Instead of this:

mc_8tap_regular_w16_hv_8bpc_c:                 9.0 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_neon:              1.9 ( 4.76x)

Co-authored-by: J. Dekker <jdek@itanimul.li>
2024-04-02 10:35:29 +00:00
Henrik Gramner abc8a1689f lf_mask: Align lvl buffers
Ensures that SIMD stores performed by memset() are aligned.
2024-03-28 15:58:36 +01:00
Henrik Gramner 119df64b21 lf_mask: Use sizeof() in memset() size calculations 2024-03-28 15:58:35 +01:00
Henrik Gramner df3dafddc3 lf_mask: Use a union type for last_delta_lf
On architectures without unaligned load capabilites the compiler will
otherwise load the individual 8-bit values one at a time.
2024-03-28 15:58:34 +01:00
Henrik Gramner 076955a153 refmvs: Fix buffer overread in save_tmvs() asm
The refmvs_block struct is only 12 bytes large but it's accessed
using 16-byte unaligned loads in asm.

In order to avoid reading past the end of the allocated buffer
we therefore need to pad the allocation size by 4 bytes.
2024-03-28 01:41:28 +01:00
Henrik Gramner 3d98a242a0 x86: Add 6-tap variants of high bit-depth mc AVX2 functions 2024-03-22 11:11:58 +01:00
Henrik Gramner b3323a8ccd x86: Add minor high bit-depth mc 8-tap AVX2 improvements 2024-03-22 10:41:45 +01:00
Henrik Gramner 9849ede130 x86: Add 6-tap variants of 8bpc mc AVX2 functions
6-taps filters are sufficient in the majority of cases, and are
quite a bit faster than the equivalent 8-tap filters.
2024-03-21 12:30:05 +00:00
Henrik Gramner 02c2033a1e x86: Add minor 8bpc mc 8-tap AVX2 improvements 2024-03-21 12:30:05 +00:00
Peter Collingbourne 8e08426468 arm64: Use different instruction sequence for taking global address with HWASan
When dav1d is built with HWASan, the build fails because globals are
tagged and the normal adrp/add instruction sequence does not have
enough range to take the tagged address. Therefore, use an alternative
instruction sequence when HWASan is enabled, which is the same as
what the compiler generates.
2024-03-18 20:50:37 +00:00
Henrik Gramner 645da27785 x86: Update x86inc.asm
8494a52b95
04f14f431c
2024-03-15 12:19:27 +01:00
Henrik Gramner 8b46166852 ci: Make checkasm work on the x86-32 build 2024-03-15 12:19:24 +01:00
Jean-Baptiste Kempf 872e470ebf NEWS: Forgotten intro sentence 2024-03-09 11:00:33 +01:00
Jean-Baptiste Kempf 162fb6d85c Update to 1.4.1 2024-03-08 22:49:10 +00:00
Matthias Dressel b9312c8dd8 Update THANKS.md 2024-03-08 23:24:30 +01:00
Martin Storsjö 024b260cb9 arm32: Fix right shifts in the 16bpc iwht implementation
These shifts used the wrong element size; this only was noticed in
some argon tests.
2024-03-08 21:49:57 +00:00
Nathan E. Egge 0fff614a4c arm32/msac: Trim C functions, saves 1024 bytes 2024-03-08 20:26:46 +00:00
Nathan E. Egge b9f5333021 arm64/msac: Trim C functions, saves 1392 bytes 2024-03-08 20:16:13 +00:00