The --cpumask flag only takes one single flag name, one can't set
a combination like neon+dotprod.
Therefore, apply the same pattern as for x86, by adding mask values
that contain all the implied lower level flags.
This is somewhat complicated, as the set of features isn't entirely
linear - in particular, SVE doesn't imply either dotprod or i8mm,
and SVE2 only implies dotprod, but not i8mm.
This makes sure that "dav1d --cpumask dotprod" actually uses any
SIMD at all, as it previously only set the dotprod flag but not
neon, which essentially opted out from all SIMD.
Simplify the TBL usages in small block size (2, 4) parts of the 2D
(horizontal-vertical) put subpel filters. The 2-register TBLs are
replaced with the 1-register form because we only need the lower
64-bits of the result and it can be extracted from only one source
register. Performance is not affected by this change.
Simplify the inner loops of the DotProd code path of horizontal
subpel filters to avoid using 2-register TBL instructions. The
store part of block size 16 of the horizontal put case is also
simplified (str + add -> st1). This patch can improve performance
mostly on small cores like Cortex-A510 and newer. Other CPUs are
mostly unaffected.
Cortex-A510:
mct_8tap_sharp_w16_h_8bpc_dotprod: 2.77x -> 3.13x
mct_8tap_sharp_w32_h_8bpc_dotprod: 2.32x -> 2.56x
Cortex-A55:
mct_8tap_sharp_w16_h_8bpc_dotprod: 3.89x -> 3.89x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.35x -> 3.35x
Cortex-A715:
mct_8tap_sharp_w16_h_8bpc_dotprod: 3.79x -> 3.78x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.30x -> 3.30x
Cortex-A78:
mct_8tap_sharp_w16_h_8bpc_dotprod: 4.30x -> 4.31x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.79x -> 3.80x
Cortex-X3:
mct_8tap_sharp_w16_h_8bpc_dotprod: 4.74x -> 4.75x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.89x -> 3.91x
Cortex-X1:
mct_8tap_sharp_w16_h_8bpc_dotprod: 4.61x -> 4.62x
mct_8tap_sharp_w32_h_8bpc_dotprod: 3.67x -> 3.66x
Simplify the accumulator initializations of the DotProd code path of
vertical subpel filters. This also makes it possible for some CPUs to
use zero latency vector register moves. The load is also simplified
(ldr + add -> ld1) in the inner loop of vertical filter for block
size 16.
Add \dot parameter to filter_8tap_fn macro in preparation to extend
it with i8mm code path. This patch also contains string fixes and
some instruction reorderings along with some register renaming to
make it more uniform. These changes don't affect performance but
simplifies the code a bit.
Manually add a padding 0 entry to make the odd number of .hword
entries align with the instruction size.
This fixes assembling with GAS, with the --gdwarf2 option, where
it previously produced the error message "unaligned opcodes detected
in executable segment".
The message is slightly misleading, as the error is printed even
if there actually are no opcodes that are misaligned, as the jump
table is the last thing within the .text section. The issue can
be reproduced with an input as small as this, assembled with
"as --gdwarf2 -c test.s".
.text
nop
.hword 0
See a6228f47f0 for earlier cases of
the same error - although in those cases, we actually did have more
code and labels following the unaligned jump tables.
This error is present with binutils 2.39 and earlier; in
binutils 2.40, this input no longer is considered an error, fixed
in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=6f6f5b0adc9efd103c434fd316e8c880a259775d.
Port improvements from the hi token functions to the rest of the symbol
adaption functions. These weren't originally ported since they didn't
work with arbitrary padding. In practice, zero padding is already used
and only the tests need to be updated.
Results - Neoverse N1
Old:
msac_decode_symbol_adapt4_c: 41.4 ( 1.00x)
msac_decode_symbol_adapt4_neon: 31.0 ( 1.34x)
msac_decode_symbol_adapt8_c: 54.5 ( 1.00x)
msac_decode_symbol_adapt8_neon: 32.2 ( 1.69x)
msac_decode_symbol_adapt16_c: 85.6 ( 1.00x)
msac_decode_symbol_adapt16_neon: 37.5 ( 2.28x)
New:
msac_decode_symbol_adapt4_c: 41.5 ( 1.00x)
msac_decode_symbol_adapt4_neon: 27.7 ( 1.50x)
msac_decode_symbol_adapt8_c: 55.7 ( 1.00x)
msac_decode_symbol_adapt8_neon: 30.1 ( 1.85x)
msac_decode_symbol_adapt16_c: 82.4 ( 1.00x)
msac_decode_symbol_adapt16_neon: 35.2 ( 2.34x)
It was originally disabled due to older meson versions mixing the output
of 'meson test -v' from different tests, which made the log difficult to
read. Newer versions however caches the output from each test as it runs
and prints it in one contiguous block, so that's no longer an issue.
We can simply use the regular mv contexts for intra frames.
They are mutually exclusive, and the dmv contexts were already
discarded and replaced with default contexts on frame completion.
Attempt to finish writing the current frame before exiting to avoid
ending up with a partially written frame at the end of the output file.
Only try catching a signal once, falling back to the default behavior
of exiting immediately the second time a given signal is raised.
On AArch64, the performance counter registers usually are
restricted and not accessible from user space.
On macOS, we currently use mach_absolute_time() as timer on
aarch64. This measures wallclock time but with a very coarse
resolution.
There is a private API, kperf, that one can use for getting
high precision timers though. Unfortunately, it requires running
the checkasm binary as root (e.g. with sudo).
Also, as it is a private, undocumented API, it can potentially
change at any time.
This is handled by adding a new meson build option, for switching
to this timer. If the timer source in checkasm could be changed
at runtime with an option, this wouldn't need to be a build time
option.
This allows getting benchmarks like this:
mc_8tap_regular_w16_hv_8bpc_c: 1522.1 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_neon: 331.8 ( 4.59x)
Instead of this:
mc_8tap_regular_w16_hv_8bpc_c: 9.0 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_neon: 1.9 ( 4.76x)
Co-authored-by: J. Dekker <jdek@itanimul.li>
The refmvs_block struct is only 12 bytes large but it's accessed
using 16-byte unaligned loads in asm.
In order to avoid reading past the end of the allocated buffer
we therefore need to pad the allocation size by 4 bytes.
When dav1d is built with HWASan, the build fails because globals are
tagged and the normal adrp/add instruction sequence does not have
enough range to take the tagged address. Therefore, use an alternative
instruction sequence when HWASan is enabled, which is the same as
what the compiler generates.