yuanhecai
fbefb34ae9
loongarch: Improve one functions in itx_8bpc.add_8x32 series
...
1. inv_txfm_add_dct_dct_8x32
Relative speedup over C code:
inv_txfm_add_8x32_dct_dct_0_8bpc_c: 33.3 ( 1.00x)
inv_txfm_add_8x32_dct_dct_0_8bpc_lsx: 2.1 (15.58x)
inv_txfm_add_8x32_dct_dct_1_8bpc_c: 311.1 ( 1.00x)
inv_txfm_add_8x32_dct_dct_1_8bpc_lsx: 24.9 (12.49x)
inv_txfm_add_8x32_dct_dct_2_8bpc_c: 308.4 ( 1.00x)
inv_txfm_add_8x32_dct_dct_2_8bpc_lsx: 24.9 (12.37x)
inv_txfm_add_8x32_dct_dct_3_8bpc_c: 309.3 ( 1.00x)
inv_txfm_add_8x32_dct_dct_3_8bpc_lsx: 25.0 (12.37x)
inv_txfm_add_8x32_dct_dct_4_8bpc_c: 308.4 ( 1.00x)
inv_txfm_add_8x32_dct_dct_4_8bpc_lsx: 25.0 (12.35x)
2024-01-21 15:31:46 +08:00
yuanhecai
8c32cde7c1
loongarch: Improve six functions in itx_8bpc.add_16x16 series
...
1. inv_txfm_add_dct_dct_16x16
2. inv_txfm_add_adst_adst_16x16
3. inv_txfm_add_adst_dct_16x16
4. inv_txfm_add_dct_adst_16x16
5. inv_txfm_add_flipadst_dct_16x16
6. inv_txfm_add_dct_flipadst_16x16
Relative speedup over C code:
inv_txfm_add_16x16_adst_adst_0_8bpc_c: 327.6 ( 1.00x)
inv_txfm_add_16x16_adst_adst_0_8bpc_lsx: 30.5 (10.74x)
inv_txfm_add_16x16_adst_adst_1_8bpc_c: 327.6 ( 1.00x)
inv_txfm_add_16x16_adst_adst_1_8bpc_lsx: 30.7 (10.67x)
inv_txfm_add_16x16_adst_adst_2_8bpc_c: 327.6 ( 1.00x)
inv_txfm_add_16x16_adst_adst_2_8bpc_lsx: 30.5 (10.73x)
inv_txfm_add_16x16_adst_dct_0_8bpc_c: 321.0 ( 1.00x)
inv_txfm_add_16x16_adst_dct_0_8bpc_lsx: 27.6 (11.64x)
inv_txfm_add_16x16_adst_dct_1_8bpc_c: 320.9 ( 1.00x)
inv_txfm_add_16x16_adst_dct_1_8bpc_lsx: 27.4 (11.70x)
inv_txfm_add_16x16_adst_dct_2_8bpc_c: 320.8 ( 1.00x)
inv_txfm_add_16x16_adst_dct_2_8bpc_lsx: 27.5 (11.67x)
inv_txfm_add_16x16_dct_adst_0_8bpc_c: 321.1 ( 1.00x)
inv_txfm_add_16x16_dct_adst_0_8bpc_lsx: 27.1 (11.85x)
inv_txfm_add_16x16_dct_adst_1_8bpc_c: 321.1 ( 1.00x)
inv_txfm_add_16x16_dct_adst_1_8bpc_lsx: 27.2 (11.80x)
inv_txfm_add_16x16_dct_adst_2_8bpc_c: 329.4 ( 1.00x)
inv_txfm_add_16x16_dct_adst_2_8bpc_lsx: 27.2 (12.10x)
inv_txfm_add_16x16_dct_dct_0_8bpc_c: 31.9 ( 1.00x)
inv_txfm_add_16x16_dct_dct_0_8bpc_lsx: 1.8 (18.14x)
inv_txfm_add_16x16_dct_dct_1_8bpc_c: 314.3 ( 1.00x)
inv_txfm_add_16x16_dct_dct_1_8bpc_lsx: 23.9 (13.16x)
inv_txfm_add_16x16_dct_dct_2_8bpc_c: 314.3 ( 1.00x)
inv_txfm_add_16x16_dct_dct_2_8bpc_lsx: 24.1 (13.05x)
inv_txfm_add_16x16_dct_flipadst_0_8bpc_c: 321.0 ( 1.00x)
inv_txfm_add_16x16_dct_flipadst_0_8bpc_lsx: 27.1 (11.83x)
inv_txfm_add_16x16_dct_flipadst_1_8bpc_c: 321.0 ( 1.00x)
inv_txfm_add_16x16_dct_flipadst_1_8bpc_lsx: 27.1 (11.84x)
inv_txfm_add_16x16_dct_flipadst_2_8bpc_c: 327.7 ( 1.00x)
inv_txfm_add_16x16_dct_flipadst_2_8bpc_lsx: 27.1 (12.07x)
inv_txfm_add_16x16_flipadst_dct_0_8bpc_c: 322.6 ( 1.00x)
inv_txfm_add_16x16_flipadst_dct_0_8bpc_lsx: 28.1 (11.49x)
inv_txfm_add_16x16_flipadst_dct_1_8bpc_c: 322.5 ( 1.00x)
inv_txfm_add_16x16_flipadst_dct_1_8bpc_lsx: 28.1 (11.48x)
inv_txfm_add_16x16_flipadst_dct_2_8bpc_c: 322.7 ( 1.00x)
inv_txfm_add_16x16_flipadst_dct_2_8bpc_lsx: 28.0 (11.53x)
2024-01-21 15:31:46 +08:00
yuanhecai
233be20140
loongarch: Improve one functions in itx_8bpc.add_4x8 series
...
1. inv_txfm_add_dct_dct_4x8
Relative speedup over C code:
inv_txfm_add_4x8_dct_dct_0_8bpc_c: 5.7 ( 1.00x)
inv_txfm_add_4x8_dct_dct_0_8bpc_lsx: 0.8 ( 7.12x)
inv_txfm_add_4x8_dct_dct_1_8bpc_c: 34.5 ( 1.00x)
inv_txfm_add_4x8_dct_dct_1_8bpc_lsx: 3.0 (11.64x)
2024-01-21 15:31:46 +08:00
yuanhecai
8626d9f9a6
loongarch: Improve two functions in itx_8bpc.add_16x8 series
...
1. inv_txfm_add_dct_dct_16x8
2. inv_txfm_add_adst_dct_16x8
Relative speedup over C code:
inv_txfm_add_16x8_adst_dct_0_8bpc_c: 152.1 ( 1.00x)
inv_txfm_add_16x8_adst_dct_0_8bpc_lsx: 13.7 (11.08x)
inv_txfm_add_16x8_adst_dct_1_8bpc_c: 152.1 ( 1.00x)
inv_txfm_add_16x8_adst_dct_1_8bpc_lsx: 13.7 (11.07x)
inv_txfm_add_16x8_adst_dct_2_8bpc_c: 152.1 ( 1.00x)
inv_txfm_add_16x8_adst_dct_2_8bpc_lsx: 13.7 (11.08x)
inv_txfm_add_16x8_dct_dct_0_8bpc_c: 17.0 ( 1.00x)
inv_txfm_add_16x8_dct_dct_0_8bpc_lsx: 1.3 (13.10x)
inv_txfm_add_16x8_dct_dct_1_8bpc_c: 147.5 ( 1.00x)
inv_txfm_add_16x8_dct_dct_1_8bpc_lsx: 11.6 (12.73x)
inv_txfm_add_16x8_dct_dct_2_8bpc_c: 147.5 ( 1.00x)
inv_txfm_add_16x8_dct_dct_2_8bpc_lsx: 11.6 (12.74x)
2024-01-21 15:31:46 +08:00
yuanhecai
5ebe32283e
loongarch: Improve four functions in itx_8bpc.add_8x16 series
...
1. inv_txfm_add_dct_dct_8x16
2. inv_txfm_add_identity_identity_8x16
3. inv_txfm_add_adst_dct_8x16
4 .inv_txfm_add_dct_adst_8x16
Relative speedup over C code:
inv_txfm_add_8x16_adst_dct_0_8bpc_c: 151.0 ( 1.00x)
inv_txfm_add_8x16_adst_dct_0_8bpc_lsx: 14.9 (10.10x)
inv_txfm_add_8x16_adst_dct_1_8bpc_c: 151.0 ( 1.00x)
inv_txfm_add_8x16_adst_dct_1_8bpc_lsx: 15.0 (10.10x)
inv_txfm_add_8x16_adst_dct_2_8bpc_c: 151.0 ( 1.00x)
inv_txfm_add_8x16_adst_dct_2_8bpc_lsx: 15.0 (10.10x)
inv_txfm_add_8x16_dct_adst_0_8bpc_c: 157.3 ( 1.00x)
inv_txfm_add_8x16_dct_adst_0_8bpc_lsx: 13.6 (11.59x)
inv_txfm_add_8x16_dct_adst_1_8bpc_c: 154.9 ( 1.00x)
inv_txfm_add_8x16_dct_adst_1_8bpc_lsx: 13.6 (11.38x)
inv_txfm_add_8x16_dct_adst_2_8bpc_c: 154.8 ( 1.00x)
inv_txfm_add_8x16_dct_adst_2_8bpc_lsx: 13.5 (11.46x)
inv_txfm_add_8x16_dct_dct_0_8bpc_c: 17.8 ( 1.00x)
inv_txfm_add_8x16_dct_dct_0_8bpc_lsx: 1.5 (11.75x)
inv_txfm_add_8x16_dct_dct_1_8bpc_c: 149.4 ( 1.00x)
inv_txfm_add_8x16_dct_dct_1_8bpc_lsx: 12.0 (12.49x)
inv_txfm_add_8x16_dct_dct_2_8bpc_c: 159.5 ( 1.00x)
inv_txfm_add_8x16_dct_dct_2_8bpc_lsx: 12.0 (13.33x)
inv_txfm_add_8x16_identity_identity_0_8bpc_c: 75.0 ( 1.00x)
inv_txfm_add_8x16_identity_identity_0_8bpc_lsx: 6.0 (12.50x)
inv_txfm_add_8x16_identity_identity_1_8bpc_c: 67.4 ( 1.00x)
inv_txfm_add_8x16_identity_identity_1_8bpc_lsx: 6.0 (11.26x)
inv_txfm_add_8x16_identity_identity_2_8bpc_c: 66.7 ( 1.00x)
inv_txfm_add_8x16_identity_identity_2_8bpc_lsx: 5.9 (11.40x)
2024-01-21 15:31:46 +08:00
yuanhecai
32809a0222
loongarch: Improve the performance of itx_8bpc.add_8x8 series functions
...
Relative speedup over C code:
inv_txfm_add_8x8_adst_adst_0_8bpc_c: 70.1 ( 1.00x)
inv_txfm_add_8x8_adst_adst_0_8bpc_lsx: 9.4 ( 7.45x)
inv_txfm_add_8x8_adst_adst_1_8bpc_c: 70.1 ( 1.00x)
inv_txfm_add_8x8_adst_adst_1_8bpc_lsx: 9.4 ( 7.43x)
inv_txfm_add_8x8_adst_dct_0_8bpc_c: 68.7 ( 1.00x)
inv_txfm_add_8x8_adst_dct_0_8bpc_lsx: 7.6 ( 9.08x)
inv_txfm_add_8x8_adst_dct_1_8bpc_c: 68.7 ( 1.00x)
inv_txfm_add_8x8_adst_dct_1_8bpc_lsx: 7.6 ( 9.00x)
inv_txfm_add_8x8_adst_flipadst_0_8bpc_c: 70.3 ( 1.00x)
inv_txfm_add_8x8_adst_flipadst_0_8bpc_lsx: 9.4 ( 7.47x)
inv_txfm_add_8x8_adst_flipadst_1_8bpc_c: 70.3 ( 1.00x)
inv_txfm_add_8x8_adst_flipadst_1_8bpc_lsx: 9.4 ( 7.47x)
inv_txfm_add_8x8_adst_identity_0_8bpc_c: 50.6 ( 1.00x)
inv_txfm_add_8x8_adst_identity_0_8bpc_lsx: 5.7 ( 8.88x)
inv_txfm_add_8x8_adst_identity_1_8bpc_c: 49.8 ( 1.00x)
inv_txfm_add_8x8_adst_identity_1_8bpc_lsx: 5.7 ( 8.73x)
inv_txfm_add_8x8_dct_adst_0_8bpc_c: 67.9 ( 1.00x)
inv_txfm_add_8x8_dct_adst_0_8bpc_lsx: 7.5 ( 9.05x)
inv_txfm_add_8x8_dct_adst_1_8bpc_c: 67.9 ( 1.00x)
inv_txfm_add_8x8_dct_adst_1_8bpc_lsx: 7.4 ( 9.13x)
inv_txfm_add_8x8_dct_dct_0_8bpc_c: 9.1 ( 1.00x)
inv_txfm_add_8x8_dct_dct_0_8bpc_lsx: 0.8 (11.20x)
inv_txfm_add_8x8_dct_dct_1_8bpc_c: 66.5 ( 1.00x)
inv_txfm_add_8x8_dct_dct_1_8bpc_lsx: 5.0 (13.42x)
inv_txfm_add_8x8_dct_flipadst_0_8bpc_c: 67.9 ( 1.00x)
inv_txfm_add_8x8_dct_flipadst_0_8bpc_lsx: 7.5 ( 9.06x)
inv_txfm_add_8x8_dct_flipadst_1_8bpc_c: 67.9 ( 1.00x)
inv_txfm_add_8x8_dct_flipadst_1_8bpc_lsx: 7.5 ( 9.06x)
inv_txfm_add_8x8_dct_identity_0_8bpc_c: 47.3 ( 1.00x)
inv_txfm_add_8x8_dct_identity_0_8bpc_lsx: 3.7 (12.70x)
inv_txfm_add_8x8_dct_identity_1_8bpc_c: 47.3 ( 1.00x)
inv_txfm_add_8x8_dct_identity_1_8bpc_lsx: 3.7 (12.70x)
inv_txfm_add_8x8_flipadst_adst_0_8bpc_c: 70.3 ( 1.00x)
inv_txfm_add_8x8_flipadst_adst_0_8bpc_lsx: 9.6 ( 7.35x)
inv_txfm_add_8x8_flipadst_adst_1_8bpc_c: 70.3 ( 1.00x)
inv_txfm_add_8x8_flipadst_adst_1_8bpc_lsx: 9.6 ( 7.33x)
inv_txfm_add_8x8_flipadst_dct_0_8bpc_c: 68.9 ( 1.00x)
inv_txfm_add_8x8_flipadst_dct_0_8bpc_lsx: 7.6 ( 9.10x)
inv_txfm_add_8x8_flipadst_dct_1_8bpc_c: 68.9 ( 1.00x)
inv_txfm_add_8x8_flipadst_dct_1_8bpc_lsx: 7.6 ( 9.11x)
inv_txfm_add_8x8_flipadst_flipadst_0_8bpc_c: 70.4 ( 1.00x)
inv_txfm_add_8x8_flipadst_flipadst_0_8bpc_lsx: 9.6 ( 7.32x)
inv_txfm_add_8x8_flipadst_flipadst_1_8bpc_c: 70.4 ( 1.00x)
inv_txfm_add_8x8_flipadst_flipadst_1_8bpc_lsx: 9.6 ( 7.34x)
inv_txfm_add_8x8_flipadst_identity_0_8bpc_c: 49.9 ( 1.00x)
inv_txfm_add_8x8_flipadst_identity_0_8bpc_lsx: 5.6 ( 8.91x)
inv_txfm_add_8x8_flipadst_identity_1_8bpc_c: 49.9 ( 1.00x)
inv_txfm_add_8x8_flipadst_identity_1_8bpc_lsx: 5.6 ( 8.91x)
inv_txfm_add_8x8_identity_adst_0_8bpc_c: 51.3 ( 1.00x)
inv_txfm_add_8x8_identity_adst_0_8bpc_lsx: 5.5 ( 9.28x)
inv_txfm_add_8x8_identity_adst_1_8bpc_c: 51.3 ( 1.00x)
inv_txfm_add_8x8_identity_adst_1_8bpc_lsx: 5.5 ( 9.28x)
inv_txfm_add_8x8_identity_dct_0_8bpc_c: 50.5 ( 1.00x)
inv_txfm_add_8x8_identity_dct_0_8bpc_lsx: 3.6 (13.83x)
inv_txfm_add_8x8_identity_dct_1_8bpc_c: 50.6 ( 1.00x)
inv_txfm_add_8x8_identity_dct_1_8bpc_lsx: 3.6 (13.87x)
inv_txfm_add_8x8_identity_flipadst_0_8bpc_c: 52.0 ( 1.00x)
inv_txfm_add_8x8_identity_flipadst_0_8bpc_lsx: 5.5 ( 9.40x)
inv_txfm_add_8x8_identity_flipadst_1_8bpc_c: 52.0 ( 1.00x)
inv_txfm_add_8x8_identity_flipadst_1_8bpc_lsx: 5.5 ( 9.39x)
inv_txfm_add_8x8_identity_identity_0_8bpc_c: 31.1 ( 1.00x)
inv_txfm_add_8x8_identity_identity_0_8bpc_lsx: 1.8 (17.06x)
inv_txfm_add_8x8_identity_identity_1_8bpc_c: 31.1 ( 1.00x)
inv_txfm_add_8x8_identity_identity_1_8bpc_lsx: 1.8 (16.97x)
2024-01-21 15:31:46 +08:00
yuanhecai
951646ce56
loongarch: Improve the performance of itx_8bpc.add_8x4 series functions
...
Relative speedup over C code:
inv_txfm_add_8x4_adst_adst_0_8bpc_c: 32.0 ( 1.00x)
inv_txfm_add_8x4_adst_adst_0_8bpc_lsx: 4.1 ( 7.87x)
inv_txfm_add_8x4_adst_adst_1_8bpc_c: 32.3 ( 1.00x)
inv_txfm_add_8x4_adst_adst_1_8bpc_lsx: 4.1 ( 7.92x)
inv_txfm_add_8x4_adst_dct_0_8bpc_c: 33.7 ( 1.00x)
inv_txfm_add_8x4_adst_dct_0_8bpc_lsx: 3.8 ( 8.77x)
inv_txfm_add_8x4_adst_dct_1_8bpc_c: 33.1 ( 1.00x)
inv_txfm_add_8x4_adst_dct_1_8bpc_lsx: 3.8 ( 8.63x)
inv_txfm_add_8x4_adst_flipadst_0_8bpc_c: 32.7 ( 1.00x)
inv_txfm_add_8x4_adst_flipadst_0_8bpc_lsx: 4.1 ( 7.99x)
inv_txfm_add_8x4_adst_flipadst_1_8bpc_c: 32.8 ( 1.00x)
inv_txfm_add_8x4_adst_flipadst_1_8bpc_lsx: 4.0 ( 8.16x)
inv_txfm_add_8x4_adst_identity_0_8bpc_c: 31.2 ( 1.00x)
inv_txfm_add_8x4_adst_identity_0_8bpc_lsx: 3.8 ( 8.29x)
inv_txfm_add_8x4_adst_identity_1_8bpc_c: 28.7 ( 1.00x)
inv_txfm_add_8x4_adst_identity_1_8bpc_lsx: 3.7 ( 7.78x)
inv_txfm_add_8x4_dct_adst_0_8bpc_c: 32.0 ( 1.00x)
inv_txfm_add_8x4_dct_adst_0_8bpc_lsx: 3.0 (10.76x)
inv_txfm_add_8x4_dct_adst_1_8bpc_c: 31.5 ( 1.00x)
inv_txfm_add_8x4_dct_adst_1_8bpc_lsx: 2.8 (11.46x)
inv_txfm_add_8x4_dct_dct_0_8bpc_c: 5.5 ( 1.00x)
inv_txfm_add_8x4_dct_dct_0_8bpc_lsx: 0.6 ( 9.22x)
inv_txfm_add_8x4_dct_dct_1_8bpc_c: 33.1 ( 1.00x)
inv_txfm_add_8x4_dct_dct_1_8bpc_lsx: 2.8 (11.89x)
inv_txfm_add_8x4_dct_flipadst_0_8bpc_c: 32.4 ( 1.00x)
inv_txfm_add_8x4_dct_flipadst_0_8bpc_lsx: 3.0 (10.81x)
inv_txfm_add_8x4_dct_flipadst_1_8bpc_c: 32.4 ( 1.00x)
inv_txfm_add_8x4_dct_flipadst_1_8bpc_lsx: 3.0 (10.81x)
inv_txfm_add_8x4_dct_identity_0_8bpc_c: 27.9 ( 1.00x)
inv_txfm_add_8x4_dct_identity_0_8bpc_lsx: 2.7 (10.35x)
inv_txfm_add_8x4_dct_identity_1_8bpc_c: 28.5 ( 1.00x)
inv_txfm_add_8x4_dct_identity_1_8bpc_lsx: 2.7 (10.53x)
inv_txfm_add_8x4_flipadst_adst_0_8bpc_c: 32.2 ( 1.00x)
inv_txfm_add_8x4_flipadst_adst_0_8bpc_lsx: 4.1 ( 7.86x)
inv_txfm_add_8x4_flipadst_adst_1_8bpc_c: 32.2 ( 1.00x)
inv_txfm_add_8x4_flipadst_adst_1_8bpc_lsx: 4.0 ( 7.95x)
inv_txfm_add_8x4_flipadst_dct_0_8bpc_c: 33.6 ( 1.00x)
inv_txfm_add_8x4_flipadst_dct_0_8bpc_lsx: 3.8 ( 8.73x)
inv_txfm_add_8x4_flipadst_dct_1_8bpc_c: 33.6 ( 1.00x)
inv_txfm_add_8x4_flipadst_dct_1_8bpc_lsx: 3.8 ( 8.74x)
inv_txfm_add_8x4_flipadst_flipadst_0_8bpc_c: 32.6 ( 1.00x)
inv_txfm_add_8x4_flipadst_flipadst_0_8bpc_lsx: 4.0 ( 8.16x)
inv_txfm_add_8x4_flipadst_flipadst_1_8bpc_c: 32.6 ( 1.00x)
inv_txfm_add_8x4_flipadst_flipadst_1_8bpc_lsx: 4.0 ( 8.15x)
inv_txfm_add_8x4_flipadst_identity_0_8bpc_c: 28.7 ( 1.00x)
inv_txfm_add_8x4_flipadst_identity_0_8bpc_lsx: 3.8 ( 7.64x)
inv_txfm_add_8x4_flipadst_identity_1_8bpc_c: 28.7 ( 1.00x)
inv_txfm_add_8x4_flipadst_identity_1_8bpc_lsx: 3.8 ( 7.55x)
inv_txfm_add_8x4_identity_adst_0_8bpc_c: 21.9 ( 1.00x)
inv_txfm_add_8x4_identity_adst_0_8bpc_lsx: 1.9 (11.81x)
inv_txfm_add_8x4_identity_adst_1_8bpc_c: 26.9 ( 1.00x)
inv_txfm_add_8x4_identity_adst_1_8bpc_lsx: 1.9 (14.39x)
inv_txfm_add_8x4_identity_dct_0_8bpc_c: 23.3 ( 1.00x)
inv_txfm_add_8x4_identity_dct_0_8bpc_lsx: 1.7 (13.53x)
inv_txfm_add_8x4_identity_dct_1_8bpc_c: 23.3 ( 1.00x)
inv_txfm_add_8x4_identity_dct_1_8bpc_lsx: 1.7 (13.53x)
inv_txfm_add_8x4_identity_flipadst_0_8bpc_c: 22.3 ( 1.00x)
inv_txfm_add_8x4_identity_flipadst_0_8bpc_lsx: 1.9 (11.46x)
inv_txfm_add_8x4_identity_flipadst_1_8bpc_c: 23.4 ( 1.00x)
inv_txfm_add_8x4_identity_flipadst_1_8bpc_lsx: 1.9 (12.02x)
inv_txfm_add_8x4_identity_identity_0_8bpc_c: 18.5 ( 1.00x)
inv_txfm_add_8x4_identity_identity_0_8bpc_lsx: 1.6 (11.23x)
inv_txfm_add_8x4_identity_identity_1_8bpc_c: 18.5 ( 1.00x)
inv_txfm_add_8x4_identity_identity_1_8bpc_lsx: 1.6 (11.57x)
2024-01-21 15:31:46 +08:00
yuanhecai
a4cd834991
loongarch: Improve the performance of itx_8bpc.add_4x4 series functions
...
Relative speedup over C code:
inv_txfm_add_4x4_adst_adst_0_8bpc_c: 14.1 ( 1.00x)
inv_txfm_add_4x4_adst_adst_0_8bpc_lsx: 1.3 (11.16x)
inv_txfm_add_4x4_adst_adst_1_8bpc_c: 14.1 ( 1.00x)
inv_txfm_add_4x4_adst_adst_1_8bpc_lsx: 1.3 (11.17x)
inv_txfm_add_4x4_adst_dct_0_8bpc_c: 14.8 ( 1.00x)
inv_txfm_add_4x4_adst_dct_0_8bpc_lsx: 1.4 (10.99x)
inv_txfm_add_4x4_adst_dct_1_8bpc_c: 14.9 ( 1.00x)
inv_txfm_add_4x4_adst_dct_1_8bpc_lsx: 1.3 (11.42x)
inv_txfm_add_4x4_adst_flipadst_0_8bpc_c: 14.4 ( 1.00x)
inv_txfm_add_4x4_adst_flipadst_0_8bpc_lsx: 1.2 (11.52x)
inv_txfm_add_4x4_adst_flipadst_1_8bpc_c: 14.4 ( 1.00x)
inv_txfm_add_4x4_adst_flipadst_1_8bpc_lsx: 1.2 (11.52x)
inv_txfm_add_4x4_adst_identity_0_8bpc_c: 12.5 ( 1.00x)
inv_txfm_add_4x4_adst_identity_0_8bpc_lsx: 1.2 (10.22x)
inv_txfm_add_4x4_adst_identity_1_8bpc_c: 12.5 ( 1.00x)
inv_txfm_add_4x4_adst_identity_1_8bpc_lsx: 1.2 (10.26x)
inv_txfm_add_4x4_dct_adst_0_8bpc_c: 14.6 ( 1.00x)
inv_txfm_add_4x4_dct_adst_0_8bpc_lsx: 1.3 (11.37x)
inv_txfm_add_4x4_dct_adst_1_8bpc_c: 14.6 ( 1.00x)
inv_txfm_add_4x4_dct_adst_1_8bpc_lsx: 1.3 (11.55x)
inv_txfm_add_4x4_dct_dct_0_8bpc_c: 3.2 ( 1.00x)
inv_txfm_add_4x4_dct_dct_0_8bpc_lsx: 0.5 ( 6.28x)
inv_txfm_add_4x4_dct_dct_1_8bpc_c: 15.4 ( 1.00x)
inv_txfm_add_4x4_dct_dct_1_8bpc_lsx: 1.2 (13.19x)
inv_txfm_add_4x4_dct_flipadst_0_8bpc_c: 15.0 ( 1.00x)
inv_txfm_add_4x4_dct_flipadst_0_8bpc_lsx: 1.3 (11.73x)
inv_txfm_add_4x4_dct_flipadst_1_8bpc_c: 15.0 ( 1.00x)
inv_txfm_add_4x4_dct_flipadst_1_8bpc_lsx: 1.3 (11.72x)
inv_txfm_add_4x4_dct_identity_0_8bpc_c: 13.0 ( 1.00x)
inv_txfm_add_4x4_dct_identity_0_8bpc_lsx: 1.1 (12.36x)
inv_txfm_add_4x4_dct_identity_1_8bpc_c: 13.0 ( 1.00x)
inv_txfm_add_4x4_dct_identity_1_8bpc_lsx: 1.0 (12.36x)
inv_txfm_add_4x4_flipadst_adst_0_8bpc_c: 14.2 ( 1.00x)
inv_txfm_add_4x4_flipadst_adst_0_8bpc_lsx: 1.3 (11.00x)
inv_txfm_add_4x4_flipadst_adst_1_8bpc_c: 14.2 ( 1.00x)
inv_txfm_add_4x4_flipadst_adst_1_8bpc_lsx: 1.3 (11.03x)
inv_txfm_add_4x4_flipadst_dct_0_8bpc_c: 15.0 ( 1.00x)
inv_txfm_add_4x4_flipadst_dct_0_8bpc_lsx: 1.3 (11.43x)
inv_txfm_add_4x4_flipadst_dct_1_8bpc_c: 15.0 ( 1.00x)
inv_txfm_add_4x4_flipadst_dct_1_8bpc_lsx: 1.3 (11.44x)
inv_txfm_add_4x4_flipadst_flipadst_0_8bpc_c: 14.5 ( 1.00x)
inv_txfm_add_4x4_flipadst_flipadst_0_8bpc_lsx: 1.3 (11.60x)
inv_txfm_add_4x4_flipadst_flipadst_1_8bpc_c: 14.5 ( 1.00x)
inv_txfm_add_4x4_flipadst_flipadst_1_8bpc_lsx: 1.2 (11.61x)
inv_txfm_add_4x4_flipadst_identity_0_8bpc_c: 12.5 ( 1.00x)
inv_txfm_add_4x4_flipadst_identity_0_8bpc_lsx: 1.1 (11.01x)
inv_txfm_add_4x4_flipadst_identity_1_8bpc_c: 12.5 ( 1.00x)
inv_txfm_add_4x4_flipadst_identity_1_8bpc_lsx: 1.1 (10.99x)
inv_txfm_add_4x4_identity_adst_0_8bpc_c: 12.1 ( 1.00x)
inv_txfm_add_4x4_identity_adst_0_8bpc_lsx: 1.1 (11.50x)
inv_txfm_add_4x4_identity_adst_1_8bpc_c: 12.1 ( 1.00x)
inv_txfm_add_4x4_identity_adst_1_8bpc_lsx: 1.1 (10.98x)
inv_txfm_add_4x4_identity_dct_0_8bpc_c: 12.9 ( 1.00x)
inv_txfm_add_4x4_identity_dct_0_8bpc_lsx: 1.0 (12.95x)
inv_txfm_add_4x4_identity_dct_1_8bpc_c: 13.0 ( 1.00x)
inv_txfm_add_4x4_identity_dct_1_8bpc_lsx: 1.0 (12.97x)
inv_txfm_add_4x4_identity_flipadst_0_8bpc_c: 12.4 ( 1.00x)
inv_txfm_add_4x4_identity_flipadst_0_8bpc_lsx: 1.1 (11.26x)
inv_txfm_add_4x4_identity_flipadst_1_8bpc_c: 12.4 ( 1.00x)
inv_txfm_add_4x4_identity_flipadst_1_8bpc_lsx: 1.1 (11.32x)
inv_txfm_add_4x4_identity_identity_0_8bpc_c: 10.6 ( 1.00x)
inv_txfm_add_4x4_identity_identity_0_8bpc_lsx: 0.9 (11.45x)
inv_txfm_add_4x4_identity_identity_1_8bpc_c: 10.6 ( 1.00x)
inv_txfm_add_4x4_identity_identity_1_8bpc_lsx: 0.9 (11.78x)
inv_txfm_add_4x4_wht_wht_0_8bpc_c: 4.1 ( 1.00x)
inv_txfm_add_4x4_wht_wht_0_8bpc_lsx: 0.6 ( 6.84x)
inv_txfm_add_4x4_wht_wht_1_8bpc_c: 4.1 ( 1.00x)
inv_txfm_add_4x4_wht_wht_1_8bpc_lsx: 0.6 ( 6.83x)
2024-01-21 15:31:46 +08:00
yuanhecai
14df65f217
loongarch: Improve the performance of refmvs.splat_mv function
...
Relative speedup over C code:
splat_mv_w1_c: 0.6 ( 1.00x)
splat_mv_w1_lsx: 0.4 ( 1.28x)
splat_mv_w2_c: 0.9 ( 1.00x)
splat_mv_w2_lsx: 0.6 ( 1.65x)
splat_mv_w4_c: 2.2 ( 1.00x)
splat_mv_w4_lsx: 0.8 ( 2.87x)
splat_mv_w8_c: 7.7 ( 1.00x)
splat_mv_w8_lsx: 2.0 ( 3.80x)
splat_mv_w16_c: 19.1 ( 1.00x)
splat_mv_w16_lsx: 4.6 ( 4.18x)
splat_mv_w32_c: 49.0 ( 1.00x)
splat_mv_w32_lsx: 10.3 ( 4.76x)
2024-01-21 15:31:46 +08:00
jinbo
38bc00849a
loongarch: Improve the performance of msac series functions
...
Relative speedup over C code:
msac_decode_bool_c: 0.5 ( 1.00x)
msac_decode_bool_lsx: 0.5 ( 1.09x)
msac_decode_bool_adapt_c: 0.7 ( 1.00x)
msac_decode_bool_adapt_lsx: 0.6 ( 1.20x)
msac_decode_symbol_adapt4_c: 1.3 ( 1.00x)
msac_decode_symbol_adapt4_lsx: 1.0 ( 1.30x)
msac_decode_symbol_adapt8_c: 2.1 ( 1.00x)
msac_decode_symbol_adapt8_lsx: 1.0 ( 2.05x)
msac_decode_symbol_adapt16_c: 3.7 ( 1.00x)
msac_decode_symbol_adapt16_lsx: 0.8 ( 4.77x)
2024-01-21 15:31:46 +08:00
yuanhecai
b98ea43379
loongarch: Improve the performance of looprestoration_8bpc series functions
...
Relative speedup over C code:
wiener_5tap_8bpc_c: 13358.0 ( 1.00x)
wiener_5tap_8bpc_lsx: 2484.7 ( 5.38x)
wiener_7tap_8bpc_c: 13358.4 ( 1.00x)
wiener_7tap_8bpc_lsx: 2486.4 ( 5.37x)
sgr_3x3_8bpc_c: 18989.2 ( 1.00x)
sgr_3x3_8bpc_lsx: 7981.6 ( 2.38x)
sgr_5x5_8bpc_c: 17242.0 ( 1.00x)
sgr_5x5_8bpc_lsx: 5735.5 ( 3.01x)
2024-01-21 15:31:46 +08:00
yuanhecai
78a776d253
loongarch: Improve the performance of loopfilter_8bpc series functions
...
Relative speedup over C code:
lpf_h_sb_uv_w4_8bpc_c: 25.3 ( 1.00x)
lpf_h_sb_uv_w4_8bpc_lsx: 6.7 ( 3.79x)
lpf_h_sb_uv_w6_8bpc_c: 36.5 ( 1.00x)
lpf_h_sb_uv_w6_8bpc_lsx: 11.0 ( 3.31x)
lpf_h_sb_y_w4_8bpc_c: 47.7 ( 1.00x)
lpf_h_sb_y_w4_8bpc_lsx: 12.5 ( 3.82x)
lpf_h_sb_y_w8_8bpc_c: 81.9 ( 1.00x)
lpf_h_sb_y_w8_8bpc_lsx: 22.2 ( 3.69x)
lpf_h_sb_y_w16_8bpc_c: 85.1 ( 1.00x)
lpf_h_sb_y_w16_8bpc_lsx: 18.1 ( 4.70x)
lpf_v_sb_uv_w4_8bpc_c: 25.3 ( 1.00x)
lpf_v_sb_uv_w4_8bpc_lsx: 5.7 ( 4.43x)
lpf_v_sb_uv_w6_8bpc_c: 37.6 ( 1.00x)
lpf_v_sb_uv_w6_8bpc_lsx: 9.5 ( 3.97x)
lpf_v_sb_y_w4_8bpc_c: 59.4 ( 1.00x)
lpf_v_sb_y_w4_8bpc_lsx: 15.7 ( 3.78x)
lpf_v_sb_y_w8_8bpc_c: 94.5 ( 1.00x)
lpf_v_sb_y_w8_8bpc_lsx: 29.4 ( 3.21x)
lpf_v_sb_y_w16_8bpc_c: 97.8 ( 1.00x)
lpf_v_sb_y_w16_8bpc_lsx: 36.3 ( 2.70x)
2024-01-21 15:31:46 +08:00
jinbo
ae8756ed91
loongarch: Improve the performance of mc_8bpc.mct functions
...
Relative speedup over C code:
mct_8tap_regular_w4_0_8bpc_c: 4.2 ( 1.00x)
mct_8tap_regular_w4_0_8bpc_lasx: 0.5 ( 9.08x)
mct_8tap_regular_w4_h_8bpc_c: 12.5 ( 1.00x)
mct_8tap_regular_w4_h_8bpc_lasx: 1.6 ( 7.80x)
mct_8tap_regular_w4_hv_8bpc_c: 33.5 ( 1.00x)
mct_8tap_regular_w4_hv_8bpc_lasx: 6.0 ( 5.54x)
mct_8tap_regular_w4_v_8bpc_c: 13.6 ( 1.00x)
mct_8tap_regular_w4_v_8bpc_lasx: 2.2 ( 6.22x)
mct_8tap_regular_w8_0_8bpc_c: 11.3 ( 1.00x)
mct_8tap_regular_w8_0_8bpc_lasx: 0.7 (15.77x)
mct_8tap_regular_w8_h_8bpc_c: 39.1 ( 1.00x)
mct_8tap_regular_w8_h_8bpc_lasx: 4.7 ( 8.30x)
mct_8tap_regular_w8_hv_8bpc_c: 90.9 ( 1.00x)
mct_8tap_regular_w8_hv_8bpc_lasx: 17.2 ( 5.29x)
mct_8tap_regular_w8_v_8bpc_c: 40.5 ( 1.00x)
mct_8tap_regular_w8_v_8bpc_lasx: 6.9 ( 5.86x)
mct_8tap_regular_w16_0_8bpc_c: 34.3 ( 1.00x)
mct_8tap_regular_w16_0_8bpc_lasx: 1.3 (26.32x)
mct_8tap_regular_w16_h_8bpc_c: 128.3 ( 1.00x)
mct_8tap_regular_w16_h_8bpc_lasx: 20.5 ( 6.26x)
mct_8tap_regular_w16_hv_8bpc_c: 273.5 ( 1.00x)
mct_8tap_regular_w16_hv_8bpc_lasx: 54.5 ( 5.02x)
mct_8tap_regular_w16_v_8bpc_c: 129.7 ( 1.00x)
mct_8tap_regular_w16_v_8bpc_lasx: 22.8 ( 5.69x)
mct_8tap_regular_w32_0_8bpc_c: 133.7 ( 1.00x)
mct_8tap_regular_w32_0_8bpc_lasx: 5.4 (24.65x)
mct_8tap_regular_w32_h_8bpc_c: 511.4 ( 1.00x)
mct_8tap_regular_w32_h_8bpc_lasx: 85.1 ( 6.01x)
mct_8tap_regular_w32_hv_8bpc_c: 1018.2 ( 1.00x)
mct_8tap_regular_w32_hv_8bpc_lasx: 210.0 ( 4.85x)
mct_8tap_regular_w32_v_8bpc_c: 513.6 ( 1.00x)
mct_8tap_regular_w32_v_8bpc_lasx: 88.7 ( 5.79x)
mct_8tap_regular_w64_0_8bpc_c: 315.4 ( 1.00x)
mct_8tap_regular_w64_0_8bpc_lasx: 13.2 (23.86x)
mct_8tap_regular_w64_h_8bpc_c: 1236.8 ( 1.00x)
mct_8tap_regular_w64_h_8bpc_lasx: 208.2 ( 5.94x)
mct_8tap_regular_w64_hv_8bpc_c: 2428.0 ( 1.00x)
mct_8tap_regular_w64_hv_8bpc_lasx: 502.7 ( 4.83x)
mct_8tap_regular_w64_v_8bpc_c: 1238.3 ( 1.00x)
mct_8tap_regular_w64_v_8bpc_lasx: 214.0 ( 5.79x)
mct_8tap_regular_w128_0_8bpc_c: 775.3 ( 1.00x)
mct_8tap_regular_w128_0_8bpc_lasx: 32.5 (23.86x)
mct_8tap_regular_w128_h_8bpc_c: 3077.5 ( 1.00x)
mct_8tap_regular_w128_h_8bpc_lasx: 518.6 ( 5.93x)
mct_8tap_regular_w128_hv_8bpc_c: 5987.0 ( 1.00x)
mct_8tap_regular_w128_hv_8bpc_lasx: 1242.4 ( 4.82x)
mct_8tap_regular_w128_v_8bpc_c: 3077.5 ( 1.00x)
mct_8tap_regular_w128_v_8bpc_lasx: 530.3 ( 5.80x)
2024-01-21 15:31:46 +08:00
jinbo
b34ecaf310
loongarch: Improve the performance of mc_8bpc.mc functions
...
Relative speedup over C code:
mc_8tap_regular_w2_0_8bpc_c: 5.3 ( 1.00x)
mc_8tap_regular_w2_0_8bpc_lsx: 0.8 ( 6.62x)
mc_8tap_regular_w2_h_8bpc_c: 11.0 ( 1.00x)
mc_8tap_regular_w2_h_8bpc_lsx: 2.5 ( 4.40x)
mc_8tap_regular_w2_hv_8bpc_c: 24.4 ( 1.00x)
mc_8tap_regular_w2_hv_8bpc_lsx: 9.1 ( 2.70x)
mc_8tap_regular_w2_v_8bpc_c: 12.9 ( 1.00x)
mc_8tap_regular_w2_v_8bpc_lsx: 3.2 ( 4.08x)
mc_8tap_regular_w4_0_8bpc_c: 4.8 ( 1.00x)
mc_8tap_regular_w4_0_8bpc_lsx: 0.8 ( 5.97x)
mc_8tap_regular_w4_h_8bpc_c: 20.0 ( 1.00x)
mc_8tap_regular_w4_h_8bpc_lsx: 3.9 ( 5.06x)
mc_8tap_regular_w4_hv_8bpc_c: 44.3 ( 1.00x)
mc_8tap_regular_w4_hv_8bpc_lsx: 15.0 ( 2.96x)
mc_8tap_regular_w4_v_8bpc_c: 23.5 ( 1.00x)
mc_8tap_regular_w4_v_8bpc_lsx: 4.2 ( 5.54x)
mc_8tap_regular_w8_0_8bpc_c: 4.8 ( 1.00x)
mc_8tap_regular_w8_0_8bpc_lsx: 0.8 ( 6.03x)
mc_8tap_regular_w8_h_8bpc_c: 37.5 ( 1.00x)
mc_8tap_regular_w8_h_8bpc_lsx: 7.6 ( 4.96x)
mc_8tap_regular_w8_hv_8bpc_c: 84.0 ( 1.00x)
mc_8tap_regular_w8_hv_8bpc_lsx: 23.9 ( 3.51x)
mc_8tap_regular_w8_v_8bpc_c: 44.8 ( 1.00x)
mc_8tap_regular_w8_v_8bpc_lsx: 7.2 ( 6.23x)
mc_8tap_regular_w16_0_8bpc_c: 5.8 ( 1.00x)
mc_8tap_regular_w16_0_8bpc_lsx: 1.1 ( 5.12x)
mc_8tap_regular_w16_h_8bpc_c: 103.8 ( 1.00x)
mc_8tap_regular_w16_h_8bpc_lsx: 21.6 ( 4.80x)
mc_8tap_regular_w16_hv_8bpc_c: 220.2 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_lsx: 65.1 ( 3.38x)
mc_8tap_regular_w16_v_8bpc_c: 124.8 ( 1.00x)
mc_8tap_regular_w16_v_8bpc_lsx: 19.9 ( 6.28x)
mc_8tap_regular_w32_0_8bpc_c: 8.9 ( 1.00x)
mc_8tap_regular_w32_0_8bpc_lsx: 2.9 ( 3.06x)
mc_8tap_regular_w32_h_8bpc_c: 323.6 ( 1.00x)
mc_8tap_regular_w32_h_8bpc_lsx: 69.1 ( 4.68x)
mc_8tap_regular_w32_hv_8bpc_c: 649.5 ( 1.00x)
mc_8tap_regular_w32_hv_8bpc_lsx: 197.7 ( 3.29x)
mc_8tap_regular_w32_v_8bpc_c: 390.5 ( 1.00x)
mc_8tap_regular_w32_v_8bpc_lsx: 61.9 ( 6.31x)
mc_8tap_regular_w64_0_8bpc_c: 13.3 ( 1.00x)
mc_8tap_regular_w64_0_8bpc_lsx: 9.7 ( 1.37x)
mc_8tap_regular_w64_h_8bpc_c: 1145.3 ( 1.00x)
mc_8tap_regular_w64_h_8bpc_lsx: 248.2 ( 4.61x)
mc_8tap_regular_w64_hv_8bpc_c: 2204.4 ( 1.00x)
mc_8tap_regular_w64_hv_8bpc_lsx: 682.1 ( 3.23x)
mc_8tap_regular_w64_v_8bpc_c: 1384.9 ( 1.00x)
mc_8tap_regular_w64_v_8bpc_lsx: 218.9 ( 6.33x)
mc_8tap_regular_w128_0_8bpc_c: 33.6 ( 1.00x)
mc_8tap_regular_w128_0_8bpc_lsx: 27.7 ( 1.21x)
mc_8tap_regular_w128_h_8bpc_c: 3228.1 ( 1.00x)
mc_8tap_regular_w128_h_8bpc_lsx: 701.7 ( 4.60x)
mc_8tap_regular_w128_hv_8bpc_c: 6108.2 ( 1.00x)
mc_8tap_regular_w128_hv_8bpc_lsx: 1905.3 ( 3.21x)
mc_8tap_regular_w128_v_8bpc_c: 3906.8 ( 1.00x)
mc_8tap_regular_w128_v_8bpc_lsx: 617.4 ( 6.33x)
2024-01-21 15:31:46 +08:00
yuanhecai
d618867533
loongarch: Improve the performance of avg functions
...
Relative speedup over C code:
avg_w4_8bpc_c: 7.0 ( 1.00x)
avg_w4_8bpc_lsx: 0.8 ( 8.69x)
avg_w4_8bpc_lasx: 0.8 ( 8.94x)
avg_w8_8bpc_c: 20.4 ( 1.00x)
avg_w8_8bpc_lsx: 1.1 (18.25x)
avg_w8_8bpc_lasx: 0.9 (23.16x)
avg_w16_8bpc_c: 65.1 ( 1.00x)
avg_w16_8bpc_lsx: 2.5 (26.43x)
avg_w16_8bpc_lasx: 2.0 (32.05x)
avg_w32_8bpc_c: 255.1 ( 1.00x)
avg_w32_8bpc_lsx: 8.6 (29.74x)
avg_w32_8bpc_lasx: 6.0 (42.80x)
avg_w64_8bpc_c: 611.0 ( 1.00x)
avg_w64_8bpc_lsx: 21.0 (29.10x)
avg_w64_8bpc_lasx: 12.1 (50.36x)
avg_w128_8bpc_c: 1519.3 ( 1.00x)
avg_w128_8bpc_lsx: 88.7 (17.13x)
avg_w128_8bpc_lasx: 60.3 (25.20x)
2024-01-21 15:31:46 +08:00
yuanhecai
4080673c17
loongarch: Improve the performance of mask_c, w_mask_420 functions
...
Relative speedup over C code:
mask_w4_8bpc_c: 9.2 ( 1.00x)
mask_w4_8bpc_lsx: 1.1 ( 8.31x)
mask_w4_8bpc_lasx: 1.2 ( 7.42x)
mask_w8_8bpc_c: 27.4 ( 1.00x)
mask_w8_8bpc_lsx: 2.6 (10.54x)
mask_w8_8bpc_lasx: 1.9 (14.65x)
mask_w16_8bpc_c: 87.2 ( 1.00x)
mask_w16_8bpc_lsx: 8.0 (10.92x)
mask_w16_8bpc_lasx: 6.5 (13.46x)
mask_w32_8bpc_c: 343.4 ( 1.00x)
mask_w32_8bpc_lsx: 31.7 (10.84x)
mask_w32_8bpc_lasx: 22.1 (15.51x)
mask_w64_8bpc_c: 824.9 ( 1.00x)
mask_w64_8bpc_lsx: 78.0 (10.57x)
mask_w64_8bpc_lasx: 54.1 (15.25x)
mask_w128_8bpc_c: 2042.9 ( 1.00x)
mask_w128_8bpc_lsx: 200.7 (10.18x)
mask_w128_8bpc_lasx: 157.1 (13.00x)
w_mask_420_w4_8bpc_c: 19.0 ( 1.00x)
w_mask_420_w4_8bpc_lsx: 1.7 (11.11x)
w_mask_420_w4_8bpc_lasx: 1.2 (15.87x)
w_mask_420_w8_8bpc_c: 58.2 ( 1.00x)
w_mask_420_w8_8bpc_lsx: 4.6 (12.58x)
w_mask_420_w8_8bpc_lasx: 2.5 (23.74x)
w_mask_420_w16_8bpc_c: 188.0 ( 1.00x)
w_mask_420_w16_8bpc_lsx: 11.8 (15.88x)
w_mask_420_w16_8bpc_lasx: 8.3 (22.66x)
w_mask_420_w32_8bpc_c: 742.2 ( 1.00x)
w_mask_420_w32_8bpc_lsx: 47.3 (15.68x)
w_mask_420_w32_8bpc_lasx: 32.7 (22.68x)
w_mask_420_w64_8bpc_c: 1786.3 ( 1.00x)
w_mask_420_w64_8bpc_lsx: 112.4 (15.89x)
w_mask_420_w64_8bpc_lasx: 78.4 (22.78x)
w_mask_420_w128_8bpc_c: 4442.2 ( 1.00x)
w_mask_420_w128_8bpc_lsx: 298.9 (14.86x)
w_mask_420_w128_8bpc_lasx: 220.5 (20.15x)
2024-01-21 15:31:46 +08:00
Hao Chen
bde69a94bf
loongarch: Improve the performance of w_avg functions
...
Relative speedup over C code:
w_avg_w4_8bpc_c: 8.6 ( 1.00x)
w_avg_w4_8bpc_lsx: 1.0 ( 8.53x)
w_avg_w4_8bpc_lasx: 1.0 ( 8.79x)
w_avg_w8_8bpc_c: 24.4 ( 1.00x)
w_avg_w8_8bpc_lsx: 2.7 ( 8.90x)
w_avg_w8_8bpc_lasx: 1.6 (15.33x)
w_avg_w16_8bpc_c: 77.4 ( 1.00x)
w_avg_w16_8bpc_lsx: 6.9 (11.29x)
w_avg_w16_8bpc_lasx: 5.2 (14.88x)
w_avg_w32_8bpc_c: 303.7 ( 1.00x)
w_avg_w32_8bpc_lsx: 27.2 (11.16x)
w_avg_w32_8bpc_lasx: 14.2 (21.43x)
w_avg_w64_8bpc_c: 725.8 ( 1.00x)
w_avg_w64_8bpc_lsx: 66.1 (10.98x)
w_avg_w64_8bpc_lasx: 35.4 (20.48x)
w_avg_w128_8bpc_c: 1812.6 ( 1.00x)
w_avg_w128_8bpc_lsx: 169.9 (10.67x)
w_avg_w128_8bpc_lasx: 111.7 (16.23x)
2024-01-21 15:31:46 +08:00
yuanhecai
a23a1e7f81
loongarch: Improve the performance of warp8x8, warp8x8t functions
...
Relative speedup over C code:
warp_8x8_8bpc_c: 81.3 ( 1.00x)
warp_8x8_8bpc_lsx: 27.1 ( 3.00x)
warp_8x8_8bpc_lasx: 17.9 ( 4.54x)
warp_8x8t_8bpc_c: 71.7 ( 1.00x)
warp_8x8t_8bpc_lsx: 26.6 ( 2.69x)
warp_8x8t_8bpc_lasx: 17.7 ( 4.04x)
2024-01-21 15:31:46 +08:00
yuanhecai
4fb71a1a01
loongarch: add loongson_asm.S
2024-01-21 15:31:46 +08:00
yuanhecai
2e952f300f
Add loongarch support
2024-01-21 15:06:52 +08:00
Matthias Dressel
7d225bec62
CI: Add loongarch64 tests
2024-01-15 14:54:46 +01:00
Matthias Dressel
655d7ec07d
CI: Add loongarch64 toolchain
2024-01-15 09:35:54 +01:00
Henrik Gramner
d23e87f7ae
checkasm: Prefer sigsetjmp()/siglongjmp() over SA_NODEFER
...
Also prefer re-setting the signal handler upon intercept in combination
with SA_RESETHAND over re-raising exceptions with the SIG_DFL handler.
2024-01-11 12:35:34 +00:00
Henrik Gramner
8501a4b201
checkasm: Make signal handling async-signal-safe
2024-01-11 12:35:34 +00:00
Ronald S. Bultje
ceeb535d94
qm: derive more tables at runtime
...
This reduces binary size from ~50kb to ~35kb. Ideas provided by Yu-Chen
(Eric) Sun and Ryan Lei from Meta.
2024-01-03 13:42:40 -05:00
Henrik Gramner
746ab8b4f3
thread_task: Properly handle spurious wakeups in delayed_fg
...
POSIX explicitly states that spurious wakeups from pthread_cond_wake()
may occur, even without any corresponding call to pthread_cond_signal().
2023-12-19 13:15:43 +01:00
Henrik Gramner
b3f5e8cef5
thread_task: Replace goto's with a regular while-loop
2023-12-19 13:15:43 +01:00
Henrik Gramner
8ba0df8492
checkasm: Fix cdef_dir function prototype
2023-12-19 12:11:46 +01:00
Martin Storsjö
5149b27447
checkasm: Map SIGBUS to the right error text
...
This was missed in 2ef970a885
.
Also print this text for EXCEPTION_IN_PAGE_ERROR on Windows.
2023-12-15 14:10:01 +02:00
Henrik Gramner
b3779b89c0
x86: Add high bit-depth ipred z1 AVX-512 (Ice Lake) asm
2023-12-11 14:15:30 +01:00
Henrik Gramner
0a8d66402e
x86: Require fast gathers for AVX-512 horizontal loopfilters
...
Prefer using the AVX2 implementations (which doesn't use gathers) on Zen 4.
2023-12-08 16:21:13 +01:00
Henrik Gramner
a04a724719
x86: Require fast gathers for high bit-depth AVX-512 film grain
...
Prefer using the SSSE3 implementations on Zen 4.
2023-12-08 16:21:13 +01:00
Henrik Gramner
0e438e70fa
x86: Require fast gathers for AVX-512 mc resize and warp
...
Prefer using the AVX2 implementations (which doesn't use gathers) on Zen 4.
2023-12-08 16:21:13 +01:00
Henrik Gramner
ec05e9b978
x86: Flag Zen 4 as having slow gathers
2023-12-08 15:34:16 +01:00
Henrik Gramner
3c41fa88ce
x86: Add 8-bit ipred z2 AVX-512 (Ice Lake) asm
2023-11-13 13:05:58 +01:00
Henrik Gramner
e47a39ca95
x86: Fix 8bpc AVX2 ipred_z2 filtering with extremely large frame sizes
...
The max_width/max_height values can exceed 16-bit range.
2023-11-12 22:52:18 +01:00
Martin Storsjö
2179b30c84
checkasm: Fix catching crashes on Windows on ARM
...
longjmp on Windows uses SEH to unwind on ARM/ARM64 too, just like on
x86_64, thus use RtlCaptureContext/RtlRestoreContext instead of
setjmp/longjmp on those architectures as well.
2023-11-01 19:28:07 +02:00
Henrik Gramner
d2ee43892b
checkasm: Improve DSP trimming error message
2023-11-01 14:43:19 +01:00
Henrik Gramner
611abc20db
checkasm: Add missing WINAPI_PARTITION checks on Windows
...
Some functionality is only available on WINAPI_PARTITION_DESKTOP systems.
2023-11-01 14:43:19 +01:00
Henrik Gramner
6bc552eb28
checkasm: Enable virtual terminal processing on Windows
...
This allows for the use of standard VT100 escape codes for text coloring,
which simplifies things by eliminating a bunch of Windows-specific code.
This is only supported since Windows 10. Things will still run on
older systems, just without colored text output.
2023-11-01 14:43:18 +01:00
Henrik Gramner
0f2a877e7e
checkasm: Check for errors in command line parsing
2023-11-01 13:59:46 +01:00
Henrik Gramner
9dbf46285d
ci: Fix test-debian-asan running checkasm with non-existing arguments
2023-11-01 13:59:46 +01:00
Matthias Dressel
48ef395920
CI: Update images
2023-10-24 20:27:33 +02:00
Henrik Gramner
fd4ecc2fd8
x86: Add 8-bit ipred z3 AVX-512 (Ice Lake) asm
2023-10-19 17:00:20 +02:00
Ronald S. Bultje
47107e384b
deblock_avx512: convert byte-shifts to gf2p8affineqb
2023-10-05 17:24:34 +00:00
Henrik Gramner
4c012978fb
x86: Add 8-bit ipred z1 AVX-512 (Ice Lake) asm
2023-10-04 11:49:57 +02:00
Henrik Gramner
8936bab7ba
x86: Consolidate some pb_0to31 and pb_0to63 constants
2023-10-04 11:49:43 +02:00
Jean-Baptiste Kempf
48035599cd
Prepare for release 1.3.0
2023-10-03 17:36:52 +02:00
André Kempe
769bd1457a
fix: various errors in implementation of BTI
...
Amend call type in refmvs. Because these blocks are reached via
blr x11, they need to be annotated.
Add missing BTI landing pads in ipred.S and ipred16.S. Because the
subroutines are called via a br from register, they need annotation with
'bti j' (AARCH64_VALID_JUMP_TARGET).
2023-09-08 10:02:06 +01:00
Henrik Gramner
97becd7372
Use the correct free() function on dav1d_mem_pool_init() failure
2023-08-18 17:41:50 +02:00