Commit Graph

2502 Commits

Author SHA1 Message Date
yuanhecai fbefb34ae9 loongarch: Improve one functions in itx_8bpc.add_8x32 series
1. inv_txfm_add_dct_dct_8x32

Relative speedup over C code:

inv_txfm_add_8x32_dct_dct_0_8bpc_c:                  33.3 ( 1.00x)
inv_txfm_add_8x32_dct_dct_0_8bpc_lsx:                 2.1 (15.58x)
inv_txfm_add_8x32_dct_dct_1_8bpc_c:                 311.1 ( 1.00x)
inv_txfm_add_8x32_dct_dct_1_8bpc_lsx:                24.9 (12.49x)
inv_txfm_add_8x32_dct_dct_2_8bpc_c:                 308.4 ( 1.00x)
inv_txfm_add_8x32_dct_dct_2_8bpc_lsx:                24.9 (12.37x)
inv_txfm_add_8x32_dct_dct_3_8bpc_c:                 309.3 ( 1.00x)
inv_txfm_add_8x32_dct_dct_3_8bpc_lsx:                25.0 (12.37x)
inv_txfm_add_8x32_dct_dct_4_8bpc_c:                 308.4 ( 1.00x)
inv_txfm_add_8x32_dct_dct_4_8bpc_lsx:                25.0 (12.35x)
2024-01-21 15:31:46 +08:00
yuanhecai 8c32cde7c1 loongarch: Improve six functions in itx_8bpc.add_16x16 series
1. inv_txfm_add_dct_dct_16x16
2. inv_txfm_add_adst_adst_16x16
3. inv_txfm_add_adst_dct_16x16
4. inv_txfm_add_dct_adst_16x16
5. inv_txfm_add_flipadst_dct_16x16
6. inv_txfm_add_dct_flipadst_16x16

Relative speedup over C code:

inv_txfm_add_16x16_adst_adst_0_8bpc_c:              327.6 ( 1.00x)
inv_txfm_add_16x16_adst_adst_0_8bpc_lsx:             30.5 (10.74x)
inv_txfm_add_16x16_adst_adst_1_8bpc_c:              327.6 ( 1.00x)
inv_txfm_add_16x16_adst_adst_1_8bpc_lsx:             30.7 (10.67x)
inv_txfm_add_16x16_adst_adst_2_8bpc_c:              327.6 ( 1.00x)
inv_txfm_add_16x16_adst_adst_2_8bpc_lsx:             30.5 (10.73x)
inv_txfm_add_16x16_adst_dct_0_8bpc_c:               321.0 ( 1.00x)
inv_txfm_add_16x16_adst_dct_0_8bpc_lsx:              27.6 (11.64x)
inv_txfm_add_16x16_adst_dct_1_8bpc_c:               320.9 ( 1.00x)
inv_txfm_add_16x16_adst_dct_1_8bpc_lsx:              27.4 (11.70x)
inv_txfm_add_16x16_adst_dct_2_8bpc_c:               320.8 ( 1.00x)
inv_txfm_add_16x16_adst_dct_2_8bpc_lsx:              27.5 (11.67x)
inv_txfm_add_16x16_dct_adst_0_8bpc_c:               321.1 ( 1.00x)
inv_txfm_add_16x16_dct_adst_0_8bpc_lsx:              27.1 (11.85x)
inv_txfm_add_16x16_dct_adst_1_8bpc_c:               321.1 ( 1.00x)
inv_txfm_add_16x16_dct_adst_1_8bpc_lsx:              27.2 (11.80x)
inv_txfm_add_16x16_dct_adst_2_8bpc_c:               329.4 ( 1.00x)
inv_txfm_add_16x16_dct_adst_2_8bpc_lsx:              27.2 (12.10x)
inv_txfm_add_16x16_dct_dct_0_8bpc_c:                 31.9 ( 1.00x)
inv_txfm_add_16x16_dct_dct_0_8bpc_lsx:                1.8 (18.14x)
inv_txfm_add_16x16_dct_dct_1_8bpc_c:                314.3 ( 1.00x)
inv_txfm_add_16x16_dct_dct_1_8bpc_lsx:               23.9 (13.16x)
inv_txfm_add_16x16_dct_dct_2_8bpc_c:                314.3 ( 1.00x)
inv_txfm_add_16x16_dct_dct_2_8bpc_lsx:               24.1 (13.05x)
inv_txfm_add_16x16_dct_flipadst_0_8bpc_c:           321.0 ( 1.00x)
inv_txfm_add_16x16_dct_flipadst_0_8bpc_lsx:          27.1 (11.83x)
inv_txfm_add_16x16_dct_flipadst_1_8bpc_c:           321.0 ( 1.00x)
inv_txfm_add_16x16_dct_flipadst_1_8bpc_lsx:          27.1 (11.84x)
inv_txfm_add_16x16_dct_flipadst_2_8bpc_c:           327.7 ( 1.00x)
inv_txfm_add_16x16_dct_flipadst_2_8bpc_lsx:          27.1 (12.07x)
inv_txfm_add_16x16_flipadst_dct_0_8bpc_c:           322.6 ( 1.00x)
inv_txfm_add_16x16_flipadst_dct_0_8bpc_lsx:          28.1 (11.49x)
inv_txfm_add_16x16_flipadst_dct_1_8bpc_c:           322.5 ( 1.00x)
inv_txfm_add_16x16_flipadst_dct_1_8bpc_lsx:          28.1 (11.48x)
inv_txfm_add_16x16_flipadst_dct_2_8bpc_c:           322.7 ( 1.00x)
inv_txfm_add_16x16_flipadst_dct_2_8bpc_lsx:          28.0 (11.53x)
2024-01-21 15:31:46 +08:00
yuanhecai 233be20140 loongarch: Improve one functions in itx_8bpc.add_4x8 series
1. inv_txfm_add_dct_dct_4x8

Relative speedup over C code:

inv_txfm_add_4x8_dct_dct_0_8bpc_c:                    5.7 ( 1.00x)
inv_txfm_add_4x8_dct_dct_0_8bpc_lsx:                  0.8 ( 7.12x)
inv_txfm_add_4x8_dct_dct_1_8bpc_c:                   34.5 ( 1.00x)
inv_txfm_add_4x8_dct_dct_1_8bpc_lsx:                  3.0 (11.64x)
2024-01-21 15:31:46 +08:00
yuanhecai 8626d9f9a6 loongarch: Improve two functions in itx_8bpc.add_16x8 series
1. inv_txfm_add_dct_dct_16x8
2. inv_txfm_add_adst_dct_16x8

Relative speedup over C code:

inv_txfm_add_16x8_adst_dct_0_8bpc_c:                152.1 ( 1.00x)
inv_txfm_add_16x8_adst_dct_0_8bpc_lsx:               13.7 (11.08x)
inv_txfm_add_16x8_adst_dct_1_8bpc_c:                152.1 ( 1.00x)
inv_txfm_add_16x8_adst_dct_1_8bpc_lsx:               13.7 (11.07x)
inv_txfm_add_16x8_adst_dct_2_8bpc_c:                152.1 ( 1.00x)
inv_txfm_add_16x8_adst_dct_2_8bpc_lsx:               13.7 (11.08x)
inv_txfm_add_16x8_dct_dct_0_8bpc_c:                  17.0 ( 1.00x)
inv_txfm_add_16x8_dct_dct_0_8bpc_lsx:                 1.3 (13.10x)
inv_txfm_add_16x8_dct_dct_1_8bpc_c:                 147.5 ( 1.00x)
inv_txfm_add_16x8_dct_dct_1_8bpc_lsx:                11.6 (12.73x)
inv_txfm_add_16x8_dct_dct_2_8bpc_c:                 147.5 ( 1.00x)
inv_txfm_add_16x8_dct_dct_2_8bpc_lsx:                11.6 (12.74x)
2024-01-21 15:31:46 +08:00
yuanhecai 5ebe32283e loongarch: Improve four functions in itx_8bpc.add_8x16 series
1. inv_txfm_add_dct_dct_8x16
2. inv_txfm_add_identity_identity_8x16
3. inv_txfm_add_adst_dct_8x16
4 .inv_txfm_add_dct_adst_8x16

Relative speedup over C code:

inv_txfm_add_8x16_adst_dct_0_8bpc_c:                151.0 ( 1.00x)
inv_txfm_add_8x16_adst_dct_0_8bpc_lsx:               14.9 (10.10x)
inv_txfm_add_8x16_adst_dct_1_8bpc_c:                151.0 ( 1.00x)
inv_txfm_add_8x16_adst_dct_1_8bpc_lsx:               15.0 (10.10x)
inv_txfm_add_8x16_adst_dct_2_8bpc_c:                151.0 ( 1.00x)
inv_txfm_add_8x16_adst_dct_2_8bpc_lsx:               15.0 (10.10x)
inv_txfm_add_8x16_dct_adst_0_8bpc_c:                157.3 ( 1.00x)
inv_txfm_add_8x16_dct_adst_0_8bpc_lsx:               13.6 (11.59x)
inv_txfm_add_8x16_dct_adst_1_8bpc_c:                154.9 ( 1.00x)
inv_txfm_add_8x16_dct_adst_1_8bpc_lsx:               13.6 (11.38x)
inv_txfm_add_8x16_dct_adst_2_8bpc_c:                154.8 ( 1.00x)
inv_txfm_add_8x16_dct_adst_2_8bpc_lsx:               13.5 (11.46x)
inv_txfm_add_8x16_dct_dct_0_8bpc_c:                  17.8 ( 1.00x)
inv_txfm_add_8x16_dct_dct_0_8bpc_lsx:                 1.5 (11.75x)
inv_txfm_add_8x16_dct_dct_1_8bpc_c:                 149.4 ( 1.00x)
inv_txfm_add_8x16_dct_dct_1_8bpc_lsx:                12.0 (12.49x)
inv_txfm_add_8x16_dct_dct_2_8bpc_c:                 159.5 ( 1.00x)
inv_txfm_add_8x16_dct_dct_2_8bpc_lsx:                12.0 (13.33x)
inv_txfm_add_8x16_identity_identity_0_8bpc_c:        75.0 ( 1.00x)
inv_txfm_add_8x16_identity_identity_0_8bpc_lsx:       6.0 (12.50x)
inv_txfm_add_8x16_identity_identity_1_8bpc_c:        67.4 ( 1.00x)
inv_txfm_add_8x16_identity_identity_1_8bpc_lsx:       6.0 (11.26x)
inv_txfm_add_8x16_identity_identity_2_8bpc_c:        66.7 ( 1.00x)
inv_txfm_add_8x16_identity_identity_2_8bpc_lsx:       5.9 (11.40x)
2024-01-21 15:31:46 +08:00
yuanhecai 32809a0222 loongarch: Improve the performance of itx_8bpc.add_8x8 series functions
Relative speedup over C code:

inv_txfm_add_8x8_adst_adst_0_8bpc_c:                 70.1 ( 1.00x)
inv_txfm_add_8x8_adst_adst_0_8bpc_lsx:                9.4 ( 7.45x)
inv_txfm_add_8x8_adst_adst_1_8bpc_c:                 70.1 ( 1.00x)
inv_txfm_add_8x8_adst_adst_1_8bpc_lsx:                9.4 ( 7.43x)
inv_txfm_add_8x8_adst_dct_0_8bpc_c:                  68.7 ( 1.00x)
inv_txfm_add_8x8_adst_dct_0_8bpc_lsx:                 7.6 ( 9.08x)
inv_txfm_add_8x8_adst_dct_1_8bpc_c:                  68.7 ( 1.00x)
inv_txfm_add_8x8_adst_dct_1_8bpc_lsx:                 7.6 ( 9.00x)
inv_txfm_add_8x8_adst_flipadst_0_8bpc_c:             70.3 ( 1.00x)
inv_txfm_add_8x8_adst_flipadst_0_8bpc_lsx:            9.4 ( 7.47x)
inv_txfm_add_8x8_adst_flipadst_1_8bpc_c:             70.3 ( 1.00x)
inv_txfm_add_8x8_adst_flipadst_1_8bpc_lsx:            9.4 ( 7.47x)
inv_txfm_add_8x8_adst_identity_0_8bpc_c:             50.6 ( 1.00x)
inv_txfm_add_8x8_adst_identity_0_8bpc_lsx:            5.7 ( 8.88x)
inv_txfm_add_8x8_adst_identity_1_8bpc_c:             49.8 ( 1.00x)
inv_txfm_add_8x8_adst_identity_1_8bpc_lsx:            5.7 ( 8.73x)
inv_txfm_add_8x8_dct_adst_0_8bpc_c:                  67.9 ( 1.00x)
inv_txfm_add_8x8_dct_adst_0_8bpc_lsx:                 7.5 ( 9.05x)
inv_txfm_add_8x8_dct_adst_1_8bpc_c:                  67.9 ( 1.00x)
inv_txfm_add_8x8_dct_adst_1_8bpc_lsx:                 7.4 ( 9.13x)
inv_txfm_add_8x8_dct_dct_0_8bpc_c:                    9.1 ( 1.00x)
inv_txfm_add_8x8_dct_dct_0_8bpc_lsx:                  0.8 (11.20x)
inv_txfm_add_8x8_dct_dct_1_8bpc_c:                   66.5 ( 1.00x)
inv_txfm_add_8x8_dct_dct_1_8bpc_lsx:                  5.0 (13.42x)
inv_txfm_add_8x8_dct_flipadst_0_8bpc_c:              67.9 ( 1.00x)
inv_txfm_add_8x8_dct_flipadst_0_8bpc_lsx:             7.5 ( 9.06x)
inv_txfm_add_8x8_dct_flipadst_1_8bpc_c:              67.9 ( 1.00x)
inv_txfm_add_8x8_dct_flipadst_1_8bpc_lsx:             7.5 ( 9.06x)
inv_txfm_add_8x8_dct_identity_0_8bpc_c:              47.3 ( 1.00x)
inv_txfm_add_8x8_dct_identity_0_8bpc_lsx:             3.7 (12.70x)
inv_txfm_add_8x8_dct_identity_1_8bpc_c:              47.3 ( 1.00x)
inv_txfm_add_8x8_dct_identity_1_8bpc_lsx:             3.7 (12.70x)
inv_txfm_add_8x8_flipadst_adst_0_8bpc_c:             70.3 ( 1.00x)
inv_txfm_add_8x8_flipadst_adst_0_8bpc_lsx:            9.6 ( 7.35x)
inv_txfm_add_8x8_flipadst_adst_1_8bpc_c:             70.3 ( 1.00x)
inv_txfm_add_8x8_flipadst_adst_1_8bpc_lsx:            9.6 ( 7.33x)
inv_txfm_add_8x8_flipadst_dct_0_8bpc_c:              68.9 ( 1.00x)
inv_txfm_add_8x8_flipadst_dct_0_8bpc_lsx:             7.6 ( 9.10x)
inv_txfm_add_8x8_flipadst_dct_1_8bpc_c:              68.9 ( 1.00x)
inv_txfm_add_8x8_flipadst_dct_1_8bpc_lsx:             7.6 ( 9.11x)
inv_txfm_add_8x8_flipadst_flipadst_0_8bpc_c:         70.4 ( 1.00x)
inv_txfm_add_8x8_flipadst_flipadst_0_8bpc_lsx:        9.6 ( 7.32x)
inv_txfm_add_8x8_flipadst_flipadst_1_8bpc_c:         70.4 ( 1.00x)
inv_txfm_add_8x8_flipadst_flipadst_1_8bpc_lsx:        9.6 ( 7.34x)
inv_txfm_add_8x8_flipadst_identity_0_8bpc_c:         49.9 ( 1.00x)
inv_txfm_add_8x8_flipadst_identity_0_8bpc_lsx:        5.6 ( 8.91x)
inv_txfm_add_8x8_flipadst_identity_1_8bpc_c:         49.9 ( 1.00x)
inv_txfm_add_8x8_flipadst_identity_1_8bpc_lsx:        5.6 ( 8.91x)
inv_txfm_add_8x8_identity_adst_0_8bpc_c:             51.3 ( 1.00x)
inv_txfm_add_8x8_identity_adst_0_8bpc_lsx:            5.5 ( 9.28x)
inv_txfm_add_8x8_identity_adst_1_8bpc_c:             51.3 ( 1.00x)
inv_txfm_add_8x8_identity_adst_1_8bpc_lsx:            5.5 ( 9.28x)
inv_txfm_add_8x8_identity_dct_0_8bpc_c:              50.5 ( 1.00x)
inv_txfm_add_8x8_identity_dct_0_8bpc_lsx:             3.6 (13.83x)
inv_txfm_add_8x8_identity_dct_1_8bpc_c:              50.6 ( 1.00x)
inv_txfm_add_8x8_identity_dct_1_8bpc_lsx:             3.6 (13.87x)
inv_txfm_add_8x8_identity_flipadst_0_8bpc_c:         52.0 ( 1.00x)
inv_txfm_add_8x8_identity_flipadst_0_8bpc_lsx:        5.5 ( 9.40x)
inv_txfm_add_8x8_identity_flipadst_1_8bpc_c:         52.0 ( 1.00x)
inv_txfm_add_8x8_identity_flipadst_1_8bpc_lsx:        5.5 ( 9.39x)
inv_txfm_add_8x8_identity_identity_0_8bpc_c:         31.1 ( 1.00x)
inv_txfm_add_8x8_identity_identity_0_8bpc_lsx:        1.8 (17.06x)
inv_txfm_add_8x8_identity_identity_1_8bpc_c:         31.1 ( 1.00x)
inv_txfm_add_8x8_identity_identity_1_8bpc_lsx:        1.8 (16.97x)
2024-01-21 15:31:46 +08:00
yuanhecai 951646ce56 loongarch: Improve the performance of itx_8bpc.add_8x4 series functions
Relative speedup over C code:

inv_txfm_add_8x4_adst_adst_0_8bpc_c:                 32.0 ( 1.00x)
inv_txfm_add_8x4_adst_adst_0_8bpc_lsx:                4.1 ( 7.87x)
inv_txfm_add_8x4_adst_adst_1_8bpc_c:                 32.3 ( 1.00x)
inv_txfm_add_8x4_adst_adst_1_8bpc_lsx:                4.1 ( 7.92x)
inv_txfm_add_8x4_adst_dct_0_8bpc_c:                  33.7 ( 1.00x)
inv_txfm_add_8x4_adst_dct_0_8bpc_lsx:                 3.8 ( 8.77x)
inv_txfm_add_8x4_adst_dct_1_8bpc_c:                  33.1 ( 1.00x)
inv_txfm_add_8x4_adst_dct_1_8bpc_lsx:                 3.8 ( 8.63x)
inv_txfm_add_8x4_adst_flipadst_0_8bpc_c:             32.7 ( 1.00x)
inv_txfm_add_8x4_adst_flipadst_0_8bpc_lsx:            4.1 ( 7.99x)
inv_txfm_add_8x4_adst_flipadst_1_8bpc_c:             32.8 ( 1.00x)
inv_txfm_add_8x4_adst_flipadst_1_8bpc_lsx:            4.0 ( 8.16x)
inv_txfm_add_8x4_adst_identity_0_8bpc_c:             31.2 ( 1.00x)
inv_txfm_add_8x4_adst_identity_0_8bpc_lsx:            3.8 ( 8.29x)
inv_txfm_add_8x4_adst_identity_1_8bpc_c:             28.7 ( 1.00x)
inv_txfm_add_8x4_adst_identity_1_8bpc_lsx:            3.7 ( 7.78x)
inv_txfm_add_8x4_dct_adst_0_8bpc_c:                  32.0 ( 1.00x)
inv_txfm_add_8x4_dct_adst_0_8bpc_lsx:                 3.0 (10.76x)
inv_txfm_add_8x4_dct_adst_1_8bpc_c:                  31.5 ( 1.00x)
inv_txfm_add_8x4_dct_adst_1_8bpc_lsx:                 2.8 (11.46x)
inv_txfm_add_8x4_dct_dct_0_8bpc_c:                    5.5 ( 1.00x)
inv_txfm_add_8x4_dct_dct_0_8bpc_lsx:                  0.6 ( 9.22x)
inv_txfm_add_8x4_dct_dct_1_8bpc_c:                   33.1 ( 1.00x)
inv_txfm_add_8x4_dct_dct_1_8bpc_lsx:                  2.8 (11.89x)
inv_txfm_add_8x4_dct_flipadst_0_8bpc_c:              32.4 ( 1.00x)
inv_txfm_add_8x4_dct_flipadst_0_8bpc_lsx:             3.0 (10.81x)
inv_txfm_add_8x4_dct_flipadst_1_8bpc_c:              32.4 ( 1.00x)
inv_txfm_add_8x4_dct_flipadst_1_8bpc_lsx:             3.0 (10.81x)
inv_txfm_add_8x4_dct_identity_0_8bpc_c:              27.9 ( 1.00x)
inv_txfm_add_8x4_dct_identity_0_8bpc_lsx:             2.7 (10.35x)
inv_txfm_add_8x4_dct_identity_1_8bpc_c:              28.5 ( 1.00x)
inv_txfm_add_8x4_dct_identity_1_8bpc_lsx:             2.7 (10.53x)
inv_txfm_add_8x4_flipadst_adst_0_8bpc_c:             32.2 ( 1.00x)
inv_txfm_add_8x4_flipadst_adst_0_8bpc_lsx:            4.1 ( 7.86x)
inv_txfm_add_8x4_flipadst_adst_1_8bpc_c:             32.2 ( 1.00x)
inv_txfm_add_8x4_flipadst_adst_1_8bpc_lsx:            4.0 ( 7.95x)
inv_txfm_add_8x4_flipadst_dct_0_8bpc_c:              33.6 ( 1.00x)
inv_txfm_add_8x4_flipadst_dct_0_8bpc_lsx:             3.8 ( 8.73x)
inv_txfm_add_8x4_flipadst_dct_1_8bpc_c:              33.6 ( 1.00x)
inv_txfm_add_8x4_flipadst_dct_1_8bpc_lsx:             3.8 ( 8.74x)
inv_txfm_add_8x4_flipadst_flipadst_0_8bpc_c:         32.6 ( 1.00x)
inv_txfm_add_8x4_flipadst_flipadst_0_8bpc_lsx:        4.0 ( 8.16x)
inv_txfm_add_8x4_flipadst_flipadst_1_8bpc_c:         32.6 ( 1.00x)
inv_txfm_add_8x4_flipadst_flipadst_1_8bpc_lsx:        4.0 ( 8.15x)
inv_txfm_add_8x4_flipadst_identity_0_8bpc_c:         28.7 ( 1.00x)
inv_txfm_add_8x4_flipadst_identity_0_8bpc_lsx:        3.8 ( 7.64x)
inv_txfm_add_8x4_flipadst_identity_1_8bpc_c:         28.7 ( 1.00x)
inv_txfm_add_8x4_flipadst_identity_1_8bpc_lsx:        3.8 ( 7.55x)
inv_txfm_add_8x4_identity_adst_0_8bpc_c:             21.9 ( 1.00x)
inv_txfm_add_8x4_identity_adst_0_8bpc_lsx:            1.9 (11.81x)
inv_txfm_add_8x4_identity_adst_1_8bpc_c:             26.9 ( 1.00x)
inv_txfm_add_8x4_identity_adst_1_8bpc_lsx:            1.9 (14.39x)
inv_txfm_add_8x4_identity_dct_0_8bpc_c:              23.3 ( 1.00x)
inv_txfm_add_8x4_identity_dct_0_8bpc_lsx:             1.7 (13.53x)
inv_txfm_add_8x4_identity_dct_1_8bpc_c:              23.3 ( 1.00x)
inv_txfm_add_8x4_identity_dct_1_8bpc_lsx:             1.7 (13.53x)
inv_txfm_add_8x4_identity_flipadst_0_8bpc_c:         22.3 ( 1.00x)
inv_txfm_add_8x4_identity_flipadst_0_8bpc_lsx:        1.9 (11.46x)
inv_txfm_add_8x4_identity_flipadst_1_8bpc_c:         23.4 ( 1.00x)
inv_txfm_add_8x4_identity_flipadst_1_8bpc_lsx:        1.9 (12.02x)
inv_txfm_add_8x4_identity_identity_0_8bpc_c:         18.5 ( 1.00x)
inv_txfm_add_8x4_identity_identity_0_8bpc_lsx:        1.6 (11.23x)
inv_txfm_add_8x4_identity_identity_1_8bpc_c:         18.5 ( 1.00x)
inv_txfm_add_8x4_identity_identity_1_8bpc_lsx:        1.6 (11.57x)
2024-01-21 15:31:46 +08:00
yuanhecai a4cd834991 loongarch: Improve the performance of itx_8bpc.add_4x4 series functions
Relative speedup over C code:

inv_txfm_add_4x4_adst_adst_0_8bpc_c:                 14.1 ( 1.00x)
inv_txfm_add_4x4_adst_adst_0_8bpc_lsx:                1.3 (11.16x)
inv_txfm_add_4x4_adst_adst_1_8bpc_c:                 14.1 ( 1.00x)
inv_txfm_add_4x4_adst_adst_1_8bpc_lsx:                1.3 (11.17x)
inv_txfm_add_4x4_adst_dct_0_8bpc_c:                  14.8 ( 1.00x)
inv_txfm_add_4x4_adst_dct_0_8bpc_lsx:                 1.4 (10.99x)
inv_txfm_add_4x4_adst_dct_1_8bpc_c:                  14.9 ( 1.00x)
inv_txfm_add_4x4_adst_dct_1_8bpc_lsx:                 1.3 (11.42x)
inv_txfm_add_4x4_adst_flipadst_0_8bpc_c:             14.4 ( 1.00x)
inv_txfm_add_4x4_adst_flipadst_0_8bpc_lsx:            1.2 (11.52x)
inv_txfm_add_4x4_adst_flipadst_1_8bpc_c:             14.4 ( 1.00x)
inv_txfm_add_4x4_adst_flipadst_1_8bpc_lsx:            1.2 (11.52x)
inv_txfm_add_4x4_adst_identity_0_8bpc_c:             12.5 ( 1.00x)
inv_txfm_add_4x4_adst_identity_0_8bpc_lsx:            1.2 (10.22x)
inv_txfm_add_4x4_adst_identity_1_8bpc_c:             12.5 ( 1.00x)
inv_txfm_add_4x4_adst_identity_1_8bpc_lsx:            1.2 (10.26x)
inv_txfm_add_4x4_dct_adst_0_8bpc_c:                  14.6 ( 1.00x)
inv_txfm_add_4x4_dct_adst_0_8bpc_lsx:                 1.3 (11.37x)
inv_txfm_add_4x4_dct_adst_1_8bpc_c:                  14.6 ( 1.00x)
inv_txfm_add_4x4_dct_adst_1_8bpc_lsx:                 1.3 (11.55x)
inv_txfm_add_4x4_dct_dct_0_8bpc_c:                    3.2 ( 1.00x)
inv_txfm_add_4x4_dct_dct_0_8bpc_lsx:                  0.5 ( 6.28x)
inv_txfm_add_4x4_dct_dct_1_8bpc_c:                   15.4 ( 1.00x)
inv_txfm_add_4x4_dct_dct_1_8bpc_lsx:                  1.2 (13.19x)
inv_txfm_add_4x4_dct_flipadst_0_8bpc_c:              15.0 ( 1.00x)
inv_txfm_add_4x4_dct_flipadst_0_8bpc_lsx:             1.3 (11.73x)
inv_txfm_add_4x4_dct_flipadst_1_8bpc_c:              15.0 ( 1.00x)
inv_txfm_add_4x4_dct_flipadst_1_8bpc_lsx:             1.3 (11.72x)
inv_txfm_add_4x4_dct_identity_0_8bpc_c:              13.0 ( 1.00x)
inv_txfm_add_4x4_dct_identity_0_8bpc_lsx:             1.1 (12.36x)
inv_txfm_add_4x4_dct_identity_1_8bpc_c:              13.0 ( 1.00x)
inv_txfm_add_4x4_dct_identity_1_8bpc_lsx:             1.0 (12.36x)
inv_txfm_add_4x4_flipadst_adst_0_8bpc_c:             14.2 ( 1.00x)
inv_txfm_add_4x4_flipadst_adst_0_8bpc_lsx:            1.3 (11.00x)
inv_txfm_add_4x4_flipadst_adst_1_8bpc_c:             14.2 ( 1.00x)
inv_txfm_add_4x4_flipadst_adst_1_8bpc_lsx:            1.3 (11.03x)
inv_txfm_add_4x4_flipadst_dct_0_8bpc_c:              15.0 ( 1.00x)
inv_txfm_add_4x4_flipadst_dct_0_8bpc_lsx:             1.3 (11.43x)
inv_txfm_add_4x4_flipadst_dct_1_8bpc_c:              15.0 ( 1.00x)
inv_txfm_add_4x4_flipadst_dct_1_8bpc_lsx:             1.3 (11.44x)
inv_txfm_add_4x4_flipadst_flipadst_0_8bpc_c:         14.5 ( 1.00x)
inv_txfm_add_4x4_flipadst_flipadst_0_8bpc_lsx:        1.3 (11.60x)
inv_txfm_add_4x4_flipadst_flipadst_1_8bpc_c:         14.5 ( 1.00x)
inv_txfm_add_4x4_flipadst_flipadst_1_8bpc_lsx:        1.2 (11.61x)
inv_txfm_add_4x4_flipadst_identity_0_8bpc_c:         12.5 ( 1.00x)
inv_txfm_add_4x4_flipadst_identity_0_8bpc_lsx:        1.1 (11.01x)
inv_txfm_add_4x4_flipadst_identity_1_8bpc_c:         12.5 ( 1.00x)
inv_txfm_add_4x4_flipadst_identity_1_8bpc_lsx:        1.1 (10.99x)
inv_txfm_add_4x4_identity_adst_0_8bpc_c:             12.1 ( 1.00x)
inv_txfm_add_4x4_identity_adst_0_8bpc_lsx:            1.1 (11.50x)
inv_txfm_add_4x4_identity_adst_1_8bpc_c:             12.1 ( 1.00x)
inv_txfm_add_4x4_identity_adst_1_8bpc_lsx:            1.1 (10.98x)
inv_txfm_add_4x4_identity_dct_0_8bpc_c:              12.9 ( 1.00x)
inv_txfm_add_4x4_identity_dct_0_8bpc_lsx:             1.0 (12.95x)
inv_txfm_add_4x4_identity_dct_1_8bpc_c:              13.0 ( 1.00x)
inv_txfm_add_4x4_identity_dct_1_8bpc_lsx:             1.0 (12.97x)
inv_txfm_add_4x4_identity_flipadst_0_8bpc_c:         12.4 ( 1.00x)
inv_txfm_add_4x4_identity_flipadst_0_8bpc_lsx:        1.1 (11.26x)
inv_txfm_add_4x4_identity_flipadst_1_8bpc_c:         12.4 ( 1.00x)
inv_txfm_add_4x4_identity_flipadst_1_8bpc_lsx:        1.1 (11.32x)
inv_txfm_add_4x4_identity_identity_0_8bpc_c:         10.6 ( 1.00x)
inv_txfm_add_4x4_identity_identity_0_8bpc_lsx:        0.9 (11.45x)
inv_txfm_add_4x4_identity_identity_1_8bpc_c:         10.6 ( 1.00x)
inv_txfm_add_4x4_identity_identity_1_8bpc_lsx:        0.9 (11.78x)
inv_txfm_add_4x4_wht_wht_0_8bpc_c:                    4.1 ( 1.00x)
inv_txfm_add_4x4_wht_wht_0_8bpc_lsx:                  0.6 ( 6.84x)
inv_txfm_add_4x4_wht_wht_1_8bpc_c:                    4.1 ( 1.00x)
inv_txfm_add_4x4_wht_wht_1_8bpc_lsx:                  0.6 ( 6.83x)
2024-01-21 15:31:46 +08:00
yuanhecai 14df65f217 loongarch: Improve the performance of refmvs.splat_mv function
Relative speedup over C code:

splat_mv_w1_c:                           0.6 ( 1.00x)
splat_mv_w1_lsx:                         0.4 ( 1.28x)
splat_mv_w2_c:                           0.9 ( 1.00x)
splat_mv_w2_lsx:                         0.6 ( 1.65x)
splat_mv_w4_c:                           2.2 ( 1.00x)
splat_mv_w4_lsx:                         0.8 ( 2.87x)
splat_mv_w8_c:                           7.7 ( 1.00x)
splat_mv_w8_lsx:                         2.0 ( 3.80x)
splat_mv_w16_c:                         19.1 ( 1.00x)
splat_mv_w16_lsx:                        4.6 ( 4.18x)
splat_mv_w32_c:                         49.0 ( 1.00x)
splat_mv_w32_lsx:                       10.3 ( 4.76x)
2024-01-21 15:31:46 +08:00
jinbo 38bc00849a loongarch: Improve the performance of msac series functions
Relative speedup over C code:

msac_decode_bool_c:                            0.5 ( 1.00x)
msac_decode_bool_lsx:                          0.5 ( 1.09x)
msac_decode_bool_adapt_c:                      0.7 ( 1.00x)
msac_decode_bool_adapt_lsx:                    0.6 ( 1.20x)
msac_decode_symbol_adapt4_c:                   1.3 ( 1.00x)
msac_decode_symbol_adapt4_lsx:                 1.0 ( 1.30x)
msac_decode_symbol_adapt8_c:                   2.1 ( 1.00x)
msac_decode_symbol_adapt8_lsx:                 1.0 ( 2.05x)
msac_decode_symbol_adapt16_c:                  3.7 ( 1.00x)
msac_decode_symbol_adapt16_lsx:                0.8 ( 4.77x)
2024-01-21 15:31:46 +08:00
yuanhecai b98ea43379 loongarch: Improve the performance of looprestoration_8bpc series functions
Relative speedup over C code:

wiener_5tap_8bpc_c:                       13358.0 ( 1.00x)
wiener_5tap_8bpc_lsx:                      2484.7 ( 5.38x)
wiener_7tap_8bpc_c:                       13358.4 ( 1.00x)
wiener_7tap_8bpc_lsx:                      2486.4 ( 5.37x)
sgr_3x3_8bpc_c:                           18989.2 ( 1.00x)
sgr_3x3_8bpc_lsx:                          7981.6 ( 2.38x)
sgr_5x5_8bpc_c:                           17242.0 ( 1.00x)
sgr_5x5_8bpc_lsx:                          5735.5 ( 3.01x)
2024-01-21 15:31:46 +08:00
yuanhecai 78a776d253 loongarch: Improve the performance of loopfilter_8bpc series functions
Relative speedup over C code:

lpf_h_sb_uv_w4_8bpc_c:                       25.3 ( 1.00x)
lpf_h_sb_uv_w4_8bpc_lsx:                      6.7 ( 3.79x)
lpf_h_sb_uv_w6_8bpc_c:                       36.5 ( 1.00x)
lpf_h_sb_uv_w6_8bpc_lsx:                     11.0 ( 3.31x)
lpf_h_sb_y_w4_8bpc_c:                        47.7 ( 1.00x)
lpf_h_sb_y_w4_8bpc_lsx:                      12.5 ( 3.82x)
lpf_h_sb_y_w8_8bpc_c:                        81.9 ( 1.00x)
lpf_h_sb_y_w8_8bpc_lsx:                      22.2 ( 3.69x)
lpf_h_sb_y_w16_8bpc_c:                       85.1 ( 1.00x)
lpf_h_sb_y_w16_8bpc_lsx:                     18.1 ( 4.70x)
lpf_v_sb_uv_w4_8bpc_c:                       25.3 ( 1.00x)
lpf_v_sb_uv_w4_8bpc_lsx:                      5.7 ( 4.43x)
lpf_v_sb_uv_w6_8bpc_c:                       37.6 ( 1.00x)
lpf_v_sb_uv_w6_8bpc_lsx:                      9.5 ( 3.97x)
lpf_v_sb_y_w4_8bpc_c:                        59.4 ( 1.00x)
lpf_v_sb_y_w4_8bpc_lsx:                      15.7 ( 3.78x)
lpf_v_sb_y_w8_8bpc_c:                        94.5 ( 1.00x)
lpf_v_sb_y_w8_8bpc_lsx:                      29.4 ( 3.21x)
lpf_v_sb_y_w16_8bpc_c:                       97.8 ( 1.00x)
lpf_v_sb_y_w16_8bpc_lsx:                     36.3 ( 2.70x)
2024-01-21 15:31:46 +08:00
jinbo ae8756ed91 loongarch: Improve the performance of mc_8bpc.mct functions
Relative speedup over C code:

mct_8tap_regular_w4_0_8bpc_c:                      4.2 ( 1.00x)
mct_8tap_regular_w4_0_8bpc_lasx:                   0.5 ( 9.08x)
mct_8tap_regular_w4_h_8bpc_c:                     12.5 ( 1.00x)
mct_8tap_regular_w4_h_8bpc_lasx:                   1.6 ( 7.80x)
mct_8tap_regular_w4_hv_8bpc_c:                    33.5 ( 1.00x)
mct_8tap_regular_w4_hv_8bpc_lasx:                  6.0 ( 5.54x)
mct_8tap_regular_w4_v_8bpc_c:                     13.6 ( 1.00x)
mct_8tap_regular_w4_v_8bpc_lasx:                   2.2 ( 6.22x)
mct_8tap_regular_w8_0_8bpc_c:                     11.3 ( 1.00x)
mct_8tap_regular_w8_0_8bpc_lasx:                   0.7 (15.77x)
mct_8tap_regular_w8_h_8bpc_c:                     39.1 ( 1.00x)
mct_8tap_regular_w8_h_8bpc_lasx:                   4.7 ( 8.30x)
mct_8tap_regular_w8_hv_8bpc_c:                    90.9 ( 1.00x)
mct_8tap_regular_w8_hv_8bpc_lasx:                 17.2 ( 5.29x)
mct_8tap_regular_w8_v_8bpc_c:                     40.5 ( 1.00x)
mct_8tap_regular_w8_v_8bpc_lasx:                   6.9 ( 5.86x)
mct_8tap_regular_w16_0_8bpc_c:                    34.3 ( 1.00x)
mct_8tap_regular_w16_0_8bpc_lasx:                  1.3 (26.32x)
mct_8tap_regular_w16_h_8bpc_c:                   128.3 ( 1.00x)
mct_8tap_regular_w16_h_8bpc_lasx:                 20.5 ( 6.26x)
mct_8tap_regular_w16_hv_8bpc_c:                  273.5 ( 1.00x)
mct_8tap_regular_w16_hv_8bpc_lasx:                54.5 ( 5.02x)
mct_8tap_regular_w16_v_8bpc_c:                   129.7 ( 1.00x)
mct_8tap_regular_w16_v_8bpc_lasx:                 22.8 ( 5.69x)
mct_8tap_regular_w32_0_8bpc_c:                   133.7 ( 1.00x)
mct_8tap_regular_w32_0_8bpc_lasx:                  5.4 (24.65x)
mct_8tap_regular_w32_h_8bpc_c:                   511.4 ( 1.00x)
mct_8tap_regular_w32_h_8bpc_lasx:                 85.1 ( 6.01x)
mct_8tap_regular_w32_hv_8bpc_c:                 1018.2 ( 1.00x)
mct_8tap_regular_w32_hv_8bpc_lasx:               210.0 ( 4.85x)
mct_8tap_regular_w32_v_8bpc_c:                   513.6 ( 1.00x)
mct_8tap_regular_w32_v_8bpc_lasx:                 88.7 ( 5.79x)
mct_8tap_regular_w64_0_8bpc_c:                   315.4 ( 1.00x)
mct_8tap_regular_w64_0_8bpc_lasx:                 13.2 (23.86x)
mct_8tap_regular_w64_h_8bpc_c:                  1236.8 ( 1.00x)
mct_8tap_regular_w64_h_8bpc_lasx:                208.2 ( 5.94x)
mct_8tap_regular_w64_hv_8bpc_c:                 2428.0 ( 1.00x)
mct_8tap_regular_w64_hv_8bpc_lasx:               502.7 ( 4.83x)
mct_8tap_regular_w64_v_8bpc_c:                  1238.3 ( 1.00x)
mct_8tap_regular_w64_v_8bpc_lasx:                214.0 ( 5.79x)
mct_8tap_regular_w128_0_8bpc_c:                  775.3 ( 1.00x)
mct_8tap_regular_w128_0_8bpc_lasx:                32.5 (23.86x)
mct_8tap_regular_w128_h_8bpc_c:                 3077.5 ( 1.00x)
mct_8tap_regular_w128_h_8bpc_lasx:               518.6 ( 5.93x)
mct_8tap_regular_w128_hv_8bpc_c:                5987.0 ( 1.00x)
mct_8tap_regular_w128_hv_8bpc_lasx:             1242.4 ( 4.82x)
mct_8tap_regular_w128_v_8bpc_c:                 3077.5 ( 1.00x)
mct_8tap_regular_w128_v_8bpc_lasx:               530.3 ( 5.80x)
2024-01-21 15:31:46 +08:00
jinbo b34ecaf310 loongarch: Improve the performance of mc_8bpc.mc functions
Relative speedup over C code:

mc_8tap_regular_w2_0_8bpc_c:                      5.3 ( 1.00x)
mc_8tap_regular_w2_0_8bpc_lsx:                    0.8 ( 6.62x)
mc_8tap_regular_w2_h_8bpc_c:                     11.0 ( 1.00x)
mc_8tap_regular_w2_h_8bpc_lsx:                    2.5 ( 4.40x)
mc_8tap_regular_w2_hv_8bpc_c:                    24.4 ( 1.00x)
mc_8tap_regular_w2_hv_8bpc_lsx:                   9.1 ( 2.70x)
mc_8tap_regular_w2_v_8bpc_c:                     12.9 ( 1.00x)
mc_8tap_regular_w2_v_8bpc_lsx:                    3.2 ( 4.08x)
mc_8tap_regular_w4_0_8bpc_c:                      4.8 ( 1.00x)
mc_8tap_regular_w4_0_8bpc_lsx:                    0.8 ( 5.97x)
mc_8tap_regular_w4_h_8bpc_c:                     20.0 ( 1.00x)
mc_8tap_regular_w4_h_8bpc_lsx:                    3.9 ( 5.06x)
mc_8tap_regular_w4_hv_8bpc_c:                    44.3 ( 1.00x)
mc_8tap_regular_w4_hv_8bpc_lsx:                  15.0 ( 2.96x)
mc_8tap_regular_w4_v_8bpc_c:                     23.5 ( 1.00x)
mc_8tap_regular_w4_v_8bpc_lsx:                    4.2 ( 5.54x)
mc_8tap_regular_w8_0_8bpc_c:                      4.8 ( 1.00x)
mc_8tap_regular_w8_0_8bpc_lsx:                    0.8 ( 6.03x)
mc_8tap_regular_w8_h_8bpc_c:                     37.5 ( 1.00x)
mc_8tap_regular_w8_h_8bpc_lsx:                    7.6 ( 4.96x)
mc_8tap_regular_w8_hv_8bpc_c:                    84.0 ( 1.00x)
mc_8tap_regular_w8_hv_8bpc_lsx:                  23.9 ( 3.51x)
mc_8tap_regular_w8_v_8bpc_c:                     44.8 ( 1.00x)
mc_8tap_regular_w8_v_8bpc_lsx:                    7.2 ( 6.23x)
mc_8tap_regular_w16_0_8bpc_c:                     5.8 ( 1.00x)
mc_8tap_regular_w16_0_8bpc_lsx:                   1.1 ( 5.12x)
mc_8tap_regular_w16_h_8bpc_c:                   103.8 ( 1.00x)
mc_8tap_regular_w16_h_8bpc_lsx:                  21.6 ( 4.80x)
mc_8tap_regular_w16_hv_8bpc_c:                  220.2 ( 1.00x)
mc_8tap_regular_w16_hv_8bpc_lsx:                 65.1 ( 3.38x)
mc_8tap_regular_w16_v_8bpc_c:                   124.8 ( 1.00x)
mc_8tap_regular_w16_v_8bpc_lsx:                  19.9 ( 6.28x)
mc_8tap_regular_w32_0_8bpc_c:                     8.9 ( 1.00x)
mc_8tap_regular_w32_0_8bpc_lsx:                   2.9 ( 3.06x)
mc_8tap_regular_w32_h_8bpc_c:                   323.6 ( 1.00x)
mc_8tap_regular_w32_h_8bpc_lsx:                  69.1 ( 4.68x)
mc_8tap_regular_w32_hv_8bpc_c:                  649.5 ( 1.00x)
mc_8tap_regular_w32_hv_8bpc_lsx:                197.7 ( 3.29x)
mc_8tap_regular_w32_v_8bpc_c:                   390.5 ( 1.00x)
mc_8tap_regular_w32_v_8bpc_lsx:                  61.9 ( 6.31x)
mc_8tap_regular_w64_0_8bpc_c:                    13.3 ( 1.00x)
mc_8tap_regular_w64_0_8bpc_lsx:                   9.7 ( 1.37x)
mc_8tap_regular_w64_h_8bpc_c:                  1145.3 ( 1.00x)
mc_8tap_regular_w64_h_8bpc_lsx:                 248.2 ( 4.61x)
mc_8tap_regular_w64_hv_8bpc_c:                 2204.4 ( 1.00x)
mc_8tap_regular_w64_hv_8bpc_lsx:                682.1 ( 3.23x)
mc_8tap_regular_w64_v_8bpc_c:                  1384.9 ( 1.00x)
mc_8tap_regular_w64_v_8bpc_lsx:                 218.9 ( 6.33x)
mc_8tap_regular_w128_0_8bpc_c:                   33.6 ( 1.00x)
mc_8tap_regular_w128_0_8bpc_lsx:                 27.7 ( 1.21x)
mc_8tap_regular_w128_h_8bpc_c:                 3228.1 ( 1.00x)
mc_8tap_regular_w128_h_8bpc_lsx:                701.7 ( 4.60x)
mc_8tap_regular_w128_hv_8bpc_c:                6108.2 ( 1.00x)
mc_8tap_regular_w128_hv_8bpc_lsx:              1905.3 ( 3.21x)
mc_8tap_regular_w128_v_8bpc_c:                 3906.8 ( 1.00x)
mc_8tap_regular_w128_v_8bpc_lsx:                617.4 ( 6.33x)
2024-01-21 15:31:46 +08:00
yuanhecai d618867533 loongarch: Improve the performance of avg functions
Relative speedup over C code:

avg_w4_8bpc_c:                           7.0 ( 1.00x)
avg_w4_8bpc_lsx:                         0.8 ( 8.69x)
avg_w4_8bpc_lasx:                        0.8 ( 8.94x)
avg_w8_8bpc_c:                          20.4 ( 1.00x)
avg_w8_8bpc_lsx:                         1.1 (18.25x)
avg_w8_8bpc_lasx:                        0.9 (23.16x)
avg_w16_8bpc_c:                         65.1 ( 1.00x)
avg_w16_8bpc_lsx:                        2.5 (26.43x)
avg_w16_8bpc_lasx:                       2.0 (32.05x)
avg_w32_8bpc_c:                        255.1 ( 1.00x)
avg_w32_8bpc_lsx:                        8.6 (29.74x)
avg_w32_8bpc_lasx:                       6.0 (42.80x)
avg_w64_8bpc_c:                        611.0 ( 1.00x)
avg_w64_8bpc_lsx:                       21.0 (29.10x)
avg_w64_8bpc_lasx:                      12.1 (50.36x)
avg_w128_8bpc_c:                      1519.3 ( 1.00x)
avg_w128_8bpc_lsx:                      88.7 (17.13x)
avg_w128_8bpc_lasx:                     60.3 (25.20x)
2024-01-21 15:31:46 +08:00
yuanhecai 4080673c17 loongarch: Improve the performance of mask_c, w_mask_420 functions
Relative speedup over C code:

mask_w4_8bpc_c:                             9.2 ( 1.00x)
mask_w4_8bpc_lsx:                           1.1 ( 8.31x)
mask_w4_8bpc_lasx:                          1.2 ( 7.42x)
mask_w8_8bpc_c:                            27.4 ( 1.00x)
mask_w8_8bpc_lsx:                           2.6 (10.54x)
mask_w8_8bpc_lasx:                          1.9 (14.65x)
mask_w16_8bpc_c:                           87.2 ( 1.00x)
mask_w16_8bpc_lsx:                          8.0 (10.92x)
mask_w16_8bpc_lasx:                         6.5 (13.46x)
mask_w32_8bpc_c:                          343.4 ( 1.00x)
mask_w32_8bpc_lsx:                         31.7 (10.84x)
mask_w32_8bpc_lasx:                        22.1 (15.51x)
mask_w64_8bpc_c:                          824.9 ( 1.00x)
mask_w64_8bpc_lsx:                         78.0 (10.57x)
mask_w64_8bpc_lasx:                        54.1 (15.25x)
mask_w128_8bpc_c:                        2042.9 ( 1.00x)
mask_w128_8bpc_lsx:                       200.7 (10.18x)
mask_w128_8bpc_lasx:                      157.1 (13.00x)

w_mask_420_w4_8bpc_c:                      19.0 ( 1.00x)
w_mask_420_w4_8bpc_lsx:                     1.7 (11.11x)
w_mask_420_w4_8bpc_lasx:                    1.2 (15.87x)
w_mask_420_w8_8bpc_c:                      58.2 ( 1.00x)
w_mask_420_w8_8bpc_lsx:                     4.6 (12.58x)
w_mask_420_w8_8bpc_lasx:                    2.5 (23.74x)
w_mask_420_w16_8bpc_c:                    188.0 ( 1.00x)
w_mask_420_w16_8bpc_lsx:                   11.8 (15.88x)
w_mask_420_w16_8bpc_lasx:                   8.3 (22.66x)
w_mask_420_w32_8bpc_c:                    742.2 ( 1.00x)
w_mask_420_w32_8bpc_lsx:                   47.3 (15.68x)
w_mask_420_w32_8bpc_lasx:                  32.7 (22.68x)
w_mask_420_w64_8bpc_c:                   1786.3 ( 1.00x)
w_mask_420_w64_8bpc_lsx:                  112.4 (15.89x)
w_mask_420_w64_8bpc_lasx:                  78.4 (22.78x)
w_mask_420_w128_8bpc_c:                  4442.2 ( 1.00x)
w_mask_420_w128_8bpc_lsx:                 298.9 (14.86x)
w_mask_420_w128_8bpc_lasx:                220.5 (20.15x)
2024-01-21 15:31:46 +08:00
Hao Chen bde69a94bf loongarch: Improve the performance of w_avg functions
Relative speedup over C code:

w_avg_w4_8bpc_c:                         8.6 ( 1.00x)
w_avg_w4_8bpc_lsx:                       1.0 ( 8.53x)
w_avg_w4_8bpc_lasx:                      1.0 ( 8.79x)
w_avg_w8_8bpc_c:                        24.4 ( 1.00x)
w_avg_w8_8bpc_lsx:                       2.7 ( 8.90x)
w_avg_w8_8bpc_lasx:                      1.6 (15.33x)
w_avg_w16_8bpc_c:                       77.4 ( 1.00x)
w_avg_w16_8bpc_lsx:                      6.9 (11.29x)
w_avg_w16_8bpc_lasx:                     5.2 (14.88x)
w_avg_w32_8bpc_c:                      303.7 ( 1.00x)
w_avg_w32_8bpc_lsx:                     27.2 (11.16x)
w_avg_w32_8bpc_lasx:                    14.2 (21.43x)
w_avg_w64_8bpc_c:                      725.8 ( 1.00x)
w_avg_w64_8bpc_lsx:                     66.1 (10.98x)
w_avg_w64_8bpc_lasx:                    35.4 (20.48x)
w_avg_w128_8bpc_c:                    1812.6 ( 1.00x)
w_avg_w128_8bpc_lsx:                   169.9 (10.67x)
w_avg_w128_8bpc_lasx:                  111.7 (16.23x)
2024-01-21 15:31:46 +08:00
yuanhecai a23a1e7f81 loongarch: Improve the performance of warp8x8, warp8x8t functions
Relative speedup over C code:

warp_8x8_8bpc_c:                                     81.3 ( 1.00x)
warp_8x8_8bpc_lsx:                                   27.1 ( 3.00x)
warp_8x8_8bpc_lasx:                                  17.9 ( 4.54x)
warp_8x8t_8bpc_c:                                    71.7 ( 1.00x)
warp_8x8t_8bpc_lsx:                                  26.6 ( 2.69x)
warp_8x8t_8bpc_lasx:                                 17.7 ( 4.04x)
2024-01-21 15:31:46 +08:00
yuanhecai 4fb71a1a01 loongarch: add loongson_asm.S 2024-01-21 15:31:46 +08:00
yuanhecai 2e952f300f Add loongarch support 2024-01-21 15:06:52 +08:00
Matthias Dressel 7d225bec62 CI: Add loongarch64 tests 2024-01-15 14:54:46 +01:00
Matthias Dressel 655d7ec07d CI: Add loongarch64 toolchain 2024-01-15 09:35:54 +01:00
Henrik Gramner d23e87f7ae checkasm: Prefer sigsetjmp()/siglongjmp() over SA_NODEFER
Also prefer re-setting the signal handler upon intercept in combination
with SA_RESETHAND over re-raising exceptions with the SIG_DFL handler.
2024-01-11 12:35:34 +00:00
Henrik Gramner 8501a4b201 checkasm: Make signal handling async-signal-safe 2024-01-11 12:35:34 +00:00
Ronald S. Bultje ceeb535d94 qm: derive more tables at runtime
This reduces binary size from ~50kb to ~35kb. Ideas provided by Yu-Chen
(Eric) Sun and Ryan Lei from Meta.
2024-01-03 13:42:40 -05:00
Henrik Gramner 746ab8b4f3 thread_task: Properly handle spurious wakeups in delayed_fg
POSIX explicitly states that spurious wakeups from pthread_cond_wake()
may occur, even without any corresponding call to pthread_cond_signal().
2023-12-19 13:15:43 +01:00
Henrik Gramner b3f5e8cef5 thread_task: Replace goto's with a regular while-loop 2023-12-19 13:15:43 +01:00
Henrik Gramner 8ba0df8492 checkasm: Fix cdef_dir function prototype 2023-12-19 12:11:46 +01:00
Martin Storsjö 5149b27447 checkasm: Map SIGBUS to the right error text
This was missed in 2ef970a885.

Also print this text for EXCEPTION_IN_PAGE_ERROR on Windows.
2023-12-15 14:10:01 +02:00
Henrik Gramner b3779b89c0 x86: Add high bit-depth ipred z1 AVX-512 (Ice Lake) asm 2023-12-11 14:15:30 +01:00
Henrik Gramner 0a8d66402e x86: Require fast gathers for AVX-512 horizontal loopfilters
Prefer using the AVX2 implementations (which doesn't use gathers) on Zen 4.
2023-12-08 16:21:13 +01:00
Henrik Gramner a04a724719 x86: Require fast gathers for high bit-depth AVX-512 film grain
Prefer using the SSSE3 implementations on Zen 4.
2023-12-08 16:21:13 +01:00
Henrik Gramner 0e438e70fa x86: Require fast gathers for AVX-512 mc resize and warp
Prefer using the AVX2 implementations (which doesn't use gathers) on Zen 4.
2023-12-08 16:21:13 +01:00
Henrik Gramner ec05e9b978 x86: Flag Zen 4 as having slow gathers 2023-12-08 15:34:16 +01:00
Henrik Gramner 3c41fa88ce x86: Add 8-bit ipred z2 AVX-512 (Ice Lake) asm 2023-11-13 13:05:58 +01:00
Henrik Gramner e47a39ca95 x86: Fix 8bpc AVX2 ipred_z2 filtering with extremely large frame sizes
The max_width/max_height values can exceed 16-bit range.
2023-11-12 22:52:18 +01:00
Martin Storsjö 2179b30c84 checkasm: Fix catching crashes on Windows on ARM
longjmp on Windows uses SEH to unwind on ARM/ARM64 too, just like on
x86_64, thus use RtlCaptureContext/RtlRestoreContext instead of
setjmp/longjmp on those architectures as well.
2023-11-01 19:28:07 +02:00
Henrik Gramner d2ee43892b checkasm: Improve DSP trimming error message 2023-11-01 14:43:19 +01:00
Henrik Gramner 611abc20db checkasm: Add missing WINAPI_PARTITION checks on Windows
Some functionality is only available on WINAPI_PARTITION_DESKTOP systems.
2023-11-01 14:43:19 +01:00
Henrik Gramner 6bc552eb28 checkasm: Enable virtual terminal processing on Windows
This allows for the use of standard VT100 escape codes for text coloring,
which simplifies things by eliminating a bunch of Windows-specific code.

This is only supported since Windows 10. Things will still run on
older systems, just without colored text output.
2023-11-01 14:43:18 +01:00
Henrik Gramner 0f2a877e7e checkasm: Check for errors in command line parsing 2023-11-01 13:59:46 +01:00
Henrik Gramner 9dbf46285d ci: Fix test-debian-asan running checkasm with non-existing arguments 2023-11-01 13:59:46 +01:00
Matthias Dressel 48ef395920 CI: Update images 2023-10-24 20:27:33 +02:00
Henrik Gramner fd4ecc2fd8 x86: Add 8-bit ipred z3 AVX-512 (Ice Lake) asm 2023-10-19 17:00:20 +02:00
Ronald S. Bultje 47107e384b deblock_avx512: convert byte-shifts to gf2p8affineqb 2023-10-05 17:24:34 +00:00
Henrik Gramner 4c012978fb x86: Add 8-bit ipred z1 AVX-512 (Ice Lake) asm 2023-10-04 11:49:57 +02:00
Henrik Gramner 8936bab7ba x86: Consolidate some pb_0to31 and pb_0to63 constants 2023-10-04 11:49:43 +02:00
Jean-Baptiste Kempf 48035599cd Prepare for release 1.3.0 2023-10-03 17:36:52 +02:00
André Kempe 769bd1457a fix: various errors in implementation of BTI
Amend call type in refmvs. Because these blocks are reached via
blr x11, they need to be annotated.

Add missing BTI landing pads in ipred.S and ipred16.S. Because the
subroutines are called via a br from register, they need annotation with
'bti j' (AARCH64_VALID_JUMP_TARGET).
2023-09-08 10:02:06 +01:00
Henrik Gramner 97becd7372 Use the correct free() function on dav1d_mem_pool_init() failure 2023-08-18 17:41:50 +02:00