sm4-aesni-avx2: add generic 1 to 16 block bulk processing function
* cipher/sm4-aesni-avx2-amd64.S: Remove unnecessary vzeroupper at function entries. (_gcry_sm4_aesni_avx2_crypt_blk1_16): New. * cipher/sm4.c (_gcry_sm4_aesni_avx2_crypt_blk1_16) (sm4_aesni_avx2_crypt_blk1_16): New. (sm4_get_crypt_blk1_16_fn) [USE_AESNI_AVX2]: Add 'sm4_aesni_avx2_crypt_blk1_16'.
Benchmark AMD Ryzen 5800X:
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
XTS enc | 1.48 ns/B 643.2 MiB/s 7.19 c/B 4850 XTS dec | 1.48 ns/B 644.3 MiB/s 7.18 c/B 4850
After (1.37x faster):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
XTS enc | 1.07 ns/B 888.7 MiB/s 5.21 c/B 4850 XTS dec | 1.07 ns/B 889.4 MiB/s 5.20 c/B 4850
Benchmark on Intel i5-6200U 2.30GHz:
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
XTS enc | 2.95 ns/B 323.0 MiB/s 8.25 c/B 2792 XTS dec | 2.95 ns/B 323.0 MiB/s 8.24 c/B 2792
After (1.64x faster):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
XTS enc | 1.79 ns/B 531.4 MiB/s 5.01 c/B 2791 XTS dec | 1.79 ns/B 531.6 MiB/s 5.01 c/B 2791
Reviewed-and-tested-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>