Add SM4 x86-64/GFNI/AVX2 implementation
* cipher/Makefile.am: Add 'sm4-gfni-avx2-amd64.S'. * cipher/sm4-aesni-avx2-amd64.S: New. * cipher/sm4.c (USE_GFNI_AVX2): New. (SM4_context): Add 'use_gfni_avx2'. (crypt_blk1_8_fn_t): Rename to... (crypt_blk1_16_fn_t): ...this. (sm4_aesni_avx_crypt_blk1_8): Rename to... (sm4_aesni_avx_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (_gcry_sm4_gfni_avx_expand_key, _gcry_sm4_gfni_avx2_ctr_enc) (_gcry_sm4_gfni_avx2_cbc_dec, _gcry_sm4_gfni_avx2_cfb_dec) (_gcry_sm4_gfni_avx2_ocb_enc, _gcry_sm4_gfni_avx2_ocb_dec) (_gcry_sm4_gfni_avx2_ocb_auth, _gcry_sm4_gfni_avx2_crypt_blk1_16) (sm4_gfni_avx2_crypt_blk1_16): New. (sm4_aarch64_crypt_blk1_8): Rename to... (sm4_aarch64_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (sm4_armv8_ce_crypt_blk1_8): Rename to... (sm4_armv8_ce_crypt_blk1_16): ...this and add handling for 9 to 16 input blocks. (sm4_expand_key): Add GFNI/AVX2 path. (sm4_setkey): Enable GFNI/AVX2 implementation if HW features available; Disable AESNI implementations when GFNI implementation is enabled. (sm4_encrypt) [USE_GFNI_AVX2]: New. (sm4_decrypt) [USE_GFNI_AVX2]: New. (sm4_get_crypt_blk1_8_fn): Rename to... (sm4_get_crypt_blk1_16_fn): ...this; Update to use *_blk1_16 functions; Add GFNI/AVX2 selection. (_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec) (_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): Add GFNI/AVX2 path; Widen generic bulk processing from 8 blocks to 16 blocks. (_gcry_sm4_xts_crypt): Widen generic bulk processing from 8 blocks to 16 blocks.
Benchmark on Intel i3-1115G4 (tigerlake):
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 10.34 ns/B 92.21 MiB/s 42.29 c/B 4089 ECB dec | 10.34 ns/B 92.24 MiB/s 42.29 c/B 4090 CBC enc | 11.06 ns/B 86.26 MiB/s 45.21 c/B 4090 CBC dec | 1.13 ns/B 844.8 MiB/s 4.62 c/B 4090 CFB enc | 11.06 ns/B 86.27 MiB/s 45.22 c/B 4090 CFB dec | 1.13 ns/B 846.0 MiB/s 4.61 c/B 4090 CTR enc | 1.14 ns/B 834.3 MiB/s 4.67 c/B 4089 CTR dec | 1.14 ns/B 834.5 MiB/s 4.67 c/B 4089 XTS enc | 1.93 ns/B 494.1 MiB/s 7.89 c/B 4090 XTS dec | 1.94 ns/B 492.5 MiB/s 7.92 c/B 4090 OCB enc | 1.16 ns/B 823.3 MiB/s 4.74 c/B 4090 OCB dec | 1.16 ns/B 818.8 MiB/s 4.76 c/B 4089 OCB auth | 1.15 ns/B 831.0 MiB/s 4.69 c/B 4089
After:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 8.39 ns/B 113.6 MiB/s 34.33 c/B 4090 ECB dec | 8.40 ns/B 113.5 MiB/s 34.35 c/B 4090 CBC enc | 9.45 ns/B 101.0 MiB/s 38.63 c/B 4089 CBC dec | 0.650 ns/B 1468 MiB/s 2.66 c/B 4090 CFB enc | 9.44 ns/B 101.1 MiB/s 38.59 c/B 4090 CFB dec | 0.660 ns/B 1444 MiB/s 2.70 c/B 4090 CTR enc | 0.664 ns/B 1437 MiB/s 2.71 c/B 4090 CTR dec | 0.664 ns/B 1437 MiB/s 2.71 c/B 4090 XTS enc | 0.756 ns/B 1262 MiB/s 3.09 c/B 4090 XTS dec | 0.757 ns/B 1260 MiB/s 3.10 c/B 4090 OCB enc | 0.673 ns/B 1417 MiB/s 2.75 c/B 4090 OCB dec | 0.675 ns/B 1413 MiB/s 2.76 c/B 4090 OCB auth | 0.672 ns/B 1418 MiB/s 2.75 c/B 4090
ECB: 1.2x faster
CBC-enc / CFB-enc: 1.17x faster
CBC-dec / CFB-dec / CTR / OCB: 1.7x faster
XTS: 2.5x faster
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>