Home GnuPG

Add SM4 x86-64/GFNI/AVX2 implementation

Description

Add SM4 x86-64/GFNI/AVX2 implementation

* cipher/Makefile.am: Add 'sm4-gfni-avx2-amd64.S'.
* cipher/sm4-aesni-avx2-amd64.S: New.
* cipher/sm4.c (USE_GFNI_AVX2): New.
(SM4_context): Add 'use_gfni_avx2'.
(crypt_blk1_8_fn_t): Rename to...
(crypt_blk1_16_fn_t): ...this.
(sm4_aesni_avx_crypt_blk1_8): Rename to...
(sm4_aesni_avx_crypt_blk1_16): ...this and add handling for 9 to 16
input blocks.
(_gcry_sm4_gfni_avx_expand_key, _gcry_sm4_gfni_avx2_ctr_enc)
(_gcry_sm4_gfni_avx2_cbc_dec, _gcry_sm4_gfni_avx2_cfb_dec)
(_gcry_sm4_gfni_avx2_ocb_enc, _gcry_sm4_gfni_avx2_ocb_dec)
(_gcry_sm4_gfni_avx2_ocb_auth, _gcry_sm4_gfni_avx2_crypt_blk1_16)
(sm4_gfni_avx2_crypt_blk1_16): New.
(sm4_aarch64_crypt_blk1_8): Rename to...
(sm4_aarch64_crypt_blk1_16): ...this and add handling for 9 to 16
input blocks.
(sm4_armv8_ce_crypt_blk1_8): Rename to...
(sm4_armv8_ce_crypt_blk1_16): ...this and add handling for 9 to 16
input blocks.
(sm4_expand_key): Add GFNI/AVX2 path.
(sm4_setkey): Enable GFNI/AVX2 implementation if HW features
available; Disable AESNI implementations when GFNI implementation is
enabled.
(sm4_encrypt) [USE_GFNI_AVX2]: New.
(sm4_decrypt) [USE_GFNI_AVX2]: New.
(sm4_get_crypt_blk1_8_fn): Rename to...
(sm4_get_crypt_blk1_16_fn): ...this; Update to use *_blk1_16 functions;
Add GFNI/AVX2 selection.
(_gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec, _gcry_sm4_cfb_dec)
(_gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): Add GFNI/AVX2 path; Widen
generic bulk processing from 8 blocks to 16 blocks.
(_gcry_sm4_xts_crypt): Widen generic bulk processing from 8 blocks to
16 blocks.

Benchmark on Intel i3-1115G4 (tigerlake):

Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     10.34 ns/B     92.21 MiB/s     42.29 c/B      4089
 ECB dec |     10.34 ns/B     92.24 MiB/s     42.29 c/B      4090
 CBC enc |     11.06 ns/B     86.26 MiB/s     45.21 c/B      4090
 CBC dec |      1.13 ns/B     844.8 MiB/s      4.62 c/B      4090
 CFB enc |     11.06 ns/B     86.27 MiB/s     45.22 c/B      4090
 CFB dec |      1.13 ns/B     846.0 MiB/s      4.61 c/B      4090
 CTR enc |      1.14 ns/B     834.3 MiB/s      4.67 c/B      4089
 CTR dec |      1.14 ns/B     834.5 MiB/s      4.67 c/B      4089
 XTS enc |      1.93 ns/B     494.1 MiB/s      7.89 c/B      4090
 XTS dec |      1.94 ns/B     492.5 MiB/s      7.92 c/B      4090
 OCB enc |      1.16 ns/B     823.3 MiB/s      4.74 c/B      4090
 OCB dec |      1.16 ns/B     818.8 MiB/s      4.76 c/B      4089
OCB auth |      1.15 ns/B     831.0 MiB/s      4.69 c/B      4089

After:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |      8.39 ns/B     113.6 MiB/s     34.33 c/B      4090
 ECB dec |      8.40 ns/B     113.5 MiB/s     34.35 c/B      4090
 CBC enc |      9.45 ns/B     101.0 MiB/s     38.63 c/B      4089
 CBC dec |     0.650 ns/B      1468 MiB/s      2.66 c/B      4090
 CFB enc |      9.44 ns/B     101.1 MiB/s     38.59 c/B      4090
 CFB dec |     0.660 ns/B      1444 MiB/s      2.70 c/B      4090
 CTR enc |     0.664 ns/B      1437 MiB/s      2.71 c/B      4090
 CTR dec |     0.664 ns/B      1437 MiB/s      2.71 c/B      4090
 XTS enc |     0.756 ns/B      1262 MiB/s      3.09 c/B      4090
 XTS dec |     0.757 ns/B      1260 MiB/s      3.10 c/B      4090
 OCB enc |     0.673 ns/B      1417 MiB/s      2.75 c/B      4090
 OCB dec |     0.675 ns/B      1413 MiB/s      2.76 c/B      4090
OCB auth |     0.672 ns/B      1418 MiB/s      2.75 c/B      4090

ECB: 1.2x faster
CBC-enc / CFB-enc: 1.17x faster
CBC-dec / CFB-dec / CTR / OCB: 1.7x faster
XTS: 2.5x faster

  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Apr 24 2022, 8:03 PM
Parents
rCaad3381e9384: sm4: add XTS bulk processing
Branches
Unknown
Tags
Unknown