Optimizations for SM4 cipher
* cipher/cipher.c (_gcry_cipher_open_internal): Add SM4 bulk functions. * cipher/sm4.c (ATTR_ALIGNED_64): New. (sbox): Convert to ... (sbox_table): ... this structure for sbox hardening as is done for AES and GCM. (prefetch_sbox_table): New. (sm4_t_non_lin_sub): Make inline; Optimize sbox access pattern. (sm4_key_lin_sub): Make inline; Tune slightly. (sm4_key_sub, sm4_enc_sub): Make inline. (sm4_round): Make inline; Take 'x' as separate parameters instead of array. (sm4_expand_key): Return void; Drop keylen; Unroll loops by 4; Wipe sensitive variables at end; Move key-length check to 'sm4_setkey'. (sm4_setkey): Add initial self-test step; Add key-length check; Remove burn stack (as variables wiped in 'sm4_expand_key'). (sm4_do_crypt): Return burn stack depth; Unroll loops by 4. (sm4_encrypt, sm4_decrypt): Prefetch sbox table; Return burn stack from 'sm4_do_crypt', as allows tail-call optimization by compiler. (sm4_do_crypt_blks2): New two parallel block function for greater instruction level parallelism. (sm4_crypt_blocks, _gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec) (_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): New bulk processing functions. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): New bulk processing self-tests. (sm4_selftest): Clear SM4 context before use; Use 'sm4_expand_key' instead of 'sm4_setkey'; Call bulk processing self-tests. * src/cipher.h (_gcry_sm4_ctr_enc, _gcry_sm4_ctr_dec) (_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): New. * tests/basic.c (check_ocb_cipher): Add SM4-OCB test vector.
Benchmark on AMD Ryzen 7 3700X (x86-64):
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 17.69 ns/B 53.92 MiB/s 76.50 c/B 4326 ECB dec | 17.74 ns/B 53.77 MiB/s 76.72 c/B 4325 CBC enc | 18.14 ns/B 52.56 MiB/s 78.47 c/B 4325 CBC dec | 18.05 ns/B 52.83 MiB/s 78.09 c/B 4326 CFB enc | 18.19 ns/B 52.44 MiB/s 78.67 c/B 4326 CFB dec | 18.16 ns/B 52.53 MiB/s 78.53 c/B 4326 OFB enc | 16.82 ns/B 56.70 MiB/s 72.96 c/B 4338 OFB dec | 16.87 ns/B 56.53 MiB/s 72.96 c/B 4325 CTR enc | 18.17 ns/B 52.47 MiB/s 78.62 c/B 4326 CTR dec | 18.02 ns/B 52.94 MiB/s 77.92 c/B 4325 XTS enc | 17.70 ns/B 53.87 MiB/s 76.11 c/B 4300 XTS dec | 17.65 ns/B 54.04 MiB/s 76.28 c/B 4323±1 CCM enc | 33.76 ns/B 28.25 MiB/s 146.9 c/B 4350 CCM dec | 34.07 ns/B 27.99 MiB/s 147.4 c/B 4326 CCM auth | 16.97 ns/B 56.19 MiB/s 73.41 c/B 4325 EAX enc | 34.02 ns/B 28.03 MiB/s 147.1 c/B 4325 EAX dec | 36.56 ns/B 26.08 MiB/s 159.1 c/B 4350 EAX auth | 17.02 ns/B 56.03 MiB/s 73.62 c/B 4325 GCM enc | 16.76 ns/B 56.90 MiB/s 72.50 c/B 4325 GCM dec | 18.01 ns/B 52.94 MiB/s 78.37 c/B 4350 GCM auth | 0.120 ns/B 7975 MiB/s 0.517 c/B 4325 OCB enc | 18.19 ns/B 52.43 MiB/s 78.68 c/B 4325 OCB dec | 18.15 ns/B 52.54 MiB/s 78.51 c/B 4325 OCB auth | 16.87 ns/B 56.54 MiB/s 72.95 c/B 4325
After (non-parallalizeble modes ~2.0x faster, parallel modes ~3.8x):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 8.28 ns/B 115.1 MiB/s 35.84 c/B 4327±1 ECB dec | 8.33 ns/B 114.4 MiB/s 36.13 c/B 4336±1 CBC enc | 8.94 ns/B 106.7 MiB/s 38.66 c/B 4325 CBC dec | 4.78 ns/B 199.7 MiB/s 20.42 c/B 4275 CFB enc | 8.95 ns/B 106.5 MiB/s 38.72 c/B 4325 CFB dec | 4.81 ns/B 198.2 MiB/s 20.57 c/B 4275 OFB enc | 8.48 ns/B 112.5 MiB/s 36.66 c/B 4325 OFB dec | 8.42 ns/B 113.3 MiB/s 36.41 c/B 4325 CTR enc | 4.81 ns/B 198.2 MiB/s 20.69 c/B 4300 CTR dec | 4.80 ns/B 198.8 MiB/s 20.63 c/B 4300 XTS enc | 8.75 ns/B 109.0 MiB/s 37.83 c/B 4325 XTS dec | 8.86 ns/B 107.7 MiB/s 38.30 c/B 4326 CCM enc | 13.74 ns/B 69.42 MiB/s 59.42 c/B 4325 CCM dec | 13.77 ns/B 69.25 MiB/s 59.57 c/B 4326 CCM auth | 8.87 ns/B 107.5 MiB/s 38.36 c/B 4325 EAX enc | 13.76 ns/B 69.29 MiB/s 59.54 c/B 4326 EAX dec | 13.77 ns/B 69.25 MiB/s 59.57 c/B 4325 EAX auth | 8.89 ns/B 107.3 MiB/s 38.44 c/B 4325 GCM enc | 4.96 ns/B 192.3 MiB/s 21.20 c/B 4275 GCM dec | 4.91 ns/B 194.4 MiB/s 21.10 c/B 4300 GCM auth | 0.116 ns/B 8232 MiB/s 0.504 c/B 4351 OCB enc | 4.88 ns/B 195.5 MiB/s 20.86 c/B 4275 OCB dec | 4.85 ns/B 196.6 MiB/s 20.86 c/B 4301 OCB auth | 4.80 ns/B 198.9 MiB/s 20.62 c/B 4301
Benchmark on ARM Cortex-A53 (aarch64):
Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 84.08 ns/B 11.34 MiB/s 54.48 c/B 648.0 ECB dec | 84.07 ns/B 11.34 MiB/s 54.47 c/B 648.0 CBC enc | 84.90 ns/B 11.23 MiB/s 55.01 c/B 647.9 CBC dec | 84.69 ns/B 11.26 MiB/s 54.87 c/B 648.0 CFB enc | 84.55 ns/B 11.28 MiB/s 54.79 c/B 648.0 CFB dec | 84.55 ns/B 11.28 MiB/s 54.78 c/B 648.0 OFB enc | 84.45 ns/B 11.29 MiB/s 54.72 c/B 647.9 OFB dec | 84.45 ns/B 11.29 MiB/s 54.72 c/B 648.0 CTR enc | 85.42 ns/B 11.16 MiB/s 55.35 c/B 648.0 CTR dec | 85.42 ns/B 11.16 MiB/s 55.35 c/B 648.0 XTS enc | 88.72 ns/B 10.75 MiB/s 57.49 c/B 648.0 XTS dec | 88.71 ns/B 10.75 MiB/s 57.48 c/B 648.0 CCM enc | 170.2 ns/B 5.60 MiB/s 110.3 c/B 647.9 CCM dec | 170.2 ns/B 5.60 MiB/s 110.3 c/B 648.0 CCM auth | 84.27 ns/B 11.32 MiB/s 54.60 c/B 648.0 EAX enc | 170.6 ns/B 5.59 MiB/s 110.5 c/B 648.0 EAX dec | 170.6 ns/B 5.59 MiB/s 110.5 c/B 648.0 EAX auth | 84.51 ns/B 11.29 MiB/s 54.76 c/B 648.0 GCM enc | 86.99 ns/B 10.96 MiB/s 56.36 c/B 648.0 GCM dec | 87.00 ns/B 10.96 MiB/s 56.37 c/B 648.0 GCM auth | 1.56 ns/B 609.9 MiB/s 1.01 c/B 648.0 OCB enc | 86.77 ns/B 10.99 MiB/s 56.22 c/B 648.0 OCB dec | 86.77 ns/B 10.99 MiB/s 56.22 c/B 648.0 OCB auth | 86.20 ns/B 11.06 MiB/s 55.85 c/B 648.0
After (non-parallalizable modes ~30% faster, parallel modes ~80%):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 64.85 ns/B 14.71 MiB/s 42.02 c/B 648.0 ECB dec | 64.78 ns/B 14.72 MiB/s 41.98 c/B 648.0 CBC enc | 64.53 ns/B 14.78 MiB/s 41.81 c/B 647.9 CBC dec | 45.09 ns/B 21.15 MiB/s 29.21 c/B 648.0 CFB enc | 64.56 ns/B 14.77 MiB/s 41.84 c/B 648.0 CFB dec | 45.52 ns/B 20.95 MiB/s 29.49 c/B 647.9 OFB enc | 64.14 ns/B 14.87 MiB/s 41.56 c/B 648.0 OFB dec | 64.14 ns/B 14.87 MiB/s 41.56 c/B 648.0 CTR enc | 45.54 ns/B 20.94 MiB/s 29.51 c/B 648.0 CTR dec | 45.53 ns/B 20.95 MiB/s 29.50 c/B 648.0 XTS enc | 67.88 ns/B 14.05 MiB/s 43.98 c/B 648.0 XTS dec | 67.69 ns/B 14.09 MiB/s 43.86 c/B 648.0 CCM enc | 110.6 ns/B 8.62 MiB/s 71.66 c/B 648.0 CCM dec | 110.2 ns/B 8.65 MiB/s 71.42 c/B 648.0 CCM auth | 64.87 ns/B 14.70 MiB/s 42.04 c/B 648.0 EAX enc | 109.9 ns/B 8.68 MiB/s 71.22 c/B 648.0 EAX dec | 109.9 ns/B 8.68 MiB/s 71.22 c/B 648.0 EAX auth | 64.37 ns/B 14.81 MiB/s 41.71 c/B 648.0 GCM enc | 47.07 ns/B 20.26 MiB/s 30.51 c/B 648.0 GCM dec | 47.08 ns/B 20.26 MiB/s 30.51 c/B 648.0 GCM auth | 1.55 ns/B 614.7 MiB/s 1.01 c/B 648.0 OCB enc | 48.38 ns/B 19.71 MiB/s 31.35 c/B 648.0 OCB dec | 48.11 ns/B 19.82 MiB/s 31.17 c/B 648.0 OCB auth | 46.71 ns/B 20.42 MiB/s 30.27 c/B 648.0
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>