Home GnuPG

Optimizations for SM4 cipher

Description

Optimizations for SM4 cipher

* cipher/cipher.c (_gcry_cipher_open_internal): Add SM4 bulk
functions.
* cipher/sm4.c (ATTR_ALIGNED_64): New.
(sbox): Convert to ...
(sbox_table): ... this structure for sbox hardening as is done
for AES and GCM.
(prefetch_sbox_table): New.
(sm4_t_non_lin_sub): Make inline; Optimize sbox access pattern.
(sm4_key_lin_sub): Make inline; Tune slightly.
(sm4_key_sub, sm4_enc_sub): Make inline.
(sm4_round): Make inline; Take 'x' as separate parameters instead
of array.
(sm4_expand_key): Return void; Drop keylen; Unroll loops by 4;
Wipe sensitive variables at end; Move key-length check to
'sm4_setkey'.
(sm4_setkey): Add initial self-test step; Add key-length check;
Remove burn stack (as variables wiped in 'sm4_expand_key').
(sm4_do_crypt): Return burn stack depth; Unroll loops by 4.
(sm4_encrypt, sm4_decrypt): Prefetch sbox table; Return burn
stack from 'sm4_do_crypt', as allows tail-call optimization
by compiler.
(sm4_do_crypt_blks2): New two parallel block function for greater
instruction level parallelism.
(sm4_crypt_blocks, _gcry_sm4_ctr_enc, _gcry_sm4_cbc_dec)
(_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): New
bulk processing functions.
(selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): New
bulk processing self-tests.
(sm4_selftest): Clear SM4 context before use; Use 'sm4_expand_key'
instead of 'sm4_setkey'; Call bulk processing self-tests.
* src/cipher.h (_gcry_sm4_ctr_enc, _gcry_sm4_ctr_dec)
(_gcry_sm4_cfb_dec, _gcry_sm4_ocb_crypt, _gcry_sm4_ocb_auth): New.
* tests/basic.c (check_ocb_cipher): Add SM4-OCB test vector.

Benchmark on AMD Ryzen 7 3700X (x86-64):

Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     17.69 ns/B     53.92 MiB/s     76.50 c/B      4326
 ECB dec |     17.74 ns/B     53.77 MiB/s     76.72 c/B      4325
 CBC enc |     18.14 ns/B     52.56 MiB/s     78.47 c/B      4325
 CBC dec |     18.05 ns/B     52.83 MiB/s     78.09 c/B      4326
 CFB enc |     18.19 ns/B     52.44 MiB/s     78.67 c/B      4326
 CFB dec |     18.16 ns/B     52.53 MiB/s     78.53 c/B      4326
 OFB enc |     16.82 ns/B     56.70 MiB/s     72.96 c/B      4338
 OFB dec |     16.87 ns/B     56.53 MiB/s     72.96 c/B      4325
 CTR enc |     18.17 ns/B     52.47 MiB/s     78.62 c/B      4326
 CTR dec |     18.02 ns/B     52.94 MiB/s     77.92 c/B      4325
 XTS enc |     17.70 ns/B     53.87 MiB/s     76.11 c/B      4300
 XTS dec |     17.65 ns/B     54.04 MiB/s     76.28 c/B      4323±1
 CCM enc |     33.76 ns/B     28.25 MiB/s     146.9 c/B      4350
 CCM dec |     34.07 ns/B     27.99 MiB/s     147.4 c/B      4326
CCM auth |     16.97 ns/B     56.19 MiB/s     73.41 c/B      4325
 EAX enc |     34.02 ns/B     28.03 MiB/s     147.1 c/B      4325
 EAX dec |     36.56 ns/B     26.08 MiB/s     159.1 c/B      4350
EAX auth |     17.02 ns/B     56.03 MiB/s     73.62 c/B      4325
 GCM enc |     16.76 ns/B     56.90 MiB/s     72.50 c/B      4325
 GCM dec |     18.01 ns/B     52.94 MiB/s     78.37 c/B      4350
GCM auth |     0.120 ns/B      7975 MiB/s     0.517 c/B      4325
 OCB enc |     18.19 ns/B     52.43 MiB/s     78.68 c/B      4325
 OCB dec |     18.15 ns/B     52.54 MiB/s     78.51 c/B      4325
OCB auth |     16.87 ns/B     56.54 MiB/s     72.95 c/B      4325

After (non-parallalizeble modes ~2.0x faster, parallel modes ~3.8x):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |      8.28 ns/B     115.1 MiB/s     35.84 c/B      4327±1
 ECB dec |      8.33 ns/B     114.4 MiB/s     36.13 c/B      4336±1
 CBC enc |      8.94 ns/B     106.7 MiB/s     38.66 c/B      4325
 CBC dec |      4.78 ns/B     199.7 MiB/s     20.42 c/B      4275
 CFB enc |      8.95 ns/B     106.5 MiB/s     38.72 c/B      4325
 CFB dec |      4.81 ns/B     198.2 MiB/s     20.57 c/B      4275
 OFB enc |      8.48 ns/B     112.5 MiB/s     36.66 c/B      4325
 OFB dec |      8.42 ns/B     113.3 MiB/s     36.41 c/B      4325
 CTR enc |      4.81 ns/B     198.2 MiB/s     20.69 c/B      4300
 CTR dec |      4.80 ns/B     198.8 MiB/s     20.63 c/B      4300
 XTS enc |      8.75 ns/B     109.0 MiB/s     37.83 c/B      4325
 XTS dec |      8.86 ns/B     107.7 MiB/s     38.30 c/B      4326
 CCM enc |     13.74 ns/B     69.42 MiB/s     59.42 c/B      4325
 CCM dec |     13.77 ns/B     69.25 MiB/s     59.57 c/B      4326
CCM auth |      8.87 ns/B     107.5 MiB/s     38.36 c/B      4325
 EAX enc |     13.76 ns/B     69.29 MiB/s     59.54 c/B      4326
 EAX dec |     13.77 ns/B     69.25 MiB/s     59.57 c/B      4325
EAX auth |      8.89 ns/B     107.3 MiB/s     38.44 c/B      4325
 GCM enc |      4.96 ns/B     192.3 MiB/s     21.20 c/B      4275
 GCM dec |      4.91 ns/B     194.4 MiB/s     21.10 c/B      4300
GCM auth |     0.116 ns/B      8232 MiB/s     0.504 c/B      4351
 OCB enc |      4.88 ns/B     195.5 MiB/s     20.86 c/B      4275
 OCB dec |      4.85 ns/B     196.6 MiB/s     20.86 c/B      4301
OCB auth |      4.80 ns/B     198.9 MiB/s     20.62 c/B      4301

Benchmark on ARM Cortex-A53 (aarch64):

Before:
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     84.08 ns/B     11.34 MiB/s     54.48 c/B     648.0
 ECB dec |     84.07 ns/B     11.34 MiB/s     54.47 c/B     648.0
 CBC enc |     84.90 ns/B     11.23 MiB/s     55.01 c/B     647.9
 CBC dec |     84.69 ns/B     11.26 MiB/s     54.87 c/B     648.0
 CFB enc |     84.55 ns/B     11.28 MiB/s     54.79 c/B     648.0
 CFB dec |     84.55 ns/B     11.28 MiB/s     54.78 c/B     648.0
 OFB enc |     84.45 ns/B     11.29 MiB/s     54.72 c/B     647.9
 OFB dec |     84.45 ns/B     11.29 MiB/s     54.72 c/B     648.0
 CTR enc |     85.42 ns/B     11.16 MiB/s     55.35 c/B     648.0
 CTR dec |     85.42 ns/B     11.16 MiB/s     55.35 c/B     648.0
 XTS enc |     88.72 ns/B     10.75 MiB/s     57.49 c/B     648.0
 XTS dec |     88.71 ns/B     10.75 MiB/s     57.48 c/B     648.0
 CCM enc |     170.2 ns/B      5.60 MiB/s     110.3 c/B     647.9
 CCM dec |     170.2 ns/B      5.60 MiB/s     110.3 c/B     648.0
CCM auth |     84.27 ns/B     11.32 MiB/s     54.60 c/B     648.0
 EAX enc |     170.6 ns/B      5.59 MiB/s     110.5 c/B     648.0
 EAX dec |     170.6 ns/B      5.59 MiB/s     110.5 c/B     648.0
EAX auth |     84.51 ns/B     11.29 MiB/s     54.76 c/B     648.0
 GCM enc |     86.99 ns/B     10.96 MiB/s     56.36 c/B     648.0
 GCM dec |     87.00 ns/B     10.96 MiB/s     56.37 c/B     648.0
GCM auth |      1.56 ns/B     609.9 MiB/s      1.01 c/B     648.0
 OCB enc |     86.77 ns/B     10.99 MiB/s     56.22 c/B     648.0
 OCB dec |     86.77 ns/B     10.99 MiB/s     56.22 c/B     648.0
OCB auth |     86.20 ns/B     11.06 MiB/s     55.85 c/B     648.0

After (non-parallalizable modes ~30% faster, parallel modes ~80%):
SM4 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     64.85 ns/B     14.71 MiB/s     42.02 c/B     648.0
 ECB dec |     64.78 ns/B     14.72 MiB/s     41.98 c/B     648.0
 CBC enc |     64.53 ns/B     14.78 MiB/s     41.81 c/B     647.9
 CBC dec |     45.09 ns/B     21.15 MiB/s     29.21 c/B     648.0
 CFB enc |     64.56 ns/B     14.77 MiB/s     41.84 c/B     648.0
 CFB dec |     45.52 ns/B     20.95 MiB/s     29.49 c/B     647.9
 OFB enc |     64.14 ns/B     14.87 MiB/s     41.56 c/B     648.0
 OFB dec |     64.14 ns/B     14.87 MiB/s     41.56 c/B     648.0
 CTR enc |     45.54 ns/B     20.94 MiB/s     29.51 c/B     648.0
 CTR dec |     45.53 ns/B     20.95 MiB/s     29.50 c/B     648.0
 XTS enc |     67.88 ns/B     14.05 MiB/s     43.98 c/B     648.0
 XTS dec |     67.69 ns/B     14.09 MiB/s     43.86 c/B     648.0
 CCM enc |     110.6 ns/B      8.62 MiB/s     71.66 c/B     648.0
 CCM dec |     110.2 ns/B      8.65 MiB/s     71.42 c/B     648.0
CCM auth |     64.87 ns/B     14.70 MiB/s     42.04 c/B     648.0
 EAX enc |     109.9 ns/B      8.68 MiB/s     71.22 c/B     648.0
 EAX dec |     109.9 ns/B      8.68 MiB/s     71.22 c/B     648.0
EAX auth |     64.37 ns/B     14.81 MiB/s     41.71 c/B     648.0
 GCM enc |     47.07 ns/B     20.26 MiB/s     30.51 c/B     648.0
 GCM dec |     47.08 ns/B     20.26 MiB/s     30.51 c/B     648.0
GCM auth |      1.55 ns/B     614.7 MiB/s      1.01 c/B     648.0
 OCB enc |     48.38 ns/B     19.71 MiB/s     31.35 c/B     648.0
 OCB dec |     48.11 ns/B     19.82 MiB/s     31.17 c/B     648.0
OCB auth |     46.71 ns/B     20.42 MiB/s     30.27 c/B     648.0
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Jun 16 2020, 6:39 PM
Parents
rCa6177e1bc948: ecc: For Ed448, it's only for EdDSA.
Branches
Unknown
Tags
Unknown