Home GnuPG

rijndael: add VAES/AVX512 accelerated implementation

Description

rijndael: add VAES/AVX512 accelerated implementation

* cipher/Makefile.am: Add 'rijndael-vaes-avx512-amd64.S'.
* cipher/rijndael-internal.h (USE_VAES_AVX512): New.
(RIJNDAEL_context_s) [USE_VAES_AVX512]: Add 'use_vaes_avx512'.
* cipher/rijndael-vaes-avx2-amd64.S
(_gcry_vaes_avx2_ocb_crypt_amd64): Minor optimization for aligned
blk8 OCB path.
* cipher/rijndael-vaes-avx512-amd64.S: New.
* cipher/rijndael-vaes.c [USE_VAES_AVX512]
(_gcry_vaes_avx512_cbc_dec_amd64, _gcry_vaes_avx512_cfb_dec_amd64)
(_gcry_vaes_avx512_ctr_enc_amd64)
(_gcry_vaes_avx512_ctr32le_enc_amd64)
(_gcry_vaes_avx512_ocb_aligned_crypt_amd64)
(_gcry_vaes_avx512_xts_crypt_amd64)
(_gcry_vaes_avx512_ecb_crypt_amd64): New.
(_gcry_aes_vaes_ecb_crypt, _gcry_aes_vaes_cbc_dec)
(_gcry_aes_vaes_cfb_dec, _gcry_aes_vaes_ctr_enc)
(_gcry_aes_vaes_ctr32le_enc, _gcry_aes_vaes_ocb_crypt)
(_gcry_aes_vaes_ocb_auth, _gcry_aes_vaes_xts_crypt)
[USE_VAES_AVX512]: Add AVX512 code paths.
* cipher/rijndael.c (do_setkey) [USE_VAES_AVX512]: Add setup for
'ctx->use_vaes_avx512'.
* configure.ac: Add 'rijndael-vaes-avx512-amd64.lo'.

Commit adds VAES/AVX512 acceleration for AES. New implementation
is about ~2x faster (for parallel modes, such as OCB) compared to
VAES/AVX2 implementation on AMD zen5. With AMD zen4 and Intel
tigerlake, VAES/AVX512 is about same speed as VAES/AVX2 since
HW supports only 256bit wide processing for AES instructions.

Benchmark on AMD Ryzen 9 9950X3D (zen5):

Before (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.029 ns/B     32722 MiB/s     0.162 c/B      5566±1
 ECB dec |     0.029 ns/B     32824 MiB/s     0.162 c/B      5563
 CBC enc |     0.449 ns/B      2123 MiB/s      2.50 c/B      5563
 CBC dec |     0.029 ns/B     32735 MiB/s     0.162 c/B      5566
 CFB enc |     0.449 ns/B      2122 MiB/s      2.50 c/B      5565
 CFB dec |     0.029 ns/B     32752 MiB/s     0.162 c/B      5565
 CTR enc |     0.030 ns/B     31694 MiB/s     0.167 c/B      5565
 CTR dec |     0.030 ns/B     31727 MiB/s     0.167 c/B      5568
 XTS enc |     0.033 ns/B     28776 MiB/s     0.184 c/B      5560
 XTS dec |     0.033 ns/B     28517 MiB/s     0.186 c/B      5551±4
 GCM enc |     0.074 ns/B     12841 MiB/s     0.413 c/B      5565
 GCM dec |     0.075 ns/B     12658 MiB/s     0.419 c/B      5566
GCM auth |     0.045 ns/B     21322 MiB/s     0.249 c/B      5566
 OCB enc |     0.030 ns/B     32298 MiB/s     0.164 c/B      5543±4
 OCB dec |     0.029 ns/B     32476 MiB/s     0.163 c/B      5545±6
OCB auth |     0.029 ns/B     32961 MiB/s     0.161 c/B      5561±2

After (VAES/AVX512):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.015 ns/B     62011 MiB/s     0.085 c/B      5553±5
 ECB dec |     0.015 ns/B     63315 MiB/s     0.084 c/B      5552±3
 CBC enc |     0.449 ns/B      2122 MiB/s      2.50 c/B      5565
 CBC dec |     0.015 ns/B     63800 MiB/s     0.083 c/B      5557±4
 CFB enc |     0.449 ns/B      2122 MiB/s      2.50 c/B      5562
 CFB dec |     0.015 ns/B     62510 MiB/s     0.085 c/B      5557±1
 CTR enc |     0.016 ns/B     60975 MiB/s     0.087 c/B      5564
 CTR dec |     0.016 ns/B     60737 MiB/s     0.087 c/B      5556±2
 XTS enc |     0.018 ns/B     53861 MiB/s     0.098 c/B      5561±1
 XTS dec |     0.018 ns/B     53604 MiB/s     0.099 c/B      5549±3
 GCM enc |     0.037 ns/B     25806 MiB/s     0.206 c/B      5561±3
 GCM dec |     0.038 ns/B     25223 MiB/s     0.210 c/B      5555±5
GCM auth |     0.021 ns/B     44365 MiB/s     0.120 c/B      5562
 OCB enc |     0.016 ns/B     61035 MiB/s     0.087 c/B      5545±6
 OCB dec |     0.015 ns/B     62190 MiB/s     0.085 c/B      5544±5
OCB auth |     0.015 ns/B     63886 MiB/s     0.083 c/B      5543±7

Benchmark on AMD Ryzen 9 7900X (zen4):

Before (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.028 ns/B     33759 MiB/s     0.160 c/B      5676
 ECB dec |     0.028 ns/B     33560 MiB/s     0.161 c/B      5676
 CBC enc |     0.441 ns/B      2165 MiB/s      2.50 c/B      5676
 CBC dec |     0.029 ns/B     32766 MiB/s     0.165 c/B      5677±2
 CFB enc |     0.440 ns/B      2165 MiB/s      2.50 c/B      5676
 CFB dec |     0.029 ns/B     33053 MiB/s     0.164 c/B      5686±4
 CTR enc |     0.029 ns/B     32420 MiB/s     0.167 c/B      5677±1
 CTR dec |     0.029 ns/B     32531 MiB/s     0.167 c/B      5690±5
 XTS enc |     0.038 ns/B     25081 MiB/s     0.215 c/B      5650
 XTS dec |     0.038 ns/B     25020 MiB/s     0.217 c/B      5704±6
 GCM enc |     0.067 ns/B     14170 MiB/s     0.370 c/B      5500
 GCM dec |     0.067 ns/B     14205 MiB/s     0.369 c/B      5500
GCM auth |     0.038 ns/B     25110 MiB/s     0.209 c/B      5500
 OCB enc |     0.030 ns/B     31579 MiB/s     0.172 c/B      5708±20
 OCB dec |     0.030 ns/B     31613 MiB/s     0.173 c/B      5722±5
OCB auth |     0.029 ns/B     32535 MiB/s     0.167 c/B      5688±1

After (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.028 ns/B     33551 MiB/s     0.161 c/B      5676
 ECB dec |     0.029 ns/B     33346 MiB/s     0.162 c/B      5675
 CBC enc |     0.440 ns/B      2166 MiB/s      2.50 c/B      5675
 CBC dec |     0.029 ns/B     33308 MiB/s     0.163 c/B      5685±3
 CFB enc |     0.440 ns/B      2165 MiB/s      2.50 c/B      5675
 CFB dec |     0.029 ns/B     33254 MiB/s     0.163 c/B      5671±1
 CTR enc |     0.029 ns/B     33367 MiB/s     0.163 c/B      5686
 CTR dec |     0.029 ns/B     33447 MiB/s     0.162 c/B      5687
 XTS enc |     0.034 ns/B     27705 MiB/s     0.195 c/B      5673±1
 XTS dec |     0.035 ns/B     27429 MiB/s     0.197 c/B      5677
 GCM enc |     0.057 ns/B     16625 MiB/s     0.324 c/B      5652
 GCM dec |     0.059 ns/B     16094 MiB/s     0.326 c/B      5510
GCM auth |     0.030 ns/B     31982 MiB/s     0.164 c/B      5500
 OCB enc |     0.030 ns/B     31630 MiB/s     0.166 c/B      5500
 OCB dec |     0.030 ns/B     32214 MiB/s     0.163 c/B      5500
OCB auth |     0.029 ns/B     33413 MiB/s     0.157 c/B      5500

Benchmark on Intel Core i3-1115G4I (tigerlake):

Before (VAES/AVX512):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.038 ns/B     25068 MiB/s     0.156 c/B      4090
 ECB dec |     0.038 ns/B     25157 MiB/s     0.155 c/B      4090
 CBC enc |     0.459 ns/B      2080 MiB/s      1.88 c/B      4090
 CBC dec |     0.038 ns/B     25091 MiB/s     0.155 c/B      4090
 CFB enc |     0.458 ns/B      2081 MiB/s      1.87 c/B      4090
 CFB dec |     0.038 ns/B     25176 MiB/s     0.155 c/B      4090
 CTR enc |     0.039 ns/B     24466 MiB/s     0.159 c/B      4090
 CTR dec |     0.039 ns/B     24428 MiB/s     0.160 c/B      4090
 XTS enc |     0.057 ns/B     16760 MiB/s     0.233 c/B      4090
 XTS dec |     0.056 ns/B     16952 MiB/s     0.230 c/B      4090
 GCM enc |     0.102 ns/B      9344 MiB/s     0.417 c/B      4090
 GCM dec |     0.102 ns/B      9312 MiB/s     0.419 c/B      4090
GCM auth |     0.063 ns/B     15243 MiB/s     0.256 c/B      4090
 OCB enc |     0.042 ns/B     22451 MiB/s     0.174 c/B      4090
 OCB dec |     0.042 ns/B     22613 MiB/s     0.172 c/B      4090
OCB auth |     0.040 ns/B     23770 MiB/s     0.164 c/B      4090

After (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.040 ns/B     24094 MiB/s     0.162 c/B      4097±3
 ECB dec |     0.040 ns/B     24052 MiB/s     0.162 c/B      4097±3
 CBC enc |     0.458 ns/B      2080 MiB/s      1.88 c/B      4090
 CBC dec |     0.039 ns/B     24385 MiB/s     0.160 c/B      4097±3
 CFB enc |     0.458 ns/B      2080 MiB/s      1.87 c/B      4090
 CFB dec |     0.039 ns/B     24403 MiB/s     0.160 c/B      4097±3
 CTR enc |     0.040 ns/B     24119 MiB/s     0.162 c/B      4097±3
 CTR dec |     0.040 ns/B     24095 MiB/s     0.162 c/B      4097±3
 XTS enc |     0.048 ns/B     19891 MiB/s     0.196 c/B      4097±3
 XTS dec |     0.048 ns/B     20077 MiB/s     0.195 c/B      4097±3
 GCM enc |     0.084 ns/B     11417 MiB/s     0.342 c/B      4097±3
 GCM dec |     0.084 ns/B     11373 MiB/s     0.344 c/B      4097±3
GCM auth |     0.045 ns/B     21402 MiB/s     0.183 c/B      4097±3
 OCB enc |     0.040 ns/B     23946 MiB/s     0.163 c/B      4097±3
 OCB dec |     0.040 ns/B     23760 MiB/s     0.164 c/B      4097±4
OCB auth |     0.041 ns/B     23083 MiB/s     0.169 c/B      4097±4
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Fri, Jan 2, 3:01 PM
Parents
rCd5cf2b90c7d0: rijndael-aesni: use assembly for moving first and last round key
Branches
Unknown
Tags
Unknown
References
HEAD -> master