rijndael: add VAES/AVX512 accelerated implementation
* cipher/Makefile.am: Add 'rijndael-vaes-avx512-amd64.S'. * cipher/rijndael-internal.h (USE_VAES_AVX512): New. (RIJNDAEL_context_s) [USE_VAES_AVX512]: Add 'use_vaes_avx512'. * cipher/rijndael-vaes-avx2-amd64.S (_gcry_vaes_avx2_ocb_crypt_amd64): Minor optimization for aligned blk8 OCB path. * cipher/rijndael-vaes-avx512-amd64.S: New. * cipher/rijndael-vaes.c [USE_VAES_AVX512] (_gcry_vaes_avx512_cbc_dec_amd64, _gcry_vaes_avx512_cfb_dec_amd64) (_gcry_vaes_avx512_ctr_enc_amd64) (_gcry_vaes_avx512_ctr32le_enc_amd64) (_gcry_vaes_avx512_ocb_aligned_crypt_amd64) (_gcry_vaes_avx512_xts_crypt_amd64) (_gcry_vaes_avx512_ecb_crypt_amd64): New. (_gcry_aes_vaes_ecb_crypt, _gcry_aes_vaes_cbc_dec) (_gcry_aes_vaes_cfb_dec, _gcry_aes_vaes_ctr_enc) (_gcry_aes_vaes_ctr32le_enc, _gcry_aes_vaes_ocb_crypt) (_gcry_aes_vaes_ocb_auth, _gcry_aes_vaes_xts_crypt) [USE_VAES_AVX512]: Add AVX512 code paths. * cipher/rijndael.c (do_setkey) [USE_VAES_AVX512]: Add setup for 'ctx->use_vaes_avx512'. * configure.ac: Add 'rijndael-vaes-avx512-amd64.lo'.
Commit adds VAES/AVX512 acceleration for AES. New implementation
is about ~2x faster (for parallel modes, such as OCB) compared to
VAES/AVX2 implementation on AMD zen5. With AMD zen4 and Intel
tigerlake, VAES/AVX512 is about same speed as VAES/AVX2 since
HW supports only 256bit wide processing for AES instructions.
Benchmark on AMD Ryzen 9 9950X3D (zen5):
Before (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.029 ns/B 32722 MiB/s 0.162 c/B 5566±1 ECB dec | 0.029 ns/B 32824 MiB/s 0.162 c/B 5563 CBC enc | 0.449 ns/B 2123 MiB/s 2.50 c/B 5563 CBC dec | 0.029 ns/B 32735 MiB/s 0.162 c/B 5566 CFB enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5565 CFB dec | 0.029 ns/B 32752 MiB/s 0.162 c/B 5565 CTR enc | 0.030 ns/B 31694 MiB/s 0.167 c/B 5565 CTR dec | 0.030 ns/B 31727 MiB/s 0.167 c/B 5568 XTS enc | 0.033 ns/B 28776 MiB/s 0.184 c/B 5560 XTS dec | 0.033 ns/B 28517 MiB/s 0.186 c/B 5551±4 GCM enc | 0.074 ns/B 12841 MiB/s 0.413 c/B 5565 GCM dec | 0.075 ns/B 12658 MiB/s 0.419 c/B 5566 GCM auth | 0.045 ns/B 21322 MiB/s 0.249 c/B 5566 OCB enc | 0.030 ns/B 32298 MiB/s 0.164 c/B 5543±4 OCB dec | 0.029 ns/B 32476 MiB/s 0.163 c/B 5545±6 OCB auth | 0.029 ns/B 32961 MiB/s 0.161 c/B 5561±2
After (VAES/AVX512):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.015 ns/B 62011 MiB/s 0.085 c/B 5553±5 ECB dec | 0.015 ns/B 63315 MiB/s 0.084 c/B 5552±3 CBC enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5565 CBC dec | 0.015 ns/B 63800 MiB/s 0.083 c/B 5557±4 CFB enc | 0.449 ns/B 2122 MiB/s 2.50 c/B 5562 CFB dec | 0.015 ns/B 62510 MiB/s 0.085 c/B 5557±1 CTR enc | 0.016 ns/B 60975 MiB/s 0.087 c/B 5564 CTR dec | 0.016 ns/B 60737 MiB/s 0.087 c/B 5556±2 XTS enc | 0.018 ns/B 53861 MiB/s 0.098 c/B 5561±1 XTS dec | 0.018 ns/B 53604 MiB/s 0.099 c/B 5549±3 GCM enc | 0.037 ns/B 25806 MiB/s 0.206 c/B 5561±3 GCM dec | 0.038 ns/B 25223 MiB/s 0.210 c/B 5555±5 GCM auth | 0.021 ns/B 44365 MiB/s 0.120 c/B 5562 OCB enc | 0.016 ns/B 61035 MiB/s 0.087 c/B 5545±6 OCB dec | 0.015 ns/B 62190 MiB/s 0.085 c/B 5544±5 OCB auth | 0.015 ns/B 63886 MiB/s 0.083 c/B 5543±7
Benchmark on AMD Ryzen 9 7900X (zen4):
Before (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.028 ns/B 33759 MiB/s 0.160 c/B 5676 ECB dec | 0.028 ns/B 33560 MiB/s 0.161 c/B 5676 CBC enc | 0.441 ns/B 2165 MiB/s 2.50 c/B 5676 CBC dec | 0.029 ns/B 32766 MiB/s 0.165 c/B 5677±2 CFB enc | 0.440 ns/B 2165 MiB/s 2.50 c/B 5676 CFB dec | 0.029 ns/B 33053 MiB/s 0.164 c/B 5686±4 CTR enc | 0.029 ns/B 32420 MiB/s 0.167 c/B 5677±1 CTR dec | 0.029 ns/B 32531 MiB/s 0.167 c/B 5690±5 XTS enc | 0.038 ns/B 25081 MiB/s 0.215 c/B 5650 XTS dec | 0.038 ns/B 25020 MiB/s 0.217 c/B 5704±6 GCM enc | 0.067 ns/B 14170 MiB/s 0.370 c/B 5500 GCM dec | 0.067 ns/B 14205 MiB/s 0.369 c/B 5500 GCM auth | 0.038 ns/B 25110 MiB/s 0.209 c/B 5500 OCB enc | 0.030 ns/B 31579 MiB/s 0.172 c/B 5708±20 OCB dec | 0.030 ns/B 31613 MiB/s 0.173 c/B 5722±5 OCB auth | 0.029 ns/B 32535 MiB/s 0.167 c/B 5688±1
After (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.028 ns/B 33551 MiB/s 0.161 c/B 5676 ECB dec | 0.029 ns/B 33346 MiB/s 0.162 c/B 5675 CBC enc | 0.440 ns/B 2166 MiB/s 2.50 c/B 5675 CBC dec | 0.029 ns/B 33308 MiB/s 0.163 c/B 5685±3 CFB enc | 0.440 ns/B 2165 MiB/s 2.50 c/B 5675 CFB dec | 0.029 ns/B 33254 MiB/s 0.163 c/B 5671±1 CTR enc | 0.029 ns/B 33367 MiB/s 0.163 c/B 5686 CTR dec | 0.029 ns/B 33447 MiB/s 0.162 c/B 5687 XTS enc | 0.034 ns/B 27705 MiB/s 0.195 c/B 5673±1 XTS dec | 0.035 ns/B 27429 MiB/s 0.197 c/B 5677 GCM enc | 0.057 ns/B 16625 MiB/s 0.324 c/B 5652 GCM dec | 0.059 ns/B 16094 MiB/s 0.326 c/B 5510 GCM auth | 0.030 ns/B 31982 MiB/s 0.164 c/B 5500 OCB enc | 0.030 ns/B 31630 MiB/s 0.166 c/B 5500 OCB dec | 0.030 ns/B 32214 MiB/s 0.163 c/B 5500 OCB auth | 0.029 ns/B 33413 MiB/s 0.157 c/B 5500
Benchmark on Intel Core i3-1115G4I (tigerlake):
Before (VAES/AVX512):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.038 ns/B 25068 MiB/s 0.156 c/B 4090 ECB dec | 0.038 ns/B 25157 MiB/s 0.155 c/B 4090 CBC enc | 0.459 ns/B 2080 MiB/s 1.88 c/B 4090 CBC dec | 0.038 ns/B 25091 MiB/s 0.155 c/B 4090 CFB enc | 0.458 ns/B 2081 MiB/s 1.87 c/B 4090 CFB dec | 0.038 ns/B 25176 MiB/s 0.155 c/B 4090 CTR enc | 0.039 ns/B 24466 MiB/s 0.159 c/B 4090 CTR dec | 0.039 ns/B 24428 MiB/s 0.160 c/B 4090 XTS enc | 0.057 ns/B 16760 MiB/s 0.233 c/B 4090 XTS dec | 0.056 ns/B 16952 MiB/s 0.230 c/B 4090 GCM enc | 0.102 ns/B 9344 MiB/s 0.417 c/B 4090 GCM dec | 0.102 ns/B 9312 MiB/s 0.419 c/B 4090 GCM auth | 0.063 ns/B 15243 MiB/s 0.256 c/B 4090 OCB enc | 0.042 ns/B 22451 MiB/s 0.174 c/B 4090 OCB dec | 0.042 ns/B 22613 MiB/s 0.172 c/B 4090 OCB auth | 0.040 ns/B 23770 MiB/s 0.164 c/B 4090
After (VAES/AVX2):
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.040 ns/B 24094 MiB/s 0.162 c/B 4097±3 ECB dec | 0.040 ns/B 24052 MiB/s 0.162 c/B 4097±3 CBC enc | 0.458 ns/B 2080 MiB/s 1.88 c/B 4090 CBC dec | 0.039 ns/B 24385 MiB/s 0.160 c/B 4097±3 CFB enc | 0.458 ns/B 2080 MiB/s 1.87 c/B 4090 CFB dec | 0.039 ns/B 24403 MiB/s 0.160 c/B 4097±3 CTR enc | 0.040 ns/B 24119 MiB/s 0.162 c/B 4097±3 CTR dec | 0.040 ns/B 24095 MiB/s 0.162 c/B 4097±3 XTS enc | 0.048 ns/B 19891 MiB/s 0.196 c/B 4097±3 XTS dec | 0.048 ns/B 20077 MiB/s 0.195 c/B 4097±3 GCM enc | 0.084 ns/B 11417 MiB/s 0.342 c/B 4097±3 GCM dec | 0.084 ns/B 11373 MiB/s 0.344 c/B 4097±3 GCM auth | 0.045 ns/B 21402 MiB/s 0.183 c/B 4097±3 OCB enc | 0.040 ns/B 23946 MiB/s 0.163 c/B 4097±3 OCB dec | 0.040 ns/B 23760 MiB/s 0.164 c/B 4097±4 OCB auth | 0.041 ns/B 23083 MiB/s 0.169 c/B 4097±4
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>