Add VAES/AVX2 accelerated i386 implementation for AES
* cipher/Makefile.am: Add 'rijndael-vaes-i386.c' and 'rijndael-vaes-avx2-i386.S'. * cipher/asm-common-i386.h: New. * cipher/rijndael-internal.h (USE_VAES_I386): New. * cipher/rijndael-vaes-avx2-i386.S: New. * cipher/rijndael-vaes-i386.c: New. * cipher/rijndael-vaes.c: Update header description (add 'AMD64'). * cipher/rijndael.c [USE_VAES]: Add 'USE_VAES_I386' to ifdef around '_gcry_aes_vaes_*' function prototypes. (setkey) [USE_VAES_I386]: Add setup of VAES/AVX2/i386 bulk functions. * configure.ac: Add 'rijndael-vaes-i386.lo' and 'rijndael-vaes-avx2-i386.lo'. (gcry_cv_gcc_amd64_platform_as_ok): Rename this to ... (gcry_cv_gcc_x86_platform_as_ok): ... this and change to check for both AMD64 and i386 assembler compatibility. (gcry_cv_gcc_win32_platform_as_ok): New.
Benchmark on Intel Core(TM) i3-1115G4 (tigerlake, linux-i386):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.127 ns/B 7523 MiB/s 0.379 c/B 2992 ECB dec | 0.127 ns/B 7517 MiB/s 0.380 c/B 2992 CBC dec | 0.108 ns/B 8855 MiB/s 0.322 c/B 2993 CFB dec | 0.107 ns/B 8938 MiB/s 0.319 c/B 2993 CTR enc | 0.111 ns/B 8589 MiB/s 0.332 c/B 2992 CTR dec | 0.111 ns/B 8593 MiB/s 0.332 c/B 2993 XTS enc | 0.140 ns/B 6833 MiB/s 0.418 c/B 2993 XTS dec | 0.139 ns/B 6863 MiB/s 0.416 c/B 2993 OCB enc | 0.138 ns/B 6907 MiB/s 0.413 c/B 2993 OCB dec | 0.139 ns/B 6884 MiB/s 0.415 c/B 2993 OCB auth | 0.124 ns/B 7679 MiB/s 0.372 c/B 2993
After:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.053 ns/B 18056 MiB/s 0.158 c/B 2992 ECB dec | 0.053 ns/B 18009 MiB/s 0.158 c/B 2993 CBC dec | 0.053 ns/B 17955 MiB/s 0.159 c/B 2993 CFB dec | 0.054 ns/B 17813 MiB/s 0.160 c/B 2993 CTR enc | 0.061 ns/B 15633 MiB/s 0.183 c/B 2993 CTR dec | 0.061 ns/B 15608 MiB/s 0.183 c/B 2993 XTS enc | 0.082 ns/B 11640 MiB/s 0.245 c/B 2993 XTS dec | 0.081 ns/B 11717 MiB/s 0.244 c/B 2992 OCB enc | 0.082 ns/B 11677 MiB/s 0.244 c/B 2993 OCB dec | 0.089 ns/B 10736 MiB/s 0.266 c/B 2992 OCB auth | 0.080 ns/B 11883 MiB/s 0.240 c/B 2993 ECB: ~2.4x faster CBC/CFB dec: ~2.0x faster CTR: ~1.8x faster XTS: ~1.7x faster OCB enc: ~1.7x faster OCB dec/auth: ~1.5x faster
Benchmark on AMD Ryzen 9 7900X (zen4, linux-i386):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.070 ns/B 13582 MiB/s 0.330 c/B 4700 ECB dec | 0.071 ns/B 13525 MiB/s 0.331 c/B 4700 CBC dec | 0.072 ns/B 13165 MiB/s 0.341 c/B 4701 CFB dec | 0.072 ns/B 13197 MiB/s 0.340 c/B 4700 CTR enc | 0.073 ns/B 13140 MiB/s 0.341 c/B 4700 CTR dec | 0.073 ns/B 13092 MiB/s 0.342 c/B 4700 XTS enc | 0.093 ns/B 10268 MiB/s 0.437 c/B 4700 XTS dec | 0.093 ns/B 10204 MiB/s 0.439 c/B 4700 OCB enc | 0.088 ns/B 10885 MiB/s 0.412 c/B 4700 OCB dec | 0.180 ns/B 5290 MiB/s 0.847 c/B 4700 OCB auth | 0.174 ns/B 5466 MiB/s 0.820 c/B 4700
After:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.035 ns/B 27469 MiB/s 0.163 c/B 4700 ECB dec | 0.035 ns/B 27482 MiB/s 0.163 c/B 4700 CBC dec | 0.036 ns/B 26853 MiB/s 0.167 c/B 4700 CFB dec | 0.035 ns/B 27452 MiB/s 0.163 c/B 4700 CTR enc | 0.042 ns/B 22573 MiB/s 0.199 c/B 4700 CTR dec | 0.042 ns/B 22524 MiB/s 0.199 c/B 4700 XTS enc | 0.054 ns/B 17731 MiB/s 0.253 c/B 4700 XTS dec | 0.054 ns/B 17788 MiB/s 0.252 c/B 4700 OCB enc | 0.043 ns/B 22162 MiB/s 0.202 c/B 4700 OCB dec | 0.044 ns/B 21918 MiB/s 0.205 c/B 4700 OCB auth | 0.039 ns/B 24327 MiB/s 0.184 c/B 4700 ECB: ~2.0x faster CBC/CFB dec: ~2.0x faster CTR/XTS: ~1.7x faster OCB enc: ~2.0x faster OCB dec: ~4.1x faster OCB auth: ~4.4x faster
Benchmark on AMD Ryzen 7 5800X (zen3, win32):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.087 ns/B 11002 MiB/s 0.329 c/B 3800 ECB dec | 0.088 ns/B 10887 MiB/s 0.333 c/B 3801 CBC dec | 0.097 ns/B 9831 MiB/s 0.369 c/B 3801 CFB dec | 0.096 ns/B 9897 MiB/s 0.366 c/B 3800 CTR enc | 0.104 ns/B 9190 MiB/s 0.394 c/B 3801 CTR dec | 0.105 ns/B 9083 MiB/s 0.399 c/B 3801 XTS enc | 0.127 ns/B 7538 MiB/s 0.481 c/B 3801 XTS dec | 0.127 ns/B 7505 MiB/s 0.483 c/B 3801 OCB enc | 0.117 ns/B 8180 MiB/s 0.443 c/B 3801 OCB dec | 0.115 ns/B 8296 MiB/s 0.437 c/B 3800 OCB auth | 0.107 ns/B 8928 MiB/s 0.406 c/B 3801
After:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.042 ns/B 22515 MiB/s 0.161 c/B 3801 ECB dec | 0.043 ns/B 22308 MiB/s 0.163 c/B 3801 CBC dec | 0.050 ns/B 18910 MiB/s 0.192 c/B 3801 CFB dec | 0.049 ns/B 19402 MiB/s 0.187 c/B 3801 CTR enc | 0.053 ns/B 18002 MiB/s 0.201 c/B 3801 CTR dec | 0.053 ns/B 17944 MiB/s 0.202 c/B 3801 XTS enc | 0.076 ns/B 12531 MiB/s 0.289 c/B 3801 XTS dec | 0.077 ns/B 12465 MiB/s 0.291 c/B 3801 OCB enc | 0.065 ns/B 14719 MiB/s 0.246 c/B 3801 OCB dec | 0.060 ns/B 15887 MiB/s 0.228 c/B 3801 OCB auth | 0.054 ns/B 17504 MiB/s 0.207 c/B 3801 ECB: ~2.0x faster CBC/CFB dec: ~1.9x faster CTR: ~1.9x faster XTS: ~1.6x faster OCB enc: ~1.8x faster OCB dec/auth: ~1.9x faster
[v2]:
- Improve CTR performance
- Improve OCB performance
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>