Home GnuPG

Add VAES/AVX2 accelerated i386 implementation for AES

Description

Add VAES/AVX2 accelerated i386 implementation for AES

* cipher/Makefile.am: Add 'rijndael-vaes-i386.c' and
'rijndael-vaes-avx2-i386.S'.
* cipher/asm-common-i386.h: New.
* cipher/rijndael-internal.h (USE_VAES_I386): New.
* cipher/rijndael-vaes-avx2-i386.S: New.
* cipher/rijndael-vaes-i386.c: New.
* cipher/rijndael-vaes.c: Update header description (add 'AMD64').
* cipher/rijndael.c [USE_VAES]: Add 'USE_VAES_I386' to ifdef around
'_gcry_aes_vaes_*' function prototypes.
(setkey) [USE_VAES_I386]: Add setup of VAES/AVX2/i386 bulk functions.
* configure.ac: Add 'rijndael-vaes-i386.lo' and
'rijndael-vaes-avx2-i386.lo'.
(gcry_cv_gcc_amd64_platform_as_ok): Rename this to ...
(gcry_cv_gcc_x86_platform_as_ok): ... this and change to check for
both AMD64 and i386 assembler compatibility.
(gcry_cv_gcc_win32_platform_as_ok): New.

Benchmark on Intel Core(TM) i3-1115G4 (tigerlake, linux-i386):

Before:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.127 ns/B      7523 MiB/s     0.379 c/B      2992
 ECB dec |     0.127 ns/B      7517 MiB/s     0.380 c/B      2992
 CBC dec |     0.108 ns/B      8855 MiB/s     0.322 c/B      2993
 CFB dec |     0.107 ns/B      8938 MiB/s     0.319 c/B      2993
 CTR enc |     0.111 ns/B      8589 MiB/s     0.332 c/B      2992
 CTR dec |     0.111 ns/B      8593 MiB/s     0.332 c/B      2993
 XTS enc |     0.140 ns/B      6833 MiB/s     0.418 c/B      2993
 XTS dec |     0.139 ns/B      6863 MiB/s     0.416 c/B      2993
 OCB enc |     0.138 ns/B      6907 MiB/s     0.413 c/B      2993
 OCB dec |     0.139 ns/B      6884 MiB/s     0.415 c/B      2993
OCB auth |     0.124 ns/B      7679 MiB/s     0.372 c/B      2993

After:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

      ECB enc |     0.053 ns/B     18056 MiB/s     0.158 c/B      2992
      ECB dec |     0.053 ns/B     18009 MiB/s     0.158 c/B      2993
      CBC dec |     0.053 ns/B     17955 MiB/s     0.159 c/B      2993
      CFB dec |     0.054 ns/B     17813 MiB/s     0.160 c/B      2993
      CTR enc |     0.061 ns/B     15633 MiB/s     0.183 c/B      2993
      CTR dec |     0.061 ns/B     15608 MiB/s     0.183 c/B      2993
      XTS enc |     0.082 ns/B     11640 MiB/s     0.245 c/B      2993
      XTS dec |     0.081 ns/B     11717 MiB/s     0.244 c/B      2992
      OCB enc |     0.082 ns/B     11677 MiB/s     0.244 c/B      2993
      OCB dec |     0.089 ns/B     10736 MiB/s     0.266 c/B      2992
     OCB auth |     0.080 ns/B     11883 MiB/s     0.240 c/B      2993
ECB: ~2.4x faster
CBC/CFB dec: ~2.0x faster
CTR: ~1.8x faster
XTS: ~1.7x faster
OCB enc: ~1.7x faster
OCB dec/auth: ~1.5x faster

Benchmark on AMD Ryzen 9 7900X (zen4, linux-i386):

Before:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.070 ns/B     13582 MiB/s     0.330 c/B      4700
 ECB dec |     0.071 ns/B     13525 MiB/s     0.331 c/B      4700
 CBC dec |     0.072 ns/B     13165 MiB/s     0.341 c/B      4701
 CFB dec |     0.072 ns/B     13197 MiB/s     0.340 c/B      4700
 CTR enc |     0.073 ns/B     13140 MiB/s     0.341 c/B      4700
 CTR dec |     0.073 ns/B     13092 MiB/s     0.342 c/B      4700
 XTS enc |     0.093 ns/B     10268 MiB/s     0.437 c/B      4700
 XTS dec |     0.093 ns/B     10204 MiB/s     0.439 c/B      4700
 OCB enc |     0.088 ns/B     10885 MiB/s     0.412 c/B      4700
 OCB dec |     0.180 ns/B      5290 MiB/s     0.847 c/B      4700
OCB auth |     0.174 ns/B      5466 MiB/s     0.820 c/B      4700

After:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

      ECB enc |     0.035 ns/B     27469 MiB/s     0.163 c/B      4700
      ECB dec |     0.035 ns/B     27482 MiB/s     0.163 c/B      4700
      CBC dec |     0.036 ns/B     26853 MiB/s     0.167 c/B      4700
      CFB dec |     0.035 ns/B     27452 MiB/s     0.163 c/B      4700
      CTR enc |     0.042 ns/B     22573 MiB/s     0.199 c/B      4700
      CTR dec |     0.042 ns/B     22524 MiB/s     0.199 c/B      4700
      XTS enc |     0.054 ns/B     17731 MiB/s     0.253 c/B      4700
      XTS dec |     0.054 ns/B     17788 MiB/s     0.252 c/B      4700
      OCB enc |     0.043 ns/B     22162 MiB/s     0.202 c/B      4700
      OCB dec |     0.044 ns/B     21918 MiB/s     0.205 c/B      4700
     OCB auth |     0.039 ns/B     24327 MiB/s     0.184 c/B      4700
ECB: ~2.0x faster
CBC/CFB dec: ~2.0x faster
CTR/XTS: ~1.7x faster
OCB enc: ~2.0x faster
OCB dec: ~4.1x faster
OCB auth: ~4.4x faster

Benchmark on AMD Ryzen 7 5800X (zen3, win32):

Before:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

 ECB enc |     0.087 ns/B     11002 MiB/s     0.329 c/B      3800
 ECB dec |     0.088 ns/B     10887 MiB/s     0.333 c/B      3801
 CBC dec |     0.097 ns/B      9831 MiB/s     0.369 c/B      3801
 CFB dec |     0.096 ns/B      9897 MiB/s     0.366 c/B      3800
 CTR enc |     0.104 ns/B      9190 MiB/s     0.394 c/B      3801
 CTR dec |     0.105 ns/B      9083 MiB/s     0.399 c/B      3801
 XTS enc |     0.127 ns/B      7538 MiB/s     0.481 c/B      3801
 XTS dec |     0.127 ns/B      7505 MiB/s     0.483 c/B      3801
 OCB enc |     0.117 ns/B      8180 MiB/s     0.443 c/B      3801
 OCB dec |     0.115 ns/B      8296 MiB/s     0.437 c/B      3800
OCB auth |     0.107 ns/B      8928 MiB/s     0.406 c/B      3801

After:
AES | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

      ECB enc |     0.042 ns/B     22515 MiB/s     0.161 c/B      3801
      ECB dec |     0.043 ns/B     22308 MiB/s     0.163 c/B      3801
      CBC dec |     0.050 ns/B     18910 MiB/s     0.192 c/B      3801
      CFB dec |     0.049 ns/B     19402 MiB/s     0.187 c/B      3801
      CTR enc |     0.053 ns/B     18002 MiB/s     0.201 c/B      3801
      CTR dec |     0.053 ns/B     17944 MiB/s     0.202 c/B      3801
      XTS enc |     0.076 ns/B     12531 MiB/s     0.289 c/B      3801
      XTS dec |     0.077 ns/B     12465 MiB/s     0.291 c/B      3801
      OCB enc |     0.065 ns/B     14719 MiB/s     0.246 c/B      3801
      OCB dec |     0.060 ns/B     15887 MiB/s     0.228 c/B      3801
     OCB auth |     0.054 ns/B     17504 MiB/s     0.207 c/B      3801
ECB: ~2.0x faster
CBC/CFB dec: ~1.9x faster
CTR: ~1.9x faster
XTS: ~1.6x faster
OCB enc: ~1.8x faster
OCB dec/auth: ~1.9x faster

[v2]:

  • Improve CTR performance
  • Improve OCB performance
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Jun 3 2023, 11:49 AM
Parents
rC13f288edd527: rijndael-vaes-avx2-amd64: avoid extra load in CFB & CBC IV handling
Branches
Unknown
Tags
Unknown