aria-avx2: add VAES accelerated implementation
* cipher/aria-aesni-avx2-amd64.S (CONFIG_AS_VAES): New. [CONFIG_AS_VAES]: Add VAES accelerated assembly macros and functions. * cipher/aria.c (USE_VAES_AVX2): New. (ARIA_context): Add 'use_vaes_avx2'. (_gcry_aria_vaes_avx2_ecb_crypt_blk32) (_gcry_aria_vaes_avx2_ctr_crypt_blk32) (aria_avx2_ecb_crypt_blk32, aria_avx2_ctr_crypt_blk32): Add VAES/AVX2 code paths. (aria_setkey): Enable VAES/AVX2 implementation based on HW features.
This patch adds VAES/AVX2 accelerated ARIA block cipher implementation.
VAES instruction set extends AESNI instructions to work on all 128-bit
lanes of 256-bit YMM and 512-bit ZMM vector registers, thus AES
operations can be executed directly on YMM registers without needing
to manually split YMM to two XMM halfs for AESNI instructions.
This improves performance on CPUs that support VAES but not GFNI, like
AMD Zen3.
Benchmark on Ryzen 7 5800X (zen3, turbo-freq off):
Before (AESNI/AVX2):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.559 ns/B 1707 MiB/s 2.12 c/B 3800 ECB dec | 0.560 ns/B 1703 MiB/s 2.13 c/B 3800 CTR enc | 0.570 ns/B 1672 MiB/s 2.17 c/B 3800 CTR dec | 0.568 ns/B 1679 MiB/s 2.16 c/B 3800
After (VAES/AVX2, ~33% faster):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.435 ns/B 2193 MiB/s 1.65 c/B 3800 ECB dec | 0.434 ns/B 2197 MiB/s 1.65 c/B 3800 CTR enc | 0.413 ns/B 2306 MiB/s 1.57 c/B 3800 CTR dec | 0.411 ns/B 2318 MiB/s 1.56 c/B 3800
Cc: Taehee Yoo <ap420073@gmail.com>
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>