camellia: add x86_64 VAES/AVX2 accelerated implementation
* cipher/Makefile.am: Add 'camellia-aesni-avx2-amd64.h' and 'camellia-vaes-avx2-amd64.S'. * cipher/camellia-aesni-avx2-amd64.S: New, old content moved to... * cipher/camellia-aesni-avx2-amd64.h: ...here. (IF_AESNI, IF_VAES, FUNC_NAME): New. * cipher/camellia-vaes-avx2-amd64.S: New. * cipher/camellia-glue.c (USE_VAES_AVX2): New. (CAMELLIA_context): New member 'use_vaes_avx2'. (_gcry_camellia_vaes_avx2_ctr_enc, _gcry_camellia_vaes_avx2_cbc_dec) (_gcry_camellia_vaes_avx2_cfb_dec, _gcry_camellia_vaes_avx2_ocb_enc) (_gcry_camellia_vaes_avx2_ocb_dec) (_gcry_camellia_vaes_avx2_ocb_auth): New. (camellia_setkey): Check for HWF_INTEL_VAES. (_gcry_camellia_ctr_enc, _gcry_camellia_cbc_dec) (_gcry_camellia_cfb_dec, _gcry_camellia_ocb_crypt) (_gcry_camellia_ocb_auth): Add USE_VAES_AVX2 code. * configure.ac: Add 'camellia-vaes-avx2-amd64.lo'.
Camellia AES-NI/AVX2 implementation had to split 256-bit vector
to 128-bit parts for AES processing, but now we can use those
256-bit registers directly with VAES.
Benchmarks on AMD Ryzen 5800X:
Before (AES-NI/AVX2):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.539 ns/B 1769 MiB/s 2.62 c/B 4852 CFB dec | 0.528 ns/B 1806 MiB/s 2.56 c/B 4852±1 CTR enc | 0.552 ns/B 1728 MiB/s 2.68 c/B 4850 OCB enc | 0.550 ns/B 1734 MiB/s 2.65 c/B 4825 OCB dec | 0.577 ns/B 1653 MiB/s 2.78 c/B 4825 OCB auth | 0.546 ns/B 1747 MiB/s 2.63 c/B 4825
After (VAES/AVX2, CBC-dec ~13%, CFB-dec/CTR/OCB ~20% faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.477 ns/B 1999 MiB/s 2.31 c/B 4850 CFB dec | 0.433 ns/B 2201 MiB/s 2.10 c/B 4850 CTR enc | 0.438 ns/B 2176 MiB/s 2.13 c/B 4851 OCB enc | 0.449 ns/B 2122 MiB/s 2.18 c/B 4850 OCB dec | 0.468 ns/B 2038 MiB/s 2.27 c/B 4850 OCB auth | 0.447 ns/B 2131 MiB/s 2.17 c/B 4850
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>