sha3: Add x86-64 AVX512 accelerated implementation
* LICENSES: Add 'cipher/keccak-amd64-avx512.S'. * configure.ac: Add 'keccak-amd64-avx512.lo'. * cipher/Makefile.am: Add 'keccak-amd64-avx512.S'. * cipher/keccak-amd64-avx512.S: New. * cipher/keccak.c (USE_64BIT_AVX512, ASM_FUNC_ABI): New. [USE_64BIT_AVX512] (_gcry_keccak_f1600_state_permute64_avx512) (_gcry_keccak_absorb_blocks_avx512, keccak_f1600_state_permute64_avx512) (keccak_absorb_lanes64_avx512, keccak_avx512_64_ops): New. (keccak_init) [USE_64BIT_AVX512]: Enable x86-64 AVX512 implementation if supported by HW features.
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before (BMI2 instructions):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SHA3-224 | 1.77 ns/B 540.3 MiB/s 7.22 c/B 4088
SHA3-256 | 1.86 ns/B 514.0 MiB/s 7.59 c/B 4089
SHA3-384 | 2.43 ns/B 393.1 MiB/s 9.92 c/B 4089
SHA3-512 | 3.49 ns/B 273.2 MiB/s 14.27 c/B 4088
SHAKE128 | 1.52 ns/B 629.1 MiB/s 6.20 c/B 4089
SHAKE256 | 1.86 ns/B 511.6 MiB/s 7.62 c/B 4089
After (~33% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
SHA3-224 | 1.32 ns/B 721.8 MiB/s 5.40 c/B 4089
SHA3-256 | 1.40 ns/B 681.7 MiB/s 5.72 c/B 4089
SHA3-384 | 1.83 ns/B 522.5 MiB/s 7.46 c/B 4089
SHA3-512 | 2.63 ns/B 362.1 MiB/s 10.77 c/B 4088
SHAKE128 | 1.13 ns/B 840.4 MiB/s 4.64 c/B 4089
SHAKE256 | 1.40 ns/B 682.1 MiB/s 5.72 c/B 4089
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>