camellia-gfni-avx512: add 1-block constant-time implementation
* cipher/camellia-gfni-avx512-amd64.S (_gcry_camellia_gfni_avx512_enc_blk1) (_gcry_camellia_gfni_avx512_dec_blk1): New. * cipher/camellia-glue.c [USE_GFNI_AVX512] (_gcry_camellia_gfni_avx512_enc_blk1) (_gcry_camellia_gfni_avx512_dec_blk1): New prototypes. (camellia_decrypt, camellia_encrypt) [USE_GFNI_AVX512]: Use GFNI/AVX512 1-block implementation if supported by CPU.
Benchmark on Intel (tigerlake):
Before:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 5.57 ns/B 171.3 MiB/s 22.77 c/B 4090 CFB enc | 5.57 ns/B 171.2 MiB/s 22.79 c/B 4090
After (~27% faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 4.36 ns/B 218.9 MiB/s 17.82 c/B 4090 CFB enc | 4.35 ns/B 219.1 MiB/s 17.80 c/B 4090
Benchmark on AMD Ryzen 9 7900X (zen4):
Before:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 3.75 ns/B 254.1 MiB/s 20.64 c/B 5500 CFB enc | 3.75 ns/B 254.2 MiB/s 20.63 c/B 5500
After (~12% faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 3.34 ns/B 285.6 MiB/s 18.29 c/B 5475 CFB enc | 3.34 ns/B 285.6 MiB/s 18.28 c/B 5475
Benchmark on AMD Ryzen 9 9950X3D (zen5):
Before:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 3.15 ns/B 302.8 MiB/s 18.10 c/B 5747 CFB enc | 3.18 ns/B 300.0 MiB/s 18.27 c/B 5748
After (~13% slower):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC enc | 3.58 ns/B 266.7 MiB/s 20.55 c/B 5746±5 CFB enc | 3.58 ns/B 266.7 MiB/s 20.55 c/B 5748
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>