blake2: avoid AVX/AVX2/AVX512 when CPU has high vector inst latency
* cipher/blake2.c (blake2b_init_ctx, blake2s_init_ctx): Disable AVX/AVX2/AVX512 implementation if x86 CPU prefers GPR implementation over scalar integer vector. * src/hwf-common.h (hwf_x86_cpu_details) (_gcry_hwf_x86_cpu_details): New. * src/hwf-x86.c (x86_cpu_details, x86_hw_features) (x86_detect_done, _gcry_hwf_x86_cpu_details): New. (detect_x86_gnuc): Detect Zen5 and add 'cpu_details'. (_gcry_hwf_detect_x86): Add 'x86_cpu_details' setup.
Blake2s/Blake2b AVX/AVX2/AVX512 implementations are slower than
generic C implementation if CPU has integer vector latency higher
than 1 (for example, AMD Zen5 has int-vector latency of 2) and powerful
GPR execution. Therefore use generic C implementation for Blake2
on Zen5.
Generic C with AMD Zen5:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 0.473 ns/B 2016 MiB/s 2.72 c/B 5750
BLAKE2S_256 | 0.798 ns/B 1195 MiB/s 4.59 c/B 5750
AVX512 with AMD Zen5:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 0.923 ns/B 1033 MiB/s 5.31 c/B 5750
BLAKE2S_256 | 1.42 ns/B 672.4 MiB/s 8.15 c/B 5749
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>