blake2b-avx512: replace VPGATHER with manual gather
* cipher/blake2.c (blake2b_init_ctx): Remove HWF_INTEL_FAST_VPGATHER check for AVX512 implementation. * cipher/blake2b-amd64-avx512.S (R16, VPINSRQ_KMASK, .Lshuf_ror16) (.Lk1_mask): New. (GEN_GMASK, RESET_KMASKS, .Lgmask*): Remove. (GATHER_MSG): Use manual gather instead of VPGATHER. (ROR_16): Use vpshufb for small speed improvement on tigerlake. (_gcry_blake2b_transform_amd64_avx512): New setup & clean-up for kmask registers; Reduce excess loop aligned from 64B to 16B.
As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
switch blake2b-avx512 implementation to use manual memory gathering
instead.
Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
microcode):
Old before "Downfall" (commit 909daa700e4b45d75469df298ee564b8fc2f4b72):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4088
Old after "Downfall" (~3.0x slower):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 2.11 ns/B 451.3 MiB/s 8.64 c/B 4089
New (same as before "Downfall"):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4090
Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):
Old:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 0.793 ns/B 1203 MiB/s 3.73 c/B 4700
New (~3% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2B_512 | 0.771 ns/B 1237 MiB/s 3.62 c/B 4700
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>