Home GnuPG

blake2b-avx512: replace VPGATHER with manual gather

Description

blake2b-avx512: replace VPGATHER with manual gather

* cipher/blake2.c (blake2b_init_ctx): Remove HWF_INTEL_FAST_VPGATHER
check for AVX512 implementation.
* cipher/blake2b-amd64-avx512.S (R16, VPINSRQ_KMASK, .Lshuf_ror16)
(.Lk1_mask): New.
(GEN_GMASK, RESET_KMASKS, .Lgmask*): Remove.
(GATHER_MSG): Use manual gather instead of VPGATHER.
(ROR_16): Use vpshufb for small speed improvement on tigerlake.
(_gcry_blake2b_transform_amd64_avx512): New setup & clean-up for
kmask registers; Reduce excess loop aligned from 64B to 16B.

As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
switch blake2b-avx512 implementation to use manual memory gathering
instead.

Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
microcode):

Old before "Downfall" (commit 909daa700e4b45d75469df298ee564b8fc2f4b72):

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4088

Old after "Downfall" (~3.0x slower):

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2B_512 | 2.11 ns/B 451.3 MiB/s 8.64 c/B 4089

New (same as before "Downfall"):

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4090

Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):

Old:

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2B_512 | 0.793 ns/B 1203 MiB/s 3.73 c/B 4700

New (~3% faster):

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2B_512 | 0.771 ns/B 1237 MiB/s 3.62 c/B 4700

  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Aug 20 2023, 4:35 PM
Parents
rCded3a1ec2ec6: twofish-avx2-amd64: replace VPGATHER with manual gather
Branches
Unknown
Tags
Unknown