twofish-avx2-amd64: replace VPGATHER with manual gather
* cipher/twofish-avx2-amd64.S (do_gather): New. (g16): Switch to use 'do_gather' instead of VPGATHER instruction. (__twofish_enc_blk16, __twofish_dec_blk16): Prepare stack for 'do_gather'. * cipher/twofish.c (twofish) [USE_AVX2]: Remove now unneeded HWF_INTEL_FAST_VPGATHER check.
As VPGATHER is now slow on majority of CPUs (because of "Downfall"),
switch twofish-avx2 implementation to use manual memory gathering
instead.
Benchmark on Intel Core i3-1115G4 (tigerlake, with "Downfall" mitigated
microcode):
Before:
TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 7.00 ns/B 136.3 MiB/s 28.62 c/B 4089 ECB dec | 7.00 ns/B 136.2 MiB/s 28.64 c/B 4090
After (~3.2x faster):
TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 2.19 ns/B 435.5 MiB/s 8.95 c/B 4089 ECB dec | 2.19 ns/B 436.2 MiB/s 8.94 c/B 4089
Benchmark on AMD Ryzen 9 7900X (zen4, did not suffer from "Downfall"):
Before:
TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 1.91 ns/B 499.0 MiB/s 8.98 c/B 4700 ECB dec | 1.90 ns/B 500.7 MiB/s 8.95 c/B 4700
After (~9% faster):
TWOFISH | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 1.74 ns/B 547.9 MiB/s 8.18 c/B 4700 ECB dec | 1.74 ns/B 547.8 MiB/s 8.18 c/B 4700
[v2]:
- reorder memory operations in do_gather for small performance increase.
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>