blake2-avx512: merge some of the gather loads
* cipher/blake2b-amd64-avx512.S (GATHER_MSG_2, GATHER_MSG_3) (GATHER_MSG_5, GATHER_MSG_6, GATHER_MSG_8, GATHER_MSG_9): New. (LOAD_MSG_2, LOAD_MSG_3, LOAD_MSG_5, LOAD_MSG_6, LOAD_MSG_8) (LOAD_MSG_9): Use GATHER_MSG_<number>. (_blake2b_avx512_data): Add merged load masks ".L[4-7]_mask". (_gcry_blake2b_transform_amd64_avx512): Load merged load masks to %k[4-7] and clear registers on exit. * cipher/blake2s-amd64-avx512.S (VPINSRD_KMASK, GATHER_MSG_2) (GATHER_MSG_3, GATHER_MSG_5, GATHER_MSG_6, GATHER_MSG_8) (GATHER_MSG_9): New. (LOAD_MSG_2, LOAD_MSG_3, LOAD_MSG_5, LOAD_MSG_6, LOAD_MSG_8) (LOAD_MSG_9): Use GATHER_MSG_<number>. (_blake2s_avx512_data): Add merged load masks ".L[4-7]_mask". (_gcry_blake2s_transform_amd64_avx512): Load merged load masks to %k[4-7] and clear registers on exit.
Merged loads reduce number of memory loads and instructions in
blake2-avx512 implementations a bit. However, since GATHER_MSG
is not bottleneck in Intel tigerlake or AMD Zen4, this does not
give easily measurable performance difference, bench-slope results
remain the same as before.
Benchmark on AMD Ryzen 9 7900X (zen4):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2S_256 | 1.14 ns/B 837.6 MiB/s 5.35 c/B 4700
BLAKE2B_512 | 0.772 ns/B 1235 MiB/s 3.63 c/B 4700
After:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2S_256 | 1.14 ns/B 837.6 MiB/s 5.35 c/B 4700
BLAKE2B_512 | 0.772 ns/B 1235 MiB/s 3.63 c/B 4700
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2S_256 | 1.02 ns/B 934.2 MiB/s 4.18 c/B 4090
BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4089
After:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
BLAKE2S_256 | 1.02 ns/B 933.5 MiB/s 4.18 c/B 4089
BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4089
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>