blake2-avx512: merge some of the gather loads
325786acd445
Actions

Description

blake2-avx512: merge some of the gather loads

* cipher/blake2b-amd64-avx512.S (GATHER_MSG_2, GATHER_MSG_3)
(GATHER_MSG_5, GATHER_MSG_6, GATHER_MSG_8, GATHER_MSG_9): New.
(LOAD_MSG_2, LOAD_MSG_3, LOAD_MSG_5, LOAD_MSG_6, LOAD_MSG_8)
(LOAD_MSG_9): Use GATHER_MSG_<number>.
(_blake2b_avx512_data): Add merged load masks ".L[4-7]_mask".
(_gcry_blake2b_transform_amd64_avx512): Load merged load masks
to %k[4-7] and clear registers on exit.
* cipher/blake2s-amd64-avx512.S (VPINSRD_KMASK, GATHER_MSG_2)
(GATHER_MSG_3, GATHER_MSG_5, GATHER_MSG_6, GATHER_MSG_8)
(GATHER_MSG_9): New.
(LOAD_MSG_2, LOAD_MSG_3, LOAD_MSG_5, LOAD_MSG_6, LOAD_MSG_8)
(LOAD_MSG_9): Use GATHER_MSG_<number>.
(_blake2s_avx512_data): Add merged load masks ".L[4-7]_mask".
(_gcry_blake2s_transform_amd64_avx512): Load merged load masks
to %k[4-7] and clear registers on exit.

Merged loads reduce number of memory loads and instructions in
blake2-avx512 implementations a bit. However, since GATHER_MSG
is not bottleneck in Intel tigerlake or AMD Zen4, this does not
give easily measurable performance difference, bench-slope results
remain the same as before.

Benchmark on AMD Ryzen 9 7900X (zen4):

Before:

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2S_256 | 1.14 ns/B 837.6 MiB/s 5.35 c/B 4700
BLAKE2B_512 | 0.772 ns/B 1235 MiB/s 3.63 c/B 4700

After:

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2S_256 | 1.14 ns/B 837.6 MiB/s 5.35 c/B 4700
BLAKE2B_512 | 0.772 ns/B 1235 MiB/s 3.63 c/B 4700

Benchmark on Intel Core i3-1115G4 (tigerlake):

Before:

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2S_256 | 1.02 ns/B 934.2 MiB/s 4.18 c/B 4090
BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4089

After:

|  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz

BLAKE2S_256 | 1.02 ns/B 933.5 MiB/s 4.18 c/B 4089
BLAKE2B_512 | 0.705 ns/B 1353 MiB/s 2.88 c/B 4089

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance

jukivili

Authored on Aug 20 2023, 8:14 PM

Parents

rC36d014f919d1: build: Change the default for --with-libtool-modification.

Branches

Unknown

Tags

Unknown

Event Timeline

jukivili committed rC325786acd445: blake2-avx512: merge some of the gather loads (authored by jukivili).Sep 15 2023, 6:56 PM

Changes (2)

Path

Size

cipher/

blake2b-amd64-avx512.S

blake2s-amd64-avx512.S

rC325786acd445

View Options

cipher/blake2b-amd64-avx512.S