chacha20-avx512: add handling for any input block count and tweak 16 block code a bit
* cipher/chacha20-amd64-avx512.S: Add tail handling for 8/4/2/1 blocks; Rename `_gcry_chacha20_amd64_avx512_blocks16` to `_gcry_chacha20_amd64_avx512_blocks`; Tweak 16 parallel block processing for small speed improvement. * cipher/chacha20.c (_gcry_chacha20_amd64_avx512_blocks16): Rename to ... (_gcry_chacha20_amd64_avx512_blocks): ... this. (chacha20_blocks) [USE_AVX512]: Add AVX512 code-path. (do_chacha20_encrypt_stream_tail) [USE_AVX512]: Change to handle any number of full input blocks instead of multiples of 16.
Patch improves performance of ChaCha20-AVX512 implementation on small
input buffer sizes (less than 64*16B = 1024B).
Following benchmarks show improvement in 16 parallel blocks processing
performance.
Benchmark on AMD Ryzen 9 7900X:
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.130 ns/B 7330 MiB/s 0.716 c/B 5500 STREAM dec | 0.128 ns/B 7426 MiB/s 0.713 c/B 5555 POLY1305 enc | 0.175 ns/B 5444 MiB/s 0.964 c/B 5500 POLY1305 dec | 0.175 ns/B 5455 MiB/s 0.962 c/B 5500
After:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.123 ns/B 7767 MiB/s 0.691 c/B 5625 STREAM dec | 0.123 ns/B 7736 MiB/s 0.693 c/B 5625 POLY1305 enc | 0.168 ns/B 5679 MiB/s 0.945 c/B 5625 POLY1305 dec | 0.167 ns/B 5708 MiB/s 0.940 c/B 5625
Benchmark on Intel Core i3-1115G4:
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.161 ns/B 5934 MiB/s 0.658 c/B 4097±3 STREAM dec | 0.160 ns/B 5951 MiB/s 0.656 c/B 4097±4 POLY1305 enc | 0.220 ns/B 4333 MiB/s 0.902 c/B 4096±3 POLY1305 dec | 0.220 ns/B 4325 MiB/s 0.903 c/B 4096±3
After:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.152 ns/B 6267 MiB/s 0.623 c/B 4097±3 STREAM dec | 0.152 ns/B 6287 MiB/s 0.621 c/B 4097±3 POLY1305 enc | 0.215 ns/B 4443 MiB/s 0.879 c/B 4096±3 POLY1305 dec | 0.214 ns/B 4452 MiB/s 0.878 c/B 4096±3
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>