Home GnuPG

chacha20-avx512: add handling for any input block count and tweak 16 block code…

Description

chacha20-avx512: add handling for any input block count and tweak 16 block code a bit

* cipher/chacha20-amd64-avx512.S: Add tail handling for 8/4/2/1
blocks; Rename `_gcry_chacha20_amd64_avx512_blocks16` to
`_gcry_chacha20_amd64_avx512_blocks`; Tweak 16 parallel block processing
for small speed improvement.
* cipher/chacha20.c (_gcry_chacha20_amd64_avx512_blocks16): Rename to ...
(_gcry_chacha20_amd64_avx512_blocks): ... this.
(chacha20_blocks) [USE_AVX512]: Add AVX512 code-path.
(do_chacha20_encrypt_stream_tail) [USE_AVX512]: Change to handle any
number of full input blocks instead of multiples of 16.

Patch improves performance of ChaCha20-AVX512 implementation on small
input buffer sizes (less than 64*16B = 1024B).

Following benchmarks show improvement in 16 parallel blocks processing
performance.

Benchmark on AMD Ryzen 9 7900X:

Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

  STREAM enc |     0.130 ns/B      7330 MiB/s     0.716 c/B      5500
  STREAM dec |     0.128 ns/B      7426 MiB/s     0.713 c/B      5555
POLY1305 enc |     0.175 ns/B      5444 MiB/s     0.964 c/B      5500
POLY1305 dec |     0.175 ns/B      5455 MiB/s     0.962 c/B      5500

After:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

  STREAM enc |     0.123 ns/B      7767 MiB/s     0.691 c/B      5625
  STREAM dec |     0.123 ns/B      7736 MiB/s     0.693 c/B      5625
POLY1305 enc |     0.168 ns/B      5679 MiB/s     0.945 c/B      5625
POLY1305 dec |     0.167 ns/B      5708 MiB/s     0.940 c/B      5625

Benchmark on Intel Core i3-1115G4:

Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

  STREAM enc |     0.161 ns/B      5934 MiB/s     0.658 c/B      4097±3
  STREAM dec |     0.160 ns/B      5951 MiB/s     0.656 c/B      4097±4
POLY1305 enc |     0.220 ns/B      4333 MiB/s     0.902 c/B      4096±3
POLY1305 dec |     0.220 ns/B      4325 MiB/s     0.903 c/B      4096±3

After:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

  STREAM enc |     0.152 ns/B      6267 MiB/s     0.623 c/B      4097±3
  STREAM dec |     0.152 ns/B      6287 MiB/s     0.621 c/B      4097±3
POLY1305 enc |     0.215 ns/B      4443 MiB/s     0.879 c/B      4096±3
POLY1305 dec |     0.214 ns/B      4452 MiB/s     0.878 c/B      4096±3
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Dec 5 2022, 5:41 PM
Parents
rC896fe69757e0: doc: Minor fix up.
Branches
Unknown
Tags
Unknown