chacha20: add AVX512 implementation
* cipher/Makefile.am: Add 'chacha20-amd64-avx512.S'. * cipher/chacha20-amd64-avx512.S: New. * cipher/chacha20.c (USE_AVX512): New. (CHACHA20_context_s): Add 'use_avx512'. [USE_AVX512] (_gcry_chacha20_amd64_avx512_blocks16): New. (chacha20_do_setkey) [USE_AVX512]: Setup 'use_avx512' based on HW features. (do_chacha20_encrypt_stream_tail) [USE_AVX512]: Use AVX512 implementation if supported. (_gcry_chacha20_poly1305_encrypt) [USE_AVX512]: Disable stitched chacha20-poly1305 implementations if AVX512 implementation is used. (_gcry_chacha20_poly1305_decrypt) [USE_AVX512]: Disable stitched chacha20-poly1305 implementations if AVX512 implementation is used.
Benchmark on Intel Core i3-1115G4 (tigerlake):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz STREAM enc | 0.276 ns/B 3451 MiB/s 1.13 c/B 4090 STREAM dec | 0.284 ns/B 3359 MiB/s 1.16 c/B 4090 POLY1305 enc | 0.411 ns/B 2320 MiB/s 1.68 c/B 4098±3 POLY1305 dec | 0.408 ns/B 2338 MiB/s 1.67 c/B 4091±1 POLY1305 auth | 0.060 ns/B 15785 MiB/s 0.247 c/B 4090±1
After (stream 1.7x faster, poly1305-aead 1.8x faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz STREAM enc | 0.162 ns/B 5869 MiB/s 0.665 c/B 4092±1 STREAM dec | 0.162 ns/B 5884 MiB/s 0.664 c/B 4096±3 POLY1305 enc | 0.221 ns/B 4306 MiB/s 0.907 c/B 4097±3 POLY1305 dec | 0.220 ns/B 4342 MiB/s 0.900 c/B 4096±3 POLY1305 auth | 0.060 ns/B 15797 MiB/s 0.247 c/B 4085±2
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>