Add stitched ChaCha20-Poly1305 SSSE3 and AVX2 implementations
* cipher/asm-poly1305-amd64.h: New. * cipher/Makefile.am: Add 'asm-poly1305-amd64.h'. * cipher/chacha20-amd64-avx2.S (QUATERROUND2): Add interleave operators. (_gcry_chacha20_poly1305_amd64_avx2_blocks8): New. * cipher/chacha20-amd64-ssse3.S (QUATERROUND2): Add interleave operators. (_gcry_chacha20_poly1305_amd64_ssse3_blocks4) (_gcry_chacha20_poly1305_amd64_ssse3_blocks1): New. * cipher/chacha20.c (_gcry_chacha20_poly1305_amd64_ssse3_blocks4) (_gcry_chacha20_poly1305_amd64_ssse3_blocks1) (_gcry_chacha20_poly1305_amd64_avx2_blocks8): New prototypes. (chacha20_encrypt_stream): Split tail to... (do_chacha20_encrypt_stream_tail): ... new function. (_gcry_chacha20_poly1305_encrypt) (_gcry_chacha20_poly1305_decrypt): New. * cipher/cipher-internal.h (_gcry_chacha20_poly1305_encrypt) (_gcry_chacha20_poly1305_decrypt): New prototypes. * cipher/cipher-poly1305.c (_gcry_cipher_poly1305_encrypt): Call '_gcry_chacha20_poly1305_encrypt' if cipher is ChaCha20. (_gcry_cipher_poly1305_decrypt): Call '_gcry_chacha20_poly1305_decrypt' if cipher is ChaCha20. * cipher/poly1305-internal.h (_gcry_cipher_poly1305_update_burn): New prototype. * cipher/poly1305.c (poly1305_blocks): Make static. (_gcry_poly1305_update): Split main function body to ... (_gcry_poly1305_update_burn): ... new function.
Benchmark on Intel Skylake (i5-6500, 3200 Mhz):
Before, 8-way AVX2:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.378 ns/B 2526 MiB/s 1.21 c/B STREAM dec | 0.373 ns/B 2560 MiB/s 1.19 c/B POLY1305 enc | 0.685 ns/B 1392 MiB/s 2.19 c/B POLY1305 dec | 0.686 ns/B 1390 MiB/s 2.20 c/B POLY1305 auth | 0.315 ns/B 3031 MiB/s 1.01 c/B
After, 8-way AVX2 (~36% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 0.503 ns/B 1896 MiB/s 1.61 c/B POLY1305 dec | 0.485 ns/B 1965 MiB/s 1.55 c/B
Benchmark on Intel Haswell (i7-4790K, 3998 Mhz):
Before, 8-way AVX2:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.318 ns/B 2999 MiB/s 1.27 c/B STREAM dec | 0.317 ns/B 3004 MiB/s 1.27 c/B POLY1305 enc | 0.586 ns/B 1627 MiB/s 2.34 c/B POLY1305 dec | 0.586 ns/B 1627 MiB/s 2.34 c/B POLY1305 auth | 0.271 ns/B 3524 MiB/s 1.08 c/B
After, 8-way AVX2 (~30% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 0.452 ns/B 2108 MiB/s 1.81 c/B POLY1305 dec | 0.440 ns/B 2167 MiB/s 1.76 c/B
Before, 4-way SSSE3:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.627 ns/B 1521 MiB/s 2.51 c/B STREAM dec | 0.626 ns/B 1523 MiB/s 2.50 c/B POLY1305 enc | 0.895 ns/B 1065 MiB/s 3.58 c/B POLY1305 dec | 0.896 ns/B 1064 MiB/s 3.58 c/B POLY1305 auth | 0.271 ns/B 3521 MiB/s 1.08 c/B
After, 4-way SSSE3 (~20% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 0.733 ns/B 1301 MiB/s 2.93 c/B POLY1305 dec | 0.726 ns/B 1314 MiB/s 2.90 c/B
Before, 1-way SSSE3:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 1.56 ns/B 609.6 MiB/s 6.25 c/B POLY1305 dec | 1.56 ns/B 609.4 MiB/s 6.26 c/B
After, 1-way SSSE3 (~18% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 1.31 ns/B 725.4 MiB/s 5.26 c/B POLY1305 dec | 1.31 ns/B 727.3 MiB/s 5.24 c/B
For comparison to other libraries (on Intel i7-4790K, 3998 Mhz):
bench-slope-openssl: OpenSSL 1.1.1 11 Sep 2018
Cipher:
chacha20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.301 ns/B 3166.4 MiB/s 1.20 c/B STREAM dec | 0.300 ns/B 3174.7 MiB/s 1.20 c/B POLY1305 enc | 0.463 ns/B 2060.6 MiB/s 1.85 c/B POLY1305 dec | 0.462 ns/B 2063.8 MiB/s 1.85 c/B POLY1305 auth | 0.162 ns/B 5899.3 MiB/s 0.646 c/B
bench-slope-nettle: Nettle 3.4
Cipher:
chacha | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 1.65 ns/B 578.2 MiB/s 6.59 c/B STREAM dec | 1.65 ns/B 578.2 MiB/s 6.59 c/B POLY1305 enc | 2.05 ns/B 464.8 MiB/s 8.20 c/B POLY1305 dec | 2.05 ns/B 464.7 MiB/s 8.20 c/B POLY1305 auth | 0.404 ns/B 2359.1 MiB/s 1.62 c/B
bench-slope-botan: Botan 2.6.0
Cipher:
ChaCha | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc/dec | 0.855 ns/B 1116.0 MiB/s 3.42 c/B
POLY1305 enc | 1.60 ns/B 595.4 MiB/s 6.40 c/B POLY1305 dec | 1.60 ns/B 595.8 MiB/s 6.40 c/B POLY1305 auth | 0.752 ns/B 1268.3 MiB/s 3.01 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>