Home GnuPG

Add stitched ChaCha20-Poly1305 SSSE3 and AVX2 implementations

Description

Add stitched ChaCha20-Poly1305 SSSE3 and AVX2 implementations

* cipher/asm-poly1305-amd64.h: New.
* cipher/Makefile.am: Add 'asm-poly1305-amd64.h'.
* cipher/chacha20-amd64-avx2.S (QUATERROUND2): Add interleave
operators.
(_gcry_chacha20_poly1305_amd64_avx2_blocks8): New.
* cipher/chacha20-amd64-ssse3.S (QUATERROUND2): Add interleave
operators.
(_gcry_chacha20_poly1305_amd64_ssse3_blocks4)
(_gcry_chacha20_poly1305_amd64_ssse3_blocks1): New.
* cipher/chacha20.c (_gcry_chacha20_poly1305_amd64_ssse3_blocks4)
(_gcry_chacha20_poly1305_amd64_ssse3_blocks1)
(_gcry_chacha20_poly1305_amd64_avx2_blocks8): New prototypes.
(chacha20_encrypt_stream): Split tail to...
(do_chacha20_encrypt_stream_tail): ... new function.
(_gcry_chacha20_poly1305_encrypt)
(_gcry_chacha20_poly1305_decrypt): New.
* cipher/cipher-internal.h (_gcry_chacha20_poly1305_encrypt)
(_gcry_chacha20_poly1305_decrypt): New prototypes.
* cipher/cipher-poly1305.c (_gcry_cipher_poly1305_encrypt): Call
'_gcry_chacha20_poly1305_encrypt' if cipher is ChaCha20.
(_gcry_cipher_poly1305_decrypt): Call
'_gcry_chacha20_poly1305_decrypt' if cipher is ChaCha20.
* cipher/poly1305-internal.h (_gcry_cipher_poly1305_update_burn): New
prototype.
* cipher/poly1305.c (poly1305_blocks): Make static.
(_gcry_poly1305_update): Split main function body to ...
(_gcry_poly1305_update_burn): ... new function.

Benchmark on Intel Skylake (i5-6500, 3200 Mhz):

Before, 8-way AVX2:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |     0.378 ns/B      2526 MiB/s      1.21 c/B
   STREAM dec |     0.373 ns/B      2560 MiB/s      1.19 c/B
 POLY1305 enc |     0.685 ns/B      1392 MiB/s      2.19 c/B
 POLY1305 dec |     0.686 ns/B      1390 MiB/s      2.20 c/B
POLY1305 auth |     0.315 ns/B      3031 MiB/s      1.01 c/B

After, 8-way AVX2 (~36% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

POLY1305 enc |     0.503 ns/B      1896 MiB/s      1.61 c/B
POLY1305 dec |     0.485 ns/B      1965 MiB/s      1.55 c/B

Benchmark on Intel Haswell (i7-4790K, 3998 Mhz):

Before, 8-way AVX2:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |     0.318 ns/B      2999 MiB/s      1.27 c/B
   STREAM dec |     0.317 ns/B      3004 MiB/s      1.27 c/B
 POLY1305 enc |     0.586 ns/B      1627 MiB/s      2.34 c/B
 POLY1305 dec |     0.586 ns/B      1627 MiB/s      2.34 c/B
POLY1305 auth |     0.271 ns/B      3524 MiB/s      1.08 c/B

After, 8-way AVX2 (~30% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

POLY1305 enc |     0.452 ns/B      2108 MiB/s      1.81 c/B
POLY1305 dec |     0.440 ns/B      2167 MiB/s      1.76 c/B

Before, 4-way SSSE3:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |     0.627 ns/B      1521 MiB/s      2.51 c/B
   STREAM dec |     0.626 ns/B      1523 MiB/s      2.50 c/B
 POLY1305 enc |     0.895 ns/B      1065 MiB/s      3.58 c/B
 POLY1305 dec |     0.896 ns/B      1064 MiB/s      3.58 c/B
POLY1305 auth |     0.271 ns/B      3521 MiB/s      1.08 c/B

After, 4-way SSSE3 (~20% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

POLY1305 enc |     0.733 ns/B      1301 MiB/s      2.93 c/B
POLY1305 dec |     0.726 ns/B      1314 MiB/s      2.90 c/B

Before, 1-way SSSE3:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

POLY1305 enc |      1.56 ns/B     609.6 MiB/s      6.25 c/B
POLY1305 dec |      1.56 ns/B     609.4 MiB/s      6.26 c/B

After, 1-way SSSE3 (~18% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

POLY1305 enc |      1.31 ns/B     725.4 MiB/s      5.26 c/B
POLY1305 dec |      1.31 ns/B     727.3 MiB/s      5.24 c/B

For comparison to other libraries (on Intel i7-4790K, 3998 Mhz):

bench-slope-openssl: OpenSSL 1.1.1 11 Sep 2018
Cipher:
chacha20 | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |     0.301 ns/B    3166.4 MiB/s      1.20 c/B
   STREAM dec |     0.300 ns/B    3174.7 MiB/s      1.20 c/B
 POLY1305 enc |     0.463 ns/B    2060.6 MiB/s      1.85 c/B
 POLY1305 dec |     0.462 ns/B    2063.8 MiB/s      1.85 c/B
POLY1305 auth |     0.162 ns/B    5899.3 MiB/s     0.646 c/B

bench-slope-nettle: Nettle 3.4
Cipher:
chacha | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |      1.65 ns/B     578.2 MiB/s      6.59 c/B
   STREAM dec |      1.65 ns/B     578.2 MiB/s      6.59 c/B
 POLY1305 enc |      2.05 ns/B     464.8 MiB/s      8.20 c/B
 POLY1305 dec |      2.05 ns/B     464.7 MiB/s      8.20 c/B
POLY1305 auth |     0.404 ns/B    2359.1 MiB/s      1.62 c/B

bench-slope-botan: Botan 2.6.0
Cipher:
ChaCha | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc/dec | 0.855 ns/B 1116.0 MiB/s 3.42 c/B

 POLY1305 enc |      1.60 ns/B     595.4 MiB/s      6.40 c/B
 POLY1305 dec |      1.60 ns/B     595.8 MiB/s      6.40 c/B
POLY1305 auth |     0.752 ns/B    1268.3 MiB/s      3.01 c/B
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Jan 27 2019, 10:19 AM
Parents
rC7d9b2f114f3e: Add SSSE3 optimized non-parallel ChaCha20 function
Branches
Unknown
Tags
Unknown