Home GnuPG

chacha20-amd64-avx2: optimize output xoring

Description

chacha20-amd64-avx2: optimize output xoring

* cipher/chacha20-amd64-avx2.S (STACK_TMP2): Remove.
(transpose_16byte_2x2, xor_src_dst): New.
(BUF_XOR_256_TO_128): Remove.
(_gcry_chaha20_amd64_avx2_blocks8)
(_gcry_chacha20_poly1305_amd64_avx2_blocks8): Replace
BUF_XOR_256_TO_128 with transpose_16byte_2x2/xor_src_dst; Reduce stack
usage; Better interleave chacha20 state merging and output xoring.

Benchmark on Intel i7-4790K:

Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

  STREAM enc |     0.314 ns/B      3035 MiB/s      1.26 c/B      3998
  STREAM dec |     0.314 ns/B      3037 MiB/s      1.26 c/B      3998
POLY1305 enc |     0.451 ns/B      2117 MiB/s      1.80 c/B      3998
POLY1305 dec |     0.441 ns/B      2162 MiB/s      1.76 c/B      3998

After:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

  STREAM enc |     0.309 ns/B      3086 MiB/s      1.24 c/B      3998
  STREAM dec |     0.309 ns/B      3083 MiB/s      1.24 c/B      3998
POLY1305 enc |     0.445 ns/B      2141 MiB/s      1.78 c/B      3998
POLY1305 dec |     0.436 ns/B      2188 MiB/s      1.74 c/B      3998
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Jan 27 2019, 10:19 AM
Parents
rC28614a77a281: tests/bench-slope: prevent auto-mhz detection getting stuck
Branches
Unknown
Tags
Unknown