chacha20-amd64-avx2: optimize output xoring
* cipher/chacha20-amd64-avx2.S (STACK_TMP2): Remove. (transpose_16byte_2x2, xor_src_dst): New. (BUF_XOR_256_TO_128): Remove. (_gcry_chaha20_amd64_avx2_blocks8) (_gcry_chacha20_poly1305_amd64_avx2_blocks8): Replace BUF_XOR_256_TO_128 with transpose_16byte_2x2/xor_src_dst; Reduce stack usage; Better interleave chacha20 state merging and output xoring.
Benchmark on Intel i7-4790K:
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.314 ns/B 3035 MiB/s 1.26 c/B 3998 STREAM dec | 0.314 ns/B 3037 MiB/s 1.26 c/B 3998 POLY1305 enc | 0.451 ns/B 2117 MiB/s 1.80 c/B 3998 POLY1305 dec | 0.441 ns/B 2162 MiB/s 1.76 c/B 3998
After:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.309 ns/B 3086 MiB/s 1.24 c/B 3998 STREAM dec | 0.309 ns/B 3083 MiB/s 1.24 c/B 3998 POLY1305 enc | 0.445 ns/B 2141 MiB/s 1.78 c/B 3998 POLY1305 dec | 0.436 ns/B 2188 MiB/s 1.74 c/B 3998
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>