Reduce size of x86-64 stitched Chacha20-Poly1305 implementations
* cipher/chacha20-amd64-avx2.c (_gcry_chacha20_poly1305_amd64_avx2_blocks8): De-unroll round loop. * cipher/chacha20-amd64-ssse3.c (_gcry_chacha20_poly1305_amd64_ssse3_blocks4): (_gcry_chacha20_poly1305_amd64_ssse3_blocks1): Ditto.
Object size before:
text data bss dec hex filename 13428 0 0 13428 3474 cipher/.libs/chacha20-amd64-avx2.o 23175 0 0 23175 5a87 cipher/.libs/chacha20-amd64-ssse3.o
Object size after:
text data bss dec hex filename 4815 0 0 4815 12cf cipher/.libs/chacha20-amd64-avx2.o 9284 0 0 9284 2444 cipher/.libs/chacha20-amd64-ssse3.o
Benchmark on AMD Ryzen 3700X (AVX2 impl.):
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.266 ns/B 3581 MiB/s 1.15 c/B 4333 STREAM dec | 0.265 ns/B 3598 MiB/s 1.15 c/B 4350 POLY1305 enc | 0.313 ns/B 3046 MiB/s 1.35 c/B 4317 POLY1305 dec | 0.296 ns/B 3222 MiB/s 1.29 c/B 4345 POLY1305 auth | 0.221 ns/B 4311 MiB/s 0.972 c/B 4394
After:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 0.264 ns/B 3614 MiB/s 1.16 c/B 4380±2 STREAM dec | 0.265 ns/B 3597 MiB/s 1.16 c/B 4374 POLY1305 enc | 0.293 ns/B 3252 MiB/s 1.27 c/B 4326 POLY1305 dec | 0.275 ns/B 3464 MiB/s 1.19 c/B 4323 POLY1305 auth | 0.219 ns/B 4360 MiB/s 0.963 c/B 4400
[v2]: Use two inner round loops with 3 and 2 iterations and different
level of interleaving poly1305.
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>