Small tweak for PowerPC Chacha20-Poly1305 round loop
* cipher/chacha20-ppc.c (_gcry_chacha20_poly1305_ppc8_block4): Use inner/outer round loop structure instead of two separate loops for stitched and non-stitched parts.
Benchmark on POWER8 ~3.8Ghz:
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.619 ns/B 1541 MiB/s 2.35 c/B STREAM dec | 0.619 ns/B 1541 MiB/s 2.35 c/B POLY1305 enc | 0.784 ns/B 1216 MiB/s 2.98 c/B POLY1305 dec | 0.770 ns/B 1239 MiB/s 2.93 c/B POLY1305 auth | 0.502 ns/B 1898 MiB/s 1.91 c/B
After (~2% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 0.765 ns/B 1247 MiB/s 2.91 c/B POLY1305 dec | 0.749 ns/B 1273 MiB/s 2.85 c/B
Benchmark on POWER9 ~3.8Ghz:
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 0.687 ns/B 1389 MiB/s 2.61 c/B STREAM dec | 0.692 ns/B 1379 MiB/s 2.63 c/B POLY1305 enc | 1.08 ns/B 880.9 MiB/s 4.11 c/B POLY1305 dec | 1.07 ns/B 888.0 MiB/s 4.08 c/B POLY1305 auth | 0.459 ns/B 2078 MiB/s 1.74 c/B
After (~5% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
POLY1305 enc | 1.03 ns/B 929.2 MiB/s 3.90 c/B POLY1305 dec | 1.02 ns/B 936.6 MiB/s 3.87 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>