chacha20-aarch64: improve performance through higher SIMD interleaving
* cipher/chacha20-aarch64.S (ROTATE2, ROTATE2_8, ROTATE2_16) (QUARTERROUND2): Replace with... (ROTATE4, ROTATE4_8, ROTATE4_16, QUARTERROUND4): ...these. (_gcry_chacha20_aarch64_blocks4) (_gcry_chacha20_poly1305_aarch64_blocks4): Adjust to use QUARTERROUND4.
This change improves chacha20 performance on larger ARM cores, such as
Cortex-A72. Performance on Cortex-A53 stays the same.
Benchmark on AWS Graviton (Cortex-A72):
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 3.11 ns/B 306.3 MiB/s 7.16 c/B 2300 STREAM dec | 3.12 ns/B 306.0 MiB/s 7.17 c/B 2300 POLY1305 enc | 3.14 ns/B 304.2 MiB/s 7.21 c/B 2300 POLY1305 dec | 3.11 ns/B 306.6 MiB/s 7.15 c/B 2300 POLY1305 auth | 0.929 ns/B 1027 MiB/s 2.14 c/B 2300
After (~41% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
STREAM enc | 2.19 ns/B 435.1 MiB/s 5.04 c/B 2300 STREAM dec | 2.20 ns/B 434.1 MiB/s 5.05 c/B 2300 POLY1305 enc | 2.22 ns/B 429.2 MiB/s 5.11 c/B 2300 POLY1305 dec | 2.20 ns/B 434.3 MiB/s 5.05 c/B 2300 POLY1305 auth | 0.931 ns/B 1025 MiB/s 2.14 c/B 2300
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>