chacha20-aarch64: improve performance through higher SIMD interleaving
8d7b1d0a52bd
Actions

Description

chacha20-aarch64: improve performance through higher SIMD interleaving

* cipher/chacha20-aarch64.S (ROTATE2, ROTATE2_8, ROTATE2_16)
(QUARTERROUND2): Replace with...
(ROTATE4, ROTATE4_8, ROTATE4_16, QUARTERROUND4): ...these.
(_gcry_chacha20_aarch64_blocks4)
(_gcry_chacha20_poly1305_aarch64_blocks4): Adjust to use QUARTERROUND4.

This change improves chacha20 performance on larger ARM cores, such as
Cortex-A72. Performance on Cortex-A53 stays the same.

Benchmark on AWS Graviton (Cortex-A72):

Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

   STREAM enc |      3.11 ns/B     306.3 MiB/s      7.16 c/B      2300
   STREAM dec |      3.12 ns/B     306.0 MiB/s      7.17 c/B      2300
 POLY1305 enc |      3.14 ns/B     304.2 MiB/s      7.21 c/B      2300
 POLY1305 dec |      3.11 ns/B     306.6 MiB/s      7.15 c/B      2300
POLY1305 auth |     0.929 ns/B      1027 MiB/s      2.14 c/B      2300

After (~41% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz

   STREAM enc |      2.19 ns/B     435.1 MiB/s      5.04 c/B      2300
   STREAM dec |      2.20 ns/B     434.1 MiB/s      5.05 c/B      2300
 POLY1305 enc |      2.22 ns/B     429.2 MiB/s      5.11 c/B      2300
 POLY1305 dec |      2.20 ns/B     434.3 MiB/s      5.05 c/B      2300
POLY1305 auth |     0.931 ns/B      1025 MiB/s      2.14 c/B      2300