Add stitched ChaCha20-Poly1305 ARMv8/AArch64 implementation
4bebafb7bae8
Actions

Description

Add stitched ChaCha20-Poly1305 ARMv8/AArch64 implementation

* cipher/Makefile.am: Add 'asm-poly1305-aarch64.h'.
* cipher/asm-poly1305-aarch64.h: New.
* cipher/chacha20-aarch64.S (ROT8, _, ROTATE2_8): New.
(ROTATE2): Add interleave operator.
(QUARTERROUND2): Add interleave operators; Use ROTATE2_8.
(chacha20_data): Rename to...
(_gcry_chacha20_aarch64_blocks4_data_inc_counter): ...to this.
(_gcry_chacha20_aarch64_blocks4_data_rot8): New.
(_gcry_chacha20_aarch64_blocks4): Preload ROT8; Fill empty parameters
for QUARTERROUND2 interleave operators.
(_gcry_chacha20_poly1305_aarch64_blocks4): New.
* cipher/chacha20.c
[USE_AARCH64_SIMD] (_gcry_chacha20_poly1305_aarch64_blocks4): New.
(_gcry_chacha20_poly1305_encrypt, _gcry_chacha20_poly1305_decrypt)
[USE_AARCH64_SIMD]: Use stitched implementation if ctr->use_neon is
set.

Patch also make small tweak for regular ARMv8/AArch64 ChaCha20
implementation for 'rotate by 8' operation.

Benchmark on Cortex-A53 @ 1104 Mhz:

Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |      4.93 ns/B     193.5 MiB/s      5.44 c/B
   STREAM dec |      4.93 ns/B     193.6 MiB/s      5.44 c/B
 POLY1305 enc |      7.71 ns/B     123.7 MiB/s      8.51 c/B
 POLY1305 dec |      7.70 ns/B     123.8 MiB/s      8.50 c/B
POLY1305 auth |      2.77 ns/B     343.7 MiB/s      3.06 c/B

After (chacha20 ~6% faster, chacha20-poly1305 ~29% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |      4.65 ns/B     205.2 MiB/s      5.13 c/B
   STREAM dec |      4.65 ns/B     205.1 MiB/s      5.13 c/B
 POLY1305 enc |      5.97 ns/B     159.7 MiB/s      6.59 c/B
 POLY1305 dec |      5.92 ns/B     161.1 MiB/s      6.54 c/B
POLY1305 auth |      2.78 ns/B     343.3 MiB/s      3.07 c/B