Add stitched ChaCha20-Poly1305 ARMv8/AArch64 implementation
* cipher/Makefile.am: Add 'asm-poly1305-aarch64.h'. * cipher/asm-poly1305-aarch64.h: New. * cipher/chacha20-aarch64.S (ROT8, _, ROTATE2_8): New. (ROTATE2): Add interleave operator. (QUARTERROUND2): Add interleave operators; Use ROTATE2_8. (chacha20_data): Rename to... (_gcry_chacha20_aarch64_blocks4_data_inc_counter): ...to this. (_gcry_chacha20_aarch64_blocks4_data_rot8): New. (_gcry_chacha20_aarch64_blocks4): Preload ROT8; Fill empty parameters for QUARTERROUND2 interleave operators. (_gcry_chacha20_poly1305_aarch64_blocks4): New. * cipher/chacha20.c [USE_AARCH64_SIMD] (_gcry_chacha20_poly1305_aarch64_blocks4): New. (_gcry_chacha20_poly1305_encrypt, _gcry_chacha20_poly1305_decrypt) [USE_AARCH64_SIMD]: Use stitched implementation if ctr->use_neon is set.
Patch also make small tweak for regular ARMv8/AArch64 ChaCha20
implementation for 'rotate by 8' operation.
Benchmark on Cortex-A53 @ 1104 Mhz:
Before:
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 4.93 ns/B 193.5 MiB/s 5.44 c/B STREAM dec | 4.93 ns/B 193.6 MiB/s 5.44 c/B POLY1305 enc | 7.71 ns/B 123.7 MiB/s 8.51 c/B POLY1305 dec | 7.70 ns/B 123.8 MiB/s 8.50 c/B POLY1305 auth | 2.77 ns/B 343.7 MiB/s 3.06 c/B
After (chacha20 ~6% faster, chacha20-poly1305 ~29% faster):
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte
STREAM enc | 4.65 ns/B 205.2 MiB/s 5.13 c/B STREAM dec | 4.65 ns/B 205.1 MiB/s 5.13 c/B POLY1305 enc | 5.97 ns/B 159.7 MiB/s 6.59 c/B POLY1305 dec | 5.92 ns/B 161.1 MiB/s 6.54 c/B POLY1305 auth | 2.78 ns/B 343.3 MiB/s 3.07 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>