OCB ARM CE: Move ocb_get_l handling to assembly part
* cipher/rijndael-armv8-aarch32-ce.S: Add OCB 'L_{ntz(i)}' calculation. * cipher/rijndael-armv8-aarch64-ce.S: Ditto. * cipher/rijndael-armv8-ce.c (_gcry_aes_ocb_enc_armv8_ce) (_gcry_aes_ocb_dec_armv8_ce, _gcry_aes_ocb_auth_armv8_ce) (ocb_cryt_fn_t): Updated arguments. (_gcry_aes_armv8_ce_ocb_crypt, _gcry_aes_armv8_ce_ocb_auth): Remove 'ocb_get_l' handling and splitting input to 32 block chunks, instead pass full buffers to assembly.
Performance on Cortex-A53 (AArch32):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
OCB enc | 1.63 ns/B 583.8 MiB/s 1.88 c/B OCB dec | 1.67 ns/B 572.1 MiB/s 1.92 c/B OCB auth | 1.33 ns/B 717.1 MiB/s 1.53 c/B
After (~12% faster):
AES | nanosecs/byte mebibytes/sec cycles/byte
OCB enc | 1.47 ns/B 650.2 MiB/s 1.69 c/B OCB dec | 1.48 ns/B 644.5 MiB/s 1.70 c/B OCB auth | 1.19 ns/B 798.2 MiB/s 1.38 c/B
Performance on Cortex-A53 (AArch64):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
OCB enc | 1.29 ns/B 738.5 MiB/s 1.49 c/B OCB dec | 1.32 ns/B 723.5 MiB/s 1.52 c/B OCB auth | 1.15 ns/B 827.0 MiB/s 1.33 c/B
After (~8% faster):
AES | nanosecs/byte mebibytes/sec cycles/byte
OCB enc | 1.21 ns/B 789.1 MiB/s 1.39 c/B OCB dec | 1.21 ns/B 789.2 MiB/s 1.39 c/B OCB auth | 1.10 ns/B 867.0 MiB/s 1.27 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>