Optimizations for AES aarch64-ce assembly implementation
* cipher/rijndael-armv8-aarch64-ce.S (vk14): Remove. (vklast, __, _): New. (aes_preload_keys): Setup vklast. (do_aes_one128/192/256): Split to ... (do_aes_one_part1, do_aes_part2_128/192/256): ... these and add interleave ops. (do_aes_one128/192/256): New using above part1 and part2 macros. (aes_round_4): Rename to ... (aes_round_4_multikey): ... this and allow different key used for parallel blocks. (aes_round_4): New using above multikey macro. (aes_lastround_4): Reorder AES round and xor instructions, allow different last key for parallel blocks. (do_aes_4_128/192/256): Split to ... (do_aes_4_part1_multikey, do_aes_4_part1) (do_aes_4_part2_128/192/256): ... these. (do_aes_4_128/192/256): New using above part1 and part2 macros. (CLEAR_REG): Use movi for clearing registers. (aes_clear_keys): Remove branching and clear all key registers. (_gcry_aes_enc_armv8_ce, _gcry_aes_dec_armv8_ce): Adjust to macro changes. (_gcry_aes_cbc_enc_armv8_ce, _gcry_aes_cbc_dec_armv8_ce) (_gcry_aes_cfb_enc_armv8_ce, _gcry_aes_cfb_enc_armv8_ce) (_gcry_aes_ctr32le_enc_armv8_ce): Apply entry/loop-body/exit optimization for better interleaving of input/output processing; First/last round key and input/output xoring optimization to reduce critical path length. (_gcry_aes_ctr_enc_armv8_ce): Add fast path for counter incrementing without byte-swaps when counter does not overflow 8-bit; Apply entry/loop-body/exit optimization for better interleaving of input/output processing; First/last round key and input/output xoring optimization to reduce critical path length. (_gcry_aes_ocb_enc_armv8_ce, _gcry_aes_ocb_dec_armv8_ce): Add aligned processing for nblk and OCB offsets; Apply entry/loop-body/exit optimization for better interleaving of input/output processing; First/last round key and input/output xoring optimization to reduce critical path length; Change to use same function body macro for both encryption and decryption. (_gcry_aes_xts_enc_armv8_ce, _gcry_aes_xts_dec_armv8_ce): Apply entry/loop-body/exit optimization for better interleaving of input/output processing; First/last round key and input/output xoring optimization to reduce critical path length; Change to use same function body macro for both encryption and decryption.
Benchmark on AWS Graviton2 (2500Mhz):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
CBC enc | 0.663 ns/B 1439 MiB/s 1.66 c/B CBC dec | 0.288 ns/B 3310 MiB/s 0.720 c/B CFB enc | 0.657 ns/B 1453 MiB/s 1.64 c/B CFB dec | 0.288 ns/B 3313 MiB/s 0.720 c/B CTR dec | 0.314 ns/B 3039 MiB/s 0.785 c/B XTS enc | 0.357 ns/B 2674 MiB/s 0.891 c/B XTS dec | 0.358 ns/B 2666 MiB/s 0.894 c/B OCB enc | 0.343 ns/B 2784 MiB/s 0.856 c/B OCB dec | 0.341 ns/B 2795 MiB/s 0.853 c/B GCM-SIV enc | 0.526 ns/B 1813 MiB/s 1.31 c/B
After:
AES | nanosecs/byte mebibytes/sec cycles/byte perf increase
CBC enc | 0.500 ns/B 1906 MiB/s 1.25 c/B +33% CBC dec | 0.263 ns/B 3622 MiB/s 0.658 c/B +9% CFB enc | 0.500 ns/B 1906 MiB/s 1.25 c/B +31% CFB dec | 0.263 ns/B 3620 MiB/s 0.658 c/B +9% CTR enc | 0.264 ns/B 3618 MiB/s 0.659 c/B +19% XTS enc | 0.350 ns/B 2722 MiB/s 0.876 c/B +2% OCB enc | 0.275 ns/B 3468 MiB/s 0.687 c/B +25% OCB dec | 0.276 ns/B 3459 MiB/s 0.689 c/B +24% GCM-SIV enc | 0.494 ns/B 1929 MiB/s 1.24 c/B +6%
Benchmark on Cortex-A53 (1152Mhz):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
CBC enc | 1.41 ns/B 675.9 MiB/s 1.63 c/B CBC dec | 0.910 ns/B 1048 MiB/s 1.05 c/B CFB enc | 1.30 ns/B 732.2 MiB/s 1.50 c/B CFB dec | 0.910 ns/B 1048 MiB/s 1.05 c/B CTR enc | 1.03 ns/B 924.4 MiB/s 1.19 c/B XTS enc | 1.25 ns/B 763.0 MiB/s 1.44 c/B OCB enc | 1.21 ns/B 789.5 MiB/s 1.39 c/B OCB dec | 1.21 ns/B 788.9 MiB/s 1.39 c/B GCM-SIV enc | 1.92 ns/B 496.6 MiB/s 2.21 c/B
After:
AES | nanosecs/byte mebibytes/sec cycles/byte perf increase
CBC enc | 1.14 ns/B 836.6 MiB/s 1.31 c/B +24% CBC dec | 0.843 ns/B 1132 MiB/s 0.971 c/B +8% CFB enc | 1.19 ns/B 798.8 MiB/s 1.38 c/B +9% CFB dec | 0.842 ns/B 1132 MiB/s 0.970 c/B +8% CTR enc | 0.898 ns/B 1062 MiB/s 1.03 c/B +16% XTS enc | 1.22 ns/B 779.9 MiB/s 1.41 c/B +2% OCB enc | 0.992 ns/B 961.0 MiB/s 1.14 c/B +22% OCB dec | 0.993 ns/B 960.5 MiB/s 1.14 c/B +22% GCM-SIV enc | 1.88 ns/B 507.3 MiB/s 2.17 c/B +2%
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>