rijndael-ppc: performance improvements
* cipher/rijndael-ppc.c (ALIGNED_LOAD, ALIGNED_STORE, VEC_LOAD_BE) (VEC_STORE_BE): Rewrite. (VEC_BE_SWAP, VEC_LOAD_BE_NOSWAP, VEC_STORE_BE_NOSWAP): New. (PRELOAD_ROUND_KEYS, AES_ENCRYPT, AES_DECRYPT): Adjust to new input parameters for vector load macros. (ROUND_KEY_VARIABLES_ALL, PRELOAD_ROUND_KEYS_ALL) (AES_ENCRYPT_ALL): New. (vec_bswap32_const_neg): New. (vec_aligned_ld, vec_aligned_st, vec_load_be_const): Rename to... (asm_aligned_ls, asm_aligned_st, asm_load_be_const): ...these. (asm_be_swap, asm_vperm1, asm_load_be_noswap) (asm_store_be_noswap): New. (vec_add_uint128): Rename to... (asm_add_uint128): ...this. (asm_xor, asm_cipher_be, asm_cipherlast_be, asm_ncipher_be) (asm_ncipherlast_be): New inline assembly functions with volatile keyword to allow manual instruction ordering. (_gcry_aes_ppc8_setkey, aes_ppc8_prepare_decryption) (_gcry_aes_ppc8_encrypt, _gcry_aes_ppc8_decrypt) (_gcry_aes_ppc8_cfb_enc, _gcry_aes_ppc8_cbc_enc) (_gcry_aes_ppc8_ocb_auth): Update to use new&rewritten helper macros. (_gcry_aes_ppc8_cfb_dec, _gcry_aes_ppc8_cbc_dec) (_gcry_aes_ppc8_ctr_enc, _gcry_aes_ppc8_ocb_crypt) (_gcry_aes_ppc8_xts_crypt): Update to use new&rewritten helper macros; Tune 8-block parallel paths with manual instruction ordering.
Benchmarks on POWER8 (ppc64le, ~3.8Ghz):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
CBC enc | 1.06 ns/B 902.2 MiB/s 4.02 c/B CBC dec | 0.208 ns/B 4585 MiB/s 0.790 c/B CFB enc | 1.06 ns/B 900.4 MiB/s 4.02 c/B CFB dec | 0.208 ns/B 4588 MiB/s 0.790 c/B CTR enc | 0.238 ns/B 4007 MiB/s 0.904 c/B CTR dec | 0.238 ns/B 4009 MiB/s 0.904 c/B XTS enc | 0.492 ns/B 1937 MiB/s 1.87 c/B XTS dec | 0.488 ns/B 1955 MiB/s 1.85 c/B OCB enc | 0.243 ns/B 3928 MiB/s 0.922 c/B OCB dec | 0.247 ns/B 3858 MiB/s 0.939 c/B OCB auth | 0.213 ns/B 4482 MiB/s 0.809 c/B
After (cbc-dec & cfb-dec & xts & ocb ~6% faster, ctr ~11% faster):
AES | nanosecs/byte mebibytes/sec cycles/byte
CBC enc | 1.06 ns/B 902.1 MiB/s 4.02 c/B CBC dec | 0.196 ns/B 4877 MiB/s 0.743 c/B CFB enc | 1.06 ns/B 902.2 MiB/s 4.02 c/B CFB dec | 0.195 ns/B 4889 MiB/s 0.741 c/B CTR enc | 0.214 ns/B 4448 MiB/s 0.815 c/B CTR dec | 0.214 ns/B 4452 MiB/s 0.814 c/B XTS enc | 0.461 ns/B 2067 MiB/s 1.75 c/B XTS dec | 0.456 ns/B 2092 MiB/s 1.73 c/B OCB enc | 0.227 ns/B 4200 MiB/s 0.863 c/B OCB dec | 0.234 ns/B 4072 MiB/s 0.890 c/B OCB auth | 0.207 ns/B 4604 MiB/s 0.787 c/B
Benchmarks on POWER9 (ppc64le, ~3.8Ghz):
Before:
AES | nanosecs/byte mebibytes/sec cycles/byte
CBC enc | 1.04 ns/B 918.7 MiB/s 3.94 c/B CBC dec | 0.240 ns/B 3982 MiB/s 0.910 c/B CFB enc | 1.04 ns/B 917.6 MiB/s 3.95 c/B CFB dec | 0.241 ns/B 3963 MiB/s 0.914 c/B CTR enc | 0.249 ns/B 3835 MiB/s 0.945 c/B CTR dec | 0.252 ns/B 3787 MiB/s 0.957 c/B XTS enc | 0.505 ns/B 1889 MiB/s 1.92 c/B XTS dec | 0.495 ns/B 1926 MiB/s 1.88 c/B OCB enc | 0.303 ns/B 3152 MiB/s 1.15 c/B OCB dec | 0.305 ns/B 3129 MiB/s 1.16 c/B OCB auth | 0.265 ns/B 3595 MiB/s 1.01 c/B
After (cbc-dec & cfb-dec ~6% faster, ctr ~11% faster, ocb ~4% faster):
AES | nanosecs/byte mebibytes/sec cycles/byte
CBC enc | 1.04 ns/B 917.3 MiB/s 3.95 c/B CBC dec | 0.225 ns/B 4234 MiB/s 0.856 c/B CFB enc | 1.04 ns/B 917.8 MiB/s 3.95 c/B CFB dec | 0.226 ns/B 4214 MiB/s 0.860 c/B CTR enc | 0.221 ns/B 4306 MiB/s 0.842 c/B CTR dec | 0.223 ns/B 4271 MiB/s 0.848 c/B XTS enc | 0.503 ns/B 1897 MiB/s 1.91 c/B XTS dec | 0.495 ns/B 1928 MiB/s 1.88 c/B OCB enc | 0.288 ns/B 3309 MiB/s 1.10 c/B OCB dec | 0.292 ns/B 3266 MiB/s 1.11 c/B OCB auth | 0.267 ns/B 3570 MiB/s 1.02 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>