Home GnuPG

rijndael-ppc: performance improvements

Description

rijndael-ppc: performance improvements

* cipher/rijndael-ppc.c (ALIGNED_LOAD, ALIGNED_STORE, VEC_LOAD_BE)
(VEC_STORE_BE): Rewrite.
(VEC_BE_SWAP, VEC_LOAD_BE_NOSWAP, VEC_STORE_BE_NOSWAP): New.
(PRELOAD_ROUND_KEYS, AES_ENCRYPT, AES_DECRYPT): Adjust to new
input parameters for vector load macros.
(ROUND_KEY_VARIABLES_ALL, PRELOAD_ROUND_KEYS_ALL)
(AES_ENCRYPT_ALL): New.
(vec_bswap32_const_neg): New.
(vec_aligned_ld, vec_aligned_st, vec_load_be_const): Rename to...
(asm_aligned_ls, asm_aligned_st, asm_load_be_const): ...these.
(asm_be_swap, asm_vperm1, asm_load_be_noswap)
(asm_store_be_noswap): New.
(vec_add_uint128): Rename to...
(asm_add_uint128): ...this.
(asm_xor, asm_cipher_be, asm_cipherlast_be, asm_ncipher_be)
(asm_ncipherlast_be): New inline assembly functions with volatile
keyword to allow manual instruction ordering.
(_gcry_aes_ppc8_setkey, aes_ppc8_prepare_decryption)
(_gcry_aes_ppc8_encrypt, _gcry_aes_ppc8_decrypt)
(_gcry_aes_ppc8_cfb_enc, _gcry_aes_ppc8_cbc_enc)
(_gcry_aes_ppc8_ocb_auth): Update to use new&rewritten helper macros.
(_gcry_aes_ppc8_cfb_dec, _gcry_aes_ppc8_cbc_dec)
(_gcry_aes_ppc8_ctr_enc, _gcry_aes_ppc8_ocb_crypt)
(_gcry_aes_ppc8_xts_crypt): Update to use new&rewritten helper
macros; Tune 8-block parallel paths with manual instruction ordering.

Benchmarks on POWER8 (ppc64le, ~3.8Ghz):

Before:
AES | nanosecs/byte mebibytes/sec cycles/byte

 CBC enc |      1.06 ns/B     902.2 MiB/s      4.02 c/B
 CBC dec |     0.208 ns/B      4585 MiB/s     0.790 c/B
 CFB enc |      1.06 ns/B     900.4 MiB/s      4.02 c/B
 CFB dec |     0.208 ns/B      4588 MiB/s     0.790 c/B
 CTR enc |     0.238 ns/B      4007 MiB/s     0.904 c/B
 CTR dec |     0.238 ns/B      4009 MiB/s     0.904 c/B
 XTS enc |     0.492 ns/B      1937 MiB/s      1.87 c/B
 XTS dec |     0.488 ns/B      1955 MiB/s      1.85 c/B
 OCB enc |     0.243 ns/B      3928 MiB/s     0.922 c/B
 OCB dec |     0.247 ns/B      3858 MiB/s     0.939 c/B
OCB auth |     0.213 ns/B      4482 MiB/s     0.809 c/B

After (cbc-dec & cfb-dec & xts & ocb ~6% faster, ctr ~11% faster):
AES | nanosecs/byte mebibytes/sec cycles/byte

 CBC enc |      1.06 ns/B     902.1 MiB/s      4.02 c/B
 CBC dec |     0.196 ns/B      4877 MiB/s     0.743 c/B
 CFB enc |      1.06 ns/B     902.2 MiB/s      4.02 c/B
 CFB dec |     0.195 ns/B      4889 MiB/s     0.741 c/B
 CTR enc |     0.214 ns/B      4448 MiB/s     0.815 c/B
 CTR dec |     0.214 ns/B      4452 MiB/s     0.814 c/B
 XTS enc |     0.461 ns/B      2067 MiB/s      1.75 c/B
 XTS dec |     0.456 ns/B      2092 MiB/s      1.73 c/B
 OCB enc |     0.227 ns/B      4200 MiB/s     0.863 c/B
 OCB dec |     0.234 ns/B      4072 MiB/s     0.890 c/B
OCB auth |     0.207 ns/B      4604 MiB/s     0.787 c/B

Benchmarks on POWER9 (ppc64le, ~3.8Ghz):

Before:
AES | nanosecs/byte mebibytes/sec cycles/byte

 CBC enc |      1.04 ns/B     918.7 MiB/s      3.94 c/B
 CBC dec |     0.240 ns/B      3982 MiB/s     0.910 c/B
 CFB enc |      1.04 ns/B     917.6 MiB/s      3.95 c/B
 CFB dec |     0.241 ns/B      3963 MiB/s     0.914 c/B
 CTR enc |     0.249 ns/B      3835 MiB/s     0.945 c/B
 CTR dec |     0.252 ns/B      3787 MiB/s     0.957 c/B
 XTS enc |     0.505 ns/B      1889 MiB/s      1.92 c/B
 XTS dec |     0.495 ns/B      1926 MiB/s      1.88 c/B
 OCB enc |     0.303 ns/B      3152 MiB/s      1.15 c/B
 OCB dec |     0.305 ns/B      3129 MiB/s      1.16 c/B
OCB auth |     0.265 ns/B      3595 MiB/s      1.01 c/B

After (cbc-dec & cfb-dec ~6% faster, ctr ~11% faster, ocb ~4% faster):
AES | nanosecs/byte mebibytes/sec cycles/byte

 CBC enc |      1.04 ns/B     917.3 MiB/s      3.95 c/B
 CBC dec |     0.225 ns/B      4234 MiB/s     0.856 c/B
 CFB enc |      1.04 ns/B     917.8 MiB/s      3.95 c/B
 CFB dec |     0.226 ns/B      4214 MiB/s     0.860 c/B
 CTR enc |     0.221 ns/B      4306 MiB/s     0.842 c/B
 CTR dec |     0.223 ns/B      4271 MiB/s     0.848 c/B
 XTS enc |     0.503 ns/B      1897 MiB/s      1.91 c/B
 XTS dec |     0.495 ns/B      1928 MiB/s      1.88 c/B
 OCB enc |     0.288 ns/B      3309 MiB/s      1.10 c/B
 OCB dec |     0.292 ns/B      3266 MiB/s      1.11 c/B
OCB auth |     0.267 ns/B      3570 MiB/s      1.02 c/B
  • Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>

Details

Provenance
jukiviliAuthored on Dec 22 2019, 3:44 PM
Parents
rC0837d7e6be3e: rijndael-ppc: fix bad register used for vector load/store assembly
Branches
Unknown
Tags
Unknown