Add parallelized AES-NI CBC decryption
* cipher/rijndael.c [USE_AESNI] (aesni_cleanup_5): New macro. [USE_AESNI] (do_aesni_dec_vec4): New function. (_gcry_aes_cbc_dec) [USE_AESNI]: Add parallelized CBC loop. (_gcry_aes_cbc_dec) [USE_AESNI]: Change IV storage register from xmm3 to xmm5.
This gives ~60% improvement in CBC decryption speed on sandy-bridge (x86-64).
Overall speed improvement with this and previous CBC patches is over 400%.
Before:
$ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256
Running each test 1000 times.
ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- ---------------
AES 670ms 770ms 2920ms 720ms 1900ms 660ms 2260ms 2250ms 480ms 500ms
AES192 860ms 930ms 3250ms 870ms 2210ms 830ms 2580ms 2580ms 570ms 570ms
AES256 1020ms 1080ms 3580ms 1030ms 2550ms 970ms 2880ms 2870ms 660ms 660ms
After:
$ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256
Running each test 1000 times.
ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- ---------------
AES 670ms 770ms 2130ms 450ms 1880ms 670ms 2250ms 2280ms 490ms 490ms
AES192 880ms 920ms 2460ms 540ms 2210ms 830ms 2580ms 2570ms 580ms 570ms
AES256 1020ms 1070ms 2800ms 620ms 2560ms 970ms 2880ms 2880ms 660ms 650ms
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>