Improve parallelizability of CBC decryption for AES-NI
* cipher/rijndael.c (_gcry_aes_cbc_dec) [USE_AESNI]: Add AES-NI specific CBC mode loop with temporary block and IV stored in free SSE registers.
Benchmark results on Intel Core i5-2450M (x86-64) show ~2.5x improvement:
Before:
$ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256
Running each test 1000 times.
ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- ---------------
AES 690ms 780ms 2940ms 2110ms 1880ms 670ms 2250ms 2250ms 490ms 500ms
AES192 890ms 930ms 3260ms 2390ms 2220ms 820ms 2580ms 2590ms 560ms 570ms
AES256 1040ms 1070ms 3590ms 2640ms 2540ms 970ms 2880ms 2890ms 650ms 650ms
After:
$ tests/benchmark --cipher-repetitions 1000 cipher aes aes192 aes256
Running each test 1000 times.
ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- ---------------
AES 670ms 770ms 2920ms 720ms 1900ms 660ms 2260ms 2250ms 480ms 500ms
AES192 860ms 930ms 3250ms 870ms 2210ms 830ms 2580ms 2580ms 570ms 570ms
AES256 1020ms 1080ms 3580ms 1030ms 2550ms 970ms 2880ms 2870ms 660ms 660ms
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi>