cast5: add amd64 assembly implementation
* cipher/Makefile.am: Add 'cast5-amd64.S'. * cipher/cast5-amd64.S: New file. * cipher/cast5.c (USE_AMD64_ASM): New macro. (_gcry_cast5_s1tos4): Merge arrays s1, s2, s3, s4 to single array to simplify access from assembly implementation. (s1, s2, s3, s4): New macros pointing to subarrays in _gcry_cast5_s1tos4. [USE_AMD64_ASM] (_gcry_cast5_amd64_encrypt_block) (_gcry_cast5_amd64_decrypt_block, _gcry_cast5_amd64_ctr_enc) (_gcry_cast5_amd64_cbc_dec, _gcry_cast5_amd64_cfb_dec): New prototypes. [USE_AMD64_ASM] (do_encrypt_block, do_decrypt_block, encrypt_block) (decrypt_block): New functions. (_gcry_cast5_ctr_enc, _gcry_cast5_cbc_dec, _gcry_cast5_cfb_dec) (selftest_ctr, selftest_cbc, selftest_cfb): New functions. (selftest): Call new bulk selftests. * cipher/cipher.c (gcry_cipher_open) [USE_CAST5]: Register CAST5 bulk functions for ctr-enc, cbc-dec and cfb-dec. * configure.ac (cast5) [x86_64]: Add 'cast5-amd64.lo'. * src/cipher.h (_gcry_cast5_ctr_enc, _gcry_cast5_cbc_dec) (gcry_cast5_cfb_dec): New prototypes.
Provides non-parallel implementations for small speed-up and 4-way parallel
implementations that gets accelerated on `out-of-order' CPUs.
Speed old vs. new on AMD Phenom II X6 1055T:
ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- ---------------
CAST5 1.23x 1.22x 1.21x 2.86x 1.21x 2.83x 1.22x 1.17x 2.73x 2.73x
Speed old vs. new on Intel Core i5-2450M (Sandy-Bridge):
ECB/Stream CBC CFB OFB CTR --------------- --------------- --------------- --------------- ---------------
CAST5 1.00x 1.04x 1.06x 2.56x 1.06x 2.37x 1.03x 1.01x 2.43x 2.41x
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>