blowfish: add three rounds parallel handling to generic C implementation
* cipher/blowfish.c (BLOWFISH_ROUNDS): Remove. [BLOWFISH_ROUNDS != 16] (function_F): Remove. (F): Replace big-endian and little-endian version with single endian-neutral version. (R3, do_encrypt_3, do_decrypt_3): New. (_gcry_blowfish_ctr_enc, _gcry_blowfish_cbc_dec) (_gcry_blowfish_cfb_dec): Use new three block functions.
Benchmark on aarch64 (cortex-a53, 816 Mhz):
Before:
BLOWFISH | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 29.58 ns/B 32.24 MiB/s 24.13 c/B CFB dec | 33.38 ns/B 28.57 MiB/s 27.24 c/B CTR enc | 34.18 ns/B 27.90 MiB/s 27.89 c/B
After (~60%-70% faster):
BLOWFISH | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 18.18 ns/B 52.45 MiB/s 14.84 c/B CFB dec | 19.67 ns/B 48.50 MiB/s 16.05 c/B CTR enc | 19.77 ns/B 48.25 MiB/s 16.13 c/B
Benchmark on i386 (haswell, 4000 Mhz):
Before:
BLOWFISH | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 6.10 ns/B 156.4 MiB/s 24.39 c/B CFB dec | 6.39 ns/B 149.2 MiB/s 25.56 c/B CTR enc | 6.73 ns/B 141.6 MiB/s 26.93 c/B
After (~80% faster):
BLOWFISH | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 3.46 ns/B 275.5 MiB/s 13.85 c/B CFB dec | 3.53 ns/B 270.4 MiB/s 14.11 c/B CTR enc | 3.56 ns/B 268.0 MiB/s 14.23 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>