cast5: add three rounds parallel handling to generic C implementation
* cipher/cast5.c (do_encrypt_block_3, do_decrypt_block_3): New. (_gcry_cast5_ctr_enc, _gcry_cast5_cbc_dec, _gcry_cast5_cfb_dec): Use new three block functions.
Benchmark on aarch64 (cortex-a53, 816 Mhz):
Before:
CAST5 | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 35.24 ns/B 27.07 MiB/s 28.75 c/B CFB dec | 34.62 ns/B 27.54 MiB/s 28.25 c/B CTR enc | 35.39 ns/B 26.95 MiB/s 28.88 c/B
After (~40%-50% faster):
CAST5 | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 23.05 ns/B 41.38 MiB/s 18.81 c/B CFB dec | 24.49 ns/B 38.94 MiB/s 19.98 c/B CTR dec | 24.57 ns/B 38.82 MiB/s 20.05 c/B
Benchmark on i386 (haswell, 4000 Mhz):
Before:
CAST5 | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 6.92 ns/B 137.7 MiB/s 27.69 c/B CFB dec | 6.83 ns/B 139.7 MiB/s 27.32 c/B CTR enc | 7.01 ns/B 136.1 MiB/s 28.03 c/B
After (~70% faster):
CAST5 | nanosecs/byte mebibytes/sec cycles/byte
CBC dec | 3.97 ns/B 240.1 MiB/s 15.89 c/B CFB dec | 3.96 ns/B 241.0 MiB/s 15.83 c/B CTR enc | 4.01 ns/B 237.8 MiB/s 16.04 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>