camellia-avx2: speed up for round key broadcasting
* cipher/camellia-aesni-avx2-amd64.h (roundsm32, fls32): Use 'vpbroadcastb' for loading round key. * cipher/camellia-glue.c (camellia_encrypt_blk1_32) (camellia_decrypt_blk1_32): Adjust num_blks thresholds for AVX2 implementations, 2 blks for GFNI, 4 blks for VAES and 5 blks for AESNI.
Benchmark on AMD Ryzen 9 7900X (turbo-freq off):
Before:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.213 ns/B 4469 MiB/s 1.00 c/B 4700 ECB dec | 0.215 ns/B 4440 MiB/s 1.01 c/B 4700
After (~10% faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.194 ns/B 4919 MiB/s 0.911 c/B 4700 ECB dec | 0.195 ns/B 4896 MiB/s 0.916 c/B 4700
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>