camellia-aesni-avx: speed up for round key broadcasting
* cipher/camellia-aesni-avx2-amd64.h (roundsm16, fls16): Broadcast round key bytes directly with 'vpshufb'.
Benchmark on AMD Ryzen 9 7900X (turbo-freq off):
Before:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.837 ns/B 1139 MiB/s 3.94 c/B 4700 ECB dec | 0.839 ns/B 1137 MiB/s 3.94 c/B 4700
After (~3% faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.808 ns/B 1180 MiB/s 3.80 c/B 4700 ECB dec | 0.810 ns/B 1177 MiB/s 3.81 c/B 4700
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>