aria-avx512: small optimization for aria_diff_m
* cipher/aria-gfni-avx512-amd64.S (aria_diff_m): Use 'vpternlogq' for 3-way XOR operation. ---
Using vpternlogq gives small performance improvement on AMD Zen4. With
Intel tiger-lake speed is the same as before.
Benchmark on AMD Ryzen 9 7900X (zen4, turbo-freq off):
Before:
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.203 ns/B 4703 MiB/s 0.953 c/B 4700 ECB dec | 0.204 ns/B 4675 MiB/s 0.959 c/B 4700 CTR enc | 0.207 ns/B 4609 MiB/s 0.973 c/B 4700 CTR dec | 0.207 ns/B 4608 MiB/s 0.973 c/B 4700
After (~3% faster):
ARIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.197 ns/B 4847 MiB/s 0.925 c/B 4700 ECB dec | 0.197 ns/B 4852 MiB/s 0.924 c/B 4700 CTR enc | 0.200 ns/B 4759 MiB/s 0.942 c/B 4700 CTR dec | 0.200 ns/B 4772 MiB/s 0.939 c/B 4700
Cc: Taehee Yoo <ap420073@gmail.com>
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>