Camellia AES-NI/AVX/AVX2 size optimization
* cipher/camellia-aesni-avx-amd64.S: Use loop for handling repeating '(enc|dec)_rounds16/fls16' portions of encryption/decryption. * cipher/camellia-aesni-avx2-amd64.S: Use loop for handling repeating '(enc|dec)_rounds32/fls32' portions of encryption/decryption.
Use round+fls loop to reduce binary size of Camellia AES-NI/AVX/AVX2
implementations. This also gives small performance boost on AMD Zen2.
Before:
text data bss dec hex filename 63877 0 0 63877 f985 cipher/.libs/camellia-aesni-avx2-amd64.o 59623 0 0 59623 e8e7 cipher/.libs/camellia-aesni-avx-amd64.o
After:
text data bss dec hex filename 22999 0 0 22999 59d7 cipher/.libs/camellia-aesni-avx2-amd64.o 25047 0 0 25047 61d7 cipher/.libs/camellia-aesni-avx-amd64.o
Benchmark on AMD Ryzen 7 3700X:
Before:
Cipher:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.670 ns/B 1424 MiB/s 2.88 c/B 4300 CFB dec | 0.667 ns/B 1430 MiB/s 2.87 c/B 4300 CTR enc | 0.677 ns/B 1410 MiB/s 2.91 c/B 4300 CTR dec | 0.676 ns/B 1412 MiB/s 2.90 c/B 4300 OCB enc | 0.696 ns/B 1370 MiB/s 2.98 c/B 4275 OCB dec | 0.698 ns/B 1367 MiB/s 2.98 c/B 4275 OCB auth | 0.683 ns/B 1395 MiB/s 2.94 c/B 4300
After (~8% faster):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.611 ns/B 1561 MiB/s 2.64 c/B 4313 CFB dec | 0.616 ns/B 1549 MiB/s 2.65 c/B 4312 CTR enc | 0.625 ns/B 1525 MiB/s 2.69 c/B 4300 CTR dec | 0.625 ns/B 1526 MiB/s 2.69 c/B 4299 OCB enc | 0.639 ns/B 1493 MiB/s 2.75 c/B 4307 OCB dec | 0.642 ns/B 1485 MiB/s 2.76 c/B 4301 OCB auth | 0.631 ns/B 1512 MiB/s 2.71 c/B 4300
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>