camellia-gfni: use GFNI for uint8 right shift in FLS
* cipher/camellia-aesni-avx2-amd64.h (IF_GFNI, IF_NOT_GFNI): New. [CAMELLIA_GFNI_BUILD] (rol32_1_32): Add GFNI variant which uses vgf2p8affineqb for uint8 right shift by 7. (fls32): Load 'right shift by 7' bit-matrix on GFNI build. [CAMELLIA_GFNI_BUILD] (.Lright_shift_by_7): New. * cipher/camellia-gfni-avx512-amd64.S (clear_regs): Don't clear %k1. (rol32_1_64): Use vgf2p8affineqb for uint8 right shift by 7. (fls64): Adjust for rol32_1_64 changes. (.Lbyte_ones): Remove. (.Lright_shift_by_7): New. (_gcry_camellia_gfni_avx512_ctr_enc): Clear %k1 after use.
Benchmark on Intel Core i3-1115G4:
Before:
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.194 ns/B 4920 MiB/s 0.794 c/B 4096±4 ECB dec | 0.194 ns/B 4916 MiB/s 0.793 c/B 4089
After (~1.7% faster)
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
ECB enc | 0.190 ns/B 5008 MiB/s 0.780 c/B 4096±3 ECB dec | 0.191 ns/B 5002 MiB/s 0.781 c/B 4096±3
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>