camellia: add amd64 GFNI/AVX512 implementation
* cipher/Makefile.am: Add 'camellia-gfni-avx512-amd64.S'. * cipher/bulkhelp.h (bulk_ocb_prepare_L_pointers_array_blk64): New. * cipher/camellia-aesni-avx2-amd64.h: Rename internal functions from "__camellia_???" to "FUNC_NAME(???)"; Minor changes to comments. * cipher/camellia-gfni-avx512-amd64.S: New. * cipher/camellia-gfni.c (USE_GFNI_AVX512): New. (CAMELLIA_context): Add 'use_gfni_avx512'. (_gcry_camellia_gfni_avx512_ctr_enc, _gcry_camellia_gfni_avx512_cbc_dec) (_gcry_camellia_gfni_avx512_cfb_dec, _gcry_camellia_gfni_avx512_ocb_enc) (_gcry_camellia_gfni_avx512_ocb_dec) (_gcry_camellia_gfni_avx512_enc_blk64) (_gcry_camellia_gfni_avx512_dec_blk64, avx512_burn_stack_depth): New. (camellia_setkey): Use GFNI/AVX512 if supported by CPU. (camellia_encrypt_blk1_64, camellia_decrypt_blk1_64): New. (_gcry_camellia_ctr_enc, _gcry_camellia_cbc_dec, _gcry_camellia_cfb_dec) (_gcry_camellia_ocb_crypt) [USE_GFNI_AVX512]: Add GFNI/AVX512 code path. (_gcry_camellia_xts_crypt): Change parallel block size from 32 to 64. (selftest_ctr_128, selftest_cbc_128, selftest_cfb_128): Increase test block size. * cipher/chacha20-amd64-avx512.S: Clear k-mask registers with xor. * cipher/poly1305-amd64-avx512.S: Likewise. * cipher/sha512-avx512-amd64.S: Likewise. ---
Benchmark on Intel i3-1115G4 (tigerlake):
Before (GFNI/AVX2):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.356 ns/B 2679 MiB/s 1.46 c/B 4089 CFB dec | 0.374 ns/B 2547 MiB/s 1.53 c/B 4089 CTR enc | 0.409 ns/B 2332 MiB/s 1.67 c/B 4089 CTR dec | 0.406 ns/B 2347 MiB/s 1.66 c/B 4089 XTS enc | 0.430 ns/B 2216 MiB/s 1.76 c/B 4090 XTS dec | 0.433 ns/B 2201 MiB/s 1.77 c/B 4090 OCB enc | 0.460 ns/B 2071 MiB/s 1.88 c/B 4089 OCB dec | 0.492 ns/B 1939 MiB/s 2.01 c/B 4089
After (GFNI/AVX512):
CAMELLIA128 | nanosecs/byte mebibytes/sec cycles/byte auto Mhz
CBC dec | 0.207 ns/B 4600 MiB/s 0.827 c/B 3989 CFB dec | 0.207 ns/B 4610 MiB/s 0.825 c/B 3989 CTR enc | 0.218 ns/B 4382 MiB/s 0.868 c/B 3990 CTR dec | 0.217 ns/B 4389 MiB/s 0.867 c/B 3990 XTS enc | 0.330 ns/B 2886 MiB/s 1.35 c/B 4097±4 XTS dec | 0.328 ns/B 2904 MiB/s 1.35 c/B 4097±3 OCB enc | 0.246 ns/B 3879 MiB/s 0.981 c/B 3990 OCB dec | 0.247 ns/B 3855 MiB/s 0.987 c/B 3990 CBC dec: 70% faster CFB dec: 80% faster CTR: 87% faster XTS: 31% faster OCB: 92% faster
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>