Few comments on the patches.
blake2-avx512: merge some of the gather loads
Just started wondering how much of this slow down is because of MingW libc not having very well optimized memcpy/memmove/memchr/strlen/etc. Is there profiling tools like 'perf' on Linux that could be used for Windows builds?
blake2b-avx512: replace VPGATHER with manual gather
twofish-avx2-amd64: replace VPGATHER with manual gather
Avoid VPGATHER usage for most of Intel CPUs
hwf-x86: use CFI statements for 'is_cpuid_available'
configure: fix HAVE_GCC_ASM_CFI_DIRECTIVES check
Add VAES/AVX2 accelerated i386 implementation for AES
rijndael-vaes-avx2-amd64: avoid extra load in CFB & CBC IV handling
rijndael-vaes-avx2-amd64: acceleration for OCB auth
Problem with SHA-256 on x86-64 is that it took long time for Intel to introduce SHA acceleration (SHA1 & SHA256) to their main CPU products.
keccak: add md_read support for SHAKE algorithms
addm/subm/mulm: fix case when destination is same MPI as divider
twofish-avx2: de-unroll round function
serpent: add x86/AVX512 implementation
mpi: optimize mpi_rshift and mpi_lshift to avoid extra MPI copying
mpi/amd64: optimize add_n and sub_n
mpi: avoid MPI copy at gcry_mpi_sub
mpi/amd64: fix use of 'movd' for 64-bit register move in lshift&rshift
bench-slope: add MPI benchmarking
cipher: restore weak-key error-code after mode specific setkey
Here's fix for mode specific setkey clearing error code:
Revert "cipher: Fix edge case for SET_ALLOW_WEAK_KEY."
doc: add documentation for GCRYCTL_SET_ALLOW_WEAK_KEY
About error code. You need to use to get the value.
I'll add documentation about GCRYCTL_SET_ALLOW_WEAK_KEY which was missing from be original commit.
tests/basic now actually fail because setkey not returning GPG_ERR_WEAK_KEY for weak keys with GCRYCTL_SET_ALLOW_WEAK_KEY.
That's right. With GCRYCTL_SET_ALLOW_WEAK_KEY, setkey still returns GPG_ERR_WEAK_KEY when weak key is detected. However, cipher handle can still be used as if setkey succeeded.
cipher-gcm-ppc: tweak loop structure a bit
Here's mirroring script that is in place currently:
camellia-simd128: use 8-bit right shift for rotate function
camellia-gfni: use GFNI for uint8 right shift in FLS
rijndael-ppc: use vector registers for key schedule calculations
Add PowerPC vector implementation of SM4
camellia-simd128: faster sbox filtering with uint8 right shift
chacha20-ppc: do not generate p9 code when target attr unavailable
Fix "'inline' is not at beginning of declaration" warnings
Improve PPC target function attribute checks
camellia: add AArch64 crypto-extension implementation
camellia: add POWER8/POWER9 vcrypto implementation
aes-amd64-vaes: fix fast exit path in XTS function
chacha20-ppc: use target and optimize attributes for P8 and P9
ppc: add support for clang target attribute
aes-ppc: use target and optimize attributes for P8 and P9
aes-ppc: add CTR32LE bulk acceleration
aes-ppc: add ECB bulk acceleration for benchmarking purposes
sha2-ppc: better optimization for POWER9
camellia-aesni-avx: speed up for round key broadcasting
camellia-gfni-avx512: speed up for round key broadcasting
camellia-avx2: speed up for round key broadcasting
camellia-avx2: add fast path for full 32 block ECB input
camellia: add CTR-mode byte addition for AVX/AVX2/AVX512 impl.
camellia-aesni-avx: add acceleration for ECB/XTS/CTR32LE modes
sm4: add CTR-mode byte addition for AVX/AVX2/AVX512 implementations
aes-vaes-avx2: improve case when only CTR needs carry handling
aria-avx2: add VAES accelerated implementation
aria-avx512: small optimization for aria_diff_m
aria-avx: small optimization for aria_ark_8way
aria: add x86_64 GFNI/AVX512 accelerated implementation
aria: add x86_64 AESNI/GFNI/AVX/AVX2 accelerated implementations
asm-common-aarch64: fix read-only section for Windows target
aarch64-asm: align functions to 16 bytes
s390x-asm: move constant data to read-only section
aarch64-asm: move constant data to read-only section
powerpc-asm: move constant data to read-only section
mpi/amd64: align functions and inner loops to 16 bytes
amd64-asm: align functions to 16 bytes for cipher algos
amd64-asm: move constant data to read-only section for hash/mac algos
amd64-asm: move constant data to read-only section for cipher algos
tests/bench-slope: skip CPU warm-up in regression tests
tests/basic: perform x86 vector cluttering only when __SSE2__ is set
tests/basic: fix clutter vector register asm for amd64 and i386
avx512: tweak zmm16-zmm31 register clearing
aria: add generic 2-way bulk processing
bulkhelp: change bulk function definition to allow modifying context
sm4: add missing OCB 16-way GFNI-AVX512 path
Fix compiler warnings seen with clang-powerpc64le target
Add GMAC-SM4 and Poly1305-SM4
Add clang support for ARM 32-bit assembly
rijndael-ppc: fix wrong inline assembly constraint
Fix building AVX512 Intel-syntax assembly with x86-64 clang
avx512: tweak AVX512 spec stop, use common macro in assembly
chacha20-avx512: add handling for any input block count and tweak 16 block code…
Any comments on applying these to gnupg-2.2?
sha3-avx512: fix for "x32" target
twofish: accelerate XTS and ECB modes
serpent: fix compiler warning on 32-bit ARM
serpent: accelerate XTS and ECB modes
sm4: accelerate ECB (for benchmarking)