TODO list
- x86-64 implementations
- SHA512
- Implementation using new SHA512 instructions
- SM3
- Implementation using new SM3 instructions
- SM4
- Implementation using new SM4 instructions
- ADX implementation of large integer multiply
- AVX512-IFMA for large integer multiply
- SHA512
- ARMv8 64bit (& 32bit) implementations
- Port Serpent ARMv7/NEON implementation to 64bit
- Power vcrypto
- Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference)
- RISC-V 32/64 implementations
- Support for scalar crypto extensions (AES, SHA, SM3/4, etc)
- Support for vector crypto extensions (AES, SHA, SM3/4, etc)
- Vector accelerated ChaCha20
- Vector accelerated Blake2
- RISC-V support for generic SIMD Camellia
- Support for more crypto instruction sets on different architectures
- SPARC T4 crypto instruction set
- Performance optimizations for curve 25519
- https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2
- Maybe use mixed asm/C approach as used with poly1305.c
DONE list
- x86-64 implementations
- Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2.
- Implementations done/checked:
AES (AVX-512 + VAES)(done)- Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf).
- AES-VAES-AVX2-CTR on zen4: 0.163cpb.
GCM (AVX-512 + VPCLMUL)(done)- Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE:
- VPCLMUL/AVX2 close to twice as fast on AMD zen3/zen4 (zen3: 0.219 cpb, zen4: 0.212cpb)
- VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb) and ~2.4x faster on AMD zen4 (0.171cpb)
- Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE:
ChaCha20+Poly1305 / AVX512(done)- AVX512 implementation done.
- Results on tigerlake:
- ChaCha20 stream: 0.665 cpb
- Poly1305 MAC: 0.247 cpb
- ChaCha20+Poly1305 AEAD: 0.907 cpb
- Results on zen4:
- ChaCha20 stream: 0.711 cpb
- Poly1305 MAC: 0.247 cpb
- ChaCha20+Poly1305 AEAD: 0.964 cpb
- Results on tigerlake:
- AVX512 implementation done.
Camellia (AVX-512 + GFNI)(done)- GFNI/AVX2 implementation done. Camellia128-CTR:
- 1.67cpb on tigerlake
- 1.30cpb on zen4
- GFNI/AVX512 implementation done. Camellia128-CTR:
- 0.868cpb on tigerlake
- 0.876cpb on zen4
- GFNI/AVX2 implementation done. Camellia128-CTR:
SHA512 / AVX512(done)- AVX512 with 256-bit/128-bit registers:
- 4.76 cpb on tigerlake
- 4.88 cpb on zen4 (note: sha512-avx2-bmi2 is faster on zen4, 4.35cpb, c0f85e0c8657030eb979a465199a07e2819f81e4)
- AVX512 with 256-bit/128-bit registers:
SM4 (AVX-512 + GFNI)(done)- GFNI/AVX2 implementation done. SM4-CTR:
- 2.71cpb on tigerlake
- 2.38cpb on zen4
- GFNI/AVX512, SM4-CTR:
- 1.25cpb on tigerlake
- 1.51cpb on zen4
- GFNI/AVX2 implementation done. SM4-CTR:
SHA3 / AVX512(done)- SHA3-256, AVX512 (impl. using lower 64bit on 128bit registers):
- 5.72cpb on tigerlake
- 5.55cpb on zen4
- SHA3-256, AVX512 (impl. using lower 64bit on 128bit registers):
Blake2 / AVX512(done)- BLAKE2b AVX512 (256bit vectors):
- 2.88cpb on tigerlake
- 3.63cpb on zen4
- BLAKE2s AVX512 (128bit vectors):
- 4.18cpb on tigerlake
- 5.32cpb on zen4
- BLAKE2b AVX512 (256bit vectors):
- Twofish
Wider AVX512 registers could give additional performancePerformance is limited by table lookups
Serpent / AVX512(done)- AVX512/CTR: 2.52cpb on zen4
- ARMv8 64bit (& 32bit) implementations
SHA512 ARMv8.4 crypto extension acceleration(done)- 2.54cpb on AWS Graviton3
SHA3 ARMv8.4 crypto extension acceleration(not feasible)- SHA3 instruction set "accelerated" implementation is nearly two times slower than generic C version on AWS Graviton3:
- SHA3 ARMv8.4 implementation from CRYPTOGAMS: 11.28 c/B
- SHA3 generic C implementation in libgcrypt: 6.05 c/B
- SHA3 instruction set "accelerated" implementation is nearly two times slower than generic C version on AWS Graviton3:
Use SHA3 ARMv8.4 instructions to accelerate Blake2b (rotate and XOR instruction + three-way XOR instruction)(not feasible)- AdvSIMD + SHA3/CE instruction set implementation is slower than generic C version on AWS Graviton3:
- Blake2b AdvSIMD+XAR implementation: 5.09 c/B
- Blake2b generic C implementation in libgcrypt: 3.21 c/B
- AdvSIMD + SHA3/CE instruction set implementation is slower than generic C version on AWS Graviton3:
Port Camellia aesni/avx implementation to ARM-CE AES 64bit- ECB on Graviton2: 6.93cpb
Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit(not planned as 32bit ARM seems to be sunsetting)Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit(not planned as 32bit ARM seems to be sunsetting)
- Power vcrypto
Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics- ECB on POWER9: 7.48cpb
AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64.(done)- CFB-enc/CBC-enc on POWER8: 3.86-3.89c/B
- RISC-V 32/64 implementations
Add MPI longlong.h support(handled with new generic 'long long' (on 32bit) and '__int128' (on 64bit) macros)Needs 32/32->64bit and 64/64->128bit multiply macros.Generic addition and subtraction macros are as good as it gets since RISC-V does not have carry-flag.