TODO list
- x86-64 implementations
- Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2.
- Implementations to consider:
- ~~AES (AVX-512 + VAES)~~ (done)
- Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf).
- AES-VAES-AVX2-CTR on zen4: 0.163cpb.
- ~~GCM (AVX-512 + VPCLMUL)~~ (done)
- Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE:
- VPCLMUL/AVX2 close to twice as fast on AMD zen3/zen4 (zen3: 0.219 cpb, zen4: 0.212cpb)
- VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb) and ~2.4x faster on AMD zen4 (0.171cpb)
- ~~ChaCha20+Poly1305~~ (done)
- AVX512 implementation done.
- Results on tigerlake:
- ChaCha20 stream: 0.665 cpb
- Poly1305 MAC: 0.247 cpb
- ChaCha20+Poly1305 AEAD: 0.907 cpb
- Results on zen4:
- ChaCha20 stream: 0.711 cpb
- Poly1305 MAC: 0.247 cpb
- ChaCha20+Poly1305 AEAD: 0.964 cpb
- ~~Camellia (AVX-512 + GFNI)~~ (done)
- GFNI/AVX2 implementation done. Camellia128-CTR:
- 1.67cpb on tigerlake
- 1.30cpb on zen4
- GFNI/AVX512 implementation done. Camellia128-CTR:
- 0.868cpb on tigerlake
- 0.876cpb on zen4
- ~~SHA512~~ (done)
- AVX512 with 256-bit/128-bit registers:
- 4.76 cpb on tigerlake
- 4.88 cpb on zen4 (note: sha512-avx2-bmi2 is faster on zen4, 4.35cpb, c0f85e0c8657030eb979a465199a07e2819f81e4)
- ~~SM4 (AVX-512 + GFNI)~~ (done)
- GFNI/AVX2 implementation done. SM4-CTR:
- 2.71cpb on tigerlake
- 2.38cpb on zen4
- GFNI/AVX512, SM4-CTR:
- 1.25cpb on tigerlake
- 1.51cpb on zen4
- ~~SHA3~~ (done)
- SHA3-256, AVX512 (impl. using lower 64bit on 128bit registers):
- 5.72cpb on tigerlake
- 5.55cpb on zen4
- ~~Blake2~~ (done)
- Check if EVEX encoded VPGATHERDD/VPGATHERDQ is faster than VPINSRD/VPINSRQ method
- For BLAKE2B, VPGATHERDQ was faster. For BLAKE2S VMOVD+VPINSRD was faster (on tigerlake).
- BLAKE2b AVX512 (256bit vectors):
- 2.88cpb on tigerlake
- 3.72cpb on zen4
- BLAKE2s AVX512 (128bit vectors):
- 4.18cpb on tigerlake
- 5.32cpb on zen4
- Twofish
- Wider AVX512 registers and EVEX encoded VPGATHERQQ could give additional performance
- Serpent
- ~~VPSHUFB could be used to implement sboxes with one instruction. This could be done with AVX2 in addition to AVX512.~~ ... but makes linear transformation in round function much more complex, so unlikely to be faster than current bit-logic sboxes.
- Use of vpternlogq instruction would reduce number of bitlogic instructions per SBOX. This could be used with 128b and 256b vector implementations in addition to 512b vectors. Vector rotate instruction would speed up linear transform part of round function.
- Wider AVX512 registers could give ~30-50% additional performance vs 256b AVX2.
- ADX implementation of large integer multiply
- AVX512-IFMA for large integer multiply
- ARMv8 64bit (& 32bit) implementations
- ~~SHA512 ARMv8.4 crypto extension acceleration~~ (done)
- 2.54cpb on AWS Graviton3
- ~~SHA3 ARMv8.4 crypto extension acceleration~~ (not feasible)
- SHA3 instruction set "accelerated" implementation is nearly two times slower than generic C version on AWS Graviton3:
- SHA3 ARMv8.4 implementation from CRYPTOGAMS: 11.28 c/B
- SHA3 generic C implementation in libgcrypt: 6.05 c/B
- ~~Use SHA3 ARMv8.4 instructions to accelerate Blake2b (rotate and XOR instruction + three-way XOR instruction)~~ (not feasible)
- AdvSIMD + SHA3/CE instruction set implementation is slower than generic C version on AWS Graviton3:
- Blake2b AdvSIMD+XAR implementation: 5.09 c/B
- Blake2b generic C implementation in libgcrypt: 3.21 c/B
- Port Camellia aesni/avx implementation to ARM-CE AES 64bit
- Port Serpent ARMv7/NEON implementation to 64bit
- ~~Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit~~ (not planned as 32bit ARM seems to be sunsetting)
- ~~Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit~~ (not planned as 32bit ARM seems to be sunsetting)
- Power vcrypto
- Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference)
- Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics
- ~~AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64.~~ (done)
- CFB-enc/CBC-enc on POWER8: 3.86-3.89c/B
- RISC-V 32/64 implementations
- ~~Add MPI longlong.h support~~ (handled with new generic 'long long' (on 32bit) and '__int128' (on 64bit) macros)
- ~~Needs 32/32->64bit and 64/64->128bit multiply macros.~~
- ~~Generic addition and subtraction macros are as good as it gets since RISC-V does not have carry-flag.~~
- Support for more crypto instruction sets on different architectures
- SPARC T4 crypto instruction set
- Performance optimizations for curve 25519
- https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2
- Maybe use mixed asm/C approach as used with poly1305.c