Change Details

TODO list - x86-64 implementations - Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2. - Implementations to consider: - ~~AES (AVX-512 + VAES)~~ (done) - Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf). - ~~GCM (AVX-512 + VPCLMUL)~~ (done) - Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE: - VPCLMUL/AVX2 close to twice as fast on AMD zen3 (0.219 cpb). - VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb). - ~~ChaCha20+Poly1305~~ (done) - AVX512 implementation done. Results on tigerlake: - ChaCha20 stream: 0.665 cpb - Poly1305 MAC: 0.247 cpb - ChaCha20+Poly1305 AEAD: 0.907 cpb - ~~Camellia (AVX-512 + GFNI)~~ (done) - GFNI/AVX2 implementation done. Camellia128-CTR: 1.67cpb on tigerlake. - GFNI/AVX512 implementation done. Camellia128-CTR: 0.868cpb on tigerlake. - ~~SHA512~~ (done) - AVX512 with 256-bit/128-bit registers: 4.76 cpb on tigerlake - ~~SM4 (AVX-512 + GFNI)~~ (done) - GFNI/AVX2 implementation done. SM4-CTR: 2.71cpb on tigerlake. - GFNI/AVX512, SM4-CTR: 1.25cpb on tigerlake - ~~SHA3~~ (done) - AVX512 (impl. using lower 64bit on 128bit registers): 5.72cpb on tigerlake - ~~Blake2~~ (done) - Check if EVEX encoded VPGATHERDD/VPGATHERDQ is faster than VPINSRD/VPINSRQ method - For BLAKE2B, VPGATHERDQ was faster. For BLAKE2S VMOVD+VPINSRD was faster (on tigerlake). - BLAKE2b AVX512 (256bit vectors): 2.88cpb on tigerlake - BLAKE2s AVX512 (128bit vectors): 4.18cpb on tigerlake - Twofish - Wider AVX512 registers and EVEX encoded VPGATHERQQ could give additional performance - Serpent - ~VPSHUFB could be used to implement sboxes with one instruction. This could be done with AVX2 in addition to AVX512.~ ... but make linear transformation in round function much more complex, so unlikely to be faster than current bitwise sboxes. - Use of vpternlogq instruction would reduce number of bitlogic instructions per SBOX. This could be used with 128b and 256b vector implementations in addition to 512b vectors. - Wider AVX512 registers could give ~30-50% additional performance vs 256b AVX2. - ADX implementation of large integer multiply - AVX512-IFMA for large integer multiply - ARMv8 64bit (& 32bit) implementations - ~~SHA512 ARMv8.4 crypto extension acceleration~~ (done) - 2.54cpb on AWS Graviton3 - ~~SHA3 ARMv8.4 crypto extension acceleration~~ (not feasible) - SHA3 instruction set "accelerated" implementation is nearly two times slower than generic C version on AWS Graviton3: - SHA3 ARMv8.4 implementation from CRYPTOGAMS: 11.28 c/B - SHA3 generic C implementation in libgcrypt: 6.05 c/B - ~~Use SHA3 ARMv8.4 instructions to accelerate Blake2b (rotate and XOR instruction + three-way XOR instruction)~~ (not feasible) - AdvSIMD + SHA3/CE instruction set implementation is slower than generic C version on AWS Graviton3: - Blake2b AdvSIMD+XAR implementation: 5.09 c/B - Blake2b generic C implementation in libgcrypt: 3.21 c/B - Port Camellia aesni/avx implementation to ARM-CE AES 64bit - Port Serpent ARMv7/NEON implementation to 64bit - ~~Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit~~ (not planned as 32bit ARM seems to be sunsetting) - ~~Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit~~ (not planned as 32bit ARM seems to be sunsetting) - Power vcrypto - Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference) - Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics - ~~AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64.~~ (done) - CFB-enc/CBC-enc on POWER8: 3.86-3.89c/B - RISC-V 32/64 implementations - Add MPI longlong.h support - Needs 32/32->64bit and 64/64->128bit multiply macros. - Generic addition and subtraction macros are as good as it gets since RISC-V does not have carry-flag. - Support for more crypto instruction sets on different architectures - SPARC T4 crypto instruction set - Performance optimizations for curve 25519 - https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2 - Maybe use mixed asm/C approach as used with poly1305.c

TODO list - x86-64 implementations - Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2. - Implementations to consider: - ~~AES (AVX-512 + VAES)~~ (done) - Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf). - ~~GCM (AVX-512 + VPCLMUL)~~ (done) - Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE: - VPCLMUL/AVX2 close to twice as fast on AMD zen3 (0.219 cpb). - VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb). - ~~ChaCha20+Poly1305~~ (done) - AVX512 implementation done. Results on tigerlake: - ChaCha20 stream: 0.665 cpb - Poly1305 MAC: 0.247 cpb - ChaCha20+Poly1305 AEAD: 0.907 cpb - ~~Camellia (AVX-512 + GFNI)~~ (done) - GFNI/AVX2 implementation done. Camellia128-CTR: 1.67cpb on tigerlake. - GFNI/AVX512 implementation done. Camellia128-CTR: 0.868cpb on tigerlake. - ~~SHA512~~ (done) - AVX512 with 256-bit/128-bit registers: 4.76 cpb on tigerlake - ~~SM4 (AVX-512 + GFNI)~~ (done) - GFNI/AVX2 implementation done. SM4-CTR: 2.71cpb on tigerlake. - GFNI/AVX512, SM4-CTR: 1.25cpb on tigerlake - ~~SHA3~~ (done) - AVX512 (impl. using lower 64bit on 128bit registers): 5.72cpb on tigerlake - ~~Blake2~~ (done) - Check if EVEX encoded VPGATHERDD/VPGATHERDQ is faster than VPINSRD/VPINSRQ method - For BLAKE2B, VPGATHERDQ was faster. For BLAKE2S VMOVD+VPINSRD was faster (on tigerlake). - BLAKE2b AVX512 (256bit vectors): 2.88cpb on tigerlake - BLAKE2s AVX512 (128bit vectors): 4.18cpb on tigerlake - Twofish - Wider AVX512 registers and EVEX encoded VPGATHERQQ could give additional performance - Serpent - ~~VPSHUFB could be used to implement sboxes with one instruction. This could be done with AVX2 in addition to AVX512.~ ... but make linear transformation in round function much more complex, so unlikely to be faster than current bit-logic sboxes. - Use of vpternlogq instruction would reduce number of bitlogic instructions per SBOX. This could be used with 128b and 256b vector implementations in addition to 512b vectors. - Wider AVX512 registers could give ~30-50% additional performance vs 256b AVX2. - ADX implementation of large integer multiply - AVX512-IFMA for large integer multiply - ARMv8 64bit (& 32bit) implementations - ~~SHA512 ARMv8.4 crypto extension acceleration~~ (done) - 2.54cpb on AWS Graviton3 - ~~SHA3 ARMv8.4 crypto extension acceleration~~ (not feasible) - SHA3 instruction set "accelerated" implementation is nearly two times slower than generic C version on AWS Graviton3: - SHA3 ARMv8.4 implementation from CRYPTOGAMS: 11.28 c/B - SHA3 generic C implementation in libgcrypt: 6.05 c/B - ~~Use SHA3 ARMv8.4 instructions to accelerate Blake2b (rotate and XOR instruction + three-way XOR instruction)~~ (not feasible) - AdvSIMD + SHA3/CE instruction set implementation is slower than generic C version on AWS Graviton3: - Blake2b AdvSIMD+XAR implementation: 5.09 c/B - Blake2b generic C implementation in libgcrypt: 3.21 c/B - Port Camellia aesni/avx implementation to ARM-CE AES 64bit - Port Serpent ARMv7/NEON implementation to 64bit - ~~Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit~~ (not planned as 32bit ARM seems to be sunsetting) - ~~Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit~~ (not planned as 32bit ARM seems to be sunsetting) - Power vcrypto - Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference) - Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics - ~~AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64.~~ (done) - CFB-enc/CBC-enc on POWER8: 3.86-3.89c/B - RISC-V 32/64 implementations - Add MPI longlong.h support - Needs 32/32->64bit and 64/64->128bit multiply macros. - Generic addition and subtraction macros are as good as it gets since RISC-V does not have carry-flag. - Support for more crypto instruction sets on different architectures - SPARC T4 crypto instruction set - Performance optimizations for curve 25519 - https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2 - Maybe use mixed asm/C approach as used with poly1305.c

TODO list - x86-64 implementations - Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2. - Implementations to consider: - ~~AES (AVX-512 + VAES)~~ (done) - Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf). - ~~GCM (AVX-512 + VPCLMUL)~~ (done) - Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE: - VPCLMUL/AVX2 close to twice as fast on AMD zen3 (0.219 cpb). - VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb). - ~~ChaCha20+Poly1305~~ (done) - AVX512 implementation done. Results on tigerlake: - ChaCha20 stream: 0.665 cpb - Poly1305 MAC: 0.247 cpb - ChaCha20+Poly1305 AEAD: 0.907 cpb - ~~Camellia (AVX-512 + GFNI)~~ (done) - GFNI/AVX2 implementation done. Camellia128-CTR: 1.67cpb on tigerlake. - GFNI/AVX512 implementation done. Camellia128-CTR: 0.868cpb on tigerlake. - ~~SHA512~~ (done) - AVX512 with 256-bit/128-bit registers: 4.76 cpb on tigerlake - ~~SM4 (AVX-512 + GFNI)~~ (done) - GFNI/AVX2 implementation done. SM4-CTR: 2.71cpb on tigerlake. - GFNI/AVX512, SM4-CTR: 1.25cpb on tigerlake - ~~SHA3~~ (done) - AVX512 (impl. using lower 64bit on 128bit registers): 5.72cpb on tigerlake - ~~Blake2~~ (done) - Check if EVEX encoded VPGATHERDD/VPGATHERDQ is faster than VPINSRD/VPINSRQ method - For BLAKE2B, VPGATHERDQ was faster. For BLAKE2S VMOVD+VPINSRD was faster (on tigerlake). - BLAKE2b AVX512 (256bit vectors): 2.88cpb on tigerlake - BLAKE2s AVX512 (128bit vectors): 4.18cpb on tigerlake - Twofish - Wider AVX512 registers and EVEX encoded VPGATHERQQ could give additional performance - Serpent - ~~VPSHUFB could be used to implement sboxes with one instruction. This could be done with AVX2 in addition to AVX512.~ ... but make linear transformation in round function much more complex, so unlikely to be faster than current bitwisebit-logic sboxes. - Use of vpternlogq instruction would reduce number of bitlogic instructions per SBOX. This could be used with 128b and 256b vector implementations in addition to 512b vectors. - Wider AVX512 registers could give ~30-50% additional performance vs 256b AVX2. - ADX implementation of large integer multiply - AVX512-IFMA for large integer multiply - ARMv8 64bit (& 32bit) implementations - ~~SHA512 ARMv8.4 crypto extension acceleration~~ (done) - 2.54cpb on AWS Graviton3 - ~~SHA3 ARMv8.4 crypto extension acceleration~~ (not feasible) - SHA3 instruction set "accelerated" implementation is nearly two times slower than generic C version on AWS Graviton3: - SHA3 ARMv8.4 implementation from CRYPTOGAMS: 11.28 c/B - SHA3 generic C implementation in libgcrypt: 6.05 c/B - ~~Use SHA3 ARMv8.4 instructions to accelerate Blake2b (rotate and XOR instruction + three-way XOR instruction)~~ (not feasible) - AdvSIMD + SHA3/CE instruction set implementation is slower than generic C version on AWS Graviton3: - Blake2b AdvSIMD+XAR implementation: 5.09 c/B - Blake2b generic C implementation in libgcrypt: 3.21 c/B - Port Camellia aesni/avx implementation to ARM-CE AES 64bit - Port Serpent ARMv7/NEON implementation to 64bit - ~~Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit~~ (not planned as 32bit ARM seems to be sunsetting) - ~~Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit~~ (not planned as 32bit ARM seems to be sunsetting) - Power vcrypto - Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference) - Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics - ~~AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64.~~ (done) - CFB-enc/CBC-enc on POWER8: 3.86-3.89c/B - RISC-V 32/64 implementations - Add MPI longlong.h support - Needs 32/32->64bit and 64/64->128bit multiply macros. - Generic addition and subtraction macros are as good as it gets since RISC-V does not have carry-flag. - Support for more crypto instruction sets on different architectures - SPARC T4 crypto instruction set - Performance optimizations for curve 25519 - https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2 - Maybe use mixed asm/C approach as used with poly1305.c