Change Details

TODO list - x86-64 AVX-512 implementations - Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2. - Implementations to consider: - ~~AES (AVX-512 + VAES)~~ - Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf). - GCM (AVX-512 + VPCLMUL) - Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE: - VPCLMUL/AVX2 close to twice as fast on AMD zen3. - VPCLMUL/AVX512 close to twice as fast Intel tigerlake. - ChaCha20 - SHA3 - SHA512 - AVX512 with 256-bit/128-bit registers: 4.78 cpb on tigerlake - Twofish - Serpent - Camellia (AVX-512 + GFNI) - SM4 (AVX-512 + GFNI) - Power vcrypto - Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference) - Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics - AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64. - ARMv8 64bit (& 32bit) implementations - Port Camellia aesni/avx implementation to ARM-CE AES 64bit(/32bit) - Port Serpent ARMv7/NEON implementation to 64bit - Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit - Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit - x86_64 / i386 implementations - ADX implementation of large integer multiply - Support for more crypto instruction sets on different architectures - SPARC T4 crypto instruction set - Performance optimizations for curve 25519 - https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2 - Maybe use mixed asm/C approach as used with poly1305.c

TODO list - x86-64 AVX-512 implementations - Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2. - Implementations to consider: - ~~AES (AVX-512 + VAES)~~ - Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf). - GCM (AVX-512 + VPCLMUL) - Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE: - VPCLMUL/AVX2 close to twice as fast on AMD zen3. - VPCLMUL/AVX512 close to twice as fast Intel tigerlake. - ChaCha20 - SHA3 - SHA512 - AVX512 with 256-bit/128-bit registers: 4.76 cpb on tigerlake - Twofish - Serpent - Camellia (AVX-512 + GFNI) - SM4 (AVX-512 + GFNI) - Power vcrypto - Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference) - Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics - AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64. - ARMv8 64bit (& 32bit) implementations - Port Camellia aesni/avx implementation to ARM-CE AES 64bit(/32bit) - Port Serpent ARMv7/NEON implementation to 64bit - Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit - Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit - x86_64 / i386 implementations - ADX implementation of large integer multiply - Support for more crypto instruction sets on different architectures - SPARC T4 crypto instruction set - Performance optimizations for curve 25519 - https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2 - Maybe use mixed asm/C approach as used with poly1305.c

TODO list - x86-64 AVX-512 implementations - Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2. - Implementations to consider: - ~~AES (AVX-512 + VAES)~~ - Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf). - GCM (AVX-512 + VPCLMUL) - Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE: - VPCLMUL/AVX2 close to twice as fast on AMD zen3. - VPCLMUL/AVX512 close to twice as fast Intel tigerlake. - ChaCha20 - SHA3 - SHA512 - AVX512 with 256-bit/128-bit registers: 4.7876 cpb on tigerlake - Twofish - Serpent - Camellia (AVX-512 + GFNI) - SM4 (AVX-512 + GFNI) - Power vcrypto - Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference) - Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics - AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64. - ARMv8 64bit (& 32bit) implementations - Port Camellia aesni/avx implementation to ARM-CE AES 64bit(/32bit) - Port Serpent ARMv7/NEON implementation to 64bit - Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit - Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit - x86_64 / i386 implementations - ADX implementation of large integer multiply - Support for more crypto instruction sets on different architectures - SPARC T4 crypto instruction set - Performance optimizations for curve 25519 - https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2 - Maybe use mixed asm/C approach as used with poly1305.c