libgcrypt performance TODOs
TODO list

  • x86-64 AVX-512 implementations
    • Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2.
    • Implementations to consider:
      • AES (AVX-512 + VAES)
        • Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (
      • GCM (AVX-512 + VPCLMUL)
        • Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE:
          • VPCLMUL/AVX2 close to twice as fast on AMD zen3 (0.219 cpb).
          • VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb).
      • ChaCha20+Poly1305
        • AVX512 implementation done. Results on tigerlake:
          • ChaCha20 stream: 0.665 cpb
          • Poly1305 MAC: 0.247 cpb
          • ChaCha20+Poly1305 AEAD: 0.907 cpb
      • SHA3
      • SHA512
        • AVX512 with 256-bit/128-bit registers: 4.76 cpb on tigerlake
      • Twofish
      • Serpent
      • Camellia (AVX-512 + GFNI)
        • GFNI/AVX2 implementation done. Camellia128-CTR: 1.67cpb on tigerlake.
        • GFNI/AVX512 implementation done. Camellia128-CTR: 0.868cpb on tigerlake.
      • SM4 (AVX-512 + GFNI)
        • GFNI/AVX2 implementation done. SM4-CTR: 2.71cpb on tigerlake.
  • Power vcrypto
    • Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference)
    • Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics
    • AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64.
  • ARMv8 64bit (& 32bit) implementations
    • Port Camellia aesni/avx implementation to ARM-CE AES 64bit(/32bit)
    • Port Serpent ARMv7/NEON implementation to 64bit
    • Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit
    • Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit
  • x86_64 / i386 implementations
    • ADX implementation of large integer multiply
    • AVX512-IFMA for large integer multiply
  • Support for more crypto instruction sets on different architectures
    • SPARC T4 crypto instruction set

