Page MenuHome GnuPG

libgcrypt performance TODOs
Open, WishlistPublic

Description

TODO list

  • x86-64 AVX-512 implementations
    • Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2.
    • Implementations to consider:
      • AES (AVX-512 + VAES)
        • Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf).
      • GCM (AVX-512 + VPCLMUL)
        • Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE:
          • VPCLMUL/AVX2 close to twice as fast on AMD zen3 (0.219 cpb).
          • VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb).
      • ChaCha20+Poly1305
        • AVX512 implementation done. Results on tigerlake:
          • ChaCha20 stream: 0.665 cpb
          • Poly1305 MAC: 0.247 cpb
          • ChaCha20+Poly1305 AEAD: 0.907 cpb
      • SHA3
      • SHA512
        • AVX512 with 256-bit/128-bit registers: 4.76 cpb on tigerlake
      • Twofish
      • Serpent
      • Camellia (AVX-512 + GFNI)
        • GFNI/AVX2 implementation done. Camellia128-CTR: 1.67cpb on tigerlake.
        • GFNI/AVX512 implementation done. Camellia128-CTR: 0.868cpb on tigerlake.
      • SM4 (AVX-512 + GFNI)
        • GFNI/AVX2 implementation done. SM4-CTR: 2.71cpb on tigerlake.
  • Power vcrypto
    • Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference)
    • Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics
    • AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64.
  • ARMv8 64bit (& 32bit) implementations
    • Port Camellia aesni/avx implementation to ARM-CE AES 64bit(/32bit)
    • Port Serpent ARMv7/NEON implementation to 64bit
    • Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit
    • Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit
  • x86_64 / i386 implementations
    • ADX implementation of large integer multiply
    • AVX512-IFMA for large integer multiply
  • Support for more crypto instruction sets on different architectures
    • SPARC T4 crypto instruction set

Event Timeline

jukivili created this object in space S1 Public.
jukivili updated the task description. (Show Details)

Isn't the Sparc crypto instruction set only available in kernel mode?

SPARC T4 has crypto instruction set for AES, GCM, SHA1, SHA256, SHA512, Camellia and DES, that can be used from user-space too.

jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)