Page MenuHome GnuPG

libgcrypt performance TODOs
Open, WishlistPublic

Description

TODO list

  • x86-64 implementations
    • SHA512
      • Implementation using new SHA512 instructions
    • SM3
      • Implementation using new SM3 instructions
    • SM4
      • Implementation using new SM4 instructions
    • ADX implementation of large integer multiply
    • AVX512-IFMA for large integer multiply
  • ARMv8 64bit (& 32bit) implementations
    • Port Serpent ARMv7/NEON implementation to 64bit
  • Power vcrypto
    • Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference)
  • RISC-V 32/64 implementations
    • Support for scalar crypto extensions (AES, SHA, SM3/4, etc)
    • Support for vector crypto extensions (AES, SHA, SM3/4, etc)
    • Vector accelerated ChaCha20
    • Vector accelerated Blake2
    • RISC-V support for generic SIMD Camellia
  • Support for more crypto instruction sets on different architectures
    • SPARC T4 crypto instruction set

DONE list

  • x86-64 implementations
    • Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2.
    • Implementations done/checked:
      • AES (AVX-512 + VAES) (done)
        • Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf).
        • AES-VAES-AVX2-CTR on zen4: 0.163cpb.
      • GCM (AVX-512 + VPCLMUL) (done)
        • Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE:
          • VPCLMUL/AVX2 close to twice as fast on AMD zen3/zen4 (zen3: 0.219 cpb, zen4: 0.212cpb)
          • VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb) and ~2.4x faster on AMD zen4 (0.171cpb)
      • ChaCha20+Poly1305 / AVX512 (done)
        • AVX512 implementation done.
          • Results on tigerlake:
            • ChaCha20 stream: 0.665 cpb
            • Poly1305 MAC: 0.247 cpb
            • ChaCha20+Poly1305 AEAD: 0.907 cpb
          • Results on zen4:
            • ChaCha20 stream: 0.711 cpb
            • Poly1305 MAC: 0.247 cpb
            • ChaCha20+Poly1305 AEAD: 0.964 cpb
      • Camellia (AVX-512 + GFNI) (done)
        • GFNI/AVX2 implementation done. Camellia128-CTR:
          • 1.67cpb on tigerlake
          • 1.30cpb on zen4
        • GFNI/AVX512 implementation done. Camellia128-CTR:
          • 0.868cpb on tigerlake
          • 0.876cpb on zen4
      • SHA512 / AVX512 (done)
      • SM4 (AVX-512 + GFNI) (done)
        • GFNI/AVX2 implementation done. SM4-CTR:
          • 2.71cpb on tigerlake
          • 2.38cpb on zen4
        • GFNI/AVX512, SM4-CTR:
          • 1.25cpb on tigerlake
          • 1.51cpb on zen4
      • SHA3 / AVX512 (done)
        • SHA3-256, AVX512 (impl. using lower 64bit on 128bit registers):
          • 5.72cpb on tigerlake
          • 5.55cpb on zen4
      • Blake2 / AVX512 (done)
        • BLAKE2b AVX512 (256bit vectors):
          • 2.88cpb on tigerlake
          • 3.63cpb on zen4
        • BLAKE2s AVX512 (128bit vectors):
          • 4.18cpb on tigerlake
          • 5.32cpb on zen4
      • Twofish
        • Wider AVX512 registers could give additional performance Performance is limited by table lookups
      • Serpent / AVX512 (done)
        • AVX512/CTR: 2.52cpb on zen4
  • ARMv8 64bit (& 32bit) implementations
    • SHA512 ARMv8.4 crypto extension acceleration (done)
      • 2.54cpb on AWS Graviton3
    • SHA3 ARMv8.4 crypto extension acceleration (not feasible)
      • SHA3 instruction set "accelerated" implementation is nearly two times slower than generic C version on AWS Graviton3:
        • SHA3 ARMv8.4 implementation from CRYPTOGAMS: 11.28 c/B
        • SHA3 generic C implementation in libgcrypt: 6.05 c/B
    • Use SHA3 ARMv8.4 instructions to accelerate Blake2b (rotate and XOR instruction + three-way XOR instruction) (not feasible)
      • AdvSIMD + SHA3/CE instruction set implementation is slower than generic C version on AWS Graviton3:
        • Blake2b AdvSIMD+XAR implementation: 5.09 c/B
        • Blake2b generic C implementation in libgcrypt: 3.21 c/B
    • Port Camellia aesni/avx implementation to ARM-CE AES 64bit
      • ECB on Graviton2: 6.93cpb
    • Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit (not planned as 32bit ARM seems to be sunsetting)
    • Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit (not planned as 32bit ARM seems to be sunsetting)
  • Power vcrypto
    • Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics
      • ECB on POWER9: 7.48cpb
    • AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64. (done)
      • CFB-enc/CBC-enc on POWER8: 3.86-3.89c/B
  • RISC-V 32/64 implementations
    • Add MPI longlong.h support (handled with new generic 'long long' (on 32bit) and '__int128' (on 64bit) macros)
      • Needs 32/32->64bit and 64/64->128bit multiply macros.
      • Generic addition and subtraction macros are as good as it gets since RISC-V does not have carry-flag.

Event Timeline

jukivili created this object in space S1 Public.
jukivili updated the task description. (Show Details)

Isn't the Sparc crypto instruction set only available in kernel mode?

SPARC T4 has crypto instruction set for AES, GCM, SHA1, SHA256, SHA512, Camellia and DES, that can be used from user-space too.

jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)