Page MenuHome GnuPG

libgcrypt performance TODOs
Open, WishlistPublic

Description

TODO list

  • x86-64 implementations
    • Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2.
    • Implementations to consider:
      • AES (AVX-512 + VAES) (done)
        • Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf).
      • GCM (AVX-512 + VPCLMUL) (done)
        • Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE:
          • VPCLMUL/AVX2 close to twice as fast on AMD zen3 (0.219 cpb).
          • VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb).
      • ChaCha20+Poly1305 (done)
        • AVX512 implementation done. Results on tigerlake:
          • ChaCha20 stream: 0.665 cpb
          • Poly1305 MAC: 0.247 cpb
          • ChaCha20+Poly1305 AEAD: 0.907 cpb
      • Camellia (AVX-512 + GFNI) (done)
        • GFNI/AVX2 implementation done. Camellia128-CTR: 1.67cpb on tigerlake.
        • GFNI/AVX512 implementation done. Camellia128-CTR: 0.868cpb on tigerlake.
      • SHA512 (done)
        • AVX512 with 256-bit/128-bit registers: 4.76 cpb on tigerlake
      • SM4 (AVX-512 + GFNI) (done)
        • GFNI/AVX2 implementation done. SM4-CTR: 2.71cpb on tigerlake.
        • GFNI/AVX512, SM4-CTR: 1.25cpb on tigerlake
      • SHA3 (done)
        • AVX512 (impl. using lower 64bit on 128bit registers): 5.72cpb on tigerlake
      • Blake2 (done)
        • Check if EVEX encoded VPGATHERDD/VPGATHERDQ is faster than VPINSRD/VPINSRQ method
          • For BLAKE2B, VPGATHERDQ was faster. For BLAKE2S VMOVD+VPINSRD was faster (on tigerlake).
        • BLAKE2b AVX512 (256bit vectors): 2.88cpb on tigerlake
        • BLAKE2s AVX512 (128bit vectors): 4.18cpb on tigerlake
      • Twofish
        • Wider AVX512 registers and EVEX encoded VPGATHERQQ could give additional performance
      • Serpent
        • VPSHUFB could be used to implement sboxes with one instruction. This could be done with AVX2 in addition to AVX512. ... but make linear transformation in round function much more complex, so unlikely to be faster than current bit-logic sboxes.
        • Use of vpternlogq instruction would reduce number of bitlogic instructions per SBOX. This could be used with 128b and 256b vector implementations in addition to 512b vectors. Vector rotate instruction would speed up linear transform part of round function.
        • Wider AVX512 registers could give ~30-50% additional performance vs 256b AVX2.
    • ADX implementation of large integer multiply
    • AVX512-IFMA for large integer multiply
  • ARMv8 64bit (& 32bit) implementations
    • SHA512 ARMv8.4 crypto extension acceleration (done)
      • 2.54cpb on AWS Graviton3
    • SHA3 ARMv8.4 crypto extension acceleration (not feasible)
      • SHA3 instruction set "accelerated" implementation is nearly two times slower than generic C version on AWS Graviton3:
        • SHA3 ARMv8.4 implementation from CRYPTOGAMS: 11.28 c/B
        • SHA3 generic C implementation in libgcrypt: 6.05 c/B
    • Use SHA3 ARMv8.4 instructions to accelerate Blake2b (rotate and XOR instruction + three-way XOR instruction) (not feasible)
      • AdvSIMD + SHA3/CE instruction set implementation is slower than generic C version on AWS Graviton3:
        • Blake2b AdvSIMD+XAR implementation: 5.09 c/B
        • Blake2b generic C implementation in libgcrypt: 3.21 c/B
    • Port Camellia aesni/avx implementation to ARM-CE AES 64bit
    • Port Serpent ARMv7/NEON implementation to 64bit
    • Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit (not planned as 32bit ARM seems to be sunsetting)
    • Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit (not planned as 32bit ARM seems to be sunsetting)
  • Power vcrypto
    • Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference)
    • Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics
    • AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64. (done)
      • CFB-enc/CBC-enc on POWER8: 3.86-3.89c/B
  • RISC-V 32/64 implementations
    • Add MPI longlong.h support
      • Needs 32/32->64bit and 64/64->128bit multiply macros.
      • Generic addition and subtraction macros are as good as it gets since RISC-V does not have carry-flag.
  • Support for more crypto instruction sets on different architectures
    • SPARC T4 crypto instruction set

Event Timeline

jukivili created this object in space S1 Public.
jukivili updated the task description. (Show Details)

Isn't the Sparc crypto instruction set only available in kernel mode?

SPARC T4 has crypto instruction set for AES, GCM, SHA1, SHA256, SHA512, Camellia and DES, that can be used from user-space too.

jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)
jukivili updated the task description. (Show Details)