TODO list
- x86-64 AVX-512 implementations
   - Newer CPUs can use AVX-512 without high frequency penalty. To enable AVX-512 for such newer CPUs, cpuid check should check for newest AVX-512 features such as AVX512-VBMI2.
   - Implementations to consider:
      - ~~AES (AVX-512 + VAES)~~ (done)
         - Turns out that VAES/AVX512 is slightly slower than VAES/AVX2. Tested CBC-DEC with tigerlake. no frequency drop observed. Current AES128-CTR/AVX2/VAES speed is 0.160 cycles/byte on tigerlake. This paper from Intel gives estimate of 0.16c/B theoritical limit for "icelake" which is previous generation to tigerlake (https://eprint.iacr.org/2018/392.pdf).
      - ~~GCM (AVX-512 + VPCLMUL)~~ (done)
         - Check first if VPCLMUL/AVX2 would be faster than current PCLMUL/SSE:
            - VPCLMUL/AVX2 close to twice as fast on AMD zen3 (0.219 cpb).
            - VPCLMUL/AVX512 close to twice as fast Intel tigerlake (0.181 cpb).
      - ~~ChaCha20+Poly1305~~ (done)
            - AVX512 implementation done. Results on tigerlake:
                    - ChaCha20 stream: 0.665 cpb
                    - Poly1305 MAC: 0.247 cpb
                    - ChaCha20+Poly1305 AEAD: 0.907 cpb
      - ~~Camellia (AVX-512 + GFNI)~~ (done)
            - GFNI/AVX2 implementation done. Camellia128-CTR: 1.67cpb on tigerlake.
            - GFNI/AVX512 implementation done. Camellia128-CTR: 0.868cpb on tigerlake.
      - ~~SHA512~~ (done)
            - AVX512 with 256-bit/128-bit registers: 4.76 cpb on tigerlake
      - SM4 (AVX-512 + GFNI)
            - GFNI/AVX2 implementation done. SM4-CTR: 2.71cpb on tigerlake.
            - WIP: GFNI/AVX512
      - SHA3
      - Blake2
            - Check if EVEX encoded VPGATHERDD/VPGATHERDQ is faster than VPINSRD/VPINSRQ method
            - Round function could use VRORD/VRORQ
      - Twofish
            - Wider AVX512 registers and EVEX encoded VPGATHERQQ could give additional performance
      - Serpent
            - VPSHUFB could be used to implement sboxes with one instruction. This could be done with AVX2 in addition to AVX512.
- ARMv8 64bit (& 32bit) implementations
   - SHA512 ARMv8.4 crypto extension acceleration (WIP)
   - ~~SHA3 ARMv8.4 crypto extension acceleration~~ (not feasible, SHA3 instruction set "accelerated" implementation is over two times slower than generic C version on AWS Graviton3)
   - Use SHA3 ARMv8.4 instructions to accelerate Blake2b (rotate and XOR instruction + three-way XOR instruction)
   - Port Camellia aesni/avx implementation to ARM-CE AES 64bit(/32bit)
   - Port Serpent ARMv7/NEON implementation to 64bit
   - Port stitched Chacha20-Poly1305 ARMv8/AArch64 implementation to 32bit
   - Port CRC ARM-CE PMULL 64bit implementation to ARM-CE PMULL 32bit
- Power vcrypto
   - Add optimized PPC64 MPI assembly functions (PPC32 in mpi/powerpc32/ for reference)
   - Port Camellia aesni/avx implementation to VSX/vcrypto intrinsics
   - AES CFB/CBC dec/enc fine tuning, like was done for x86 and aarch64.
- x86_64 / i386 implementations
   - ADX implementation of large integer multiply
   - AVX512-IFMA for large integer multiply
- Support for more crypto instruction sets on different architectures
   - SPARC T4 crypto instruction set 
- Performance optimizations for curve 25519
   - https://marc.info/?l=gcrypt-devel&m=153295947908984&w=2
   - Maybe use mixed asm/C approach as used with poly1305.c