Add SSSE3 optimized non-parallel ChaCha20 function
* cipher/chacha20-amd64-ssse3.S (ROTATE_SHUF, ROTATE, WORD_SHUF) (QUARTERROUND4, _gcry_chacha20_amd64_ssse3_blocks1): New. * cipher/chacha20.c (_gcry_chacha20_amd64_ssse3_blocks1): New prototype. (chacha20_blocks): Rename to ... (do_chacha20_blocks): ... this. (chacha20_blocks): New. (chacha20_encrypt_stream): Adjust for new chacha20_blocks function.
This patch provides SSSE3 optimized version of non-parallel
ChaCha20 core block function. On Intel Haswell generic C function
runs at 6.9 cycles/byte. New function runs at 5.2 cycles/byte, thus
being ~32% faster.
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>