chacha20: avoid AVX512/AVX2/SSSE3 for single block processing with Zen5
* cipher/chacha20.c (CHACHA20_context_s): Add 'skip_one_block_hw_impl'. (chacha20_blocks, do_chacha20_encrypt_stream_tail): Avoid single block / non-parallel processing with AVX512/AVX2/SSSE3.
AMD Zen5 has slower integer vector performance than general purpose
register implementation for Chacha20. Generic C is approx 50% faster
for single block computation. Commit adjust calls to AVX512/AVX2/SSSE3
code so that tailing single block computation are handled with generic
C for AMD Zen5.
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>