s390x/zSeries has vector register instruction set which can be used to implement faster Chacha20.
Currently have 8 block parallel implementation done. Need to check if 6 block parallel approach is better (as used in OpenSSL - benefit being less register pressure and less moving of data between registers and stack).
Reimplemented 8 block parallel in "vertical" orientation.
CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.506 ns/B 1886 MiB/s 2.28 c/B STREAM dec | 0.506 ns/B 1884 MiB/s 2.28 c/B
bench-slope-openssl: OpenSSL 1.1.1f 31 Mar 2020 Cipher: chacha20 | nanosecs/byte mebibytes/sec cycles/byte STREAM enc | 0.592 ns/B 1609.9 MiB/s 2.67 c/B STREAM dec | 0.593 ns/B 1607.1 MiB/s 2.67 c/B