libgcrypt: s390x/zSeries 128-bit vector implementation of ChaCha20
s390x/zSeries has vector register instruction set which can be used to implement faster Chacha20.

Currently have 8 block parallel implementation done. Need to check if 6 block parallel approach is better (as used in OpenSSL - benefit being less register pressure and less moving of data between registers and stack).

Reimplemented 8 block parallel in "vertical" orientation.

libgcrypt chacha20-s390x:

CHACHA20       |  nanosecs/byte   mebibytes/sec   cycles/byte
    STREAM enc |     0.506 ns/B      1886 MiB/s      2.28 c/B
    STREAM dec |     0.506 ns/B      1884 MiB/s      2.28 c/B

openssl 1.1.1f:

bench-slope-openssl: OpenSSL 1.1.1f  31 Mar 2020
 chacha20       |  nanosecs/byte   mebibytes/sec   cycles/byte
     STREAM enc |     0.592 ns/B    1609.9 MiB/s      2.67 c/B
     STREAM dec |     0.593 ns/B    1607.1 MiB/s      2.67 c/B

Merged to master.