Tune SHA-512/AVX2 and SHA-256/AVX2 implementations
* cipher/sha256-avx2-bmi2-amd64.S (ONE_ROUND_PART1, ONE_ROUND_PART2) (ONE_ROUND): New round function. (FOUR_ROUNDS_AND_SCHED, FOUR_ROUNDS): Use new round function. (_gcry_sha256_transform_amd64_avx2): Exit early if number of blocks is zero; Writing XFER to stack earlier and handle XREF writing in FOUR_ROUNDS_AND_SCHED. * cipher/sha512-avx2-bmi2-amd64.S (MASK_YMM_LO, MASK_YMM_LOx): New. (ONE_ROUND_PART1, ONE_ROUND_PART2, ONE_ROUND): New round function. (FOUR_ROUNDS_AND_SCHED, FOUR_ROUNDS): Use new round function. (_gcry_sha512_transform_amd64_avx2): Writing XFER to stack earlier and handle XREF writing in FOUR_ROUNDS_AND_SCHED.
Benchmark on Intel Haswell (4.0Ghz):
Before:
| nanosecs/byte mebibytes/sec cycles/byte
SHA256 | 2.17 ns/B 439.0 MiB/s 8.68 c/B
SHA512 | 1.56 ns/B 612.5 MiB/s 6.23 c/B
After (~4-6% faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA256 | 2.05 ns/B 465.9 MiB/s 8.18 c/B
SHA512 | 1.49 ns/B 640.3 MiB/s 5.95 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>