Add AVX2/BMI2 implementation of SHA1
* cipher/Makefile.am: Add 'sha1-avx2-bmi2-amd64.S'. * cipher/hash-common.h (MD_BLOCK_CTX_BUFFER_SIZE): New. (gcry_md_block_ctx): Change buffer length to MD_BLOCK_CTX_BUFFER_SIZE. * cipher/sha1-avx-amd64.S: Add missing .size for transform function. * cipher/sha1-ssse3-amd64.S: Add missing .size for transform function. * cipher/sha1-avx-bmi2-amd64.S: Add missing .size for transform function; Tweak implementation for small ~1% speed increase. * cipher/sha1-avx2-bmi2-amd64.S: New. * cipher/sha1.c (USE_AVX2, _gcry_sha1_transform_amd64_avx2_bmi2) (do_sha1_transform_amd64_avx2_bmi2): New. (sha1_init) [USE_AVX2]: Enable AVX2 implementation if supported by HW features. (sha1_final): Merge processing of two last blocks when extra block is needed.
Benchmarks on Intel Haswell (4.0 Ghz):
Before (AVX/BMI2):
| nanosecs/byte mebibytes/sec cycles/byte
SHA1 | 0.970 ns/B 983.2 MiB/s 3.88 c/B
After (AVX/BMI2, ~1% faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA1 | 0.960 ns/B 993.1 MiB/s 3.84 c/B
After (AVX2/BMI2, ~9% faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA1 | 0.890 ns/B 1071 MiB/s 3.56 c/B
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>