Optimizations for GCM Intel/PCLMUL implementation
* cipher/cipher-gcm-intel-pclmul.c (reduction): New. (glmul_pclmul): Include shifting to left into pclmul operations; Use 'reduction' helper function. [__x86_64__] (gfmul_pclmul_aggr4): Reorder instructions and adjust register usage to free up registers; Use 'reduction' helper function; Include shifting to left into pclmul operations; Moving load H values and input from caller into this function. [__x86_64__] (gfmul_pclmul_aggr8): New. (gcm_lsh): New. (_gcry_ghash_setup_intel_pclmul): Left shift H values to left by one; Preserve XMM6-XMM15 registers on WIN64. (_gcry_ghash_intel_pclmul) [__x86_64__]: Use 8 block aggregated reduction function.
Benchmark on Intel Haswell (amd64):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
GMAC_AES | 0.206 ns/B 4624 MiB/s 0.825 c/B 3998
After (+50% faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
GMAC_AES | 0.137 ns/B 6953 MiB/s 0.548 c/B 3998
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>