Enable four block aggregated GCM Intel PCLMUL implementation on i386
* cipher/cipher-gcm-intel-pclmul.c (reduction): Change "%%xmm7" to "%%xmm5". (gfmul_pclmul_aggr4): Move outside [__x86_64__] block; Remove usage of XMM8-XMM15 registers; Do not preload H-values and be_mask to reduce register usage for i386. (_gcry_ghash_setup_intel_pclmul): Enable calculation of H2, H3 and H4 on i386. (_gcry_ghash_intel_pclmul): Adjust to above gfmul_pclmul_aggr4 changes; Move 'aggr4' code path outside [__x86_64__] block.
Benchmark on Intel Haswell (win32):
Before:
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
GMAC_AES | 0.446 ns/B 2140 MiB/s 1.78 c/B 3998
After (~2.38x faster):
| nanosecs/byte mebibytes/sec cycles/byte auto Mhz
GMAC_AES | 0.187 ns/B 5107 MiB/s 0.747 c/B 3998
- Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>