10-20X speed.
However this Power 9 machine is faster than the last Power 9 benchmarks
on the optimized versions,
so while better than the last patch, it is not all due to the code.
Before:
GCM enc | 4.23 ns/B 225.3 MiB/s - c/B GCM dec | 3.58 ns/B 266.2 MiB/s - c/B GCM auth | 3.34 ns/B 285.3 MiB/s - c/B
After:
GCM enc | 0.370 ns/B 2578 MiB/s - c/B GCM dec | 0.371 ns/B 2571 MiB/s - c/B GCM auth | 0.159 ns/B 6003 MiB/s - c/B
v2: avoid __int128 which is poorely optimized, and bizarrely not available
in 32-bit addressing mode (our SIMD unit is 128 bits).
v3: properly credit Andy and Cryptograms (there was never mal-intent here, just FUD).