Use POWER8 and POWER9 ISA enhancements to improve the performance of SHA-2. Demonstrate why achieved performance is close to optimal for the platform. Optimized implementations in the Cryptogams repository[1] may serve as useful references. Financial bounty upon completion and community acceptance of patches.
Description
Revisions and Commits
| Status | Assigned | Task | ||
|---|---|---|---|---|
| Open | jukivili | T4460 libgcrypt performance TODOs | ||
| Resolved | jukivili | T4531 PowerPC performance improvements | ||
| Resolved | jukivili | T4530 libgcrypt: POWER SHA-2 Vector Acceleration |
Event Timeline
Please do not change the priority back without discussing this with the maintainer first. Thanks.
I'll start working on new PowerPC SHA2 implementations for libgcrypt in coming weeks.
Patches send to mailing list:
https://lists.gnupg.org/pipermail/gcrypt-devel/2019-August/004800.html
https://lists.gnupg.org/pipermail/gcrypt-devel/2019-August/004799.html
SHA256 results:
Benchmark on POWER8 ~3.8Ghz:
Before:
| nanosecs/byte mebibytes/sec cycles/byte
SHA256 | 4.17 ns/B 228.6 MiB/s 15.85 c/B
After (~1.63x faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA256 | 2.55 ns/B 373.9 MiB/s 9.69 c/B
For comparison, OpenSSL 1.1.1b (~2.4% slower):
| nanosecs/byte mebibytes/sec cycles/byte
SHA256 | 2.61 ns/B 364.8 MiB/s 9.93 c/B
Benchmark on POWER9 ~3.8Ghz:
Before:
| nanosecs/byte mebibytes/sec cycles/byte
SHA256 | 3.23 ns/B 295.6 MiB/s 12.26 c/B
After (~1.03x faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA256 | 3.11 ns/B 306.8 MiB/s 11.81 c/B
For comparison, OpenSSL 1.1.1b (~6.6% faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA256 | 2.91 ns/B 327.5 MiB/s 11.07 c/BSHA512 results:
Benchmark on POWER8 ~3.8Ghz:
Before:
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 3.47 ns/B 274.6 MiB/s 13.20 c/B
After (~2.08x faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 1.66 ns/B 573.1 MiB/s 6.32 c/B
For comparison, OpenSSL 1.1.1b (~1.6% faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 1.64 ns/B 582.2 MiB/s 6.22 c/B
Benchmark on POWER9 ~3.8Ghz:
Before:
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 2.65 ns/B 359.6 MiB/s 10.08 c/B
After (~1.33x faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 1.99 ns/B 479.2 MiB/s 7.56 c/B
For comparison, OpenSSL 1.1.1b (~9.4% faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 1.82 ns/B 524.4 MiB/s 6.91 c/BI have not been able to get Altivec/VSX intrinsic implementation to work fast on POWER9. Appears that SHA2 vector acceleration gives diminishing returns on POWER9. For example, OpenSSL assembly vshasigma(w|d) implementations are only 6 to 10% faster than optimized non-vector C implementation provided here, which is within what is expected speed-up if these C implementations would be turned into assembly implementations.
PowerPC SHA-256 and SHA-512 implementations with little bit more tuning committed. Most notably, SHA-512 on POWER8 now gives similar performance to OpenSSL:
Benchmark on POWER8 ~3.8Ghz:
Before:
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 3.47 ns/B 274.6 MiB/s 13.20 c/B
After (~2.1x faster):
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 1.64 ns/B 581.8 MiB/s 6.23 c/B
For comparison, OpenSSL 1.1.1b (~same):
| nanosecs/byte mebibytes/sec cycles/byte
SHA512 | 1.64 ns/B 582.2 MiB/s 6.22 c/B