Use POWER8 and POWER9 ISA enhancements to improve the performance of SHA-2. Demonstrate why achieved performance is close to optimal for the platform. Optimized implementations in the Cryptogams repository[1] may serve as useful references. Financial bounty upon completion and community acceptance of patches.
Description
Revisions and Commits
Status | Assigned | Task | ||
---|---|---|---|---|
Open | jukivili | T4460 libgcrypt performance TODOs | ||
Resolved | jukivili | T4531 PowerPC performance improvements | ||
Resolved | jukivili | T4530 libgcrypt: POWER SHA-2 Vector Acceleration |
Event Timeline
Please do not change the priority back without discussing this with the maintainer first. Thanks.
I'll start working on new PowerPC SHA2 implementations for libgcrypt in coming weeks.
Patches send to mailing list:
https://lists.gnupg.org/pipermail/gcrypt-devel/2019-August/004800.html
https://lists.gnupg.org/pipermail/gcrypt-devel/2019-August/004799.html
SHA256 results:
Benchmark on POWER8 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 4.17 ns/B 228.6 MiB/s 15.85 c/B After (~1.63x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 2.55 ns/B 373.9 MiB/s 9.69 c/B For comparison, OpenSSL 1.1.1b (~2.4% slower): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 2.61 ns/B 364.8 MiB/s 9.93 c/B Benchmark on POWER9 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 3.23 ns/B 295.6 MiB/s 12.26 c/B After (~1.03x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 3.11 ns/B 306.8 MiB/s 11.81 c/B For comparison, OpenSSL 1.1.1b (~6.6% faster): | nanosecs/byte mebibytes/sec cycles/byte SHA256 | 2.91 ns/B 327.5 MiB/s 11.07 c/B
SHA512 results:
Benchmark on POWER8 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 3.47 ns/B 274.6 MiB/s 13.20 c/B After (~2.08x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.66 ns/B 573.1 MiB/s 6.32 c/B For comparison, OpenSSL 1.1.1b (~1.6% faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.64 ns/B 582.2 MiB/s 6.22 c/B Benchmark on POWER9 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 2.65 ns/B 359.6 MiB/s 10.08 c/B After (~1.33x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.99 ns/B 479.2 MiB/s 7.56 c/B For comparison, OpenSSL 1.1.1b (~9.4% faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.82 ns/B 524.4 MiB/s 6.91 c/B
I have not been able to get Altivec/VSX intrinsic implementation to work fast on POWER9. Appears that SHA2 vector acceleration gives diminishing returns on POWER9. For example, OpenSSL assembly vshasigma(w|d) implementations are only 6 to 10% faster than optimized non-vector C implementation provided here, which is within what is expected speed-up if these C implementations would be turned into assembly implementations.
PowerPC SHA-256 and SHA-512 implementations with little bit more tuning committed. Most notably, SHA-512 on POWER8 now gives similar performance to OpenSSL:
Benchmark on POWER8 ~3.8Ghz: Before: | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 3.47 ns/B 274.6 MiB/s 13.20 c/B After (~2.1x faster): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.64 ns/B 581.8 MiB/s 6.23 c/B For comparison, OpenSSL 1.1.1b (~same): | nanosecs/byte mebibytes/sec cycles/byte SHA512 | 1.64 ns/B 582.2 MiB/s 6.22 c/B