libgcrypt: POWER SHA-2 Vector Acceleration
Closed, ResolvedPublic

Description

Use POWER8 and POWER9 ISA enhancements to improve the performance of SHA-2. Demonstrate why achieved performance is close to optimal for the platform. Optimized implementations in the Cryptogams repository[1] may serve as useful references. Financial bounty upon completion and community acceptance of patches.

https://github.com/dot-asm/cryptogams/

gcwilson created this task.May 20 2019, 7:04 PM
werner renamed this task from [$] libgcrypt: POWER SHA-2 Vector Acceleration to libgcrypt: POWER SHA-2 Vector Acceleration.May 21 2019, 7:52 AM
werner triaged this task as Normal priority.
johnmar raised the priority of this task from Normal to Needs Triage.Jul 15 2019, 9:09 PM
werner triaged this task as Normal priority.Jul 16 2019, 8:31 AM
werner added a subscriber: werner.

Please do not change the priority back without discussing this with the maintainer first. Thanks.

jukivili claimed this task.Aug 25 2019, 6:11 PM
jukivili added a subscriber: jukivili.

I'll start working on new PowerPC SHA2 implementations for libgcrypt in coming weeks.

jukivili added a comment.EditedAug 31 2019, 2:07 AM

Patches send to mailing list:
https://lists.gnupg.org/pipermail/gcrypt-devel/2019-August/004800.html
https://lists.gnupg.org/pipermail/gcrypt-devel/2019-August/004799.html

SHA256 results:

Benchmark on POWER8 ~3.8Ghz:
 Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA256         |      4.17 ns/B     228.6 MiB/s     15.85 c/B

 After (~1.63x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA256         |      2.55 ns/B     373.9 MiB/s      9.69 c/B

 For comparison, OpenSSL 1.1.1b (~2.4% slower):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA256         |      2.61 ns/B     364.8 MiB/s      9.93 c/B


Benchmark on POWER9 ~3.8Ghz:
 Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA256         |      3.23 ns/B     295.6 MiB/s     12.26 c/B

 After (~1.03x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA256         |      3.11 ns/B     306.8 MiB/s     11.81 c/B

 For comparison, OpenSSL 1.1.1b (~6.6% faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA256         |      2.91 ns/B     327.5 MiB/s     11.07 c/B

SHA512 results:

Benchmark on POWER8 ~3.8Ghz:
 Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      3.47 ns/B     274.6 MiB/s     13.20 c/B

 After (~2.08x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      1.66 ns/B     573.1 MiB/s      6.32 c/B

 For comparison, OpenSSL 1.1.1b (~1.6% faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      1.64 ns/B     582.2 MiB/s      6.22 c/B


Benchmark on POWER9 ~3.8Ghz:
 Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      2.65 ns/B     359.6 MiB/s     10.08 c/B

 After (~1.33x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      1.99 ns/B     479.2 MiB/s      7.56 c/B

 For comparison, OpenSSL 1.1.1b (~9.4% faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      1.82 ns/B     524.4 MiB/s      6.91 c/B

I have not been able to get Altivec/VSX intrinsic implementation to work fast on POWER9. Appears that SHA2 vector acceleration gives diminishing returns on POWER9. For example, OpenSSL assembly vshasigma(w|d) implementations are only 6 to 10% faster than optimized non-vector C implementation provided here, which is within what is expected speed-up if these C implementations would be turned into assembly implementations.

jukivili closed this task as Resolved.Sep 3 2019, 9:38 PM

PowerPC SHA-256 and SHA-512 implementations with little bit more tuning committed. Most notably, SHA-512 on POWER8 now gives similar performance to OpenSSL:

Benchmark on POWER8 ~3.8Ghz:
 Before:
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      3.47 ns/B     274.6 MiB/s     13.20 c/B

 After (~2.1x faster):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      1.64 ns/B     581.8 MiB/s      6.23 c/B

 For comparison, OpenSSL 1.1.1b (~same):
                |  nanosecs/byte   mebibytes/sec   cycles/byte
 SHA512         |      1.64 ns/B     582.2 MiB/s      6.22 c/B