VPMSUMD accelleration for GCM mode on PPC
Needs ReviewPublic

Authored by slandden on Apr 14 2020, 2:56 PM.

Details

Reviewers
jukivili
Summary

See T4630

10X speed.

Before:

 GCM enc |      4.20 ns/B     226.9 MiB/s      7.97 c/B      1895
 GCM dec |      3.57 ns/B     267.5 MiB/s      6.76 c/B      1895
GCM auth |      3.34 ns/B     285.7 MiB/s      6.33 c/B      1895

After:

 GCM enc |     0.524 ns/B      1818 MiB/s     0.995 c/B      1897
 GCM dec |     0.525 ns/B      1816 MiB/s     0.996 c/B      1897
GCM auth |     0.328 ns/B      2909 MiB/s     0.622 c/B      1897
Test Plan

Is integrated into tests, and added a new test

Diff Detail

Repository
rC libgcrypt
Lint
Lint Skipped
Unit
Unit Tests Skipped
slandden created this revision.Apr 14 2020, 2:56 PM
jukivili edited reviewers, added: jukivili; removed: jwilk.Apr 14 2020, 9:49 PM

Generally nice looking patch and great improvement for performance.

  • For coding style, I'll just remind that libgcrypt follows GNU C style.
  • While use of "vector unsigned __int128" simplifies implementation a bit, it's use causes some problems...
  • 32-bit targets does not support __int128, thus source does not compile on plain old powerpc:
../../cipher/cipher-gcm.c: In function 'ghash_ppc_vpmsum':
../../cipher/cipher-gcm.c:111:21: error: '__int128' is not supported on this target
  111 |   volatile unsigned __int128 *where = (unsigned __int128*)result;
      |                     ^~~~~~~~
  • Code generation for shifting vector unsigned __int128 is not great. For example, rotating "H3 << 64 | H3 >> 64" causes compiler to store vector to stack, and from there two halfs are loaded to scalar registers, then moved to two vector registers and those vectors registers are then merged. About like this:
xxswapd vs32,vs32
addi    r9,r1,-16
stxvd2x vs32,0,r9
ld      r8,-8(r1)
ld      r9,-16(r1)
mtvsrd  vs1,r8
mtvsrd  vs0,r9
xxmrghd vs34,vs0,vs1

This sequence could be replaced with single xxswapd instruction. Left and right shift on vector unsigned __int128 are similar expect zero is loaded to r8 or r9, while operation could be performed with single xxmrghd or xxmrghl instruction. See crc-ppc.c for asm_swap_u64 which swaps upper and lower parts of vector.

  • So overall, it is better to not use vector unsigned __int128 but use vector unsigned long long instead.
  • Code does not compile on big-endian powerpc64 target. Either defined(WORDS_BIGENDIAN) needs to be added to the #ifdef or big-endian support checked.
  • __attribute__((optimize(0))) attributes and volatile pointer in cipher-gcm.c probably are some left overs from debugging. Those should be removed.
  • Copyright header does not look like LGPL2.1.

For results, note that --cpu-mhz auto does not quite work on PowerPC. Automatic Mhz detection has been tested on x86 and few ARM processors and it is not robust enough for all microarchitectures. I've seen it gets wrong CPU Mhz speeds on PowerPC8/9 and for example on AWS's Graviton ARM processors.

Generally nice looking patch and great improvement for performance.

For coding style, I'll just remind that libgcrypt follows GNU C style.

While use of "vector unsigned __int128" simplifies implementation a bit, it's use causes some problems...

32-bit targets does not support __int128, thus source does not compile on plain old powerpc:

../../cipher/cipher-gcm.c: In function 'ghash_ppc_vpmsum':
../../cipher/cipher-gcm.c:111:21: error: '__int128' is not supported on this target

111 |   volatile unsigned __int128 *where = (unsigned __int128*)result;
    |                     ^~~~~~~~

  Code generation for shifting vector unsigned __int128 is not great. For example, rotating "H3 << 64 | H3 >> 64" causes compiler to store vector to stack, and from there two halfs are loaded to scalar registers, then moved to two vector registers and those vectors registers are then merged. About like this:

xxswapd vs32,vs32
addi r9,r1,-16
stxvd2x vs32,0,r9
ld r8,-8(r1)
ld r9,-16(r1)
mtvsrd vs1,r8
mtvsrd vs0,r9
xxmrghd vs34,vs0,vs1

I opened bug reports for these:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94687
Also one for LLVM.

This sequence could be replaced with single xxswapd instruction. Left and right shift on vector unsigned __int128 are similar expect zero is loaded to r8 or r9, while operation could be performed with single xxmrghd or xxmrghl instruction. See crc-ppc.c for asm_swap_u64 which swaps upper and lower parts of vector.

So overall, it is better to not use vector unsigned __int128 but use vector unsigned long long instead.

But I need it to shift H one to the left:

H1 = H << one;

And the compiler not being able to compile this would be really sad.

Code does not compile on big-endian powerpc64 target. Either defined(WORDS_BIGENDIAN) needs to be added to the #ifdef or big-endian support checked.

Minicloud has been sending me e-mails that my account has been canceled months after unexpectedly deleting my account. I am unable to test on big-endian for this reason.

__attribute__((optimize(0))) attributes and volatile pointer in cipher-gcm.c probably are some left overs from debugging. Those should be removed.

Fixed

Copyright header does not look like LGPL2.1.

Fixed

For results, note that --cpu-mhz auto does not quite work on PowerPC. Automatic Mhz detection has been tested on x86 and few ARM processors and it is not robust enough for all microarchitectures. I've seen it gets wrong CPU Mhz speeds on PowerPC8/9 and for example on AWS's Graviton ARM processors.

Yes it works. my cpu is 1.9Ghz

The compiler is suppose to do things for you, and vector __int128 is those things it is suppose to do for you.....and it works well enough, and that bug can get fixed. I am not particularly attached, but I don't get why the compiler cannot be fixed, and that not be acceptable, and just having on the most important architecture until it can be fixed.

slandden updated this revision to Diff 1425.Apr 27 2020, 2:24 PM

described in previous comment. Mostly cosmetic

slandden updated this revision to Diff 1426.Apr 27 2020, 2:32 PM

remove <<64 | >> 64 which has poor codegen.

In D501#4558, @slandden wrote:

But I need it to shift H one to the left:

H1 = H << one;

You could use asm_add_uint128() from rijndael-ppc-common.h. "H1 = H << 1 => H1 = H * 2 => H1 = H + H".

And the compiler not being able to compile this would be really sad.

Code does not compile on big-endian powerpc64 target. Either defined(WORDS_BIGENDIAN) needs to be added to the #ifdef or big-endian support checked.

Minicloud has been sending me e-mails that my account has been canceled months after unexpectedly deleting my account. I am unable to test on big-endian for this reason.

Minicloud appears to cancel all account after few months and then one needs to reapply for access.

__attribute__((optimize(0))) attributes and volatile pointer in cipher-gcm.c probably are some left overs from debugging. Those should be removed.

Fixed

Thanks.

Copyright header does not look like LGPL2.1.

Fixed

Thanks.

For results, note that --cpu-mhz auto does not quite work on PowerPC. Automatic Mhz detection has been tested on x86 and few ARM processors and it is not robust enough for all microarchitectures. I've seen it gets wrong CPU Mhz speeds on PowerPC8/9 and for example on AWS's Graviton ARM processors.

Yes it works. my cpu is 1.9Ghz

The compiler is suppose to do things for you, and vector __int128 is those things it is suppose to do for you.....and it works well enough, and that bug can get fixed. I am not particularly attached, but I don't get why the compiler cannot be fixed, and that not be acceptable, and just having on the most important architecture until it can be fixed.

It would be nice if compiler could generate fast code for these, but I think it is better to get implementation run fast with current compilers even if it means using inline assembly and maybe later use vector __int128 when support gets improved.