VPMSUMD accelleration for GCM mode on PPC
Needs ReviewPublic

Authored by slandden on Apr 14 2020, 2:56 PM.

Details

Reviewers
jukivili
Summary

10-20X speed.

However this Power 9 machine is faster than the last Power 9 benchmarks
on the optimized versions,
so while better than the last patch, it is not all due to the code.

Before:

 GCM enc |      4.23 ns/B     225.3 MiB/s         - c/B
 GCM dec |      3.58 ns/B     266.2 MiB/s         - c/B
GCM auth |      3.34 ns/B     285.3 MiB/s         - c/B

After:

 GCM enc |     0.370 ns/B      2578 MiB/s         - c/B
 GCM dec |     0.371 ns/B      2571 MiB/s         - c/B
GCM auth |     0.159 ns/B      6003 MiB/s         - c/B

v2: avoid __int128 which is poorely optimized, and bizarrely not available

in 32-bit addressing mode (our SIMD unit is 128 bits).

v3: properly credit Andy and Cryptograms (there was never mal-intent here, just FUD).

Test Plan

Is integrated into tests, and added a new test

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped
slandden created this revision.Apr 14 2020, 2:56 PM
jukivili edited reviewers, added: jukivili; removed: jwilk.Apr 14 2020, 9:49 PM

Generally nice looking patch and great improvement for performance.

  • For coding style, I'll just remind that libgcrypt follows GNU C style.
  • While use of "vector unsigned __int128" simplifies implementation a bit, it's use causes some problems...
  • 32-bit targets does not support __int128, thus source does not compile on plain old powerpc:
../../cipher/cipher-gcm.c: In function 'ghash_ppc_vpmsum':
../../cipher/cipher-gcm.c:111:21: error: '__int128' is not supported on this target
  111 |   volatile unsigned __int128 *where = (unsigned __int128*)result;
      |                     ^~~~~~~~
  • Code generation for shifting vector unsigned __int128 is not great. For example, rotating "H3 << 64 | H3 >> 64" causes compiler to store vector to stack, and from there two halfs are loaded to scalar registers, then moved to two vector registers and those vectors registers are then merged. About like this:
xxswapd vs32,vs32
addi    r9,r1,-16
stxvd2x vs32,0,r9
ld      r8,-8(r1)
ld      r9,-16(r1)
mtvsrd  vs1,r8
mtvsrd  vs0,r9
xxmrghd vs34,vs0,vs1

This sequence could be replaced with single xxswapd instruction. Left and right shift on vector unsigned __int128 are similar expect zero is loaded to r8 or r9, while operation could be performed with single xxmrghd or xxmrghl instruction. See crc-ppc.c for asm_swap_u64 which swaps upper and lower parts of vector.

  • So overall, it is better to not use vector unsigned __int128 but use vector unsigned long long instead.
  • Code does not compile on big-endian powerpc64 target. Either defined(WORDS_BIGENDIAN) needs to be added to the #ifdef or big-endian support checked.
  • __attribute__((optimize(0))) attributes and volatile pointer in cipher-gcm.c probably are some left overs from debugging. Those should be removed.
  • Copyright header does not look like LGPL2.1.

For results, note that --cpu-mhz auto does not quite work on PowerPC. Automatic Mhz detection has been tested on x86 and few ARM processors and it is not robust enough for all microarchitectures. I've seen it gets wrong CPU Mhz speeds on PowerPC8/9 and for example on AWS's Graviton ARM processors.

Generally nice looking patch and great improvement for performance.

For coding style, I'll just remind that libgcrypt follows GNU C style.

While use of "vector unsigned __int128" simplifies implementation a bit, it's use causes some problems...

32-bit targets does not support __int128, thus source does not compile on plain old powerpc:

../../cipher/cipher-gcm.c: In function 'ghash_ppc_vpmsum':
../../cipher/cipher-gcm.c:111:21: error: '__int128' is not supported on this target

111 |   volatile unsigned __int128 *where = (unsigned __int128*)result;
    |                     ^~~~~~~~

  Code generation for shifting vector unsigned __int128 is not great. For example, rotating "H3 << 64 | H3 >> 64" causes compiler to store vector to stack, and from there two halfs are loaded to scalar registers, then moved to two vector registers and those vectors registers are then merged. About like this:

xxswapd vs32,vs32
addi r9,r1,-16
stxvd2x vs32,0,r9
ld r8,-8(r1)
ld r9,-16(r1)
mtvsrd vs1,r8
mtvsrd vs0,r9
xxmrghd vs34,vs0,vs1

I opened bug reports for these:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94687
Also one for LLVM.

This sequence could be replaced with single xxswapd instruction. Left and right shift on vector unsigned __int128 are similar expect zero is loaded to r8 or r9, while operation could be performed with single xxmrghd or xxmrghl instruction. See crc-ppc.c for asm_swap_u64 which swaps upper and lower parts of vector.

So overall, it is better to not use vector unsigned __int128 but use vector unsigned long long instead.

But I need it to shift H one to the left:

H1 = H << one;

And the compiler not being able to compile this would be really sad.

Code does not compile on big-endian powerpc64 target. Either defined(WORDS_BIGENDIAN) needs to be added to the #ifdef or big-endian support checked.

Minicloud has been sending me e-mails that my account has been canceled months after unexpectedly deleting my account. I am unable to test on big-endian for this reason.

__attribute__((optimize(0))) attributes and volatile pointer in cipher-gcm.c probably are some left overs from debugging. Those should be removed.

Fixed

Copyright header does not look like LGPL2.1.

Fixed

For results, note that --cpu-mhz auto does not quite work on PowerPC. Automatic Mhz detection has been tested on x86 and few ARM processors and it is not robust enough for all microarchitectures. I've seen it gets wrong CPU Mhz speeds on PowerPC8/9 and for example on AWS's Graviton ARM processors.

Yes it works. my cpu is 1.9Ghz

The compiler is suppose to do things for you, and vector __int128 is those things it is suppose to do for you.....and it works well enough, and that bug can get fixed. I am not particularly attached, but I don't get why the compiler cannot be fixed, and that not be acceptable, and just having on the most important architecture until it can be fixed.

slandden updated this revision to Diff 1425.Apr 27 2020, 2:24 PM

described in previous comment. Mostly cosmetic

slandden updated this revision to Diff 1426.Apr 27 2020, 2:32 PM

remove <<64 | >> 64 which has poor codegen.

In D501#4558, @slandden wrote:

But I need it to shift H one to the left:

H1 = H << one;

You could use asm_add_uint128() from rijndael-ppc-common.h. "H1 = H << 1 => H1 = H * 2 => H1 = H + H".

And the compiler not being able to compile this would be really sad.

Code does not compile on big-endian powerpc64 target. Either defined(WORDS_BIGENDIAN) needs to be added to the #ifdef or big-endian support checked.

Minicloud has been sending me e-mails that my account has been canceled months after unexpectedly deleting my account. I am unable to test on big-endian for this reason.

Minicloud appears to cancel all account after few months and then one needs to reapply for access.

__attribute__((optimize(0))) attributes and volatile pointer in cipher-gcm.c probably are some left overs from debugging. Those should be removed.

Fixed

Thanks.

Copyright header does not look like LGPL2.1.

Fixed

Thanks.

For results, note that --cpu-mhz auto does not quite work on PowerPC. Automatic Mhz detection has been tested on x86 and few ARM processors and it is not robust enough for all microarchitectures. I've seen it gets wrong CPU Mhz speeds on PowerPC8/9 and for example on AWS's Graviton ARM processors.

Yes it works. my cpu is 1.9Ghz

The compiler is suppose to do things for you, and vector __int128 is those things it is suppose to do for you.....and it works well enough, and that bug can get fixed. I am not particularly attached, but I don't get why the compiler cannot be fixed, and that not be acceptable, and just having on the most important architecture until it can be fixed.

It would be nice if compiler could generate fast code for these, but I think it is better to get implementation run fast with current compilers even if it means using inline assembly and maybe later use vector __int128 when support gets improved.

slandden updated this revision to Diff 1431.Jun 16 2020, 2:57 PM

Switching to assembly for the shifts made a significant speed-up. As Minicloud is seemingly broken (can't open up ssh port) I cannot test on 64-bit big-endian or 32-bit and have thus made it 64le-only.

slandden edited the summary of this revision. (Show Details)Jun 16 2020, 2:57 PM

Thanks for the new version. Unfortunately Minicloud seems to be down and therefore cannot test patch at the moment. I'll take look when I regain power64 access.

Just one question at the moment.

Is this CRYPTOGAMS ghashp8-ppc.pl assembly converted to intrinsics (gcm_init_p8 + gcm_ghash_p8/Lshort combined with gcm_ghash_p8_4x)?

Because it would need proper copyright information from the original.

I am using Duff's device which is not in that version (and makes it considerably simpler), but it certainly is influenced by that version (and the preprocessing of the table taking advantage of the communicative nature of carryless multiplication is novel in that version), and I can add a note to that effect.

Minicloud is up. I found that the altivec flags are never passed when libgcrypt is compiled big-endian.

When I took side-by-side comparison of cryptogams version to this patch, what I find is that they are strikingly similar. Operation/instruction ordering matches closely to parts of ghashp8-ppc.pl. In many parts variable/register names are the same also.



So, things I see are needed to be done for inclusion of this patch are:

  1. GNU C coding style fixes.
  2. Adding comment about that this implementation is based on GHASH implementation by Andy Polyakov with original license. This needs to be checked with @werner , but I think following would be sufficient:
/*
 * Based on GHASH implementation by Andy Polyakov from CRYPTOGAMS
 * distribution (ppc/ghashp8-ppc.pl).
 *
 * Original copyright license follows:
 *
 * Copyright (c) 2006, CRYPTOGAMS by <appro@openssl.org>
 * All rights reserved.
 * 
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 
 *       * Redistributions of source code must retain copyright notices,
 *         this list of conditions and the following disclaimer.
 * 
 *       * Redistributions in binary form must reproduce the above
 *         copyright notice, this list of conditions and the following
 *         disclaimer in the documentation and/or other materials
 *         provided with the distribution.
 * 
 *       * Neither the name of the CRYPTOGAMS nor the names of its
 *         copyright holder and contributors may be used to endorse or
 *         promote products derived from this software without specific
 *         prior written permission.
 * 
 * ALTERNATIVELY, provided that this notice is retained in full, this
 * product may be distributed under the terms of the GNU General Public
 * License (GPL), in which case the provisions of the GPL apply INSTEAD OF
 * those given above.
 * 
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS
 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
 * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

We need to clarify two things:

  1. What is "CRYTOGRAMS by <OPPENSSL-address>". A copyright notice requires an identifiable natural or legal person tobe behind it,. So what is CRYPTOGRAMS an how is it related to OpenSSL Project.?
  1. The OpenSSL Project still has the advertising clause which is incompatible even with the LGPL. I recall that the first patch on the ML showed the advertising clause along with the CRYPTOGRAMS name.

To clarify this I would suggest that we contact Andy Polyakov and check whether he was the sole author and on which grounds the advertising clause was removed.

BTW, I rarly notice comments in Differential unless there is a releated open ticket. Do we have one?

If we can use the code please first commit the original code to the repo and only then apply code style fixes.

https://github.com/dot-asm/cryptogams/blob/master/LICENSE

and Andy is the sole author, and he even told me personally by e-mail this
a long time ago when I was interested in the libcrypt library of glibc is .
He also licensed cryptogams for the Linux kernel (because of WireGuard)
however that did not make it into the version the version that was merged
(some of his code is already there, and IIRC includes the ghash at issue
here).

Copyright (c) 2006, CRYPTOGAMS by <appro@openssl.org>
All rights reserved.

Is this good for the legal person requirement?

And I can work on the other issues when the legal one is made clear. Sorry
for being pugnacious.

slandden updated this revision to Diff 1439.Oct 7 2020, 2:54 PM
slandden edited the summary of this revision. (Show Details)

v2: avoid __int128 which is poorely optimized, and bizarrely not available

in 32-bit addressing mode (our SIMD unit is 128 bits).

v3: properly credit Andy and Cryptograms (there was never mal-intent here, just FUD).