Page MenuHome GnuPG

Optimize Chacha20 and Poly1305 for PPC P10 LE
Closed, ResolvedPublic

Description

Support 8 blocks (64-bytes block) unrolling for chacha20 and 4 blocks (16 bytes block) unrolling for poly1305.

Related Objects

Mentioned Here
P10 Slow dirmngr

Event Timeline

dannytsen created this object in space S1 Public.

Files affected:

configure.ac - Added chacha20 and poly1305 assembly implementations.
cipher/chacha20-p10le-8x.s (New) - support 8 blocks (512 bytes) unrolling.
cipher/poly1305-p10le.s (New) - support 4 blocks (128 bytes) unrolling.
cipher/Makefile.am - added new chacha20 and poly1305 files.
cipher/chacha20.c - Added PPC p10 le support for 8x chacha20.
cipher/poly1305.c - Added PPC p10 le support for 4x poly1305.

Improved performance, bench-slope needs to be compiled with -O flag. -O2 flag would not provide any output.

Before

CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |     0.587 ns/B      1625 MiB/s         - c/B
   STREAM dec |     0.587 ns/B      1625 MiB/s         - c/B
 POLY1305 enc |     0.645 ns/B      1479 MiB/s         - c/B
 POLY1305 dec |     0.634 ns/B      1504 MiB/s         - c/B
POLY1305 auth |     0.414 ns/B      2304 MiB/s         - c/B

After (~66%)

CHACHA20 | nanosecs/byte mebibytes/sec cycles/byte

   STREAM enc |     0.352 ns/B      2707 MiB/s         - c/B
   STREAM dec |     0.352 ns/B      2709 MiB/s         - c/B
 POLY1305 enc |     0.558 ns/B      1709 MiB/s         - c/B
 POLY1305 dec |     0.556 ns/B      1715 MiB/s         - c/B
POLY1305 auth |     0.204 ns/B      4674 MiB/s         - c/B

-O2 problem with bench-slope seems strange. Does problem appear after this patch is applied?

Few comments on the patch:

  1. Commit log is needed.
  2. Looking at results on P10 before patch, difference between STREAM and POLY1305 is small. POLY1305 is adding only ~0.058 cycles/byte extra processing in stitched/interleaved Chacha20-Poly1305 implementation so P10 appears to be quite capable of parallel processing mixed ALU+vector workloads. I suspect that interleaving approach would also work with Chacha20x8 implementation in this patch. I'd estimate performance for Chacha20x8-Poly1305x1 stitched implementation to be around ~0.42-0.43 cycles/byte or ~2200MiB/s.
  3. There is few places where #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ etc is used. Libgcrypt has WORDS_BIGENDIAN macro which should be instead... #ifndef WORDS_BIGENDIAN etc.
  1. _gcry_chacha20_poly1305_encrypt should handle P10 bit differently. Set authptr NULL to skip stitched code paths entirely:
#ifdef USE_PPC_VEC_POLY1305
  else if (ctx->use_ppc && ctx->use_p10)
    {
      /* Skip stitched chacha20-poly1305 for AVX512. */
      authptr = NULL;
    }
  else if (ctx->use_ppc && length >= CHACHA20_BLOCK_SIZE * 4)
  1. Likewise, _gcry_chacha20_poly1305_decrypt should handle P10 bit differently. Set skip_stitched to 1 to skip stitched code paths:
#ifdef USE_AVX512
  if (ctx->use_avx512)
    {
      /* Skip stitched chacha20-poly1305 for P10. */
      skip_stitched = 1;
    }
#endif
#ifdef USE_PPC_VEC_POLY1305
  if (ctx->use_ppc && ctx->use_p10)
    {
      /* Skip stitched chacha20-poly1305 for P10. */
      skip_stitched = 1;
    }
#endif
  1. HW feature check missing for Poly1305 P10 code. Add "use_p10" to poly1305_context_t and setup for it in poly1305_init (libgcrypt master has use_avx512, which can be used as an example).

Tested patch with small change so that HWF_PPC_ARCH_3_00 is used instead of HWF_PPC_ARCH_3_10. Building bench-slope with "-O3 -flto" makes bug in new implementation visible. Without new implementations bench-slope is ok (testing with QEMU):

$ tests/bench-slope --disable-hwf ppc-arch_3_00 cipher chacha20
Cipher:
 CHACHA20       |  nanosecs/byte   mebibytes/sec   cycles/byte
     STREAM enc |      2.35 ns/B     405.0 MiB/s         - c/B
     STREAM dec |      2.32 ns/B     410.7 MiB/s         - c/B
   POLY1305 enc |      2.46 ns/B     388.0 MiB/s         - c/B
   POLY1305 dec |      2.34 ns/B     408.1 MiB/s         - c/B
  POLY1305 auth |     0.238 ns/B      4003 MiB/s         - c/B

With new implementations, some ABI callee saved register(s) are not being restored properly by new implementations (could be floating point or vector register):

Cipher:
 CHACHA20       |  nanosecs/byte   mebibytes/sec   cycles/byte
     STREAM enc |       nan ns/B       nan MiB/s         - c/B
     STREAM dec |       nan ns/B       nan MiB/s         - c/B
   POLY1305 enc |       nan ns/B       nan MiB/s         - c/B
   POLY1305 dec |       nan ns/B       nan MiB/s         - c/B
  POLY1305 auth |       nan ns/B       nan MiB/s         - c/B
                =

Problem is that new assembly is using VSX registers vs14-vs31 which overlap with floating-point registers f14-f31. f14-f31 are ABI callee saved, so those need to be stored and restored.

HI @jukivili , Thanks for the updates. For f14-f31 registers that was my mistake that did not think floating point will be used. Will correct that. For poly1305, it can be used on ARCH_3.0 so checking use_p10 doesn't seem to be necessary but I can include that as well.

To stitch chacha20 and poly1305, may not be efficient because interleaving vectors for both may not be beneficial. Unless I can do something with 8x unroll for poly1305.

I will put your suggestions all together and have a right approach. Thanks.

I meant interleaving integer register based 1xPoly1305 with 8xChacha20 as is done for 4xChacha20 in cipher/chacha20-ppc.c (interleaved so that for each 4xChaCha20 processed, 4 blocks of 1xPoly1305 is executed). Quite often microarchitectures have separate execution units for integer registers and vector registers and then it makes sense to interleave integer-poly1305 with vector-chacha20 as algorithms do not end up competing for same execution resources. Interleaving vector-poly1305 and vector-chacha20 is not likely to give performance increase (and likely to run problems with running out of vector registers).

One example of this execution parallelism is interleaved chacha20-avx2/poly1305 (8xChacha20, 1xPoly1305) on Zen3. Since vector and integer execution happens in parallel and does not complete each other, chacha20-poly1305 ends up being almost as fast as either chacha20 or poly1305 alone:

CHACHA20       |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
    STREAM enc |     0.203 ns/B      4706 MiB/s     0.983 c/B      4850
    STREAM dec |     0.202 ns/B      4711 MiB/s     0.982 c/B      4850 <-- 8x vector chacha20-avx2
  POLY1305 enc |     0.226 ns/B      4222 MiB/s      1.10 c/B      4850
  POLY1305 dec |     0.207 ns/B      4605 MiB/s      1.00 c/B      4850 <-- 8x vector chacha20-avx2 + 1x integer poly1305 interleaved
 POLY1305 auth |     0.195 ns/B      4892 MiB/s     0.945 c/B      4849 <-- 1x integer poly1305

But I'm not saying this must be done now, just thought it might be something interesting to look in to.

use_p10 check is necessary in the case library is compiled for older base architecture but assembly implementation is included so that it can be enabled if library is then run on newer HW.

Thanks @jukivili. I have never thought of interleaving with interger poly1305 operation and that's a good suggestion. Will think about that one.

I have done some fixes and update of the current patch. Please review. Patch is included. Thanks.

Thanks for updated patch. I'm travelling next week and have time to check it closely only after I'm back. On quick glance, it looks good. What is also needed is the changelog for git commit log.

Thanks @jukivili , Here is the changelog,

Chacha20/poly1305 - Optimized chacha20/poly1305 for P10 operation.

Files affected:

configure.ac - Added chacha20 and poly1305 assembly implementations.
cipher/chacha20-p10le-8x.s (New) - support 8 blocks (512 bytes) unrolling.
cipher/poly1305-p10le.s (New) - support 4 blocks (128 bytes) unrolling.
cipher/Makefile.am - added new chacha20 and poly1305 files.
cipher/chacha20.c - Added PPC p10 le support for 8x chacha20.
cipher/poly1305.c - Added PPC p10 le support for 4x poly1305.
cipher/poly1305-internal.h - Added PPC p10 le support for poly1305.

Signed-off-by: Danny Tsen <dtsen@us.ibm.com>

Patch applied to master with small changes.

jukivili claimed this task.