Unexpectedly slow decryption for AEAD (and CFB)
Open, HighPublic

Description

Hello,

I started looking in to how well new AEAD mode performs and noticed some performance issues with decryption.

With latest libgcrypt-devel & gnupg-devel, encryption with AEAD works fast, ~1 GB/s when encrypting 2 GiB from /dev/zero to ram filesystem (2.0GB/s when streaming from /dev/zero to /dev/null). Command used was:

dd in=/dev/zero bs=1M count=${GIBI}k | "$GPG" \
  --batch --yes --encrypt --recipient jussi.kivilinna@iki.fi \
  --cipher-algo aes128 --compress-algo none --force-aead --rfc4880bis \
  --output "$OUTFILE"

Now when I attempt decrypting AEAD-encrypted 2 GiB file from ram filesystem to /dev/null, I only got 172 MB/s speed. This over 5 times slower than I'd expect (which is same level of performance for decryption as for encryption, ~1.0GB/s).

Using 'perf', I see functions where most time is spend when decrypting AEAD-file:

# Overhead  Command  Shared Object           Symbol                                 
# ........  .......  ......................  .......................................
#
    47.71%  gpg      libgcrypt.so.20.3.0     [.] transform_blk
    28.87%  gpg      gpg                     [.] aead_decode_filter
    16.97%  gpg      libgcrypt.so.20.3.0     [.] _gcry_sha1_transform_amd64_avx_bmi2
     3.10%  gpg      libgcrypt.so.20.3.0     [.] aesni_ocb_dec
     2.58%  gpg      libc-2.26.so            [.] __memmove_avx_unaligned_erms
     0.14%  gpg      libgcrypt.so.20.3.0     [.] transform
     0.06%  gpg      libc-2.26.so            [.] _IO_fwrite
    ...

Appears that some of the slow-down is caused since data is being passed to SHA1 (_gcry_sha1_transform_amd64_avx_bmi2) and RMD160 (transform_blk). This seamed surprising since AEAD mode was in use, so I dig in to gnupg source code to find the culprit. In 'g10/mainproc.c:proc_plaintext', there is bit of code that enabled both SHA1 and RMD160:

      /* This is for the old GPG LITERAL+SIG case.  It's not legal
         according to 2440, so hopefully it won't come up that often.
         There is no good way to specify what algorithms to use in
         that case, so these there are the historical answer. */
	gcry_md_enable (c->mfx.md, DIGEST_ALGO_RMD160);
	gcry_md_enable (c->mfx.md, DIGEST_ALGO_SHA1);

This looked like legacy cruft, so I tried disabling this part...

	//gcry_md_enable (c->mfx.md, DIGEST_ALGO_RMD160);
	//gcry_md_enable (c->mfx.md, DIGEST_ALGO_SHA1);

I'm not familiar enough with gnupg or openpg spec to really tell if this change is ok, but seems to help with speed issue. Could this enabling of SHA1+RMD160 for 'old GPG LITERAL+SIG case' be moved behind new commandline option that is not on by default ?

After this change, decryption speed for AEAD went from 172 MB/s to 446 MB/s. This still was not close to speeds seen for encryption. Looking new 'perf' profiling result shows that bottleneck is at g10/decrypt-data.c:

# Overhead  Command  Shared Object           Symbol                                 
# ........  .......  ......................  .......................................
#
    80.16%  gpg      gpg                     [.] aead_decode_filter
     9.81%  gpg      libgcrypt.so.20.3.0     [.] aesni_ocb_dec
     8.68%  gpg      libc-2.26.so            [.] __memmove_avx_unaligned_erms
     0.12%  gpg      libc-2.26.so            [.] _IO_fwrite
    ...

Problem appears to be in use of 'iobuf_get' function for copying data around one byte at time. Since crypto with AEAD is performing at sub-cycles-per-byte rate, iobuf_get style buffer copying is adding significant overhead. In my test case, majority of CPU time is spent in following loop from 'aead_underflow':

if (dfx->partial)
  {
    for (; len < size; len++ )
      {
        if ((c = iobuf_get (a)) == -1)
          {
            dfx->eof_seen = 1; /* Normal EOF. */
            break;
          }
        buf[len] = c;
      }
  }
else

Looking into iobuf code, I found iobuf_read that could be used optimized loop above. I replaced the loop above with:

if (dfx->partial)
  {
    for (; len < size; )
      {
        size_t curr = size - len;
        int ret;

        ret = iobuf_read(a, &buf[len], curr);
        if (ret == -1)
          {
            dfx->eof_seen = 1; /* Normal EOF. */
            break;
          }

        len += ret;
      }
  }
else

Again, I'm not familiar enough with gnupg to tell if this change is ok. But it seems to work for my test case. With this optimization, decryption for AEAD went to 1.3 GB/s. Perf report shows that now most of time is spend in OCB decryption and memmove:

# Overhead  Command  Shared Object        Symbol                            
# ........  .......  ...................  ..................................
#
    51.67%  gpg      libc-2.26.so         [.] __memmove_avx_unaligned_erms
    40.27%  gpg      libgcrypt.so.20.3.0  [.] aesni_ocb_dec
     1.14%  gpg      libc-2.26.so         [.] _IO_file_xsputn@@GLIBC_2.2.5
     0.75%  gpg      libc-2.26.so         [.] _IO_fflush
     0.73%  gpg      libc-2.26.so         [.] _IO_fwrite
    ...

After this, I started wondering if CFB/MDC mode is also affected by these bottlenecks.

I get ~300 MB/s CFB encryption (AES128+SHA1) speed (2GiB from /dev/zero to ram filesystem). Command used was:

dd in=/dev/zero bs=1M count=${GIBI}k | "$GPG" \
  --batch --yes --encrypt --recipient jussi.kivilinna@iki.fi \
  --cipher-algo aes128 --compress-algo none --digest-algo sha1 \
  --output "$OUTFILE"

Without above code changes, decryption only gets 110 MB/s, while I'd expect decryption to be faster than encryption with CFB mode (CFB-enc is not parallelizable, while CFB-dec is). Perf profile report shows that extra time is spend in RMD160 and decrypt-data.c iobuf_get loops:

# Overhead  Command  Shared Object           Symbol                                 
# ........  .......  ......................  .......................................
#
    41.28%  gpg      libgcrypt.so.20.3.0     [.] transform_blk
    28.83%  gpg      libgcrypt.so.20.3.0     [.] _gcry_sha1_transform_amd64_avx_bmi2
    24.49%  gpg      gpg                     [.] mdc_decode_filter
     2.39%  gpg      libc-2.26.so            [.] __memmove_avx_unaligned_erms
     2.25%  gpg      libgcrypt.so.20.3.0     [.] _gcry_aes_aesni_cfb_dec
     0.11%  gpg      libgcrypt.so.20.3.0     [.] transform
     0.05%  gpg      libc-2.26.so            [.] _IO_file_xsputn@@GLIBC_2.2.5
     0.05%  gpg      libc-2.26.so            [.] _IO_fwrite
    ...

With the "SHA1+RMD160 for 'old GPG LITERAL+SIG case'" disabled and iobuf_get loops changed to use iobuf_read, CFB-decryption speeds goes to 618 MB/s.

Perf profiling result with code changes for CFB-decryption:

# Overhead  Command  Shared Object        Symbol                                 
# ........  .......  ...................  .......................................
#
    71.05%  gpg      libgcrypt.so.20.3.0  [.] _gcry_sha1_transform_amd64_avx_bmi2
    15.10%  gpg      libc-2.26.so         [.] __memmove_avx_unaligned_erms
    11.33%  gpg      libgcrypt.so.20.3.0  [.] _gcry_aes_aesni_cfb_dec
     0.18%  gpg      libc-2.26.so         [.] _IO_file_xsputn@@GLIBC_2.2.5
     0.16%  gpg      gpg                  [.] iobuf_read
     0.14%  gpg      libgcrypt.so.20.3.0  [.] _gcry_md_block_write
     0.13%  gpg      libc-2.26.so         [.] _IO_file_write@@GLIBC_2.2.5
     0.13%  gpg      libc-2.26.so         [.] _IO_fwrite
    ...
jukivili created this task.Feb 11 2018, 1:30 PM
jukivili added a comment.EditedFeb 11 2018, 1:34 PM

Here's patches I used for testing. First is hack for disabing SHA1+RMD160 'old GPG LITERAL+SIG case' and second is 'iobuf_get to iobuf_read' optimization.


werner claimed this task.Feb 13 2018, 12:29 PM
werner added a subscriber: werner.

Thanks for this research. Two weeks ago I also did some testing and started to implement a fast track way for simple encryption(for example without signing and filters). But your path to improve iobuf is probably the more general solution.

Regarding an option for legacy data: That will not be easy. However, with AEAD we should be able to disable running theses extra hash contexts.

werner triaged this task as High priority.Apr 17 2018, 8:40 PM
werner edited projects, added gnupg (gpg23); removed gnupg.