Hello,
I started looking in to how well new AEAD mode performs and noticed some performance issues with decryption.
With latest libgcrypt-devel & gnupg-devel, encryption with AEAD works fast, ~1 GB/s when encrypting 2 GiB from /dev/zero to ram filesystem (2.0GB/s when streaming from /dev/zero to /dev/null). Command used was:
dd in=/dev/zero bs=1M count=${GIBI}k | "$GPG" \ --batch --yes --encrypt --recipient jussi.kivilinna@iki.fi \ --cipher-algo aes128 --compress-algo none --force-aead --rfc4880bis \ --output "$OUTFILE"
Now when I attempt decrypting AEAD-encrypted 2 GiB file from ram filesystem to /dev/null, I only got 172 MB/s speed. This over 5 times slower than I'd expect (which is same level of performance for decryption as for encryption, ~1.0GB/s).
Using 'perf', I see functions where most time is spend when decrypting AEAD-file:
# Overhead Command Shared Object Symbol # ........ ....... ...................... ....................................... # 47.71% gpg libgcrypt.so.20.3.0 [.] transform_blk 28.87% gpg gpg [.] aead_decode_filter 16.97% gpg libgcrypt.so.20.3.0 [.] _gcry_sha1_transform_amd64_avx_bmi2 3.10% gpg libgcrypt.so.20.3.0 [.] aesni_ocb_dec 2.58% gpg libc-2.26.so [.] __memmove_avx_unaligned_erms 0.14% gpg libgcrypt.so.20.3.0 [.] transform 0.06% gpg libc-2.26.so [.] _IO_fwrite ...
Appears that some of the slow-down is caused since data is being passed to SHA1 (_gcry_sha1_transform_amd64_avx_bmi2) and RMD160 (transform_blk). This seamed surprising since AEAD mode was in use, so I dig in to gnupg source code to find the culprit. In 'g10/mainproc.c:proc_plaintext', there is bit of code that enabled both SHA1 and RMD160:
/* This is for the old GPG LITERAL+SIG case. It's not legal according to 2440, so hopefully it won't come up that often. There is no good way to specify what algorithms to use in that case, so these there are the historical answer. */ gcry_md_enable (c->mfx.md, DIGEST_ALGO_RMD160); gcry_md_enable (c->mfx.md, DIGEST_ALGO_SHA1);
This looked like legacy cruft, so I tried disabling this part...
//gcry_md_enable (c->mfx.md, DIGEST_ALGO_RMD160); //gcry_md_enable (c->mfx.md, DIGEST_ALGO_SHA1);
I'm not familiar enough with gnupg or openpg spec to really tell if this change is ok, but seems to help with speed issue. Could this enabling of SHA1+RMD160 for 'old GPG LITERAL+SIG case' be moved behind new commandline option that is not on by default ?
After this change, decryption speed for AEAD went from 172 MB/s to 446 MB/s. This still was not close to speeds seen for encryption. Looking new 'perf' profiling result shows that bottleneck is at g10/decrypt-data.c:
# Overhead Command Shared Object Symbol # ........ ....... ...................... ....................................... # 80.16% gpg gpg [.] aead_decode_filter 9.81% gpg libgcrypt.so.20.3.0 [.] aesni_ocb_dec 8.68% gpg libc-2.26.so [.] __memmove_avx_unaligned_erms 0.12% gpg libc-2.26.so [.] _IO_fwrite ...
Problem appears to be in use of 'iobuf_get' function for copying data around one byte at time. Since crypto with AEAD is performing at sub-cycles-per-byte rate, iobuf_get style buffer copying is adding significant overhead. In my test case, majority of CPU time is spent in following loop from 'aead_underflow':
if (dfx->partial) { for (; len < size; len++ ) { if ((c = iobuf_get (a)) == -1) { dfx->eof_seen = 1; /* Normal EOF. */ break; } buf[len] = c; } } else
Looking into iobuf code, I found iobuf_read that could be used optimized loop above. I replaced the loop above with:
if (dfx->partial) { for (; len < size; ) { size_t curr = size - len; int ret; ret = iobuf_read(a, &buf[len], curr); if (ret == -1) { dfx->eof_seen = 1; /* Normal EOF. */ break; } len += ret; } } else
Again, I'm not familiar enough with gnupg to tell if this change is ok. But it seems to work for my test case. With this optimization, decryption for AEAD went to 1.3 GB/s. Perf report shows that now most of time is spend in OCB decryption and memmove:
# Overhead Command Shared Object Symbol # ........ ....... ................... .................................. # 51.67% gpg libc-2.26.so [.] __memmove_avx_unaligned_erms 40.27% gpg libgcrypt.so.20.3.0 [.] aesni_ocb_dec 1.14% gpg libc-2.26.so [.] _IO_file_xsputn@@GLIBC_2.2.5 0.75% gpg libc-2.26.so [.] _IO_fflush 0.73% gpg libc-2.26.so [.] _IO_fwrite ...
After this, I started wondering if CFB/MDC mode is also affected by these bottlenecks.
I get ~300 MB/s CFB encryption (AES128+SHA1) speed (2GiB from /dev/zero to ram filesystem). Command used was:
dd in=/dev/zero bs=1M count=${GIBI}k | "$GPG" \ --batch --yes --encrypt --recipient jussi.kivilinna@iki.fi \ --cipher-algo aes128 --compress-algo none --digest-algo sha1 \ --output "$OUTFILE"
Without above code changes, decryption only gets 110 MB/s, while I'd expect decryption to be faster than encryption with CFB mode (CFB-enc is not parallelizable, while CFB-dec is). Perf profile report shows that extra time is spend in RMD160 and decrypt-data.c iobuf_get loops:
# Overhead Command Shared Object Symbol # ........ ....... ...................... ....................................... # 41.28% gpg libgcrypt.so.20.3.0 [.] transform_blk 28.83% gpg libgcrypt.so.20.3.0 [.] _gcry_sha1_transform_amd64_avx_bmi2 24.49% gpg gpg [.] mdc_decode_filter 2.39% gpg libc-2.26.so [.] __memmove_avx_unaligned_erms 2.25% gpg libgcrypt.so.20.3.0 [.] _gcry_aes_aesni_cfb_dec 0.11% gpg libgcrypt.so.20.3.0 [.] transform 0.05% gpg libc-2.26.so [.] _IO_file_xsputn@@GLIBC_2.2.5 0.05% gpg libc-2.26.so [.] _IO_fwrite ...
With the "SHA1+RMD160 for 'old GPG LITERAL+SIG case'" disabled and iobuf_get loops changed to use iobuf_read, CFB-decryption speeds goes to 618 MB/s.
Perf profiling result with code changes for CFB-decryption:
# Overhead Command Shared Object Symbol # ........ ....... ................... ....................................... # 71.05% gpg libgcrypt.so.20.3.0 [.] _gcry_sha1_transform_amd64_avx_bmi2 15.10% gpg libc-2.26.so [.] __memmove_avx_unaligned_erms 11.33% gpg libgcrypt.so.20.3.0 [.] _gcry_aes_aesni_cfb_dec 0.18% gpg libc-2.26.so [.] _IO_file_xsputn@@GLIBC_2.2.5 0.16% gpg gpg [.] iobuf_read 0.14% gpg libgcrypt.so.20.3.0 [.] _gcry_md_block_write 0.13% gpg libc-2.26.so [.] _IO_file_write@@GLIBC_2.2.5 0.13% gpg libc-2.26.so [.] _IO_fwrite ...