gpg-agent 2.x poor performance / futex errors
Open, NormalPublic

Description

Hello,

after upgrading from Ubuntu 16.04 to Ubuntu 18.04 we've noticed the issues which came along with the gpg v2.x.

The gpg-agent produces millions of futex syscall errors during a very short time (a second or two) when it's being loaded either by the SaltStack's salt-master decrypting the pillars (our main use case) or when it is being directly tested with "parallel" tool from moreutils package.

$ sudo strace -f -p <pidof gpg-agent>
...
...Ctrl+C just after couple of seconds while "gpg -d" commands are running in parallel (see below for details)
...
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 96.35 2800.145231         305   9194103   2009552 futex
  3.63  105.404136      373774       282           pselect6
  0.01    0.431338         102      4246           read
  0.00    0.104701          12      8490           write
  0.00    0.029085         103       283           accept
  0.00    0.016549          58       284           madvise
  0.00    0.012201          22       567           close
  0.00    0.010979           8      1410           getpid
  0.00    0.010405          12       849           access
  0.00    0.006341          22       284       284 wait4
  0.00    0.004350          15       283           openat
  0.00    0.003764          13       283           clone
  0.00    0.002668           9       283           getsockopt
  0.00    0.002568           9       283           fstat
  0.00    0.002564           9       283           set_robust_list
  0.00    0.001941           7       283           lseek
------ ----------- ----------- --------- --------- ----------------
100.00 2906.188821               9212496   2009836 total

I'll describe the issue and steps to reproduce it.

First, prepare the "enc" file:

cat /usr/share/doc/base-files/README | gpg -ear "some-4K-RSA-publick-key" > enc

Run parallel decryptions using "time" to measure it:

time parallel -j 30 sh -c "cat enc | gpg --no-tty -d -q -o /dev/null" -- $(seq 1 3000)

Running "gpg -d" (GPG v2.x, with the gpg-agent) in parallel as described above took:

  • 1minute 18seconds on a big HW; (48 cores, *gpg-agent 2.2.4*-1ubuntu1.2)
  • 32 seconds on my laptop; (4 cores, *gpg-agent 2.2.19*-3ubuntu2)

Running the same commands but with GPG v1.4.20 (no agent):

  • 9 seconds on a big HW: (40 cores, *gnupg 1.4.20*-1ubuntu3.3)
  • 21 seconds on a VM; (1 core, *gnupg 1.4.20*-1ubuntu3.3)

Note: in order to prevent "command 'PKDECRYPT' failed: Cannot allocate memory <gcrypt>" error, the gpg-agent is running either with "--auto-expand-secmem 0x30000" flag or with "auto-expand-secmem" in ~/.gnupg/gpg-agent.conf file.

Since our use case is to have SaltStack's salt-master decrypt many pillars for hundreds of servers, the Ubuntu 16.04 => 18.04 upgrade severely degrades the SaltStack performance making it almost unusable, i.e. it becomes 10 times slower, requires us figuring workarounds such as increasing "gather_job_timeout" or probably even rolling back to gpg v1.x.

Any suggestions?

Kind regards,
Andrey Arapov

Details

Related Objects

gniibe added a subscriber: gniibe.Nov 17 2020, 2:10 AM

I think that it is not gpg-agent but pinentry which causes millions of futex syscall errors.
For interactive use case, pinentry may be the point of contention.
I might be wrong if your key is not protected by passphrase.

If possible, please try adding arguments for gpg invocation: --pinentry-mode loopback --passphrase-file YOUR_FILE_FOR_PASSPHRASE
This can avoid the invocation of pinentry entirely.

werner added a subscriber: werner.Nov 17 2020, 8:32 AM

Note that you actually run 30 independent processes with gpg 1.4 but with gpg-agent there is just one process to handle the private key operations (decrypt). To utilize more cores you need to setup several GNUPGHOME with the same private keys.

werner triaged this task as Normal priority.Nov 17 2020, 8:35 AM
werner edited projects, added gpgagent, Feature Request; removed Bug Report.

I change this to a feature request: Allow several processes to run public key decryption using the same set of private keys.

I think that it is not gpg-agent but pinentry which causes millions of futex syscall errors.
For interactive use case, pinentry may be the point of contention.
I might be wrong if your key is not protected by passphrase.

If possible, please try adding arguments for gpg invocation: --pinentry-mode loopback --passphrase-file YOUR_FILE_FOR_PASSPHRASE
This can avoid the invocation of pinentry entirely.

I've tried running gpg with "--pinentry-mode loopback" (+gpg-agent is running with "--allow-loopback-pinentry" to make sure it is allowed) -- did not change the situation.
Worth mentioning, the gpg private key we are using isn't password protected.

Note that you actually run 30 independent processes with gpg 1.4 but with gpg-agent there is just one process to handle the private key operations (decrypt). To utilize more cores you need to setup several GNUPGHOME with the same private keys.

Yeah, this must be it.

I change this to a feature request: Allow several processes to run public key decryption using the same set of private keys.

Thank you, Werner!

Thanks. I understand the situation. Basically, gpg-agent's computation is done by a single thread (in current implementation), although it accepts many requests simultaneously.

For an experiment, I tried following patch to allow computation by multiple threads.
It is not yet complete because it's only for decryption (not for signing) and protect.c is shared by other commands, but it's worth a try.

diff --git a/agent/pkdecrypt.c b/agent/pkdecrypt.c
index ec23daf83..be2e54679 100644
--- a/agent/pkdecrypt.c
+++ b/agent/pkdecrypt.c
@@ -26,6 +26,7 @@
 #include <unistd.h>
 #include <sys/stat.h>
 
+#include <npth.h>
 #include "agent.h"
 
 
@@ -105,7 +106,9 @@ agent_pkdecrypt (ctrl_t ctrl, const char *desc_text,
 /*           gcry_sexp_dump (s_skey); */
 /*         } */
 
+      npth_unprotect ();
       rc = gcry_pk_decrypt (&s_plain, s_cipher, s_skey);
+      npth_protect ();
       if (rc)
         {
           log_error ("decryption failed: %s\n", gpg_strerror (rc));
diff --git a/agent/protect.c b/agent/protect.c
index 54666e717..ebc1a417a 100644
--- a/agent/protect.c
+++ b/agent/protect.c
@@ -37,6 +37,7 @@
 # include <sys/times.h>
 #endif
 
+#include <npth.h>
 #include "agent.h"
 
 #include "cvt-openpgp.h"
@@ -1451,17 +1452,23 @@ hash_passphrase (const char *passphrase, int hashalgo,
                  unsigned long s2kcount,
                  unsigned char *key, size_t keylen)
 {
+  int r;
+
   /* The key derive function does not support a zero length string for
      the passphrase in the S2K modes.  Return a better suited error
      code than GPG_ERR_INV_DATA.  */
   if (!passphrase || !*passphrase)
     return gpg_error (GPG_ERR_NO_PASSPHRASE);
-  return gcry_kdf_derive (passphrase, strlen (passphrase),
-                          s2kmode == 3? GCRY_KDF_ITERSALTED_S2K :
-                          s2kmode == 1? GCRY_KDF_SALTED_S2K :
-                          s2kmode == 0? GCRY_KDF_SIMPLE_S2K : GCRY_KDF_NONE,
-                          hashalgo, s2ksalt, 8, s2kcount,
-                          keylen, key);
+
+  npth_unprotect ();
+  r = gcry_kdf_derive (passphrase, strlen (passphrase),
+                       s2kmode == 3? GCRY_KDF_ITERSALTED_S2K :
+                       s2kmode == 1? GCRY_KDF_SALTED_S2K :
+                       s2kmode == 0? GCRY_KDF_SIMPLE_S2K : GCRY_KDF_NONE,
+                       hashalgo, s2ksalt, 8, s2kcount,
+                       keylen, key);
+  npth_protect ();
+  return r;
 }

I though about this too but we need to take care about the logging functions of Libgcrypt which are intertwined with nPth (clamp function of libgpg-error).