Page MenuHome GnuPG

GPA is stuck if keyring is too big and trust-model is tofu+pgp
Closed, ResolvedPublic


When using the tofu+pgp trust model, GPA is sometimes unable to display anything and appears to be stuck as soon as the keyring editor is started (gpa -k). No key is displayed and the interface does not respond to user actions (menus are displayed but clicking on any menu item, or on any other part of the interface, produces no effect).

Here are some observations (initially done with GnuPG 2.2.4 / GPA 0.9.10, and reproduced with a development version freshly built with Speedo from the master branch of all involved projects):

  • While GPA is stuck, two gpg processes are running and one of them is mobilizing 100% of CPU time according to top(1).
  • For information, here are the GPG processes started by GPA (the process mobilizing the CPU is the second one, with --list-secret-keys):
13151 pts/3    SL+    0:00 gpg --batch --no-sk-comments --lc-messages en_US.UTF-8 --lc-ctype en_US.UTF-8 --status-fd 8 --no-tty --charset utf8 --enable-progress-filter --exit-on-status-write-error --display :0 --ttyname /dev/pts/3 --ttytype rxvt-unicode --with-colons --list-keys --
13155 pts/3    RL+    0:14 gpg --batch --no-sk-comments --lc-messages en_US.UTF-8 --lc-ctype en_US.UTF-8 --status-fd 13 --no-tty --charset utf8 --enable-progress-filter --exit-on-status-write-error --display :0 --ttyname /dev/pts/3 --ttytype rxvt-unicode --with-colons --list-secret-keys --
  • The problem is dependent on the tofu+pgp trust model. Changing the trust model to pgp eliminates the problem. If I switch back to tofu+pgp the problem occurs again.
  • The contents of the TOFU database does not seem to matter. If I remove the tofu.db file and let gpg rebuilds a new database from scratch, the problem occurs all the same.
  • After killing GPA when it is stuck, there are two supplementary files tofu.db-journal and tofu.db-want-lock in the GnuPG home directory.
  • The problem does not seem to be triggered by a specific key, but rather seems to depend on the size of the public keyring. I tried importing (parts of) my public keyring to a fresh new GNUPGHOME, and I observed the following:
    • If I import the entirety of my current public keyring (100 keys, including mine), the bug occurs.
    • If I import my current public keyring in several chunks, everything works fine until I have imported ~95 keys. After importing a 96th key, the bug occurs. Removing any single key (not necessarily the last one) clears the problem.
    • This 95-keys threshold is variable. In some of my tests, GPA was still working with a public keyring of 96 keys, and the problem only occurred after importing a 97th key.

I am at a loss trying to figure out what the cause of the problem could be, so I am hoping someone here will be able to shed some light on this issue. I can perform more tests if needed, and/or I can also provide my public keyring if necessary.

If that's relevant (I doubt it but who knows), all tests were performed on Slackware Linux, linux-4.4.12, glibc-2.23, gtk-2.24.31.



Event Timeline

gouttegd created this object in space S1 Public.
werner added projects: gnupg (gpg22), TOFU.
werner added a subscriber: werner.

One of these TOFU bugs. Thanks for the good bug report.

I did a few more tests and here are some more observations:

  • The bug never seems to occur if the private keyring is empty (which is probably consistent with previous observation that the gpg process mobilizing the CPU while gpa is hung is gpg --list-secret-keys, as noted above).
  • The number of keys in the public keyring needed to trigger the bug depends on their types. I tested the following conditions:
    • a public keyring containing only ECDSA nistp256 primary keys, each associated with a ECDH nistp256 encryption subkey and a single User ID: I need 187 such keys to trigger the bug;
    • a public keyring containing only RSA 2048 primary keys, each associated with a RSA 2048 encryption subkey and a single User ID: I need 193 such keys to trigger the bug.

I am guessing that the number of keys triggering the bug probably also depend on the actual "contents" of those keys (number of subkeys, number of User IDs, number of signatures). This would explain why I initially observed the bug with my own keyring even though it contains only 100 keys (many of those keys have several User IDs and several signatures). But I didn't test that hypothesis.

Still trying to pinpoint the bug, but I am afraid I am stuck.

First, an update on the description of the problem: contrary to what I described above, GPA is actually not completely unresponsive. Some GUI items do seems to work (e.g., I can start the clipboard or file manager, display the "About" dialog box or the "preferences" window). But the Menu>Quit command definitely does not work (no visible effect at all), and the Menu>Close has a strange effect: it closes the keyring manager window as expected, but gpa does not relinquish control of the terminal until I send it a ^C signal.

Running with strace -f -e trace=process gpa -k, I observe that the Menu>Quit command actually does has an effect, albeit an unexpected one: every click on it leads to the spawning of another gpg [...] --list-secret-keys process, which ends up also trying to monopolize the CPU.

For what I can tell, troubles start when at some point, the gpa_keylist_next function calls gpa_keytable_lookup_key to check whether a given key has a secret part (src/keylist.c, line 487 in current master). The gpa_keytable_lookup_key function then does the following, which is described in the code as a "hack":

keytable->end = (GpaKeyTableEndFunc) gtk_main_quit;
reload_cache (keytable, NULL);
gtk_main ();
keytable->end = NULL;
return gpa_keytable_lookup_key (keytable, fpr);

If I understand correctly, this should call gpg (in the reload_cache function) to initiate a listing of the private keys (this is where the process that ends up monopolizing the CPU is started), then enter a GTK event loop during which GPA should process the data sent by gpg. When this is done, the gtk_main_quit function should be called to exit the inner event loop, allowing execution to continue.

What seems to happen here is that the callbacks for the next_key and done signals are actually never called. Presumably, this explains the behavior described above:

  1. GPA is stuck is the inner event loop started in the gpa_keytable_lookup_key.
  2. When I activate the Menu>Quit command, the gtk_main_quit function is called. This terminates, not the "outer" event loop, but the inner loop, and GPA resumes its execution in the gpa_keytable_lookup_key function, just after the call to gtk_main.
  3. The gpa_keytable_lookup_key function then calls itself, assuming the keytable has now been initialized. Since it hasn't (because the callbacks were not called), it calls the reload_cache (hence the spawning of a new gpg process) and get stuck again.

What I fail to understand is why the callbacks are not called, and especially why this only happens when using the TOFU trust model...

The problem seems to have to do with the locking of the TOFU database.

What happens is roughly the following:

  1. Gpa starts with an empty public keyring.
  2. Gpa starts loading the public keyring (calling gpg [...] --list-keys).
  3. Upon receiving the first public key, Gpa wants to lookup in the secret keyring to see whether that public key has a private counterpart.
  4. Therefore, Gpa starts loading the private keyring, by calling gpg [...] --list-secret-keys

When the trust model is tofu or tofu+pgp, the problem is that both gpg processes need to access the TOFU database. I am not sure of what exactly happens next, but I am guessing it's a kind of "deadlock" situation, as follows: the second gpg process (the one spawned to list the secret keys) must wait for the termination of the first one, but the first process cannot terminate until Gpa has finished reading its output, but Gpa is waiting for the output of the second process...

The easiest fix would be to make sure the private keyring is loaded in Gpa before loading the public keyring. I'll send a patch for that soon.

gouttegd claimed this task.

Thanks @werner for applying the patch. Closing here, since I have been using that patch for several weeks now without ever encountering the bug again.

I verified that manually putting the DB in WAL mode also resolved this issue, since writers don't block readers in WAL mode.