Kleopatra: Initial keylisting sometimes fails or hangs for some seconds
Open, HighPublic
Actions

Assigned To

Authored By

	• ikloecker
	Oct 10 2024, 2:18 PM

Description

On Linux, this problem is extremely hard to reproduce by running Kleopatra. The easiest way to reproduce this on Linux is to run libkleo's keyresolvercoretest (which consists of 58 tests where each one uses a separate temporary GNUPGHOME and each of those tests initializes the KeyCache by doing keylistings with gpg and gpgsm) in a loop until it fails (see also https://invent.kde.org/pim/libkleo/-/issues/2), e.g. by running the following in the build folder of libkleo:

while ctest --output-on-failure -R keyresolvercore; do echo; done

I ran the tests with GPGME_DEBUG=8:$(pwd)/gpgme-$(date +"%Y-%m-%d-%H%M%S").log to get gpgme logs.

It took 109 tries and 25 tries until the test failed. Additionally, in the first iteration 2 out of the 109 runs took ~10 seconds instead of ~3 seconds and in the second iteration 3 out of the 25 tries took ~10 seconds instead of ~3 seconds.

The gpgme logs of the first failure indicate that gpgsm failed to start gpg-agent. The gpgme logs of the second failure seem to indicate that gpg tried a few times to start gpg-agent before it could connect to gpg-agent. I'll attach the logs.

In any case, the logging suggests that performing concurrent keylistings with gpg and gpgsm without already running gpg-agent, so that gpg and gpgsm race to start gpg-agent, sometimes (on Linux very rarely) results in a failure of one of the keylistings.

To avoid this problem I will serialize the two keylistings done by KeyCache.

Revisions and Commits

rLIBKLEO Libkleo
	rLIBKLEOd1a7ef0c4a69 Create the socket directory before initializing the key cache
	rLIBKLEOe1c79959b723 Create the socket directory before initializing the key cache
	rLIBKLEO9597b11d1e27 Don't merge already known keys with new keys if there are no new keys
	rLIBKLEO55206cfabebd Serialize listing of OpenPGP and S/MIME keys
	rLIBKLEOc32c4a89356f Don't merge already known keys with new keys if there are no new keys
	rLIBKLEO1ab933a893b4 Serialize listing of OpenPGP and S/MIME keys
rG GnuPG
	rG2ae017a25fe8 common: Fix a race condition in creating socketdir.
	rG71840b57f486 common: Fix a race condition in creating socketdir.

Related Objects

Mentioned In: T7289: Release GnuPG 2.5.2
T7434: Kleopatra: Initial keylisting hangs for ~60 seconds (gpg-agent: Socket ...S.gpg-agent cannot be bound)
T7030: Release GnuPG 2.4.6
Mentioned Here: rG71840b57f486: common: Fix a race condition in creating socketdir.
rG69a8aefa5bf7: gpgsm: Fix cached istrusted lookup.

Event Timeline

• ikloecker claimed this task.Oct 10 2024, 2:18 PM

• ikloecker created this task.

gpgme logs for a failed test where the keylisting with gpgsm failed

gpgme-2024-10-09-100752-only-logs-of-initialization-and-failed-test.log86 KBDownload

gpgme logs for a failed test where the keylisting with gpg was delayed for several seconds

gpgme-2024-10-09-164329-only-logs-of-initialization-and-failed-test.log142 KBDownload

I have reproduced this with libkleo from our gpg4win/24.05 branch and with gpg (GnuPG) 2.4.6-beta102 (HEAD of STABLE-BRANCH-2-4) and current master of gpgme and all GnuPG libraries. It took just 8 runs until a unittest failed.

• ikloecker added a commit: rLIBKLEOc32c4a89356f: Don't merge already known keys with new keys if there are no new keys.Oct 10 2024, 10:00 PM

• ikloecker added a commit: rLIBKLEO1ab933a893b4: Serialize listing of OpenPGP and S/MIME keys.

• ikloecker added a commit: rLIBKLEO55206cfabebd: Serialize listing of OpenPGP and S/MIME keys.Oct 11 2024, 9:34 AM

• ikloecker added a commit: rLIBKLEO9597b11d1e27: Don't merge already known keys with new keys if there are no new keys.

systemd based Linux?

• ebo triaged this task as High priority.Oct 14 2024, 10:37 AM

I can reproduce this with gnupg 2.2.45-beta27 (STABLE-BRANCH-2-2 69a8aefa) on openSUSE Tumbleweed.

I found one reason for the intermittently failing concurrent initial keylisting. gpgsm sometimes uses the wrong socket file to (try to) connect to gpg-agent.

On a failed attempt to connect to gpg-agent gpgsm (started by gpgme in server mode) logs:

2024-10-15 10:28:17 gpgsm[36258] DBG: chan_18 -> # Home: /tmp/keyresolvercoretest-lqnneK
2024-10-15 10:28:17 gpgsm[36258] DBG: chan_18 -> # Config: /tmp/keyresolvercoretest-lqnneK/gpgsm.conf
2024-10-15 10:28:17 gpgsm[36258] DBG: chan_18 -> # DirmngrInfo: /tmp/keyresolvercoretest-lqnneK/S.dirmngr
[...]
2024-10-15 10:28:17 gpgsm[36258] DBG: _assuan_connect (ctx=0x000000003a588590): error: No such file or directory
2024-10-15 10:28:17 gpgsm[36258] DBG: assuan_socket_connect (ctx=0x000000003a588590): call: can't connect to `/tmp/keyresolvercoretest-lqnneK/S.gpg-agent': No such file or directory

Note the wrong socket directory /tmp/keyresolvercoretest-lqnneK/ (which is GNUPGHOME) in DirmngrInfo and in the error message of assuan_socket_connect

When gpgsm doesn't fail to connect to gpg-agent (permanently) then gpgsm logs:

2024-10-15 10:27:38 gpgsm[28602] DBG: chan_16 -> # Home: /tmp/keyresolvercoretest-naEZXQ
2024-10-15 10:27:38 gpgsm[28602] DBG: chan_16 -> # Config: /tmp/keyresolvercoretest-naEZXQ/gpgsm.conf
2024-10-15 10:27:38 gpgsm[28602] DBG: chan_16 -> # DirmngrInfo: /run/user/1000/gnupg/d.mc3whx1hqygybbykmi5gejw7/S.dirmngr
[...]
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_socket (ctx=0x00000000099f65d0): enter: namespace=1,style=1,protocol=0
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_socket (ctx=0x00000000099f65d0): leave: result=7
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_connect (ctx=0x00000000099f65d0): enter: socket=7,addr=0x00007ffd0698f940,length=61
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_connect (ctx=0x00000000099f65d0): error: No such file or directory
2024-10-15 10:27:38 gpgsm[28602] DBG: assuan_socket_connect (ctx=0x00000000099f65d0): call: can't connect to `/run/user/1000/gnupg/d.mc3whx1hqygybbykmi5gejw7/S.gpg-agent': No such file or directory
       
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_close (ctx=0x00000000099f65d0): call: fd=0x7
2024-10-15 10:27:38 gpgsm[28602] no running gpg-agent - starting '/opt/gnupg/2.2/bin/gpg-agent'
2024-10-15 10:27:38 gpgsm[28602] DBG: chan_16 -> S PROGRESS starting_agent ? 0 0
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_socket (ctx=0x00000000099f65d0): enter: namespace=1,style=1,protocol=0
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_socket (ctx=0x00000000099f65d0): leave: result=7
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_connect (ctx=0x00000000099f65d0): enter: socket=7,addr=0x00007ffd0698f940,length=61
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_connect (ctx=0x00000000099f65d0): leave: result=0
2024-10-15 10:27:38 gpgsm[28602] DBG: chan_7 <- OK Pleased to meet you, process 28602
2024-10-15 10:27:38 gpgsm[28602] DBG: connection to agent established

In particular, gpgsm uses the correct socket directory /run/user/1000/gnupg/d.mc3whx1hqygybbykmi5gejw7/.

I suspect that gpgsm tried and failed to create the socket directory because gpg won the race to create the directory. Hence, gpgsm falls back to using GNUPGHOME as socket directory.

• ikloecker added a commit: rLIBKLEOe1c79959b723: Create the socket directory before initializing the key cache.Oct 15 2024, 5:11 PM

• gniibe added a commit: rG71840b57f486: common: Fix a race condition in creating socketdir..Oct 16 2024, 3:06 AM

Good catch, @ikloecker !
I located the bug in GnuPG, and the fix is: rG71840b57f486: common: Fix a race condition in creating socketdir.

I confirm the fix. Using gnupg master the unit test ran 544 times without any failures or suspiciously long run time.

• ikloecker added a commit: rLIBKLEOd1a7ef0c4a69: Create the socket directory before initializing the key cache.Oct 16 2024, 4:19 PM

The fix should probably be backported to gnupg 2.2 and 2.4.

• ebo edited projects, added gnupg22, gnupg24; removed gnupg.Oct 17 2024, 1:49 PM

Passing ticket to werner to consider backports.

Fix backported to 2.4

• werner added a commit: rG2ae017a25fe8: common: Fix a race condition in creating socketdir..Oct 29 2024, 1:01 PM

• werner changed the task status from Open to Testing.Oct 29 2024, 1:07 PM

• werner moved this task from Backlog to QA on the gnupg24 board.

• werner mentioned this in T7030: Release GnuPG 2.4.6.Oct 29 2024, 2:24 PM

Unfortunately, this seems not to have ended the sporadic hangs.
I just saw a hanging initial keylisting with gpg4win-beta-70 which has gpg 2.4.6

@ebo Thank you for your continuous testing.

Could you please try invoking following commands with the specific GNUHOME when you see the hang (and let me show the result)?

> gpg-connect-agent.exe --homedir SOMEWHERE 'scd keyinfo --list' /bye
> gpg-connect-agent.exe --homedir SOMEWHERE 'keyinfo --list' /bye
> gpg.exe --homedir SOMEWHERE --debug=ipc -K

The first command invocation tries to connect to the agent asking scdaemon showing available keys. If it stops, kill it by CTRL-C.
The second command invocation tries to connect to the agent for available keys (internally, it will connect scdaemon). If it stops, kill it by CTRL-C.
The third command invocation is doing key listing by gpg.exe and shows communication between gpg-agent. If it stops, kill it by CTRL-C.

When you also see the hang for the first command invocation above, please try killing scdaemon by gpgconf.exe --kill scdaemon and if it really kills scdaemon.
If not, please manually kill scdaemon and see if next command invocation of gpg.exe -K works or not.

When you also see the hang for the second command invocation above, please try killing gpg-agent by gpgconf.exe --kill gpg-agent and if it really kills gpg-agent.
If not, please manually kill gpg-agent and see if next command invocation of gpg.exe -K works or not.

• werner changed the task status from Testing to Open.Nov 2 2024, 4:12 PM

• werner moved this task from QA to WiP on the gnupg24 board.

Had a occurrence of the never ending "loading certificate cache" issue again.
There was a leftover gpgsm process from the previous tests (although Kleopatra warned when I closed it, that processes still running in the background were there and would be aborted).

None of the commands I should try did hang in this case.
Killing scdaemon with 'gpgconf.exe --kill scdaemon' worked and ended the loading message and all certificates in the keyring where shown.

Next I managed to have one gpg and one gpgsm process each left over from the last execution of Kleopatra.
After starting Kleopatra new anyway, again "loading certificate cache" and an additional pair of gpg and gpgsm listing processes start.

In this case the following commands hang:

>gpg-connect-agent "scd keyinfo --list" /bye
>gpg-connect-agent "keyinfo --list" /bye
>gpg --debug=ipc -K
>gpgconf --kill scdaemon

"gpg --debug=ipc -K" gave some output though, before hanging, see:

gpg-debug-ipc-K.txt24 KBDownload

The last 5 lines from the output repeated several times again, after saving this.

Then manually killing scdaemon ended the gpg and gpgsm processes, too, but one gpg process was immediately started again.

I managed to get the same "loading certificate" message several times in a row on this test instance by stopping and starting Kleopatra in a row twice. After removing the Signature Card 2.0 this did not happen again in 5-6 tries, although I collected 2 lingering listing processes again (not both started on the same startup). Even import of a X.509 certificate worked.

After removing the Yubikey, too, repeated restarts of Kleopatra did neither produce lingering listing processes not did I get "loading certificates".

For Beta-75 it looks similar judging from my first tries.

A difference seems to be that while I can provoke collecting leftover gpg and gpgsm processes quite reliably by starting Kleo with both an X.509 card and a Yubikey connected, quit, and restart, this seldom results in al "loading certificates cache" situation in the following start. I've seen that only once today.

• ikloecker mentioned this in T7434: Kleopatra: Initial keylisting hangs for ~60 seconds (gpg-agent: Socket ...S.gpg-agent cannot be bound).Dec 4 2024, 11:11 AM

• werner mentioned this in T7289: Release GnuPG 2.5.2.Dec 5 2024, 11:47 AM

• werner moved this task from Backlog to QA on the gpd5x board.Dec 5 2024, 4:36 PM

Kleopatra: Initial keylisting sometimes fails or hangs for some secondsOpen, HighPublicActions

Description

Revisions and Commits

Related Objects

Event Timeline

Kleopatra: Initial keylisting sometimes fails or hangs for some seconds
Open, HighPublic
Actions