Page MenuHome GnuPG

Kleopatra: Initial keylisting sometimes fails or hangs for some seconds
Open, HighPublic

Description

On Linux, this problem is extremely hard to reproduce by running Kleopatra. The easiest way to reproduce this on Linux is to run libkleo's keyresolvercoretest (which consists of 58 tests where each one uses a separate temporary GNUPGHOME and each of those tests initializes the KeyCache by doing keylistings with gpg and gpgsm) in a loop until it fails (see also https://invent.kde.org/pim/libkleo/-/issues/2), e.g. by running the following in the build folder of libkleo:

while ctest --output-on-failure -R keyresolvercore; do echo; done

I ran the tests with GPGME_DEBUG=8:$(pwd)/gpgme-$(date +"%Y-%m-%d-%H%M%S").log to get gpgme logs.

It took 109 tries and 25 tries until the test failed. Additionally, in the first iteration 2 out of the 109 runs took ~10 seconds instead of ~3 seconds and in the second iteration 3 out of the 25 tries took ~10 seconds instead of ~3 seconds.

The gpgme logs of the first failure indicate that gpgsm failed to start gpg-agent. The gpgme logs of the second failure seem to indicate that gpg tried a few times to start gpg-agent before it could connect to gpg-agent. I'll attach the logs.

In any case, the logging suggests that performing concurrent keylistings with gpg and gpgsm without already running gpg-agent, so that gpg and gpgsm race to start gpg-agent, sometimes (on Linux very rarely) results in a failure of one of the keylistings.

To avoid this problem I will serialize the two keylistings done by KeyCache.

Event Timeline

gpgme logs for a failed test where the keylisting with gpgsm failed

gpgme logs for a failed test where the keylisting with gpg was delayed for several seconds

I have reproduced this with libkleo from our gpg4win/24.05 branch and with gpg (GnuPG) 2.4.6-beta102 (HEAD of STABLE-BRANCH-2-4) and current master of gpgme and all GnuPG libraries. It took just 8 runs until a unittest failed.

I can reproduce this with gnupg 2.2.45-beta27 (STABLE-BRANCH-2-2 69a8aefa) on openSUSE Tumbleweed.

I found one reason for the intermittently failing concurrent initial keylisting. gpgsm sometimes uses the wrong socket file to (try to) connect to gpg-agent.

On a failed attempt to connect to gpg-agent gpgsm (started by gpgme in server mode) logs:

2024-10-15 10:28:17 gpgsm[36258] DBG: chan_18 -> # Home: /tmp/keyresolvercoretest-lqnneK
2024-10-15 10:28:17 gpgsm[36258] DBG: chan_18 -> # Config: /tmp/keyresolvercoretest-lqnneK/gpgsm.conf
2024-10-15 10:28:17 gpgsm[36258] DBG: chan_18 -> # DirmngrInfo: /tmp/keyresolvercoretest-lqnneK/S.dirmngr
[...]
2024-10-15 10:28:17 gpgsm[36258] DBG: _assuan_connect (ctx=0x000000003a588590): error: No such file or directory
2024-10-15 10:28:17 gpgsm[36258] DBG: assuan_socket_connect (ctx=0x000000003a588590): call: can't connect to `/tmp/keyresolvercoretest-lqnneK/S.gpg-agent': No such file or directory

Note the wrong socket directory /tmp/keyresolvercoretest-lqnneK/ (which is GNUPGHOME) in DirmngrInfo and in the error message of assuan_socket_connect

When gpgsm doesn't fail to connect to gpg-agent (permanently) then gpgsm logs:

2024-10-15 10:27:38 gpgsm[28602] DBG: chan_16 -> # Home: /tmp/keyresolvercoretest-naEZXQ
2024-10-15 10:27:38 gpgsm[28602] DBG: chan_16 -> # Config: /tmp/keyresolvercoretest-naEZXQ/gpgsm.conf
2024-10-15 10:27:38 gpgsm[28602] DBG: chan_16 -> # DirmngrInfo: /run/user/1000/gnupg/d.mc3whx1hqygybbykmi5gejw7/S.dirmngr
[...]
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_socket (ctx=0x00000000099f65d0): enter: namespace=1,style=1,protocol=0
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_socket (ctx=0x00000000099f65d0): leave: result=7
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_connect (ctx=0x00000000099f65d0): enter: socket=7,addr=0x00007ffd0698f940,length=61
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_connect (ctx=0x00000000099f65d0): error: No such file or directory
2024-10-15 10:27:38 gpgsm[28602] DBG: assuan_socket_connect (ctx=0x00000000099f65d0): call: can't connect to `/run/user/1000/gnupg/d.mc3whx1hqygybbykmi5gejw7/S.gpg-agent': No such file or directory
       
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_close (ctx=0x00000000099f65d0): call: fd=0x7
2024-10-15 10:27:38 gpgsm[28602] no running gpg-agent - starting '/opt/gnupg/2.2/bin/gpg-agent'
2024-10-15 10:27:38 gpgsm[28602] DBG: chan_16 -> S PROGRESS starting_agent ? 0 0
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_socket (ctx=0x00000000099f65d0): enter: namespace=1,style=1,protocol=0
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_socket (ctx=0x00000000099f65d0): leave: result=7
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_connect (ctx=0x00000000099f65d0): enter: socket=7,addr=0x00007ffd0698f940,length=61
2024-10-15 10:27:38 gpgsm[28602] DBG: _assuan_connect (ctx=0x00000000099f65d0): leave: result=0
2024-10-15 10:27:38 gpgsm[28602] DBG: chan_7 <- OK Pleased to meet you, process 28602
2024-10-15 10:27:38 gpgsm[28602] DBG: connection to agent established

In particular, gpgsm uses the correct socket directory /run/user/1000/gnupg/d.mc3whx1hqygybbykmi5gejw7/.

I suspect that gpgsm tried and failed to create the socket directory because gpg won the race to create the directory. Hence, gpgsm falls back to using GNUPGHOME as socket directory.

I confirm the fix. Using gnupg master the unit test ran 544 times without any failures or suspiciously long run time.

The fix should probably be backported to gnupg 2.2 and 2.4.

Passing ticket to werner to consider backports.

werner changed the task status from Open to Testing.Oct 29 2024, 1:07 PM
werner moved this task from Backlog to QA on the gnupg24 board.

Unfortunately, this seems not to have ended the sporadic hangs.
I just saw a hanging initial keylisting with gpg4win-beta-70 which has gpg 2.4.6

@ebo Thank you for your continuous testing.

Could you please try invoking following commands with the specific GNUHOME when you see the hang (and let me show the result)?

> gpg-connect-agent.exe --homedir SOMEWHERE 'scd keyinfo --list' /bye
> gpg-connect-agent.exe --homedir SOMEWHERE 'keyinfo --list' /bye
> gpg.exe --homedir SOMEWHERE --debug=ipc -K

The first command invocation tries to connect to the agent asking scdaemon showing available keys. If it stops, kill it by CTRL-C.
The second command invocation tries to connect to the agent for available keys (internally, it will connect scdaemon). If it stops, kill it by CTRL-C.
The third command invocation is doing key listing by gpg.exe and shows communication between gpg-agent. If it stops, kill it by CTRL-C.

When you also see the hang for the first command invocation above, please try killing scdaemon by gpgconf.exe --kill scdaemon and if it really kills scdaemon.
If not, please manually kill scdaemon and see if next command invocation of gpg.exe -K works or not.

When you also see the hang for the second command invocation above, please try killing gpg-agent by gpgconf.exe --kill gpg-agent and if it really kills gpg-agent.
If not, please manually kill gpg-agent and see if next command invocation of gpg.exe -K works or not.

werner changed the task status from Testing to Open.Nov 2 2024, 4:12 PM
werner moved this task from QA to WiP on the gnupg24 board.

Had a occurrence of the never ending "loading certificate cache" issue again.
There was a leftover gpgsm process from the previous tests (although Kleopatra warned when I closed it, that processes still running in the background were there and would be aborted).

None of the commands I should try did hang in this case.
Killing scdaemon with 'gpgconf.exe --kill scdaemon' worked and ended the loading message and all certificates in the keyring where shown.

Next I managed to have one gpg and one gpgsm process each left over from the last execution of Kleopatra.
After starting Kleopatra new anyway, again "loading certificate cache" and an additional pair of gpg and gpgsm listing processes start.

In this case the following commands hang:

>gpg-connect-agent "scd keyinfo --list" /bye
>gpg-connect-agent "keyinfo --list" /bye
>gpg --debug=ipc -K
>gpgconf --kill scdaemon

"gpg --debug=ipc -K" gave some output though, before hanging, see:


The last 5 lines from the output repeated several times again, after saving this.

Then manually killing scdaemon ended the gpg and gpgsm processes, too, but one gpg process was immediately started again.

I managed to get the same "loading certificate" message several times in a row on this test instance by stopping and starting Kleopatra in a row twice. After removing the Signature Card 2.0 this did not happen again in 5-6 tries, although I collected 2 lingering listing processes again (not both started on the same startup). Even import of a X.509 certificate worked.

After removing the Yubikey, too, repeated restarts of Kleopatra did neither produce lingering listing processes not did I get "loading certificates".

For Beta-75 it looks similar judging from my first tries.

A difference seems to be that while I can provoke collecting leftover gpg and gpgsm processes quite reliably by starting Kleo with both an X.509 card and a Yubikey connected, quit, and restart, this seldom results in al "loading certificates cache" situation in the following start. I've seen that only once today.