Page MenuHome GnuPG

scdaemon hangs up (when output from scdaemon is not consumed by gpg-agent)
Open, NormalPublic

Description

I managed to create (artificial) reproducible case where scdaemon hangs up.

I have a Gnuk Token inserted, and insert/remove Yubikey for this experiment.

(1) first terminal

$ gpg --card-edit
...
gpg/card> 

This is needed, for another terminal can do DEVINFO --watch for scdaemon by socket connection.

(2) second terminal
Invoke DEVINFO --watch but not consuming its output, by stopping Ctrl-Z

$ gpg-connect-agent
> SCD DEVINFO --watch
...
^Z

(3) third terminal
Observing card statuses.

$ while sleep 0.5; do gpg --card-status all; done

And inserting/removing a Yubikey many times.

Eventually (for my case, 45 times of inserting/removing), the gpg process stops proceeding (because scdaemon hangs up).

(4) fourth terminal
After scdaemon hangs up, listing secret keys hangs.

$ gpg -K

Event Timeline

gniibe renamed this task from scdaemon hang to scdaemon hangs up.Oct 9 2024, 4:04 AM
gniibe created this task.
gniibe updated the task description. (Show Details)
werner added a subscriber: werner.

But the DEVINFO --watch is required to trigger this hang? Kleopatra does not use this but we see simlar hangs from time to time in the current version.

I found fd resource leak in gpg-agent.

  • gpg-connect-agent "scd killscd" /bye seems not release a file descriptor somewhere

Because of this, looping with the command eventually results Too many open files

I found a problem of possible duplicate registration of another APP, due to no serialization for CARD access.

When send_serialno_and_app_status is called with WITH_APPS=1, it may newly register apps by select_all_additional_applications_internal to a card.
If a card is not locked, it is racy and multiple registration of apps may occur.

This problem is fixed in: rG25a140542a91: scd: Serialize CARD access for send_card_and_app_list.

gniibe renamed this task from scdaemon hangs up to scdaemon hangs up (when output from scdaemon is not consumed any more).Nov 13 2024, 3:49 AM
gniibe renamed this task from scdaemon hangs up (when output from scdaemon is not consumed any more) to scdaemon hangs up (when output from scdaemon is not consumed gpg-agent).Nov 13 2024, 5:31 AM
gniibe renamed this task from scdaemon hangs up (when output from scdaemon is not consumed gpg-agent) to scdaemon hangs up (when output from scdaemon is not consumed by gpg-agent).

After fixing two bugs, I changed the title to express the scope of this ticket.

The symptom is: scdaemon does not accept commands any more for new connection when one reading from scdaemon is stopped in a different connection.

In this situation, gpg -K hangs, gpg-connect-agent works, but any "SCD ..." command hangs with gpg-connect-agent.

FWIW, we should eventually get rid of the pipe + socket style connection model. It is just to complex with no real benefit.

I believe this is a case of non-consumption of client. on Gpg4win-Beta-75 + updated GnuPG.
Setup: I had two cards connected, one Yubikey and one Netkey3.0 card. I rebooted windows and started Kleopatra. Nothing else.

Certificates where listed as fast as expected but the listing gpg and gpgsm processes did not disappear.
Therefore tested if scdaemon hangs with gpgconf --kill scdaemon: this did not work, so it's a hard lockup. gpg -K hangs, too.
Scdaemon log showed that the cards are read, at least up to a point, but Kleo does not get the info.

Debugview:

I found an issue in libassuan, which might be related possibly, it is described in T7399.
When the response is not consumed like the case in T7399, if repeatedly, it may fill up the buffer eventually.

ebo: Thank you for your testing.

Reading the log, it's quite simple, thre is only a single execution of "SCD SERIALNO --all".
Only I can see the issue of:

  • gpg-agent is started by gpgconf --launch gpg-agent
  • before the gpgconf finishes, KeyCache::RefreshKeyJob started
  • it takes 6.38 second for gpgconf to finish. I wonder if it's normal. It seems for me that it's too slow.
  • it takes 6.73 second for KeyCache::RefreshKeyJob to finish. It seems slow too.
  • after that, it issues "SCD SERIALNO --all" command to gpg-agent

Let us minimize the testcase.

Please try following two setups.

(1) two cards connected. Reboot Windows and do

> gpgconf --launch gpg-agent
> gpg-connect-agent "SCD SERIALNO --all" /bye

And please check if it hangs or not.

  • If you encounter the hard lockup, this is the next place we can proceed (with smaller number of programs, with no Kleo, involved).
  • If it just works, please do next experiment, start from rebooting Windows again. (please try multiple times)

If no hard lockup for this test multiple times, then please try (2).

(2) two cards connected. Reboot Windows and before starting Kleo please make sure to do

> gpgconf --launch gpg-agent

And then, start Kleo to observe if it results the lingering gpg and gpgsm (and the hard lockup).

  • if you encounter the hard lockup, the testcase would be gpg+gpgsm + Kleo, but not related gpg-agent startup.
  • if it just works, please do next experiment, start from rebooting Windows again. (please try multiple times)

If this running gpg-agent before Kleo could prevent the hard lockup, it would be the race condition of starting gpg-agent by multiple programs.
Workaround will be waiting finish of gpgconf (which starts gpg-agent) by Kleo.

Please note that a card insertion to a card reader and a card reader connection to PC are different things.
It may cause different results.

"Reboot Windows" could be done in different ways:

  • no card reader connected before boot. after boot, it is connected to USB port, then card is inserted.
  • card reader connected but no card is inserted. card is inserted after boot.
  • card reader connected and card is inserted into it before the boot.

When doing tests and making the record, please describe the events of card reader connection to USB port and card insertion to card reader.

Besides:
Scardsvr should be available.
https://learn.microsoft.com/en-us/windows/security/identity-protection/smart-cards/smart-card-smart-cards-for-windows-service

In T7378, the log has the line:
sendCommand "SCD SERIALNO --all" failed: "Service ist nicht aktiv" (code: 318, source: SCD)

I think that the card reader is not connected and there is no Scardsvr at this time.
And the card reader connection to USB port results invoking Scardsvr. Then, "SCD SERIALNO --all" gets success.

For T6567 I changed the way that Kleopatra runs "gpgconf --launch gpg-agent". This change is not yet in Eva's test build. It seems my change is not good because running "gpgconf --launch gpg-agent" timed out after 5 seconds in 3 of 3 tests starting Kleopatra after a reboot of the VM. To check if "gpgconf --launch gpg-agent" really takes that long I measured the time in PowerShell after another reboot of the VM. The result is shocking.

PS C:\Users\g10code> Measure-Command { gpgconf --launch gpg-agent | Out-Default }


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 8
Milliseconds      : 463
Ticks             : 84632651
TotalDays         : 9,79544571759259E-05
TotalHours        : 0,00235090697222222
TotalMinutes      : 0,141054418333333
TotalSeconds      : 8,4632651
TotalMilliseconds : 8463,2651

The command took 8.5 (!!!) seconds. A second try took ~6.5 seconds.

"gpgconf --kill all" also takes many seconds:

PS C:\Users\g10code> Measure-Command { gpgconf --kill all | Out-Default }

[...]
TotalSeconds      : 4,3669872

A second try took ~4.3 seconds.

The next run of "gpgconf --launch gpg-agent" takes almost no time:

PS C:\Users\g10code> Measure-Command { gpgconf --launch gpg-agent | Out-Default }

[...]
TotalSeconds      : 0,2258693

All of the above was done without any physical smart cards attached to the VM.

Maybe Windows/Microsoft Defender asks Redmond if gpgconf and gpg-agent are safe to start on the first start after a reboot. In any case, I think we are chasing ghosts if we look at first start after reboot. I think it makes no sense to try to optimize the first start of Kleopatra after a reboot because it's out of our control how long it takes. The only viable solution I can see is to autostart Kleopatra (or at least gpg-agent) on reboot.

I think that the card reader is not connected and there is no Scardsvr at this time.
And the card reader connection to USB port results invoking Scardsvr. Then, "SCD SERIALNO --all" gets success.

I did 8 reboots of my windows VM with 2 smart card readers connected (with cards inserted).

But only for one of those your commands in https://dev.gnupg.org/T7323#194448 the instructions (1) failed:

gpg-connect-agent "SCD SERIALNO -all" /bye
ERR 100696144 No such device <SCD>

This was the only case of those where SCardSrv was not running.

Regarding Ingo's comment: Yes, launching gpg-agent takes quite long. And I also do not believe that it is connected to the reboot, usually I see these issues not after reboot, because I do not do this that often.

@ikloecker indeed we try only for 5 seconds:

/* The time we wait until the agent or the dirmngr are ready for
   operation after we started them before giving up.  */
#define SECS_TO_WAIT_FOR_AGENT 5
#define SECS_TO_WAIT_FOR_KEYBOXD 5
#define SECS_TO_WAIT_FOR_DIRMNGR 5

which turned out to work well enough on Linux. For WindowCE we once used 30 seconds, but that is another story

gniibe lowered the priority of this task from High to Normal.Nov 18 2024, 3:09 AM

@ebo Thank you for your testing.

IIUC, from CLI without Kleo, your tests results by 8 reboots are:

  • no hang
  • one error

Then, it should not be the bug (~= ghost) which I have been try to catch (the one when output from scdaemon is not consumed by gpg-agent).
The output of "SCD SERIALNO --all" is expected to be a single line, which cannot cause hang of scdaemon even if this output is not consumed (multiple times).

And even the single failure of No such device <SCD> means that there is some sort of race conditions around PC/SC + card reader (or its driver) + scdaemon after boot; We should consider this seriously.

While this particular problem (which can be reproducible by the artificial setup easily) exists, let me lower the priority for this ticket, and concentrate on T7396.

By your experiments and @ikloecker 's comment, I tried to see the bigger picture. Considering again, my next target to fix is the problem around apdu_connect (SCardConnect in PC/SC) in T7396.