extended-key-format test of openpgp/decrypt-unwrap-verify.scm fails on sparc64 and x32
Closed, ResolvedPublic

Description

debian's x32 and sparc64 build daemons both fail when running the extended-key-format test of openpgp/decrypt-unwrap-verify.scm.

The normal tests and the use-keyring tests do *not* fail on those architectures, however.

The test suites contain a lot of noisy reporting that i'm having difficulty wading through, but i think the difference between the failing runs (extended-key-format) and the successful runs (non-extended-key-format) is that the failing runs report:

Child 26814 returned: ((command ("/<<PKGBUILDDIR>>/build/tools/gpgconf" --build-prefix "/<<PKGBUILDDIR>>/build" --kill all)) (status 1) (stdout ) (stderr gpgconf: error running '/<<PKGBUILDDIR>>/build/tools/gpg-connect-agent': terminated
gpgconf: error running '/<<PKGBUILDDIR>>/build/tools/gpg-connect-agent scd killscd': General error
)) 
gpgconf: error running '/<<PKGBUILDDIR>>/build/tools/gpg-connect-agent': terminated
gpgconf: error running '/<<PKGBUILDDIR>>/build/tools/gpg-connect-agent scd killscd': General error

while the successful runs report:

Child 20292 returned: ((command ("/<<PKGBUILDDIR>>/build/tools/gpgconf" --build-prefix "/<<PKGBUILDDIR>>/build" --kill all)) (status 0) (stdout ) (stderr ))

I don't know what to make of this difference.

dkg created this task.Wed, Jun 19, 3:54 PM
werner triaged this task as Normal priority.Wed, Jun 19, 5:01 PM
gniibe claimed this task.EditedFri, Jun 21, 1:45 AM
gniibe added a subscriber: gniibe.

I took this task as it has errors of gpg-connect-agent scd killscd. But, it seems for me that it's not the direct cause.
Anyway, I investigate the bug.

I think that ...gpg-connect-agent': terminated is strange. Because of this failure, it reports scd killscd failed.
From 2.2.13 to 2.2.16, we added disable-scdaemon in openpgp/defs.scm, so, there must be no scdaemon running. Thus, actually the command scd killscd are not sent to gpg-agent.

I can see three: encsig-2-keys-3 Executing:, and IIUC, we should have three encsig-2-keys-4 Executing: too (but only two).
I suspect some cause(s) around here. Let's see...

gniibe edited projects, added gnupg, gpgagent; removed gnupg (gpg22).Fri, Jun 21, 2:33 AM

I found a race condition between KILLAGENT command and accepting another request.
Here is a patch to replicate the race condition :

Because of this, gpgconf --kill all may be failed, since it firstly invokes gpg-connect-agent KILLAGENT then secondly invokes gpg-connect-agent "GETINFO scd_running" .... While first invocation asks killing agent and it may take time, second invocation may connect successfully but during processing the GETINFO command the process of gpg-agent exits.

Correct solution is to implement KILLAGENT synchronously, but it's somehow harder to implement.
Easier workaround is modifying gpgconf like:

diff --git a/tools/gpgconf-comp.c b/tools/gpgconf-comp.c
index d4b3ebd29..1ba242387 100644
--- a/tools/gpgconf-comp.c
+++ b/tools/gpgconf-comp.c
@@ -1339,7 +1339,7 @@ gc_component_kill (int component)
     }
 
   /* Do the restart for the selected backends.  */
-  for (backend = 0; backend < GC_BACKEND_NR; backend++)
+  for (backend = GC_BACKEND_NR-1; backend; backend--)
     {
       if (runtime[backend] && gc_backend[backend].runtime_change)
         (*gc_backend[backend].runtime_change) (1);
dkg added a comment.Fri, Jun 21, 6:24 PM

@gniibe, thanks for the diagnosis! I agree that restarting or shutting down the backends should be done in the reverse order as a simple workaround.

We've known for some time that shutting down the background daemons is problematic. while they're shutting down, they're still holding the listening socket open, but not answering new requests on it. during that time, we see this particular race.

Interestingly, once the agent completely *does* terminate, we *don't* get the problem, because the client will just auto-launch another agent.

It's never been clear to me what the value of of this window of inaccessibility is.

I note that if the agent was to stop accept()ing on the listening socket and close it as soon as the agent plans to shut down, then the window would be significantly smaller (i'm not sure how to eliminate the window entirely).

There are two different cases: (1) By SIGTERM and (2) By KILLAGENT. It's true that the agent stops accepting on the listening socket for (1), but it's not the case for (2).
This particular problem is for the case (2).

For the window between two SIGTERMs, my interpretation is that it is an intentional behavior for debug purpose. Well, I also don't fully understand how it's useful, though.

For fully fixing (2), I think that it requires a bit of larger surgery of the agent. Currently, threads are dispatched upon accepting on the listening socket, and the main loop does no check for those children threads. This structure should be changed (or we need to introduce interaction between child thread to the main loop), so that the main loop can synchronously receive the notification of KILLAGENT (to shutdown the process), and the child thread can defer answer to the client until it can make sure gracefully shutting down.

Hm, T4521 suggests that the two different cases should not be treated differently. If you think that they *should* cause distinct behavior, please do mention it over there!

@werner, perhaps you can comment on whether there is a design goal for this window of inaccessibility?

I'm not sure i understand the options you lay out for dealing with case (2), @gniibe, but "or we need to introduce interaction between child thread to the main loop" sounds like a patch we've been shipping in debian for a while that allows a client to send a signal to the parent. I don't know how to make it work on Windows though -- perhaps @aheinecke can suggest how it would work on that platform?

@dkg, for your patch, it can be improved for Windows by using its event mechanism. You can see gnupg/scd/scdaemon.c.

Somehow related things...

Last year, actually, I tried to merge your patch, when I needed similar for scdaemon.
For scdaemon, currently, we use pipe instead of signal, because I encountered a symptom: pselect doesn't return EINTR by the signal

This use of pipe is needed for NetBSD, and it's also needed for GNU/Linux (other than ARM) in a corner case of suspend/resume.
In suspend/resume on GNU/Linux machine, SIGCONT is used internally to wake up frozen processes, and there is a race condition where a scdaemon thread of USB sending SIGCONT and pselect is restarted wrongly, never returns EINTR.

I sent my fix to the kernel: https://lkml.org/lkml/2018/3/21/909
I had expected that fix were accepted soonish, so that I could merge your patch with no change for GNU/Linux.

Perhaps, it's time to re-evaluate.

dkg added a comment.Tue, Jun 25, 2:57 AM

I'm unlikely to put a windows-specific patch into the debian source, as
i have no good way of testing it, and it wouldn't affect any binary that
we ship.

If you wanted to imlement this functionality in gpg-agent in a way that
i can drop the debian patches, i would be very happy, though :)

gniibe changed the task status from Open to Testing.Mon, Jul 1, 6:14 AM
werner closed this task as Resolved.