"gpgconf --launch gpg-agent" should not take a full second if the agent isn't already started
Closed, ResolvedPublic

Description

The first launch of gpg-agent takes a full second:

$ export GNUPGHOME=$(mktemp -d)
$ time gpgconf --launch gpg-agent

real	0m1.010s
user	0m0.003s
sys	0m0.006s
$ time gpgconf --launch gpg-agent

real	0m0.008s
user	0m0.003s
sys	0m0.004s
$ 

This relatively minor delay becomes a significant delay when gpg is used with ephemeral home directories. ephemeral home directories are recommended for some use cases (e.g. test suites, "empty keyrings" to solve problems like T3277, etc).

Looking at the system calls, i see this pattern:

  • gpgconf spawns gpg-connect-agent
  • gpg-connect-agent tries to see whether gpg-agent is listening on the expected socket.
  • if it is, it confirms that a NOP completes and terminates (this is the second-time use case)
  • if nothing is listening on the socket, gpg-connect-agent spawns gpg-agent itself, which daemonizes itself and starts listening on that socket.
  • gpg-connect-agent sleeps for 1 second explicitly to allow gpg-agent time to start up, and then tries again with the NOP.
5436  15:28:55.353803 execve("/usr/bin/gpgconf", ["gpgconf", "--launch", "gpg-agent"], [/* 49 vars */]) = 0 <0.000539>
5436  15:28:55.360343 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe245be69d0) = 5437 <0.000203>
5437  15:28:55.361988 execve("/usr/bin/gpg-connect-agent", ["gpg-connect-agent", "NOP"], [/* 49 vars */]) = 0 <0.000367>
5437  15:28:55.372350 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f1d0ff609d0) = 5438 <0.000203>
5438  15:28:55.372879 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f1d0ff609d0) = 5439 <0.000221>
5438  15:28:55.373159 exit_group(0)     = ?
5438  15:28:55.373298 +++ exited with 0 +++
5437  15:28:55.373650 nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>
5439  15:28:55.374332 execve("/usr/bin/gpg-agent", ["gpg-agent", "--homedir", "/tmp/cdtemp.T2hlVJ", "--use-standard-socket", "--daemon"], [/* 49 vars */]) = 0 <0.000331>
5439  15:28:55.384132 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5954bfc9d0) = 5440 <0.000175>
5439  15:28:55.384584 exit_group(0)     = ?
5439  15:28:55.384913 +++ exited with 0 +++
5437  15:28:56.373911 <... nanosleep resumed> 0x7ffd7872f920) = 0 <1.000228>
5440  15:28:56.375991 clone(child_stack=0x7f5953810ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f59538119d0, tls=0x7f5953811700, child_tidptr=0x7f59538119d0) = 5442 <0.000150>
5437  15:28:56.382540 exit_group(0)     = ?
5437  15:28:56.382971 +++ exited with 0 +++
5436  15:28:56.383201 exit_group(0 <unfinished ...>
5436  15:28:56.383262 <... exit_group resumed>) = ?
5442  15:28:56.383526 exit(0 <unfinished ...>
5436  15:28:56.383560 +++ exited with 0 +++
5442  15:28:56.383580 <... exit resumed>) = ?
5442  15:28:56.383654 +++ exited with 0 +++

So i think the problem is the hard sleep for 1s in gpg-connect-agent.

I see a couple ways to fix this:

  1. make the initial sleep significantly shorter -- not a tight loop, but maybe start at 1ms, and then double each time until 1s delay is reached. This means 10 wakeups in the span of the current first check. It's likely to shave a lot off of the time here. But it still smells like busy loop waiting, and it's pretty much guaranteed to waste roughly half of the time it spends sleeping, because it's polling instead of reacting to events.
  2. actually open the listening socket with gpg-connect-agent itself, and pass it to the spawned child processes, closing it in the parent. Then, the parent can connect to the socket, send the NOP command, and wait for the response with a read() syscall. When the daemonized child is done with initialization, it receives the NOP and responds to it. The kernel wakes the parent as the response is pushed (no arbitrarily-chosen timeout needed), and the process continues.

the proposal 1 seems easier given the current architecture, but proposal 2 seems better (more robust, with less delay). Is there any reason to avoid proposal 2?

Maybe we do the first proposal now as the "easy" fix, but queue up the second proposal for later when more engineering hours are available?

dkg created this task.Nov 8 2017, 3:51 PM
werner triaged this task as Wishlist priority.Nov 8 2017, 4:54 PM
werner removed a project: Bug Report.
werner added a subscriber: werner.

gpg-connect-agent is used by gpgconf to make things easier. Adding socket playing games is the opposite of simplifying things.

Proposal 1 is much better because all components will benefit from that.

I change this to whishlist because this is not a bug.

dkg added a comment.Nov 8 2017, 5:26 PM

I'll push some patches for proposal 1.

dkg added a comment.Nov 8 2017, 5:27 PM

To clarify, i'll push them to a separate branch for you to decide whether to merge.

dkg added a comment.EditedNov 8 2017, 5:48 PM

OK, i've pushed 0471ff9d3bf8d6b9a359f3c426d70d0935066907 and 149041b0b917f4298239fe18b5ebd5ead71584a6 to branch T3490-proposal1. It cuts GnuPG's own simple test suite down from about 3 minutes to 1.5 minutes for me. I haven't tested the speedup for the full test suite yet.

werner claimed this task.Nov 15 2017, 12:27 PM

Merged into master. Lets test it for a few days before I backport it.

werner closed this task as Resolved.Aug 28 2018, 5:03 PM

Backport done for 2.2.10