"gpgconf --launch gpg-agent" should not take a full second if the agent isn't already started
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dkg
	Nov 8 2017, 3:51 PM

Description

The first launch of gpg-agent takes a full second:

$ export GNUPGHOME=$(mktemp -d)
$ time gpgconf --launch gpg-agent

real	0m1.010s
user	0m0.003s
sys	0m0.006s
$ time gpgconf --launch gpg-agent

real	0m0.008s
user	0m0.003s
sys	0m0.004s
$

This relatively minor delay becomes a significant delay when gpg is used with ephemeral home directories. ephemeral home directories are recommended for some use cases (e.g. test suites, "empty keyrings" to solve problems like T3277, etc).

Looking at the system calls, i see this pattern:

gpgconf spawns gpg-connect-agent
gpg-connect-agent tries to see whether gpg-agent is listening on the expected socket.
if it is, it confirms that a NOP completes and terminates (this is the second-time use case)
if nothing is listening on the socket, gpg-connect-agent spawns gpg-agent itself, which daemonizes itself and starts listening on that socket.
gpg-connect-agent sleeps for 1 second explicitly to allow gpg-agent time to start up, and then tries again with the NOP.

5436  15:28:55.353803 execve("/usr/bin/gpgconf", ["gpgconf", "--launch", "gpg-agent"], [/* 49 vars */]) = 0 <0.000539>
5436  15:28:55.360343 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe245be69d0) = 5437 <0.000203>
5437  15:28:55.361988 execve("/usr/bin/gpg-connect-agent", ["gpg-connect-agent", "NOP"], [/* 49 vars */]) = 0 <0.000367>
5437  15:28:55.372350 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f1d0ff609d0) = 5438 <0.000203>
5438  15:28:55.372879 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f1d0ff609d0) = 5439 <0.000221>
5438  15:28:55.373159 exit_group(0)     = ?
5438  15:28:55.373298 +++ exited with 0 +++
5437  15:28:55.373650 nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>
5439  15:28:55.374332 execve("/usr/bin/gpg-agent", ["gpg-agent", "--homedir", "/tmp/cdtemp.T2hlVJ", "--use-standard-socket", "--daemon"], [/* 49 vars */]) = 0 <0.000331>
5439  15:28:55.384132 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5954bfc9d0) = 5440 <0.000175>
5439  15:28:55.384584 exit_group(0)     = ?
5439  15:28:55.384913 +++ exited with 0 +++
5437  15:28:56.373911 <... nanosleep resumed> 0x7ffd7872f920) = 0 <1.000228>
5440  15:28:56.375991 clone(child_stack=0x7f5953810ff0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f59538119d0, tls=0x7f5953811700, child_tidptr=0x7f59538119d0) = 5442 <0.000150>
5437  15:28:56.382540 exit_group(0)     = ?
5437  15:28:56.382971 +++ exited with 0 +++
5436  15:28:56.383201 exit_group(0 <unfinished ...>
5436  15:28:56.383262 <... exit_group resumed>) = ?
5442  15:28:56.383526 exit(0 <unfinished ...>
5436  15:28:56.383560 +++ exited with 0 +++
5442  15:28:56.383580 <... exit resumed>) = ?
5442  15:28:56.383654 +++ exited with 0 +++

So i think the problem is the hard sleep for 1s in gpg-connect-agent.

I see a couple ways to fix this:

make the initial sleep significantly shorter -- not a tight loop, but maybe start at 1ms, and then double each time until 1s delay is reached. This means 10 wakeups in the span of the current first check. It's likely to shave a lot off of the time here. But it still smells like busy loop waiting, and it's pretty much guaranteed to waste roughly half of the time it spends sleeping, because it's polling instead of reacting to events.
actually open the listening socket with gpg-connect-agent itself, and pass it to the spawned child processes, closing it in the parent. Then, the parent can connect to the socket, send the NOP command, and wait for the response with a read() syscall. When the daemonized child is done with initialization, it receives the NOP and responds to it. The kernel wakes the parent as the response is pushed (no arbitrarily-chosen timeout needed), and the process continues.

the proposal 1 seems easier given the current architecture, but proposal 2 seems better (more robust, with less delay). Is there any reason to avoid proposal 2?

Maybe we do the first proposal now as the "easy" fix, but queue up the second proposal for later when more engineering hours are available?

Revisions and Commits

rG GnuPG
	rG38eb7c360bc4 assuan: Fix exponential decay for first second.
	rG1189df2cd7d4 assuan: Use exponential decay for first 1s of spinlock.
	rGa22a55b994e0 assuan: Reorganize waiting for socket.
	rG0cfdd3b57d59 assuan: Fix exponential decay for first second.
	rG0471ff9d3bf8 assuan: Reorganize waiting for socket.
	rG149041b0b917 assuan: Use exponential decay for first 1s of spinlock.

Related Objects

Mentioned In: T3380: Use exponential backoff when spawning agent and dirmngr
T4112: GnuPG 2.2.10 release
rG08a4b3bc41e5: Merge T3490-proposal1 into master
Mentioned Here: rG0471ff9d3bf8: assuan: Reorganize waiting for socket.
rG149041b0b917: assuan: Use exponential decay for first 1s of spinlock.
rG1fe8fc56f246: assuan: Reorganize waiting for socket.
rGceac6e9e6c08: assuan: Use exponential decay for first 1s of spinlock.
T3277: decrypting data symmetrically doesn't reliably convey confidentiality property

Event Timeline

dkg created this task.Nov 8 2017, 3:51 PM

gpg-connect-agent is used by gpgconf to make things easier. Adding socket playing games is the opposite of simplifying things.

Proposal 1 is much better because all components will benefit from that.

I change this to whishlist because this is not a bug.

I'll push some patches for proposal 1.

To clarify, i'll push them to a separate branch for you to decide whether to merge.

OK, i've pushed 0471ff9d3bf8d6b9a359f3c426d70d0935066907 and 149041b0b917f4298239fe18b5ebd5ead71584a6 to branch T3490-proposal1. It cuts GnuPG's own simple test suite down from about 3 minutes to 1.5 minutes for me. I haven't tested the speedup for the full test suite yet.

• werner claimed this task.Nov 15 2017, 12:27 PM

dkg added a commit: rG149041b0b917: assuan: Use exponential decay for first 1s of spinlock..Nov 15 2017, 1:56 PM

dkg added a commit: rG0471ff9d3bf8: assuan: Reorganize waiting for socket..

• werner mentioned this in rG08a4b3bc41e5: Merge T3490-proposal1 into master.Nov 15 2017, 1:56 PM

Merged into master. Lets test it for a few days before I backport it.

• werner added a commit: rG0cfdd3b57d59: assuan: Fix exponential decay for first second..Nov 15 2017, 3:37 PM

• werner moved this task from Backlog to For next release on the gnupg (gpg22) board.Aug 28 2018, 4:57 PM

• werner added a commit: rGa22a55b994e0: assuan: Reorganize waiting for socket..Aug 28 2018, 5:02 PM

• werner added a commit: rG1189df2cd7d4: assuan: Use exponential decay for first 1s of spinlock..

• werner added a commit: rG38eb7c360bc4: assuan: Fix exponential decay for first second..