I will try to reproduce it. It might be that --passwd also trigerred the conversion to the new format.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 12 2017
Odd, I cannot reproduce this:
Jun 2 2017
I released libgcrypt 1.7.7
and nPth 1.6
libgcrypt secmem fix is not that in hurry, I think. nPTh bug for macOS sounds more severe.
Jun 1 2017
So, should we do a new libgcrypt release RSN?
There is another bug with solution also pending and it might not be too late for Squeeze if we hurry.
I managed to replicate this issue by preparing artificial nPth on x86 GNU/Linux.
I fixed a bug in nPth: rPTH4fae99976c31: Fix busy_wait_for.
During this debug, I also found a bug and fixed in libassuan: rA62f3123d3877: Use gpgrt_free to release memory allocated by gpgrt_asprintf.
Also, I fixed two related bug in GnuPG:
rGc03e0eb01dc4: agent: Fix error from do_encryption.
rG996544626ea4: agent: Fix memory leaks.
May 31 2017
May 25 2017
@gniibe , I'm not setting the max-passphrase-option. Currently, my gpg-agent.conf looks like this:
@landro , Do you have any key which might require passphrase update for its expiration?
I mean, do you have an gpg-agent option of "max_passphrase_days" set (default is not set).
(Since I was writing by phone, the sentence was terse. Sorry. This time, by PC.)
May 24 2017
What do you mean by connection error, @gniibe? I hope the user is not impacted by what you are suggesting.
For smartcard, yes. The feature for ssh with smartcard has been available more than ten years. I recently apply the approach to gpg frontend.
"landro (Stefan Magnus Landrø)" <noreply@dev.gnupg.org> writes:
Just noticed one more thing - I'm not trying to use a smartcard at this time (I plan on moving to yubikeys in future though) - why is "new connection to SCdaemon established" all over the logs?
So I'm using pinentry-mac in my gpg-agent.conf:
May 23 2017
So I noticed your log contains lot's of "starting a new PIN Entry", I assume you are using some kind of password manager integration, so that you don't need to enter it each time (sorry, I'm not familiar with how pinentry works on macOS).
Ok. To reproduce, I believe the key is to establish lots of connections (in my rig around 20) to (possibly different) ssh server(s) (possibly by going through a bastion) within a short timeframe.
"landro (Stefan Magnus Landrø)" <noreply@dev.gnupg.org> writes:
Too bad. I installed both libgcrypt and gnupg using homebrew, and apparently there is no way to make homebrew include debug info. I guess I could build from source and include debug info - where can I find instructions on doing that?
Hm, it did not give us the location in the source unfortunately, only
the offset from the top of the function, which the original stack trace
already contains. Maybe the library does not contain debug information.
Depending on how you installed that software, there may be a way to
install the debug symbols too. That would make bug reports much more
helpful. Thanks anyway, maybe the log will help us trace the problem.
"landro (Stefan Magnus Landrø)" <noreply@dev.gnupg.org> writes:
In https://dev.gnupg.org/T3027#97654, @justus wrote: > Hi @landro, thanks for the stack trace. Could you please try to resolve this frame > > 4 libgcrypt.20.dylib 0x000000010d8b14d2 openpgp_s2k + 594 Here it is. @justus $ atos -o /usr/local/opt/libgcrypt/lib/libgcrypt.20.dylib -arch x86_64 -l 0x10d896000 0x000000010d8b14d2 openpgp_s2k (in libgcrypt.20.dylib) + 594
@landro Thanks a lot. I think that we see some failures in the log, and there might be another bug in the failure path.
In T3027#97654, @justus wrote:Hi @landro, thanks for the stack trace. Could you please try to resolve this frame
4 libgcrypt.20.dylib 0x000000010d8b14d2 openpgp_s2k + 594to a source code location? I believe it can be done this way:
$ atos -o /usr/local/opt/libgcrypt/lib/libgcrypt.20.dylib -arch x86_64 -l 0x10d896000 0x000000010d8b14d2
I tried to reproduce this issue locally but failed.
Here is the output of the log file
In the crash log of 2017-05-22, I can't find any race or violation of shared object. It looks like some malloc related error.
Does gpg-agent emit error message(s)?
May 22 2017
Hi @landro, thanks for the stack trace. Could you please try to resolve this frame
Just retested this with 2.1.21 - unfortunately gpg-agent is still crashing. Se new attached crash log.
May 17 2017
May 16 2017
Fixed in 2.1.21.
May 15 2017
Automatic creation of socket directories creates cleanup trouble for projects previously relying on the agent-shutdown if $GNUPGHOME is removed: https://notmuchmail.org/pipermail/notmuch/2017/024550.html
Apr 19 2017
Yes, 2.1.20 has the issue, too.
The crash report can be explained as: if the double-free can occur when multiple threads access to the cache at the same time, allocation in md_open may crash.
I finally got around to test this again - sorry for the long wait. I testet with 2.1.20 - same behaviour as in 2.1.19. Crash report attached.
Apr 11 2017
Apr 7 2017
Applied as ebe12be034f0.
Apr 6 2017
While I can't reproduce this problem myself, I think I found an issue of gpg-agent passphrase caching.
Double free may happen when multiple threads enter agent_put_cache, for example.
Apr 4 2017
In 2.1.19, gpg-agent uses getpeerucred for macOS. I changed it (since it seemed not working). In 2.1.20, gpg-agent now uses getsockopt with LOCAL_PEERPID.
It seems for me that the crash occurs by ucred_free. If this is the case, 2.1.20 fixes this issue.
Mar 30 2017
Mar 24 2017
We also have a discussion of the mailing list. It does currently not make sense
to continue here.
The problem of NFS mounted home directories is _real_ and we have a solution for
this which is better than the old redirection hack.
The problem with too long socket names is not severe and has been around for
decades (for other software and 14 years for GnuPG). There are workaround and
/run/user also solves this.
I proposed a change which does not even require --create-socketdir. There was
no comment on this and thus I will push that now so that we can do a real life test.
Justus: I told you several times that we are not going to change working code
for no good reason.
Except that it is not working. If it was working, then
06f1f163e96f1039304fd3cf565cf9de1ca45849
would not be necessary.
Even if your hack (I call it a hack because it does not
work with getsockname)
1/ Yes it does. It returns precisely the path that was used in bind.
2/ We only use getsockname on sockets that were given us by a service manager
like systemd, and thus those sockets would be unaffected by "the hack".
would make it, it does not solve the major problem: The
inability of creating sockets on certain file systems. THAT is the major reason
why we moved to /var/run.
Please stop conflating these things. This bug is about "dirmngr and gpg-agent
should work automatically even when GNUPGHOME is larger than sun_path". It is
not about NFS or FAT or something.
Mar 21 2017
Justus: I told you several times that we are not going to change working code
for no good reason. Even if your hack (I call it a hack because it does not
work with getsockname) would make it, it does not solve the major problem: The
inability of creating sockets on certain file systems. THAT is the major reason
why we moved to /var/run.
The whole IPC thing is pretty complex and adding a non-standard hack as proposed
by Justus will for sure cause breakage on some platforms.
I'm not sure why you call it a hack. I've been looking at POSIX, [0] introduces
pathname resolution, and the terms 'relative path' and 'absolute path'.
0: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_13
Neither the page for connect [1], nor the one for bind [2] state that the path
used to connect/bind unix sockets must be an absolute path.
1: http: / / pubs.opengroup.org/onlinepubs/9699919799/functions/connect.html#
2: http: / / pubs.opengroup.org/onlinepubs/9699919799/functions/bind.html#
Furthermore, my test across a wide range of UNIX implementations did not show
any issues with using relative paths.
Mar 14 2017
I agreed in T2964 (wk on Mar 01 2017, 07:31 AM / Roundup) to auto create socket directories. I would like to do that
only for a tmpfs but we can also try to do this always. Adding a inotify watch
to remove the directory is more complex and I am not sure whether this is really
needed. The other thing is simple and we could do that for 2.1.20.
The whole IPC thing is pretty complex and adding a non-standard hack as proposed
by Justus will for sure cause breakage on some platforms.
Yes, we should document /var/run recommendations in the README. I will do that
for the next release.
This bug report simply asks to solve the generic problem of GNUPGHOME being
larger than sun_path. Justus's proposed mechanism is only one way of solving
that problem.
Another proposed mechanism is what i originally proposed in T2964 (dkg on Feb 17 2017, 01:52 AM / Roundup), which
*does* address remote filesystems and re-mounted filesystems.
I don't undertstand the critique about the code not yet being mature. Code
doesn't become mature by not being written, it needs to be written first and
then tested in order to become mature.
Lastly, i think if we expect that /run/user/$(id -u)/ is a "simple dependency"
for building other software, we need to make that expectation explicit someplace
reasonable (e.g. doc/HACKING or something similar)
Mar 6 2017
My main reasons why I don't want to consider this now are:
- That code is not written and thus will not be matured.
- It does not solve the major problem why we moved to /var/run, namely remote file systems and avoidance of possible re-mounted file systems
- The claim that /var/run/user does not exists is not valid, because that is a simple dependency for building the software or using it with non-common setups (remot, long $HOME). Thus an admin will anyway be on duty and adding a few lines to /etc/rc.local is not a bug deal.
FWIW, we may try this in 2.3 see T2987.
Werner does not think that this is a problem and does not want me to spend time
on this.
getsockname is only used to recover the paths of sockets bound by a supervisor
like systemd. So unless systemd starts doing the same trick that I propose,
there is no problem.
Mar 2 2017
From what I've seen there is no variation in getsockname, it just returns
whatever path is passed to bind. I don't understand the need for getsockname
tbh, because we are the ones that bind the socket in the first place.
(The only variation seems to be that the function is broken on Hurd...).
Mar 1 2017
Justus, thanks for this work, it's great!. If we can solve the problem by doing
more clever socket(7) manipulation, that would be a big win.
How do you propose dealing with the getsockname() variations? or should we just
forbid the use of getsockname() entirely in the gnupg codebase?
dkg, I understand that GnuPG does not work with such a homedir, however, it is
not the act of creating the socket that is problematic. In fact, both
bind(2)ing and connect(2)ing is ok if one uses relative paths, as demonstrated
by the test program I have attached here.
Here is the program binding and connecting to a socket with an absolute path
length of ~10 * sizeof sockaddr_un.sun_path:
System: OpenBSD:6.0:GENERIC.MP#1992
sizeof addr.sun_path: 104
Running test with strlen (cwd): 22, name: '/tmp/test-unix-sockets/socket'
getsockname returned '/tmp/test-unix-sockets/socket', addrlen: 106
Running test with strlen (cwd): 22, name: 'socket'
getsockname returned 'socket', addrlen: 106
Running test with strlen (cwd): 126, name: 'socket'
getsockname returned 'socket', addrlen: 106
Running test with strlen (cwd): 1062, name: 'socket'
getsockname returned 'socket', addrlen: 106
This works on all Unices that I have access to. I've asked on gnupg-devel@ for
people to run it elsewhere.
I understand that '--create-socketdir' solves problems besides this one. But I
disagree with the statement that our handling of socket paths is unproblematic
because --create-socketdir solves this problem.
Can we test whether /run is mounted on a tmpfs ?
should we assume that /run is always on a tmpfs but /var/run is a classical Unix
w/o a tmpfs? Or is it better to have a configure option.
I can imagine to agree to auto-create the directory on a tmpfs.
Yes, notmuch decided that they needed to workaround the situation anyway,
because they're in an environment that doesn't create the standard per-user
rundir. That doesn't seem like a great argument that gpg should also fail in
environments where the standard per-user rundir is available. I can demonstrate
a number of environments where gpg or its daemons will fail, but i don't think
any of them justify forcing gpg or its daemons to *also* fail when those
environments aren't present.
In answer to your nitpick, here is evidence that gpg's daemons cannot create
their sockets when the GNUPGHOME is too long:
1 dkg@alice:~$ mkdir -m 0700
/home/dkg/tmp/very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-long
0 dkg@alice:~$
GNUPGHOME=/home/dkg/tmp/very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-very-long
gpgconf --launch dirmngr
gpgconf: error running '/usr/bin/gpg-connect-agent': exit status 1
gpgconf: error running '/usr/bin/gpg-connect-agent --dirmngr NOP': General error
1 dkg@alice:~$
Feb 28 2017
Feb 13 2017
The whole point of a daemon is that is idling in the background to wait for work.
A more useful feature would be to flush the passphrase cache when the user is
not anymore logged in. But for Debian this has already been done by --supervised.
Feb 8 2017
I agree about that race condition being an important thing to consider, but i
think it's orthogonal to whether the process is self-terminating.
That is: we need to consider that race condition even in the case of deliberate
shutdown too, right?
Do we have a test case that involves two concurrent processes, one that tries to
stop the agent, and the other that tries to access it?
Feb 7 2017
One thing to look out for is a race condition between the agent deciding to shut
down, and a client trying to connect at that time, and that might lead to
intermittent failures. It may be doable correctly, but it is something to look
out for.
The other point being raised in the bug report about older daemons hanging
around over package upgrades should be discussed in a different bug. Yes,
shutting down the daemon when idle may work around this issue sometimes, but
clearly this is not a robust solution.
Feb 6 2017
Jan 23 2017
Jan 6 2017
Actually we do not need that function on Windows. It is on Unix called at
startup to get a list of files not to close. On Windows we do not need to close
the files before a CreateProcess and thus close_all_fds is a dummy anyway.
I removed calling this function under Windows. To go into 2.1.18.
Dec 17 2016
The problem still occured after the update of Libgcrypt, but Im pretty sure now
that I determine the origin of the problem. In the end it is somehow my fault: By
time I got more and more email accounts which are synchronized with offlineimap and
the passwords for each account are encrypted with gpg.
Offlineimap offers an option for multitheading, which synchronize the accounts in a
prallel manner. By changing to a strict serialized synchronistaion the problem
seems to vanish. My guess is, it was simply to much at once.
For those, who encounter the same problem try the '-1' option of offlineimap.
Thanks for your time and work (in general)!
Dec 9 2016
I just released Libgcrypt 1.7.4 - whcih should fix that bug.
Dec 7 2016
Backported to LIBGCRYPT-1-7-BRANCH
I have now pushed a change to Libgcrypt master to implement auto-extending of
secre memory pools. Commit b6870cf but there are two cother commits which this
is based upon. My test shows that I can now decrypt a message encrypted to the
test-hugekey.key.
I will port this back to Libgcrypt 1.7.
Dec 6 2016
I will try out the idea of extending the secmem pool even if that means no mlock.
ah right, "ulimit -l" says 64 (kbytes) on my Linux system as well. According to
mlock(2) that's since kernel 2.6.9.
So i think it's worth adopting the supplied patch as a workaround at least (i
can confirm that it resolves the specific use case described in T2857 (dkg on Dec 05 2016, 05:47 PM / Roundup)), and i
agree with you that we should extend libgcrypt to extend secure memory allocation.
it's not clear to me that swap is outside the trust boundary anyway these days,
and modern systems should prefer encrypted swap where possible.
The secmem has two goals:
- Avoid swapping out tehse pages. Thus the mlock.
- Making sure that on free the memory is zeroized.
mlock requires root privileges and thus a special init sequence is required
(install as setuid(root) and gpg-agent drops the privileges direct after
allocating and mlocking the secmem). In the old times, and probably still today
on non-Linux platforms, this is still required. However, Linux turned to
allowing any process to mlock a certain amount (64k on my box).
I tend to suggest that we extend Libgcrypt to extend the secure memory
allocation by not using mlocked memory but keeping the the seroization feature.
The second option from T2857 (wk on Dec 05 2016, 07:11 PM / Roundup).
is the only goal of the secure memory to keep the RAM from being written to
swap, or are there other goals of secure memory? why is it unlikely that a new
block of memory can be mlock'd? what are the consequences of the new block not
being mlock'd? will it still be treated as secure memory?
crashing in the event that we run out of secure memory is simply not acceptable
these days, especially in a model where we have persistent long-term daemons
that people expect to remain running.
I just posted 0001-agent-Respect-enable-large-secmem.patch to gnupg-devel:
https://lists.gnupg.org/pipermail/gnupg-devel/2016-December/032285.html
Dec 5 2016
Yeah, I saw the Debian bug report. Unfortunately there is no easy
solution to this except for rejecting the use of large secret keys.
The problem here is that the big number library needs to allocate from
a limited secure memory region (32 KiB by default) and terminates on
allocation failure. I know that this is sub-optimal but we are doing
this for 19 years now. Checking for an error after each low-level big
number operation would make the code unreadable and will introduce
bugs. Ideas on what to do:
- On secure memory allocation failure, call the out-of-memory handler which may then free other memory (or purchase new memory). This can be done in the application.
- On secure memory allocation failure, allocate a new block of secure memory and allocate from that one. There are two disadvantages: a) It is unlikely that the new block can be mlock'd. b) A free will be a a little bit slower because it needs to check the list of secure memory blocks and not just one address range. The address range check is needed so that we can figure out whether the freed address is in the secmem range and needs to zeroed out. This requires a new Libgcrypt version, though no ABI change.
- We have a ./configure option --enable-large-secmem which sets 64k instead of 32k aside for the secmem. This is currently only used in gpg to enable gpg's --enable-large-rsa option. Given that in 2.1 we use the gpg-agent for the secret key operations we should have the same options in gpg-agent. However, it is only a kludge, but one we once agreed upon to silence some pretty vocal experts on key size.
fwiw, i'm seeing this too, over at https://bugs.debian.org/846953 , for a user
with an insanely large (10240-bit) RSA key when it is locked with a passphrase.
I'm attaching such an example secret key (with passphrase "abc123"), and you can
trigger the crash with:
gpg --batch --yes --import test-hugekey.key echo test | gpg -r 861A97D02D4EE690A125DCC156CC9789743D4A89
--encrypt --armor --trust-model=always --batch --yes --output data.gpg
gpg --decrypt data.gpg
While i think it's fair to say that we need to have some limits on the sizes of
keys we can handle, gpg-agent should not crash when asked to deal with
extra-large keys, it should fail gracefully and return a sensible error code.