re-evaluate default randomness choices during key generation on GNU/Linux platforms
Closed, ResolvedPublic

Description

systems with a modern Linux kernel (> 3.17) really should use the getrandom interface rather than reading from /dev/random or /dev/urandom directly. See random(7) for more details about the different options avalable today.

We don't want to generate keys before the kernel's PRNG has been adequately seeded. But once it's seeded, it's not clear that we have any security gain from waiting for more "external entropy", whatever that source is. Indeed, djb has argued that an attacker capable of controlling the inputs to your seeding algorithm might gain an advantage in a continuous-seeding scenario that they wouldn't get in a seed-once-and-done approach.

/dev/urandom is clearly not acceptable because it doesn't block at all when the PRNG hasn't been seeded. But /dev/random is also problematic because of the severe delays it incurs, especially on systems with few sources of external entropy. These delays cause usability problems, and some of those usability problems turn into security problems.

If the kernel says that the PRNG is seeded, then GnuPG should just use it. as random(7) says:

The cryptographic algorithms used  for  the  urandom source
are quite conservative, and so should be sufficient for all purposes.

if the kernel is going to lie about seeding, then it could also lie about "available entropy" in /dev/random as well. So it doesn't make sense for GnuPG to treat the kernel as adversarial in making this decision.

Looking at the table in random(7) it seems clear to me that what we want to just invoke getrandom() with no arguments. This blocks until the kernel's PRNG has been adequately seeded, but once seeded it doesn't block, while still pulling from an unbreakably-strong PRNG. this is the best-of-both-worlds situation that we want.

Changing the GnuPG long-term (and short-term) key generation techniques to use this approach might require coordination with gcrypt. gcrypt's gcry_random_level currently has GCRY_WEAK_RANDOM and GCRY_STRONG_RANDOM and GCRY_VERY_STRONG_RANDOM, which doesn't represent the nuance described above.

One approach might be to just have gcrypt on Linux treat all values of gcry_random_level the same, and use getrandom() for all of them.

Related Objects

dkg created this task.Apr 11 2018, 8:01 PM
werner added a subscriber: werner.Apr 11 2018, 8:55 PM

To clarify: We already use the getrandom system call if it is available. To map /dev/random to /dev/urandom you can create a file /etc/gcrypt/random.conf with this line:

only-urandom

The other currently implemented option for that file is "disable-jent". See the manual.
I use it for a long time now and it is really helpful. Do this only on Linux with the getrandom system call or if you are sure that Libgcrypt is not used in the early boot stage.

I hesitate to change Libgcrypt's RNG code again because it has just been security evaluated as part of the Gpg4vs-nfd project.

dkg added a comment.EditedApr 13 2018, 9:05 PM

Werner wrote:

we already use the getrandom system call if it is available

I think this is only true for GCRY_STRONG_RANDOM or GCRY_WEAK_RANDOM. But GnuPG uses GCRY_VERY_STRONG_RANDOM for secret key generation, right?

I'm looking at gcrypt's random/rndlinux.c -- i see:

#if defined(__linux__) && defined(HAVE_SYSCALL) && defined(__NR_getrandom)
      if (fd == fd_urandom)
        {

wrapping any attempt to invoke the getrandom syscall.

i'm assuming that using GCRY_VERY_STRONG_RANDOM will mean that fd == fdrandom, which means that the fd == fd_urandom test will fail. This means that for secret key generation, GnuPG is back to blocking unnecessarily, which seems inappropriate and is likely to encourage bad behavior.

Setting only-urandom in /etc/gcrypt/random.conf seems inappropriate to me because it's documented as:

Always use the non-blocking /dev/urandom or the respective
system call instead of the blocking /dev/random.  If Libgcrypt
is used early in the boot process of the system, this option
should only be used if the system also supports the getrandom
system call.

But at runtime, the user of a pre-compiled gcrypt may not reliably know whether their copy of gcrypt was compiled on a system that supports the getrandom syscall, even if the current kernel does support it. (or, maybe the current kernel doesn't support that syscall, even though gcrypt was built against it!) Either way, it seems unsafe for the user who wants to ensure that the kernel's entropy pool has been adequately seeded to unilaterally set only-urandom.

The subsystem that *does* know whether gcrypt was compiled with support for getrandom, and whether the current kernel can support getrandom is of course gcrypt itself. So gcrypt should be able to decide to use getrandom() if it exists, regardless of whether the user passed GCRY_VERY_STRONG_RANDOM or not.

So, i think i'm proposing this change (or something comparable) to gcrypt:

diff --git a/random/rndlinux.c b/random/rndlinux.c
index 1bb7c761..509b0b19 100644
--- a/random/rndlinux.c
+++ b/random/rndlinux.c
@@ -220,17 +220,16 @@ _gcry_rndlinux_gather_random (void (*add)(const void*, size_t,
       struct timeval tv;
       int rc;
 
-      /* If we have a modern Linux kernel and we want to read from the
-       * the non-blocking /dev/urandom, we first try to use the new
+      /* If we have a modern Linux kernel, we first try to use the new
        * getrandom syscall.  That call guarantees that the kernel's
        * RNG has been properly seeded before returning any data.  This
        * is different from /dev/urandom which may, due to its
        * non-blocking semantics, return data even if the kernel has
-       * not been properly seeded.  Unfortunately we need to use a
+       * not been properly seeded.  And it differs from /dev/random by never
+       * blocking once the kernel is seeded. Unfortunately we need to use a
        * syscall and not a new device and thus we are not able to use
        * select(2) to have a timeout. */
 #if defined(__linux__) && defined(HAVE_SYSCALL) && defined(__NR_getrandom)
-      if (fd == fd_urandom)
         {
           long ret;
           size_t nbytes;
@@ -247,7 +246,7 @@ _gcry_rndlinux_gather_random (void (*add)(const void*, size_t,
             }
           while (ret == -1 && errno == EINTR);
           if (ret == -1 && errno == ENOSYS)
-            ; /* The syscall is not supported - fallback to /dev/urandom.  */
+            ; /* The syscall is not supported - fallback to pulling from fd.  */
           else
             { /* The syscall is supported.  Some sanity checks.  */
               if (ret == -1)

I don't know what the Gpg4vs-nfd project is, but i do wonder whether its security evaluation considered the concern about unnecessary blocking and how that may lead users down the path of doing potentially-unsafe things like setting only-urandom or manually seeding their kernel's "entropy pool" with data of dubious (and potentially attacker-controlled) provenance just to get past the lengthy delays during key generation.

I'd be happy to talk to the security evaluators about this if that would be useful.

@dkg : Can’t this be solved at the distribution level? I assume the packager/maintainer for Libgcrypt on a given distribution should know whether the getrandom syscall is available on said distribution, so he could install a /etc/gcrypt/random.conf file with the only-urandom option.

dkg added a comment.EditedApr 14 2018, 6:42 PM

@gouttegd : setting only-urandom at the distro level problematic due to two factors:

  • the distro usually doesn't guarantee the choice of running kernel (it's possible to run debian atop a vanilla or Android or hardware-vendor-supplied kernel, for example)
  • if for whatever reason, getrandom() isn't available at runtime, only-urandom will happily pull from an uninitialized pool, which is likely to lead to catastrophic failure.

This is an objectively more-dangerous choice for a distro maintainer than just applying the patch i propose above, which actively tries to make the right decision at runtime, rather than at packaging time.

werner triaged this task as Normal priority.
dkg added a comment.Jun 9 2018, 12:03 AM

I've heard no critique of the logic above. could we get this fix landed? it is concretely useful for doing key generation on modern GNU/Linux systems.

dkg added a comment.Jun 19 2018, 4:32 PM

could i get feedback on this ticket? a simple, clean patch is available, and i don't understand what is blocking it.

dkg added a comment.Jul 2 2018, 4:47 PM

ping again…

anarcat added a subscriber: anarcat.Jul 2 2018, 5:24 PM

Looking at the table in random(7) it seems clear to me that what we want to just invoke getrandom() with no arguments. This blocks until the kernel's PRNG has been adequately seeded, but once seeded it doesn't block, while still pulling from an unbreakably-strong PRNG. this is the best-of-both-worlds situation that we want.

Changing the GnuPG long-term (and short-term) key generation techniques to use this approach might require coordination with gcrypt. gcrypt's gcry_random_level currently has GCRY_WEAK_RANDOM and GCRY_STRONG_RANDOM and GCRY_VERY_STRONG_RANDOM, which doesn't represent the nuance described above.

One approach might be to just have gcrypt on Linux treat all values of gcry_random_level the same, and use getrandom() for all of them.

This seems very reasonable to me. Waiting for entropy is a usability limitation that makes it needlessly difficult and slow to generate new keys. We do not want to discourage users from doing that, especially since that can be useful in test suites, for example. And while I know there are ways of working around that problem now, they might not be well known to everyone. At least when I wrote the Monkeysign test suite, those tools were certainly not available and it made me use static keys instead of key generation in the test suite. This means the test suite breaks often, when keys expire for example. A nightmare.

I also have concerns about the use of /dev/random for key generation. Others have spoken about this better than me, so I won't expand on that.

I have briefly reviewed @dkg's patch and it seems reasonable to my novice gcrypt eyes.

Yar.

User input, anything to solve the lack of entropy on servers would be *great*. We have a bunch of buildbot workers we would *love* to have sign their artifacts... however we end up (unsuccessfully) doing stupid things like this to try and drive up entropy as a non-root user:

https://github.com/haiku/infrastructure/commit/7f4e58079eb5f134e674076c41da6ce5adc465e2

dkg added a comment.Jul 22 2018, 5:22 AM

Here is an example of the kinds of UI/UX mystery that users face while this decision is unresolved:

(this is from the enigmail setup wizard)

there is no reason on a modern GNU/linux system where the kernel's PRNG is already seeded to see this.

dkg added a comment.Jul 22 2018, 5:28 AM

Here is another example of users doing sketchy things to try to "fix" this process:

https://admin.hostpoint.ch/pipermail/enigmail-users_enigmail.net/2018-April/004896.html

the sketchy thing tried is:

sudo rngd -r /dev/urandom

If gcrypt's use of randomness was corrected on the GNU/Linux platform, the user would not have done this dubious thing.

dkg added a comment.EditedJul 22 2018, 7:54 AM

I've now run the proposed patch on a GNU/Linux system where the kernel's RNG is initialized but /proc/sys/kernel/random/entropy_avail shows numbers below 100, and i can confirm that 3072-bit RSA key generation takes roughly 0.8 seconds: 20 sequential default --quick-keygen operations (each creating two secret keys) took ~32s.

dkg added a comment.Aug 2 2018, 7:34 PM

This bug report has been around for several months now. it has a simple patch, a clear explanation, a report of running code, and examples of problems it solves.

I am trying to understand why it is not merged. What more is needed? Do we need additional review? if so, from who?

I think that the ultimate decision here lies with Werner. Additional review.
I think the biggest obstacle is that we don't want to change the random gathering code if it can be avoided and that the random code has been thoroughly reviewed / tested and is currently considered secure.

Werner is on vacation until the end of august so please don't be even more annoyed if you don't get a response to your last question until then. :-)

dkg added a comment.Aug 23 2018, 5:59 PM

@aheinecke thanks for the followup!

i can't tell whether you're saying that @werner is the one who needs to provide additional review, or whether we need someone from outside the project to review it. if it's the latter, can you recommend who you think would be a good reviewer?

Well, Werner is just back so he can say more.
An excellent reviewer was Stephan Müller from atsec. He is also involved a bit afaik in the kernel random development.

But what I meant to say was:

GnuPG's random code is considered secure. Universities have looked at it. Governments have looked at it. Especially the German government has looked at it (through contractors like atsec and others).
Any change there is not like "Ok this makes sense technically." Or "It makes sense from a User Experience standpoint" it has to be weight against man months (years) that were spent documenting and "evalutating" the current behavior.

Technically I agree with you. But any change in the random code of GnuPG is (as it should be) hugely expensive. This is one of the most sensitive parts of one of the most sensitive security solutions there is. And "If it ain't broke don't fix it" applies a bit.

The question here is: Is the current implementation from a User Experience standpoint "broke enough" to change it? For me as a MUA developer I would say yes. Definitely!
For me as a developer of the only free software solution that might be allowed in Germany to handle restricted government documents I would say no. Probably not. At least not at this time.

/dev/random, RDRAND, etc involves a lot of political arguments and thus it is not easy to decide what to do. What you are calling for is a linux kernel specific code path (note that rndlinux is used by most Unices) and won't be helpful for other OSes. I am of course willing to do add specific for for a few widespread OSes (and in any case for Debian). It is a major change and thus does not belong into 1.8 - I am fine with master which Debian might want to backport.

@kallisti5: For you server you can add only_urandom to random.conf when changing to a multiuser runlevel and remove it early at startup and termination.

dkg added a comment.Sep 5 2018, 4:46 PM

@werner -- yes, i am asking for a change that is specific to the way that gcrypt interacts with the Linux kernel. The minor patch i've proposed only affects a codeblock within #if defined(__linux__), so i don't believe it would have an effect on other Unices. I hope that people working with other kernels will propose any necessary fixes for them.

I'm not convinced by the suggestion of adding only_urandom to /etc/gcrypt/random.conf and removing it at certain times -- it seems like a few different opportunities for failure, in different directions. Doesn't it make more sense to have the code that actually retrieves entropy from the kernel do the right thing as it can detect its circumstances?

I've made the commit rC842ff5f60cfaf6ce3b236a44dadeddf241dbd2c3 on branch dkg/fix-T3894 if you want to merge to master.

dkg added a comment.Sep 5 2018, 4:54 PM

well, i tried to push, anyway, but it looks like playfair is rejecting my pushes:

remote: error: invalid key: hooks.denypush.branch.dkg/fix-T3894
remote: error: invalid key: hooks.denymerge.dkg/fix-T3894

at any rate, the branch in question can be found at on my gitlab mirror, or via:

git remote add dkg https://gitlab.com/dkg/libgcrypt.git
git remote update dkg
git cherry-pick 842ff5f60cfaf6ce3b236a44dadeddf241dbd2c3

Apologies for not having read all comments in this long thread. I was asked to comment on the patch, so here is my comment:

Using getrandom(2) implies that you use a noise source which has an entropy
level of at least 128 bits as outlined in [1]. The PRNG behind getrandom(2) is
constantly being reseeded if entropy is available after the elapse of 60
seconds (per default).

[1] even outlines that on contemporary hardware with a high-resolution timer,
the Linux RNG has much more entropy than the entropy estimator of the Linux
RNG will tell you. I.e. the entropy estimator is conservative which is beyond
what I would call paranoid (I would even call the entropy estimator
meaningless, because it is derived from data which has no connection to the
entropy, but that is a different topic).

Considering the quantitative analysis given in [1], particular chapter 6, I
would see no concern in permanently switching to getrandom(2) and thus
removing support for /dev/random under Linux.

The fallback in case getrandom(2) preserves the old behavior of using either /
dev/urandom or /dev/random depending on the request. Thus, the fallback is
also appropriate.

[1] https://www.bsi.bund.de/SharedDocs/Downloads/EN/BSI/Publications/Studies/
LinuxRNG/LinuxRNG_EN.html

dkg added a comment.Sep 7 2018, 9:49 PM

@aheinecke -- @smueller_chronox.de (author of the comment above) is Stephan Müller from atsec. Glad to see he seems ok with the proposal :)

His referenced study of the Linux RNG has an overwhelming amount of detail! :)

so, what else needs doing to decide whether this patch gets merged on master?

werner claimed this task.Sep 8 2018, 11:13 AM
werner raised the priority of this task from Normal to High.

Thanks for your comments, Stephan.

dkg added a comment.Oct 9 2018, 6:39 PM

What are the next steps here? i confess i'm a little tired of doing regular checkins on this issue, and i'm sure other people are tired of me nagging. What can we do to move this along?

georg added a subscriber: georg.Oct 12 2018, 6:51 PM

The next step is to release libgcrypt 1.8.4 :-)

werner closed this task as Resolved.Oct 26 2018, 1:54 PM

Fixed in master and 1.8.
@dkg: Thanks for the comments and your patience to convince me.