Page MenuHome GnuPG

(Near) endless loop in gcry_mpi_powm
Closed, ResolvedPublic

Description

I'm encountering a problem with gcry_mpi_powm() as used by libssh2. On a 32 bit
x86 Linux system with both libgcrypt-1.5.4 and libgcrypt-1.6.4 I discovered that
this function occasionally doesn't return (or, at least it doesn't after a day
of waiting) during the calculations performed on an ssh connection. I tracked
the problem to a loop within _gcry_mpi_powm that is initialized < 0 and
therefore never hits the == 0 condition to exit without the counter variable
underflowing first.

The loop is at line 661 of mpi/mpi-pow.c:

  for (j += W - c0; j; j--)

On one example of this condition being hit, on entry j=-139312672, W=3 and c0=0.
I tried creating a standalone program to reproduce this by hard-coding a set of
four arguments to gcry_mpi_powm that caused this condition to be hit, but didn't
succeed in getting it to end up in this loop. It only takes a few dozen ssh
connections to get it into this state, so it's easy to reproduce. How can I
track this down further?

Details

Version
1.6.4

Related Objects

Event Timeline

danf set Version to 1.6.4.
danf added a subscriber: danf.

Thanks for writing the report - that is better than having your report ticked
only in my mail folder.

gniibe added a subscriber: gniibe.

In what condition can J can be initialized < 0?
Do you have some idea?

When you get negative value for J on entry of the for loop, you can examine four
arguments to _gcry_mpi_powm. And then, you can write standalone program to
emulate it. Debugger or printf.

I tried doing exactly that, but it didn't reproduce the problem. I assumed that
either the internal representation of the input values set up in my test program
with gcry_mpi_scan() and gcry_mpi_set_ui() subtly differed from the ones
encountered in production, or there was some code path that uses an uninitialized
variable, but I don't know if either theory could be the case.

Thank you for your experiment.
I suspect other cause(s). In the code itself, there is no possibility J can be
negative. However, it could be possible, in practice, when the stack is
corrupted because of wrong allocation of memory or by other threads.

The code that's failing is single threaded and passes valgrind, address sanitizer
and undefined sanitizer tests. I can't think of how the stack could be corrupted
from outside the routine, except perhaps that a signal handler is involved. If
you're confident that j could never be negative in the normal case, then I'll try
tracking down how that could happen.

I can't show you math proof at hand, but I'm confident enough J can't be negative.
This implementation of mpi_powm was introduced in October 2013.

libgcrypt 1.5.3 was the one with old implementation.

If it is difficult for you to minimize your test case, as long as it is
reproducible, please let us have your test case. We'll try to figure out the bug.

This recipe generally causes a hang within no more than 5 minutes of running
through the test suite on my system. libcrypt is assumed to be installed in the
normal location, or set PKG_CONFIG_PATH appropriately. Run "src/curl --version"
to make sure it says libssh2/1.7.0_DEV to prove it's picked up the right libssh2
and "ldd src/.libs/lt-curl" to make sure it's using gcrypt.

git clone https://github.com/libssh2/libssh2.git
cd libssh2
./buildconf
./configure --prefix=/tmp/install --with-libgcrypt
make -j6 && make install
cd ..
curl -O https://curl.haxx.se/download/curl-7.47.1.tar.lzma
tar xaf curl-7.47.1.tar.lzma
cd curl-7.47.1
PKG_CONFIG_PATH=/tmp/install/lib/pkgconfig ./configure --enable-debug
--without-ssl --with-libssh2
make -j6
while true; do make -j6 test TEST_Q='-a -p -n SFTP SCP'; done

A couple more point: openssh must be installed on the system so the test suite
will work. Also, the problem seems to have started in commit fc4a969a in libssh2.

Thank you very much. It is reproducible for me, too. I located the issue.
I think that it is reproducible for any libgcrypt (even < 1.5.3).
With the patch attached, problem seems to be gone.
Problem is that the DH exchange introduced in the commit fc4a969a in libssh2,
the EXPO argument is coming without normalization, so, count_leading_zeros
results undefined value on IA-32.
In libssh2, it's random bytes, so, it can be all 0.

By "all zero", I mean that a limb can be with bits of all zeros, so that e =
ep[i] can be zero.

I tried the patch and the problem hasn't shown up for me after an hour of
continuously running the test suite, so it looks fixed! Thanks for the fast
turnaround on this tricky problem.

(The patch has been applied to 1.6 and master)

werner removed a project: Testing.