Gpgme-pthread keylist not thread safe
Closed, ResolvedPublic

Description

I'm currently in the process of fixing kleopatra's unit test suite. There is one
failing test that randomly fails with assertions in gpgme while running many
threaded keylistings and verify jobs.

To rule out that kleopatra or gpgme do something unsupported in their jobs I've
wrote a test for plain gpgme that duplicated the behavior.

First I just started 500 verify jobs this worked. So I added the mixed keylist /
verify mode that the kleopatra test used and this started to trigger bugs.

With a modified version that only does a keylist I can still trigger this but It
felt like more rarely and I never saw the io_select with that test. So I've
added both here.

The tests are quite aggressive with 500 threads I'm running them on an Intel
Core I7 2600k and they take quite a while. (Be sure to run them with nice)
I don't propose to add them to the regression test suite for the same reason ;-)

Here is an example of the errors I saw (running the tests ~10 times)
The most regular assertion is:

t-thread-keylist: wait.c:204: _gpgme_run_io_cb: Assertion `nr <= 1' failed.

But you can also get:
t-thread-keylist-verify: posix-io.c:563: _gpgme_io_select: Assertion
`!((((&writefds)->fds_bits)[((fds[i].fd) / (8 * (int) sizeof (fd_mask)))] &
((
fd_mask) 1 << ((fds[i].fd) % (8 * (int) sizeof (__fd_mask))))) != 0)' failed.

And segfaults in both tests.
Segmentation fault nice ./t-thread-keylist
Segmentation fault nice ./t-thread-keylist-verify

The keylist test also hang twice until I kill it after some minutes.

Details

Version
master
aheinecke set Version to master.Sep 8 2015, 7:43 PM
aheinecke added projects: kleopatra, Bug Report, gpgme.
aheinecke added a subscriber: werner.
werner added a comment.Sep 9 2015, 9:25 AM

Needs to be run under GPMGE_DEBUG but I bet that this well defeat the bug.

I can still see the asserts and segfaults when running the tests in gdb or with
logging enabled. So it's not a heisenbug ;-) (Although I was never able to get a
backtrace from kleopatras unit test. That test passed when running in gdb)

Do you want me to provide you with logs?

Here is an example segfault:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe50f01700 (LWP 25135)]
0x00007ffff7babfe3 in _gpgme_wait_on_condition (ctx=ctx@entry=0x7fff6c0070e0,
cond=0x6a8844002920, cond@entry=0x7fff6c007f20,

op_err_p=op_err_p@entry=0x0) at wait-private.c:155

155 if (cond && *cond)
(gdb) bt
#0 0x00007ffff7babfe3 in _gpgme_wait_on_condition
(ctx=ctx@entry=0x7fff6c0070e0, cond=0x6a8844002920, cond@entry=0x7fff6c007f20,

op_err_p=op_err_p@entry=0x0) at wait-private.c:155

#1 0x00007ffff7bb3236 in gpgme_op_keylist_next (ctx=0x7fff6c0070e0,
r_key=r_key@entry=0x7ffe50f00ef8) at keylist.c:996
#2 0x00000000004010de in start_keylist (arg=<optimized out>) at
t-thread-keylist-verify.c:62
#3 0x00007ffff798c0a4 in start_thread (arg=0x7ffe50f01700) at pthread_create.c:309
#4 0x00007ffff76c104d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Attached is the corresponding gpgme-log (issue-2092-gpgme-log-1.txt)

I've searched for the context in question and it turns out the last 1000 lines
are not nearly enough to follow it.

So here is the full log of that run.

I've just disabled the unit tests in Kleopatra that expose this bug.
The tests should be renabled depending on a version check for a version of gpgme
where the problems are fixed.

Still in issue after we figured out that select could have been the reason for
this? The explict check for a valid FD number was added in June so before that
a segv could have been the outcome.

I don't think there is an issue anymore.

aheinecke closed this task as Resolved.Nov 11 2016, 1:23 PM

The explicit check for a valid FD (in select) I mentioned above is commit 8173c4f1f8a145c4b1d454f6f05e26950e23d675