Intermittent crashes in gpgscm on s390x
Closed, ResolvedPublic

Description

Over in https://bugs.debian.org/858400, i've reported a problem where gpgscm
from 2.1.19 (and the current git head) crashes intermittently on s390x. I've
done some further digging based on the upstream source on zelenka.debian.org, an
s390x debian "porterbox".

I would have included details here but roundup says "too many links", so you'll
have to get the details from the debian BTS, i guess.

dkg set Version to 2.1.19.
dkg set External Link to https://bugs.debian.org/858400.
dkg added a subscriber: dkg.
dkg added a comment.Mar 22 2017, 10:54 PM

Roundup won't let me include the details, but i will say that from a git bisect,
i discovered that the first commit that has this behavior is
49e2ae65e892f93be7f87cfaae3392b50a99e4b1 ("gpgscm: Use a compact vector
representation.")

The crashes that happen are segfaults.

werner added a subscriber: werner.Mar 25 2017, 8:47 AM

Can you rebuild using -O0 -g and try to get a back trace again. That might be
helpful.

werner assigned this task to justus.Mar 25 2017, 8:47 AM
werner added a subscriber: justus.

I have looked into this. I installed Debian on an s390 emulator (hercules), but
have been unable to reproduce the problem there, maybe due to the emulation (it
is quite slow on my system, and the gpgscm interpreter seems especially slow,
maybe because of the challenge of doing branch prediction on interpreters).

Your stack trace suggests a memory corruption early during the initialization
("init.scm", the standard library, is being loaded), we see an error being
generated due to an unbound variable (i.e. the environment hash table is
corrupted / does not perform as expected). Then we see a segfault while the
history buffer is flattened into a list for the error message (i.e. hints at a
corruption).

Unfortunately, memory corruption bugs are very hard to detect in gpgscm due to
its use of a custom memory allocator. The allocator allocates large segments
using malloc and hands out cells from that pool as necessary. However, memory
is never freed, so tools like valgrind can not be used to detect use-after-free,
or even most out-of-bounds accesses.

I have been working on the low-level allocator last week trying to make it more
debuggable and memory errors more detectable, e.g. by moving parts of the
interpreter into readonly sections.

As Werner said, a stack trace with less optimizations would be helpful. Also,
is the problem always the same if it happens? If so, it would be interesting to
know what kind of variable is unbound (for that, inspect the 'a' parameter of
'_Error_1' [I'm attaching a pretty-printer for gdb, with that, do 'print a']).

Access to the porter box would be helpful as well.

dkg added a comment.Mar 28 2017, 2:34 PM

I've now pulled from the current master head
(caf00915532e6e8e509738962964edcd14fb0654), rebuilt on zelenka with -O0 -g, and
triggered the error again, causing a core file to be dumped.

I copied gpgscm-gdb.py into tests/gpgscm/ , added it to add-auto-load-safe-path
in ~/.gdbinit, and then ran "gdb -c tests/gpgscm/core tests/gpgscm/gpgscm" and
tried to print a, as requested. here's what i got:

0 (sid_s390x-dchroot)dkg@zelenka:~/src/gnupg2/gnupg2/build$ echo
add-auto-load-safe-path
/home/dkg/src/gnupg2/gnupg2/build/tests/gpgscm/gpgscm-gdb.py > /home/dkg/.gdbinit
0 (sid_s390x-dchroot)dkg@zelenka:~/src/gnupg2/gnupg2/build$ gdb -c
tests/gpgscm/core ./tests/gpgscm/gpgscm

GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later < GPL license >
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "s390x-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
< GDB Bugs >.
Find the GDB manual and other documentation resources online at:
< GDB Documentation >.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./tests/gpgscm/gpgscm...done.
[New LWP 7145]
Core was generated by `./gpgscm ../../../tests/gpgscm/t-child.scm'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000002aae4ecf748 in is_vector (p=0x4634508) at
../../../tests/gpgscm/scheme.c:220
220 INTERFACE INLINE int is_vector(pointer p) { return (type(p)==T_VECTOR); }
(gdb) bt
#0 0x000002aae4ecf748 in is_vector (p=0x4634508) at
../../../tests/gpgscm/scheme.c:220
#1 0x000002aae4ed3470 in vector_elem (vec=0x4634508, ielem=7) at
../../../tests/gpgscm/scheme.c:1349
#2 0x000002aae4ed975e in tailstack_flatten (sc=0x2ab046296f0,
tailstack=0x4634508, i=8, n=7, acc=0x2ab04629838) at
../../../tests/gpgscm/scheme.c:3117
#3 0x000002aae4ed99d4 in callstack_flatten (sc=0x2ab046296f0, i=8, n=7,
acc=0x2ab04629838) at ../../../tests/gpgscm/scheme.c:3155
#4 0x000002aae4ed9af0 in history_flatten (sc=0x2ab046296f0) at
../../../tests/gpgscm/scheme.c:3173
#5 0x000002aae4ed8488 in _Error_1 (sc=0x2ab046296f0, s=0x2aae4efe634 "eval:
unbound variable:", a=0x2ab0462bdd8) at ../../../tests/gpgscm/scheme.c:2777
#6 0x000002aae4eda162 in opexe_0 (sc=0x2ab046296f0, op=OP_EVAL) at
../../../tests/gpgscm/scheme.c:3298
#7 0x000002aae4ee3ef0 in Eval_Cycle (sc=0x2ab046296f0, op=OP_T0LVL) at
../../../tests/gpgscm/scheme.c:5358
#8 0x000002aae4ee5384 in scheme_load_named_file (sc=0x2ab046296f0,
fin=0x2ab04684f90, filename=0x2ab04684d80 "../../../tests/gpgscm/init.scm") at
../../../tests/gpgscm/scheme.c:5748
#9 0x000002aae4ec1ec6 in load (sc=0x2ab046296f0, file_name=0x2aae4efc7d4
"init.scm", lookup_in_cwd=0, lookup_in_path=1) at ../../../tests/gpgscm/main.c:180
#10 0x000002aae4ec22cc in main (argc=0, argv=0x3ffffe44e48) at
../../../tests/gpgscm/main.c:266
(gdb) up 5
#5 0x000002aae4ed8488 in _Error_1 (sc=0x2ab046296f0, s=0x2aae4efe634 "eval:
unbound variable:", a=0x2ab0462bdd8) at ../../../tests/gpgscm/scheme.c:2777
2777 history = history_flatten(sc);
(gdb) print a
$1 = (pointer) 0x2ab0462bdd8
(gdb) print *a
$2 = define-macro
(gdb)

maybe i'm doing something wrong? i'll ask and see whether i can give out an
account on the porterbox for you, justus.

dkg removed justus as the assignee of this task.Mar 28 2017, 2:34 PM

Yes, print *a was correct. Could you please do

  print *sc->load_stack[sc->file_i]->curr_line

there?

justus claimed this task.Mar 28 2017, 3:05 PM
dkg added a comment.Apr 3 2017, 11:39 PM

Sure:

(gdb) print *sc->load_stack[sc->file_i]->curr_line                             
$3 = 784
(gdb)
dkg updated the task description. (Show Details)Apr 6 2017, 6:14 AM
dkg changed Version from 2.1.19 to 2.1.19, 2.1.20.
gniibe added a subscriber: gniibe.Apr 6 2017, 3:45 PM

IIUC, cells are used for a place for vector elements.
I'm afraid what happens for memory space not used for vector elements.

Suppose we have vector with two elements. It is represented by two cells.

[ _flag = VECTOR, car = vector-length, cdr= vector-element-0 ]
[ _flag = vector-element-1, car = not-used, cdr = not-used ]

Will the "not-used" places be initialized? How will not-used places be marked and be reclaimed?

justus added a comment.Apr 6 2017, 3:48 PM

@gniibe good catch! I'll fix that and we'll see if that improves things.

justus added a comment.Apr 6 2017, 4:41 PM

Fwiw b83903f59ec5d49ac579f263da70ebc8dc3645b5.

I went over the more interesting places where vector elements are accessed (mark(), gc(), finalize_cell()) and they all seemed only to touch vector_length(v) items, so it should have been fine before.

dkg added a comment.Apr 6 2017, 9:32 PM

I just merged the current git head over on zelenka, which includes b83903f59ec5d49ac579f263da70ebc8dc3645b5, and managed to still get the same segfaults.

(gdb) bt
#0  0x000002aae77b6748 in is_vector (p=0x245ed508) at ../../../tests/gpgscm/scheme.c:220
#1  0x000002aae77ba680 in vector_elem (vec=0x245ed508, ielem=7) at ../../../tests/gpgscm/scheme.c:1407
#2  0x000002aae77c0956 in tailstack_flatten (sc=0x2ab245e26f0, tailstack=0x245ed508, i=8, n=7, acc=0x2ab245e2838) at ../../../tests/gpgscm/scheme.c:3175
#3  0x000002aae77c0bcc in callstack_flatten (sc=0x2ab245e26f0, i=8, n=7, acc=0x2ab245e2838) at ../../../tests/gpgscm/scheme.c:3213
#4  0x000002aae77c0ce8 in history_flatten (sc=0x2ab245e26f0) at ../../../tests/gpgscm/scheme.c:3231
#5  0x000002aae77bf680 in _Error_1 (sc=0x2ab245e26f0, s=0x2aae77e58bc "eval: unbound variable:", a=0x2ab245e4dd8) at ../../../tests/gpgscm/scheme.c:2835
#6  0x000002aae77c135a in opexe_0 (sc=0x2ab245e26f0, op=OP_EVAL) at ../../../tests/gpgscm/scheme.c:3356
#7  0x000002aae77cb048 in Eval_Cycle (sc=0x2ab245e26f0, op=OP_T0LVL) at ../../../tests/gpgscm/scheme.c:5402
#8  0x000002aae77cc4dc in scheme_load_named_file (sc=0x2ab245e26f0, fin=0x2ab2463df90, filename=0x2ab2463dd80 "../../../tests/gpgscm/init.scm") at ../../../tests/gpgscm/scheme.c:5792
#9  0x000002aae77a8ec6 in load (sc=0x2ab245e26f0, file_name=0x2aae77e392c "init.scm", lookup_in_cwd=0, lookup_in_path=1) at ../../../tests/gpgscm/main.c:180
#10 0x000002aae77a92cc in main (argc=0, argv=0x3ffff8f8668) at ../../../tests/gpgscm/main.c:266
(gdb) up 5
#5  0x000002aae77bf680 in _Error_1 (sc=0x2ab245e26f0, s=0x2aae77e58bc "eval: unbound variable:", a=0x2ab245e4dd8) at ../../../tests/gpgscm/scheme.c:2835
2835         history = history_flatten(sc);
(gdb) print *a
$1 = define-macro
(gdb)   print *sc->load_stack[sc->file_i]->curr_line
$2 = 784

If s390x is big-endian, we need padding at the start of the cell structure. So that the _flag can be compatible to the vector element.
I'll see on the porterbox myself, too.

gniibe added a comment.Apr 7 2017, 1:08 AM

I confirmed that it's 64-bit big-endian.
I wrote a patch for testing. D421: padding is needed for 64-bit big endian