the calibrate_get_time() function depends on a system that has a non-tickless kernel
Closed, ResolvedPublic

Description

the calibrate_get_time() function used by do_encryption() -> get_standard_s2k_count() among others currently ends up using 100% cpu in what seems as an infinite loop on a system that has a tickless kernel configured like:

CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y

see these two bugs that describe the problem as seen by users:
https://bugs.gentoo.org/show_bug.cgi?id=595808
https://bugs.gentoo.org/show_bug.cgi?id=605806

from the kernel docs:

CONFIG_NO_HZ_IDLE:

This option enables a tickless idle system: timer interrupts
will only trigger on an as-needed basis when the system is idle.
This is usually interesting for energy saving.
Most of the time you want to say Y here.

the root problem is that times().tms_utime always returns 0 and thus the calibration keeps incrementing the number of itterations sent to the gcry_kdf_derive() function.

tested on a gentoo linux system with gnupg-2.1.21, 4.12.0 vanilla kernel

solution:

please stop using times() and switch to something like clock_gettime(CLOCK_MONOTONIC_RAW, struct)

switching to CONFIG_HZ_PERIODIC=y solves the 100% cpu loop, but it means that gnupg depends on a kernel setup that is not the recommended one.

peter_tux renamed this task from the calibrate_get_time() function depends on a system that has a non-tickless system to the calibrate_get_time() function depends on a system that has a non-tickless kernel.
peter_tux updated the task description. (Show Details)
werner added a subscriber: werner.Jul 12 2017, 7:00 PM

I don't yet understand your problem. What has this idle configuration to do with the way how we calibrate the loop. After all we are not idle while calibrating but are heavily employing the CPU. Can you please elaborate and consider that times(2) is a POSIX API and clock ticks are an essential POSIX feature.

And sure we want to utilize the CPU at 100% - that is the whole point of using a KDF.

werner triaged this task as Normal priority.Jul 12 2017, 7:00 PM

result 1 (working):

S2K calibration: 65536 -> 0ms
S2K calibration: 131072 -> 0ms
S2K calibration: 262144 -> 0ms
S2K calibration: 524288 -> 0ms
S2K calibration: 1048576 -> 0ms
S2K calibration: 2097152 -> 10ms
S2K calibration: 4194304 -> 20ms
S2K calibration: 8388608 -> 30ms
S2K calibration: 16777216 -> 70ms
S2K calibration: 33554432 -> 140ms
S2K calibration: 23966720 -> 90ms

the below-100ms number of iterations is found during calibrate_s2k_count()

result 2 (non-working) - on the same laptop and binaries with a different kernel config:

$ time gnupg-2.1.21/agent/t-protect | grep S2K
[..]
S2K calibration: 65536 -> 0ms
S2K calibration: 131072 -> 0ms
S2K calibration: 262144 -> 0ms
S2K calibration: 524288 -> 0ms
S2K calibration: 1048576 -> 0ms
S2K calibration: 2097152 -> 0ms
S2K calibration: 4194304 -> 0ms
S2K calibration: 8388608 -> 0ms
S2K calibration: 16777216 -> 0ms
S2K calibration: 33554432 -> 0ms
S2K calibration: 67108864 -> 0ms
S2K calibration: 134217728 -> 0ms
S2K calibration: 268435456 -> 0ms
S2K calibration: 536870912 -> 0ms
S2K calibration: 1073741824 -> 0ms
S2K calibration: 2147483648 -> 0ms
S2K calibration: 4294967296 -> 0ms
S2K calibration: 8589934592 -> 0ms
S2K calibration: 17179869184 -> 0ms
^C

real    2m38.115s
user    0m0.002s
sys     2m37.757s

the below-100ms number of iterations is never to be found during calibrate_s2k_count(), cpu will hang at 100% until unsigned long count overflows (probably never). this happens because calibrate_s2k_count_one() _always_ returns 0 in this case.

I played with multiple kernel configurations and these are my findings:

scenario 1 - works with result 1

CONFIG_HZ_PERIODIC=y
CONFIG_TICK_CPU_ACCOUNTING=y

scenario 2 - works with result 1

CONFIG_NO_HZ_IDLE=y
CONFIG_TICK_CPU_ACCOUNTING=y

scenario 3 - breaks with result 2

CONFIG_NO_HZ_IDLE=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y

scenario 4 - breaks with result 2

CONFIG_NO_HZ_FULL=y
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y (this option becomes mandatory)

so apparently the config options to watch out for are

CONFIG_TICK_CPU_ACCOUNTING and
CONFIG_VIRT_CPU_ACCOUNTING_GEN

if CONFIG_VIRT_CPU_ACCOUNTING(_GEN) gets set either by mistake or as a dependency to CONFIG_NO_HZ_FULL then gpg-agent will end up in 'infinite' loops that eat 100% cpu until the end of time.

regarding the

Can you please elaborate and consider that times(2) is a POSIX API and clock ticks are an essential POSIX feature.

not sure if the clock ticks feature are as set in stone: https://lwn.net/Articles/549580/

I tested in the past CONFIG_NO_HZ_FULL and ended up with CONFIG_VIRT_CPU_ACCOUNTING(_GEN) set. it remained that way even after I switched to CONFIG_NO_HZ_IDLE. that is why I ended up experiencing this gnupg bug since ~2.1.x forward.

I can reproduce the described behavior. I have a 4.11.7 kernel with NO_HZ_IDLE=y and everything works fine, but if I set VIRT_CPU_ACCOUNTING_GEN (all other options remaining unchanged), the agent is stuck in the calibration loop (tested with GnuPG 2.1.21 and current master branch).

I note, however, than VIRT_CPU_ACCOUNTING_GEN is not frankly a "recommended" option (the kernel help says: "For now this is only useful if you are working on the full dynticks subsystem development. If unsure, say N."). NO_HZ_FULL, which automatically activates VIRT_CPU_ACCOUNTING_GEN, is not recommended either ("Say N."). So there is an issue, but not one that would affect users running on a "recommended kernel setup". And to me, it looks more like a kernel issue than a GnuPG issue...

But interestingly, I did test crudely replacing times by clock_gettime(CLOCK_PROCESS_CPUTIME_ID), and I got the normal, expected behavior even under the faulty VIRT_CPU_ACCOUNTING_GEN configuration. I don't know why this options alter the behavior of times, but whatever it does, clock_gettime seems immune to it, so maybe it is worth considering? After all clock_gettime is also a POSIX interface.

@werner : If you're willing to consider it, I can send a proper patch sometimes later in the day.

@gouttegd that is a very nice and fair assessment, thanks for taking your time to look into it.

the affected setup proved to be indeed one that is off the beaten track, but replacing the times() function would be beneficial for the reasons explained above. if there are worries of hidden problems with clock_gettime() maybe you would consider a manual non-default compile-time #ifdef controlled by something like an autoconf AC_ARG_ENABLE(tickless) ?

I'm not happy to admit it but CONFIG_VIRT_CPU_ACCOUNTING(_GEN) ended up enabled on 1000+ production servers of mine by mistake. no other piece of software had any issues with it.

@gouttegd thanks for the offer. I will consider adding a variant using clock_gettime. However clock_gettime is not available on all platforms and thus it needs to be ifdef'd. Fotunately there is npth_clock_gettime which has a builtin fallback to gettimeofday. Given that we require nPth in gpg-agent anyway, this seems to be the easiest way. It uses CLOCK_REALTIME

Using an npth function is not good because we want to come up with a reasonable iteration count. Allowing npth to switch threads would not be good. The Linux specific solution in /D450 looks like a good solution but it needs some testing.

werner raised the priority of this task from Normal to High.Oct 26 2017, 12:40 PM

The Linux specific solution in /D450 looks like a good solution but it needs some testing.

I will try to test that patch this week-end. Just need some time to recompile my kernel with the right (or shall I say, wrong) configuration options to trigger the bug.

It turns out I cannot reproduce the bug with a 4.13.2 kernel. Whatever happened to times in slightly older kernels when VIRT_CPU_ACCOUNTING_GEN was enabled seems to have been fixed in newer kernels.

But interestingly, D450 does have a drastic effect on S2K calibration.

With current master:

$ time ./agent/t-protect
S2K calibration: 25810944 -> 100ms

real	0m0.466s
user	0m0.464s
sys	0m0.000s

With D450 applied:

$ time ./agent/t-protect
S2K calibration: 65536 -> 2650ms

real	0m0.014s
user	0m0.004s
sys	0m0.004s

Something is wrong here: how can the calibration function think the calibration took more than 2 seconds when the program ran for no more than 14 milliseconds?

OK, the problem with D450 lies in the way the value obtained from clock_gettime(2) is used.

Actually, there was already problem in the original code. It used the CLOCKS_PER_SEC constant to convert the clock ticks returned by times(2) to a time expressed in milliseconds, but this seems incorrect: according to the man page, the number of clock ticks per second is to be retrieved from sysconf. CLOCKS_PER_SEC is only meant to be used for clock ticks returned by the clock(3) function (which, confusingly enough, are not the same ticks as those returned by times(2)).

I proposed a new patch D451 based on @gniibe 's one to fix both problems. I tested it with a 4.11.7 kernel with VIRT_CPU_ACCOUNTING_GEN enabled, and it avoids the bug as expected.

gniibe added a comment.EditedOct 30 2017, 3:16 AM

Thanks for testing and proposing new patch.

I see that use of CLOCKS_PER_SEC in the original code was not good (even if it worked).

You are right that in some systems where sysconf(_SC_CLK_TCK) != 100, the original code calibration is wrong.

Introducing sysconf(_SC_CLK_TCK) is worth??, I wonder. How about using clock(3)?
It works perfectly on my Debian, it looks it's better than times(3), in terms of resolution.

I updated D450, so that it uses clock(3).

clock returns CPU time on POSIX, wall clock time on Windows. For threads, I don't know.

werner added a comment.Nov 6 2017, 3:20 PM

We won't have a solution for 2.2.2 but I added --2k-count as a workaround
(rG78a6d0ce88ae) and the GETINFO subcommands s2k_count_cal and s2k_time.

werner lowered the priority of this task from High to Normal.Nov 6 2017, 3:20 PM
gniibe added a comment.Nov 7 2017, 1:36 AM

# My update of D450: clock_gettime if CLOCK_THREAD_CPUTIME_ID is available. has gone somewhere. So, I update it again.

I confirmed that times doesn't work well on my NetBSD 7.1 running QEMU.
Updated version of D450: clock_gettime if CLOCK_THREAD_CPUTIME_ID is available. with clock works better.

gniibe added a comment.Nov 7 2017, 2:37 AM

I confirmed that clock is better on FreeBSD, too. And FreeBSD has clock_gettime with CLOCK_THREAD_CPUTIME_ID.
I tested FreeBSD 11.1 running QEMU.

So, I'm going to apply D450 to master.

Fixed in 2.2.3 and master. Closing.

gniibe closed this task as Resolved.Dec 4 2017, 2:23 AM
larryv added a subscriber: larryv.Jan 1 2018, 8:52 AM