Page MenuHome GnuPG

dotlock is not perfect (errornously remove .lock as stale lockfile)
Closed, ResolvedPublic


Because has been reopened, I review again the code path in question.

I found that dotlock implementation is not perfect. There is a race condition for handling stale lock file.

  • Process A got the lock.
  • Process B waits for the lock, examines the lockfile by read_lockfile, pid is one of Process A.
  • Process A released the lock and finishes its work, terminates its process.
  • Process C got the lock, started some work with the lock.
  • Process B wrongly considers the lockfile is stale (by checking kill(pid,0) with pid of A), removes the lockfile by mistake.
  • Process D can get the lock, while actually it's process C which takes the lock.

I confirmed reproducible failure by adding usleep at the end of read_lockfile.

Event Timeline

gniibe created this task.
gniibe updated the task description. (Show Details)

I pushed a change for t-dotlock.c for testing.

Reproducible scenario:
apply a patch to dotlock.c:

diff --git a/common/dotlock.c b/common/dotlock.c
index cd8e95066..42c24c316 100644
--- a/common/dotlock.c
+++ b/common/dotlock.c
@@ -608,6 +608,7 @@ read_lockfile (dotlock_t h, int *same_node )
   if (buffer != buffer_space)
     xfree (buffer);
+  usleep (10000);
   return pid;
 #endif /*HAVE_POSIX_SYSTEM */

From a terminal, run one-shot t-dotlock by a shell command line:

while ./t-dotlock --one-shot; do sleep 0.01; done

From other four terminals, run t-dotlock in parallel.

Then in a minute or so, I got:

removing stale lockfile (created by 127229)
t-dotlock[127083]: lock taken
error opening lockfile 't-dotlock.tmp.lock': No such file or directory
release_dotlock: lockfile error
t-dotlock[127083]: error releasing lock

in a terminal or more.

Fixed in master. Should be backported when found stable.

Before the fix above, can be explained by the following scenario:

  • lock_spawning doesn't work well and multiple processes enter the critical section (of spawning gpg-agent)
  • In agent/gpg-agent.c,
    • two or more processes try to bind the socket by assuan_sock_bind in create_server_socket function simultaneously in parallel
    • one successes, but before this successful process call listen,
      • another process gets an error with EADDRINUSE
        • then, calls check_for_running_agent, but no agent because no process ever called listen yet
        • so, this process removes the socket, and calls again assuan_sock_bind
        • then, it is this process which will listen to (and accept) the socket
        • but suppose this process cannot proceed to listen because of high load
    • the successful process call listen but on removed socket, so no client can connect
    • then, that successful process does fork/exec, parent does exit(0)
  • In this situation, gpg which did gnupg_spawn_process_fd and gnupg_wait_process in common/asshelp.c, actually does wait_for_sock of gpg-agent asynchronously (to another process)
  • That's because the successful gpg-agent listens on removed socket
  • So, it may fail on heavy load

Fixed another race in commit: rG2f1afc129662: common: Fix another race condition, and address the other one.

In the old code we had a comment:

/* Note: It is unlikely that we get a race here unless a pid is
   reused too fast or a new process with the same pid as the one
   of the stale file tries to lock right at the same time as we.  */

After the commit rGd94b411f129f, there are two cases for the problem addressed:

  1. When h->use_o_excl is 1 (true), and it detects stale lockfile.
  2. When h->use_o_excl is 0 (false), and for some reason, only .lock file remains (while no file is removed), and it detect the lockfile.

Latter case is more unlikely, because both files usually remain. And that case, manual intervention is required to remove those files. (It fails in dotlock_create_unix.)

Anyway, it is better to fix the race addressed.

The race is fixed by changing the behavior when pid == getpid (). That is: When a stale lockfile is detected, regardless of same use of pid or not, unlink that file. If one case removes, but another case use, indeed, there is a race condition, where valid lockfile in use is mistakenly removed.

Note that there is a race condition still (after a fix of one race condition which may be somewhat likely and reproducible, and another fix of race condition when there is a stale lockfile).

That is fundamental one, which cannot be solved, when there is a stale lockfile and we want to recover automatically.

When it emits removing stale lockfile (created by %d)\n, it does automatic recovery, but that part is racy. Existence of stale lockfile means something wrong happened in the past. Removing the lockfile also means it's done in (unlikely) race.

If we want to be perfect, don't do automatic recovery here, but ask user's manual intervention.

That would be bad for unattended use cases. Recording the time the lock file was created might be a solution. Then cleanup only after 15 minutes or so.

Now, the problem is not about the case of pid == getpid () any more.

It is not atomic operation doing judge-then-remove the stale lockfile.

It only happens when there is:

  • When there is a stale lockfile
  • AND multiple processes race on detecting and removing the stale lockfile concurrently

So my opinion is standing with non-perfect solution. Because having a stale lock means something went wrong in the past, removing that stale lock only increase a bit of non-correctness.

gniibe edited projects, added Bug Report; removed Restricted Project.