dotlock is not perfect (errornously remove .lock as stale lockfile)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• gniibe
	Mar 18 2022, 2:19 AM

Description

Because https://bugs.debian.org/972525 has been reopened, I review again the code path in question.

I found that dotlock implementation is not perfect. There is a race condition for handling stale lock file.

Process A got the lock.
Process B waits for the lock, examines the lockfile by read_lockfile, pid is one of Process A.
Process A released the lock and finishes its work, terminates its process.
Process C got the lock, started some work with the lock.
Process B wrongly considers the lockfile is stale (by checking kill(pid,0) with pid of A), removes the lockfile by mistake.
Process D can get the lock, while actually it's process C which takes the lock.

I confirmed reproducible failure by adding usleep at the end of read_lockfile.

Revisions and Commits

rG GnuPG
		rGd9a8d3353afd common,unix: Backport dotlock changes from GnuPG 2.3.
		rG2f1afc129662 common: Fix another race condition, and address the other one.
	Concern Raised	rGd94b411f129f common: Fix a race condition removing stale lockfile.
		rGa30359cecb5b common: More heavy test condition for t-dotlock.c.

Related Objects
Search...

		Status	Assigned	Task
		Resolved	• gniibe	T5109 Initial socket connection to server
		Resolved	• gniibe	T5884 dotlock is not perfect (errornously remove .lock as stale lockfile)

Event Timeline

• gniibe triaged this task as High priority.Mar 18 2022, 2:19 AM

• gniibe created this task.

• gniibe updated the task description. (Show Details)

• gniibe updated the task description. (Show Details)Mar 18 2022, 2:22 AM

• gniibe added a commit: rGa30359cecb5b: common: More heavy test condition for t-dotlock.c..Mar 18 2022, 4:36 AM

I pushed a change for t-dotlock.c for testing.

Reproducible scenario:
apply a patch to dotlock.c:

diff --git a/common/dotlock.c b/common/dotlock.c
index cd8e95066..42c24c316 100644
--- a/common/dotlock.c
+++ b/common/dotlock.c
@@ -608,6 +608,7 @@ read_lockfile (dotlock_t h, int *same_node )
 
   if (buffer != buffer_space)
     xfree (buffer);
+  usleep (10000);
   return pid;
 }
 #endif /*HAVE_POSIX_SYSTEM */

From a terminal, run one-shot t-dotlock by a shell command line:

while ./t-dotlock --one-shot; do sleep 0.01; done

From other four terminals, run t-dotlock in parallel.

Then in a minute or so, I got:

removing stale lockfile (created by 127229)
t-dotlock[127083]: lock taken
error opening lockfile 't-dotlock.tmp.lock': No such file or directory
release_dotlock: lockfile error
t-dotlock[127083]: error releasing lock

in a terminal or more.

• gniibe added a commit: rGd94b411f129f: common: Fix a race condition removing stale lockfile..Mar 18 2022, 6:57 AM

Fixed in master. Should be backported when found stable.

Before the fix above, https://bugs.debian.org/972525 can be explained by the following scenario:

lock_spawning doesn't work well and multiple processes enter the critical section (of spawning gpg-agent)
In agent/gpg-agent.c,
- two or more processes try to bind the socket by assuan_sock_bind in create_server_socket function simultaneously in parallel
- one successes, but before this successful process call listen,
  - another process gets an error with EADDRINUSE
    - then, calls check_for_running_agent, but no agent because no process ever called listen yet
    - so, this process removes the socket, and calls again assuan_sock_bind
    - then, it is this process which will listen to (and accept) the socket
    - but suppose this process cannot proceed to listen because of high load
- the successful process call listen but on removed socket, so no client can connect
- then, that successful process does fork/exec, parent does exit(0)
In this situation, gpg which did gnupg_spawn_process_fd and gnupg_wait_process in common/asshelp.c, actually does wait_for_sock of gpg-agent asynchronously (to another process)
That's because the successful gpg-agent listens on removed socket
So, it may fail on heavy load

• gniibe added a commit: rG2f1afc129662: common: Fix another race condition, and address the other one..Mar 19 2022, 6:55 AM

Fixed another race in commit: rG2f1afc129662: common: Fix another race condition, and address the other one.

In the old code we had a comment:

/* Note: It is unlikely that we get a race here unless a pid is
   reused too fast or a new process with the same pid as the one
   of the stale file tries to lock right at the same time as we.  */

After the commit rGd94b411f129f, there are two cases for the problem addressed:

When h->use_o_excl is 1 (true), and it detects stale lockfile.
When h->use_o_excl is 0 (false), and for some reason, only .lock file remains (while no #lk.hostname.pid file is removed), and it detect the lockfile.

Latter case is more unlikely, because both files usually remain. And that case, manual intervention is required to remove those files. (It fails in dotlock_create_unix.)

Anyway, it is better to fix the race addressed.

The race is fixed by changing the behavior when pid == getpid (). That is: When a stale lockfile is detected, regardless of same use of pid or not, unlink that file. If one case removes, but another case use, indeed, there is a race condition, where valid lockfile in use is mistakenly removed.

Note that there is a race condition still (after a fix of one race condition which may be somewhat likely and reproducible, and another fix of race condition when there is a stale lockfile).

That is fundamental one, which cannot be solved, when there is a stale lockfile and we want to recover automatically.

When it emits removing stale lockfile (created by %d)\n, it does automatic recovery, but that part is racy. Existence of stale lockfile means something wrong happened in the past. Removing the lockfile also means it's done in (unlikely) race.

If we want to be perfect, don't do automatic recovery here, but ask user's manual intervention.

That would be bad for unattended use cases. Recording the time the lock file was created might be a solution. Then cleanup only after 15 minutes or so.

Now, the problem is not about the case of pid == getpid () any more.

It is not atomic operation doing judge-then-remove the stale lockfile.

It only happens when there is:

When there is a stale lockfile
AND multiple processes race on detecting and removing the stale lockfile concurrently

So my opinion is standing with non-perfect solution. Because having a stale lock means something went wrong in the past, removing that stale lock only increase a bit of non-correctness.

• gniibe added a commit: rGd9a8d3353afd: common,unix: Backport dotlock changes from GnuPG 2.3..Mar 29 2022, 4:27 AM

• werner mentioned this in T5743: Release GnuPG 2.3.5.Apr 21 2022, 5:59 PM

• gniibe closed this task as Resolved.Apr 22 2022, 6:43 AM

• gniibe edited projects, added Bug Report; removed Restricted Project.

• werner mentioned this in T5928: Release GnuPG 2.2.35.Apr 25 2022, 7:12 PM

• werner mentioned this in T6962: gpg lock issue on Alma Linux upgraded servers.Jan 30 2024, 11:04 AM

dotlock is not perfect (errornously remove .lock as stale lockfile)Closed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

dotlock is not perfect (errornously remove .lock as stale lockfile)
Closed, ResolvedPublic
Actions

Related Objects
Search...