Page MenuHome GnuPG

w32: daemon (gpg-agent/keyboxd/dirmngr) startup and connection race when there is a socket file already
Open, NormalPublic

Description

Thanks to @timegrid , I located a bug of daemon start up.

  • There is a socket file remained for some reason (like forcibly kill the process, not by gpgconf --kill)
  • client side: Connection with the socket file fails because no daemon is running
  • client side: Spawn the daemon
  • client side: loop until connection success or timeout
  • server side: assuan_sock_bind fails because there is the socket file already
  • server side: try to remove the socket file with gnupg_remove
    • Suppose that the file is opened by client here, during the file is requested for removal
  • server side: assuan_sock_bind again, but this may fail at CreateFileW (with ERROR_ACCESS_DENIED) because client side still has the valid handle to the socket file
  • server side: mysterious system error code: 0 (0x0) and error binding socket to ...: Unknown error are emitted and the process vanishes (by agent_exit(2))
  • client side: timeout, because the daemon vanished

See:
https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-createfilew

Event Timeline

gniibe triaged this task as Normal priority.Wed, Oct 1, 7:58 AM
gniibe created this task.
gniibe created this object with edit policy "Contributor (Project)".

Here is a possible fix:

diff --git a/common/asshelp.c b/common/asshelp.c
index 0152d1243..c2d342fb8 100644
--- a/common/asshelp.c
+++ b/common/asshelp.c
@@ -536,6 +536,7 @@ start_new_service (assuan_context_t *r_ctx,
           && assuan_socket_connect (ctx, sockname, 0, connect_flags))
         {
 #ifdef HAVE_W32_SYSTEM
+          gnupg_remove (sockname);
           err = gpgrt_process_spawn (program? program : program_name, argv,
                                      GPGRT_PROCESS_DETACHED, NULL, NULL);
 #else /*!W32*/

That is, let us make sure the socket file does not exist (somehow), when we start the daemon.
I know that it's not perfect; There is still a possibility that other processes (of gpg/gpgsm) open the socket file and the handle is active.

The gnupg_remove should retry if it has a sharing violation. Similar to what we do in gnupg_rename_file. I just figured that we do a remove in the latter function too w/o handling a sharing violation.

I think that modifying gnupg_remove is a bit risky because it's used in many places.
I'd rather introduce new function for Windows; gnupg_w32_delete_file for this particular purpose.
Factoring out wait_when_sharing_violation function from gnupg_rename_file.

Like this, I mean:

diff --git a/common/asshelp.c b/common/asshelp.c
index 0152d1243..33eb8f2b4 100644
--- a/common/asshelp.c
+++ b/common/asshelp.c
@@ -536,6 +536,7 @@ start_new_service (assuan_context_t *r_ctx,
           && assuan_socket_connect (ctx, sockname, 0, connect_flags))
         {
 #ifdef HAVE_W32_SYSTEM
+          gnupg_w32_delete_file (sockname);
           err = gpgrt_process_spawn (program? program : program_name, argv,
                                      GPGRT_PROCESS_DETACHED, NULL, NULL);
 #else /*!W32*/
diff --git a/common/sysutils.c b/common/sysutils.c
index 2e39d4737..daf446d52 100644
--- a/common/sysutils.c
+++ b/common/sysutils.c
@@ -980,8 +980,68 @@ w32_rename (const char *oldname, const char *newname)
   else
     return rename (oldname, newname);
 }
-#endif /*HAVE_W32_SYSTEM*/
 
+static int
+wait_when_sharing_violation (int wtime, const char *fname)
+{
+  int ec = GetLastError ();
+
+  if (ec != ERROR_SHARING_VIOLATION)
+    {
+      gnupg_w32_set_errno (-1);
+      return 0;
+    }
+
+  /* Another process has the file open.  We do not use a
+   * lock for read but instead we wait until the other
+   * process has closed the file.  This may take long but
+   * that would also be the case with a dotlock approach for
+   * read and write.  Note that we don't need this on Unix
+   * due to the inode concept.
+   *
+   * So let's wait until the file_op has worked.  The retry
+   * intervals are 50, 100, 200, 400, 800, 50ms, ...  */
+  if (!wtime || wtime >= 800)
+    wtime = 50;
+  else
+    wtime *= 2;
+
+  if (wtime >= 800)
+    log_info (_("waiting for file '%s' to become accessible ...\n"),
+              fname);
+
+  gnupg_usleep ((wtime + 999) / 1000);
+  return wtime;
+}
+
+
+void
+gnupg_w32_delete_file (const char *fname)
+{
+  wchar_t *wfname;
+  int wtime = 0;
+
+  wfname = utf8_to_wchar (fname);
+  if (!wfname)
+    return;
+
+ again:
+  if (DeleteFileW (wfname))
+    {
+      xfree (wfname);
+      return;
+    }
+
+  /* Failure of DeleteFileW.  */
+  if ((wtime = wait_when_sharing_violation (wtime, fname)))
+    goto again;
+
+  log_error ("failed to remove %s: %s\n", fname,
+             gpg_strerror (my_error_from_syserror ()));
+
+  xfree (wfname);
+}
+#endif /*HAVE_W32_SYSTEM*/
 
 /* Wrapper for rename(2) to handle Windows peculiarities.  If
  * BLOCK_SIGNALS is not NULL and points to a variable set to true, all
@@ -998,35 +1058,17 @@ gnupg_rename_file (const char *oldname, const char *newname, int *block_signals)
 
 #ifdef HAVE_DOSISH_SYSTEM
   {
-    int wtime = 0;
+    int wtime;
 
-    gnupg_remove (newname);
+    gnupg_remove (newname); /*FIXME: Also gnupg_w32_delete_file? */
+
+    wtime = 0;
   again:
     if (w32_rename (oldname, newname))
       {
-        if (GetLastError () == ERROR_SHARING_VIOLATION)
-          {
-            /* Another process has the file open.  We do not use a
-             * lock for read but instead we wait until the other
-             * process has closed the file.  This may take long but
-             * that would also be the case with a dotlock approach for
-             * read and write.  Note that we don't need this on Unix
-             * due to the inode concept.
-             *
-             * So let's wait until the rename has worked.  The retry
-             * intervals are 50, 100, 200, 400, 800, 50ms, ...  */
-            if (!wtime || wtime >= 800)
-              wtime = 50;
-            else
-              wtime *= 2;
-
-            if (wtime >= 800)
-              log_info (_("waiting for file '%s' to become accessible ...\n"),
-                        oldname);
-
-            Sleep (wtime);
-            goto again;
-          }
+        if ((wtime = wait_when_sharing_violation (wtime, oldname)))
+          goto again;
+
         err = my_error_from_syserror ();
       }
   }
diff --git a/common/sysutils.h b/common/sysutils.h
index c832c7932..c75e6eab8 100644
--- a/common/sysutils.h
+++ b/common/sysutils.h
@@ -126,6 +126,7 @@ void output_debug_string (const char *format, ...) GPGRT_ATTR_PRINTF(1,2);
 #ifdef HAVE_W32_SYSTEM
 int gnupg_w32_set_errno (int ec);
 void *w32_get_user_sid (void);
+void gnupg_w32_delete_file (const char *fname);
 
 #include "../common/w32help.h"

Please let us not clutter the code with OS specific things. We could use a gnupg_remove_ext or gnupg_remove_maybe_wait with a wait parameter which maps to a plain gnupg_remove for Unix. The GPGRT_PROCESS_DETACHED, in the asshelp is also the only specific thing which can be move to a file global macro.

I need to do a VSD new beta today and will implement this for gnupg22 so that we can test it.

I implemented that in the old 2.2 branch for easier testing.

gniibe mentioned this in Unknown Object (Maniphest Task).Mon, Oct 6, 7:33 AM
werner mentioned this in Unknown Object (Maniphest Task).Mon, Oct 13, 3:33 PM
werner moved this task from Backlog to QA on the gnupg22 board.