Page MenuHome GnuPG

libgcrypt-1.10.3 regression on hppa
Closed, ResolvedPublic

Description

Hi, on hppa, the libgcrypt test suite has regressed pretty severely between 1.10.2 and 1.10.3. I've bisected it to the following commit: https://github.com/gpg/libgcrypt/commit/01e7052cb245619280769f683d697d6b2f68e041

Here is a log of what the test suite looks like for 1.10.3 (as compared to a pass for 1.10.2): https://925284.bugs.gentoo.org/attachment.cgi?id=885758

It doesn't appear to be just my environment, because Debian CI also reproduces the problem: https://buildd.debian.org/status/logs.php?pkg=libgcrypt20&arch=hppa

If you don't have an environment available, I offer FREE shell access the exact hardware I reproduced this on: https://static.matoro.tk/isa-sandbox-faq.html

Let me know if there is any further info I can provide.

Details

External Link
https://bugs.gentoo.org/925284
Version
1.10.3

Revisions and Commits

Event Timeline

gniibe triaged this task as Normal priority.Feb 28 2024, 2:47 AM

It looks like computation for NIST P-256 failed on hppa (32-bit big-endian, actually running on 64-bit machine, IIUC).
powerpc is similar (32-bit big-endian, actually running on 64-bit machine), but no failures.

By the commit, we use asm volatile statement now, instead of using volatile variables for zero and one when computing masks.
Perhaps, with the compiler implementation of the particular architecture, something goes wrong.

Here is a possible workaround to prefer volatile variables when computing masks (for hppa).

diff --git a/src/const-time.c b/src/const-time.c
index 0fb53a07..c771eb99 100644
--- a/src/const-time.c
+++ b/src/const-time.c
@@ -24,7 +24,7 @@
 #include "const-time.h"
 
 
-#ifndef HAVE_GCC_ASM_VOLATILE_MEMORY
+#if USE_GCC_ASM_VOLATILE_MEMORY_CT == 0
 /* These variables are used to generate masks from conditional operation
  * flag parameters.  Use of volatile prevents compiler optimizations from
  * converting AND-masking to conditional branches.  */
diff --git a/src/const-time.h b/src/const-time.h
index fe07cc7a..aba7f0cb 100644
--- a/src/const-time.h
+++ b/src/const-time.h
@@ -27,8 +27,24 @@
 #define ct_memequal _gcry_ct_memequal
 #define ct_memmov_cond _gcry_ct_memmov_cond
 
+/*
+ * Macro: USE_GCC_ASM_VOLATILE_MEMORY_CT
+ *
+ * For constant-time mask generation functions:
+ *   0: Don't use asm volatile statement
+ *   1: Use asm volatile statement 
+ */
+#ifdef HAVE_GCC_ASM_VOLATILE_MEMORY
+# ifdef __hppa
+# define USE_GCC_ASM_VOLATILE_MEMORY_CT 0 /* Possibly, compiler problem.  */
+# else
+# define USE_GCC_ASM_VOLATILE_MEMORY_CT 1
+# endif
+#else
+#define USE_GCC_ASM_VOLATILE_MEMORY_CT 0
+#endif
 
-#ifndef HAVE_GCC_ASM_VOLATILE_MEMORY
+#if USE_GCC_ASM_VOLATILE_MEMORY_CT == 0
 extern volatile unsigned int _gcry_ct_vzero;
 extern volatile unsigned int _gcry_ct_vone;
 #endif
@@ -85,7 +101,7 @@ unsigned int _gcry_ct_memequal (const void *b1, const void *b2, size_t len);
 /*
  * Return all bits set if A is 1 and return 0 otherwise.
  */
-#ifdef HAVE_GCC_ASM_VOLATILE_MEMORY
+#if USE_GCC_ASM_VOLATILE_MEMORY_CT
 #  define DEFINE_CT_TYPE_GEN_MASK(name, type) \
      static inline type \
      ct_##name##_gen_mask (unsigned long op_enable) \
@@ -109,7 +125,7 @@ DEFINE_CT_TYPE_GEN_MASK(ulong, unsigned long)
 /*
  * Return all bits set if A is 0 and return 1 otherwise.
  */
-#ifdef HAVE_GCC_ASM_VOLATILE_MEMORY
+#if USE_GCC_ASM_VOLATILE_MEMORY_CT
 #  define DEFINE_CT_TYPE_GEN_INV_MASK(name, type) \
      static inline type \
      ct_##name##_gen_inv_mask (unsigned long op_enable) \

Thanks, I can confirm that this patch fixes the issue. I'll let Sam decide if this is how we want to handle it downstream or wait for confirmation from gcc.

Clarification from Dave:

I believe the issue is PA 2.0 machines are only weakly ordered. The volatile asm is only
a compiler memory barrier. The code probably needs to use a hardware memory barrier
such as __sync_synchronize() or an atomic store with release semantics.

The PA-RISC sync instructions is a full barrier. The ldcw instruction can also be used as a
barrier with a temporary variable (e.g., top of stack assuming 16-byte alignment). I haven't
compared the relative performance of these two barriers.

I am wondering if it is wrong everywhere to rely on a compiler barrier? Should this be unconditionally changed to a hardware barrier?

No, hardware barrier is not needed here. Compiler barrier is used here to prevent optimization removing mask generation and usage in following constant-time code.

Alternatively (more narrow workaround), when I add a line:

#pragma GCC optimize("O1")

before the function _gcry_mpi_ec_nist256_mod in mpi/ec-nist.c, it works for me on panama.debian.net (Debian porterbox for hppa).

gniibe changed the task status from Open to Testing.Mar 1 2024, 2:34 AM
gniibe claimed this task.

Since I don't like to introduce hppa specific workaround in a way like pragma (and I have no time to fix compiler itself), I tried to improve the ec-nist.c for hppa so that register pressure can be lower.
Here is my solution.

diff --git a/mpi/ec-inline.h b/mpi/ec-inline.h
index 0ffdf8eb..c24d5352 100644
--- a/mpi/ec-inline.h
+++ b/mpi/ec-inline.h
@@ -921,6 +921,46 @@ LIMB64_HILO(mpi_limb_t hi, mpi_limb_t lo)
 
 #endif /* HAVE_COMPATIBLE_GCC_ARM_PLATFORM_AS */
 
+#if defined (__hppa) && __GNUC__ >= 4
+#define ADD4_LIMB32(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("add %7,%11,%3\n\t" \
+	   "addc %6,%10,%2\n\t" \
+	   "addc %5,%9,%1\n\t" \
+	   "addc %4,%8,%0" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "rM" ((mpi_limb_t)(B3)), \
+	     "rM" ((mpi_limb_t)(B2)), \
+	     "rM" ((mpi_limb_t)(B1)), \
+	     "rM" ((mpi_limb_t)(B0)), \
+	     "rM" ((mpi_limb_t)(C3)), \
+	     "rM" ((mpi_limb_t)(C2)), \
+	     "rM" ((mpi_limb_t)(C1)), \
+	     "rM" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#define SUB4_LIMB32(A3, A2, A1, A0, B3, B2, B1, B0, C3, C2, C1, C0) \
+  __asm__ ("sub %7,%11,%3\n\t" \
+	   "subb %6,%10,%2\n\t" \
+	   "subb %5,%9,%1\n\t" \
+	   "subb %4,%8,%0\n\t" \
+	   : "=r" (A3), \
+	     "=&r" (A2), \
+	     "=&r" (A1), \
+	     "=&r" (A0) \
+	   : "rM" ((mpi_limb_t)(B3)), \
+	     "rM" ((mpi_limb_t)(B2)), \
+	     "rM" ((mpi_limb_t)(B1)), \
+	     "rM" ((mpi_limb_t)(B0)), \
+	     "rM" ((mpi_limb_t)(C3)), \
+	     "rM" ((mpi_limb_t)(C2)), \
+	     "rM" ((mpi_limb_t)(C1)), \
+	     "rM" ((mpi_limb_t)(C0)) \
+	   : "cc")
+
+#endif /* __hppa */
 
 /* Common 32-bit arch addition/subtraction macros.  */
 
diff --git a/mpi/longlong.h b/mpi/longlong.h
index a0e19279..35402f9c 100644
--- a/mpi/longlong.h
+++ b/mpi/longlong.h
@@ -404,23 +404,23 @@ extern UDItype __udiv_qrnnd ();
  ***************************************/
 #if defined (__hppa) && W_TYPE_SIZE == 32
 # define add_ssaaaa(sh, sl, ah, al, bh, bl) \
-  __asm__ ("	add %4,%5,%1\n"                                             \
- 	   "	addc %2,%3,%0"                                              \
+  __asm__ ("add %4,%5,%1\n\t"                                           \
+ 	   "addc %2,%3,%0"                                              \
 	   : "=r" ((USItype)(sh)),                                      \
 	     "=&r" ((USItype)(sl))                                      \
 	   : "%rM" ((USItype)(ah)),                                     \
 	     "rM" ((USItype)(bh)),                                      \
 	     "%rM" ((USItype)(al)),                                     \
-	     "rM" ((USItype)(bl)))
+	     "rM" ((USItype)(bl)) __CLOBBER_CC)
 # define sub_ddmmss(sh, sl, ah, al, bh, bl) \
-  __asm__ ("	sub %4,%5,%1\n"                                             \
-	   "	subb %2,%3,%0"                                              \
+  __asm__ ("sub %4,%5,%1\n\t"                                           \
+	   "subb %2,%3,%0"                                              \
 	   : "=r" ((USItype)(sh)),                                      \
 	     "=&r" ((USItype)(sl))                                      \
 	   : "rM" ((USItype)(ah)),                                      \
 	     "rM" ((USItype)(bh)),                                      \
 	     "rM" ((USItype)(al)),                                      \
-	     "rM" ((USItype)(bl)))
+	     "rM" ((USItype)(bl)) __CLOBBER_CC)
 # if defined (_PA_RISC1_1)
 #  define umul_ppmm(wh, wl, u, v) \
   do {									\

Note that I modified longlong.h adding __CLOBBER_CC since I think it's right. Perhaps, it's not needed for hppa actually.

If no objection, I will apply this change.

Looks good to me. __CLOBBER_CC is needed as PA-RISC has carry/borrow bits in status register for add/sub instructions.

Applied to both (master and 1.10 branch).