Page MenuHome GnuPG

gpg2 cannot find keys by non-ASCII User IDs unless the system locale is UTF-8
Closed, ResolvedPublic

Description

over on https://bugs.debian.org/795229, anarcat writes:

------

gpg can't seem to operate properly if the environment is not correctly set:

[1007]anarcat@marcos:~$ LANG=C gpg --search-keys ='Antoine Beaupré
<anarcat@koumbit.org>'
gpg: searching for "=Antoine Beaupré <anarcat@koumbit.org>" from hkp server
pool.sks-keyservers.net
gpg: key "=Antoine Beaupré <anarcat@koumbit.org>" not found on keyserver
[1008]anarcat@marcos:~$ LANG=C.UTF-8 gpg --search-keys ='Antoine Beaupré
<anarcat@koumbit.org>'
gpg: searching for "=Antoine Beaupré <anarcat@koumbit.org>" from hkp server
pool.sks-keyservers.net
(1) Antoine Beaupré <anarcat@debian.org>

Antoine Beaupré <anarcat@koumbit.org>
Antoine Beaupré <anarcat@orangeseeds.org>
Antoine Beaupré (work) <anarcat@koumbit.org>
Antoine Beaupré (Debian) <anarcat@debian.org>
Antoine Beaupré (home address) <anarcat@anarcat.ath.cx>
  4096 bit RSA key 7B75921E, created: 2009-05-29, expires: 2016-06-01

(2) The Anarcat <anarcat@koumbit.org>

The Anarcat <anarcat@anarcat.ath.cx>
Antoine Beaupré <antoine@koumbit.org>
  1024 bit DSA key 4023702F, created: 2005-03-08, expires: 2010-03-12

(revoked) (expired)
Keys 1-2 of 2 for "=Antoine Beaupré <anarcat@koumbit.org>". Enter number(s),
N)ext, or Q)uit > q

This is pretty annoying, because it expects *everyone* to have a UTF-8
locale. Because my uid has an accent in it, it makes it impossible for
some people to search for my key on the keyservers.

This is also a problem with gpg2:

[1002]anarcat@marcos:~$ LANG=C gpg2 --search-keys ='Antoine Beaupré <a
gpg: searching for "=Antoine Beaupré <anarcat@koumbit.org>" from hkp server po
gpg: key "=Antoine Beaupré <anarcat@koumbit.org>" not found on keyserver
[1003]anarcat@marcos:~$ LANG=C.UTF-8 gpg2 --search-keys ='Antoine Beaupré
<anarcat@koumbit.org>'
gpg: searching for "=Antoine Beaupré <anarcat@koumbit.org>" from hkp server
pool.sks-keyservers.net
(1) Antoine Beaupré <anarcat@debian.org>

Antoine Beaupré <anarcat@koumbit.org>
Antoine Beaupré <anarcat@orangeseeds.org>
Antoine Beaupré (work) <anarcat@koumbit.org>
Antoine Beaupré (Debian) <anarcat@debian.org>
Antoine Beaupré (home address) <anarcat@anarcat.ath.cx>
  4096 bit RSA key 7B75921E, created: 2009-05-29, expires: 2016-06-01

(2) The Anarcat <anarcat@koumbit.org>

The Anarcat <anarcat@anarcat.ath.cx>
Antoine Beaupré <antoine@koumbit.org>
  1024 bit DSA key 4023702F, created: 2005-03-08, expires: 2010-03-12

(revoked) (expired)
Keys 1-2 of 2 for "=Antoine Beaupré <anarcat@koumbit.org>". Enter number(s),
N)ext, or Q)uit > q

-----------

i'm not sure what the right fix should be. Even if LANG is broken or a
non-existent locale, can we just marshal the bytes from argv and treat them as a
UTF-8 string?

Details

External Link
https://bugs.debian.org/795229
Version
1.4.19,2.1.7

Event Timeline

dkg added projects: Debian, Bug Report.
dkg added a subscriber: dkg.

I did a couple of tests but I do not understand what is going on.
There is also an older key of Antoine 231A87628530E205 which encodes
his name in Latin-1 (wrong charset during creation or PGP was used).

Using

  gpg -vvv ....

shows the character set used by gpg. Maybe this gives some insights.
If you know that the command line is UTF-8 you may use the option
--utf-strings to avoid any conversion.

FWIW, gpg uses LC_ALL, LC_LANG, LANG in that order to determine the
locale. Antoine's original report shows

  Locale: LANG=fr_CA.UTF-8, LC_CTYPE=fr_CA.UTF-8 (charmap=UTF-8)

and thus UTF-8 should be used due do LC_CTYPE. gpg converts command
line arguments back and forth as needed but passes them as utf-8 to
the keyserver (which is the reason that the "searching for =..."
message renders it differently.

werner set External Link to https://bugs.debian.org/795229,.Aug 12 2015, 9:58 AM

I think werner means --utf8-strings instead of --utf-strings.

hm, common/utf8conv.c says this:

  /* Note that we silently assume that plain ASCII is actually meant
     as Latin-1.  This makes sense because many Unix system don't have
     their locale set up properly and thus would get annoying error
     messages and we have to handle all the "bug" reports. Latin-1 has
     always been the character set used for 8 bit characters on Unix
     systems. */

I wonder if this is still the best choice. In my experience, far more machines
have text in some UTF-8 encoding today than in Latin-1. this is especially true
for systems that deal with OpenPGP User IDs, where UTF-8 is the canonical
representation.

If the user's environment claims that it's plain ASCII and we're seeing 8-bit
characters, gpg does have to make a decision about what to do. i see four options:

a) report an error and fail.

b) pretend that the 8-bit characters are Latin-1 (this is "OK" because any
bytestring is a valid Latin-1 string)

c) pretend that the 8-bit characters are UTF-8

d) do some sort of autodetection on the bytestring (e.g. if it is a valid UTF-8
byte sequence then treat as UTF-8, otherwise treat as Latin-1)

option (a) is annoying and likely a cause of spurious complaints, as the comment
notes. GnuPG is currently going with option (b). Option (c) seems more
reasonable to me because of OpenPGP's relationship with UTF-8, but introduces
some error cases (what do we do where the bytestring is not valid UTF-8?).
Option (d) avoids error cases but might be a bit more delicate to implement.

What do you think?

anarcat changed External Link from https://bugs.debian.org/795229, to https://bugs.debian.org/795229.Aug 18 2015, 1:30 AM
anarcat added a subscriber: anarcat.

i prefer solution (c): we should assume utf8, if we are going to assume anything
at all.

if the user doesn't provide UTF8 *and* doesn't have the proper locale set, then
we should exit with a meaningful message.

that way, things break for people that don't have a properly configured locale
*and* try to input non-UTF8 as opposed to just fail if locale is *not
configured*, which is a pretty common scenario.

Solution (c) will be used for 2.1.8.

Won't fix in 1.4 because that version is mostly useful on old systems and those
don't have proper utf-8 supoort anyway.