Re: UTF-8

Previous thread: Re: PF synproxy - never worked? by Tom Murphy on Wednesday, July 28, 2010 - 5:26 am. (6 messages)

Next thread: Crema de Concha Nacar Venus Veracruz de Mexico by Concha Nacar Venus on Wednesday, July 28, 2010 - 5:38 am. (1 message)
From: Christian Weisgerber
Date: Wednesday, July 28, 2010 - 7:58 am

Some stuff that works:

== xterm ==

If you start xterm with LC_CTYPE=en_US.UTF-8 set, it will come up
in UTF-8 mode.  Apart from displaying UTF-8 encoded text, it will
also allow you to enter such text.  Keysyms are translated, e.g.
if you use a German, Swedish, etc. keymap with <adiaeresis>, that
key will produce the byte sequence 0xC4 0xA4 in xterm.

If you have a compose key (<Multi_key> in X11 terms), you can enter
_a lot_ of characters with compose sequences.  For instance, you
can use
$ setxkbmap -option compose:ralt
to configure the right Alt key as compose.

Compose sequences work by pressing (and releasing) the compose key
and then two or three other keys that get combined into a single
character, e.g.:
  <'><e>  e with acute (French etc.)
  <c><r>  r with caron (Czech)
  </><l>  l with stroke (Polish)
Some combinations are fairly intuitive, some are not.  The complete
list of supported sequences is here:
/usr/X11R6/share/X11/locale/en_US.UTF-8/Compose

If you have been using a compose key for ISO 8859-X input all along,
note that the UTF-8 sequences can be different, and in particular
the order is important, e.g. it is always <'><e> now and <e><'>
is not accepted.

== GTK2 ==

The default GTK2 input method provides its own compose key processing,
which already worked without UTF-8 locale.  However, GTK2's compose
sequences diverge from the X11 ones, and if you find that as confusing
as I do, you can disable GTK2's own compose handling and use the
X11 one by setting GTK_IM_MODULE=xim in the environment.  That
didn't work before, but now does.

-- 
Christian "naddy" Weisgerber                          naddy@mips.inka.de

From: Christian Weisgerber
Date: Wednesday, July 28, 2010 - 12:45 pm

What doesn't work: UTF-8 mode is incompatible with 8-bit control
sequences.  If that doesn't ring a bell for you, then you don't
need to worry about it. ;-)

I only noticed because the RMC on my AlphaServer 800 inserts 8-bit
controls to set bold and blink attributes in its status output.

-- 
Christian "naddy" Weisgerber                          naddy@mips.inka.de

From: Jordi Beltran Creix
Subject: Re: UTF-8
Date: Wednesday, August 4, 2010 - 6:22 am

ls(1) needs to use wcwidth(3) instead of just assuming 1 for alignment
and if I remember correctly it also mangles the strings using
isprint(3) or hardcoded values instead of iswprint(3) when printing to
terminal which is probably what you are seeing here. ed(1) is broken
by the latter and ksh(1) for both reasons.

wcwidth(3) doesn't seem to have been added yet, though.

From: Matthew Dempsky
Subject: Re: UTF-8
Date: Wednesday, August 4, 2010 - 1:36 pm

On Wed, Aug 4, 2010 at 6:22 AM, Jordi Beltran Creix

Is there any useful documentation that explains how you're supposed to
write C code and what's changed under the i18n New World Order?  From
your message, it sounds like we're going to have to rewrite nearly all
of our user-space code...

From: Jordi Beltran Creix
Subject: Re: UTF-8
Date: Wednesday, August 4, 2010 - 6:20 pm

Not everything, but utilities that do ls-like alignment with file
names and other user provided strings, do need small modifications if
they are to be made Unicode friendly. The names should still print
correctly as long as they aren't mangled but anything that uses 0 or 2
char-wide glyphs will be misaligned. Reading user input interactively
from terminal needs to account for glyph width as well, but that
mostly happens in the libraries.

String and input mangling occurs when the programs try to sanitize
control characters. In the case of UTF-8, terminal control sequences
over 0x80 can be a valid part of a printable character.

And then there is collation which means people get angry when IJ.txt
is listed after II.txt. However, many Unicode aware programs ignore it
and it is optional in POSIX regexes.

All programs that output raw strings, don't attempt alignment, and
don't work with glyphs or code points(stuff like regexes is out but
not simple matching and replacement), are safe from i18n. If you
ignore its features, UTF-8 is just like ASCII and nothing has to
change, no need to use Unicode functions for everything.

This old FAQ is the best resource there is by far about supporting
UTF-8 and locales in POSIX programs:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

... and then there are many other implementations of the same
utilities that have been adapted to different degrees before.

From: Kevin Chadwick
Subject: Re: UTF-8
Date: Thursday, August 5, 2010 - 3:05 pm

On Thu, 5 Aug 2010 10:20:12 +0900

Reminded me of the php escapeshellarg exploit that only worked on utf-8
enabled systems, if you didn't do your own filtering that is. Is there
anything like that in perl or anywhere else that may need considering?

I guess most/if not all code where it would be a problem, would be
made for utf-8 systems anyway.

From: Matthew Szudzik
Subject: Re: UTF-8
Date: Thursday, August 5, 2010 - 5:49 am

Not only does switching to unicode require a lot of work, but it
requires perpetual, unending work.  Unicode has the foolish goal of
including all known characters, so every time a country invents a new
currency symbol, for example, the unicode fonts (such as DejaVu) must be
updated to include the symbol and the C library has to be updated to
recognize that the symbol is printable, and so on.  It requires constant
maintenance.

But it's even worse, because unicode also violates the principle
(established by Alan Turing in 1936) that any two characters should be
humanly distinguishable "at a glance".  This has led to the invention of
punycode for translating unicode strings into humanly distinguishable
ASCII strings.  But then why did we switch from ASCII to unicode in the
first place?

It's my opinion that unicode shouldn't have a place in the Unix
terminal.  You might want your GUI to display unicode characters, but
when I'm working from a terminal, I want to see the data as closely as
possible to the way that the computer "sees" the data.  For example, I
don't want nvi to display unicode characters, I want to see each
individual 8-bit byte that composes the character.  I don't want unicode
on the command line.

From: Philip Guenther
Subject: Re: UTF-8
Date: Thursday, August 5, 2010 - 11:33 am

On Thu, Aug 5, 2010 at 5:49 AM, Matthew Szudzik <mszudzik@andrew.cmu.edu>
wrote:


Umm, punycode wasn't developed because of problems with
distinguishability.  Indeed, it does nothing to solve those, so I'm
not sure why you would suggest that.  punycode exists to encode
unicode across a transport that is, in effect, in base36, with various

So you want to see '41' instead of the letter 'A'?  That's "how the

Those that are wedded to plain ASCII can continue to have that
experience by using LC_ALL=C.  Oops, never mind, OpenBSD hasn't
actually implemented "plain ASCII only" for years.


Philip Guenther

From: Matthew Szudzik
Subject: Re: UTF-8
Date: Thursday, August 5, 2010 - 12:50 pm

Although punycode may not have been developed to solve problems with
distinguishability, it is used for that purpose.  For example, punycode
is commonly used as a defense against phishers who impersonate online
banks using URLs that are indistinguishable from the banks' actual URLs.
 http://en.wikipedia.org/wiki/IDN_homograph_attack
But with a properly-designed font, ASCII characters are all easily

I simply want to be able to know "at one glance" what data the computer
is using.  For that purpose, it is unnecessary to decode an "A" as 0x41.

The fact that OpenBSD doesn't implement "plain ASCII only" doesn't mean
that it shouldn't. ;)

And by the way, the Turing quote is from the paper in which he first
proposed the idea of a mechanical computer.  He argued that it is
sufficient for a computer to have a finite character set where each
character can be distinguished at a glance.  His argument begins at the
bottom of this page:
 http://www.turingarchive.org/viewer/?id=466&title=01u
and continues onto the top of the next.  Although it is not necessary
that we follow his proposal, it has served as a historical precedent
since the very beginning of computing.

From: Philip Guenther
Subject: Re: UTF-8
Date: Thursday, August 5, 2010 - 7:40 pm

On Thu, Aug 5, 2010 at 12:50 PM, Matthew Szudzik


You're saying that a side-comment in the original Turing machine
thought experiment is an argument against character sets with similar
characters?


Philip Guenther

From: Marco Peereboom
Subject: Re: UTF-8
Date: Thursday, August 5, 2010 - 8:06 pm

7 bits ought to be enough for everyone!


From: Dmitrij D. Czarkoff
Subject: Re: UTF-8
Date: Thursday, August 5, 2010 - 10:28 pm

By default all output is in ASCII range, You can configure Your keyboard to
input only ASCII range symbols. Your problem isn't one of Unicode, it's just
one of charecter ranges, and right now it doesn't happen to exist in a real
world OpenBSD.

--
Dmitrij D. Czarkoff

From: Marc Espie
Subject: Re: UTF-8
Date: Friday, August 6, 2010 - 4:31 am

So what ? human languages are complicated. It's great that finally, some
large proportion of humanity is not ignored.

Your view is so narrow-minded, this is mind-boggling.

Do you realize that almost 1 billion people live in India ? and more than
that in China ?  Do you think there is proper support for the languages of
those people outside of unicode ?  (hint: even there, it's tough. If you
have time, check the logs of qt, see all the fixes about accents and other
diacritics marks in languages you may never have heard off... which often

Stay in your backwaters county, redneck.

Anyways, you're a troll, and you're not really relevant.

Rest assured that OpenBSD developers are interested in better i18n support.
It goes slow, because it's a tough problem, and yeah, we don't want to
create security issues, and yeah, we have to be really, really careful about
a lot of things.

Don't like it ? feel free to leave.

Ou, si tu prifhres, va te faire voir ailleurs... ;-)

From: Kevin Chadwick
Subject: Re: UTF-8
Date: Friday, August 6, 2010 - 6:52 am

I'd hope everyone is interested in better i18n support but I certainly
don't envy your task. The best luck to anyone involved

From: STeve Andre'
Subject: Re: UTF-8
Date: Friday, August 6, 2010 - 6:18 am

Thank you Marc.  I started to write something twice but I devolved into
much less useful language, talking about this.  I'm going to keep this
handy, for future such conversations, see if I can expand it a bit.

I begin to think that this is uniquely an American thing, not understanding
about the rest of the world and computer usage.  Despite the added
complexity it's a wonderful thing, making computers mold to people
rather than the other way.

-- 
STeve Andre'
Disease Control Warden
Dept. of Political Science
Michigan State University

A day without Windows is like a day without a nuclear incident.

From: Alexander Polakov
Subject: Re: UTF-8
Date: Saturday, August 7, 2010 - 7:52 am

wcwidth() is in, but no man page yet.


--- ls.c	2010/08/07 15:15:04	1.1
+++ ls.c	2010/08/07 15:17:32
@@ -41,6 +41,7 @@
 #include <err.h>
 #include <errno.h>
 #include <fts.h>
+#include <locale.h>
 #include <grp.h>
 #include <pwd.h>
 #include <stdio.h>
@@ -102,6 +103,7 @@
 	int kflag = 0;
 	char *p;

+	setlocale(LC_CTYPE, "");
 	/* Terminal defaults to -Cq, non-terminal defaults to -1. */
 	if (isatty(STDOUT_FILENO)) {
 		if ((p = getenv("COLUMNS")) != NULL)
--- util.c	2010/08/07 15:00:48	1.1
+++ util.c	2010/08/07 15:13:52
@@ -41,18 +41,75 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
+#include <wchar.h>

 #include "ls.h"
 #include "extern.h"

+#define MB_LEN_MAX 32 /* goes into limits.h */
+
+static int
+printwc(wchar_t wc, mbstate_t * pst)
+{
+	size_t		size;
+	char		buf[MB_LEN_MAX];
+
+	size = wcrtomb(buf, wc, pst);
+	if (size == (size_t) -1)       /* This shouldn't happen, but for
+					 * sure */
+		return 0;
+	if (wc == L'\0') {
+		/* The following condition must be always true, but for sure */
+		if (size > 0 && buf[size - 1] == '\0')
+			--size;
+	}
+	if (size > 0)
+		fwrite(buf, 1, size, stdout);
+	return wc == L'\0' ? 0 : wcwidth(wc);
+}
+
 int
-putname(char *name)
+putname(char *src)
 {
-	int len;
+	int             n = 0;
+	mbstate_t       src_state, stdout_state;
+	/* The following +1 is to pass '\0' at the end of src to mbrtowc(). */
+	const char     *endptr = src + strlen(src) + 1;

-	for (len = 0; *name; len++, name++)
-		putchar((!isprint(*name) && f_nonprint) ? '?' : *name);
-	return len;
+	/*
+	* We have to reset src_state each time in this function, because
+	* the codeset of src pathname may not match with current locale.
+	* Note that if we pass NULL instead of src_state to mbrtowc(),
+	* there is no way to reset the state.
+	*/
+	memset(&src_state, 0, sizeof(src_state));
+	memset(&stdout_state, 0, sizeof(stdout_state));
+	while (src < endptr) {
+		wchar_t         wc;
+		size_t          ...
From: Ingo Schwarze
Subject: Re: UTF-8
Date: Saturday, August 7, 2010 - 9:47 am

Ugh.
I hate that.

Well, i partly see a point in providing such functions in libraries,
because libraries are used for compiling all sorts of software,
including typesetting software etc. etc.

But do we really want to uproot the tree in src/bin in this respect?
I value correctness by simplicity, and this code looks nightmarish.
More than 20 lines of code basically for counting up the length
of a word, involving five calls to two different library functions,
not counting memset.

Yes, it is possible to put all kinds of bytes into filenames.
But that doesn't mean it is a smart idea, and i'm not sure
it should be actively encouraged.

And even if you happen to think there is nothing wrong with
filenames that cannot be displayed everywhere, or with filenames
that look identical but are actually different, or with filenames
that some software may handle poorly, I consider the price we have
to pay for this feature rather high.

Previous thread: Re: PF synproxy - never worked? by Tom Murphy on Wednesday, July 28, 2010 - 5:26 am. (6 messages)

Next thread: Crema de Concha Nacar Venus Veracruz de Mexico by Concha Nacar Venus on Wednesday, July 28, 2010 - 5:38 am. (1 message)