Some stuff that works: == xterm == If you start xterm with LC_CTYPE=en_US.UTF-8 set, it will come up in UTF-8 mode. Apart from displaying UTF-8 encoded text, it will also allow you to enter such text. Keysyms are translated, e.g. if you use a German, Swedish, etc. keymap with <adiaeresis>, that key will produce the byte sequence 0xC4 0xA4 in xterm. If you have a compose key (<Multi_key> in X11 terms), you can enter _a lot_ of characters with compose sequences. For instance, you can use $ setxkbmap -option compose:ralt to configure the right Alt key as compose. Compose sequences work by pressing (and releasing) the compose key and then two or three other keys that get combined into a single character, e.g.: <'><e> e with acute (French etc.) <c><r> r with caron (Czech) </><l> l with stroke (Polish) Some combinations are fairly intuitive, some are not. The complete list of supported sequences is here: /usr/X11R6/share/X11/locale/en_US.UTF-8/Compose If you have been using a compose key for ISO 8859-X input all along, note that the UTF-8 sequences can be different, and in particular the order is important, e.g. it is always <'><e> now and <e><'> is not accepted. == GTK2 == The default GTK2 input method provides its own compose key processing, which already worked without UTF-8 locale. However, GTK2's compose sequences diverge from the X11 ones, and if you find that as confusing as I do, you can disable GTK2's own compose handling and use the X11 one by setting GTK_IM_MODULE=xim in the environment. That didn't work before, but now does. -- Christian "naddy" Weisgerber naddy@mips.inka.de
What doesn't work: UTF-8 mode is incompatible with 8-bit control sequences. If that doesn't ring a bell for you, then you don't need to worry about it. ;-) I only noticed because the RMC on my AlphaServer 800 inserts 8-bit controls to set bold and blink attributes in its status output. -- Christian "naddy" Weisgerber naddy@mips.inka.de
ls(1) needs to use wcwidth(3) instead of just assuming 1 for alignment and if I remember correctly it also mangles the strings using isprint(3) or hardcoded values instead of iswprint(3) when printing to terminal which is probably what you are seeing here. ed(1) is broken by the latter and ksh(1) for both reasons. wcwidth(3) doesn't seem to have been added yet, though.
On Wed, Aug 4, 2010 at 6:22 AM, Jordi Beltran Creix Is there any useful documentation that explains how you're supposed to write C code and what's changed under the i18n New World Order? From your message, it sounds like we're going to have to rewrite nearly all of our user-space code...
Not everything, but utilities that do ls-like alignment with file names and other user provided strings, do need small modifications if they are to be made Unicode friendly. The names should still print correctly as long as they aren't mangled but anything that uses 0 or 2 char-wide glyphs will be misaligned. Reading user input interactively from terminal needs to account for glyph width as well, but that mostly happens in the libraries. String and input mangling occurs when the programs try to sanitize control characters. In the case of UTF-8, terminal control sequences over 0x80 can be a valid part of a printable character. And then there is collation which means people get angry when IJ.txt is listed after II.txt. However, many Unicode aware programs ignore it and it is optional in POSIX regexes. All programs that output raw strings, don't attempt alignment, and don't work with glyphs or code points(stuff like regexes is out but not simple matching and replacement), are safe from i18n. If you ignore its features, UTF-8 is just like ASCII and nothing has to change, no need to use Unicode functions for everything. This old FAQ is the best resource there is by far about supporting UTF-8 and locales in POSIX programs: http://www.cl.cam.ac.uk/~mgk25/unicode.html ... and then there are many other implementations of the same utilities that have been adapted to different degrees before.
On Thu, 5 Aug 2010 10:20:12 +0900 Reminded me of the php escapeshellarg exploit that only worked on utf-8 enabled systems, if you didn't do your own filtering that is. Is there anything like that in perl or anywhere else that may need considering? I guess most/if not all code where it would be a problem, would be made for utf-8 systems anyway.
Not only does switching to unicode require a lot of work, but it requires perpetual, unending work. Unicode has the foolish goal of including all known characters, so every time a country invents a new currency symbol, for example, the unicode fonts (such as DejaVu) must be updated to include the symbol and the C library has to be updated to recognize that the symbol is printable, and so on. It requires constant maintenance. But it's even worse, because unicode also violates the principle (established by Alan Turing in 1936) that any two characters should be humanly distinguishable "at a glance". This has led to the invention of punycode for translating unicode strings into humanly distinguishable ASCII strings. But then why did we switch from ASCII to unicode in the first place? It's my opinion that unicode shouldn't have a place in the Unix terminal. You might want your GUI to display unicode characters, but when I'm working from a terminal, I want to see the data as closely as possible to the way that the computer "sees" the data. For example, I don't want nvi to display unicode characters, I want to see each individual 8-bit byte that composes the character. I don't want unicode on the command line.
On Thu, Aug 5, 2010 at 5:49 AM, Matthew Szudzik <mszudzik@andrew.cmu.edu> wrote: Umm, punycode wasn't developed because of problems with distinguishability. Indeed, it does nothing to solve those, so I'm not sure why you would suggest that. punycode exists to encode unicode across a transport that is, in effect, in base36, with various So you want to see '41' instead of the letter 'A'? That's "how the Those that are wedded to plain ASCII can continue to have that experience by using LC_ALL=C. Oops, never mind, OpenBSD hasn't actually implemented "plain ASCII only" for years. Philip Guenther
Although punycode may not have been developed to solve problems with distinguishability, it is used for that purpose. For example, punycode is commonly used as a defense against phishers who impersonate online banks using URLs that are indistinguishable from the banks' actual URLs. http://en.wikipedia.org/wiki/IDN_homograph_attack But with a properly-designed font, ASCII characters are all easily I simply want to be able to know "at one glance" what data the computer is using. For that purpose, it is unnecessary to decode an "A" as 0x41. The fact that OpenBSD doesn't implement "plain ASCII only" doesn't mean that it shouldn't. ;) And by the way, the Turing quote is from the paper in which he first proposed the idea of a mechanical computer. He argued that it is sufficient for a computer to have a finite character set where each character can be distinguished at a glance. His argument begins at the bottom of this page: http://www.turingarchive.org/viewer/?id=466&title=01u and continues onto the top of the next. Although it is not necessary that we follow his proposal, it has served as a historical precedent since the very beginning of computing.
On Thu, Aug 5, 2010 at 12:50 PM, Matthew Szudzik You're saying that a side-comment in the original Turing machine thought experiment is an argument against character sets with similar characters? Philip Guenther
By default all output is in ASCII range, You can configure Your keyboard to input only ASCII range symbols. Your problem isn't one of Unicode, it's just one of charecter ranges, and right now it doesn't happen to exist in a real world OpenBSD. -- Dmitrij D. Czarkoff
So what ? human languages are complicated. It's great that finally, some large proportion of humanity is not ignored. Your view is so narrow-minded, this is mind-boggling. Do you realize that almost 1 billion people live in India ? and more than that in China ? Do you think there is proper support for the languages of those people outside of unicode ? (hint: even there, it's tough. If you have time, check the logs of qt, see all the fixes about accents and other diacritics marks in languages you may never have heard off... which often Stay in your backwaters county, redneck. Anyways, you're a troll, and you're not really relevant. Rest assured that OpenBSD developers are interested in better i18n support. It goes slow, because it's a tough problem, and yeah, we don't want to create security issues, and yeah, we have to be really, really careful about a lot of things. Don't like it ? feel free to leave. Ou, si tu prifhres, va te faire voir ailleurs... ;-)
I'd hope everyone is interested in better i18n support but I certainly don't envy your task. The best luck to anyone involved
Thank you Marc. I started to write something twice but I devolved into much less useful language, talking about this. I'm going to keep this handy, for future such conversations, see if I can expand it a bit. I begin to think that this is uniquely an American thing, not understanding about the rest of the world and computer usage. Despite the added complexity it's a wonderful thing, making computers mold to people rather than the other way. -- STeve Andre' Disease Control Warden Dept. of Political Science Michigan State University A day without Windows is like a day without a nuclear incident.
wcwidth() is in, but no man page yet.
--- ls.c 2010/08/07 15:15:04 1.1
+++ ls.c 2010/08/07 15:17:32
@@ -41,6 +41,7 @@
#include <err.h>
#include <errno.h>
#include <fts.h>
+#include <locale.h>
#include <grp.h>
#include <pwd.h>
#include <stdio.h>
@@ -102,6 +103,7 @@
int kflag = 0;
char *p;
+ setlocale(LC_CTYPE, "");
/* Terminal defaults to -Cq, non-terminal defaults to -1. */
if (isatty(STDOUT_FILENO)) {
if ((p = getenv("COLUMNS")) != NULL)
--- util.c 2010/08/07 15:00:48 1.1
+++ util.c 2010/08/07 15:13:52
@@ -41,18 +41,75 @@
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
+#include <wchar.h>
#include "ls.h"
#include "extern.h"
+#define MB_LEN_MAX 32 /* goes into limits.h */
+
+static int
+printwc(wchar_t wc, mbstate_t * pst)
+{
+ size_t size;
+ char buf[MB_LEN_MAX];
+
+ size = wcrtomb(buf, wc, pst);
+ if (size == (size_t) -1) /* This shouldn't happen, but for
+ * sure */
+ return 0;
+ if (wc == L'\0') {
+ /* The following condition must be always true, but for sure */
+ if (size > 0 && buf[size - 1] == '\0')
+ --size;
+ }
+ if (size > 0)
+ fwrite(buf, 1, size, stdout);
+ return wc == L'\0' ? 0 : wcwidth(wc);
+}
+
int
-putname(char *name)
+putname(char *src)
{
- int len;
+ int n = 0;
+ mbstate_t src_state, stdout_state;
+ /* The following +1 is to pass '\0' at the end of src to mbrtowc(). */
+ const char *endptr = src + strlen(src) + 1;
- for (len = 0; *name; len++, name++)
- putchar((!isprint(*name) && f_nonprint) ? '?' : *name);
- return len;
+ /*
+ * We have to reset src_state each time in this function, because
+ * the codeset of src pathname may not match with current locale.
+ * Note that if we pass NULL instead of src_state to mbrtowc(),
+ * there is no way to reset the state.
+ */
+ memset(&src_state, 0, sizeof(src_state));
+ memset(&stdout_state, 0, sizeof(stdout_state));
+ while (src < endptr) {
+ wchar_t wc;
+ size_t ...Ugh. I hate that. Well, i partly see a point in providing such functions in libraries, because libraries are used for compiling all sorts of software, including typesetting software etc. etc. But do we really want to uproot the tree in src/bin in this respect? I value correctness by simplicity, and this code looks nightmarish. More than 20 lines of code basically for counting up the length of a word, involving five calls to two different library functions, not counting memset. Yes, it is possible to put all kinds of bytes into filenames. But that doesn't mean it is a smart idea, and i'm not sure it should be actively encouraged. And even if you happen to think there is nothing wrong with filenames that cannot be displayed everywhere, or with filenames that look identical but are actually different, or with filenames that some software may handle poorly, I consider the price we have to pay for this feature rather high.
