This patch converts some non-UTF-8 encoded text in comments to UTF-8. Signed-off-by: Adrian Bunk <bunk@kernel.org> --- This patch is attached compressed to prevent my MUA from mangling it. Documentation/PCI/pcieaer-howto.txt | 2 - arch/arm/mach-omap2/io.c | 2 - arch/s390/kernel/ebcdic.c | 36 ++++++++++++++-------------- drivers/hid/hid-input.c | 2 - drivers/isdn/hisax/enternow_pci.c | 2 - drivers/media/video/saa5249.c | 2 - drivers/misc/ibmasm/command.c | 2 - drivers/misc/ibmasm/dot_command.c | 2 - drivers/misc/ibmasm/dot_command.h | 2 - drivers/misc/ibmasm/event.c | 2 - drivers/misc/ibmasm/heartbeat.c | 2 - drivers/misc/ibmasm/i2o.h | 2 - drivers/misc/ibmasm/ibmasm.h | 2 - drivers/misc/ibmasm/ibmasmfs.c | 2 - drivers/misc/ibmasm/lowlevel.c | 2 - drivers/misc/ibmasm/lowlevel.h | 2 - drivers/misc/ibmasm/module.c | 2 - drivers/misc/ibmasm/r_heartbeat.c | 2 - drivers/misc/ibmasm/remote.h | 2 - drivers/misc/ibmasm/uart.c | 2 - drivers/s390/ebcdic.c | 36 ++++++++++++++-------------- drivers/scsi/jazz_esp.c | 2 - drivers/spi/omap2_mcspi.c | 2 - drivers/usb/storage/cypress_atacb.c | 2 - drivers/video/omap/rfbi.c | 2 - drivers/video/omap/sossi.c | 2 - 26 files changed, 60 insertions(+), 60 deletions(-)
Is this really needed Adrian ? I mean, everyone reads iso-8859-1, not everyone reads UTF-8. Now I get random crappy chars which cripple my xterms when reading such comments, and I have to do a full-reset once I've read them. It's not as if it was *that* important, and to be honnest, if you had not sent this patch, I would not even have known that non-ASCII characters were here. However, it will quickly get annoying if a recursive grep returns those pesky codes on non-compatible consoles... Quite frankly, it does not bring anything beyond trouble. I'm not adding a NAK here because I find this rude, but I don't like the orientation we're taking with the sources. We should not force people to install version X or Y of a particular system just to read sources. In fact, I would have better converted accentuated chars to their ASCII equivalent to be more friendly with people who only read 7-bit. Regards, Willy --
"Everyone" who speaks a Western European language, perhaps; and even then, mostly because a lot of tools still have a "oh, it's not valid UTF-8, guess iso-8859-1" mode. The most common instance of non-ASCII characters in Linux kernel code are people's names, and there are plenty of names which aren't representable in either ASCII or iso-8859-1. The debate on this was years ago, and the consensus was to migrate to UTF-8; however, the salient information should be expressed in the ASCII character set unless impossible. -hpa --
Or simply because people have not migrated all their install, or have explicitly disabled UTF-8 a few hours after starting to use it once they discovered the mess it caused and the poor support from the And do we really consider that people's names in *comments* cannot be converted to pure ASCII ? I'm western european and have always been against accents in comments (another reason to write comments in english BTW). Unix and internet have lived without accents for almost 30 years without anyone really bothering. And now we try to put them everywhere (even in domain names, implying big security issues) and it causes real annoyances. People's names have not changed in 30 years, so I guess that the rules used during this Willy --
For some languages, it's considered acceptable, for others it's considered major corruption. -hpa --
Non-ancient distributions default to UTF-8 and have tools that handle it
fine.
Accents are very rare in names in the kernel.
Most non-ASCII characters are umlauts and there's no sane way to
express them in ASCII (and the vowels without umlaut are pronounced
quite differently and might even make names look very strange).
And that's only within European languages, outside it becomes even
The comments in the kernel have been converted to UTF-8 quite some time
ago, what I'm fixing with my patch is just some recent non-UTF-8 stuff
that creeped in.
And names in comments in the kernel were not pure ASCII since very
early, they were in other charsets.
Mostly iso-8859-1, but not all of them.
I remember that for one name we first guessed which character it was and
then tried to figure out which charset it was in (no, it was not one
of iso-8859-*).
So it was not "ASCII -> UTF-8", it was
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Well, I accidentally used a freshly installed laptop running mandriva 2008. I was typing in a terminal inside KDE (I don't know the program name, sort of an xterm, but with huge borders all around). I made a typo in a word and typed in a "é" (e acute). Pressing backspace to fix it showed me that I remove more chars than typed. I tried again. Pressing this letter 5 times, then 10 times backspace. I removed 5 chars from the prompt. I suspect that if I had used some chars with wider encoding (eg 4 bytes), I could have removed as many... Clearly those tools are not ready. Also, I recently upgraded one machine from 2.6.22 to 2.6.25. Same crappy behaviour on the console (with bash). I quickly set the vt.defaults on the kernel command line to fix the problem. At this stage, I'm not even trying to "fix" the problem, as it's a philosophical debate and I do not want to enter it. Some people consider it normal that we break user-space applications and that it's obvious that all useland code has to be replaced to remain compatible with "evolutions", and I simply do not support this principle. I just care about having the ability to disable the broken behaviour. Most of the problem comes from the variable length characters causing wrapping lines and misplaced tabs when read in non UTF-8 aware editors and/or terminals. The rest of the problem with the terminal going mad could have been caused by Agreed, but it's been done for *years*. I received mails from people spelled "jorn" or "jurgen" and they had no trouble using that spelling I would have loved to see "several different charsets -> ASCII". Willy --
So don't use that particular tool, and/or file a bug with the maintainer. :-) I have used utf-8 for years - the fact that some editors and some terminal emulators fail is not a problem for me. There are so many that works just fine. There is unicode xterm, and rxvt if you consider xterm too heavy. Both vi and emacs have versions that handle utf-8 competently. You may have to put in a one-off effort in finding a suitable font for your xterm, if you actually wants to see proper umlauts in all cases. If you don't care about looks, then xterm will display blanks/squares and backspace etc. will Outside the english-speaking world, userland _was_ completely broken in the day of ascii. And supporting the multiple iso8859-xx encodings was completely broken too, if you ever needed more than one of them. Unicode gives userland an opportunity to actually work decently for the first time. Now, ascii may be fine if C development is all you ever use the machine for. You can mangle a few names in comments - some people won't like that at all, some won't care. But try using the same machine for writing a business letter without a proper character set. You won't be taken seriously. Or even a non-english gui app with ascii-only menus. If you want to know what it is like, knock three vowels or so out of the english alphabet. Consider them not supported. Invent "transcriptions" if you like. Try writing a letter that way! Or even kernel code with informative comments. Consider the alternative - disable the broken behavior by using a tool that handles UTF-8. There are certainly enough aware apps/tools for It has been done for years because there were no other choice. If you wanted to work in unix, just forget your own name! Now there is a choice. Some people still don' care and is fine with "jorn" and such. Some are pissed off, takes offense, or stick to windows or simply puts unicode into kernel comments. If your mailer doesn't support utf-8, chances are you get some mail Lots of ...
> Outside the english-speaking world, userland _was_ completely (American) Formal UK English uses accented characters for some foreign imports (eg café), ï for words like naïve, and if you are really pretentious you need the æ symbol for words like mediæval although for modern writing this is considered silly. The bash problem btw should have been fixed (if it is bash causing it) as of 2.05b and readline 4.3. If its being cause by the KDE terminal that would suprise me but might be worth filing a bug. Alan --
It was not my machine, and had you been there, you would have heard me call It's too easy to impose crappy designs to end-users and tell them that if that does not work they have to file a bug. There are a minimal set of things that must be tested before shipping. Seeing that the default terminal emulator in KDE on Mandriva 2008 is configured in UTF-8 and does not properly render it simply makes me sick. This is broken by design and even distros trying to get it working for years still can't cope with it. I don't care about the *look*. Mutt shows me a question mark when it does not know. I care about the *behaviour*. Having backspace go back farther than the prompt is not acceptable. Having 80-col lines span over two lines yes but you just had unexpected characters. Just like MS-DOS when Unicode yes, UTF-8 no. UTF-8 is a compressed encoding of unicode. That's as silly as if you had to replace your terminals to read native gzip, and expect them as well as all the tools to work Well, booting 2.6.25 with "init=/bin/bash" results in backspace eating the prompt after pressing accentuated letters. Even the control chars have been correctly handled on many UNIXes for decades! The real problem with this crap is that it is viral : "replace all userland applications or die alone on your island". Then "ah, your applications behave in a funny manner, well that may be because of UTF-8, but that is not important, just wait for the update". I'm not even speaking about the security implications it has on a lot of tools, starting with regex Funny that you mention Windows. Windows has been using 16-bit unicode for a long time without problems. It's a clean encoding. Like it or not. Since they have started using UTF-8, bare windows users have started telling me that there are often bizarre characters in texts instead of accents. That most often happens in forwarded mails. so they get hit Once again, I don't care about the strange looking, just about the You know why we got this ...
I would describe the UCS-2 situation as a disaster area - embedded nuls
causing breakage, inability to represent the full unicode space and
Actually it was primarily designed to make moving encoding painless so
that ascii still worked and C properties like \0 plus traditional
screen supports the needed transliteration for you.
Alan
--
"Having worked in a university for more than twenty years after leaving
industry, I had become unused to seeing management skill routinely
exercised, universities being administered rather than managed"
-- Peter Checkland
--
The console yes (by default until I disabled it to restore correct behaviour). The shell no, it was the one present on my machine and has never been compiled with UTF-8 support, and should not have to. If we say that starting with 2.6.24, we're explicitly breaking compatiblity with old userland, fine. But that was not explicitly stated. In my opinion, the problem is that when I press "é", the system sends two chars to the bash, which itself sends two chars to the terminal, which only displays one and moves the cursor one step ahead. Then, pressing backspace once sends one backspace all along, resulting in the terminal blanking one displayed char, but the shell not being aware that only half of it was removed. But if you look at how control chars are handled, if you display ^H then press backspace, you remove all of it. It's the terminal which adjusts the position depending on the character length. So in my opinion, when we send one backspace to the terminal to remove one character, since there are two in the buffer, we should not get back one full char. Ideally, the console driver should send as many backspaces as needed to fix the multiple characters that were emitted. It's not logical at all that if we send 3 chars to a process with one key, sending a cancellation of those chars only sends one backspace. You see, that's really what I hate with this encoding. Every stage relies on the next one to do the fixup. And of course, a But at least, there is no feeling of having it working. You immediately I cannot imagine how one can believe that something which transcodes one char as a series of 1-to-4 chars will be a painless move. A lot of code Willy --
Bizarre, so you are using deliberately misconfigured ancient userspace to The shell puts the terminal in character by character mode and readline does this. If you have your shell/readline deliberately set up not to be The console driver isn't involved - readline took over for the shell, and readline most definitely supports this in a utf8 locale. Alan --
Hi Alan, No I'm not using anything deliberately misconfigured. I'm trying to explain that on the opposite, any tool which has not been explicitly adapted to those Please, I'm not "deliberately" setting my tools *not* to support unicode. I have tools which have worked for years and which are now asked to behave OK I could reproduce the case without ever involving either a shell or readline or anything. Using "cat" as the init program exhibited the anomaly, though it was not much easy to analyze. Then I switched to "init=od -An -tx1 -". 1) if I enter "A" then press backspace, I get nothing. Pressing enter 16 times flushes the line buffer and "od" prints 16 times "0a", indicating nothing was remaining in the buffer. 2) if I enter Ctrl-V Ctrl-A, my display prints "^A", and when I press backspace, I correctly get the cursor back two chars. Once again, flushing the buffer with enter shows it was empty. 3) if I enter Alt-196, I get a "Ä". Flushing the buffer shows that od got two bytes: c3 84. 4) now if I enter Alt-196 and press backspace, my "Ä" is removed by the backspace, but only the second byte is flushed from the line buffer. Then, if I press enter 15 times, I get a line with c3 0a 0a 0a ... And there is no user-land involved here. I'm really hoping you better understand the problem now. Pressing backspace to fix input does not correct the input with multi-byte chars, it leaves incomplete start sequences. If I press Alt-1111111, then backspace, I get f4 8f 91 0a 0a 0a 0a because it is f4 8f 91 87 minus one byte. Of course, pressing Backspace multiple times removes them all, but it also removes previous characters on the display. Another experience : I press 01234, then Alt-255, Backspace, then 56789. On the display, I have 0123456789. od gets 30 31 32 33 34 c3 35 36 37 38 39. Now if I want to correctly fix the input, I have to press backspace twice, but then I have to make the '4' disappear from my display, while knowing ...
Did you put the console into utf-8 mode before the cat ? --
I had not *explictly* disabled it, since as the doc suggests :
vt.default_utf8=
[VT]
Format=<0|1>
Set system-wide default UTF-8 mode for all tty's.
Default is 1, i.e. UTF-8 mode is enabled for all
newly opened terminals.
And I know that I can fix the behaviour by explicitly setting it to zero.
Also, the fact that "od" shows me multi-byte characters on the input
indicates to me that everything is set to UTF-8. So unless I'm missing
something, my console is set by default to UTF-8 (I test this on 2.6.25).
Regards,
Willy
--
Yes, there is apparently a real bug here: this vt setting doesn't propagate to the tty layer iutf8 flag. -hpa --
export LANG=en_US.UTF-8 (i.e., inform the userspace that you are using UTF-8), unset LC_CTYPE and unset LC_ALL (so that they don't override $LANG), and problem solved. -- Alexander E. Patrakov --
Not to mention the fact that UCS-2 ran out of code points almost as soon as they said "no more codepoints." The result was UTF-16, a hideous abortion which took all the problems with wide encodings, combined it with all the problems of multibyte encodings, and added a few new ones for good measure. -hpa --
I can reproduce your problem in a plain xterm when setting LANG=en_US
(most likely the same problem can occur with other non UTF-8 settings).
In this case I'm actually more surprised that the character is displayed
correctly than that you have to type backspace twice.
Any kind of charset mixing is highly problematic (which is also why my
patch was attached compressed), so if you disable UTF-8 anywhere in a
modern distribution problems are somehow expected (it could also be a
It's not a compressed encoding, it's a variable-length encoding.
Besides the size advantages one main advantage of UTF-8 is that ASCII is
valid UTF-8. This means that for the ASCII source code in the kernel it
doesn't matter whether it's treated as ASCII or UTF-8, and no conversion
was needed.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
It's not that I *had* to type it twice. But I *could* type it twice, and No, it was not disabled at all. I had to type in a command for a co-worker who just did a default install the day before, and typed a I don't agree. If you refuse character-set mixing, there's no problem. Bit 7 of first char == 1 ? => full text is 32 bit. Willy --
You miss my point.
The point is:
A conversion "ASCII -> UTF-8" is a nop.
This means when changing the kernel from half a dozen charsets used in
comments to UTF-8 we only had to change the few characters actually
containing non UTF-8.
Going to something like UTF-32 as you suggest would have involved
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Yeah, ascii-only is a crappy design. :-/ I don't know if mandriva is broken by design - I only use debian. It would not surprise me if some distros botch utf-8 through negligence. They are based in english-speaking countries and have their biggest user bases there - the majority of their customers aren't going to use more than ascii so why should they bother. Someone made a "cool" terminal emulator? Transparency and effects? Distribute it, despite the fact that it won't work in all cases. Distro contains xterm anyway for those that need a fallback. Machine owner thinks one terminal emulator is enough and I don't see how wrong characters are better than backspace eating the prompt or 80-col overflowing when it shouldn't. It is all breakage either way. Stuff break if TERM is set wrong for the terminal in use too, or if the app in use don't _use_ the TERM variable. This happens too, and you only notice if the app runs on a terminal incompatible with TERM=linux. Amusing and accurate. I use Norwegian which has 3 non-ascii vowels. As well as some accented characters, but they don't crop up in _every other It had to be done in an ascii-compatible way. That way, a userland containing a mix of ascii-only apps, fully utf-8 supporting apps, and apps with partial utf-8 support will work flawlessly for ascii-only stuff. Like C source and english language tools. Of course utf-8 only works in the apps supporting it, but utf-8 users keeps fixing this in the apps they need. Breaking ascii compatibility was not an option, because that means replacing the entire userland in one operation. That cannot be done unless a single authority control everything, and the open source world isn't like that. Variable length encoding is necessary, given that: * Ascii should work as before, i.e. one "char" per ascii character * One single encoding so a plain text file can contain the symbols of any writing system in use. There are way more than 256 symbols. No, I don't have a utf-8 ...
Mandriva is a French company.
And what Willy describes really sounds like someone fiddling with some
settings (or something like accidentally selecting some non UTF-8
locale).
Bad things can happen when you somehow get charsets mixed, but
distributions default to UTF-8 for quite some time, and problems
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Well, we were talking about Mandriva, which is a Brazilian-French company, their main languages are Portugese and French; you'd think they'd notice themselves. Most likely there was something in Willy's configuration that buggered it up. -hpa --
This sounds as if you had UTF-8 characters in a non UTF-8 environment.
Email addresses are a different topic.
But it's not right in names, and if someone then pronounces their name
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Presumably, this was konsole. konsole works fine with UTF-8 (I use it that way every day); the most common cause of this kind of problems is people explicitly clobbering the locale or charset class defaults in their login scripts. -hpa --
Possible. It was the one you get by clicking on a terminal icon. Huuhhh what an horror, I'm discussing icons and GUIs on LKML. I must I really doubt the miss would have done this. Or someone would have done it for her which I really doubt in such a small time frame after a fresh install from the day before. I will investigate though. Willy --
From one of Alan's posts it sounds like there was a bug with multibyte characters in readline at some point that got fixed relatively quickly, but still made it out. -hpa --
That's a ridiculous statement. Just because you didn't bother, you can't assume that the people who were actually affected didn't bother. I went through large parts of the 1990's under the name "David K}gedal". And I bothered. And no, the second character in my last name is not an accented a, they have been separate letters for hundreds of years in Sweden. So I can live without using accented letters, as long as I can write Kågedal including the å. :-) Not that my name appears anywhere in the Linux source, but I still felt the urge to reply... -- David Kågedal --
Perhaps we should put them in latin as well just in case any Roman is struggling with this new language 8) Distibutions have been shipping UTF enabled by default for years and years. Alan --
"enabled" does not mean "working" Alan. I know one distro which I will not name in order not to offense you which shipped with it enabled by default, but which would not properly display the characters on the console, resulting in mangled messages during boot. I particularly remember the "[ECHEC]" ("[FAILED]") with random garbage instead of the Willy --
No offence taken. In fact I seem to remember filing similar bugs at the
time about rpm/popt getting its help formatting wrong in some locales (eg
Welsh) for similar reasons - but that was some time ago.
All the mainstream tools handle utf-8 just fine, joe is quite happy
editing utf-8 these days (as are the legacy vim and emacs editing
tools ;)). There really are no good reasons left not to use UTF-8.
Alan
--
> you are confusing me even more.
Of course. "I'm from IBM. I'm here to help." ;-)
-- Alan Altmark
--
Good Job! Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> AFAIK some file already are written by utf-8. frankly, I say from the standpoint as the non-Europian, all files are written by ascii: no problem all files are written by iso8859-1: need editor customize all files are written by utf-8: no problem some files are written by iso8859-1, but another files are written by utf-8: Ouch! Noooooo!! --
