On Thu, 17 Jan 2008, Johannes Schindelin wrote:One of the advantages (the biggest one, in fact, apart from the obvious US-ASCII down-compatibility and the fact that you can do C-compatible NUL-terminated strings) of UTF-8 is that it's locale-independent, and doesn't care about LANG, because it's valid in all languages. And that's really important. It's important for a very simple reason: there is almost never such a thing as "a locale" except for US-ASCII. Once you move away from US-ASCII, it actually tends to be much more common that you have a *mixture* of locales - often in the same "document" - than to have one single locale. It very much happens even in filenames - people "mix" locales in trivial ways even within a single pathname component (non-US-ASCII filename, but with a regular file extension), but much more interestingly they do so within a directory tree (ie you have have translation subdirectories where the filenames themselves are in another language, and you can have full pathnames where different components are in different languages, for example). And UTF-8 is _wonderful_ for this, because LANG doesn't matter, and cannot matter, and thus mixing isn't a problem. Of course, you can screw it up. Locales still can change things like sort order and capitalization etc, so even if you use UTF-8, you sure can get into trouble with LANG and thinking that a per-session locale makes sense. So choosing UTF-8 for the filesystem isn't wrong per se. It's a fine choice, and has no issues with LANG in itself. Limiting it to strictly valid UTF-8 encodings is also fine. Limiting it (further) to only character normalized UTF-8 is also fine. Most Linux filesystems don't limit it in any way, so you can make filenames that aren't valid UTF-8 at all, much less normalizing multi-character sequences. I personally think that's the best option, but I probably do so mostly because I know some people still use Latin1 as their only locale (and I suspect Asia will take decades before it has converted to UTF-8 and will also have cases where they use other non-UTF locales). But enforcing clean UTF-8 is not a bad idea per se. Not allowing byte sequences that aren't a valid UTF-8 encoding (eg \xc0\xc0 is not a valid UTF-8 character) is fine. I wouldn't call people crazy for doing that, although it does mean that you cannot, for example, decide to write a Latin1 filename (which is not necessarily a *good* idea in this day and age, but I think there's a difference between "that's not a good idea" and "you cannot do that"). And even limiting the UTF-8 charset further to only the minimal representation of one particular glyph (ie not allowing multi-character sequences that can be represented more simply) may be even *more* big-brother, but would at least not cause the technical aliasing issues. I personally think that's so controlling as to be stupid (and has no real advantage), but hey, at least it doesn't *corrupt* anything silently. So I think that using UTF-8 as a character encoding is a *good* thing to do, and that automatically means that LANG shouldn't matter for filenames, but within that choice of UTF-8 there are still mistakes that you can make. Notably multi-character normalization and case-insensitivity. Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| Kok, Auke | Re: -mm merge plans for 2.6.23 - ioat/dma engine |
| Jeff Garzik | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Matthew Garrett | [PATCH] Remove process freezer from suspend to RAM pathway |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Jens Axboe | Re: [BUG] New Kernel Bugs |
git: | |
