On Mon, 21 Jan 2008, Kevin Ballard wrote:You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON. The fact is, text-as-string-of-codepoints (let's make the "codepoints" obvious, so that there is no ambiguity, but I'd also like to make it clear that a codepoint *is* how a Unicode character is defined, and a Unicode "string" is actually *defined* to be a sequence of codepoints, and totally independent of normalization!) is fine. That was never the issue at all. Unicode codepoints are wonderful. Now, git _also_ heavily depends on the actual encoding of those codepoints, since we create hashes etc, so in fact, as far ass git is concerned, names have to be in some particular encoding to be hashed, and UTF-8 is the only sane encoding for Unicode. People can blather about UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is simply technically superior in so many ways that I don't even understand why anybody ever uses anything else. So I would not disagree with using UTF-8 at all. But that is *entirely* a separate issue from "normalization". Kevin, you seem to think that normalization is somehow forced on you by the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE. Normalization is a totally separate decision, and it's a STUPID one, because it breaks so many of the _nice_ properties of using UTF-8. And THAT is where we differ. It has nothing to do with "octets". It has nothing to do with not liking Unicode. It has nothing to do with "strings". In short: - normalization is by no means required or even a good feature. It's something you do when you want to know if two strings are equivalent, but that doesn't actually mean that you should keep the strings normalized all the time! - normalization has *nothing* to do with "treating text as octets". That's entirely an encoding issue. - of *course* git has to treat things as a binary stream at some point, since you need that to even compute a SHA1 in the first place, but that has *nothing* to do with normalization or the lack of it. Got it? Forced normalization is stupid, because it changes the data and removes information, and unless you know that change is safe, it's the wrong thing to do. One reason _not_ to do normalization is that if you don't, you can still interact with no ambiguity with other non-Unicode locales. You can do the 1:1 Latin1<->Unicode translation, and you *never* get into trouble. In cotnrast, if you normalize, it's no longer a 1:1 translation any more, and you can get into a situation where the translation from Latin1 to Unicode and back results in a *different* filename than the one you started with! See? That's a *serious*problem*. A system that forces normalization BY DEFINITION cannot work with people who use a Latin1 filesystem, because it will corrupt the filenames! But you are apparently too damn stupid to understand that "data corruption" == "bad", and too damn stupid to see that "Unicode" does not mean "Forced normalization". But I'll try one more time. Let's say that I work on a project where there are some people who use Latin1, and some people who use UTF-8, and we use special characters. It should all work, as long as we use only the common subset, and we teach git to convert to UTF-8 as a common base. Right? In your *idiotic* world, where you have to normalize and corrupting filenames is ok, that doesn't work! It works wonderfully well if you do the obvious 1:1 translation and you do *not* normalize, but the moment you start normalizing, you actually corrupt the filenames! And yes, the character sequence 'a¨' is exactly one such sequence. It's perfectly representable in both Latin1 and in UTF-8: in latin1 it is a two-character '\x61\xa8', and when doing a Latin1->UTF-8 conversion, it becomes '\x61\xc2\xa8', and you can convert back and forth between those two forms an infinite amount of times, and you never corrupt it. But the moment you add normalization to the mix, you start screwing up. Suddenly, the sequence '\x61\xa8' in Latin1 becomes (assuming NFD) '\xc3\xa4' in UTF-8, and when converted back to Latin1, it is now '\xe4', ie that filename hass been corrupted! See? Normalization in the face of working together with others is a total and utter mistake, and yes, it really *does* corrupt data. It makes it fundamentally impossible to reliably work together with other encodings - even when you do converstion between the two! [ And that's the really sad part. Non-normalized Unicode can pretty much be used as a "generic encoding" for just about all locales - if you know the locale you convert from and to, you can generally use UTF-8 as an internal format, knowing that you can always get the same result back in the original encoding. Normalization literally breaks that wonderful generic capability of Unicode. And the fact that Unicode is such a "generic replacement" for any locale is exactly what makes it so wonderful, and allows you to fairly seamlessly convert piece-meal from some particular locale to Unicode: even if you have some programs that still work in the original locale, you know that you can convert back to it without loss of information. Except if you normalize. In that case, you *do* lose information, and suddenly one of the best things about Unicode simply disappears. As a result, people who force-normalize are idiots. But they seem to also be stupid enough that they don't understand that they are idiots. Sad. It's a bit like whitespace. Whitespace "doesn't matter" in text (== is equivalent), but an email client that force-normalizes whitespace in text is a really *broken* email client, because it turns out that sometimes even the "equivalent" forms simply do matter. Patches are text, but whitespace is meaningful there. Same exact deal: it's good to have the *ability* to normalize whitespace (in email, we call this "text=flowed" or similar), and in some ceses you might even want to make it the default action, but *forcing* normalization is total idiocy and actually makes the system less useful! ] Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| Linus Torvalds | Linux 2.6.27-rc8 |
| Andi Kleen | [PATCH x86] [2/16] Add a counter for per cpu clocksource watchdog checks and repor... |
| David Miller | Slow DOWN, please!!! |
| Greg KH | Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in |
git: | |
| Jeff King | Re: [PATCH] Color support added to git-add--interactive. |
| Yann Dirson | Re: irc usage.. |
| Peter Stahlir | Git as a filesystem |
| Junio C Hamano | Re: [PATCH 3/3] Teach "git branch" about --new-workdir |
| new_guy | Code signing in OpenBSD |
| Jason Dixon | Wasting our Freedom |
| Nick Guenther | Re: Real men don't attack straw men |
| Daniel Ouellet | identifying sparse files and get ride of them trick available? |
| Wolfgang Walter | Re: Kernel oops with 2.6.26, padlock and ipsec: probably problem with fpu state ch... |
| KOSAKI Motohiro | [bug?] tg3: Failed to load firmware "tigon/tg3_tso.bin" |
| Tomasz Grobelny | [PATCH 0/5] [DCCP]: Queuing policies |
| Arjan van de Ven | Re: [GIT]: Networking |
| high memory | 8 hours ago | Linux kernel |
| semaphore access speed | 11 hours ago | Applications and Utilities |
| the kernel how to power off the machine | 12 hours ago | Linux kernel |
| Easter Eggs in windows XP | 15 hours ago | Windows |
| Shared swap partition | 16 hours ago | Linux general |
| Root password | 16 hours ago | Linux general |
| Where/when DNOTIFY is used? | 18 hours ago | Linux kernel |
| How to convert Linux Kernel built-in module into a loadable module | 20 hours ago | Linux kernel |
| Linux 2.6.24 and I/O schedulers | 21 hours ago | Linux kernel |
| USB Driver -- Interrupt Polling -- A Little Help Please | 1 day ago | Linux general |
