On Jan 16, 2008, at 5:32 PM, Linus Torvalds wrote:You're right, it doesn't actually have to store the normalized form. =20 And yes, it's possible to compare without normalizing them. =20 Admittedly, I don't know much about the implementation details of =20 unicode, but I would assume that the easiest way to compare two =20 strings is to normalize them first. But in the case of the filesystem, =20= normalization actually is important if you're thinking about filenames =20= in terms of characters rather than bytes. When I feed the filesystem a =20= given unicode string, it has to find the file I'm talking about - =20 should it do a relatively expensive unicode-sensitive comparison of =20 all the filenames with the one I gave it, or should it just normalize =20= all names and do the much cheaper lookup that way? I don't know about =20= you, but I'd prefer to let my filesystem normalize the name and run =20 faster. There's a difference between "looks similar" as in "Polish" vs =20 "polish", and actually is the same string as in "Ma<UMLAUT =20 MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid =20= semantic meaning, normalization doesn't. The only way to argue that =20 normalization is wrong is by providing a good reason to preserve the =20 exact byte sequence, and so far the only reason I've seen is to help =20 git. Applications in general don't care one whit about the byte =20 sequence of the filename, they care about the underlying file the name =20= represents. Additionally, it would be a terrible experience for a user =20= to enter "M=E4rchen" and have the application say "sorry, I can't find =20= this file" simply because the application used decomposed characters =20 and the filename used composed characters. Unless the user is =20 knowledgeable about the OS, filesystems, and unicode, they wouldn't =20 have a hope of figuring out what the problem was. How do you figure? When I type "M=E4rchen", I'm typing a string, not a =20= byte sequence. I have no control over the normalization of the =20 characters. Therefore, depending on what program I'm typing the name =20 in, I might use the same normalization as the filename, or I might =20 miss. It's completely out of my control. This is why the filesystem =20 has to step in and say "You composed that character differently, but I =20= know you were trying to specify this file". There are valid reasons for case to matter, but what reason is there =20 for "single character" vs" two character overlay" to matter in =20 filenames? They're different representations of the exact same string, =20= and that's what a filename is - a string. It seems like your arguments stem from the assumption that the user =20 cares about the byte sequence that represents the filename, which is =20 wrong. The user has no idea what the byte sequence is - the user cares =20= about the string. Normalization is meant to help computers, not users, =20= and claiming that different normalizations of the same string produces =20= different meaningful strings is complete bunk. If you were to have two different files on your system, both of them =20 called "M=E4rchen", but one precomposed and one decomposed, how would =20= you specify which one you wanted? Unless Linux has a special text =20 input system which gives the user control over the normalization of =20 their typed characters, you'd have to write out the UTF-8 bytes =20 manually. I just don't understand this insistence on treating the specific byte =20= sequence that makes up the filename as significant. -Kevin Ballard --=20 Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
| Tomasz Kłoczko | Is it time for remove (crap) ALSA from kernel tree ? |
| Greg Kroah-Hartman | [PATCH 008/196] Chinese: add translation of volatile-considered-harmful.txt |
| David Miller | Slow DOWN, please!!! |
| Aubrey | O_DIRECT question |
git: | |
| Martin Langhoff | Re: pack operation is thrashing my server |
| Francis Moreau | emacs and git... |
| Mirko Stocker | Working with Git and CVS in a team. |
| Keith Packard | Re: parsecvs tool now creates git repositories |
| Chris Peterson | [PATCH] drivers/net: remove network drivers' last few uses of IRQF_SAMPLE_RANDOM |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Karen Xie | [RFC][PATCH 1/1] cxgb3i: cxgb3 iSCSI initiator |
| Jeff Garzik | [git patches] net driver updates for .27 |
| Jonathan Thornburg | svnd questions (encrypting all of a partition or disk) |
| Richard Stallman | Real men don't attack straw men |
| Daniel Ouellet | identifying sparse files and get ride of them trick available? |
| Brandon Lee | DELL PERC 5iR slow performance |
| high memory | 4 hours ago | Linux kernel |
| semaphore access speed | 7 hours ago | Applications and Utilities |
| the kernel how to power off the machine | 8 hours ago | Linux kernel |
| Easter Eggs in windows XP | 10 hours ago | Windows |
| Shared swap partition | 11 hours ago | Linux general |
| Root password | 12 hours ago | Linux general |
| Where/when DNOTIFY is used? | 13 hours ago | Linux kernel |
| How to convert Linux Kernel built-in module into a loadable module | 16 hours ago | Linux kernel |
| Linux 2.6.24 and I/O schedulers | 16 hours ago | Linux kernel |
| USB Driver -- Interrupt Polling -- A Little Help Please | 22 hours ago | Linux general |
