On Jan 16, 2008, at 5:32 PM, Linus Torvalds wrote:You're right, it doesn't actually have to store the normalized form. =20 And yes, it's possible to compare without normalizing them. =20 Admittedly, I don't know much about the implementation details of =20 unicode, but I would assume that the easiest way to compare two =20 strings is to normalize them first. But in the case of the filesystem, =20= normalization actually is important if you're thinking about filenames =20= in terms of characters rather than bytes. When I feed the filesystem a =20= given unicode string, it has to find the file I'm talking about - =20 should it do a relatively expensive unicode-sensitive comparison of =20 all the filenames with the one I gave it, or should it just normalize =20= all names and do the much cheaper lookup that way? I don't know about =20= you, but I'd prefer to let my filesystem normalize the name and run =20 faster. There's a difference between "looks similar" as in "Polish" vs =20 "polish", and actually is the same string as in "Ma<UMLAUT =20 MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid =20= semantic meaning, normalization doesn't. The only way to argue that =20 normalization is wrong is by providing a good reason to preserve the =20 exact byte sequence, and so far the only reason I've seen is to help =20 git. Applications in general don't care one whit about the byte =20 sequence of the filename, they care about the underlying file the name =20= represents. Additionally, it would be a terrible experience for a user =20= to enter "M=E4rchen" and have the application say "sorry, I can't find =20= this file" simply because the application used decomposed characters =20 and the filename used composed characters. Unless the user is =20 knowledgeable about the OS, filesystems, and unicode, they wouldn't =20 have a hope of figuring out what the problem was. How do you figure? When I type "M=E4rchen", I'm typing a string, not a =20= byte sequence. I have no control over the normalization of the =20 characters. Therefore, depending on what program I'm typing the name =20 in, I might use the same normalization as the filename, or I might =20 miss. It's completely out of my control. This is why the filesystem =20 has to step in and say "You composed that character differently, but I =20= know you were trying to specify this file". There are valid reasons for case to matter, but what reason is there =20 for "single character" vs" two character overlay" to matter in =20 filenames? They're different representations of the exact same string, =20= and that's what a filename is - a string. It seems like your arguments stem from the assumption that the user =20 cares about the byte sequence that represents the filename, which is =20 wrong. The user has no idea what the byte sequence is - the user cares =20= about the string. Normalization is meant to help computers, not users, =20= and claiming that different normalizations of the same string produces =20= different meaningful strings is complete bunk. If you were to have two different files on your system, both of them =20 called "M=E4rchen", but one precomposed and one decomposed, how would =20= you specify which one you wanted? Unless Linux has a special text =20 input system which gives the user control over the normalization of =20 their typed characters, you'd have to write out the UTF-8 bytes =20 manually. I just don't understand this insistence on treating the specific byte =20= sequence that makes up the filename as significant. -Kevin Ballard --=20 Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
| Andrew Morton | -mm merge plans for 2.6.23 |
| Rafael J. Wysocki | [Bug #11207] VolanoMark regression with 2.6.27-rc1 |
| Zhang, Yanmin | AIM7 40% regression with 2.6.26-rc1 |
| Con Kolivas | [PATCH][RSDL-mm 0/7] RSDL cpu scheduler for 2.6.21-rc3-mm2 |
git: | |
| Gregory Haskins | [RFC PATCH 03/17] vbus: add connection-client helper infrastructure |
| David Woodhouse | [PATCH 03/30] solos: FPGA and firmware update support. |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
