login
Header Space

 
 

Re: git on MacOSX and files with decomposed utf-8 file names

Score:
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Linus Torvalds <torvalds@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 7:11 pm

On Jan 16, 2008, at 5:32 PM, Linus Torvalds wrote:





You're right, it doesn't actually have to store the normalized form. =20
And yes, it's possible to compare without normalizing them. =20
Admittedly, I don't know much about the implementation details of =20
unicode, but I would assume that the easiest way to compare two =20
strings is to normalize them first. But in the case of the filesystem, =20=

normalization actually is important if you're thinking about filenames =20=

in terms of characters rather than bytes. When I feed the filesystem a =20=

given unicode string, it has to find the file I'm talking about - =20
should it do a relatively expensive unicode-sensitive comparison of =20
all the filenames with the one I gave it, or should it just normalize =20=

all names and do the much cheaper lookup that way? I don't know about =20=

you, but I'd prefer to let my filesystem normalize the name and run =20
faster.



There's a difference between "looks similar" as in "Polish" vs =20
"polish", and actually is the same string as in "Ma<UMLAUT =20
MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid =20=

semantic meaning, normalization doesn't. The only way to argue that =20
normalization is wrong is by providing a good reason to preserve the =20
exact byte sequence, and so far the only reason I've seen is to help =20
git. Applications in general don't care one whit about the byte =20
sequence of the filename, they care about the underlying file the name =20=

represents. Additionally, it would be a terrible experience for a user =20=

to enter "M=E4rchen" and have the application say "sorry, I can't find =20=

this file" simply because the application used decomposed characters =20
and the filename used composed characters. Unless the user is =20
knowledgeable about the OS, filesystems, and unicode, they wouldn't =20
have a hope of figuring out what the problem was.


How do you figure? When I type "M=E4rchen", I'm typing a string, not a =20=

byte sequence. I have no control over the normalization of the =20
characters. Therefore, depending on what program I'm typing the name =20
in, I might use the same normalization as the filename, or I might =20
miss. It's completely out of my control. This is why the filesystem =20
has to step in and say "You composed that character differently, but I =20=

know you were trying to specify this file".



There are valid reasons for case to matter, but what reason is there =20
for "single character" vs" two character overlay" to matter in =20
filenames? They're different representations of the exact same string, =20=

and that's what a filename is - a string.

It seems like your arguments stem from the assumption that the user =20
cares about the byte sequence that represents the filename, which is =20
wrong. The user has no idea what the byte sequence is - the user cares =20=

about the string. Normalization is meant to help computers, not users, =20=

and claiming that different normalizations of the same string produces =20=

different meaningful strings is complete bunk.

If you were to have two different files on your system, both of them =20
called "M=E4rchen", but one precomposed and one decomposed, how would =20=

you specify which one you wanted? Unless Linux has a special text =20
input system which gives the user control over the normalization of =20
their typed characters, you'd have to write out the UTF-8 bytes =20
manually.

I just don't understand this insistence on treating the specific byte =20=

sequence that makes up the filename as significant.

-Kevin Ballard

--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 11:34 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Wed Jan 16, 7:03 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 12:32 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Jakub Narebski, (Wed Jan 16, 12:46 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 6:23 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Kevin Ballard, (Wed Jan 16, 7:11 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 8:35 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Wed Jan 16, 8:54 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 9:08 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Thu Jan 17, 12:08 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 6:08 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Thu Jan 17, 12:43 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 6:09 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Robin Rosenberg, (Thu Jan 17, 9:27 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Peter Karlsson, (Mon Jan 21, 10:14 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 5:06 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 6:45 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Mon Jan 21, 10:50 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 11:21 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Mon Jan 21, 11:17 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 6:56 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 5:17 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Mon Jan 21, 5:43 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Eric W. Biederman, (Tue Jan 22, 10:46 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Junio C Hamano, (Tue Jan 22, 10:57 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Fri Jan 18, 4:50 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Sat Jan 19, 8:11 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Sun Jan 20, 5:34 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Sat Jan 19, 6:58 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Sun Jan 20, 9:15 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Peter Karlsson, (Fri Jan 18, 11:30 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Robin Rosenberg, (Thu Jan 17, 9:05 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Robin Rosenberg, (Fri Jan 18, 5:42 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Peter Karlsson, (Fri Jan 18, 11:37 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 2:18 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Thu Jan 17, 12:51 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 6:22 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 11:57 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Robin Rosenberg, (Thu Jan 17, 8:44 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 8:33 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Wed Jan 16, 8:57 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Eyvind Bernhardsen, (Wed Jan 16, 6:37 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 6:28 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 7:10 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 9:05 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 7:51 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Johannes Schindelin, (Thu Jan 17, 8:53 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 9:40 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Wincent Colaiuta, (Thu Jan 17, 7:46 am)
speck-geostationary