login
Header Space

 
 

Re: git on MacOSX and files with decomposed utf-8 file names

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Linus Torvalds <torvalds@...>
Cc: Theodore Tso <tytso@...>, Mike Hommey <mh@...>, <git@...>
Date: Wednesday, January 23, 2008 - 1:19 pm

On Jan 23, 2008, at 11:16 AM, Linus Torvalds wrote:



Well yes, any context in which a string is treated as Unicode instead =20=

of an opaque sequence of bytes will probably lead to normalization at =20=

some point (e.g. when searching text, I'm going to want M=E4rchen and =20=

M=E4rchen to be treated as the same string). The Mac OS X APIs use NFD, =20=

and everybody else uses NFC, but either way it's still normalization.




Why would the globbing libraries have to do anything special to =20
understand NFD? In fact, I prefer that they don't - it's very handy to =20=

be able to type Ma* and have that match M=E4rchen, as the globbing =20
library sees Ma??rchen and is happy to match the ??rchen against *. =20
Were the filename in NFC, I couldn't do that. Similarly, Ma<tab> =20
autocompletes the name M=E4rchen for me. But the convenience is beside =20=

the point - what I'm trying to show here is that if the globbing =20
library were NFD-aware, it probably would decide Ma* shouldn't match =20
M=E4rchen, right?

I assume globbing libraries et al don't do UTF-8 hackery in Linux, =20
right? And yet using NFC-encoded filenames is fairly common? So why =20
should it be any different on OS X, especially since HFS+ isn't the =20
only option here (and thus doing NFD conversion in the library would =20
mess up other filesystems)?

In fact, probably the biggest reason the NFD-encoding was done at the =20=

HFS+ level is because they simply couldn't trust user-level libraries =20=

to always do the NFD conversion for pathnames. And I quote:

"I would prefer that case sensitivity and unicode normalization were =20
not the responsibility of the file system -- but I realize that we =20
cannot just ignore the problem and let the other layers sort it all =20
out."


I don't get why you're still calling it corruption when, on an HFS+ =20
system, NFD-encoding is correct. It would be corruption for HFS+ to =20
write anything else but NFD.



There's no reason to assume that OS X is actually storing the NFD on =20
the volume. In fact, it's quite explicitly not:

"As far as storing exactly what was passed in,  its not just HFS =20
that's involved her.  In Mac OS X,  SMB, MSDOS, UDF, ISO 9660 =20
(Joliet), NTFS and ZFS file systems all store in one form -- NFC.  We =20=

store in NFC since that what is expected for these files systems.  If =20=

we were to allow KFD to pass through, it would cause problems when =20
these names were accessed outside of Mac OS X.  So this is not just an =20=

HFS issue but an interchange issue for Mac OS X.  We have the legacy =20
NFD use/expectation in our applications and we chose not to ignore the =20=

problem but make a conscience effort to have the appropriate form used =20=

(NFD in Mac OS X APIs, NFC elsewhere).  Its not perfect but neither is =20=

the agnostic approach where both forms can be used and you can have =20
duplicate filenames in your file system."

-Kevin Ballard

--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Wed Jan 23, 12:16 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Kevin Ballard, (Wed Jan 23, 1:19 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Wed Jan 23, 7:37 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Jonathan del Strother, (Wed Jan 23, 5:02 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Tue Jan 22, 9:27 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Tue Jan 22, 9:14 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Tue Jan 22, 9:47 pm)
speck-geostationary