login
Header Space

 
 

Re: git on MacOSX and files with decomposed utf-8 file names

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Theodore Tso <tytso@...>
Cc: <git@...>
Date: Tuesday, January 22, 2008 - 8:38 pm

On Jan 22, 2008, at 7:08 PM, Theodore Tso wrote:

http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
http://lists.limewire.org/pipermail/gui-dev/2003-January/001110.html
http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.=
html

I just finished talking to one of the HFS+ developers, so I suspect I =20=

know a lot more on this subject now than you do. Here's some of the =20
relevant information:

* Any new characters added to Unicode will only have one form =20
(decomposed), so HFS+ will always accept new characters as they will =20
be NFD. The only exception is case-sensitivity, as the case-folding =20
tables in HFS+ are static, so new characters with case variants will =20
be treated in a case-sensitive manner. However, as they are already =20
decomposed, the NFD algorithm will not change their encoding. This =20
means that no, there are zero problems moving HFS+ drives between =20
versions of OS X.

* At the time HFS+ was developed, there was no one common standard for =20=

normalization. The HFS+ developers picked NFD because they thought it =20=

was "a more flexible, future-looking form", but Microsoft ended up =20
picking the opposite just a short time later. Interestingly, NFC is a =20=

weird hybrid form which only has composed forms for pre-existing =20
characters, and decomposed forms for all new characters (as they only =20=

have one form). So in a sense NFD is more sane then NFC.

* The core issue here, which is why you think HFS+ is so stupid, is =20
that you guys see no problem with having 2 files "M=E4rchen" (NFC) and =20=

"M=E4rchen" (NFD), whereas the HFS+ developers don't consider it =20
acceptable to have 2 visually identical names as independent files. =20
Unfortunately, the only way to do this matching is to store the =20
normalized form in the filesystem, because it would be a performance =20
nightmare to try and do this matching any other way. The HFS+ =20
developers considered it an acceptable trade-off, and as an =20
application developer I tend to agree with them.

As I have stated in the past, this isn't a case of HFS+ being stupid =20
and causing problems, it's a case of HFS+ being *different* and =20
causing problems. But this difference is just as much your fault as it =20=

is HFS+'s fault.

* For detecting case-sensitive filesystems you can use pathconf(2): =20
_PC_CASE_SENSITIVE (if unsupported, you can assume the filesystem is =20
case-sensitive). There is also the getattrlist(2) attribute: =20
VOL_CAP_FMT_CASE_SENSITIVE.

There appears to be no API for determining if normalization will be =20
applied. However, any filesystem that uses UTF-8 explicitly as storage =20=

(unlike the Linux filesystems, which you claim use UTF-8 but is =20
obviously you really use nothing at all) is pretty much guaranteed to =20=

have to normalize or it will have abysmal performance.

I must say it is shocking that someone as smart as you is still more =20
interested in finding ways to prove me wrong then to actually address =20=

the problem. It's obvious that the only research you did was intended =20=

to find ways to call me stupid.

-Kevin Ballard

--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: git on MacOSX and files with decomposed utf-8 file names, Linus Torvalds, (Wed Jan 23, 12:16 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Wed Jan 23, 7:37 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Jonathan del Strother, (Wed Jan 23, 5:02 am)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Tue Jan 22, 9:27 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Tue Jan 22, 9:14 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Kevin Ballard, (Tue Jan 22, 8:38 pm)
Re: git on MacOSX and files with decomposed utf-8 file names, Martin Langhoff, (Tue Jan 22, 9:47 pm)
speck-geostationary