I did some research on this point, since if we really are going to be compatible with MacOS X's crappy HFS+ system, we need to know what the decomposition algorithm actually is. Turns out, there are *two* of them. Kevin didn't know what he was talking about. In fact, different versions of Mac OS X use different normalization algorithms. Mac OS X 8.1 through 10.2.x used decompositions based on Unicode 2.1. Mac OS X 10.3 and later use decompositions based on Unicode 3.2.[1] As I correctly predicted, Apple is changing their normalization algorithm in different versions of Mac OS X. It is not static, which meands there will be compatibility problems when moving hard drives between Mac OS X versions. I don't know if they try to fix this in their fsck or not, when upgrading from 10.2 to 10.3, but if not, certain files could disappear as part of the Mac OS X upgrade. Fun fun fun. And clearly Kevin didn't read the tech note very carefully, since it clearly admits why they did it. The Mac OS X developers were being cheasy with how they implemented their HFS B-tree algorithms, and took the cheap, easy way out. So yeah, "crappy" is the only word that can be used for what Mac OS X perpetuated on the world. Because of that, a quick Google search shows it causes problems all over the stack, for many different programs beyond just git, including limewire and gnutella[2][3], Slim[4], and no doubt others. [1] http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties [2] http://lists.limewire.org/pipermail/gui-dev/2003-January/001110.html [3] http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.html [4] http://forums.slimdevices.com/showthread.php?t=40582 In any case, it seems pretty clear that by now everyone except Kevin has realized that HFS+ is crappy and causes Internet-wide interoperability problems. So I'll justify sending this note by pointing out the specific table of Mac OS's filesystem corruption algorithm can be found here: ...
http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.= I just finished talking to one of the HFS+ developers, so I suspect I =20= know a lot more on this subject now than you do. Here's some of the =20 relevant information: * Any new characters added to Unicode will only have one form =20 (decomposed), so HFS+ will always accept new characters as they will =20 be NFD. The only exception is case-sensitivity, as the case-folding =20 tables in HFS+ are static, so new characters with case variants will =20 be treated in a case-sensitive manner. However, as they are already =20 decomposed, the NFD algorithm will not change their encoding. This =20 means that no, there are zero problems moving HFS+ drives between =20 versions of OS X. * At the time HFS+ was developed, there was no one common standard for =20= normalization. The HFS+ developers picked NFD because they thought it =20= was "a more flexible, future-looking form", but Microsoft ended up =20 picking the opposite just a short time later. Interestingly, NFC is a =20= weird hybrid form which only has composed forms for pre-existing =20 characters, and decomposed forms for all new characters (as they only =20= have one form). So in a sense NFD is more sane then NFC. * The core issue here, which is why you think HFS+ is so stupid, is =20 that you guys see no problem with having 2 files "M=E4rchen" (NFC) and =20= "M=E4rchen" (NFD), whereas the HFS+ developers don't consider it =20 acceptable to have 2 visually identical names as independent files. =20 Unfortunately, the only way to do this matching is to store the =20 normalized form in the filesystem, because it would be a performance =20 nightmare to try and do this matching any other way. The HFS+ =20 developers considered it an acceptable trade-off, and as an =20 application developer I tend to agree with them. As I have stated in the past, this isn't a case of HFS+ being stupid =20 and causing problems, it's a case of HFS+ being ...
Don't ruin it. You were silent for 12hs and lots of patches and research on the problem started flowing. If you keep making a nuisance of yourself, people will turn from helping you to beating you up for being so annoying. Perhaps help prepare those tests you said it was a good idea to work on. If you manage to stay silent a bit, we'll need them soon ;-) m -
Except there *are* problems, because this promise doesn't apply to Unicode 2.1 (Mac OS 10.2 and before) and Unicode 3.2 (Mac OS 10.3 and above). And there were changes between the normalization algorithm between Unicode 3.2 and the Unicode version 4.1. So taking a hard drive between Mac OS X 10.2 and 10.3 *will* cause problems. The guarantees of Unicode stability didn't come until well past Unicode 2.1. Also, I know of no guarantee that there will be no more new compositions. According to Unicode Stnadard Annex #15 (http://unicode.org/reports/tr15/), new characters that can be decomposed are strongly discouraged, but "It would be possible to add more compositions in a future version of Unicode". Got a reference to NFC is better if you care about compatibility with existing legacy character sets, where you want round-trip conversions to be idempotent. On the other hand, given that Mac OS has historically never cared about being compatible with the rest of the world, it Yep. No problems to do that. You seem to think that supporting Unicode requires imposing this constraint, but that's simply not true, Nope. They were just not clever enough. If they use a hashed key for their b-tree and used a hash which had the property that two strings that were equivalent in the Unicode sense have the same hash value, it's quite possible to do Unicode-equivalence lookups quickly. Yeah, calculating the hash algorithm takes a bit amount of time, but it gets called no more than the normalization routine, and its performance overhead is no worse than the normalizing a string. I know how to do it in a Linux filesystem; it's just an insane thing to do, and so I choose not to do it. But it is doable; if you must persue the course of filesystem insanity, it's possible to do it in a performant way, without normalization; it's the same way that you can No, I did the research to try to find the HFS-specific filename mangling algorithm. And given that's based on an back-level, ...
Uh, Ted is a filesystem developer. I can't count the hours I spent talking with my father, a theoretical physicist, but that does not make me qualified to consider myself a better authority on physics than a sub-average actual grad student of the matter. If you don't manage to check your arrogance eventually, you'll be causing more damage to your cause than if you just shut up. You make abundantly clear that you don't understand the _implications_ of the details you may or not may happen to find out. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
One thing I'd like somebody to check: what _does_ happen with OS X and NFS (OS X as a client, not server)? In particular: - Is it suddenly sane and case-sensitive? - Does the NFS client do any unicode conversion? I tried to google for it, but didn't find the right keywords to get anything useful out of that modern-day internet oracle. Linus -
Yes. Similarlty with UFS partitions. After much grief with case-insensitivity on OSX I reinstalled the OS on a UFS partition, only to find that most 3rd party apps can't cope with case-sensitive Don't know, unfortunately. I suspect both bits of mangling happen in the fs code. martin -
-Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
So this bit of insanity can affect users on other OSs too, if they use git on an NFS mountpoint hosted on OSX/HFS+. IIRC Apple does recommend UFS for servers though. I wonder how XServe machines ship by default. m -
Using a Linux server, and a OS X client, over NFS, it is in case-sensitive. This is not unexpected, since you can mount UFS partitions on Mac OS X, or reformat HFS+ filesystems and make them be Nope: # perl -CO -e 'print pack("U",0x00C4)."\n"' | xargs touch # ls -l | cat -v total 0 0 -rw-r--r-- 1 nobody nobody 0 Jan 22 20:30 M-CM-^D It's pretty clear the Unicode conversion is being done in HFS+, not in the VFS layer of Mac OS X. So presumably if and when Mac OS adopts ZFS, they will be able to be free of this mess, at least if they care about being compatible with Solaris. - Ted -
Ok. That's going to make it both easier and harder for them in the future. In particular, it probably means that their VFS layer really has no notion of this at all, and it's going to be fairly hard to support any kind of I wouldn't hold my breadth on ZFS, considering the memory requirements. ZFS apparently wants *lots* of memory: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Administra... http://wiki.freebsd.org/ZFSTuningGuide in fact it seems that the FreeBSD people basically recomment against using ZFS on 32-bit kernels because of the memory use issues. Yes, it could be BSD-specific, but considering Solaris has the same recommendation, it sure seems like ZFS isn't ready for prime time on any low-end (read: consumer) hardware. Of course, in a year or two, 2GB will be the norm. Right now it's still fairly unusual on Mac hardware outside of the Mac Pro line (which, I think, comes with a *minimum* of 2GB), and the people who get it want it not for the filesystem caches, but for big photo editing jobs.. Linus -
HFS+ was developed on Mac OS 8, which I believe didn't have the notion of a VFS, or at least not one that would have been in any way capable of doing the case-insensitivity and normalization necessary. However, I'm not sure what you mean by a "backwards compatibility" layer on other filesystems - if you mean treating another filesystem like HFS+, well, if you're using a filesystem that doesn't do normalization then Actually, interestingly the new MacBook Air comes with 2GB stock (I'm assuming it's soldered onto the motherboard, though, so it makes sense that Apple's giving customers 2GB as they can't upgrade themselves). In any case, everybody's making a big fuss about ZFS, but it really doesn't make a lot of sense to use for a consumer system, it seems more geared for a server. -Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
There must be something at the VFS layer, or some other layer: - IIRC, Joliet iso9660 volumes end up being mounted with files names in NFS when the real file names are NFC on the disk. - Likewise for Samba shares. - When I had my problems with iso9660 rockridge volumes using NFC (you can create that just fine with mkisofs), the volume is mounted without normalisation, i.e. if you get to a shell and want to access files, you must use NFC, but at least the Finder does transliteration at some stage, because going into the mount point and opening some files fail because it's trying to open the file with the name transliterated to NFD. I just hope the same doesn't happen with other filesystems. Also, OSX using NFD widely, a file created from non Unix applications may end up being named in NFD on any file system. File contents, too, may end up being transliterated whenever a file is modified with non Unix applications, introducing unwanted changes. Typing file names in the Terminal might also make them encoded in NFD, too. Mike -
I assume you mean NFD, not NFS, but here's what one of the HFS+ =20 engineers had to say: "In Mac OS X, SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file =20 systems all store in one form -- NFC. We store in NFC since that what =20= Can you produce a reproducible set of steps for this? Because the =20 Finder shouldn't be doing any of this work on its own, all the =20 Entirely possible, though renormalizing file contents seems a bit less =20= likely. I will point out that the text input system in OS X seems to =20 default to producing NFC (at least, typing `echo 'M=E4rchen' | xxd` in =20= the Terminal shows that the input string there is NFC). So user input =20= will most likely produce NFC, the only way you're probably going to =20 end up with NFD is if you move a file from HFS+ to another filesystem. -Kevin Ballard --=20 Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
I wonder what happens if you do this: touch 'Märchen' echo M*rchen | xxd -g1 Will that produce NFC or NFD? Dmitry -
0000000: 4d 61 cc 88 72 63 68 65 6e 0a Ma..rchen. -
This is NFC! Did you do that on HFS+? If so, it means that shell on Mac also converts filenames to NFC when it reads them from the disk. Dmitry -
Simple : on a Linux host, create files with NFC names, and create an iso image with mkisofs, with rockridge but no joliet. Burn this to a disc, and insert the disc in your OSX host, and try to open files from the finder. Interestingly, IIRC, Finder is able to copy the files, though. As a bonus, try the same with an iso volume name in NFC, it's even better : the created mount point is NFD, but it tries to mount on the name in NFC and fails. And then you just can't eject the CD anymore. Mike -
Here's a reliable test case to test filename normalization on Mac OS. ------ cut here ------- cat > test.pl << EOF #!/usr/bin/perl -CO print "M".pack("U",0x00E4)."rchen\n"; print "Ma".pack("U",0x0308)."rchen\n"; EOF chmod +x test.pl ./test.pl | xargs touch echo M* | xxd -g1 ------ cut here ------- On an NFS mounted filesystem, what you will get is this: 0000000: 4d 61 cc 88 72 63 68 65 6e 20 4d c3 a4 72 63 68 Ma..rchen M..rch 0000010: 65 6e 0a en. and on an HFS+ mounted filesystem, what you will get is this: 0000000: 4d 61 cc 88 72 63 68 65 6e 0a Ma..rchen. So this demonstrates that on my MacOS 10.4.11 system, on NFS, MacOS is doing no normalization, as it is creating two files. On HFS+, MacOS is mapping both filenames to the same decomposed name. More (or not) surprisingly, given Kevin Ballard's "reliable source": "In Mac OS X, SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file systems all store in one form -- NFC. We store in NFC since that what is expected for these files systems." Using a Sony Reader (which uses an internal FAT filesystem) hooked up to a MacOS 10.4.11 system: % /fs/u1/tmp/test.pl | xargs touch % echo M* | xxd -g1 0000000: 4d 61 cc 88 72 63 68 65 6e 0a Ma..rchen. .. which is the decomposed form. So it looks like on FAT/MSDOS filesystems MacOS 10.4.11 normalizes files to NFD, which will *not* do the right thing as far as Windows compatibility is concerned on USB sticks, et. al. Mac OS users would be well advised not to use non-ASCII names in their filesystems if they care about interoperating with other systems. :-P - Ted -
Well, it demonstrates that (a) the OS and (b) _perl_ don't mangle filenames on non-HFS+ filesystems. The problem is that since most native applications *expect* that name mangling, they'll probably do name mangling of their own (internally) just to compare the names! So I would not be surprised if the globbing libraries, for example, will do NFD-mangling in order to glob "correctly", so even programs ported from real Unix might end up getting pathnames subtly changed into NFD as part of some hot library-on-library action with UTF hackery inside. Things like the finder etc, which must be very aware of the fact that filenames get corrupted, would presumably internally always convert everything they get into NFD in order to compare names from different sources. And as part of that, programs may well corrupt the name before they then use it to create a pathname. The fact that your perl program works under NFS, but creates NFD on a VFAT volume, does imply that they probably used at least some of the same routines they use in HFS+ for VFAT. Not entirely surprising: doing case insensitive stuff with Unicode is nasty code, so why not share it (even if it's then incorrect for FAT).. Piece of crap it is, though. Apple has painted themselves into a nasty corner there. Linus -
Well "touch" actually since that was what was actually creating the files; I only used perl because it was easist way to gaurantee exactly It's worse than that. You can specify at format time whether or not HFS+ does case-sensitivity or not, and of course, there is UFS, which I expect does no Unicode normalization at all, much like NFS. I suspect what you've pointed out is why certain MacOS programs break horribly when run on non-HFS+ filesystems, though. And if that is the case, then those same programs might not be reliable if the user's home directory is stored on NFS --- like they would be in an enteprise/corproate environment, if Apple ever wants to have any hope of penetrating that market. Because of this, git code won't be able to just check for HFS+; it will probably have to do a run-time test to see whether or not the filesystem is doing case-folding or not, since that can be turned on or off on a per-filesystem basis. Also unknown, and which should be tested, is whether turning off case-folding also turns off Unicode normalization. It may be that they did this so that HFS+ could be UFS compatible, since Darwin *must* be built on a UFS filesystem, reflecting its Mach/BSD heritage. (I ran across this while doing my web research; apparently HFS+ has been causing Apple headaches Well, hopefully not everyone inside Apple's OS groups are total morons, and actually use a utf8_str_equiv() routine instead of strcmp() to do their Unicode comparisons. But then again, maybe No kidding!! - Ted -
Well yes, any context in which a string is treated as Unicode instead =20= of an opaque sequence of bytes will probably lead to normalization at =20= some point (e.g. when searching text, I'm going to want M=E4rchen and =20= M=E4rchen to be treated as the same string). The Mac OS X APIs use NFD, =20= Why would the globbing libraries have to do anything special to =20 understand NFD? In fact, I prefer that they don't - it's very handy to =20= be able to type Ma* and have that match M=E4rchen, as the globbing =20 library sees Ma??rchen and is happy to match the ??rchen against *. =20 Were the filename in NFC, I couldn't do that. Similarly, Ma<tab> =20 autocompletes the name M=E4rchen for me. But the convenience is beside =20= the point - what I'm trying to show here is that if the globbing =20 library were NFD-aware, it probably would decide Ma* shouldn't match =20 M=E4rchen, right? I assume globbing libraries et al don't do UTF-8 hackery in Linux, =20 right? And yet using NFC-encoded filenames is fairly common? So why =20 should it be any different on OS X, especially since HFS+ isn't the =20 only option here (and thus doing NFD conversion in the library would =20 mess up other filesystems)? In fact, probably the biggest reason the NFD-encoding was done at the =20= HFS+ level is because they simply couldn't trust user-level libraries =20= to always do the NFD conversion for pathnames. And I quote: "I would prefer that case sensitivity and unicode normalization were =20 not the responsibility of the file system -- but I realize that we =20 cannot just ignore the problem and let the other layers sort it all =20 I don't get why you're still calling it corruption when, on an HFS+ =20 system, NFD-encoding is correct. It would be corruption for HFS+ to =20 There's no reason to assume that OS X is actually storing the NFD on =20 the volume. In fact, it's quite explicitly not: "As far as storing exactly what was passed in, its not just HFS =20 that's ...
As pointed out (multiple times), this is only true if the programmer is a moron. You do not need to - and *should* not - convert to a common normalization in order to compare to Uncode strings. You should just compare them with a Unicode-aware comparison routine. It will be faster, and it will avoid corrupting the input. Sadly, stupid people are much too common. Linus -
How about this: it's lossy. It's lossy in a similar sense that TIFF -> JPEG -> TIFF doesn't give you back exactly the same bytes, even though (modulo the compression level) the two TIFFs might be visually indistinguishable. You seem to have an issue with calling this "corruption", but to most of us, if you have a system where you don't get back *exactly the same data* that you put in, then the data has been corrupted. Now, please stop trolling this point, agree to disagree, and either contribute some code or be quiet and allow others to make progress. j. -
Because in a modern Internet-aware world, whoever designs a FS needs to acknowledge that they will need to store files from other systems that have other assumptions. That is, if they want to interoperate. As you noted not long ago, it is a serious problem if an HFS+ partition is shared over NFS. If you look at all the apps that have problems with this aspect of HFS+ , they are all apps that transfer files over the network over diverse protocols. That's why it's a problem with git, because the files may be coming from a different machine, running any arbitrary OS that git supports. In such scenario, can you understand why everyone is saying that HFS+ and the VFS should not mangle names, even if it makes sense to some use cases under OSX? And do you understand why the same applies to git, being a network-sharing-oriented app? So -- if OSX was doing things to make it easier for users to find matching files at the Finder level, that'd be _fine_. But the FS has to deal with a lot more variety than that. So this is a bad design decision -- perhaps less obvious under OS8/9, but completely disastrous with a network OS such as OSX. Call it "different" if you want, but that's a euphemism for "wrong". cheers, m -
I was actually asking for you to show this instead of just asserting it, but I realized I have access to an SMB share myself so I just tested. And you're right. That's very curious. I guess they did that because the entire Carbon stack was written assuming NFD (back at the same time HFS+ was created), and they wanted to provide a consistent interface to applications. Since the filesystem already uses NFC, renormalizing to NFD shouldn't lose anything (want the original I was actually hoping for something I could test myself. -Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
Wait, did you tell us some time ago that normalization does not matter and you just need to treat strings "as text"? Now, it looks like the Carbon stack does not treat strings "as text". How come? Maybe, you should stop lying and admit that changing Unicode On Windows, you can create two *different* files -- one with NFC and the other with NFD name. I wonder, how it is going to work with your renormalization back and force. Dmitry -
I'm not sure what you're trying to say here. As near as I can tell, SMB already does encoding conversions itself when talking to different clients, so you can hardly say OS X is doing something bad by converting between local NFD and NFC on SMB. -Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
