I did some research on this point, since if we really are going to be
compatible with MacOS X's crappy HFS+ system, we need to know what the
decomposition algorithm actually is. Turns out, there are *two* of
them. Kevin didn't know what he was talking about. In fact,
different versions of Mac OS X use different normalization algorithms.Mac OS X 8.1 through 10.2.x used decompositions based on Unicode 2.1.
Mac OS X 10.3 and later use decompositions based on Unicode 3.2.[1]As I correctly predicted, Apple is changing their normalization
algorithm in different versions of Mac OS X. It is not static, which
meands there will be compatibility problems when moving hard drives
between Mac OS X versions. I don't know if they try to fix this in
their fsck or not, when upgrading from 10.2 to 10.3, but if not,
certain files could disappear as part of the Mac OS X upgrade. Fun
fun fun.And clearly Kevin didn't read the tech note very carefully, since it
clearly admits why they did it. The Mac OS X developers were being
cheasy with how they implemented their HFS B-tree algorithms, and took
the cheap, easy way out. So yeah, "crappy" is the only word that can
be used for what Mac OS X perpetuated on the world. Because of that,
a quick Google search shows it causes problems all over the stack, for
many different programs beyond just git, including limewire and
gnutella[2][3], Slim[4], and no doubt others.[1] http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties
[2] http://lists.limewire.org/pipermail/gui-dev/2003-January/001110.html
[3] http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg0000...
[4] http://forums.slimdevices.com/showthread.php?t=40582In any case, it seems pretty clear that by now everyone except Kevin
has realized that HFS+ is crappy and causes Internet-wide
interoperability problems. So I'll justify sending this note by
pointing out the specific table of Mac OS's filesystem corruption
algorithm can be found here:http:...
One thing I'd like somebody to check: what _does_ happen with OS X and NFS
(OS X as a client, not server)? In particular:- Is it suddenly sane and case-sensitive?
- Does the NFS client do any unicode conversion?
I tried to google for it, but didn't find the right keywords to get
anything useful out of that modern-day internet oracle.Linus
-
Using a Linux server, and a OS X client, over NFS, it is in
case-sensitive. This is not unexpected, since you can mount UFS
partitions on Mac OS X, or reformat HFS+ filesystems and make them beNope:
# perl -CO -e 'print pack("U",0x00C4)."\n"' | xargs touch
# ls -l | cat -v
total 0
0 -rw-r--r-- 1 nobody nobody 0 Jan 22 20:30 M-CM-^DIt's pretty clear the Unicode conversion is being done in HFS+, not in
the VFS layer of Mac OS X.So presumably if and when Mac OS adopts ZFS, they will be able to be
free of this mess, at least if they care about being compatible with
Solaris.- Ted
-
There must be something at the VFS layer, or some other layer:
- IIRC, Joliet iso9660 volumes end up being mounted with files names in
NFS when the real file names are NFC on the disk.
- Likewise for Samba shares.
- When I had my problems with iso9660 rockridge volumes using NFC (you
can create that just fine with mkisofs), the volume is mounted without
normalisation, i.e. if you get to a shell and want to access files,
you must use NFC, but at least the Finder does transliteration at some
stage, because going into the mount point and opening some files fail
because it's trying to open the file with the name transliterated to
NFD. I just hope the same doesn't happen with other filesystems.Also, OSX using NFD widely, a file created from non Unix applications
may end up being named in NFD on any file system. File contents, too,
may end up being transliterated whenever a file is modified with non
Unix applications, introducing unwanted changes.
Typing file names in the Terminal might also make them encoded in NFD,
too.Mike
-
I assume you mean NFD, not NFS, but here's what one of the HFS+ =20
engineers had to say:"In Mac OS X, SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file =20
systems all store in one form -- NFC. We store in NFC since that what =20=Can you produce a reproducible set of steps for this? Because the =20
Finder shouldn't be doing any of this work on its own, all the =20Entirely possible, though renormalizing file contents seems a bit less =20=
likely. I will point out that the text input system in OS X seems to =20
default to producing NFC (at least, typing `echo 'M=E4rchen' | xxd` in =20=the Terminal shows that the input string there is NFC). So user input =20=
will most likely produce NFC, the only way you're probably going to =20
end up with NFD is if you move a file from HFS+ to another filesystem.-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Simple : on a Linux host, create files with NFC names, and create an iso
image with mkisofs, with rockridge but no joliet. Burn this to a disc, and
insert the disc in your OSX host, and try to open files from the finder.
Interestingly, IIRC, Finder is able to copy the files, though.As a bonus, try the same with an iso volume name in NFC, it's even better :
the created mount point is NFD, but it tries to mount on the name in NFC and
fails. And then you just can't eject the CD anymore.Mike
-
I was actually asking for you to show this instead of just asserting
it, but I realized I have access to an SMB share myself so I just
tested.And you're right. That's very curious. I guess they did that because
the entire Carbon stack was written assuming NFD (back at the same
time HFS+ was created), and they wanted to provide a consistent
interface to applications. Since the filesystem already uses NFC,
renormalizing to NFD shouldn't lose anything (want the originalI was actually hoping for something I could test myself.
-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Wait, did you tell us some time ago that normalization does not
matter and you just need to treat strings "as text"? Now, it looks
like the Carbon stack does not treat strings "as text". How come?Maybe, you should stop lying and admit that changing Unicode
On Windows, you can create two *different* files -- one with NFC
and the other with NFD name. I wonder, how it is going to work
with your renormalization back and force.Dmitry
-
I'm not sure what you're trying to say here. As near as I can tell,
SMB already does encoding conversions itself when talking to different
clients, so you can hardly say OS X is doing something bad by
converting between local NFD and NFC on SMB.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Here's a reliable test case to test filename normalization on Mac OS.
------ cut here -------
cat > test.pl << EOF
#!/usr/bin/perl -CO
print "M".pack("U",0x00E4)."rchen\n";
print "Ma".pack("U",0x0308)."rchen\n";
EOF
chmod +x test.pl
./test.pl | xargs touch
echo M* | xxd -g1
------ cut here -------On an NFS mounted filesystem, what you will get is this:
0000000: 4d 61 cc 88 72 63 68 65 6e 20 4d c3 a4 72 63 68 Ma..rchen M..rch
0000010: 65 6e 0a en.and on an HFS+ mounted filesystem, what you will get is this:
0000000: 4d 61 cc 88 72 63 68 65 6e 0a Ma..rchen.
So this demonstrates that on my MacOS 10.4.11 system, on NFS, MacOS is
doing no normalization, as it is creating two files. On HFS+, MacOS
is mapping both filenames to the same decomposed name.More (or not) surprisingly, given Kevin Ballard's "reliable source":
"In Mac OS X, SMB, MSDOS, UDF, ISO 9660 (Joliet), NTFS and ZFS file
systems all store in one form -- NFC. We store in NFC since that what is
expected for these files systems."Using a Sony Reader (which uses an internal FAT filesystem) hooked up
to a MacOS 10.4.11 system:% /fs/u1/tmp/test.pl | xargs touch
% echo M* | xxd -g1
0000000: 4d 61 cc 88 72 63 68 65 6e 0a Ma..rchen... which is the decomposed form. So it looks like on FAT/MSDOS
filesystems MacOS 10.4.11 normalizes files to NFD, which will *not* do
the right thing as far as Windows compatibility is concerned on USB
sticks, et. al. Mac OS users would be well advised not to use
non-ASCII names in their filesystems if they care about interoperating
with other systems. :-P- Ted
-
Well, it demonstrates that (a) the OS and (b) _perl_ don't mangle
filenames on non-HFS+ filesystems.The problem is that since most native applications *expect* that name
mangling, they'll probably do name mangling of their own (internally) just
to compare the names!So I would not be surprised if the globbing libraries, for example, will
do NFD-mangling in order to glob "correctly", so even programs ported from
real Unix might end up getting pathnames subtly changed into NFD as part
of some hot library-on-library action with UTF hackery inside.Things like the finder etc, which must be very aware of the fact that
filenames get corrupted, would presumably internally always convert
everything they get into NFD in order to compare names from different
sources. And as part of that, programs may well corrupt the name before
they then use it to create a pathname.The fact that your perl program works under NFS, but creates NFD on a VFAT
volume, does imply that they probably used at least some of the same
routines they use in HFS+ for VFAT. Not entirely surprising: doing case
insensitive stuff with Unicode is nasty code, so why not share it (even if
it's then incorrect for FAT)..Piece of crap it is, though. Apple has painted themselves into a nasty
corner there.Linus
-
Well yes, any context in which a string is treated as Unicode instead =20=
of an opaque sequence of bytes will probably lead to normalization at =20=
some point (e.g. when searching text, I'm going to want M=E4rchen and =20=
M=E4rchen to be treated as the same string). The Mac OS X APIs use NFD, =20=
Why would the globbing libraries have to do anything special to =20
understand NFD? In fact, I prefer that they don't - it's very handy to =20=be able to type Ma* and have that match M=E4rchen, as the globbing =20
library sees Ma??rchen and is happy to match the ??rchen against *. =20
Were the filename in NFC, I couldn't do that. Similarly, Ma<tab> =20
autocompletes the name M=E4rchen for me. But the convenience is beside =20=the point - what I'm trying to show here is that if the globbing =20
library were NFD-aware, it probably would decide Ma* shouldn't match =20
M=E4rchen, right?I assume globbing libraries et al don't do UTF-8 hackery in Linux, =20
right? And yet using NFC-encoded filenames is fairly common? So why =20
should it be any different on OS X, especially since HFS+ isn't the =20
only option here (and thus doing NFD conversion in the library would =20
mess up other filesystems)?In fact, probably the biggest reason the NFD-encoding was done at the =20=
HFS+ level is because they simply couldn't trust user-level libraries =20=
to always do the NFD conversion for pathnames. And I quote:
"I would prefer that case sensitivity and unicode normalization were =20
not the responsibility of the file system -- but I realize that we =20
cannot just ignore the problem and let the other layers sort it all =20I don't get why you're still calling it corruption when, on an HFS+ =20
system, NFD-encoding is correct. It would be corruption for HFS+ to =20There's no reason to assume that OS X is actually storing the NFD on =20
the volume. In fact, it's quite explicitly not:"As far as storing exactly what was passed in, its not just HFS =20
that's invo...
Because in a modern Internet-aware world, whoever designs a FS needs
to acknowledge that they will need to store files from other systems
that have other assumptions. That is, if they want to interoperate.As you noted not long ago, it is a serious problem if an HFS+
partition is shared over NFS. If you look at all the apps that have
problems with this aspect of HFS+ , they are all apps that transfer
files over the network over diverse protocols. That's why it's a
problem with git, because the files may be coming from a different
machine, running any arbitrary OS that git supports.In such scenario, can you understand why everyone is saying that HFS+
and the VFS should not mangle names, even if it makes sense to some
use cases under OSX? And do you understand why the same applies to
git, being a network-sharing-oriented app?So -- if OSX was doing things to make it easier for users to find
matching files at the Finder level, that'd be _fine_. But the FS has
to deal with a lot more variety than that. So this is a bad design
decision -- perhaps less obvious under OS8/9, but completely
disastrous with a network OS such as OSX. Call it "different" if you
want, but that's a euphemism for "wrong".cheers,
m
-
How about this: it's lossy. It's lossy in a similar sense that TIFF ->
JPEG -> TIFF doesn't give you back exactly the same bytes, even though
(modulo the compression level) the two TIFFs might be visually
indistinguishable.You seem to have an issue with calling this "corruption", but to most
of us, if you have a system where you don't get back *exactly the same
data* that you put in, then the data has been corrupted.Now, please stop trolling this point, agree to disagree, and either
contribute some code or be quiet and allow others to make progress.j.
-
As pointed out (multiple times), this is only true if the programmer is a
moron.You do not need to - and *should* not - convert to a common normalization
in order to compare to Uncode strings. You should just compare them with a
Unicode-aware comparison routine. It will be faster, and it will avoid
corrupting the input.Sadly, stupid people are much too common.
Linus
-
Well "touch" actually since that was what was actually creating the
files; I only used perl because it was easist way to gaurantee exactlyIt's worse than that. You can specify at format time whether or not
HFS+ does case-sensitivity or not, and of course, there is UFS, which
I expect does no Unicode normalization at all, much like NFS. I
suspect what you've pointed out is why certain MacOS programs break
horribly when run on non-HFS+ filesystems, though. And if that is the
case, then those same programs might not be reliable if the user's
home directory is stored on NFS --- like they would be in an
enteprise/corproate environment, if Apple ever wants to have any hope
of penetrating that market.Because of this, git code won't be able to just check for HFS+; it
will probably have to do a run-time test to see whether or not the
filesystem is doing case-folding or not, since that can be turned on
or off on a per-filesystem basis. Also unknown, and which should be
tested, is whether turning off case-folding also turns off Unicode
normalization. It may be that they did this so that HFS+ could be UFS
compatible, since Darwin *must* be built on a UFS filesystem,
reflecting its Mach/BSD heritage. (I ran across this while doing my
web research; apparently HFS+ has been causing Apple headachesWell, hopefully not everyone inside Apple's OS groups are total
morons, and actually use a utf8_str_equiv() routine instead of
strcmp() to do their Unicode comparisons. But then again, maybeNo kidding!!
- Ted
-
I wonder what happens if you do this:
touch 'Märchen'
echo M*rchen | xxd -g1Will that produce NFC or NFD?
Dmitry
-
0000000: 4d 61 cc 88 72 63 68 65 6e 0a Ma..rchen.
-
This is NFC! Did you do that on HFS+?
If so, it means that shell on Mac also converts filenames to NFC when
it reads them from the disk.Dmitry
-
NFD, you mean ?
Mike
-
Oops, you are right.
Dmitry
-
Ok. That's going to make it both easier and harder for them in the future.
In particular, it probably means that their VFS layer really has no notion
of this at all, and it's going to be fairly hard to support any kind ofI wouldn't hold my breadth on ZFS, considering the memory requirements.
ZFS apparently wants *lots* of memory:http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#...
http://wiki.freebsd.org/ZFSTuningGuidein fact it seems that the FreeBSD people basically recomment against using
ZFS on 32-bit kernels because of the memory use issues.Yes, it could be BSD-specific, but considering Solaris has the same
recommendation, it sure seems like ZFS isn't ready for prime time on any
low-end (read: consumer) hardware.Of course, in a year or two, 2GB will be the norm. Right now it's still
fairly unusual on Mac hardware outside of the Mac Pro line (which, I
think, comes with a *minimum* of 2GB), and the people who get it want it
not for the filesystem caches, but for big photo editing jobs..Linus
-
HFS+ was developed on Mac OS 8, which I believe didn't have the notion
of a VFS, or at least not one that would have been in any way capable
of doing the case-insensitivity and normalization necessary. However,
I'm not sure what you mean by a "backwards compatibility" layer on
other filesystems - if you mean treating another filesystem like HFS+,
well, if you're using a filesystem that doesn't do normalization thenActually, interestingly the new MacBook Air comes with 2GB stock (I'm
assuming it's soldered onto the motherboard, though, so it makes sense
that Apple's giving customers 2GB as they can't upgrade themselves).In any case, everybody's making a big fuss about ZFS, but it really
doesn't make a lot of sense to use for a consumer system, it seems
more geared for a server.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
So this bit of insanity can affect users on other OSs too, if they use
git on an NFS mountpoint hosted on OSX/HFS+.IIRC Apple does recommend UFS for servers though. I wonder how XServe
machines ship by default.m
-
Yes. Similarlty with UFS partitions. After much grief with
case-insensitivity on OSX I reinstalled the OS on a UFS partition,
only to find that most 3rd party apps can't cope with case-sensitiveDon't know, unfortunately. I suspect both bits of mangling happen in
the fs code.martin
-
http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg00000.=
I just finished talking to one of the HFS+ developers, so I suspect I =20=
know a lot more on this subject now than you do. Here's some of the =20
relevant information:* Any new characters added to Unicode will only have one form =20
(decomposed), so HFS+ will always accept new characters as they will =20
be NFD. The only exception is case-sensitivity, as the case-folding =20
tables in HFS+ are static, so new characters with case variants will =20
be treated in a case-sensitive manner. However, as they are already =20
decomposed, the NFD algorithm will not change their encoding. This =20
means that no, there are zero problems moving HFS+ drives between =20
versions of OS X.* At the time HFS+ was developed, there was no one common standard for =20=
normalization. The HFS+ developers picked NFD because they thought it =20=
was "a more flexible, future-looking form", but Microsoft ended up =20
picking the opposite just a short time later. Interestingly, NFC is a =20=weird hybrid form which only has composed forms for pre-existing =20
characters, and decomposed forms for all new characters (as they only =20=have one form). So in a sense NFD is more sane then NFC.
* The core issue here, which is why you think HFS+ is so stupid, is =20
that you guys see no problem with having 2 files "M=E4rchen" (NFC) and =20="M=E4rchen" (NFD), whereas the HFS+ developers don't consider it =20
acceptable to have 2 visually identical names as independent files. =20
Unfortunately, the only way to do this matching is to store the =20
normalized form in the filesystem, because it would be a performance =20
nightmare to try and do this matching any other way. The HFS+ =20
developers considered it an acceptable trade-off, and as an =20
application developer I tend to agree with them.As I have stated in the past, this isn't a case of HFS+ being stupid =20
and causing problems, it's a case of HFS+ being *differen...
Uh, Ted is a filesystem developer. I can't count the hours I spent
talking with my father, a theoretical physicist, but that does not make
me qualified to consider myself a better authority on physics than a
sub-average actual grad student of the matter.If you don't manage to check your arrogance eventually, you'll be
causing more damage to your cause than if you just shut up. You make
abundantly clear that you don't understand the _implications_ of the
details you may or not may happen to find out.--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
Except there *are* problems, because this promise doesn't apply to
Unicode 2.1 (Mac OS 10.2 and before) and Unicode 3.2 (Mac OS 10.3 and
above). And there were changes between the normalization algorithm
between Unicode 3.2 and the Unicode version 4.1. So taking a hard
drive between Mac OS X 10.2 and 10.3 *will* cause problems. The
guarantees of Unicode stability didn't come until well past Unicode
2.1.Also, I know of no guarantee that there will be no more new
compositions. According to Unicode Stnadard Annex #15
(http://unicode.org/reports/tr15/), new characters that can be
decomposed are strongly discouraged, but "It would be possible to add
more compositions in a future version of Unicode". Got a reference toNFC is better if you care about compatibility with existing legacy
character sets, where you want round-trip conversions to be
idempotent. On the other hand, given that Mac OS has historically
never cared about being compatible with the rest of the world, itYep. No problems to do that. You seem to think that supporting
Unicode requires imposing this constraint, but that's simply not true,Nope. They were just not clever enough. If they use a hashed key for
their b-tree and used a hash which had the property that two strings
that were equivalent in the Unicode sense have the same hash value,
it's quite possible to do Unicode-equivalence lookups quickly. Yeah,
calculating the hash algorithm takes a bit amount of time, but it gets
called no more than the normalization routine, and its performance
overhead is no worse than the normalizing a string.I know how to do it in a Linux filesystem; it's just an insane thing
to do, and so I choose not to do it. But it is doable; if you must
persue the course of filesystem insanity, it's possible to do it in a
performant way, without normalization; it's the same way that you canNo, I did the research to try to find the HFS-specific filename
mangling algorithm. And given that's based on an back-level, old
ver...
Don't ruin it. You were silent for 12hs and lots of patches and
research on the problem started flowing. If you keep making a nuisance
of yourself, people will turn from helping you to beating you up for
being so annoying.Perhaps help prepare those tests you said it was a good idea to work
on. If you manage to stay silent a bit, we'll need them soon ;-)m
-
| Greg Kroah-Hartman | [PATCH 006/196] Chinese: add translation of oops-tracing.txt |
| Andrew Morton | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
| Eric W. Biederman | [PATCH] nfs lockd reclaimer: Convert to kthread API |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 03/37] dccp: List management for new feature negotiation |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
