Hi,
I have some files like "Lüftung.txt" in my repository. The strange thing
is that I can pull / add / commit / push those files without problem but
git-status always complains that thoes files are untraced (but not
missing). My assumption is that it's a problem with the way MacOSX
stores the file names (decomposed UTF-8). So something like
"Lüftung.txt" becomes "Lüftung.txt".It seems that git-status does two things:
1. Find files under version control (i.e. search for missing files)
2. Find files not under version control (i.e. search for untracked files)I guess that the first look-up succeeds because MacOS X converts
composed UTF-8 to decomposed UTF-8 when searching for a file. But it
seems that the second look-up takes the file names as-is (decomposed)
without converting them to composed UTF-8.Is there an easy way to fix this behaviour? It's really annoying to see
all those "untracked" files that are already under version control when
executing a git-status.Regards,
Mark
-
Hi,
On Wed, 16 Jan 2008, Mark Junker wrote:
> I have some files like "L
More like, Mac OS X has standardized on Unicode and the rest of the =20
world hasn't caught up yet. Git is the only tool I've ever heard of =20
that has a problem with OS X using Unicode.-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Apple's decision[*] to use _decomposed_ unicode causes all sorts of
little problems because other tools aren't expecting to see strings
changed behind their backs.I know little about the gritty details, but I see the bug reports...
-Miles
--
Any man who is a triangle, has thee right, when in Cartesian Space, to
have angles, which when summed, come to know more, nor no less, than
nine score degrees, should he so wish. [TEMPLE OV THEE LEMUR]
.
-
As far as I know, Subversion has basically exactly the same problem,
and any time you consume/produce files on Mac OS X that are be
consumed/produced on other platforms you will run into this kind of
issue, with any software.Tell Mac OS X to write a file with "ó" in the file name ("\xc3\xb3" in
UTF-8), and it will "normalize" it prior to writing by converting it
into a decomposed form (that is, ASCII "o" followed by "\xcc\x81", or
"combining acute accent"). So they're both valid Unicode, both valid
UTF-8, and they encode exactly the same characters but the byte stream
is different.If you only work on Mac OS X then this will never be a problem because
all the files you create and therefore all the files you add to your
Git repository will have their names in decomposed UTF-8. But when you
start cloning repositories containing files added on other systems,
systems which might use precomposed rather than decomposed UTF-8 then
you'll run into exactly this kind of problem. The git.git repo has one
such file itself (gitweb/test/Märchen, if I remember correctly, which
Git reports as untracked).Now, Mac OS X's behaviour is not entirely "insane" as some would
claim; there is indeed a rationale behind it even if you don't agree
with it, but it *does* produce some unfortunate teething problems for
people wanting to use Mac OS X in a cross-platform environment.Here are some Apple docs on the subject:
http://developer.apple.com/qa/qa2001/qa1173.html
http://developer.apple.com/qa/qa2001/qa1235.html
I personally wish that UTF-8 didn't allow different normalization
forms; then this kind of problem wouldn't arise. But it has arisen and
we have to live with it. Some workarounds have been proposed for Git,
but I haven't seen any convincing proposals yet.Cheers,
Wincent-
Hi,
> > > I have some files like "L
To be more exact encoding used to _create_ file differs from encoding
...which means that sequence of bytes differ. And Git by design is
(both for filenames and for blob contents) encoding agnostic.HFS+ is just _stupid_. And unfortunately Git doesn't support stupid
filesystems (e.g. case insensitive filesystems) well.--
Jakub Narebski
Poland
ShadeHawk on #git
-
There's two different ways to do filesystem encodings. One is to have =20=
the fs simply not care about encoding, which is what the linux world =20
seems to prefer. Sure, this is great in that what you create the file =20=with is what you get back, but on the other hand, given an arbitrary =20
non-ASCII file on disk, you have absolutely no idea what the encoding =20=should be and you can't display it without making assumptions (yes you =20=
can use heuristics, but you're still making assumptions). Filesystems =20=
like HFS+ that standardize the encoding, on the other hand, make it =20
such that you always know what the encoding of a file should be, so =20
you can always display and use the filename intelligently. It also =20
means it plays much nicer in a non-ASCII world, since you don't have =20
to worry about different normalizations of a given string referring to =20=different files (it's one thing to be case-sensitive, but claiming =20
that "f=F6o" and "f=F6o" are different files just because one uses a =20
composed character and the other doesn't is extremely user-=20
unfriendly). On the other hand, what you create the file with may not =20=be what you read back later, since the name has been standardized. =20
It's hard to say one is better than the other, they're just different =20=ways of doing it. However, I have noticed that everybody who's voiced =20=
an opinion on this list in favor of the encoding-agnostic approach =20
seem to be unwilling to accept that any other approach might have =20
validity, to the extent of calling an OS/filesystem that does things =20
different stupid or insane. This strikes me as extremely elitist and =20
risks alienating what I expect to be a fast-growing group of users =20
(i.e. OS X users).I'm willing to give Linus a free pass on calling other OS's stupid and =20=
insane, as I don't think Linux would exist as it does today without =20
his strong opinions, but I don't think this should give carte blanche =20=to th...
There is no technical reason for *kernel* to care about file name
encoding. It is something that can be and should be dealt with inAnd also because a user space program can deal with it much more
Wrong. If you have a policy that all file names are stored in UTF-8
encoding then there is no problem here. It should not be a kernel
problem to care about encoding, besides you cannot fully solve itYeah, right... Like Microsoft likes to "standardize" everything, which
in practice means forcing on others something fundamentally broken and
that does not follow any existing standard precisely:===
IMPORTANT:
The terms used in this Q&A, decomposed and precomposed, roughly
correspond to Unicode Normal Forms D and C, respectively. However, most
volume formats do not follow the exact specification for these normal
forms.
===
http://developer.apple.com/qa/qa2001/qa1173.htmlNot to mention that the use of decomposed Unicode as the standard is
Somehow I have no problem with displaying non-ASCII names on Linux.
I can see both Unicode Normal Forms C and D encoded symbols withoutAs you typed them, they both are exactly the same, and both of them are
in the Normal Forms C (which Mac calls as precomposed). So why do youI am sure everyone here is scared to death... I mean we have used to
hear such threats from some MS salespeople, but from a Mac guy? It is
really scare....Wake up, and stop shooting this nonsense at us. If you have technical
reasons why your solution is better, let us know. So far, you do not
sound very convincing here. Why do think that the issue of encoding can
not be dealt with in the user space? Why does Mac OS X uses so-called
decomposed Unicode, which even does not follow any standard precisely?
Why does Mac OS X chose to decompose characters while it does notI suppose it would be much better a subject for discussion...
At least, it would be more likely to result in that Git workingFirst, no one called Mac OS X insane, but case insensitive files...
By the way, calling HFS+ stupid, or rather calling at least two
different normalizations of UTF-8 (two different encodings) used for
writing and reading filenames stupid is wrong _for me_. I have quotedFor me it looks like a layering violation... but my knowledge about
filesystem is cluse to nil. IMHO it is VFS and libc which should do theBut using one encoding to create file, and another when reding filenames
is strange. It is IMHO better to simply refuse creating filenames which
are outside chosen encoding / normalization. But having different
encodings used for reading and writing on the level of filesystemFirst, it is Git philosophy and very core of design to be encoding
agnostic (to be "content tracker"). Second, using the same sequence of
bytes on filesystem, in the index, and in 'tree' objects ensures good
performance... this is something to think about if you want to add
patches which would deal with HFS+ API/UI quirks.[cut]
--
Jakub Narebski
Poland
-
It's not using different encodings, it's all Unicode. However, it
accepts different normalization variants of Unicode, since it can read
them all and it would be folly to require everybody to conform to its
own special internal variant. But it does have to normalize them,
otherwise how would it detect the same filename using different
normalizations? Also, it may seem strange to have different names
between reading and writing, but that's only if you think of the name
as a sequence of bytes - when treated as a sequence of characters, you
get the same result. In other words, you're used to filenames asSure, it makes sense from a performance perspective, but it causes
problems with HFS+ and any other filesystem that behaves the same way.
In the previous discussion about case-sensitivity, somebody suggested
using a lookup table to map between git's internal representation and
the name the filesystem returns, which seems like a decent idea and
one that could be enabled with a config parameter to avoid penalizing
repos on other filesystems. But I don't know enough about the
internals of git to even think of trying to implement it myself.--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Hi,
But that's the _point_! It _is_ Unicode, yet it uses _different_
encodings of the _same_ string.Now, this discussion gets really annoying. The real question is: will you
do something about it, or reply with another 500-line email?Ciao,
Dscho
-
I wish I could do something about it. But right now I'm a full-time
student trying to do contracting jobs on the side, and I don't believe
I have the time to learn enough about the guts of git to try and make
any changes to something as core as index filename handling. I just
want people here to recognize that this is a valid problem instead of
simply dismissing it as "HFS+ is insane, lets just ignore this issue".-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
That's a singularly *stupid* argument.
Here, let me rephrase that same idiotic argument:
"But it does have to uppercase them, otherwise how would it detect the
same filename using different cases?"..and if you don't see how that's *exactly* the same argument, you really
are stupid.The fact is, normalization is wrong.
It's wrong when you normalize upper/lower case (no, the word "Polish" is
not the same as "polish"), and it's equally wrong when you normalize forNo. HFS+ treats users as idiots and thinks that it should "fix" the
filename for them. And it causes problems.It causes problems for exactly the same reasons case-independence causes
problems, because it's EXACTLY THE SAME ISSUE. People may think that "but
they are the same", but they aren't. Case matters. And so does "single
character" vs "two character overlay".Does it always matter? Hell no. But the problem with a filesystem that
thinks it knows better is that when it *sometimes* matters, the filesystem
simply DOES THE WRONG THING.Can't you understand that?
Linus
-
Side note: there are ways to do it right.
You can:
- not do conversion at all (which is always right). Not corrupting the
user data means that the user never gets something back that he didn't
put in(And, btw, the "security" argument is total BS. The fact that two
characters look the same does not mean that they should act the same,
and it is *not* a security feature. Quite the reverse. Having programs
that get different results back from what they actually wrote, *that*
tends to be a security issue, because now you have a confused program,
and I guarantee that there are more bugs in unexpected cases than in
the expected ones)- Not accept data in formats that you don't like. This is also always
right, but can be rather impolite.- Not accept data in formats that you don't like, and give people
explicit conversion and comparison routines so that they can then make
their own decisions and they are *aware* of the conversion (so that
they don't come back to the problem of being confused)So there are certainly many ways to handle things like this.
The one thing you shouldn't do is to silently convert data behind the
programs back, without even giving any way to disable it (and that disable
has to be on a use-by-use casis, not some "disable/enable for all users of
this filesystem", because you can - and do - have different programs that
have different expectations).And finally: all of the above is true at *all* levels. It doesn't matter
one whit whether the automatic conversion conversion is in the kernel or
in a library. Doing it on a library level has advantages (namely the whole
"disable/enable" thing tends to get *much* easier to do, and applications
can decide to link against a particular version to get the behaviour
*they* want, for example).So doing it inside the kernel is just about the worst possible case,
exactly because it makes it really hard to do a "on a case-by-case" basis.Yes,...
You're right, it doesn't actually have to store the normalized form. =20
And yes, it's possible to compare without normalizing them. =20
Admittedly, I don't know much about the implementation details of =20
unicode, but I would assume that the easiest way to compare two =20
strings is to normalize them first. But in the case of the filesystem, =20=normalization actually is important if you're thinking about filenames =20=
in terms of characters rather than bytes. When I feed the filesystem a =20=
given unicode string, it has to find the file I'm talking about - =20
should it do a relatively expensive unicode-sensitive comparison of =20
all the filenames with the one I gave it, or should it just normalize =20=all names and do the much cheaper lookup that way? I don't know about =20=
you, but I'd prefer to let my filesystem normalize the name and run =20
There's a difference between "looks similar" as in "Polish" vs =20
"polish", and actually is the same string as in "Ma<UMLAUT =20
MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid =20=semantic meaning, normalization doesn't. The only way to argue that =20
normalization is wrong is by providing a good reason to preserve the =20
exact byte sequence, and so far the only reason I've seen is to help =20
git. Applications in general don't care one whit about the byte =20
sequence of the filename, they care about the underlying file the name =20=represents. Additionally, it would be a terrible experience for a user =20=
to enter "M=E4rchen" and have the application say "sorry, I can't find =20=
this file" simply because the application used decomposed characters =20
and the filename used composed characters. Unless the user is =20
knowledgeable about the OS, filesystems, and unicode, they wouldn't =20How do you figure? When I type "M=E4rchen", I'm typing a string, not a =20=
byte sequence. I have no control over the normalization of the =20
characters. Therefore, depending on what p...
That simply isn't true.
Normalization actually has real semantic meaning. If it didn't, there
would never ever be a reason why you'd use the non-normalized form in the
first place.Others have argued the exact same thing for capitalization. "A" is the
same letter as "a". Except there is a distinction.The same is true of "a<UMLAUT MODIFIER>" and "<a WITH UMLAUT>". Yes, it's
the same "chacter" in either case. Except when there is a distinction.And there *are* cases where there are distinctions. Especially inside
computers. For one thing, you may not be talking about "characters on
screen", but you may be talking about "key sequences". And suddenly
"a<UMLAUT MODIFIER>" is a two-key sequence, and "<a WITH UMLAUT>" is a
single-key sequence, and THEY ARE DIFFERENT.See?
"a" and "A" are the same letter. But sometimes case matters.
Multi-character UTF-8 sequences may be the same character. But sometimes
the sequence matters.Git doesn't care. Just use the *same* sequence everywhere. Make sure
something doesn't change it. Because if something changes it, git willPure and utter garbage.
What you are describing is an *input method* issue, not a filesystem
issue.The fact that you think this has anything what-so-ever to do with
filesystems, I cannot understand.Here's an example: I can type Märchen two different ways on my keyboard: I
can press the 'ä' key (yes, I have one, I have a Swedish keyboard), or I
could press the '¨' key and the 'a' key.See: I get 'ä' and 'ä' respectively.
And as I send this email off, those characters never *ever* got written as
filenames to any filesystem. But they *did* get written as part of
text-files to the disk using "write()", yes.And according to your *insane* logic, that write() call should have
converted them to the same representation, no?Hell no! That conversion has absolutely nothing to do with the filesystem.
It's done at a totally different layer that actual...
The problem is that you don't control the sequence that everybody uses.
See this example:
melo@speed(~)$ uname -a
Linux speed.simplicidade.org 2.6.9-55.ELsmp #1 SMP Wed May 2 14:28:44
EDT 2007 i686 i686 i386 GNU/Linux
melo@speed(~)$ set | grep LANG
LANG=en_US.UTF-8
melo@speed(~)$ mkdir t
melo@speed(~)$ cd t
melo@speed(~/t)$ git init
Initialized empty Git repository in .git/
melo@speed(~/t)$ touch á
melo@speed(~/t)$ git-add á
melo@speed(~/t)$ git-commit -m "added a in utf8"
Created initial commit 7a473a2: added a in utf8
0 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 "\303\241"
melo@speed(~/t)$ export LANG=en_US
melo@speed(~/t)$ touch á
melo@speed(~/t)$ ls -la
total 12
drwxrwxr-x 3 melo melo 4096 Jan 16 23:44 .
drwx--x--x 31 melo melo 4096 Jan 16 23:43 ..
-rw-rw-r-- 1 melo melo 0 Jan 16 23:44 á
-rw-rw-r-- 1 melo melo 0 Jan 16 23:43 á
drwxrwxr-x 8 melo melo 4096 Jan 16 23:43 .git
melo@speed(~/t)$ git-add á
melo@speed(~/t)$ git-commit -m "added a in iso-latin-1"
Created commit 4282fca: Oláx!
0 files changed, 0 insertions(+), 0 deletions(-)
create mode 100644 "\341"So two (simulated in this test) users who use different LANG settings
will be in trouble in no time.What I take from this conversation is that I have to specify, for
each project I work on, which encoding we should use, across all
users, before they start using git with files with accented chars.The difference I see between us is that if I tell my filesystem that
I want to name my file with a particular string encoded in X, users
using encoding Y will be able to read it correctly. I like my
filesystem to make that work for me.Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
The difference I see between us is that when I tell you that this is
exactly the same thing as your file *contents*, you don't seem to get it.An OS that silently changes the contents of your files is *crap*.
Get it?
An OS that silently changes the contents of your directories is *crap*.
Get it now?
Linus
-
This is the same issue as the CRLF issue I posted on earlier, and it
all stems from that git also sees file names as a stream of bytes, notA program that silently ignores the conventions of the platform it runs
on is *crap*, no matter if the conventions are not the same as forA program that silently ignores the conventions of the file system it
tries to store its files on is *crap* :-)In my perfect world, file names would be stored as a string of characters,
so if I save a file with an å in it, that å would be preserved no
matter if I run Linux on ext2 with my locale is set to latin-1 (which
stores it as byte 0xE5), on Windows with NTFS (which stores it as the
UTF-16 code 0x00E5), on Windows/DOS with FAT (which stores it as the
byte 0x86) or on Mac OS X which stores it as decomposed UTF-8 (whose
byte sequence I don't know at the top of my head). If that was just
stored as U+00E5 in whatever encoding in the filename index, the local
implementation of git can just check it out in the form needed.--
\\// Peter - http://www.softwolves.pp.se/
-
You have to be careful about CRLF conversion, lest you corrupt your
Git philosophy to see the contents of files and "contents" of directories
(filenames) as stream of bytes, i.e. to use 'native' encoding works
perfectly well and _fast_ if all developers work in the same environment.
Troubles start if you are working across operating systems, and acrossGit has for a long time i18n.commitEncoding, and from some time it
saves it in 'encoding' header in commit object (if different from
'uft-8') and has also i18n.logOutputEncoding.For dealing with different filesystem encodings you would also have
to have both: encoding used in 'tree' objects (by repository) for
filenames saved somewhere in repository, either in tree object (argh!)
or in some kind of .gitconfig file; encoding used by filesystem in
repository config as i18n.filesystemEncoding or something like that.
And think what to put in the on disk index, and in memory index.NOTE, NOTE, NOTE! If filename is used somewherein the file contents
(manifest-like file, include-like statement), and this filename uses
characters which are differently encoded in different encoding you
are screwed with this fancy system, badly, anyway.--
Jakub Narebski
Poland
-
Hi,
I get that you think its the same thing.
What I don't get is why a user should be forced to know what type of
encoding he and the other users are using on all the layers going
down to the filesystem. If two users on different systems or in
different configurations, choose the same unicode string as the name,
why do we need to make it harder for things to just work out?The content of the file is sacred, we both agree on that. We disagree
on the filename, because for me it's more important that equal
strings, even if encoded to different byte sequences, should beI was not talking about content of files, those are sacred. I was
talking about filenames. Those *for me* are not, but are for you. No
problem, we just have different values: I want my computer to work
for me, not me working for the computer. I'm willing to accept a file
system or other layer that normalizes encoding of filenames if that
makes the end-user life easier, specially in a tool distributed byAs I said before, we disagree on file meta-data, not on file
contents. For you, byte in must be the same byte out. For me string
in must be the same string out.And as I said in the previous email, what I learned today is that in
a distributed project using git, and if you need to use accented
characters, I need to tell all the users to use the same LANG settings.It's important information, at least for me.
Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
Hi,
Why should the filename be _stored_ normalised? I agree on the lookup,
yes, but not the storage.Hth,
Dscho-
If you do the normalization in the right place, things will just work
Well, as the issue shows it does not make life for the end-user easier.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
Hello,
No problem, but don't you think that git should to it?
Don't you think its important in a distributed tool that no matter
what system they use, be it linux or solaris, they are able to talk
about a file with non-ascii chars and be the same file to both of them?That's the point I'm making. The fact that I need to set LANG across
I'm assuming you are talking about HFS+ and the strange normalization
it does.I'm sorry but that was not the problem I sent. I sent a scenario, in
which two users, using the same linux system but with different LANG
settings cannot use git reliably.Although this thread started because of HFS+ "choices", the problem
is not really related to HFS+ given that you can have the same issues
even on the same physical <insert flavor here> POSIX system.Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
I don't think I'd call that "insane" (in fact, I think these
discussions would be much less irritating for all involved if we
didn't use that word so often, even when it's not called for). It's
not that different than the whole LF/CRLF line-ending thing.The real problem is that setting LANG won't help you on Mac OS X; set
LANG to whatever you want and there is *nothing* that you can do to
stop your filenames being normalized into decomposed UTF-8, short of
dropping HFS+. You can use an alternative filesystem, but support for
basically everything except HFS+ is suboptimal in Mac OS X at the
moment.Cheers,
Wincent-
Hi,
On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
> El 17/1/2008, a las 1:40, Pedro Melo escribi
One of the advantages (the biggest one, in fact, apart from the obvious
US-ASCII down-compatibility and the fact that you can do C-compatible
NUL-terminated strings) of UTF-8 is that it's locale-independent, and
doesn't care about LANG, because it's valid in all languages.And that's really important. It's important for a very simple reason:
there is almost never such a thing as "a locale" except for US-ASCII. Once
you move away from US-ASCII, it actually tends to be much more common that
you have a *mixture* of locales - often in the same "document" - than to
have one single locale.It very much happens even in filenames - people "mix" locales in trivial
ways even within a single pathname component (non-US-ASCII filename, but
with a regular file extension), but much more interestingly they do so
within a directory tree (ie you have have translation subdirectories where
the filenames themselves are in another language, and you can have full
pathnames where different components are in different languages, for
example).And UTF-8 is _wonderful_ for this, because LANG doesn't matter, and
cannot matter, and thus mixing isn't a problem.Of course, you can screw it up. Locales still can change things like sort
order and capitalization etc, so even if you use UTF-8, you sure can get
into trouble with LANG and thinking that a per-session locale makes sense.So choosing UTF-8 for the filesystem isn't wrong per se. It's a fine
choice, and has no issues with LANG in itself. Limiting it to strictly
valid UTF-8 encodings is also fine. Limiting it (further) to only
character normalized UTF-8 is also fine.Most Linux filesystems don't limit it in any way, so you can make
filenames that aren't valid UTF-8 at all, much less normalizing
multi-character sequences.I personally think that's the best option, but I probably do so mostly
because I know some people still use Latin1 as their only locale (and I
suspect Asia will take decades before it has converted t...
Alright, you've made your point, and I'm willing to concede at least =20
some of what you've said. So perhaps we can now move onto the more =20
relevant and practical issue of: HFS+, despite how stupid it may or =20
may not be, normalizes filenames (and is case-insensitive, which is a =20=related issue). This causes a problem with git. How can this be solved?
I'm more than willing to do work to solve it, my biggest issue is I =20
don't believe I actually have the free time to learn the git internals =20=well enough to actually do proper work on what I would assume is a =20
fairly performance-critical section of git's code. However, I would be =20=happy to work with others who are perhaps more knowledgeable in this =20
area.-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
My understanding is that normalization is there to help the computer. =20=
That doesn't give it any semantic meaning, because all normal forms of =20=
The argument for case insensitivity is different than the argument for =20=
normalization. I certainly hope you understand why they are different =20=
You're right, sometimes the sequence matters. As in key sequences. But =20=
we're not talking about key sequences, we're talking about strings. =20
And how am I supposed to use the same sequence everywhere? When I type =20=
"M=E4rchen", I don't know which form I'm typing, nor should I. It's not =20=
something that I, as a user, should have to know. Especially if I pass =20=
this name through various other utilities before using it - I have no =20=
idea if another utility is going to end up normalizing the name, and =20
On a US keyboard I only have one way of typing =E4, and I have no idea =20=
whether it ends up precomposed or decomposed in the resulting byte =20
stream. And I don't care. Because I'm typing characters, not bytes. I =20=could be typing in a file in ISO-Latin-1 and I still wouldn't care, =20
because it looks the same to me. If my filesystem did make a =20
distinction between the normal forms, and I see that I have a file =20
named "M=E4rchen", how am I supposed to type that at my keyboard? I =20
don't know which normal form it's using.The fact that you think the normalization of the string matters, I =20
What a fabulous straw man argument you just put together. I hope you =20
I'm speaking as a user, and as such, I shouldn't even have to know =20
that it's possible to write the same character in multiple different =20
ways. As a user, HFS+ behaves exactly the way I want it to. You were =20
talking earlier about not messing with the "user data", but what is =20
the "user data"? It's the string, not the byte sequence. That's all I =20=care about - the string. That's all the OS cares about, that's all any =20=
application I use cares a...
The thing is, you seem to argue that what OS X does helps you as the user.
But you are arguing based on incorrect assumptions.
First off, we've had years and years and years of usage of non-corrupting
filesystems (pretty much every UNIX OS around since day 1, and many other
OS's too), and it's simply not true that it's a problem. You see the
filename in the file dialog, and you open it, and you're done. OS X isn't
any "easier" in this regard.In fact, this whole thread comes from the fact that the OS X choice that
you *think* is easier, is in fact not easier at all. It's not easier for
the user, it's not easier for the application programmer, and the really
sad part is that it's very much *not* easier for OS X itself either (ie
they had to literally write extra code with nasty tables to do it, and it
really does hurt them in performance and complexity).And _that_ is why the OS X situation is so sad. Apple literally added
extra code to make things slower and more complex *and* harder to use
reliably.Does it show up in normal behaviour? Of course not. You'd probably never
see it in real life outside of test-suites. People simply don't even tend
to use filenames outside of US-ASCII, and when they do use them, input
methods really *do* tend to do the normalization for you.But when it comes to automation (which is what computers are all about),
the OS X choice is literally the wrong one. And there's no _upside_. It's
all downside. Which is why it's so stupid.I bet it only exists because OS X engineers didn't really even think about
it, and they just assumed that "normalization is helpful". They took your
stance - thinking it was worth it, without ever really thinking it
through.Linus
-
On Jan 16, 2008, at 8:16 PM, Linus Torvalds <torvalds@linux-foundation.org
I believe it exists because HFS+ was created at a time when the Mac
was moving from a multi-encoding world (which was a nightmare) to a
Unicode world and they wanted to remove ambiguity in filenames. But I
wasn't around when they made this decision so this is just a guess.-Kevin Ballard
-
I do agree. And I think starting out case-insensitive (something they must
really hate by now) also made it less of an issue. When you're
case-insensitive, the issues with any UTF-8 normalization are simply
swamped by all the issues of case, so you probably don't even think about
it very much.The big problem with any name rewriting is that I can open file 'xyz', and
I literally have a very hard time knowing whether that file I know I
opened and created has anything to do with the file 'Xyz' that I see when
I do a readdir().Are they the same? Maybe. But it's literally hard to tell on OS X. I can
do an fstat() on my file descriptor and on the directory entry, and if
they get the same d_ino they *probably are the same entry, but even then
it actually could have been a hardlink (and my 'xyz' is really *another*
name for it entirely, and the filesystem is actually case-sensitive and
'Xyz' was a *different* name that somebody else did!).See? If you're creating a content tracker, these kinds of issues are not
"idle chatter". It's really *really* important. Was that file the one I
was told to track? Or was it a temporary file that was just hardlinked?This is why case-insensitivity is so hard: you have a very real "aliasing"
on the filesystem level, where all those really *different* pathnames end
up being the same thing.And all the same issues show up with utf-8 rewriting, so if you normalize
utf-8 names, you actually end up having almost all the same problems that
a case-insensitive filesystem has. They're just much rarer in practice, so
you just won't hit them as often - but when you do, they are equally
painful!(In fact, they can be a whole lot *more* painful, because now they are
really rare, and really confusing when they happen!)But if you come from a case-insensitive background, all the UTF-8
rewriting really looks like such a small problem compared to all the
horrid problems that you had with different locales and cases, so I
suspe...
I hope you're right (about them hating it), but we'll see. They've
just opened the source for the ZFS port they're working on. By the
time it goes final and becomes the default FS, replacing HFS+,
probably within a couple of years, we'll see if they make the same two
design decisions which cause the kinds of problems being discussed
here (case-insensitivity, and ubiquitous FS-level UTF-8 normalization).I've done a dumb search in the ZFS source code for "CASE" and see that
it can in theory support case-insensitivity as an optional feature.
The potential is there for Apple to use this. I personally hope that
they don't, because as has already been pointed out, these little
tricks tend to make life more difficult for users rather than helping
them (the day I have two files in the same directory called "Märchen"
and want to specify one of them on the command line I'll worry about
that when I come to it).http://fuzzy.wordpress.com/2007/06/09/zfsandfilesystemoptions/
Cheers,
Wincent-
Side note: the thing is, the reason people shouldn't worry about it is
that this is a *trivial* thing to handle. You really don't even need to
know what you're doing. And you can test it today, easily.Having two (differently encoded) files like that is really no different
from the traditional UNIX FAQ of "how do I remove a file starting with
'-'" or even more closely "how do I remove a file that has a character in
it that I cannot get at the keyboard".In other words, on a bog-standard UNIX (and yes, in this case, I bet OS X
works fine too for this test), just try thisfilename1=$(echo -e "hello\002there")
filename2=$(echo -e "hello\003there")
echo Odd file > "$filename1"
echo Another odd file > "$filename2"and now you have a filename that is actually rather hard to type on the
command line. In fact, for me they even *look* the same:[torvalds@woody ~]$ ll hello*
-rw-rw-r-- 1 torvalds torvalds 9 2008-01-17 08:23 hello?there
-rw-rw-r-- 1 torvalds torvalds 17 2008-01-17 08:23 hello?thereSee?
Even in my graphical browser, those two filenames look 100% *identical*. I
could give you a screen-shot, but I'm lazy. Just take my word for it, or
just fire up konqueror on Linux (but it may well depend on the particular
font you're using).[ And yes, for other browsers, you might have something that shows them as
different characters - depending on the font, it might show up as a
small box with [00 02] vs [00 03] in it, for example. But that's also
actually 100% true of the two different encodings of 'ä' - you could
easily have a file broswer that shows the multi-character as a
multi-character, exactly to distinguish them and show that one of them
isn't "normalized"!The point is, once the filesystem doesn't corrupt the data, it's always
easy to get at, and there is never any ambiguity. ]How is this different from "Märchen" spelled with two different encodings
for that "ä"?I'll tell you: it's not at all differen...
With the exception of Unicode. If you check the standard, two Unicode
codepoints (i.e. the numeric value that gets stored on disk) *can* map
to the same character, hence they are the same. They don't just look the
same, they are the same character -- even if the codepoints are
different (i.e. precomposed vs. decomposed characters). In fact, part of
the Unicode standard deals with that. (Technically, Unicode calls it
equivalence, but what the hey).In other words, Unicode treats e.g. both U+0065 and U+00E9 as
fundamentally the same character. This comes even more into play in such
alphabets as Hangul (Korean) and the Japanese Kana.--
JM Ibanez
Software Architect
Orange & Bronze Software Labs, Ltd. Co.jm@orangeandbronze.com
http://software.orangeandbronze.com/
-
Hi,
As Linus _already_ pointed out, you are confusing characters with glyphs.
Hth,
Dscho-
Someone is.
He is refering to the unicode definition of an (abstract) character.
Ch3.4 D11 - "A single abstract character may also be represented by a sequence
of code points—for example, latin capital letter g with acute may be represented
by the sequence <U+0047 latin capital letter g, U+0301 combining acute accent>,
rather than being mapped to a single code point.-- robin
-
But if you want to make it clear, you can use "encoded character" or yes,
"code point".But the thing is, even the unicode standard tends to just say "character",
and a unicode string (for example) is defined to be a sequence of "code
units" which in turn is about those *encoded* characters, which is all
about the code points.So you'll find that they are very careful in some technical definition
parts to talk about "code points", but then in other sequences they talk
about "character" even though they are referring to the actual code point
(ie the figure literally has the unicode number in it!)In fact, they sometimes even talk about "characters" in the totally
non-encoding meaning of "glyph".So yes, "character" is often ambiguous. It would be good to never use the
word at all, and only talk about "code point" and "glyph" and one of the
well-defined special terms like "combining character" or "replacement
character".But to take a representative example from The Unicode Standard, Chapter 2:
"Unicode Design Principles":Characters are represented by code points that reside only in a memory
representation, as strings in memory, on disk, or in data transmission.
The Unicode Standard deals only with character codes.(any speling mistakes mine). In other words, from the very beginning of
the standard, very basic design principles chapter, it starts talking
about characters being represented by code points and explicitly says that
it really only deals with CHARACTER CODES.Yes, I'm sure you can argue ad infinitum that all the "equivalences" and
other crap means that a "character" can sometimes mean just about
anything, but I'd say that it's pretty damn reasonable to equate "unicode
character" with "code point" or "character code".Linus
-
So they are not the same after all? It is just you don't care
about what it actually says, right? How about this: Unicode
provides a unique number for every character. So, if numbers
are not the same then by definition of the Unicode standardThere is no notion "fundamentally the same character" in the Unicode
standard as far as I know, and the characters you mentioned are very
different in Unicode:
http://www.fileformat.info/info/unicode/char/0065/index.htm
http://www.fileformat.info/info/unicode/char/00e9/index.htm
There have different names, they have different glyphs, and they
are functional different.Dmitry
-
Sorry, but you're using different characters that look the same. But
Kevins point was that it's a different thing if you use two characters
that look the same or the same character with different encodings. This
makes this HFS-specific problem different from the "look the same"- or
the "case-insensitivity"-issues.BTW: I also read about your argument that you wouldn't convert file data
to normalized UTF-8 (I agree with you that this would be nonsense) and
therefore filenames shouldn't be converted too. This is something where
I have to disagree because a filename (like ctime, mtime, atime, ...)
are meta data (while file contents isn't) and - until now - I would've
guessed that you agree on this point because git doesn't care about
filenames but contents.IMHO it would be the best solution when git stores all string meta data
in UTF-8 and converts it to the target systems file system encoding.
That would fix all those problems with different locales and file system
encodings ...However, I have to agree that the enforced character set conversion
causes more problems than it solves.Regards,
Mark
-
But that's exactly the case he gave - 'ä' vs 'a¨' are exactly that:
different strings (not even characters: the second is actually a
multi-character) that just look the same.You try to twist the argument by just claiming that they are the same
"character". They aren't, unless you *define* character to be the same as
"glyph". Of course, if you claim that, then you can always support your
argument, but I claim that is a bogus and incorrect axiom to start with!Too many people confuse "character" and "glyph". They are different.
See, for example
http://en.wikipedia.org/wiki/Unicode
and notice the *many* places where they try to make that distinction
between "character" and "glyph" clear (and also "code values", which are
the actual bytes that encode a character).See also
http://en.wikipedia.org/wiki/Unicode_normalization
and realize that a Unicode sequence is a sequence of *characters* even if
it is not normalized! Those things are still characters, when they are the
"simpler" non-combined characters.You are trying to make a totally BOGUS argument, and you base it on the
INCORRECT basis that the TWO characters 'a'+'¨' somehow aren't independent
characters. They *are*. They are *different* characters from 'ä', even
though they may be "Canonically equivalent" as a sequence.The fact is that "equivalent" does not mean "same". Why cannot people
accept that?Linus
-
Hi,
I'll shut up now if you can answer me one question, because it
really is a problem for my team.We have people using windows, people using Macs, and people using
several flavors of Linux desktops. They all have different settings
and if I add a file like áéióú that happens to be UTF-8 encoded, it
will reach a iso-latin-1 user as visual garbage. git will track the
file perfectly, we know that, because the sequence of bytes that my
system used to create the file will be the same on all "sane"
systems, but the file will look "funny" to some users, and we get
complaints for some less enlightened ones.The answer is that users should not create filenames with non-ascii
characters if they want a consistent experience, right?This is just so that I can write a best practices document to them...
Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
I can't really suggest anything else than trying to make everybody use
UTF-8.[ Not just for filenames, by the way - this is one of the reasons I think
it is so *important* to not corrupt filenames, exactly because this is
in no way filename-specific at all, and filenames are generally "textual
data" exactly the same way a text-file is.But only totally insane people think that you should force-normalize
text-files, even though all the issues are obviously all the same
regardless of whether it's a filename or a word in textfile. ]And yes, I also realize that it's not going to be realistic. We're
probably *closer* to that than we used to be, but I don't think you can
even make Windows think FAT is UTF-8.I don't know how NTFS works (I know it is Unicode-aware, and I think it
encodes filenames in UCS-2 or possibly UTF-16, but there is an obvious 1:1
translation to UTF-8, and since we use C strings, I'd assume/hope Windows
actually uses that unambiguous translation for any filenames).Under modern Linux and OS X, UTF-8 is basically the only way (older Linux
distros may be set up for Latin1, but at least the newer ones seem to allOh, absolutely. That takes care of 99.9% of all source projects. Even then
you can have problems with case insensitivity (the Linux kernel sources
are all US-ASCII filenames, for example, but *literally* has many files
that are identical if you ignore case, and that's not unheard of).So yes, to a first approximation, the answer is to simply avoid using
anything but US-ASCII. It's seldom a big limitation when talking about
filenames.Linus
-
But they are not different strings, they are canonically equivalent as
far as Unicode is concerned. They're even supposed to map to the same
glyph (if the font has an "ä", it should display it in both cases, if
it has an "a" and a combining diaeresis, it should make up one).You cannot do a binary comparison of text to see if two strings are
Whereas you are confusing characters and code points.
"ä" and "a¨" use different code points, but they encode the same
character, and from the user's perspective it is the *character* thatActually, NTFS is a bit broken. It sees file names as a string of
16-bit words. It doesn't check that it is valid UTF-16, or even valid
UCS-2, it allows almost anything.Apple made Mac OS X handle filenames properly, by seeing that file
names are a string of characters, not code points, so they use a
canonical form for all characters (personally, I would have preferred
the pre-composed form, though).--
\\// Peter - http://www.softwolves.pp.se/
-
Fuck me with a spoon.
Why the hell cannot people see that "equivalent" and "same" are two
.. and this is relevant how? They are different strings. Not the same.
Equivalence doesn't matter. Equivalence is *evil*. Equivalence is what
gives us case-insensitive filesystems ("because the names are
equivalent").Filesystems don't *want* equivalence. They want a much stronger exactness
guarantee. Exactly because sometimes the differences matter.Linus
-
It is relevant because the Mac OS file system stores file names as a
sequence of Unicode code points, in a (apparently slightly modified)
normalized form, whereas Git prefers to see file systems that store
file names as a sequence of octets, which may, or may not, actually map
to something that the user would call characters.I happen to prefer the text-as-string-of-characters (or code points,
since you use the other meaning of characters in your posts), since I
come from the text world, having worked a lot on Unicode text
processing.You apparently prefer the text-as-sequence-of-octets, which I tend to
dislike because I would have thought computer engineers would have
evolved beyond this when we left the 1900s.But the real issue is that Git cannot use it's filenames as string of
octets on Mac OS X, since the file system doesn't handle it. So Git
needs to do something sensible. That's part of porting. Preferrably
that would involve supporting real Unicode file names, which would also
work on Windows (through it's UTF-16 file APIs), and in part on other
systems (through conversion to the systems' locale encoding).--
\\// Peter - http://www.softwolves.pp.se/
-
No. The *only* issue is that git doesn't normalize.
You can think of git as a UTF-8 namespace all you want, and it will work
together wonderfully with OS X.Some of us just know what we're doing, and have been working with UTF-8
for a long time. It's not about sequence-of-octets, it's about not
corrupting the data.You think data should be changed behind peoples backs, potentially causing
corruption due to unintended conversions. And I don't.You can call me "left behind in the 1900s", but that's apparently because
you don't understand the issues. Data corruption wasn't something that
magically became ok just because we switched into a new century.Linus
-
I agree. Every single problem that I can recall Linus bringing up as a
consequence of HFS+ treating filenames as strings is in fact only a
problem if you then think of the filename as octets at some point. If
you stick with UTF-8 equivalence comparison the entire time, then
everything just works.Granted, this is a problem when you have to operate on a filesystem
that thinks of filenames as octets, but as I said before, this doesn't
mean the HFS+ approach is wrong, it just means it's incompatible with
Linus's approach.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
At *some* point everything stored in computers is a sequence of octets.
In fact, the whole point of the Unicode standard is to define characters
and how to map each character to a unique number (code points) and thenThere are more than one equivalence comparison. The unicode standard
defines at least two, and for some other purpose you may want to use
some others, but for some reason you are trying to present that to
work with text means to follow only one type of equivalence the entire
time...Dmitry
-
You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON.
The fact is, text-as-string-of-codepoints (let's make the "codepoints"
obvious, so that there is no ambiguity, but I'd also like to make it clear
that a codepoint *is* how a Unicode character is defined, and a Unicode
"string" is actually *defined* to be a sequence of codepoints, and totally
independent of normalization!) is fine.That was never the issue at all. Unicode codepoints are wonderful.
Now, git _also_ heavily depends on the actual encoding of those
codepoints, since we create hashes etc, so in fact, as far ass git is
concerned, names have to be in some particular encoding to be hashed, and
UTF-8 is the only sane encoding for Unicode. People can blather about
UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is
simply technically superior in so many ways that I don't even understand
why anybody ever uses anything else.So I would not disagree with using UTF-8 at all.
But that is *entirely* a separate issue from "normalization".
Kevin, you seem to think that normalization is somehow forced on you by
the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE.
Normalization is a totally separate decision, and it's a STUPID one,
because it breaks so many of the _nice_ properties of using UTF-8.And THAT is where we differ. It has nothing to do with "octets". It has
nothing to do with not liking Unicode. It has nothing to do with
"strings".In short:
- normalization is by no means required or even a good feature. It's
something you do when you want to know if two strings are equivalent,
but that doesn't actually mean that you should keep the strings
normalized all the time!- normalization has *nothing* to do with "treating text as octets".
That's entirely an encoding issue.- of *course* git has to treat things as a binary stream at some point,
since you need that to even compute a SHA1 in the first place, but that
...
Code point is a unique numerical value assigned to every Unicode character.
Also, every Unicode character has a uniqie name assigned to it. There are
some other non-unique properties that every Unicode has. So, to say that
a Unicode character is just a code point is not exactly correct, because
the code point is one of properties of a unicode character. But, yes, any
Unicode character can be identified by its code point. So, it is one to
one relation.Dmitry
-
Linus,
(slightly offtopic) are you praising UTF-8 as storage format (for disk
and network) or in general? UTF-8-aware string ops like counting
characters seem to me a horrendous thing at the ASM level.More on topic, I suspect Kevin's experience is more on end-user apps,
where input sanitization and even canonicalisation are common
practice. From a kernel and filesystems POV, a filename is data as
sacred as file data. On the webapp world, we "corrupt" user input
liberally to avoid XSS attacks and the like. In some cases, these
practices are stupid and can be replaced with escaping data properly,
but in other cases, the web platform is so broken that there's no
option.At least in Moodle we store *exactly* what the user POSTed and
cleanup^Wcorrupt it when displaying it, so that if it does happen that
the cleanup was buggy, we never corrupted the data.So no point in calling eachother stupid this much. Once is enough ;-)
And no point in arguing that something that is ok for an end-user app
is a good design decision for an OS.martin
-
Huh? Why? Just count all characters in the range 00-bf. That's the
exact character count of utf-8 characters.--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
I'm praising UTF-8 (without normalization) as a wonderful format where you
can do 99.9% of everything without ever caring about all the expensive
stuff.But in order to do that, you really need to avoid normalization, and you
also need to accept mis-formed UTF-8 strings (because even if it is real
UTF-8, the string may actually be just a fragment of some larger string).Once you do that (and _only_ if you do that), then UTF-8 is actually a
wonderful thing. You can consider it to be a traditional "everything is a
stream of bytes", and everything that only cares about a stream of byte
will work wonderfully well.And then, the (actually relatively few) things that want to do things like
show things on the screen, or check for equivalence, or worry about width
of the characters, *those* can still do so.So the beauty of UTF-8 is that you can switch between thinking of it like
just a binary blob and thinking of it like text, and everythign works
(including the traditional C null-termination).Sure. And I'm not arguing against them. Knowing the rules for combining
Absolutely. It's what the kernel does, and I think that's what perl does
too for their "strings". It works really well. It also allows you to
handle binary data (ie data that *really* isn't text) with shared routines
etc etc.And that's the beauty of non-normalized (and possibly badly formed) UTF-8.
Linus
-
*thanks* for these notes. Very useful, and...
I find the above amusing -- different worlds we live in. Programming
webapps means that 90% of the code deals with a bit of metaprogramming
(with lots of string manipulation) to talk SQL to a backend, and then
doing lots of string manipulation on the data the DB returns, which
ends up in humongous strings of goop otherwise known as HTML+CSS+JS.
After waiting for the DB to return data, over 50% of cpu time is spent
in regexes, concatenations, counting words, array ops, etc. So it is
pretty significant.So now I have to worry about cost and correctness of stuff that I took
for granted in the pre-unicode days - strtolower() can be quite
expensive and... buggy! But that's mainly due to Unicode, not UTF8. I
think the only slowdown I can pin on UTF-8 is in counting chars, and
probably slower regexes. Not that I deal with the C implementation of
any of this stuff -- and so happy about it! ;-)</offtopic>
I had a few issues with Perl v5.6's utf-8 handling that wasn't binary
safe (fread() to a fixed-length buffer would break the input if a
unicode char landed across the boundary - ouch!) -- made me think that
you couldn't do this in binary safe ways. So I tend to tell Perl to
treatfiles as binary, and switch to utf-8 in specially chosen spots. I
suspect that 5.8 is a bit saner about this, but I'm not taking
chances.cheers,
martin
-
Maybe because it's 1.5 times bigger for any text in chinese, japanese or
korean ?Mike
-
I'm not saying it's forced on you, I'm saying when you treat filenames =20=
as text, it DOESN'T MATTER if the string gets normalized. As long as =20
the string remains equivalent, YOU DON'T CARE about the underlying =20Alright, fine. I'm not saying HFS+ is right in storing the normalized =20=
version, but I do believe the authors of HFS+ must have had a reason =20
to do that, and I also believe that it shouldn't make any difference =20Sure it does. Normalizing a string produces an equivalent string, and =20=
so unless I look at the octets the two strings are, for all intents =20
You're right, but it doesn't have to treat it as a binary stream at =20
the level I care about. I mean, no matter what you do at some level =20
the string is evaluated as a binary stream. For our purposes, just =20
redefine the hashing algorithm to hash all equivalent strings the =20
same, and you can implement that by using SHA1 on a particular =20Decomposing and recomposing shouldn't lose any information we care =20
about - when treating filenames as text, a<COMBINING DIARESIS> and <A =20=WITH DIARESIS> are equivalent, and thus no distinction is made between =20=
them. I'm not sure what other information you might be considering =20
Wrong. '\x61\x18' in Latin1, when converted to UTF-8 (NFD) is still =20
'\x61\xc2\xa8'. You're mixing up DIARESIS (U+00A8) and COMBINING =20
DIARESIS (U+0308).I suspect this is why you've been yelling so much - you have a =20
See above as to why you're not losing the information you so fervently =20=
People who insult others run the risk of looking like a fool when =20
Sure, it all depends on what level you need to evaluate text. If we're =20=
talking about english paragraphs, then whitespace can be messed with. =20=
When we're talking about unicode strings, then specific encoding can =20
be messed with. When talking about byte sequence, nothing can be =20
messed with.In our case, when working on an HFS+ filesystem al...
to treat as text could mean different for different people. Some
may prefer to fi and fi_ligature to be treated as same in someAs matter of fact it does, otherwise characters would be the
same and we would not have this conversation at all. String
can be equivalent and not equivalent at the time, because there
are different equivalent relations. Finally, what HFS+ does
is even not normalization. In the technote, Apple explains
that they decompose some characters but not others for betterIt is not about byte stream. After all, if it were UTF-16 instead
of UTF-8, it would be one to one conversion for each character.I don't say they do that without *any* reason, but I suppose all
Apple developers in the Copland project had some reasons for theyNot true. You lose the original sequence of *characters*.
Dmitry
-
Those people can use NFKC/NFKD (compatibility equivalence). As I've
said before, I'm talking about canonical equivalence, because that
doesn't lose information like compatibility equivalence does (ex. the
fi ligature gets turned into fi in compatibility equivalence, but notAgain, I've specified many times that I'm talking about canonical
equivalence.And yes, HFS+ does normalization, it just doesn't use NFD. It uses a
Stupid engineers don't get to work on developing new filesystems. And
Copland didn't fail because of stupid engineers anyway. If I had toWhich is only a problem if you care about the byte sequence, which is
kinda the whole point of my argument.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
If you think that HFS+ does normalization then you apparently have no
idea of what the term "normalization" means. Have you? But if you
don't know what is "normalization" then you cannot really know whatAssigning someone to work on a new filesystem does not make him
suddenly smart. As to that stupid engineers don't get to work,
it is like saying there is no stupid engineers at all. There are
plenty evidence to contrary. And when management is disastrous
then most idiots with big mouth and little capacity to produce
any useful does get assignment to develop new features, while
those who can actually solve problems are assigned to fix the
next build, because the only thing that this management worriesBut if the code was so good then why was most of that code thrown away
Byte sequences are not an issue here. If the filesystem used UTF-16 to
store filenames, that would NOT cause this problem, because characters
would be the same even though bytes stored on the disk were different.
So, what you actually lose here is the original sequence of *characters*.Dmitry
-
I would go look up specifics to back me up, but my DNS is screwing up
right now so I can't access most of the internet. In any case, there
are 4 standard normalization forms - NFC, NFD, NFKC, NFKD. If there
are others, they aren't notable enough to be listed in the resource I
was reading. HFS+ uses a variant on NFD - it's a well-defined variant,
and thus can safely be called its own normalization form. I fail toI'm not talking about assigning engineers, I'm saying developing a new
filesystem, especially one that's proven itself to be usable and
extendable for the last decade, is something that only smart engineersYes. Even the best of engineers will produce crap code when overworked
and required to implement new features instead of fixing bugs and
stabilizing the system. Copland is well-known to have suffered from
featuritis, to the extent that it was practically impossible to test
in any sane fashion. Bad management can kill any project regardless ofI've already talked about that, but you are apparently incapable of
understanding.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
The defining property of normalization is producing binary identitical
strings for equivalent strings, IOW, normalization allows you to tell
what strings are equivalent and what are not just by binary comparision.
HFS+ decomposition lacks that property, because strings are not fully
decomposed thus being comparision of equivalent strings may give falseYou know, many people still use FAT, but somehow I don't think that
FAT is good despite of it being extendable for more than a decade...
Apparently, HFS+ was not worst part of the Copland project, but II don't think that anyone asked them to implement so much new features.
AFAIK, it was very difficult (nearly impossible) to get anyone to workExactly. IMHO, both management and developers are equally responsible
Well, it is *you* who is incapable of understanding anything, even
basic terms as encoding and normalization...Dmitry
-
That's horribly broken, for a couple of reasons. First of all,
changing the hash algorithm breaks compatibility with existing
repositories; sure, you can try to guess what will least likely break
existing repository (which won't be the native MacOSX normalization
algorithm, since it's more likely the combined character will likely
be used on other environments), but there's still no guarantee there
aren't filenames that use some other form of byte-string for the
filename.Secondly, the hash algorithm would not be stable. Unicode is not
static, and new characters can get added that may be composable, and
thus would be normalized differently. This is one of the reasons why
Unicode is so horribly broken as a standard. It was originally
created by representatives from the printing world that were horribly
clueless about what was needed with respect to canonicalization
representation, so they compromised allowed both forms, not realizing
what a massive f*ckup this would cause later on. So people have over
the years piled kludges on top of kludges in order to make Unicode
"work".So we can't blame all of the craziness on the MacOS designers,
although they have seen to have been very creative about how to take a
bad situation and make it worse....- Ted
-
You seem to be under the impression that I'm advocating that git treat
all filenames as unicode strings, and thus change its hashing
algorithm as described. I am not. I am saying that, if git only had to
deal with HFS+, then it could treat all filenames as strings, etc.
However, since git does not only have to deal with HFS+, this will not
work. What I am describing is an ideal, not a practicality.In other words, what I'm saying is that treating filenames as strings
works perfectly fine, *provided you can do that 100% of the time*. git
cannot do that 100% of the time, therefore it's not appropriate here.
The purpose of this argument is to illustrate that treating filenames
as strings isn't wrong, it's simply incompatible with treating
filenames as byte sequences.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Well, why are you arguing on the git list about precisely that (when
No, it's still broken, because of the Unicode-is-not-static problem.
What happens when you start adding more composable characters, which
some future version of HFS+ will start breaking apart?Presumably the whole *reason* why HFS+ was corrupting strings was so
that "stupid applications" that only did byte comparisons would work
correctly. But when you upgrade from Mac OS 10.5 to 10.6, and it adds
support for new composable characters, and you now take a USB hard
drive that was hooked up to a MacBook Air, running one version of
MacOS, and hook it up to another Macintosh, running another version of
MacOS, the normalization algorithm will be different, so the byte
comparisons won't work.So all of this extra work which MacOS put in to corrupt filenames
behind our back doesn't actually do any good; applications still need
to be smart, or there will be rare, hard to reproduce bugs
nevertheless. So if MacOS wants to supply Unicode libraries that
compare strings keeping in mind Unicode "equivalences" it can be our
guest (although how they deal with different versions of Unicode with
different equivalence classes will be their cross to bear). BUT MacOS
X SHOULD NOT BE CORRUPTING FILENAMES. TO DO SO IS BROKEN.Even Microsoft got this right; its filesystem is case-preserving, but
it has case-insensitive lookups. Hence, it is not corrupting
filenames behind the application's back, unlike MacOS.- Ted
-
Because of the way in which an argument evolves. This started out as
"HFS+ is stupid because it normalizes", and I was arguing that said
normalization wasn't stupid. This turned into an argument as to why HFS
+ wasn't stupid for normalization, which is basically this argument of
the ideal. Yes, I realize that it's not producing any practical
results, but I'm stubborn (as, apparently, are most of you), and I
believe that if the official stance of the git project is "HFS+ is
stupid" then there's a lower chance of a patch being accepted then ifIf you need a static representation, you normalize to a specific form.
And in fact, adding new composable characters doesn't matter, since if
they didn't exist before, you couldn't have possibly used them. Unless
you mean adding new composed forms of existing simpler characters, atI doubt that HFS+ normalized so that "stupid applications" could do
byte comparisons. But even if that were the case, see previousYour entire argument is based on the assumption that HFS+ "corrupts"
filenames in order to allow dumb clients to do byte comparisons, and I
don't believe that to be the case. In fact, it's only considered a
corruption if you care about the byte sequence of filenames, and my
argument is that, on HFS+, you aren't supposed to care about the byte
sequence.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Sure you can. Suppose you unpack the same tar file or zip file that
contains one of these new-fangled characters, one on a MacOS 10.5
system, and one on a MacOS 10.9 system. How HFS+ will corrupt that
filename will differ depending which version of MacOS you are running.
Hence, normalizing the filename when you store it is stupid and
broken. MacOS and its applications and libraries want to do
normalization in the privacy of its own address space, that's it's
business. It can pursue any fetish it wants, among consenting adults.
Safe, sane and consensual, and all that... well, consensual, anyway.
I'm not sure about "safe" and "sane"....My arguement is basically is that there is absolutely no value in what
HFS+ is doing, by corrupting filenames --- if you want to call it
"normalizing" them, fine, but since Unicode is not static, so you
can't even call it a "canonical" form. It's just some random
corruption of what was passed in at open(2) time, that can and will
change depending on what version of MacOS you are running.If you want to play the insane Unicode game of "equivalent"
characters, you have to do it at comparison time, so there's no point
trying to "normalize" them when you store them. It doesn't buy youOK, what's your reason for why HFS+ corrupts filenames? What do you
think is its excuse? What problem does it solve? If the answer is
"no reason at all, but because it *can*", according to the Great God
Unicode, then that's really not very impressive....- Ted
-
Note: resent to list due to bounce.
Original CC list: tytso@MIT.EDU, torvalds@linux-foundation.org, peter@softwolves.pp.se
, mjscod@web.de, melo@simplicidade.orgYou're making the huge assumption that the HFS+ normalization
algorithms will change. As the technote states:"Platform algorithms tend to evolve with the Unicode standard. The HFS
Plus algorithms cannot evolve because such evolution would invalidateIt must have bought somebody something, or they never would have done
I have no idea why HFS+ stores filenames in a normalized form, and
further I am smart enough to know that speculating is completely
pointless. I assume the authors had a good reason (which should be a
safe assumption, filesystem authors are a smart bunch). The reason may
not be valid anymore, but if it was valid back in 1998, then I can
accept it without complaining.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Sure I do, because it matters a lot for things like - wait for it - things
I've already told you the reason: they did the mistake of wanting to be
case-independent, and a (bad) case compare is easier in NFD.Once you give strings semantic meaning (and "case independent" implies
that semantic meaning), suddenly normalization looks like a good idea, and
since you're going to corrupt the data *anyway*, who cares? You just
created a file like "Hello", and readdir() returns "hello" (because there
was an old file under that name), and it's a lot more obviously corrupt.. but you *have* to look at the octets at some point. They're kind of
what the string is built up of. They never went away, even if you chose to
ignore them. The encoding is really quite important, and is visible both
in memory and on disk.It's what shows up when you sha1sum, but it's also as simple as what shows
up when you do an "ls -l" and look at a file size.It doesn't matter if the text is "equivalent", when you then see the
differences in all these small details.You can shut your eyes as much as you want, and say that you don't care,
You're right, I messed up. I used a non-combining diaeresis, and you're
right, it doesn't get corrupted. And I think that means that if Apple had
used NFC, we'd not have this problem with Latin1 systems (because then the
UTF-8 representation would be the same).So I still think that normalization is totally idiotic, but the thing that
actually causes most problems for people on OS X is that they chose the
really inconvenient one.Linus
-
I believe I already responded to the issue of hashing. In summary,
just re-define your hash function to convert the string to a specific
encoding. Sure, you'll lose some speed, but we're already assuming
that it's worth taking a speed hit in order to treat filenames as
strings (please don't argue this point, it's an opinion, not a factual
statement, and I'm not necessarily saying I agree with it, I'm justPerhaps that is the reason, I don't know (neither do you, you're just
guessing). However, my point still stands - as long as the string
stays canonically equivalent, it doesn't matter to me if theSomeone has to look at the octets, but it doesn't have to be me. As
long as I use unicode-aware libraries and such, I can let theIt does? Why on earth should it do that? Filename doesn't contribute
to the listed filesize on OS X.kevin@KBLAPTOP:~> echo foo > foo; echo foo > foobar
kevin@KBLAPTOP:~> ls -l foo*
-rw-r--r-- 1 kevin kevin 4 Jan 21 14:50 foo
-rw-r--r-- 1 kevin kevin 4 Jan 21 14:50 foobarIt would be singularly stupid for the filesize to reflect the
filename, especially since this means you would report differentVisible at some level, sure, but not visible at the level my code
I'm not sure what you mean. The byte sequence is different from Latin1
to UTF-8 even if you use NFC, so I don't think, in this case, it makes
any difference whether you use NFC or NFD. Yes, the codepoints are the
same in Latin1 and UTF-8 if you use NFC, but that's hardly relevant.
Please correct me if I'm wrong, but I believe Latin1->UTF-8->Latin1
conversion will always produce the same Latin1 text whether you useThe only reason it's particularly inconvenient is because it's
different from what most other systems picked. And if you want to
blame someone for that, blame Unicode for having so many different
normalization forms.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
[ message continues ]
Umm. What's this inability to see that data is data is data?
Why do you think Unicode has anything in particular to do with filenames?
Those same unicode strings are often part of the file data itself, and
then that encoding damn well is visible in "ls -l".Doing
echo ä > file
ls -l filesure shows that "underlying octet" thing that you wanted to avoid so much.
My point was that those underlying octets are always there, and they do
matter. The fact that the differences may not be visible when you compare
the normalized forms doesn't make it any less true.You can choose to put blinders on and try to claim that normalization is
invisible, but it's only invisible TO THOSE THINGS THAT DON'T WANT TO SEE
IT.But that doesn't change the fact that a lot of things *do* see it. There
are very few things that are "Unicode specific", and a *lot* of tools that
are just "general data tools".The problem is that the UTF-8 form is different, so if you save things in
UTF-8 (which we hopefully agree is a sane thing to do), then you should
try to use a representation that people agree on.And NFC is the more common normalization form by far, so by normalizing to
something else, you actually de-normalize as far as those other people are
concerned.I blame them for encouraging normalization at all.
It's stupid.
You don't need it.
The people who care about "are these strings equivalent" shouldn't do a
"memcmp()" on them in the first place. And if you don't do a memcmp() on
things, then you don't need to normalize.So you have two cases:
(a) the cases that care about *identity*. They don't want normalization
(b) the cases that care about *equivalence*. And they shouldn't do
octet-by-octet comparison.See? Either you want to see equivalence, or you don't. And in neither case
is normalization the right thing to do (except as *possibly* an internal
part of the comparison, but there are actually better ways to check for
equivalence...
I'm not sure what you mean. I stated a fact - at least on OS X, the =20
filename does not contribute to the listed filesize, so changing the =20
encoding of the filename doesn't change the filesize. This isn't a =20I don't, but I do think this discussion revolves around filenames, =20
Yes, I am well aware that the encoding of the *file contents* affects =20=
filesize. But when did I suggest changing the encoding of filenames =20
inside file contents? If you treat filenames as strings, there's no =20
requirement to change the encoding of filenames inside file contents. =20=I'm talking specifically about the filenames, not about file contents, =20=
Don't want to, or don't need to? It's not a matter of ignoring =20
encoding because I don't want to deal with it, it's ignoring encoding =20=Yes, I realize that. See my previous message about discussing ideal vs =20=
Was NFC the common normalization form back in 1998? My understanding =20
is Unicode was still in the process of being adopted back then, so =20I could argue against this, but frankly, I'm really tired of arguing =20
this same point. I suggest we simply agree to disagree, and move on to =20=actually fixing the problem.
-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Changing the encoding of the file name most certainly changes the
file size of the directory.--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
And my point was that your *whole* argument boils down to "normalization
is invisible".When it isn't. It's not invisible for filenames, it's not invisible for
file contents.You're trying to claim that normalization cannot matter. I'm just pointing
out that it sure as hell can. Exactly because lots of things don't
actually look at data other than as just a Unicode string. They do look at
the raw format.I'm surprised that you make generalized sweeping statements about how it's
ok to normalize because normalization is "invisible", and then when I
point out that that isn't true, you try to limit it.And no, that normalization is not invisible EVEN IN FILENAMES. If it was,
I don't know which argument you're talking about. Git (and, btw, Linux)
does the "ideal" thing (don't screw up peoples data), and it turns out to
be the "practical" thing too (it can handle a wider range of cases than OS
X can).So no, this is not "ideal" vs "practical". They aren't in any conflict
.. and people have even suggested how. Hide the idiotic OS X choices by
making a OS X-specific wrapper around readdir() that turns it into NFC.That's just about the best we can do. We can't *fix* the thing that OS X
loses information, but a least we can then show the lost information in
the same form it _probably_ was in originally.But no, it won't "fix" git on OS X.
Linus
-
I'm really surprised that, after all of this, you're still horribly
misunderstanding my argument. I never said it was invisible. NEVER.I'm also surprised that you seem to care more about this argument then
You misunderstand my point. In a previous email I specifically used
the words "ideal" and "practical" to describe arguments, which is whatAnd I've responded to that suggestion, multiple times, saying that
Quite a while ago it was suggested that git uses a table that maps the
original byte sequence as seen in the index to the form returned by
readdir(). So far this has sounded like the best solution, but as I've
said before I don't know git's internals enough (or, really, at all)
to be able to work on this myself.This solution should only "lose" information in the case where the
index has 2 filenames that HFS+ treats as a single filename.Is there some reason this won't work?
-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
You said it was invisible when you treat things "as text". Here's the
quote:.. when you treat filenames as text, it DOESN'T MATTER if the
string gets normalized ..Without ever apparently realizing that "as text" is part of the problem in
itself. What is "text" to one person is gibberish to another.In particular, the biggest reason to not normalize is that you don't know
it's text or Unicode in the first place. Which is why git doesn't do it.And no, even with filenames you don't know that they are "text". People
encode stuff in them. And people don't always use UTF-8.Of course, you could ask everybody to create OS X-only programs that know
that under OS X, you only have a subset of filenames. If so, you're
complaining about the wrong tool. Especially when the whole point of the
tool was to be distributed (not to mention coming from an environment that
simply doesn't have the same silly limitations OS X has).So here's a few clues:
- "as text" isn't "as unicode": it may well be Latin1 or EUC-JP or
something. Yes, it's still used. Git doesn't care, and very consciously
has avoided forcing character sets, even if the *default* (and notice
how it's overridable) commit message encoding may be utf-8.- In fact, even in unicode, the difference between "identical" and
"equivalent" strings exists, and even in the standard, unicode
strings are very much defined to be arbitrary codepoint sequences, not
normalized.So even for the very specific case of unicode text, it's simply not true
that "it doesn't matter if the string gets normalized". The unicode spec
itself talks about cases where even canonical normalization makes a
difference.Search for this quote:
"Not all processes are required to respect canonical equivalence. For
example:* A function that collects a set of the General_Category values
present in a string will and should produce a different value for
<angstrom sign, semicolon>...
Which is actually a good argument as to why filenames should be
Sure, I understand why git doesn't do it. I'm saying in a system which
uses unicode top-to-bottom, which you can create if you're using HFS+Again, I was talking about a system that used unicode top-to-bottom.
I find it amusing that you keep arguing against having git treat
filenames as unicode when, if you had actually taken my advice and
read my previous email talking about "ideal" vs "practical", you'd
realize that I was not suggesting git should. I was simply describing
why having the filesystem specifically treat filenames as utf-8 isn't
a problem when the entire system is unicode-aware, and thus showing
how the problems that are cropping up in git aren't because the
filesystem treats filenames as unicode, but rather because the
filesystem treats filenames differently than other filesystems. In
other words, I was trying to illustrate that HFS+ isn't wrong, it's
just different, and the difference is causing the problem.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
NO I DO NOT!
Dammit, stop this idiocy.
I think it's fine having git treat filenames "as unicode", as long as you
don't do any munging on it.Why? Because if it's utf-8, then treating them "as unicode" means exactly
the same as treating them "as a user-specified string".So stop lying about this whole thing. I have never *ever* argued against
unicode per se.All my complaints - every single one of them - comes down to making the
idiotic choice of trying to munge those strings (not even strictly
"normalize") into something they are not.And what you don't seem to understand is that once you accept _unmodified_
raw UTF-8 as a good unicode transport mechanism, suddenly other encodings
are possible. I'm not out to force my world-view on users. If they are
using legacy encodings (whether in filenames *or* in commit texts or in
their file contents), that's *their* choice.I actually personally happen to use UTF-8-encoded unicode.
I'm just not stupid enough to think that (a) corrupting it is a good idea,
*or* (b) that I should force every Asian installation of git to also force
people to use unicode (or even having all the conversion libraries and
overheads!)So stop this idiotic "unicode == normalization" crap.
I'm a huge fan of UTF-8. But that does not mean that I think normalization
is a good idea.Linus
-
Please read to the bottom of this email. As near as I can figure out,
you haven't done that on any of my previous emails.When I say "treat filenames as unicode" I'm implying the equivalence
If that's what "as unicode" meant, then the phrase "as unicode" has
No, you've argued against unicode equivalency in filenames. Can't you
figure out, when the entire time I've been talking about equivalency,You're not using raw UTF-8, you're just using raw bytes. Calling it
UTF-8 doesn't mean anything, since you don't actually know that's whatHow many times must I say the same thing over and over? I'm not
arguing that forced normalization is a good thing. I'm arguing that,
in a system which is unicode-aware top to bottom, forced normalization
is irrelevant to the user, since they don't care about the exact byte
sequence. And I'm also arguing that git should have some solution to
this problem. I find it interesting that you're perfectly happy to
rant and rail against your misperception of my argument, and yet you
consistently and repeatedly ignore my offers to stop this argument and
work towards a solution, as well as my comments on existing proposed
solutions.Are you even reading to the end of my emails?
- Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Yes, because you're an idiot.
I've told you over and over again that equivalence is stupid.
It's stupid when it's "equivalent except for case", and it's stupid when
I agree: normalization and equivalency is idiotic.
You don't seem to.
The thing is, the two are inexorably intertwined. Any filename equivalence
(except for the trivial "identity" equivalence) INVARIABLY means that
filenames get munged.Why?
Think about the file name "Abc", and think about what happens when you
create it.Now, think about what happens if that filename is considered equivalent in
case..See? The filesystem has to *corrupt* the filename.
Can you not UNDERSTAND this? Equivalence and normalization is STUPID. It's
just two sides of the exact same coin. They both INVARIABLY cause the
filename to be munged.And changing user data is not acceptable.
Do you get it now?
Linus "probably not" Torvalds
-
Let me make this really clear, because I'm afraid that you won't get it
when I leave out any steps of the way.Let us say that there is a filename "xyz" that is equivalent to a filename
"abc" in *any* way. It does not matter if xyz/abc is Hello/hello, or
whether it's two canonically equivalent strings.So now, do
close(open(xyz, O_WRONLY | O_CREAT, 0666));
close(open(abc, O_WRONLY | O_CREAT, 0666));and then look at the directory contents afterwards.
There are two, and only two, choices here (*):
- the filesystem created both files, and they show up as created
- the filesystem decided they were equivalent, and munged one (or both)
of themNow, let's go back to my claim:
- munging user data is unacceptable
and realize that equivalence BY DEFINITION must do it.So no, you do *not* get to have your cake and eat it too. You simply
fundamentally *cannot* have both filename equivalence and a non-munging
filesystem. See above why.Linus
(*) Actually, there is third choice above, which is:
- the filesystem created the first file, and errored out on the second
because it noticed it was equivalent - but not identical - to one it
already hadThis one is actually a perfectly fine choice, but it's not "your" kind
of equivalence, since it actually makes a difference between two
equivalent but non-identical names. So the filenames aren't actually
interchangable, and this case is really more of a "the filesystem has
some very specific limitations on what it allows".
-
Linus, have you even bothered to read my arguments, or do you just get
a kick out of building these straw man arguments? You have
consistently failed to actually address what I'm talking about, and
instead persist in explaining stuff I already know, as if that was the
answer to anything I've been talking about. You are clearly incapable
of understanding my basic point, no matter how simple I break it down.
I suspect it's because you've been working low-level so long you can't
think high-level, and so you manage to misinterpret my high-level
arguments as boneheaded low-level mistakes.Anyway, please see my countless former emails where I ask to work
towards a solution instead of just arguing.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
We know what the solution is:
- The OS X filesystem _is_ crap (and you seem to have almost admitted as
much by your comment that the HFS+ designers did it back in the dark
ages and didn't mean for it to ever be a server filesystem anyway)- But we can at least make a wrapper around readdir() return the NFC form
on OS X, and effectively hide much of the fallout from the crap.There is no way around it. Your "solutions" all seem to boil down to
asking git to do the same idiotic crap that OS X does, taking all the
same performance hits, and just generally doing crap just to work around
crap in your favourite OS.And no, making git be stupid just to suit a stupid filesystem simply isn't
going to happen.So how about you see _my_ point instead: OS X may have an inferior
filesystem, but we don't have to make git inferior just for that. The fact
that OS X does case independence is *its* problem, not git's.Linus
-
I agree that HFS+ isn't well suited for tasks which it is being asked
to do. I was never arguing that it was the perfect filesystem. But
that hardly matters now, I know nobody's going to bother understandingAgain, I don't think that's the correct solution. What about the
translation table that was suggested back at the beginning of the
thread? That would solve the case insensitivity issue as well, whereasNo, I am not asking git to do the same thing HFS+ does. You just
persist in misinterpreting my arguments, no matter how many times ISo, what, you're saying git shouldn't do any work at all to try and
behave nicer on OS X? Because OS X sure as hell can't change to suit
git.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Kevin,
you seem to know the problem fairly well. Could you write up a set of
testcases that show the bug? See the "t" directory in the git sources
-- you don't need to learn much about git internals, they are just
shell scripts (mostly, I think there's some perl there too). That
could lead to a good contribution to the project.... and keep you from telling everyone else that you know better how
to hack a project that you know nothing about ;-)Kevin - for your edification, that question is usually referred to as
"trolling" in this place we call the internet. Linus outlined what his
technical plan is, so git will probably do something designed by
someone who knows a thing or two about git's internals. So when you
pretend that he is saying the opposite of what he is saying... well,
people do get upset.cheers,
m
-
See now this is actually a very good suggestion. I probably should
have done this long ago. Thank you very much for actually responding
about the problem. You are the first person to do so.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Sure you do. You continue to say that unicode is the only choice, and you
continue to say that unicode requires that equivalent names be considered
the same.Umm. Git works perfectly fine on OS X, and it's not like we can do a whole
lot more about it, exactly because we cannot fix the real problem. We can
hide some of the fallout (idiotic choice of normalization), but the bigger
issues we can hardly even do anything about (case independence).And quite frankly, you've also made sure that I have absolutely zero
interest in even trying to help people with it.Linus
-
If by "ideal" you mean a world where 100% of all computers were
designed by Steve Jobs, you might have a point. But trying to argue
for such a state of idealism seems to be stupid, and certainly a
complete waste of everyone's time on the git mailing list. It's
simply not reality. It's like with the infamous resource forks, which
would have worked fine if all the world were MacOS, but which had a
tendency to get stripped off whenver you used a program that wasn't
resource fork aware, like zip, or a protocol that wasn't resource fork
aware, like FTP. And so people had to put in all sorts of kludges
like BinHex to work around MacOS's "if only the entire world was like
*me*, no one would get hurt" attitude. In some ways, the MacOS
designers are even worse than Microsoft in terms of having the "theAnd if you want to interoperate with the rest of the world, where at
least count over 92% of computers are NOT running HFS+, then "Thinking
Different" is indeed causing the problem, yes. And whose fault is that?The whole point of interoperability is that when we communicate, we
have to do so in a uniform and predictable way. If we can't, the next
best thing is to have protocol translators; but in order to do that,
we must avoid lossy transformations, such as HFS+'s
pseudo-normalization. (Why, by the way, will not result in a "normal"
form for any glyph which can be encoded with and without a combining
character if said glyph was introduced into Unicode after 1988. So
you can't even call it a "normalization" algorithm, but just a
pseudo-normalization transformation which is lossy and which DESTROYS
filename information in an irrecoverable way.)- Ted
-
NO NO NO NO NO. READ MY EMAIL. STOP MAKING ASSUMPTIONS ABOUT WHAT I'M
TALKING ABOUT.The most frustrating thing about this thread is everybody keeps
arguing about what they *assume* I'm talking about without actuallyAnd if you want to interoperate with the rest of the world, where at
least count over 92% of computers are running Windows, then using
another OS is stupid, right? Right? I mean, if everyone else is doingSure it's normalization, it's just not using one of the standard
forms. But the form is well-defined.And yes, protocol translators are a good idea. That's why I thought
the original suggestion of using a table to map index filenames <-> HFS
+ filenames sounded like it could work. The only time that should fail
is if the index contains multiple filenames that HFS+ will treat as a
single filename. Is there a problem with this approach?-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Hmmm. I m pretty sure HFS+ has a lot of problems if you run OSX as an
NFS server with clients in different encodings. It would never work in
real life. The "envelope" OSs have to work in is hugely varied -- muchDid you spot the rather nasty issues that Ted mentioned earlier in the
thread? I would say HFS+ is a bit "special" rather than "different".cheers,
m
-
Kevin,
as you might know, Linus' "other hobby" is to write kernels ;-) From
taht POV, a filename is as much data as the data in the file. Doing
odd things like sorting it, searching through it, etc, is all work for
code higher in the stack that is free to mangle the data in any way it
wants, including creating nice case-insensitive indexes, and
who-knows-what for ideogram-based languages. In contrast, the core OS
treats user data a sacred stuff, and I'm thankful it does.And from a kernel/filesystem POV, a directory is also a file. So if a
filename has a different number of octets, the directory will be
different.For all the searching and matching, it really makes sense to have
something like locate or SpotLight or whatever to index user files
that should be easy to find and match, because all the locale rules
for matching are hideously expensive to apply. Even today, most UTF-8
aware (and supposedly collation-smart) applications have trouble
matching MARTÍN when asked for martín in a case-insensitive search.
That pesky latin í trips them up everytime.cheers,
martin
-
That's certainly a reasonable POV. However, it's not the only one. As =20=
evidenced by the Mac, treating filenames as strings rather than bytes =20=
is a viable alternative POV - you can't argue that it doesn't work, =20
because OS X proves it does.Sure, that makes sense. That's why, if you are going to mangle =20
filenames, you need to pick a stable form to always use, which HFS+ =20Perhaps you should try OS X. Every single Cocoa app should do the =20
search properly. In fact, I just checked using 3 different text =20
engines (WebKit, Cocoa's text engine, and ATSUI) and all 3 did the =20
case-insensitive search properly. That said, this isn't particularly =20
relevant.-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
With its own slew of bugs. See Ted's reply earlier for a mouthful of
OSX has given me enough grief with other filesystem and general OS
problems that I have definitely abandoned it after 2 years trying to
use it part-time. It has been back to linux for me ;-)cheers,
martin
-
Linus' approach is _FAST_.
Why do you think Git has now acquired a reputation of kicking asses all
around the SCM scene?The HFS+ approach might be fine if you think of it in terms of "the user
will be awfully confused if two file names are shown identically in the
File Open dialog box". But it otherwise sucks big time when it comes to
high performance applications needing to deal with a huge amount of file
names at once.Normalization will always hurt performances. This is an overhead.
Sometimes that overhead might be insignificant and not be perceptible,
but sometimes it is. And Git is clearly in the later case. Performances
will be hurt big time the day it is made aware of that normalization.
This is why there is so much resistance about it, especially when the
benefits of normalizing file names are not shown to be worth their cost
in performance and complexity, as other systems do rather fine without
it.Nicolas
-
Normalization is cheap if you normalize user input. The user will
always be quite slower than any reasonable normalization algorithm. But
in the filesystem, one is normalizing the same stuff over and over.--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
I agree, Linus's approach is indeed fast. And if speed is more
important than treating filenames as text instead of octets, then so
be it. This is a trade-off. But a trade-off doesn't mean one approach
is "wrong", it just means the authors of HFS+ thought it was an
acceptable trade-off. HFS+ wasn't designed to be a high-performance
filesystem that deals with lots of files, it was designed to be a
filesystem used by regular people on the Mac, and I believe treating
filenames as text is a good choice in this scenario. Unfortunately,
this does mean git has to do extra work to behave correctly on this
system.Now, to move on to actually coming up with a solution. Unfortunately I
don't know enough about the internals of git to really evaluate the
proposed ideas myself, or to write a patch. Hopefully I'll come up
with the time to acquire the necessary knowledge, but until then I can
only participate in these higher-level discussions.--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Regular people have brains, not filesystems. HFS+ is employed by
computers, and computers can produce or query or process lots of data in
very short time spans, in their own pace. And if Mac users did not want
to make use of that, they would still be using Mac classics.--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
Git's data model relies on SHA-1 hashing of data, including filenames.
So at some level, git _has_ to treat data as octets, and "equivalent"
strings must be the same at the octet level (or else you lose all of the
useful properties that the hashing data model provides). You can argue
about where in the program conversion and normalization occur, but I
don't think you can get around the fact that you're going to need
to think of the "filename as octets at some point."-Peff
-
Could people _please_ stop this already?
I think the sane people see the difference between equivalence
and sameness, and we established that a filesystem that mangles
the filenames behind user's back is a bad design. Anybody who
followed the thread and still does not agree with you is, eh,
"ugly-and-stupid", as you might say ;-). You cannot educate
them all.The thing is, even if you mange to educate them all, that broken
filesystem, and other filesystems with similar brokenness, do
not go away. If your ultimate objective is to declare that it
is the right thing for git not to support such broken
filesystems, and to make everybody agree to it, that is fine.
Please keep pouring fuel to the fire. But if that is not the
case, we would need to devise a way to help lives easier for the
unfortunate people who are stuck on such filesystems. They may
not even realize that they are unfortunate now, and I agree that
some education is justified, but this thread has raged on long
enough to salvage any salvageable lost souls (the remaining ones
may be beyond salvation but let's not waste time on them).I'd rather see our mental bandwidth spent on coming up with a
workable workaround for such broken filesystems, while not
hurting use of git on sane platforms.I fear it might have to end up to be very messy and slow,
though.
-
Random thought. Would it make sense to implement a git paranoid
mode to autodetect name mangling.I.e. After opening or creating a file by name we do a readdir in the
same directory to make certain we can find that same name/inode
combination. Then on name-mangling systems we can autodetect they
exist and limit ourselves to just what they don't mangle with no
prior knowledge. By refusing to process names that actively
get mangled. For small directories that you frequently see in
development it shouldn't even be that slow.Eric
-
Inside init-db where we already check how the filesystem
behaves, we could have an autodetection. A rough equivalent of
what I had in mind is:mkdir -p "Märchen/Märchen"
if test "$(cd Märchen && echo M*)" = "Märchen"
then
: not mangling
else
git config core.namemangle true
fi(of course we do that in C not in shell).
-
I wonder if that is good enough. Git repositories can be copied over to
different filesystems.Nicolas
-
Do you mean "cp -a"? If I am not mistaken we already have that
issue, due to core.filemode, when user does that across
filesystems with different behaviours.There is not much we can do against "cp -a" other than telling
users that some configurations need to be adjusted.
-
Hi,
Actually, I see some value in calling them names, see
I was almost starting with hacking on this, but then the discussion
annoyed me too much, and I asked myself for who I think I'd do this.IMHO those people should ask "how could I begin to work on this".
Instead, they started a useless flamewar.
Now, back to the issue: Robin posted a link to his UTF-8 work. While it
is way too intrusive, and not limited to filenames at all, I think it has
a few good pointers.Ciao,
Dscho "who needs to calm down now"-
As far as I can tell, the only time you ever run into the problems
you've described on a filesystem which treats filenames as unicode
strings (and therefore is free to normalize), are when you're trying
to interact with a filesystem that treats filenames as sequences of
bytes.This doesn't mean treating filenames as unicode strings is wrong, it
just means that the world would be much better if every filesystem had
the same behaviour here. It's kinda like the endian issue, except
there's no simple solution here.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Hi,
[please do not top post. Just delete everything you do not reply to]
If you read the first message in this thread, you would probably know
that the problem exists on Mac even without any other filesystem being
involved. My understanding of it is that is caused by HFS+ converting
one sequence of Unicode characters (generated by Mac keyboard driver)
to another sequence using "fast decomposed" conversion.[And please stop calling by normalization what is not. Mac does NOT
normalize Unicode strings, it uses some sub-standard conversion,
which neither produce a normalized string nor is guaranteed to beActually, there is, if you care to do something. You can write a wrapper
around readdir(3) that will recodes filenames in Unicode Normal Forms C.
This does not require much knowledge of Git -- what it requires the
desire to do something to solve the problem. Of course, this step alone
is not a complete solution (it does not solve case-insensitive issue),
but the first step in the right direction...BTW, Git is far from being only software that ran into this problem with
Mac. But not being first, we can benefit from other people experiences:
http://osdir.com/ml/network.gnutella.limewire.core.devel/2003-01/msg0000...Dmitry
-
If somebody wants to do this, then readdir() isn't the only place, but
yes, readdir() is one of the places.I suspect that if we were to just do the "turn into NFC on readdir() on OS
X", that might actually be good enough to hide most of the problems. The
issue isn't just that OS X mangles the filenames, it's that it picks a
particularly *stupid* way to mangle them (the decomposed forms), which
means that OS X will actually not just corrupt "odd cases" of Unicode, but
will corrupt the obvious and *common* Latin1 translations of Unicode.I don't know if NFC is better for other locales, but I doubt it. Usually
people want to do the *composite* forms, not the *de*composed forms.A trivial example of this for some cross-OS issue:
- let's say that you have a file "Märchen" on just about *any* other OS
than OS X. It could be Latin1 or it could be Unicode, but even if it is
Unicode, I can almost guarantee that the 'ä' is going to be the
*single* Unicode character U+00e4 (utf-8: "\xc3\xa4", latin1: "\xe4")So from a cross-OS standpoint, that's the *common* representation, and
yes, you can create the file that way (I don't know what happens if you
actually create it with the Latin1 encoding, but I would not be
surprised if OS X notices that it's not a valid UTF sequence and
assumes it's Latin1 and converts it to Unicode)- But on OS X, because of Apples *insane* choice of normal form, it will
then be turned into "a¨". I doubt *anybody* else does that. If you have
to normalize it, NFD is just about the *worst* choice.So yeah, even just re-coding it as NFC on readdir() would at least mean
that any OS X git client would be MORE LIKELY to pick the same
representation as git clients on other OS's.It wouldn't solve all problems (and it would almost certainly create a few
new ones), but it would likely at least increase compatibility between
systems.So doing the NFC conversion on readdir() on OS X is probably a good ide...
For what it's worth, their choice wasn't entirely "insane" ie. it did
have an element of rationality: that decomposed forms are a little bit
simpler to sort.Of course, this doesn't excuse them for creating a file system that
interacts so horridly with basically everything else out there.Cheers,
Wincent-
No they are *not*.
In many languages, 'ä' does *not* sort like 'a' at all, and if you think
it does, you'll sort at least Finnish and Swedish totally wrong (åäö are
real letters, and they sort at the *end* of the alphabet, they have
nothing what-so-ever to do with the letters 'a' or 'o').The fact that in *some* languages the decomposed forms sort as the base
letter is immaterial. It's only true in some cases.So no, sort order is not it. To sort right, you need to use the a real
Unicode sort (and the decomposed form is *not* going to help you one bit,
quite the reverse).It may be that a case compare is easier in NFD (ie you basically only do
the case-compare on the base letter).Linus
-
That's what I get for believing Wikipedia ("This makes sorting far
simpler"):http://en.wikipedia.org/wiki/UTF-8#Mac_OS_X
Cheers,
Wincent-
But there is no way to know whether 'ä' in a document is the Finnish 'ä'
Unicode sort is not enough, there is no language indicator in an Unicode
document, which is why Unicode, while solving a bunch of problems, has
its very own, cf. the infamous CJK problem.But that's all very OT.
Mike
-
... without knowing the locale. Correct.
That's why sorting is locale-dependent, even in Unicode. And why you
should always sort using the *combined* character, not think that you can
sort by decompsed sequence.That said, even then you get the wrong thing. Some things cannot be sorted
character by character at all, and have semantical sorting at a higher
level entirely. I think most European family names are traditionally
sorted by effectively using the prefixes (ie d', von, etc) as a secondary
sort key (so even though they are in front, they sort as if they were
at the _end_ of the name).So unicode doesn't help with sorting, and you shouldn't even try to find
sort rules in the Unicode spec or tech reports. But in general,
decomposing the characters just makes things worse, not better. To sort
well, you tend to need the bigger picture, not the details.Of course, for something like git, we sort by binary value, because we
also require the sort to be not just well-defined, but *stable*. A sort
based on any kind of unicode rule is rather likely to change over time.Linus
-
That said, the locale doesn't necessarily express the language in which
the document is written. It's easy enough to read documents that are not
written in your native language on the net. That's already what we are both
doing right now. Fortunately, HTTP and HTML have ways to indicate the
language in which a document is written in, but that leaves out plain
mail, for instance.That said, the "decomposed" version of UTF-8 has nice side effects on
OSX, with UTF-8 encoded RockRidge ISO-9660 volumes (with or without
Joliet ; OSX will use RockRidge by default when it's there), for instance.Mike
-
AFAIK, the RockRidge standard prescribes to use the portable character
set, and it has nothing to do with Unicode. Basically, it is a subset of
ASCII.http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html
So, I don't think UTF-8 encoded filenames are valid regardless whether
they are decomposed or not.Dmitry
-
Actually, it prescribes to use the portable *filename* character set,
which is even more restrictive than just portable character set.http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#t...
Anyway, there is no place for UTF-8 in it.
Dmitry
-
.. and quite commonly, there are multiple languages per document.
The good news is that sorting is almost never relevant or done over
general documents. You sort almost only well-behaved data, and quite often
the exact order is less than important: and when it is, you have very
specific rules (which probably seldom have anything what-so-ever to doI think Unicode in general (and UTF-8 in particular) is a great thing. I
do not argue against Unicode at all. It's what I use myself.The thing I argue against is that they force normalization (and then, as a
secondary complaint, their insane choice of target format).Linux is generally UTF-8 too, and does all of this much better. No forced
normalization, and it uses UTF-8 everywhere as the encoding model. Joliet
and RR works beautifully.(I don't think RR is NFD, btw. It's the standard microsoft UTF-16 without
normalization, afaik. I think you can happily generate a Rock Ridge disk
that has two _different_ filenames that OS X cannot tell apart, but that
both Linux and Windows can see peoperly)Linus
-
Hi,
I think a better approach would be to try to match the name to what we
have in the index. Then we could implement case-insensitivity and MacOSX
workaround at the same time.Ciao,
Dscho-
I thought about that, but the problem is that HFS+ _already_ mangled
names from what the user entered (and what is used by anyone else)
to some sub-standard form, which no one outside of Mac likes or uses.
Thus, bringing filenames back to the NFC form (which is what almost
anyone uses) is the only sane thing do, because no one outside of Mac
really needs to know about this HFS+ specific craziness.So I really dislike the idea that due to some HFS+ specific conversion,
we may end up having some strangely encoded names in a Git repository.
Sane people enter names only in NFC, so why should they suffer because
of some insane conversation made by filesystem behind everyone's back?
And I am not entertaining the idea of having this Mac OS/X specific
workaround outside of Mac OS/X.Besides, writing a wrapper around readdir() is not difficult. We
already have git-compat-util.h, which redefines some functions for
some platforms, so I don't see any problem with writing a wrapper
around readdir().Dmitry
-
Hi,
So? That's why I said "match", not "compare for identity".
To be a little bit more precise: I think a viable plan would be to
- have a config switch which determines what type of filename mangling we
allow the host OS to perform (Unicode "normalisation", case mongering),
and leave _everybody_ alone who left that switch unset,- "overload" readdir() (by the famous git_X(); #define X git_X trick),
- have the overloaded readdir() _know_ which is the current prefix, and
load the index if it has not yet been loaded (but probably into a static
variable to avoid reloading, and to avoid interfering with the global
"cache" instance).It _could_ be wise to store the "normalised" forms at one stage (instead
of the index) to speed up comparison -- the prefix has to be normalised
for readdir()s purposes, too, then.This is possible with the HFS+ problem, since we know exactly how HFS+
tries to "help", and for case insensitivity too, I think. But it may be
restricting ourselves for other filename "equivalences" we might want to
handle one day.BTW: I cannot think of anything else than readdir() which should have the
"problem" of reading back a name that the user did not specify. What am INo. I think that would be a serious mistake. If you add a file on MacOSX
(with a _mangled_ filename, think of "git add ."), git should not try toIt _is_ UTF-8, so what's the problem?
As for the HFS+ specfic conversion: like the CRLF issue, I am opposed to
have a "solution" affecting other people than those on broken system. SoExactly.
Ciao,
Dscho-
Well, more importantly, most of the important cases actually don't have an
index entry yet.For example, what about "git add"? That's when it really matters that you
add things in a sane format, and by definition, you don't have an index
entry to try to match to.So once you aim for NFC in "git add", now the index will generally be in
NFC anyway (since I agree that that's what you'd normally get on non-OSX
systems), so there is little point in then matching the index.But no, it won't fix all problems. I do suspect it would make them less
obvious in practice, though.Linus
-
FWIW: I just made a test and it seems that MacOS X refuses the creation
Maybe I'll try it.
Regards,
Mark
-
In this case, git's index counts as a filesystem that treats filenames
as sequences of bytes. But yes, it is possible, though somewhat
difficult, to produce this problem on just HFS+. It's far more commonFrom what the HFS+ technote says, it produces a variant of Normal
Form D. This variant, while not guaranteed to be stable across
versions of HFS+, but in practice it is stable.I'm not sure how that would solve anything. Sure, it would provide a
stable, known encoding for git to compare filenames against, but that
would only work if the filename is known to be Unicode, and as it has
been pointed out on other filesystems the filename can be whateverIt looks like their problem was binary compatibility with strings from
other clients that were using Normal Form C instead of Normal Form D.
git's problem is that it's only even using a known encoding on HFS+.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
There is no such thing in the standard as a variant of NFD. Moreover,
even if this conversion were described in the standard, it would never
called as normalization, because normalization means conversion that
makes all equivalent strings having identical binary representations.Apple calls it as decomposition, which is correct even if it is not full
I believe that Git internally should use only UTF-8 for encoding file
names, commit messages, etc. The problem with some other filesystems
should be addressed separately (by those who work on those systems or
at least have access to them). Regardless interoperability with other
systems, this change alone should solve the issue that was described
in the first message of this thread.Dmitry
-
There are canonically equivalent, but they are different sequences
of characters as Unicode is concerned. In one case, we have one
character in the other case, we have two characters that canonicallyBy defition, sequences of characters that are canonically equivalent
I am afraid it is you who confuses "characters" with "abstract
characters", there is no place in the standard saying that
"characters" are "abstract characters" only. On contrary, the
term "characters" is used to refer non abstract characters.Dmitry
-
Perhaps it's just a case of confusion about naming conventions. I tend
to use "character" as a "grapheme cluster", i.e a "user character" (to
the end user, "ä" and "a"+diaeresis is the same character, no matter if
they would display as different glyphs), whereas some people use
"character" as a "code point", which would be more of a "programmer
character". And then there are some people that still use "character"
interchangibly for "bytes" or "code units" (for UTF-16; a pair of
surrogate code units is still only one "code point").--
\\// Peter - http://www.softwolves.pp.se/
-
I just don't understand why you insist that the filename is data, when
it is clearly metadata. The filename has two purposes: the identify
the file to the user, and to provide a handle with which to reference
the file contents. The specific byte sequence is in no way sacred.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Uhh. And exactly how do you know the difference, and why should it matter?
A lot of data is metadata. Look at the git index file. It's *all*
metadata. Does that mean that the OS has the right to corrupt it?IOW, why do you seem to argue that metadata something you can corrupt, but
not then "regular" data?Why is it ok to change a filename, when that same filename may *also* be
encoded by the user in a regular data file (think about MD5SUM files, for
example, that include the pathname, but now the pathname is part of the
file data, not on a filesystem).So filenames are data, they're metadata, they're whatever. None of that
means that it's acceptable to corrupt them, or gives the OS any reason to
say that it "knows better" than the user in how users use them. It's still
the *users* metadata, not the filesystems own metadata!In many cases, users use filenames *as* data, ie the filename actually has
a meaning in itself, not just as a handle to get the file contents.If this was truly metadata that isn't visible to the user, and not under
the users control (ie indirect block numbers etc), then you'd have a good
point. At that point, it's obviously entirely up to the filesystem how the
heck it encodes it.But that's not what filenames are. Filenames are an index specified by the
user, not by the computer.Linus
-
It's UTF-16 (when needed). I think it's all in the Linux kernel for you
It uses the local 8-bit codepage, which is not UTF-8, often some latin-inspired
thingy, but in Asia multi-byte encodings are used. In western Europe it is
Windows-1252, which is almost, but not exactly iso-8859-1. Oh, and then we
have the cmd prompt which has another encoding in 8-bit mode.I think there is a cygwin patch that converts to and from UTF-8. An application
can choose to use the "A" or "W" interfaces. The W-API's are the real ones and
the others' are just wrappers that convert to and from UTF-16 before anything
happens (i.e. CreateFileA is slower than CreateFileW and so on).-- robin
-
.. well, FAT certainly wasn't. But yes, VFAT probably is. Not that I want
Well, if it uses a 8-bit codepage, then that means that as far as the
POSIX filename interface is concerned, it has nothing what-so-ever to do
with Unicode (ie unicode is just a totally invisible internal encoding
issue, not externally visible).I assume you have to use some insane Windows-only UCS-2 filename function
to actually see any Unicode behaviour.Sad. Because there really is no reason to use a local 8-bit codepage when
So the CreateFileW() is the "native UTF-16 interface", and CreateFileA()
is the 8-bit codepage one that has nothing to do with Unicode and is
purely some local thing.But for a UNIX interface layer, the most logical thing would probably be
to map "open()" and friends not to CreateFileA(), but to
CreateFileW(utf8_to_utf16(filename)).Once you do that, then it sounds like Windows would basically be Unicode,
and hopefully without any crazy normalization (but presumably all the
crazy case-insensitivity cannot be fixed ;^).So it probably really only depends on whether you choose to use the insane
8-bit code page translation or whether you just use a sane and trivial
UTF8<->UTF16 conversion.Anybody know which one cygwin/mingw does?
Linus
-
I just had to investigate this a bit, so on a Vista machine I started a cmd
prompt and typed mode con: cp select=65001, selected the lucida font and then
echo å >x.txt and opened it in notepad and it was UTF-8 encoded. So there might
be some hope after all. I don't know how to change the encoding for non-console
apps. I leave that as an excercise for the list.-- robin
-
Yes, but have you tried to run any batch file? At least, on WinXP
all batch files silently stopped working after choosing 65001, and
I don't know what else gets broken, because Microsoft C library
does not work with encoding that requires more than two bytes perIt is not difficult to change the current encoding in any Windows
application, the real issue is that neither Microsoft C library nor
Cygwin library does not work correctly with UTF-8. There is a patch
for Cygwin though...Dmitry
-
Indeed. On Windows, you should avoid using UTF-8 and instead use UTF-16
everywhere. That usually works better, and if you run on an NT-based
system it will convert all the data to WinAPI to UTF-16 anyway.--
\\// Peter - http://www.softwolves.pp.se/
-
Yes, the default code page for the command prompt uses so-called OEM
encoding, and GUI programs uses another one, which MS calls as "ANSI"
encoding. However, if you use Cygwin, then you have ANSI encoding in
the command prompt. So, in the same command prompt window, you can have
Cygwin programs using one encoding and other window console programsSome people tried to set the current code page to 65001, which is
the Microsoft code page for UTF-8. However, it seems that does not
work very well.http://support.microsoft.com/kb/175392
http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspxIt seems to me that Win32 API functions work correctly with
UTF-8 (after all, they are just wrappers over UTF-16 functions),
but Microsoft's C library cannot handle UTF-8 (or any otherThere is a patch for Cygwin that adds UTF-8 support for it, however,
Cygwin maintainers do not like it, so it is not integrated. I think
Cygwin 1.7 will support UTF-8, but I have no idea how soon it will be
released.I don't know much about mingw, but if I am not mistaken, mingw relies
on Microsoft's C library, so I suppose it uses an "OEM" code page for
console programs by default.Dmitry
-
That's the easist thing to do if you want to assure that things will
mostly work across multiple different OS's, with different levels of
sanity. You might also want to include that it's a bad idea to create
two filenames that are identical on case-insensitive filesystems,
i.e., "makefile" and "Makefile", or "foo.H" and "foo.h" which even
though it works Just Fine on Linux, will likely cause problems on
Windows and MacOS filesystems, and other systems that are insane with
respect to case insensitivity.- Ted
-
Ahhhh ... now I understand.
Regards,
Mark-
Hi,
+1.
And I would suggest the use of RFC 3454 as the guidelines for UTF-8
normalization.Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
Hi,
-1.
It's just too arrogant to force your particular preferences down the
throat of every git user.Ciao,
Dscho
-
It's not arrogant to make a suggestion. Where is your alternative solution?
However, what about storing an additional information like the file
system encoding (for every file)? This would result in the same
behaviour (and speed) as today as long as the file system encoding is
the same. Conversion will only be done when the targets file system
encoding is different.BTW: This reminds me of the code page switching stuff back in the times
of MS-DOS 4/5. This really wasn't funny.Regards,
Mark
-
Hi,
Do you agree that you need to store or at least calculate a
normalized version of each filename to see if you are already
tracking the file, to take in account all the the filesystems out
there who are not case-preserving, case-sensitive?If so, do you think those rules should be an option? Or a preference?
Should I specify in my config file that I want my filenames to be
normalized?Ignoring encoding, and case-sensitive issues in the git index creates
problems for those people who want/need to use non-ascii chars in
their filenames, and have some change of being able to collaborate
with other users on different operating systems.Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
The problem is that there is no way to know what the "target system
encoding" is.And it wouldn't actually solve the bigger problem on OS X anyway: as long
as you are case-insensitive, you'll have all the same problems (ie the
insane OS X filesystem presumably thinks that "MÄRCHEN" and "Märchen" are
also identical, because they are "equivalent" names).Linus
-
Hi,
Correct. Storing or using a normalized version of the filename is
only part of the problem.The full problem is:
User A <-> filesystem A <-#-> git < ...... > git <-#-> filesystem B <-
> user B.You have to encode/decode/normalize on all the <-#-> and there is no
magic bullet. Each user would have to tell git "Hey I'm using utf-8"
or "Hey, I'm a masochist using HFS+".But I think its important for git to store the filenames in something
that at least permits this kind of scenario.All encoding/decoding/normalization is of course optional, and for
Correct. HFS+ has bigger problems. I'm not sure if this is enough to
solve it.But it would solve two linux users using different encodings.
And given that the filtering layers are optional, you have to
configure them, it wont bite nobody.Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
No, the encoding was the same -- UTF-8. MacOSX converts one sequence of
Unicode characters to *another* sequence, which are canonical equivalent,
but being canonical equivalent does not mean they are the same characters.
In the same way, as being compatible equivalent does not mean being the
same. As well as, being case-insensitive equivalent does not mean being
the same... Do you remember DOS? It stored all filenames in upper-case,
so they original and stored names are case-insensitive equivalent, but
they are not the same!Dmitry
-
Those of us who grew up on a case-insensitive filesystem don't find
there to be any problem with it. I can count on one hand the number of
times I've run into a problem caused by a case-insensitive filesystem.
That number is 1. And that 1 time is when git screwed up trying toThat's only true if you don't know what type of filesystem you're on.
And, in the vast majority of cases (in fact, a content tracker is the
only exception I can think of), it doesn't matter. If the user said
'xyz' and you can stat() it, great, that's what the user wanted! Just
because it's really called 'Xyz' on the filesystem doesn't make anyBut git is a content tracker, so even if it's really a different
hardlink that shouldn't matter, it's still referencing the same
content. Go ahead and track whatever name the user specified
originally, as long as it maps to a file on disk with the expected
content you're set. If the file is really called 'foo' and I told git
to track 'Foo', I'm perfectly happy with it continuing to think 'foo'I don't see that as being a problem. Think of it, if you will, as if
every single file simply had an implicit hardlink for every possible
case or normalization variant. The whole point of the filename is that
it is meta-information, used as an identifier and not as actual
content, and thus it is perfectly fine for it to be a real string,
subject to interpretation, rather than treated as a sacred binary blob
like content is. The whole purpose of the name is to identify the
inode in question, and case and normalization aren't particularlyAgain, as someone who grew up in a case-insensitive world, there's no
problems here. I wish I could tell you that it causes problems, I wish
I could agree with you, but I can't.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
I guess you haven't used unix tools much. The ever-popular HEAD perl
utility (which does an HTTP HEAD against a URL), when installed,
silently overwrites the head shell utility, which is used for all
sorts of things, some even in startup scripts. Ooops! I've been hit by
this more than once - and if you google for it, it hurt a lot ofHmmm. Many important tools - that I wouldn't want to ever fail! - have
similar needs to git. Backup/restore and file replication tools forOk - but how do you track the directory then (in git's terms, the
tree). There's no way to tell what the user wants. Does the user want
a copy of the file with different capitalization, or is the OS playingProbably because you have been surrounded by tools that have a lot of
extra code to cope with the case insensitive way of life, and learned
to not do things that are completely valid, just to avoid trouble.
Which is ok, but I don't think it makes the OS design decision
defensible.cheers,
m
-
I can imagine. However, I've never been hit by such a situation. This =20=
doesn't mean a case-insensitive filesystem is a problem per se, it =20
means interactions between a case-insensitive and a case-sensitive =20
filesystem can be a problem. That doesn't mean either way is "correct" =20=it just means both don't work well together.
I like ice cream, and I like steak, but I sure don't think a mixture =20
Both of which would be replicating the directory contents, not a =20
listing of files specified by the user. If, as a user, I were to say =20
"please replicate file FOO" and the file was really called "foo", I =20
wouldn't be in the least surprised to see the tool take me at my word =20=and produce a file called "FOO" with the contents of "foo". But in =20
general, things like this operate on the filesystem, not on the user =20If I say "track FOO", I probably mean it. So go ahead and track "FOO", =20=
even if you end up tracking the contents of file "foo". I certainly =20
Sure I do. I find it very convenient, for example, to say "cd =20
documents/school" when I really want to go to "Documents/School". =20
Similarly, if I'm trying to reference gitweb/tests/M=C3=A4rchen, I'm =
quite =20
happy to not have to figure out what normalization the filename is =20
using and attempt to replicate that (especially as I have no idea =20
which normalization my input mechanism uses - unlike Linus, I don't =20
have a key dedicated to =C3=A4, and even if I did I wouldn't necessarily =
=20
expect it to use precomposed vs decomposed). I can't think of a single =20=reason why I'd want to be able to have 2 different files named =20
"M=C3=A4rchen" on my disk. On the other hand, treating unicode =20
normalization as significant can pose security risks - how am I to =20
know that the file that is named "foo.txt" is really the same file =20
"foo.txt" that I last saw? Someone I know on IRC sent me this =20
image[1], which shows 6 files all apparently named "foo.txt" on a disk =20=...
For those on Mac OS X: it is possible to create a case-sensitive HFS+
partition and
use it with git. You even can just create a disk image and mount it.
However,
I wouldn't quite try to use it as startup filesystem...-Geert
PS. I'm working on a proposal/patch for addressing the UFS/case
sensitivity issues.
Will try to mail something later this week.
-
I was going to post this earlier, but wanted to search the archives
first. Here are the commands assuming you don't want to or can't
partition a drive and format as ufs (I don't care for HFS+ much). I
can't believe I didn't find the command in the git list archives, so
voilà:$ hdiutil create -size 300m -fs UFS foo.dmg
...............................................................................
created: /Users/mitch/foo.dmg
$ hdiutil attach foo.dmg
/dev/disk2 GUID_partition_scheme
/dev/disk2s1 Apple_UFS /Volumes/untitled
$ cd /Volumes/untitled && git clone git://git.kernel.org/pub/scm/git/
git.git
... snipped ...
$ cd git && git status
# On branch master
nothing to commit (working directory clean)After git clone in HFS+ land...
$ git status
# On branch master
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# gitweb/test/MaÌrchen
nothing added to commit but untracked files present (use "git add" to
track)Should I just add this to the wiki? Then we can all go back to
ignoring the insane filesystems.Mitch
-
While it's a nice workaround, it really is just that (a workaround)
because performance will be suboptimal in a repository running on a
disk image (and many of switched to Git because of its speed).Cheers,
Wincent-
..........................................................................=
Not only is it suboptimal, it's also not acceptable, plain and simple. =20=
If an individual wants to do that, sure, but it's simply not an =20
appropriate solution in general for this problem. I certainly don't =20
want to have to attach a disk image every time I want access to =20
anything I keep in a git repo, nor do I want to be restricted to =20
keeping everything within a certain filesystem on disk. Additionally, =20=while I'm not certain it's impossible, it's certainly very difficult =20
to attach a disk image without anybody logged into the system at the =20
GUI, as diskarbitrationd won't be running.-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Hi,
If it's not acceptable, do something about it (and I don't mean writing 50
emails). If you don't want to do something about it, I have to assume that
you accept it as-is.Ciao,
Dscho-
I never said I don't want to do anything about it. However, I do
believe that it will take a significant investment of time and energy
to learn all the gooey details of how git handles filenames and how
the index works and all that jazz, which is knowledge that other
people already have. I believe that, for me to solve this problem
independently, it may require so much time that it never gets done
(after all, I am fairly busy). However, if other people who already
have this knowledge are willing to help, that would make this task far
easier, especially given that if nobody else even acknowledges that
this is a problem I don't have much hope of getting a patch accepted.So again, I'm certainly going to try, but working by myself it simply
may never get done.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
(This is only for those that think the problem should be solved somehow. The
rest can move on - nothing to see here)You may look at http://rosenberg.homelinux.net/cgi-bin/gitweb/gitweb.cgi?p=GIT.git;a=log...
for inspiration. It's pretty obsolete by now and only a "proof of concept", i.e.
it can be done, not that it necessarily should be done exactly this way.Basically it intercepts the user's access to git, i.e. certain commands
and how files are named (since those names represent a user interface). Then
it assumes the internal encoding is UTF-8 (or garbage) converting to and
from the user's local encoding. The heuristics is based on the assumption that
a string (even random onesthat looks like UTF-8, with a very high probablity
actually is UTF-8 encoded.The test cases might be usable almost as is.
-- robin
-
This is starting to stray far afield, but the first thing I did when I
got a Macbook was to reinstall it with case-sensitive HFS as the boot
file system. Works fine, including with git. The only problem I have
had is that FileVault does not work. There are rumored to be some
third-part apps that do not work but I do not use that many of those
anyway.andrew
-
The main problem with this approach is you know for certain that using
HFSX as the boot partition is barely tested by Apple, and certainly
untested by third-party apps. This means the potential for breakage is
extremely high.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
No, actually, HFSX boot partitions are fairly well tested by Apple and
most 3rd-party programs. I had one for a while and the only problems
I encountered were with programs ported from Windows without Mac
versions, such as "Microsoft Office for Mac" and "World of Warcraft".
"Quake 4" has a few quirks which are easily worked around.Cheers,
Kyle Moffett-
Perhaps the big name companies might do some testing on HFSX, but I
can guarantee most third-party programs will not be tested under HFSX.Also, World of Warcraft isn't a ported program. It was developed for
the Mac concurrently with the Windows version. Same with MS Office -
it's an entirely different team (the Mac BU) developing MS Office for
Mac independently of the Windows version, not a porting job. However,
if you're saying these two big-name programs had problems, I wouldn't
be surprised to see many more problems on various other third-party
apps from smaller companies.In any case, "just use HFSX" is still not an appropriate solution to
the problem, especially since that will only take care of case
sensitivity and not the utf-8 stuff.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
THAT IS NOT TRUE!
How the hell does the computer know what the string means?
Hint: it does not.
The fact is, the user may use a non-normalized string on purpose. It's not
your place to say that the user is wrong. Your "undestanding" is simply
wrong. Two strings are *different* if they are [un]normalized differently.Really.
The exact same way the word Polish and polish are different, just because
You do not understand.
In *order* to do case-insensitivity, you generally need to normalize (and
do other things too - normalization is just *one* of the things you need
to do).So if you are a case-insensitive filesystem, then normalization is sane.
You define "string" to be something totally made-up.
In your world "string" means "normalized". BUT IT'S NOT TRUE!
You define normalization to be a property of strings, without any actual
backing for why that would be.The fact is, *looks the same* is very very different from *is the same*.
But you seem to be too stupid to undestand the differce.
Linus
-
You're right. The normalization only really needs to happen as part of the
name comparison itself.Linus
-
Hi,
For the record, HFS+ is case-insensitive but case-preserving so I
believe they keep the original filename around. I don't have the spec
in front of me, but from memory I believe that this is what they do.But I think that focusing on HFS+ is loosing sight of the real
problem. It's not about encoding at the filesystem, but encoding
inside the git structures.Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
Hi,
For the record, that's only the default setting. AFAIK you can configure
it to care about case, too.Also for the record, the whole thread was about HFS+ _not_ keeping the
So far I have not seen anyone talking _seriously_ about this issue. Only
a few shouts "you should support", and a few shouts back "I don't care
about insane filesystems".Therefore, I fully agree with you that we're losing sight of the real
problem.Ciao,
Dscho-
Actually, there is no good reason for non-normalized forms (deficient
software not able to deal with some of the normalized forms is not a
good reason: such software should be fixed).It is just that the file system is a rather quirky place for enforcing
the normalization. One should not be able to get unnormalized formsNo. Input methods are not the same as their resulting string. I can
even produce some ASCII characters on my keyboard in more than one wayYup. But that does not mean that normalization is a bad idea. It is
just that the filesystem is not the right place for it.--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
I'd actually agree, and it then boils down to the second sane choice I
gave earlier:- don't accept data you don't like
if you don't like non-normalized names, don't create them. That's fine.
Oh, absolutely. You can - and often should - normalize in the application
(or have libraries to do it for you).Not silently and behind peoples backs.
Linus
-
"FUD" is a bit strong, don't you think? HFS+ is the way it is and it
would be nice if Git could deal with it.The problem is that HFS+ normalizes filenames to avoid multiple files
that appear to have the same name (eg "M<A WITH UMLAUT>rchen" vs
"Ma<UMLAUT MODIFIER>rchen", in gitweb/test). This is sort of like
case sensitivity, but filenames are normalized when a file is
_created_. Git, not unreasonably, expects a file to keep the name it
was created with.As far as I can tell, as long as you add all your internationally
becharactered files to git from an HFS+ file system using a gui or
command-line completion, you'll be okay; trouble starts when you check
in a file with the composed form of a character, by typing the name on
the command line (I'm not sure about this one) or committing on
another OS. Git will store the filename in composed form, but the
Mac's filesystem will decompose the filename when you check the file
out.The result looks like this:
vredefort:[git]% git status
# On branch master
# Untracked files:
# (use "git add <file>..." to include in what will be committed)
#
# gitweb/test/Märchen
nothing added to commit but untracked files present (use "git add" to
track)(this is directly after checking out git.git @ v1.5.4-rc3)
There are two things to note here. One is that Git thinks that there
is a new file called "gitweb/test/Märchen" (decomposed) when it's
"really" just the same "gitweb/test/Märchen" (precomposed) that's in
the repository. The other is that git _thinks_ that the "gitweb/test/
Märchen" (precomposed) it's expecting is still there, because the
filesystem, when asked for "gitweb/test/Märchen" in any form will
return the file "gitweb/test/Märchen" (decomposed).Trying to check out the "next" branch at this point is a pain since
next's "Märchen" would overwrite the untracked "Märchen".I can't provide links to any previous discussions about this, but
here's Ap...
FWIW, here's Sun's take on the issue of filesystems and i18n:
http://developers.sun.com/global/products_platforms/solaris/reference/pr...
j.
-
Pretty sane, from a quick read-through, although most of it seems to not
be about general issue, as about "let's emulate others correctly on their
filesystems" (ie the rules are different for NTFS and HFS+, little enough
discussion about "native" preferred logic).However, while they don't consider normalization on file creates to be the
"preferred solution", they *do* consider filename comparison with
canonical equivalence to be that. Which means that you can get the same
odd problems:fd = open(filename, O_CREAT);
+
readdir()can actually return a *different* filename than the one we just created,
if it already existed in the directory under the different normalization.So it's basically "normalization-preserving, but normalization-ignoring"
(the same way many filesystems are case-preserving, but case-ignoring). I
don't much like it either, but as with case, the "preserving" behaviour is
probably the nicer one.I'd guess the problems are harder to trigger in practice, but you can
still get some pretty hairy cases. It's just painful when readdir() and
your own file creation doesn't have any obvious 1:1 relationship.Linus
-
So here's what I can see as being useful additions to git:
* Allowing a repo to be *optionally* configured to disallow two files
in a directory that can cause aliasing problems, with options for
unicode normalization aliasing and/or case-insensitivity aliasing. Can
this already be done via hooks and someone just needs to write the
appropriate hooks?* Having git warn during checkout if there are files which alias in
the working copy filesystem. I guess it might be interesting if there
were a mechanism in this situation for telling git which of the
aliases you want checked out, though that doesn't seem like a very
good feature.Thoughts (besides "patches welcomed")?
j.
-
I think we already discussed a plan to store normalization
mapping in the index extension section and use it to avoid
getting confused by readdir(3) that lies to us. Is there any
more thing that need to be discussed?I would presume that we would still add _new_ paths using the
pathname we receive from the user (there is no need for us to be
similarly insane as broken "normalizing" filesystems), but when
deciding if a path is new or we already have it in the index
would be done by seeing if an entry already exists in the index
whose "normalized" form is the same as the "normalized" form of
the given path --- that way we would not add two paths to the
index that would "normalize" to the same string.
-
And what do we do when asked to check out a tree which has two
different files in it whose normalized forms are the same (ie. a clone
of a repo created on a non-HFS+ filesystem)?We either have to fail catastrophically, preventing the user from
working with that tree on HFS+, or arbitrarily pick one of the files
as the "winner" which gets written out into the work tree. None of the
options is particularly attractive, although luckily this exact
situation is unlikely to come up in practice.Cheers,
Wincent-
Hi,
[Jay, don't cull Cc: lists on vger.kernel.org. I consider it rude.]
On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
> El 17/1/2008, a las 6:15, Junio C Hamano escribi
Hi,
I searched the archives for the posts about normalization and I could
not find them, sorry.Is stringprep (RFC 3454) being proposed as an optional normalization
step before lookups in the index?Best regards,
--
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!-
Hi,
> > > El 17/1/2008, a las 6:15, Junio C Hamano escribi
If this is really just a platform-specific hack, can we use platform-
specific code to do the normalization?On Mac OS X we have (unfortunately only 10.4 and up):
CFStringCreateWithFileSystemRepresentation()
CFStringGetFileSystemRepresentation()
CFStringGetMaximumSizeOfFileSystemRepresentation()If we were to use those you'd at least know that you're getting the
http://developer.apple.com/qa/qa2001/qa1173.html
Cheers,
Wincent-
Hi,
That remark about the version raises my eyebrows. Where I live, 10.2.8 is
_still_ quite common.Ciao,
Dscho-
There may be alternatives that I don't know about.
All the way back to 10.0 you have -[NSString
fileSystemRepresentation], which does the same thing but that's
Objective-C. I wouldn't be surprised if that's just a wrapper for the
CF functions; that's often the way it is on Mac OS X. And often, the
CF functions *are* present on older systems, but they're just not
declared in public headers. I wouldn't actually recommend using a
private SPI, but they are often there.Cheers,
Wincent-
Such a special mode would be mostly useless in most contexts, where
Git is used to track source code. It would enable you to check out the
tree for inspection, but you probably couldn't build anything from it
seeing as at least one of the filenames specified in your Makefile
wouldn't be present in the work tree.As such, in that kind of situation I'd rather see a big red warning
printed out that the checkout failed because a particular file
couldn't be written out, and perhaps an instruction to the user that
they can use "git show" if they want to see the blob/s which wasn't/
weren't written.Cheers,
Wincent-
| H. Peter Anvin | Re: [RFC 00/15] x86_64: Optimize percpu accesses |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Eric W. Biederman | Remaining straight forward kthread API conversions... |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
git: | |
