Hi, I have some files like "Lüftung.txt" in my repository. The strange thing is that I can pull / add / commit / push those files without problem but git-status always complains that thoes files are untraced (but not missing). My assumption is that it's a problem with the way MacOSX stores the file names (decomposed UTF-8). So something like "Lüftung.txt" becomes "Lüftung.txt". It seems that git-status does two things: 1. Find files under version control (i.e. search for missing files) 2. Find files not under version control (i.e. search for untracked files) I guess that the first look-up succeeds because MacOS X converts composed UTF-8 to decomposed UTF-8 when searching for a file. But it seems that the second look-up takes the file names as-is (decomposed) without converting them to composed UTF-8. Is there an easy way to fix this behaviour? It's really annoying to see all those "untracked" files that are already under version control when executing a git-status. Regards, Mark -
Hi, On Wed, 16 Jan 2008, Mark Junker wrote: > I have some files like "L
More like, Mac OS X has standardized on Unicode and the rest of the =20 world hasn't caught up yet. Git is the only tool I've ever heard of =20 that has a problem with OS X using Unicode. -Kevin Ballard --=20 Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
Apple's decision[*] to use _decomposed_ unicode causes all sorts of little problems because other tools aren't expecting to see strings changed behind their backs. I know little about the gritty details, but I see the bug reports... -Miles -- Any man who is a triangle, has thee right, when in Cartesian Space, to have angles, which when summed, come to know more, nor no less, than nine score degrees, should he so wish. [TEMPLE OV THEE LEMUR] . -
As far as I know, Subversion has basically exactly the same problem,
and any time you consume/produce files on Mac OS X that are be
consumed/produced on other platforms you will run into this kind of
issue, with any software.
Tell Mac OS X to write a file with "ó" in the file name ("\xc3\xb3" in
UTF-8), and it will "normalize" it prior to writing by converting it
into a decomposed form (that is, ASCII "o" followed by "\xcc\x81", or
"combining acute accent"). So they're both valid Unicode, both valid
UTF-8, and they encode exactly the same characters but the byte stream
is different.
If you only work on Mac OS X then this will never be a problem because
all the files you create and therefore all the files you add to your
Git repository will have their names in decomposed UTF-8. But when you
start cloning repositories containing files added on other systems,
systems which might use precomposed rather than decomposed UTF-8 then
you'll run into exactly this kind of problem. The git.git repo has one
such file itself (gitweb/test/Märchen, if I remember correctly, which
Git reports as untracked).
Now, Mac OS X's behaviour is not entirely "insane" as some would
claim; there is indeed a rationale behind it even if you don't agree
with it, but it *does* produce some unfortunate teething problems for
people wanting to use Mac OS X in a cross-platform environment.
Here are some Apple docs on the subject:
http://developer.apple.com/qa/qa2001/qa1173.html
http://developer.apple.com/qa/qa2001/qa1235.html
I personally wish that UTF-8 didn't allow different normalization
forms; then this kind of problem wouldn't arise. But it has arisen and
we have to live with it. Some workarounds have been proposed for Git,
but I haven't seen any convincing proposals yet.
Cheers,
Wincent
-Hi, > > > I have some files like "L
To be more exact encoding used to _create_ file differs from encoding ...which means that sequence of bytes differ. And Git by design is (both for filenames and for blob contents) encoding agnostic. HFS+ is just _stupid_. And unfortunately Git doesn't support stupid filesystems (e.g. case insensitive filesystems) well. -- Jakub Narebski Poland ShadeHawk on #git -
There's two different ways to do filesystem encodings. One is to have =20= the fs simply not care about encoding, which is what the linux world =20 seems to prefer. Sure, this is great in that what you create the file =20= with is what you get back, but on the other hand, given an arbitrary =20 non-ASCII file on disk, you have absolutely no idea what the encoding =20= should be and you can't display it without making assumptions (yes you =20= can use heuristics, but you're still making assumptions). Filesystems =20= like HFS+ that standardize the encoding, on the other hand, make it =20 such that you always know what the encoding of a file should be, so =20 you can always display and use the filename intelligently. It also =20 means it plays much nicer in a non-ASCII world, since you don't have =20 to worry about different normalizations of a given string referring to =20= different files (it's one thing to be case-sensitive, but claiming =20 that "f=F6o" and "f=F6o" are different files just because one uses a =20 composed character and the other doesn't is extremely user-=20 unfriendly). On the other hand, what you create the file with may not =20= be what you read back later, since the name has been standardized. =20 It's hard to say one is better than the other, they're just different =20= ways of doing it. However, I have noticed that everybody who's voiced =20= an opinion on this list in favor of the encoding-agnostic approach =20 seem to be unwilling to accept that any other approach might have =20 validity, to the extent of calling an OS/filesystem that does things =20 different stupid or insane. This strikes me as extremely elitist and =20 risks alienating what I expect to be a fast-growing group of users =20 (i.e. OS X users). I'm willing to give Linus a free pass on calling other OS's stupid and =20= insane, as I don't think Linux would exist as it does today without =20 his strong opinions, but I don't think this should give carte blanche =20= to th...
There is no technical reason for *kernel* to care about file name encoding. It is something that can be and should be dealt with in And also because a user space program can deal with it much more Wrong. If you have a policy that all file names are stored in UTF-8 encoding then there is no problem here. It should not be a kernel problem to care about encoding, besides you cannot fully solve it Yeah, right... Like Microsoft likes to "standardize" everything, which in practice means forcing on others something fundamentally broken and that does not follow any existing standard precisely: === IMPORTANT: The terms used in this Q&A, decomposed and precomposed, roughly correspond to Unicode Normal Forms D and C, respectively. However, most volume formats do not follow the exact specification for these normal forms. === http://developer.apple.com/qa/qa2001/qa1173.html Not to mention that the use of decomposed Unicode as the standard is Somehow I have no problem with displaying non-ASCII names on Linux. I can see both Unicode Normal Forms C and D encoded symbols without As you typed them, they both are exactly the same, and both of them are in the Normal Forms C (which Mac calls as precomposed). So why do you I am sure everyone here is scared to death... I mean we have used to hear such threats from some MS salespeople, but from a Mac guy? It is really scare.... Wake up, and stop shooting this nonsense at us. If you have technical reasons why your solution is better, let us know. So far, you do not sound very convincing here. Why do think that the issue of encoding can not be dealt with in the user space? Why does Mac OS X uses so-called decomposed Unicode, which even does not follow any standard precisely? Why does Mac OS X chose to decompose characters while it does not I suppose it would be much better a subject for discussion... At least, it would be more likely to result in that Git working First, no one called Mac OS X insane, but case insensitive files...
By the way, calling HFS+ stupid, or rather calling at least two different normalizations of UTF-8 (two different encodings) used for writing and reading filenames stupid is wrong _for me_. I have quoted For me it looks like a layering violation... but my knowledge about filesystem is cluse to nil. IMHO it is VFS and libc which should do the But using one encoding to create file, and another when reding filenames is strange. It is IMHO better to simply refuse creating filenames which are outside chosen encoding / normalization. But having different encodings used for reading and writing on the level of filesystem First, it is Git philosophy and very core of design to be encoding agnostic (to be "content tracker"). Second, using the same sequence of bytes on filesystem, in the index, and in 'tree' objects ensures good performance... this is something to think about if you want to add patches which would deal with HFS+ API/UI quirks. [cut] -- Jakub Narebski Poland -
It's not using different encodings, it's all Unicode. However, it accepts different normalization variants of Unicode, since it can read them all and it would be folly to require everybody to conform to its own special internal variant. But it does have to normalize them, otherwise how would it detect the same filename using different normalizations? Also, it may seem strange to have different names between reading and writing, but that's only if you think of the name as a sequence of bytes - when treated as a sequence of characters, you get the same result. In other words, you're used to filenames as Sure, it makes sense from a performance perspective, but it causes problems with HFS+ and any other filesystem that behaves the same way. In the previous discussion about case-sensitivity, somebody suggested using a lookup table to map between git's internal representation and the name the filesystem returns, which seems like a decent idea and one that could be enabled with a config parameter to avoid penalizing repos on other filesystems. But I don't know enough about the internals of git to even think of trying to implement it myself. -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
Hi, But that's the _point_! It _is_ Unicode, yet it uses _different_ encodings of the _same_ string. Now, this discussion gets really annoying. The real question is: will you do something about it, or reply with another 500-line email? Ciao, Dscho -
I wish I could do something about it. But right now I'm a full-time student trying to do contracting jobs on the side, and I don't believe I have the time to learn enough about the guts of git to try and make any changes to something as core as index filename handling. I just want people here to recognize that this is a valid problem instead of simply dismissing it as "HFS+ is insane, lets just ignore this issue". -Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
That's a singularly *stupid* argument. Here, let me rephrase that same idiotic argument: "But it does have to uppercase them, otherwise how would it detect the same filename using different cases?" ..and if you don't see how that's *exactly* the same argument, you really are stupid. The fact is, normalization is wrong. It's wrong when you normalize upper/lower case (no, the word "Polish" is not the same as "polish"), and it's equally wrong when you normalize for No. HFS+ treats users as idiots and thinks that it should "fix" the filename for them. And it causes problems. It causes problems for exactly the same reasons case-independence causes problems, because it's EXACTLY THE SAME ISSUE. People may think that "but they are the same", but they aren't. Case matters. And so does "single character" vs "two character overlay". Does it always matter? Hell no. But the problem with a filesystem that thinks it knows better is that when it *sometimes* matters, the filesystem simply DOES THE WRONG THING. Can't you understand that? Linus -
Side note: there are ways to do it right. You can: - not do conversion at all (which is always right). Not corrupting the user data means that the user never gets something back that he didn't put in (And, btw, the "security" argument is total BS. The fact that two characters look the same does not mean that they should act the same, and it is *not* a security feature. Quite the reverse. Having programs that get different results back from what they actually wrote, *that* tends to be a security issue, because now you have a confused program, and I guarantee that there are more bugs in unexpected cases than in the expected ones) - Not accept data in formats that you don't like. This is also always right, but can be rather impolite. - Not accept data in formats that you don't like, and give people explicit conversion and comparison routines so that they can then make their own decisions and they are *aware* of the conversion (so that they don't come back to the problem of being confused) So there are certainly many ways to handle things like this. The one thing you shouldn't do is to silently convert data behind the programs back, without even giving any way to disable it (and that disable has to be on a use-by-use casis, not some "disable/enable for all users of this filesystem", because you can - and do - have different programs that have different expectations). And finally: all of the above is true at *all* levels. It doesn't matter one whit whether the automatic conversion conversion is in the kernel or in a library. Doing it on a library level has advantages (namely the whole "disable/enable" thing tends to get *much* easier to do, and applications can decide to link against a particular version to get the behaviour *they* want, for example). So doing it inside the kernel is just about the worst possible case, exactly because it makes it really hard to do a "on a case-by-case" basis. Yes,...
You're right, it doesn't actually have to store the normalized form. =20 And yes, it's possible to compare without normalizing them. =20 Admittedly, I don't know much about the implementation details of =20 unicode, but I would assume that the easiest way to compare two =20 strings is to normalize them first. But in the case of the filesystem, =20= normalization actually is important if you're thinking about filenames =20= in terms of characters rather than bytes. When I feed the filesystem a =20= given unicode string, it has to find the file I'm talking about - =20 should it do a relatively expensive unicode-sensitive comparison of =20 all the filenames with the one I gave it, or should it just normalize =20= all names and do the much cheaper lookup that way? I don't know about =20= you, but I'd prefer to let my filesystem normalize the name and run =20 There's a difference between "looks similar" as in "Polish" vs =20 "polish", and actually is the same string as in "Ma<UMLAUT =20 MODIFIER>rchen" vs "M<A WITH UMLAUT>rchen". Capitalization has a valid =20= semantic meaning, normalization doesn't. The only way to argue that =20 normalization is wrong is by providing a good reason to preserve the =20 exact byte sequence, and so far the only reason I've seen is to help =20 git. Applications in general don't care one whit about the byte =20 sequence of the filename, they care about the underlying file the name =20= represents. Additionally, it would be a terrible experience for a user =20= to enter "M=E4rchen" and have the application say "sorry, I can't find =20= this file" simply because the application used decomposed characters =20 and the filename used composed characters. Unless the user is =20 knowledgeable about the OS, filesystems, and unicode, they wouldn't =20 How do you figure? When I type "M=E4rchen", I'm typing a string, not a =20= byte sequence. I have no control over the normalization of the =20 characters. Therefore, depending on what p...
That simply isn't true. Normalization actually has real semantic meaning. If it didn't, there would never ever be a reason why you'd use the non-normalized form in the first place. Others have argued the exact same thing for capitalization. "A" is the same letter as "a". Except there is a distinction. The same is true of "a<UMLAUT MODIFIER>" and "<a WITH UMLAUT>". Yes, it's the same "chacter" in either case. Except when there is a distinction. And there *are* cases where there are distinctions. Especially inside computers. For one thing, you may not be talking about "characters on screen", but you may be talking about "key sequences". And suddenly "a<UMLAUT MODIFIER>" is a two-key sequence, and "<a WITH UMLAUT>" is a single-key sequence, and THEY ARE DIFFERENT. See? "a" and "A" are the same letter. But sometimes case matters. Multi-character UTF-8 sequences may be the same character. But sometimes the sequence matters. Git doesn't care. Just use the *same* sequence everywhere. Make sure something doesn't change it. Because if something changes it, git will Pure and utter garbage. What you are describing is an *input method* issue, not a filesystem issue. The fact that you think this has anything what-so-ever to do with filesystems, I cannot understand. Here's an example: I can type Märchen two different ways on my keyboard: I can press the 'ä' key (yes, I have one, I have a Swedish keyboard), or I could press the '¨' key and the 'a' key. See: I get 'ä' and 'ä' respectively. And as I send this email off, those characters never *ever* got written as filenames to any filesystem. But they *did* get written as part of text-files to the disk using "write()", yes. And according to your *insane* logic, that write() call should have converted them to the same representation, no? Hell no! That conversion has absolutely nothing to do with the filesystem. It's done at a totally different layer that actual...
The problem is that you don't control the sequence that everybody uses. See this example: melo@speed(~)$ uname -a Linux speed.simplicidade.org 2.6.9-55.ELsmp #1 SMP Wed May 2 14:28:44 EDT 2007 i686 i686 i386 GNU/Linux melo@speed(~)$ set | grep LANG LANG=en_US.UTF-8 melo@speed(~)$ mkdir t melo@speed(~)$ cd t melo@speed(~/t)$ git init Initialized empty Git repository in .git/ melo@speed(~/t)$ touch á melo@speed(~/t)$ git-add á melo@speed(~/t)$ git-commit -m "added a in utf8" Created initial commit 7a473a2: added a in utf8 0 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 "\303\241" melo@speed(~/t)$ export LANG=en_US melo@speed(~/t)$ touch á melo@speed(~/t)$ ls -la total 12 drwxrwxr-x 3 melo melo 4096 Jan 16 23:44 . drwx--x--x 31 melo melo 4096 Jan 16 23:43 .. -rw-rw-r-- 1 melo melo 0 Jan 16 23:44 á -rw-rw-r-- 1 melo melo 0 Jan 16 23:43 á drwxrwxr-x 8 melo melo 4096 Jan 16 23:43 .git melo@speed(~/t)$ git-add á melo@speed(~/t)$ git-commit -m "added a in iso-latin-1" Created commit 4282fca: Oláx! 0 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 "\341" So two (simulated in this test) users who use different LANG settings will be in trouble in no time. What I take from this conversation is that I have to specify, for each project I work on, which encoding we should use, across all users, before they start using git with files with accented chars. The difference I see between us is that if I tell my filesystem that I want to name my file with a particular string encoded in X, users using encoding Y will be able to read it correctly. I like my filesystem to make that work for me. Best regards, -- Pedro Melo Blog: http://www.simplicidade.org/notes/ XMPP ID: melo@simplicidade.org Use XMPP! -
The difference I see between us is that when I tell you that this is exactly the same thing as your file *contents*, you don't seem to get it. An OS that silently changes the contents of your files is *crap*. Get it? An OS that silently changes the contents of your directories is *crap*. Get it now? Linus -
This is the same issue as the CRLF issue I posted on earlier, and it all stems from that git also sees file names as a stream of bytes, not A program that silently ignores the conventions of the platform it runs on is *crap*, no matter if the conventions are not the same as for A program that silently ignores the conventions of the file system it tries to store its files on is *crap* :-) In my perfect world, file names would be stored as a string of characters, so if I save a file with an å in it, that å would be preserved no matter if I run Linux on ext2 with my locale is set to latin-1 (which stores it as byte 0xE5), on Windows with NTFS (which stores it as the UTF-16 code 0x00E5), on Windows/DOS with FAT (which stores it as the byte 0x86) or on Mac OS X which stores it as decomposed UTF-8 (whose byte sequence I don't know at the top of my head). If that was just stored as U+00E5 in whatever encoding in the filename index, the local implementation of git can just check it out in the form needed. -- \\// Peter - http://www.softwolves.pp.se/ -
You have to be careful about CRLF conversion, lest you corrupt your Git philosophy to see the contents of files and "contents" of directories (filenames) as stream of bytes, i.e. to use 'native' encoding works perfectly well and _fast_ if all developers work in the same environment. Troubles start if you are working across operating systems, and across Git has for a long time i18n.commitEncoding, and from some time it saves it in 'encoding' header in commit object (if different from 'uft-8') and has also i18n.logOutputEncoding. For dealing with different filesystem encodings you would also have to have both: encoding used in 'tree' objects (by repository) for filenames saved somewhere in repository, either in tree object (argh!) or in some kind of .gitconfig file; encoding used by filesystem in repository config as i18n.filesystemEncoding or something like that. And think what to put in the on disk index, and in memory index. NOTE, NOTE, NOTE! If filename is used somewherein the file contents (manifest-like file, include-like statement), and this filename uses characters which are differently encoded in different encoding you are screwed with this fancy system, badly, anyway. -- Jakub Narebski Poland -
Hi, I get that you think its the same thing. What I don't get is why a user should be forced to know what type of encoding he and the other users are using on all the layers going down to the filesystem. If two users on different systems or in different configurations, choose the same unicode string as the name, why do we need to make it harder for things to just work out? The content of the file is sacred, we both agree on that. We disagree on the filename, because for me it's more important that equal strings, even if encoded to different byte sequences, should be I was not talking about content of files, those are sacred. I was talking about filenames. Those *for me* are not, but are for you. No problem, we just have different values: I want my computer to work for me, not me working for the computer. I'm willing to accept a file system or other layer that normalizes encoding of filenames if that makes the end-user life easier, specially in a tool distributed by As I said before, we disagree on file meta-data, not on file contents. For you, byte in must be the same byte out. For me string in must be the same string out. And as I said in the previous email, what I learned today is that in a distributed project using git, and if you need to use accented characters, I need to tell all the users to use the same LANG settings. It's important information, at least for me. Best regards, -- Pedro Melo Blog: http://www.simplicidade.org/notes/ XMPP ID: melo@simplicidade.org Use XMPP! -
Hi, Why should the filename be _stored_ normalised? I agree on the lookup, yes, but not the storage. Hth, Dscho -
If you do the normalization in the right place, things will just work Well, as the issue shows it does not make life for the end-user easier. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Hello, No problem, but don't you think that git should to it? Don't you think its important in a distributed tool that no matter what system they use, be it linux or solaris, they are able to talk about a file with non-ascii chars and be the same file to both of them? That's the point I'm making. The fact that I need to set LANG across I'm assuming you are talking about HFS+ and the strange normalization it does. I'm sorry but that was not the problem I sent. I sent a scenario, in which two users, using the same linux system but with different LANG settings cannot use git reliably. Although this thread started because of HFS+ "choices", the problem is not really related to HFS+ given that you can have the same issues even on the same physical <insert flavor here> POSIX system. Best regards, -- Pedro Melo Blog: http://www.simplicidade.org/notes/ XMPP ID: melo@simplicidade.org Use XMPP! -
I don't think I'd call that "insane" (in fact, I think these discussions would be much less irritating for all involved if we didn't use that word so often, even when it's not called for). It's not that different than the whole LF/CRLF line-ending thing. The real problem is that setting LANG won't help you on Mac OS X; set LANG to whatever you want and there is *nothing* that you can do to stop your filenames being normalized into decomposed UTF-8, short of dropping HFS+. You can use an alternative filesystem, but support for basically everything except HFS+ is suboptimal in Mac OS X at the moment. Cheers, Wincent -
Hi, On Thu, 17 Jan 2008, Wincent Colaiuta wrote: > El 17/1/2008, a las 1:40, Pedro Melo escribi
One of the advantages (the biggest one, in fact, apart from the obvious US-ASCII down-compatibility and the fact that you can do C-compatible NUL-terminated strings) of UTF-8 is that it's locale-independent, and doesn't care about LANG, because it's valid in all languages. And that's really important. It's important for a very simple reason: there is almost never such a thing as "a locale" except for US-ASCII. Once you move away from US-ASCII, it actually tends to be much more common that you have a *mixture* of locales - often in the same "document" - than to have one single locale. It very much happens even in filenames - people "mix" locales in trivial ways even within a single pathname component (non-US-ASCII filename, but with a regular file extension), but much more interestingly they do so within a directory tree (ie you have have translation subdirectories where the filenames themselves are in another language, and you can have full pathnames where different components are in different languages, for example). And UTF-8 is _wonderful_ for this, because LANG doesn't matter, and cannot matter, and thus mixing isn't a problem. Of course, you can screw it up. Locales still can change things like sort order and capitalization etc, so even if you use UTF-8, you sure can get into trouble with LANG and thinking that a per-session locale makes sense. So choosing UTF-8 for the filesystem isn't wrong per se. It's a fine choice, and has no issues with LANG in itself. Limiting it to strictly valid UTF-8 encodings is also fine. Limiting it (further) to only character normalized UTF-8 is also fine. Most Linux filesystems don't limit it in any way, so you can make filenames that aren't valid UTF-8 at all, much less normalizing multi-character sequences. I personally think that's the best option, but I probably do so mostly because I know some people still use Latin1 as their only locale (and I suspect Asia will take decades before it has converted t...
Alright, you've made your point, and I'm willing to concede at least =20 some of what you've said. So perhaps we can now move onto the more =20 relevant and practical issue of: HFS+, despite how stupid it may or =20 may not be, normalizes filenames (and is case-insensitive, which is a =20= related issue). This causes a problem with git. How can this be solved? I'm more than willing to do work to solve it, my biggest issue is I =20 don't believe I actually have the free time to learn the git internals =20= well enough to actually do proper work on what I would assume is a =20 fairly performance-critical section of git's code. However, I would be =20= happy to work with others who are perhaps more knowledgeable in this =20 area. -Kevin Ballard --=20 Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
My understanding is that normalization is there to help the computer. =20= That doesn't give it any semantic meaning, because all normal forms of =20= The argument for case insensitivity is different than the argument for =20= normalization. I certainly hope you understand why they are different =20= You're right, sometimes the sequence matters. As in key sequences. But =20= we're not talking about key sequences, we're talking about strings. =20 And how am I supposed to use the same sequence everywhere? When I type =20= "M=E4rchen", I don't know which form I'm typing, nor should I. It's not =20= something that I, as a user, should have to know. Especially if I pass =20= this name through various other utilities before using it - I have no =20= idea if another utility is going to end up normalizing the name, and =20 On a US keyboard I only have one way of typing =E4, and I have no idea =20= whether it ends up precomposed or decomposed in the resulting byte =20 stream. And I don't care. Because I'm typing characters, not bytes. I =20= could be typing in a file in ISO-Latin-1 and I still wouldn't care, =20 because it looks the same to me. If my filesystem did make a =20 distinction between the normal forms, and I see that I have a file =20 named "M=E4rchen", how am I supposed to type that at my keyboard? I =20 don't know which normal form it's using. The fact that you think the normalization of the string matters, I =20 What a fabulous straw man argument you just put together. I hope you =20 I'm speaking as a user, and as such, I shouldn't even have to know =20 that it's possible to write the same character in multiple different =20 ways. As a user, HFS+ behaves exactly the way I want it to. You were =20 talking earlier about not messing with the "user data", but what is =20 the "user data"? It's the string, not the byte sequence. That's all I =20= care about - the string. That's all the OS cares about, that's all any =20= application I use cares a...
The thing is, you seem to argue that what OS X does helps you as the user. But you are arguing based on incorrect assumptions. First off, we've had years and years and years of usage of non-corrupting filesystems (pretty much every UNIX OS around since day 1, and many other OS's too), and it's simply not true that it's a problem. You see the filename in the file dialog, and you open it, and you're done. OS X isn't any "easier" in this regard. In fact, this whole thread comes from the fact that the OS X choice that you *think* is easier, is in fact not easier at all. It's not easier for the user, it's not easier for the application programmer, and the really sad part is that it's very much *not* easier for OS X itself either (ie they had to literally write extra code with nasty tables to do it, and it really does hurt them in performance and complexity). And _that_ is why the OS X situation is so sad. Apple literally added extra code to make things slower and more complex *and* harder to use reliably. Does it show up in normal behaviour? Of course not. You'd probably never see it in real life outside of test-suites. People simply don't even tend to use filenames outside of US-ASCII, and when they do use them, input methods really *do* tend to do the normalization for you. But when it comes to automation (which is what computers are all about), the OS X choice is literally the wrong one. And there's no _upside_. It's all downside. Which is why it's so stupid. I bet it only exists because OS X engineers didn't really even think about it, and they just assumed that "normalization is helpful". They took your stance - thinking it was worth it, without ever really thinking it through. Linus -
On Jan 16, 2008, at 8:16 PM, Linus Torvalds <torvalds@linux-foundation.org I believe it exists because HFS+ was created at a time when the Mac was moving from a multi-encoding world (which was a nightmare) to a Unicode world and they wanted to remove ambiguity in filenames. But I wasn't around when they made this decision so this is just a guess. -Kevin Ballard -
I do agree. And I think starting out case-insensitive (something they must really hate by now) also made it less of an issue. When you're case-insensitive, the issues with any UTF-8 normalization are simply swamped by all the issues of case, so you probably don't even think about it very much. The big problem with any name rewriting is that I can open file 'xyz', and I literally have a very hard time knowing whether that file I know I opened and created has anything to do with the file 'Xyz' that I see when I do a readdir(). Are they the same? Maybe. But it's literally hard to tell on OS X. I can do an fstat() on my file descriptor and on the directory entry, and if they get the same d_ino they *probably are the same entry, but even then it actually could have been a hardlink (and my 'xyz' is really *another* name for it entirely, and the filesystem is actually case-sensitive and 'Xyz' was a *different* name that somebody else did!). See? If you're creating a content tracker, these kinds of issues are not "idle chatter". It's really *really* important. Was that file the one I was told to track? Or was it a temporary file that was just hardlinked? This is why case-insensitivity is so hard: you have a very real "aliasing" on the filesystem level, where all those really *different* pathnames end up being the same thing. And all the same issues show up with utf-8 rewriting, so if you normalize utf-8 names, you actually end up having almost all the same problems that a case-insensitive filesystem has. They're just much rarer in practice, so you just won't hit them as often - but when you do, they are equally painful! (In fact, they can be a whole lot *more* painful, because now they are really rare, and really confusing when they happen!) But if you come from a case-insensitive background, all the UTF-8 rewriting really looks like such a small problem compared to all the horrid problems that you had with different locales and cases, so I suspe...
I hope you're right (about them hating it), but we'll see. They've just opened the source for the ZFS port they're working on. By the time it goes final and becomes the default FS, replacing HFS+, probably within a couple of years, we'll see if they make the same two design decisions which cause the kinds of problems being discussed here (case-insensitivity, and ubiquitous FS-level UTF-8 normalization). I've done a dumb search in the ZFS source code for "CASE" and see that it can in theory support case-insensitivity as an optional feature. The potential is there for Apple to use this. I personally hope that they don't, because as has already been pointed out, these little tricks tend to make life more difficult for users rather than helping them (the day I have two files in the same directory called "Märchen" and want to specify one of them on the command line I'll worry about that when I come to it). http://fuzzy.wordpress.com/2007/06/09/zfsandfilesystemoptions/ Cheers, Wincent -
Side note: the thing is, the reason people shouldn't worry about it is that this is a *trivial* thing to handle. You really don't even need to know what you're doing. And you can test it today, easily. Having two (differently encoded) files like that is really no different from the traditional UNIX FAQ of "how do I remove a file starting with '-'" or even more closely "how do I remove a file that has a character in it that I cannot get at the keyboard". In other words, on a bog-standard UNIX (and yes, in this case, I bet OS X works fine too for this test), just try this filename1=$(echo -e "hello\002there") filename2=$(echo -e "hello\003there") echo Odd file > "$filename1" echo Another odd file > "$filename2" and now you have a filename that is actually rather hard to type on the command line. In fact, for me they even *look* the same: [torvalds@woody ~]$ ll hello* -rw-rw-r-- 1 torvalds torvalds 9 2008-01-17 08:23 hello?there -rw-rw-r-- 1 torvalds torvalds 17 2008-01-17 08:23 hello?there See? Even in my graphical browser, those two filenames look 100% *identical*. I could give you a screen-shot, but I'm lazy. Just take my word for it, or just fire up konqueror on Linux (but it may well depend on the particular font you're using). [ And yes, for other browsers, you might have something that shows them as different characters - depending on the font, it might show up as a small box with [00 02] vs [00 03] in it, for example. But that's also actually 100% true of the two different encodings of 'ä' - you could easily have a file broswer that shows the multi-character as a multi-character, exactly to distinguish them and show that one of them isn't "normalized"! The point is, once the filesystem doesn't corrupt the data, it's always easy to get at, and there is never any ambiguity. ] How is this different from "Märchen" spelled with two different encodings for that "ä"? I'll tell you: it's not at all differen...
With the exception of Unicode. If you check the standard, two Unicode codepoints (i.e. the numeric value that gets stored on disk) *can* map to the same character, hence they are the same. They don't just look the same, they are the same character -- even if the codepoints are different (i.e. precomposed vs. decomposed characters). In fact, part of the Unicode standard deals with that. (Technically, Unicode calls it equivalence, but what the hey). In other words, Unicode treats e.g. both U+0065 and U+00E9 as fundamentally the same character. This comes even more into play in such alphabets as Hangul (Korean) and the Japanese Kana. -- JM Ibanez Software Architect Orange & Bronze Software Labs, Ltd. Co. jm@orangeandbronze.com http://software.orangeandbronze.com/ -
Hi, As Linus _already_ pointed out, you are confusing characters with glyphs. Hth, Dscho -
Someone is. He is refering to the unicode definition of an (abstract) character. Ch3.4 D11 - "A single abstract character may also be represented by a sequence of code points—for example, latin capital letter g with acute may be represented by the sequence <U+0047 latin capital letter g, U+0301 combining acute accent>, rather than being mapped to a single code point. -- robin -
But if you want to make it clear, you can use "encoded character" or yes, "code point". But the thing is, even the unicode standard tends to just say "character", and a unicode string (for example) is defined to be a sequence of "code units" which in turn is about those *encoded* characters, which is all about the code points. So you'll find that they are very careful in some technical definition parts to talk about "code points", but then in other sequences they talk about "character" even though they are referring to the actual code point (ie the figure literally has the unicode number in it!) In fact, they sometimes even talk about "characters" in the totally non-encoding meaning of "glyph". So yes, "character" is often ambiguous. It would be good to never use the word at all, and only talk about "code point" and "glyph" and one of the well-defined special terms like "combining character" or "replacement character". But to take a representative example from The Unicode Standard, Chapter 2: "Unicode Design Principles": Characters are represented by code points that reside only in a memory representation, as strings in memory, on disk, or in data transmission. The Unicode Standard deals only with character codes. (any speling mistakes mine). In other words, from the very beginning of the standard, very basic design principles chapter, it starts talking about characters being represented by code points and explicitly says that it really only deals with CHARACTER CODES. Yes, I'm sure you can argue ad infinitum that all the "equivalences" and other crap means that a "character" can sometimes mean just about anything, but I'd say that it's pretty damn reasonable to equate "unicode character" with "code point" or "character code". Linus -
So they are not the same after all? It is just you don't care about what it actually says, right? How about this: Unicode provides a unique number for every character. So, if numbers are not the same then by definition of the Unicode standard There is no notion "fundamentally the same character" in the Unicode standard as far as I know, and the characters you mentioned are very different in Unicode: http://www.fileformat.info/info/unicode/char/0065/index.htm http://www.fileformat.info/info/unicode/char/00e9/index.htm There have different names, they have different glyphs, and they are functional different. Dmitry -
Sorry, but you're using different characters that look the same. But Kevins point was that it's a different thing if you use two characters that look the same or the same character with different encodings. This makes this HFS-specific problem different from the "look the same"- or the "case-insensitivity"-issues. BTW: I also read about your argument that you wouldn't convert file data to normalized UTF-8 (I agree with you that this would be nonsense) and therefore filenames shouldn't be converted too. This is something where I have to disagree because a filename (like ctime, mtime, atime, ...) are meta data (while file contents isn't) and - until now - I would've guessed that you agree on this point because git doesn't care about filenames but contents. IMHO it would be the best solution when git stores all string meta data in UTF-8 and converts it to the target systems file system encoding. That would fix all those problems with different locales and file system encodings ... However, I have to agree that the enforced character set conversion causes more problems than it solves. Regards, Mark -
But that's exactly the case he gave - 'ä' vs 'a¨' are exactly that: different strings (not even characters: the second is actually a multi-character) that just look the same. You try to twist the argument by just claiming that they are the same "character". They aren't, unless you *define* character to be the same as "glyph". Of course, if you claim that, then you can always support your argument, but I claim that is a bogus and incorrect axiom to start with! Too many people confuse "character" and "glyph". They are different. See, for example http://en.wikipedia.org/wiki/Unicode and notice the *many* places where they try to make that distinction between "character" and "glyph" clear (and also "code values", which are the actual bytes that encode a character). See also http://en.wikipedia.org/wiki/Unicode_normalization and realize that a Unicode sequence is a sequence of *characters* even if it is not normalized! Those things are still characters, when they are the "simpler" non-combined characters. You are trying to make a totally BOGUS argument, and you base it on the INCORRECT basis that the TWO characters 'a'+'¨' somehow aren't independent characters. They *are*. They are *different* characters from 'ä', even though they may be "Canonically equivalent" as a sequence. The fact is that "equivalent" does not mean "same". Why cannot people accept that? Linus -
Hi, I'll shut up now if you can answer me one question, because it really is a problem for my team. We have people using windows, people using Macs, and people using several flavors of Linux desktops. They all have different settings and if I add a file like áéióú that happens to be UTF-8 encoded, it will reach a iso-latin-1 user as visual garbage. git will track the file perfectly, we know that, because the sequence of bytes that my system used to create the file will be the same on all "sane" systems, but the file will look "funny" to some users, and we get complaints for some less enlightened ones. The answer is that users should not create filenames with non-ascii characters if they want a consistent experience, right? This is just so that I can write a best practices document to them... Best regards, -- Pedro Melo Blog: http://www.simplicidade.org/notes/ XMPP ID: melo@simplicidade.org Use XMPP! -
I can't really suggest anything else than trying to make everybody use UTF-8. [ Not just for filenames, by the way - this is one of the reasons I think it is so *important* to not corrupt filenames, exactly because this is in no way filename-specific at all, and filenames are generally "textual data" exactly the same way a text-file is. But only totally insane people think that you should force-normalize text-files, even though all the issues are obviously all the same regardless of whether it's a filename or a word in textfile. ] And yes, I also realize that it's not going to be realistic. We're probably *closer* to that than we used to be, but I don't think you can even make Windows think FAT is UTF-8. I don't know how NTFS works (I know it is Unicode-aware, and I think it encodes filenames in UCS-2 or possibly UTF-16, but there is an obvious 1:1 translation to UTF-8, and since we use C strings, I'd assume/hope Windows actually uses that unambiguous translation for any filenames). Under modern Linux and OS X, UTF-8 is basically the only way (older Linux distros may be set up for Latin1, but at least the newer ones seem to all Oh, absolutely. That takes care of 99.9% of all source projects. Even then you can have problems with case insensitivity (the Linux kernel sources are all US-ASCII filenames, for example, but *literally* has many files that are identical if you ignore case, and that's not unheard of). So yes, to a first approximation, the answer is to simply avoid using anything but US-ASCII. It's seldom a big limitation when talking about filenames. Linus -
But they are not different strings, they are canonically equivalent as far as Unicode is concerned. They're even supposed to map to the same glyph (if the font has an "ä", it should display it in both cases, if it has an "a" and a combining diaeresis, it should make up one). You cannot do a binary comparison of text to see if two strings are Whereas you are confusing characters and code points. "ä" and "a¨" use different code points, but they encode the same character, and from the user's perspective it is the *character* that Actually, NTFS is a bit broken. It sees file names as a string of 16-bit words. It doesn't check that it is valid UTF-16, or even valid UCS-2, it allows almost anything. Apple made Mac OS X handle filenames properly, by seeing that file names are a string of characters, not code points, so they use a canonical form for all characters (personally, I would have preferred the pre-composed form, though). -- \\// Peter - http://www.softwolves.pp.se/ -
Fuck me with a spoon.
Why the hell cannot people see that "equivalent" and "same" are two
.. and this is relevant how? They are different strings. Not the same.
Equivalence doesn't matter. Equivalence is *evil*. Equivalence is what
gives us case-insensitive filesystems ("because the names are
equivalent").
Filesystems don't *want* equivalence. They want a much stronger exactness
guarantee. Exactly because sometimes the differences matter.
Linus
-It is relevant because the Mac OS file system stores file names as a sequence of Unicode code points, in a (apparently slightly modified) normalized form, whereas Git prefers to see file systems that store file names as a sequence of octets, which may, or may not, actually map to something that the user would call characters. I happen to prefer the text-as-string-of-characters (or code points, since you use the other meaning of characters in your posts), since I come from the text world, having worked a lot on Unicode text processing. You apparently prefer the text-as-sequence-of-octets, which I tend to dislike because I would have thought computer engineers would have evolved beyond this when we left the 1900s. But the real issue is that Git cannot use it's filenames as string of octets on Mac OS X, since the file system doesn't handle it. So Git needs to do something sensible. That's part of porting. Preferrably that would involve supporting real Unicode file names, which would also work on Windows (through it's UTF-16 file APIs), and in part on other systems (through conversion to the systems' locale encoding). -- \\// Peter - http://www.softwolves.pp.se/ -
No. The *only* issue is that git doesn't normalize. You can think of git as a UTF-8 namespace all you want, and it will work together wonderfully with OS X. Some of us just know what we're doing, and have been working with UTF-8 for a long time. It's not about sequence-of-octets, it's about not corrupting the data. You think data should be changed behind peoples backs, potentially causing corruption due to unintended conversions. And I don't. You can call me "left behind in the 1900s", but that's apparently because you don't understand the issues. Data corruption wasn't something that magically became ok just because we switched into a new century. Linus -
I agree. Every single problem that I can recall Linus bringing up as a consequence of HFS+ treating filenames as strings is in fact only a problem if you then think of the filename as octets at some point. If you stick with UTF-8 equivalence comparison the entire time, then everything just works. Granted, this is a problem when you have to operate on a filesystem that thinks of filenames as octets, but as I said before, this doesn't mean the HFS+ approach is wrong, it just means it's incompatible with Linus's approach. -Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
At *some* point everything stored in computers is a sequence of octets. In fact, the whole point of the Unicode standard is to define characters and how to map each character to a unique number (code points) and then There are more than one equivalence comparison. The unicode standard defines at least two, and for some other purpose you may want to use some others, but for some reason you are trying to present that to work with text means to follow only one type of equivalence the entire time... Dmitry -
You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON. The fact is, text-as-string-of-codepoints (let's make the "codepoints" obvious, so that there is no ambiguity, but I'd also like to make it clear that a codepoint *is* how a Unicode character is defined, and a Unicode "string" is actually *defined* to be a sequence of codepoints, and totally independent of normalization!) is fine. That was never the issue at all. Unicode codepoints are wonderful. Now, git _also_ heavily depends on the actual encoding of those codepoints, since we create hashes etc, so in fact, as far ass git is concerned, names have to be in some particular encoding to be hashed, and UTF-8 is the only sane encoding for Unicode. People can blather about UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is simply technically superior in so many ways that I don't even understand why anybody ever uses anything else. So I would not disagree with using UTF-8 at all. But that is *entirely* a separate issue from "normalization". Kevin, you seem to think that normalization is somehow forced on you by the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE. Normalization is a totally separate decision, and it's a STUPID one, because it breaks so many of the _nice_ properties of using UTF-8. And THAT is where we differ. It has nothing to do with "octets". It has nothing to do with not liking Unicode. It has nothing to do with "strings". In short: - normalization is by no means required or even a good feature. It's something you do when you want to know if two strings are equivalent, but that doesn't actually mean that you should keep the strings normalized all the time! - normalization has *nothing* to do with "treating text as octets". That's entirely an encoding issue. - of *course* git has to treat things as a binary stream at some point, since you need that to even compute a SHA1 in the first place, but that ...
Code point is a unique numerical value assigned to every Unicode character. Also, every Unicode character has a uniqie name assigned to it. There are some other non-unique properties that every Unicode has. So, to say that a Unicode character is just a code point is not exactly correct, because the code point is one of properties of a unicode character. But, yes, any Unicode character can be identified by its code point. So, it is one to one relation. Dmitry -
