login
Header Space

 
 

Re: git on MacOSX and files with decomposed utf-8 file names

Previous thread: You're in my Soul by ousbghgsg on Wednesday, January 16, 2008 - 9:22 am. (1 message)

Next thread: Merging using only fast-forward by Sverre Hvammen Johansen on Wednesday, January 16, 2008 - 11:54 am. (13 messages)
To: <git@...>
Date: Wednesday, January 16, 2008 - 11:17 am

Hi,

I have some files like "Lüftung.txt" in my repository. The strange thing 
is that I can pull / add / commit / push those files without problem but 
git-status always complains that thoes files are untraced (but not 
missing). My assumption is that it's a problem with the way MacOSX 
stores the file names (decomposed UTF-8). So something like 
"Lüftung.txt" becomes "Lüftung.txt".

It seems that git-status does two things:
1. Find files under version control (i.e. search for missing files)
2. Find files not under version control (i.e. search for untracked files)

I guess that the first look-up succeeds because MacOS X converts 
composed UTF-8 to decomposed UTF-8 when searching for a file. But it 
seems that the second look-up takes the file names as-is (decomposed) 
without converting them to composed UTF-8.

Is there an easy way to fix this behaviour? It's really annoying to see 
all those "untracked" files that are already under version control when 
executing a git-status.

Regards,
Mark
-
To: Mark Junker <mjscod@...>
Cc: <git@...>
Date: Wednesday, January 16, 2008 - 11:34 am

Hi,

On Wed, 16 Jan 2008, Mark Junker wrote:

&gt; I have some files like "L
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 11:43 am

More like, Mac OS X has standardized on Unicode and the rest of the =20
world hasn't caught up yet. Git is the only tool I've ever heard of =20
that has a problem with OS X using Unicode.

-Kevin Ballard

--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
To: Kevin Ballard <kevin@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Thursday, January 17, 2008 - 3:29 am

Apple's decision[*] to use _decomposed_ unicode causes all sorts of
little problems because other tools aren't expecting to see strings
changed behind their backs.

I know little about the gritty details, but I see the bug reports...

-Miles

-- 
Any man who is a triangle, has thee right, when in Cartesian Space, to
have angles, which when summed, come to know more, nor no less, than
nine score degrees, should he so wish.  [TEMPLE OV THEE LEMUR]
.
-
To: Kevin Ballard <kevin@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 7:03 pm

As far as I know, Subversion has basically exactly the same problem,  
and any time you consume/produce files on Mac OS X that are be  
consumed/produced on other platforms you will run into this kind of  
issue, with any software.

Tell Mac OS X to write a file with "ó" in the file name ("\xc3\xb3" in  
UTF-8), and it will "normalize" it prior to writing by converting it  
into a decomposed form (that is, ASCII "o" followed by "\xcc\x81", or  
"combining acute accent"). So they're both valid Unicode, both valid  
UTF-8, and they encode exactly the same characters but the byte stream  
is different.

If you only work on Mac OS X then this will never be a problem because  
all the files you create and therefore all the files you add to your  
Git repository will have their names in decomposed UTF-8. But when you  
start cloning repositories containing files added on other systems,  
systems which might use precomposed rather than decomposed UTF-8 then  
you'll run into exactly this kind of problem. The git.git repo has one  
such file itself (gitweb/test/Märchen, if I remember correctly, which  
Git reports as untracked).

Now, Mac OS X's behaviour is not entirely "insane" as some would  
claim; there is indeed a rationale behind it even if you don't agree  
with it, but it *does* produce some unfortunate teething problems for  
people wanting to use Mac OS X in a cross-platform environment.

Here are some Apple docs on the subject:

http://developer.apple.com/qa/qa2001/qa1173.html

http://developer.apple.com/qa/qa2001/qa1235.html

I personally wish that UTF-8 didn't allow different normalization  
forms; then this kind of problem wouldn't arise. But it has arisen and  
we have to live with it. Some workarounds have been proposed for Git,  
but I haven't seen any convincing proposals yet.

Cheers,
Wincent



-
To: Kevin Ballard <kevin@...>
Cc: Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 12:32 pm

Hi,

&gt; &gt; &gt; I have some files like "L
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Kevin Ballard <kevin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 12:46 pm

To be more exact encoding used to _create_ file differs from encoding

...which means that sequence of bytes differ. And Git by design is
(both for filenames and for blob contents) encoding agnostic.

HFS+ is just _stupid_. And unfortunately Git doesn't support stupid
filesystems (e.g. case insensitive filesystems) well.

-- 
Jakub Narebski
Poland
ShadeHawk on #git
-
To: Jakub Narebski <jnareb@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 4:39 pm

There's two different ways to do filesystem encodings. One is to have =20=

the fs simply not care about encoding, which is what the linux world =20
seems to prefer. Sure, this is great in that what you create the file =20=

with is what you get back, but on the other hand, given an arbitrary =20
non-ASCII file on disk, you have absolutely no idea what the encoding =20=

should be and you can't display it without making assumptions (yes you =20=

can use heuristics, but you're still making assumptions). Filesystems =20=

like HFS+ that standardize the encoding, on the other hand, make it =20
such that you always know what the encoding of a file should be, so =20
you can always display and use the filename intelligently. It also =20
means it plays much nicer in a non-ASCII world, since you don't have =20
to worry about different normalizations of a given string referring to =20=

different files (it's one thing to be case-sensitive, but claiming =20
that "f=F6o" and "f=F6o" are different files just because one uses a =20
composed character and the other doesn't is extremely user-=20
unfriendly). On the other hand, what you create the file with may not =20=

be what you read back later, since the name has been standardized. =20
It's hard to say one is better than the other, they're just different =20=

ways of doing it. However, I have noticed that everybody who's voiced =20=

an opinion on this list in favor of the encoding-agnostic approach =20
seem to be unwilling to accept that any other approach might have =20
validity, to the extent of calling an OS/filesystem that does things =20
different stupid or insane. This strikes me as extremely elitist and =20
risks alienating what I expect to be a fast-growing group of users =20
(i.e. OS X users).

I'm willing to give Linus a free pass on calling other OS's stupid and =20=

insane, as I don't think Linux would exist as it does today without =20
his strong opinions, but I don't think this should give carte blanche =20=

to th...
To: Kevin Ballard <kevin@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 7:52 pm

There is no technical reason for *kernel* to care about file name
encoding. It is something that can be and should be dealt with in

And also because a user space program can deal with it much more

Wrong. If you have a policy that all file names are stored in UTF-8
encoding then there is no problem here. It should not be a kernel
problem to care about encoding, besides you cannot fully solve it

Yeah, right... Like Microsoft likes to "standardize" everything, which
in practice means forcing on others something fundamentally broken and
that does not follow any existing standard precisely:

===
IMPORTANT:
The terms used in this Q&amp;A, decomposed and precomposed, roughly
correspond to Unicode Normal Forms D and C, respectively. However, most
volume formats do not follow the exact specification for these normal
forms.
===
http://developer.apple.com/qa/qa2001/qa1173.html

Not to mention that the use of decomposed Unicode as the standard is

Somehow I have no problem with displaying non-ASCII names on Linux.
I can see both Unicode Normal Forms C and D encoded symbols without

As you typed them, they both are exactly the same, and both of them are
in the Normal Forms C (which Mac calls as precomposed). So why do you

I am sure everyone here is scared to death... I mean we have used to
hear such threats from some MS salespeople, but from a Mac guy? It is
really scare....

Wake up, and stop shooting this nonsense at us. If you have technical
reasons why your solution is better, let us know. So far, you do not
sound very convincing here. Why do think that the issue of encoding can
not be dealt with in the user space? Why does Mac OS X uses so-called
decomposed Unicode, which even does not follow any standard precisely?
Why does Mac OS X chose to decompose characters while it does not

I suppose it would be much better a subject for discussion...
At least, it would be more likely to result in that Git working

First, no one called Mac OS X insane, but case insensitive files...
To: Kevin Ballard <kevin@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 5:51 pm

By the way, calling HFS+ stupid, or rather calling at least two 
different normalizations of UTF-8 (two different encodings) used for 
writing and reading filenames stupid is wrong _for me_. I have quoted 

For me it looks like a layering violation... but my knowledge about 
filesystem is cluse to nil. IMHO it is VFS and libc which should do the 

But using one encoding to create file, and another when reding filenames 
is strange. It is IMHO better to simply refuse creating filenames which 
are outside chosen encoding / normalization. But having different 
encodings used for reading and writing on the level of filesystem 

First, it is Git philosophy and very core of design to be encoding 
agnostic (to be "content tracker"). Second, using the same sequence of 
bytes on filesystem, in the index, and in 'tree' objects ensures good 
performance... this is something to think about if you want to add 
patches which would deal with HFS+ API/UI quirks.

[cut]
-- 
Jakub Narebski
Poland
-
To: Jakub Narebski <jnareb@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 6:06 pm

It's not using different encodings, it's all Unicode. However, it  
accepts different normalization variants of Unicode, since it can read  
them all and it would be folly to require everybody to conform to its  
own special internal variant. But it does have to normalize them,  
otherwise how would it detect the same filename using different  
normalizations? Also, it may seem strange to have different names  
between reading and writing, but that's only if you think of the name  
as a sequence of bytes - when treated as a sequence of characters, you  
get the same result. In other words, you're used to filenames as  

Sure, it makes sense from a performance perspective, but it causes  
problems with HFS+ and any other filesystem that behaves the same way.  
In the previous discussion about case-sensitivity, somebody suggested  
using a lookup table to map between git's internal representation and  
the name the filesystem returns, which seems like a decent idea and  
one that could be enabled with a config parameter to avoid penalizing  
repos on other filesystems. But I don't know enough about the  
internals of git to even think of trying to implement it myself.

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
To: Kevin Ballard <kevin@...>
Cc: Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 6:23 pm

Hi,


But that's the _point_!  It _is_ Unicode, yet it uses _different_ 
encodings of the _same_ string.

Now, this discussion gets really annoying.  The real question is: will you 
do something about it, or reply with another 500-line email?

Ciao,
Dscho
-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 7:16 pm

I wish I could do something about it. But right now I'm a full-time  
student trying to do contracting jobs on the side, and I don't believe  
I have the time to learn enough about the guts of git to try and make  
any changes to something as core as index filename handling. I just  
want people here to recognize that this is a valid problem instead of  
simply dismissing it as "HFS+ is insane, lets just ignore this issue".

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
To: Kevin Ballard <kevin@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 6:32 pm

That's a singularly *stupid* argument.

Here, let me rephrase that same idiotic argument:

  "But it does have to uppercase them, otherwise how would it detect the 
   same filename using different cases?"

..and if you don't see how that's *exactly* the same argument, you really 
are stupid.

The fact is, normalization is wrong.

It's wrong when you normalize upper/lower case (no, the word "Polish" is 
not the same as "polish"), and it's equally wrong when you normalize for 

No. HFS+ treats users as idiots and thinks that it should "fix" the 
filename for them. And it causes problems.

It causes problems for exactly the same reasons case-independence causes 
problems, because it's EXACTLY THE SAME ISSUE. People may think that "but 
they are the same", but they aren't. Case matters. And so does "single 
character" vs "two character overlay". 

Does it always matter? Hell no. But the problem with a filesystem that 
thinks it knows better is that when it *sometimes* matters, the filesystem 
simply DOES THE WRONG THING.

Can't you understand that?

			Linus
-
To: Kevin Ballard <kevin@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 6:52 pm

Side note: there are ways to do it right.

You can:

 - not do conversion at all (which is always right). Not corrupting the 
   user data means that the user never gets something back that he didn't 
   put in

   (And, btw, the "security" argument is total BS. The fact that two 
   characters look the same does not mean that they should act the same, 
   and it is *not* a security feature. Quite the reverse. Having programs 
   that get different results back from what they actually wrote, *that* 
   tends to be a security issue, because now you have a confused program, 
   and I guarantee that there are more bugs in unexpected cases than in 
   the expected ones)

 - Not accept data in formats that you don't like. This is also always 
   right, but can be rather impolite.

 - Not accept data in formats that you don't like, and give people 
   explicit conversion and comparison routines so that they can then make 
   their own decisions and they are *aware* of the conversion (so that 
   they don't come back to the problem of being confused)

So there are certainly many ways to handle things like this.

The one thing you shouldn't do is to silently convert data behind the 
programs back, without even giving any way to disable it (and that disable 
has to be on a use-by-use casis, not some "disable/enable for all users of 
this filesystem", because you can - and do - have different programs that 
have different expectations).

And finally: all of the above is true at *all* levels. It doesn't matter 
one whit whether the automatic conversion conversion is in the kernel or 
in a library. Doing it on a library level has advantages (namely the whole 
"disable/enable" thing tends to get *much* easier to do, and applications 
can decide to link against a particular version to get the behaviour 
*they* want, for example).

So doing it inside the kernel is just about the worst possible case, 
exactly because it makes it really hard to do a "on a case-by-case" basis. 

Yes,...
To: Linus Torvalds <torvalds@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 7:11 pm

You're right, it doesn't actually have to store the normalized form. =20
And yes, it's possible to compare without normalizing them. =20
Admittedly, I don't know much about the implementation details of =20
unicode, but I would assume that the easiest way to compare two =20
strings is to normalize them first. But in the case of the filesystem, =20=

normalization actually is important if you're thinking about filenames =20=

in terms of characters rather than bytes. When I feed the filesystem a =20=

given unicode string, it has to find the file I'm talking about - =20
should it do a relatively expensive unicode-sensitive comparison of =20
all the filenames with the one I gave it, or should it just normalize =20=

all names and do the much cheaper lookup that way? I don't know about =20=

you, but I'd prefer to let my filesystem normalize the name and run =20

There's a difference between "looks similar" as in "Polish" vs =20
"polish", and actually is the same string as in "Ma&lt;UMLAUT =20
MODIFIER&gt;rchen" vs "M&lt;A WITH UMLAUT&gt;rchen". Capitalization has a valid =20=

semantic meaning, normalization doesn't. The only way to argue that =20
normalization is wrong is by providing a good reason to preserve the =20
exact byte sequence, and so far the only reason I've seen is to help =20
git. Applications in general don't care one whit about the byte =20
sequence of the filename, they care about the underlying file the name =20=

represents. Additionally, it would be a terrible experience for a user =20=

to enter "M=E4rchen" and have the application say "sorry, I can't find =20=

this file" simply because the application used decomposed characters =20
and the filename used composed characters. Unless the user is =20
knowledgeable about the OS, filesystems, and unicode, they wouldn't =20

How do you figure? When I type "M=E4rchen", I'm typing a string, not a =20=

byte sequence. I have no control over the normalization of the =20
characters. Therefore, depending on what p...
To: Kevin Ballard <kevin@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 7:38 pm

That simply isn't true.

Normalization actually has real semantic meaning. If it didn't, there 
would never ever be a reason why you'd use the non-normalized form in the 
first place.

Others have argued the exact same thing for capitalization. "A" is the 
same letter as "a". Except there is a distinction.

The same is true of "a&lt;UMLAUT MODIFIER&gt;" and "&lt;a WITH UMLAUT&gt;". Yes, it's 
the same "chacter" in either case. Except when there is a distinction.

And there *are* cases where there are distinctions. Especially inside 
computers. For one thing, you may not be talking about "characters on 
screen", but you may be talking about "key sequences". And suddenly 
"a&lt;UMLAUT MODIFIER&gt;" is a two-key sequence, and "&lt;a WITH UMLAUT&gt;" is a 
single-key sequence, and THEY ARE DIFFERENT.

See?

"a" and "A" are the same letter. But sometimes case matters.

Multi-character UTF-8 sequences may be the same character. But sometimes 
the sequence matters.


Git doesn't care. Just use the *same* sequence everywhere. Make sure 
something doesn't change it. Because if something changes it, git will 

Pure and utter garbage.

What you are describing is an *input method* issue, not a filesystem 
issue.

The fact that you think this has anything what-so-ever to do with 
filesystems, I cannot understand.

Here's an example: I can type Märchen two different ways on my keyboard: I 
can press the 'ä' key (yes, I have one, I have a Swedish keyboard), or I 
could press the '¨' key and the 'a' key.

See: I get 'ä' and 'ä' respectively.

And as I send this email off, those characters never *ever* got written as 
filenames to any filesystem. But they *did* get written as part of 
text-files to the disk using "write()", yes.

And according to your *insane* logic, that write() call should have 
converted them to the same representation, no?

Hell no! That conversion has absolutely nothing to do with the filesystem. 
It's done at a totally different layer that actual...
To: Linus Torvalds <torvalds@...>
Cc: Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 7:57 pm

The problem is that you don't control the sequence that everybody uses.

See this example:

melo@speed(~)$ uname -a
Linux speed.simplicidade.org 2.6.9-55.ELsmp #1 SMP Wed May 2 14:28:44  
EDT 2007 i686 i686 i386 GNU/Linux
melo@speed(~)$ set | grep LANG
LANG=en_US.UTF-8
melo@speed(~)$ mkdir t
melo@speed(~)$ cd t
melo@speed(~/t)$ git init
Initialized empty Git repository in .git/
melo@speed(~/t)$ touch á
melo@speed(~/t)$ git-add á
melo@speed(~/t)$ git-commit -m "added a in utf8"
Created initial commit 7a473a2: added a in utf8
  0 files changed, 0 insertions(+), 0 deletions(-)
  create mode 100644 "\303\241"
melo@speed(~/t)$ export LANG=en_US
melo@speed(~/t)$ touch á
melo@speed(~/t)$ ls -la
total 12
drwxrwxr-x   3 melo melo 4096 Jan 16 23:44 .
drwx--x--x  31 melo melo 4096 Jan 16 23:43 ..
-rw-rw-r--   1 melo melo    0 Jan 16 23:44 á
-rw-rw-r--   1 melo melo    0 Jan 16 23:43 á
drwxrwxr-x   8 melo melo 4096 Jan 16 23:43 .git
melo@speed(~/t)$ git-add á
melo@speed(~/t)$ git-commit -m "added a in iso-latin-1"
Created commit 4282fca: Oláx!
  0 files changed, 0 insertions(+), 0 deletions(-)
  create mode 100644 "\341"

So two (simulated in this test) users who use different LANG settings  
will be in trouble in no time.

What I take from this conversation is that I have to specify, for  
each project I work on, which encoding we should use, across all  
users, before they start using git with files with accented chars.

The difference I see between us is that if I tell my filesystem that  
I want to name my file with a particular string encoded in X, users  
using encoding Y will be able to read it correctly. I  like my  
filesystem to make that work for me.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!


-
To: Pedro Melo <melo@...>
Cc: Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 8:16 pm

The difference I see between us is that when I tell you that this is 
exactly the same thing as your file *contents*, you don't seem to get it.

An OS that silently changes the contents of your files is *crap*.

Get it?

An OS that silently changes the contents of your directories is *crap*.

Get it now?

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Pedro Melo <melo@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Friday, January 18, 2008 - 4:29 am

This is the same issue as the CRLF issue I posted on earlier, and it
all stems from that git also sees file names as a stream of bytes, not

A program that silently ignores the conventions of the platform it runs
on is *crap*, no matter if the conventions are not the same as for

A program that silently ignores the conventions of the file system it
tries to store its files on is *crap* :-)


In my perfect world, file names would be stored as a string of characters,
so if I save a file with an å in it, that å would be preserved no
matter if I run Linux on ext2 with my locale is set to latin-1 (which
stores it as byte 0xE5), on Windows with NTFS (which stores it as the
UTF-16 code 0x00E5), on Windows/DOS with FAT (which stores it as the
byte 0x86) or on Mac OS X which stores it as decomposed UTF-8 (whose
byte sequence I don't know at the top of my head). If that was just
stored as U+00E5 in whatever encoding in the filename index, the local
implementation of git can just check it out in the form needed.

-- 
\\// Peter - http://www.softwolves.pp.se/
-
To: Peter Karlsson <peter@...>
Cc: Linus Torvalds <torvalds@...>, Pedro Melo <melo@...>, Kevin Ballard <kevin@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Friday, January 18, 2008 - 7:16 am

You have to be careful about CRLF conversion, lest you corrupt your

Git philosophy to see the contents of files and "contents" of directories
(filenames) as stream of bytes, i.e. to use 'native' encoding works
perfectly well and _fast_ if all developers work in the same environment.
Troubles start if you are working across operating systems, and across

Git has for a long time i18n.commitEncoding, and from some time it
saves it in 'encoding' header in commit object (if different from
'uft-8') and has also i18n.logOutputEncoding.

For dealing with different filesystem encodings you would also have
to have both: encoding used in 'tree' objects (by repository) for
filenames saved somewhere in repository, either in tree object (argh!)
or in some kind of .gitconfig file; encoding used by filesystem in
repository config as i18n.filesystemEncoding or something like that.
And think what to put in the on disk index, and in memory index.


NOTE, NOTE, NOTE! If filename is used somewherein the file contents
(manifest-like file, include-like statement), and this filename uses
characters which are differently encoded in different encoding you
are screwed with this fancy system, badly, anyway.

-- 
Jakub Narebski
Poland
-
To: Linus Torvalds <torvalds@...>
Cc: Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 8:27 pm

Hi,


I get that you think its the same thing.

What I don't get is why a user should be forced to know what type of  
encoding he and the other users are using on all the layers going  
down to the filesystem. If two users on different systems or in  
different configurations, choose the same unicode string as the name,  
why do we need to make it harder for things to just work out?

The content of the file is sacred, we both agree on that. We disagree  
on the filename, because for me it's more important that equal  
strings, even if encoded to different byte sequences, should be  

I was not talking about content of files, those are sacred. I was  
talking about filenames. Those *for me* are not, but are for you. No  
problem, we just have different values: I want my computer to work  
for me, not me working for the computer. I'm willing to accept a file  
system or other layer that normalizes encoding of filenames if that  
makes the end-user life easier, specially in a tool distributed by  

As I said before, we disagree on file meta-data, not on file  
contents. For you, byte in must be the same byte out. For me string  
in must be the same string out.

And as I said in the previous email, what I learned today is that in  
a distributed project using git, and if you need to use accented  
characters, I need to tell all the users to use the same LANG settings.

It's important information, at least for me.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!


-
To: Pedro Melo <melo@...>
Cc: Linus Torvalds <torvalds@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 8:35 pm

Hi,


Why should the filename be _stored_ normalised?  I agree on the lookup, 
yes, but not the storage.

Hth,
Dscho

-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Linus Torvalds <torvalds@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 8:45 pm

[Empty message]
To: Pedro Melo <melo@...>
Cc: Linus Torvalds <torvalds@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 8:32 pm

If you do the normalization in the right place, things will just work

Well, as the issue shows it does not make life for the end-user easier.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To: David Kastrup <dak@...>
Cc: Linus Torvalds <torvalds@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 8:40 pm

Hello,


No problem, but don't you think that git should to it?

Don't you think its important in a distributed tool that no matter  
what system they use, be it linux or solaris, they are able to talk  
about a file with non-ascii chars and be the same file to both of them?

That's the point I'm making. The fact that I need to set LANG across  

I'm assuming you are talking about HFS+ and the strange normalization  
it does.

I'm sorry but that was not the problem I sent. I sent a scenario, in  
which two users, using the same linux system but with different LANG  
settings cannot use git reliably.

Although this thread started because of HFS+ "choices", the problem  
is not really related to HFS+ given that you can have the same issues  
even on the same physical &lt;insert flavor here&gt; POSIX system.

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!


-
To: Pedro Melo <melo@...>
Cc: David Kastrup <dak@...>, Linus Torvalds <torvalds@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 8:54 pm

I don't think I'd call that "insane" (in fact, I think these  
discussions would be much less irritating for all involved if we  
didn't use that word so often, even when it's not called for). It's  
not that different than the whole LF/CRLF line-ending thing.

The real problem is that setting LANG won't help you on Mac OS X; set  
LANG to whatever you want and there is *nothing* that you can do to  
stop your filenames being normalized into decomposed UTF-8, short of  
dropping HFS+. You can use an alternative filesystem, but support for  
basically everything except HFS+ is suboptimal in Mac OS X at the  
moment.

Cheers,
Wincent

-
To: Wincent Colaiuta <win@...>
Cc: Pedro Melo <melo@...>, David Kastrup <dak@...>, Linus Torvalds <torvalds@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 9:08 pm

Hi,

On Thu, 17 Jan 2008, Wincent Colaiuta wrote:

&gt; El 17/1/2008, a las 1:40, Pedro Melo escribi
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Wincent Colaiuta <win@...>, Pedro Melo <melo@...>, David Kastrup <dak@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 9:41 pm

One of the advantages (the biggest one, in fact, apart from the obvious 
US-ASCII down-compatibility and the fact that you can do C-compatible 
NUL-terminated strings) of UTF-8 is that it's locale-independent, and 
doesn't care about LANG, because it's valid in all languages.

And that's really important. It's important for a very simple reason: 
there is almost never such a thing as "a locale" except for US-ASCII. Once 
you move away from US-ASCII, it actually tends to be much more common that 
you have a *mixture* of locales - often in the same "document" - than to 
have one single locale.

It very much happens even in filenames - people "mix" locales in trivial 
ways even within a single pathname component (non-US-ASCII filename, but 
with a regular file extension), but much more interestingly they do so 
within a directory tree (ie you have have translation subdirectories where 
the filenames themselves are in another language, and you can have full 
pathnames where different components are in different languages, for 
example).

And UTF-8 is _wonderful_ for this, because LANG doesn't matter, and 
cannot matter, and thus mixing isn't a problem.

Of course, you can screw it up. Locales still can change things like sort 
order and capitalization etc, so even if you use UTF-8, you sure can get 
into trouble with LANG and thinking that a per-session locale makes sense.

So choosing UTF-8 for the filesystem isn't wrong per se. It's a fine 
choice, and has no issues with LANG in itself. Limiting it to strictly 
valid UTF-8 encodings is also fine. Limiting it (further) to only 
character normalized UTF-8 is also fine.

Most Linux filesystems don't limit it in any way, so you can make 
filenames that aren't valid UTF-8 at all, much less normalizing 
multi-character sequences.

I personally think that's the best option, but I probably do so mostly 
because I know some people still use Latin1 as their only locale (and I 
suspect Asia will take decades before it has converted t...
To: Linus Torvalds <torvalds@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Wincent Colaiuta <win@...>, Pedro Melo <melo@...>, David Kastrup <dak@...>, Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, <git@...>
Date: Thursday, January 17, 2008 - 12:07 am

Alright, you've made your point, and I'm willing to concede at least =20
some of what you've said. So perhaps we can now move onto the more =20
relevant and practical issue of: HFS+, despite how stupid it may or =20
may not be, normalizes filenames (and is case-insensitive, which is a =20=

related issue). This causes a problem with git. How can this be solved?

I'm more than willing to do work to solve it, my biggest issue is I =20
don't believe I actually have the free time to learn the git internals =20=

well enough to actually do proper work on what I would assume is a =20
fairly performance-critical section of git's code. However, I would be =20=

happy to work with others who are perhaps more knowledgeable in this =20
area.

-Kevin Ballard

--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
To: Linus Torvalds <torvalds@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 8:09 pm

My understanding is that normalization is there to help the computer. =20=

That doesn't give it any semantic meaning, because all normal forms of =20=


The argument for case insensitivity is different than the argument for =20=

normalization. I certainly hope you understand why they are different =20=


You're right, sometimes the sequence matters. As in key sequences. But =20=

we're not talking about key sequences, we're talking about strings. =20

And how am I supposed to use the same sequence everywhere? When I type =20=

"M=E4rchen", I don't know which form I'm typing, nor should I. It's not =20=

something that I, as a user, should have to know. Especially if I pass =20=

this name through various other utilities before using it - I have no =20=

idea if another utility is going to end up normalizing the name, and =20

On a US keyboard I only have one way of typing =E4, and I have no idea =20=

whether it ends up precomposed or decomposed in the resulting byte =20
stream. And I don't care. Because I'm typing characters, not bytes. I =20=

could be typing in a file in ISO-Latin-1 and I still wouldn't care, =20
because it looks the same to me. If my filesystem did make a =20
distinction between the normal forms, and I see that I have a file =20
named "M=E4rchen", how am I supposed to type that at my keyboard? I =20
don't know which normal form it's using.

The fact that you think the normalization of the string matters, I =20

What a fabulous straw man argument you just put together. I hope you =20

I'm speaking as a user, and as such, I shouldn't even have to know =20
that it's possible to write the same character in multiple different =20
ways. As a user, HFS+ behaves exactly the way I want it to. You were =20
talking earlier about not messing with the "user data", but what is =20
the "user data"? It's the string, not the byte sequence. That's all I =20=

care about - the string. That's all the OS cares about, that's all any =20=

application I use cares a...
To: Kevin Ballard <kevin@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, <git@...>
Date: Wednesday, January 16, 2008 - 9:16 pm

The thing is, you seem to argue that what OS X does helps you as the user.

But you are arguing based on incorrect assumptions.

First off, we've had years and years and years of usage of non-corrupting 
filesystems (pretty much every UNIX OS around since day 1, and many other 
OS's too), and it's simply not true that it's a problem. You see the 
filename in the file dialog, and you open it, and you're done. OS X isn't 
any "easier" in this regard.

In fact, this whole thread comes from the fact that the OS X choice that 
you *think* is easier, is in fact not easier at all. It's not easier for 
the user, it's not easier for the application programmer, and the really 
sad part is that it's very much *not* easier for OS X itself either (ie 
they had to literally write extra code with nasty tables to do it, and it 
really does hurt them in performance and complexity).

And _that_ is why the OS X situation is so sad. Apple literally added 
extra code to make things slower and more complex *and* harder to use 
reliably.

Does it show up in normal behaviour? Of course not. You'd probably never 
see it in real life outside of test-suites. People simply don't even tend 
to use filenames outside of US-ASCII, and when they do use them, input 
methods really *do* tend to do the normalization for you.

But when it comes to automation (which is what computers are all about), 
the OS X choice is literally the wrong one. And there's no _upside_. It's 
all downside. Which is why it's so stupid.

I bet it only exists because OS X engineers didn't really even think about 
it, and they just assumed that "normalization is helpful". They took your 
stance - thinking it was worth it, without ever really thinking it 
through.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Wednesday, January 16, 2008 - 11:52 pm

On Jan 16, 2008, at 8:16 PM, Linus Torvalds &lt;torvalds@linux-foundation.org 

I believe it exists because HFS+ was created at a time when the Mac  
was moving from a multi-encoding world (which was a nightmare) to a  
Unicode world and they wanted to remove ambiguity in filenames. But I  
wasn't around when they made this decision so this is just a guess.

-Kevin Ballard
-
To: Kevin Ballard <kevin@...>
Cc: Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 12:08 am

I do agree. And I think starting out case-insensitive (something they must 
really hate by now) also made it less of an issue. When you're 
case-insensitive, the issues with any UTF-8 normalization are simply 
swamped by all the issues of case, so you probably don't even think about 
it very much.

The big problem with any name rewriting is that I can open file 'xyz', and 
I literally have a very hard time knowing whether that file I know I 
opened and created has anything to do with the file 'Xyz' that I see when 
I do a readdir().

Are they the same? Maybe. But it's literally hard to tell on OS X. I can 
do an fstat() on my file descriptor and on the directory entry, and if 
they get the same d_ino they *probably are the same entry, but even then 
it actually could have been a hardlink (and my 'xyz' is really *another* 
name for it entirely, and the filesystem is actually case-sensitive and 
'Xyz' was a *different* name that somebody else did!).

See? If you're creating a content tracker, these kinds of issues are not 
"idle chatter". It's really *really* important. Was that file the one I 
was told to track? Or was it a temporary file that was just hardlinked? 

This is why case-insensitivity is so hard: you have a very real "aliasing" 
on the filesystem level, where all those really *different* pathnames end 
up being the same thing.

And all the same issues show up with utf-8 rewriting, so if you normalize 
utf-8 names, you actually end up having almost all the same problems that 
a case-insensitive filesystem has. They're just much rarer in practice, so 
you just won't hit them as often - but when you do, they are equally 
painful!

(In fact, they can be a whole lot *more* painful, because now they are 
really rare, and really confusing when they happen!)

But if you come from a case-insensitive background, all the UTF-8 
rewriting really looks like such a small problem compared to all the 
horrid problems that you had with different locales and cases, so I 
suspe...
To: Linus Torvalds <torvalds@...>
Cc: Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 6:08 am

I hope you're right (about them hating it), but we'll see. They've  
just opened the source for the ZFS port they're working on. By the  
time it goes final and becomes the default FS, replacing HFS+,  
probably within a couple of years, we'll see if they make the same two  
design decisions which cause the kinds of problems being discussed  
here (case-insensitivity, and ubiquitous FS-level UTF-8 normalization).

I've done a dumb search in the ZFS source code for "CASE" and see that  
it can in theory support case-insensitivity as an optional feature.  
The potential is there for Apple to use this. I personally hope that  
they don't, because as has already been pointed out, these little  
tricks tend to make life more difficult for users rather than helping  
them (the day I have two files in the same directory called "Märchen"  
and want to specify one of them on the command line I'll worry about  
that when I come to it).

http://fuzzy.wordpress.com/2007/06/09/zfsandfilesystemoptions/

Cheers,
Wincent



-
To: Wincent Colaiuta <win@...>
Cc: Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 12:43 pm

Side note: the thing is, the reason people shouldn't worry about it is 
that this is a *trivial* thing to handle. You really don't even need to 
know what you're doing. And you can test it today, easily.

Having two (differently encoded) files like that is really no different 
from the traditional UNIX FAQ of "how do I remove a file starting with 
'-'" or even more closely "how do I remove a file that has a character in 
it that I cannot get at the keyboard".

In other words, on a bog-standard UNIX (and yes, in this case, I bet OS X 
works fine too for this test), just try this

	filename1=$(echo -e "hello\002there")
	filename2=$(echo -e "hello\003there")
	echo Odd file &gt; "$filename1"
	echo Another odd file &gt; "$filename2"

and now you have a filename that is actually rather hard to type on the 
command line. In fact, for me they even *look* the same:

	[torvalds@woody ~]$ ll hello*
	-rw-rw-r-- 1 torvalds torvalds  9 2008-01-17 08:23 hello?there
	-rw-rw-r-- 1 torvalds torvalds 17 2008-01-17 08:23 hello?there

See?

Even in my graphical browser, those two filenames look 100% *identical*. I 
could give you a screen-shot, but I'm lazy. Just take my word for it, or 
just fire up konqueror on Linux (but it may well depend on the particular 
font you're using).

[ And yes, for other browsers, you might have something that shows them as 
  different characters - depending on the font, it might show up as a 
  small box with [00 02] vs [00 03] in it, for example. But that's also 
  actually 100% true of the two different encodings of 'ä' - you could 
  easily have a file broswer that shows the multi-character as a 
  multi-character, exactly to distinguish them and show that one of them 
  isn't "normalized"!

  The point is, once the filesystem doesn't corrupt the data, it's always 
  easy to get at, and there is never any ambiguity. ]

How is this different from "Märchen" spelled with two different encodings 
for that "ä"?

I'll tell you: it's not at all differen...
To: Linus Torvalds <torvalds@...>
Cc: Wincent Colaiuta <win@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 6:01 pm

With the exception of Unicode. If you check the standard, two Unicode
codepoints (i.e. the numeric value that gets stored on disk) *can* map
to the same character, hence they are the same. They don't just look the
same, they are the same character -- even if the codepoints are
different (i.e. precomposed vs. decomposed characters). In fact, part of
the Unicode standard deals with that. (Technically, Unicode calls it
equivalence, but what the hey).

In other words, Unicode treats e.g. both U+0065 and U+00E9 as
fundamentally the same character. This comes even more into play in such
alphabets as Hangul (Korean) and the Japanese Kana.


-- 
JM Ibanez
Software Architect
Orange &amp; Bronze Software Labs, Ltd. Co.

jm@orangeandbronze.com
http://software.orangeandbronze.com/
-
To: JM Ibanez <jm@...>
Cc: Linus Torvalds <torvalds@...>, Wincent Colaiuta <win@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 6:09 pm

Hi,


As Linus _already_ pointed out, you are confusing characters with glyphs.

Hth,
Dscho

-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: JM Ibanez <jm@...>, Linus Torvalds <torvalds@...>, Wincent Colaiuta <win@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 9:27 pm

Someone is. 

He is refering to the unicode definition of an (abstract) character.

Ch3.4 D11 - "A single abstract character may also be represented by a sequence
of code points—for example, latin capital letter g with acute may be represented
by the sequence &lt;U+0047 latin capital letter g, U+0301 combining acute accent&gt;, 
rather than being mapped to a single code point.


-- robin
-
To: JM Ibanez <jm@...>
Cc: Wincent Colaiuta <win@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 7:05 pm

But if you want to make it clear, you can use "encoded character" or yes, 
"code point". 

But the thing is, even the unicode standard tends to just say "character", 
and a unicode string (for example) is defined to be a sequence of "code 
units" which in turn is about those *encoded* characters, which is all 
about the code points.

So you'll find that they are very careful in some technical definition 
parts to talk about "code points", but then in other sequences they talk 
about "character" even though they are referring to the actual code point 
(ie the figure literally has the unicode number in it!)

In fact, they sometimes even talk about "characters" in the totally 
non-encoding meaning of "glyph".

So yes, "character" is often ambiguous. It would be good to never use the 
word at all, and only talk about "code point" and "glyph" and one of the 
well-defined special terms like "combining character" or "replacement 
character".

But to take a representative example from The Unicode Standard, Chapter 2: 
"Unicode Design Principles":

  Characters are represented by code points that reside only in a memory 
  representation, as strings in memory, on disk, or in data transmission. 
  The Unicode Standard deals only with character codes.

(any speling mistakes mine). In other words, from the very beginning of 
the standard, very basic design principles chapter, it starts talking 
about characters being represented by code points and explicitly says that 
it really only deals with CHARACTER CODES.

Yes, I'm sure you can argue ad infinitum that all the "equivalences" and 
other crap means that a "character" can sometimes mean just about 
anything, but I'd say that it's pretty damn reasonable to equate "unicode 
character" with "code point" or "character code".

			Linus
-
To: JM Ibanez <jm@...>
Cc: Linus Torvalds <torvalds@...>, Wincent Colaiuta <win@...>, Kevin Ballard <kevin@...>, Jakub Narebski <jnareb@...>, Johannes Schindelin <Johannes.Schindelin@...>, Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 7:10 pm

So they are not the same after all? It is just you don't care
about what it actually says, right? How about this: Unicode
provides a unique number for every character. So, if numbers
are not the same then by definition of the Unicode standard

There is no notion "fundamentally the same character" in the Unicode
standard as far as I know, and the characters you mentioned are very
different in Unicode:
http://www.fileformat.info/info/unicode/char/0065/index.htm
http://www.fileformat.info/info/unicode/char/00e9/index.htm
There have different names, they have different glyphs, and they
are functional different.

Dmitry
-
To: git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 2:09 pm

Sorry, but you're using different characters that look the same. But 
Kevins point was that it's a different thing if you use two characters 
that look the same or the same character with different encodings. This 
makes this HFS-specific problem different from the "look the same"- or 
the "case-insensitivity"-issues.

BTW: I also read about your argument that you wouldn't convert file data 
to normalized UTF-8 (I agree with you that this would be nonsense) and 
therefore filenames shouldn't be converted too. This is something where 
I have to disagree because a filename (like ctime, mtime, atime, ...) 
are meta data (while file contents isn't) and - until now - I would've 
guessed that you agree on this point because git doesn't care about 
filenames but contents.

IMHO it would be the best solution when git stores all string meta data 
in UTF-8 and converts it to the target systems file system encoding. 
That would fix all those problems with different locales and file system 
encodings ...

However, I have to agree that the enforced character set conversion 
causes more problems than it solves.

Regards,
Mark
-
To: Mark Junker <mjscod@...>
Cc: git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 2:42 pm

But that's exactly the case he gave - 'ä' vs 'a¨' are exactly that: 
different strings (not even characters: the second is actually a 
multi-character) that just look the same.

You try to twist the argument by just claiming that they are the same 
"character". They aren't, unless you *define* character to be the same as 
"glyph". Of course, if you claim that, then you can always support your 
argument, but I claim that is a bogus and incorrect axiom to start with!

Too many people confuse "character" and "glyph". They are different.

See, for example

	http://en.wikipedia.org/wiki/Unicode

and notice the *many* places where they try to make that distinction 
between "character" and "glyph" clear (and also "code values", which are 
the actual bytes that encode a character).

See also

	http://en.wikipedia.org/wiki/Unicode_normalization

and realize that a Unicode sequence is a sequence of *characters* even if 
it is not normalized! Those things are still characters, when they are the 
"simpler" non-combined characters.

You are trying to make a totally BOGUS argument, and you base it on the 
INCORRECT basis that the TWO characters 'a'+'¨' somehow aren't independent 
characters. They *are*. They are *different* characters from 'ä', even 
though they may be "Canonically equivalent" as a sequence.

The fact is that "equivalent" does not mean "same". Why cannot people 
accept that?

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 2:52 pm

Hi,



I'll shut up now if you can answer me one question,  because it  
really is a problem for my team.

We have people using windows, people using Macs, and people using  
several flavors of Linux desktops. They all have different settings  
and if I add a file like áéióú that happens to be UTF-8 encoded, it  
will reach a iso-latin-1 user as visual garbage. git will track the  
file perfectly, we know that, because the sequence of bytes that my  
system used to create the file will be the same on all "sane"  
systems, but the file will look "funny" to some users, and we get  
complaints for some less enlightened ones.

The answer is that users should not create filenames with non-ascii  
characters if they want a consistent experience, right?

This is just so that I can write a best practices document to them...

Best regards,
-- 
Pedro Melo
Blog: http://www.simplicidade.org/notes/
XMPP ID: melo@simplicidade.org
Use XMPP!


-
To: Pedro Melo <melo@...>
Cc: Mark Junker <mjscod@...>, git@vger.kernel.org <git@...>
Date: Thursday, January 17, 2008 - 3:11 pm

I can't really suggest anything else than trying to make everybody use 
UTF-8.

[ Not just for filenames, by the way - this is one of the reasons I think
  it is so *important* to not corrupt filenames, exactly because this is 
  in no way filename-specific at all, and filenames are generally "textual 
  data" exactly the same way a text-file is.

  But only totally insane people think that you should force-normalize 
  text-files, even though all the issues are obviously all the same 
  regardless of whether it's a filename or a word in textfile. ]

And yes, I also realize that it's not going to be realistic. We're 
probably *closer* to that than we used to be, but I don't think you can 
even make Windows think FAT is UTF-8.

I don't know how NTFS works (I know it is Unicode-aware, and I think it 
encodes filenames in UCS-2 or possibly UTF-16, but there is an obvious 1:1 
translation to UTF-8, and since we use C strings, I'd assume/hope Windows 
actually uses that unambiguous translation for any filenames).

Under modern Linux and OS X, UTF-8 is basically the only way (older Linux 
distros may be set up for Latin1, but at least the newer ones seem to all 

Oh, absolutely. That takes care of 99.9% of all source projects. Even then 
you can have problems with case insensitivity (the Linux kernel sources 
are all US-ASCII filenames, for example, but *literally* has many files 
that are identical if you ignore case, and that's not unheard of).

So yes, to a first approximation, the answer is to simply avoid using 
anything but US-ASCII. It's seldom a big limitation when talking about 
filenames.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Date: Friday, January 18, 2008 - 6:19 am

But they are not different strings, they are canonically equivalent as
far as Unicode is concerned. They're even supposed to map to the same
glyph (if the font has an "ä", it should display it in both cases, if
it has an "a" and a combining diaeresis, it should make up one).

You cannot do a binary comparison of text to see if two strings are

Whereas you are confusing characters and code points.

"ä" and "a¨" use different code points, but they encode the same
character, and from the user's perspective it is the *character* that

Actually, NTFS is a bit broken. It sees file names as a string of
16-bit words. It doesn't check that it is valid UTF-16, or even valid
UCS-2, it allows almost anything.


Apple made Mac OS X handle filenames properly, by seeing that file
names are a string of characters, not code points, so they use a
canonical form for all characters (personally, I would have preferred
the pre-composed form, though).

-- 
\\// Peter - http://www.softwolves.pp.se/
-
To: Peter Karlsson <peter@...>
Cc: Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Date: Friday, January 18, 2008 - 1:11 pm

Fuck me with a spoon.

Why the hell cannot people see that "equivalent" and "same" are two 

.. and this is relevant how? They are different strings. Not the same.

Equivalence doesn't matter. Equivalence is *evil*. Equivalence is what 
gives us case-insensitive filesystems ("because the names are 
equivalent").

Filesystems don't *want* equivalence. They want a much stronger exactness 
guarantee. Exactly because sometimes the differences matter.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Date: Monday, January 21, 2008 - 10:14 am

It is relevant because the Mac OS file system stores file names as a
sequence of Unicode code points, in a (apparently slightly modified)
normalized form, whereas Git prefers to see file systems that store
file names as a sequence of octets, which may, or may not, actually map
to something that the user would call characters.

I happen to prefer the text-as-string-of-characters (or code points,
since you use the other meaning of characters in your posts), since I
come from the text world, having worked a lot on Unicode text
processing.

You apparently prefer the text-as-sequence-of-octets, which I tend to
dislike because I would have thought computer engineers would have
evolved beyond this when we left the 1900s.

But the real issue is that Git cannot use it's filenames as string of
octets on Mac OS X, since the file system doesn't handle it. So Git
needs to do something sensible. That's part of porting. Preferrably
that would involve supporting real Unicode file names, which would also
work on Windows (through it's UTF-16 file APIs), and in part on other
systems (through conversion to the systems' locale encoding).

-- 
\\// Peter - http://www.softwolves.pp.se/
-
To: Peter Karlsson <peter@...>
Cc: Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Date: Monday, January 21, 2008 - 2:16 pm

No. The *only* issue is that git doesn't normalize.

You can think of git as a UTF-8 namespace all you want, and it will work 
together wonderfully with OS X. 


Some of us just know what we're doing, and have been working with UTF-8 
for a long time. It's not about sequence-of-octets, it's about not 
corrupting the data.

You think data should be changed behind peoples backs, potentially causing 
corruption due to unintended conversions. And I don't.

You can call me "left behind in the 1900s", but that's apparently because 
you don't understand the issues. Data corruption wasn't something that 
magically became ok just because we switched into a new century.

			Linus
-
To: Peter Karlsson <peter@...>
Cc: Linus Torvalds <torvalds@...>, Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Date: Monday, January 21, 2008 - 12:43 pm

I agree. Every single problem that I can recall Linus bringing up as a  
consequence of HFS+ treating filenames as strings is in fact only a  
problem if you then think of the filename as octets at some point. If  
you stick with UTF-8 equivalence comparison the entire time, then  
everything just works.

Granted, this is a problem when you have to operate on a filesystem  
that thinks of filenames as octets, but as I said before, this doesn't  
mean the HFS+ approach is wrong, it just means it's incompatible with  
Linus's approach.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
To: Kevin Ballard <kevin@...>
Cc: Peter Karlsson <peter@...>, Linus Torvalds <torvalds@...>, Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Date: Monday, January 21, 2008 - 4:30 pm

At *some* point everything stored in computers is a sequence of octets.
In fact, the whole point of the Unicode standard is to define characters
and how to map each character to a unique number (code points) and then

There are more than one equivalence comparison. The unicode standard
defines at least two, and for some other purpose you may want to use
some others, but for some reason you are trying to present that to
work with text means to follow only one type of equivalence the entire
time...

Dmitry
-
To: Kevin Ballard <kevin@...>
Cc: Peter Karlsson <peter@...>, Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Date: Monday, January 21, 2008 - 2:12 pm

You say "I agree", BUT YOU DON'T EVEN SEEM TO UNDERSTAND WHAT IS GOING ON.

The fact is, text-as-string-of-codepoints (let's make the "codepoints" 
obvious, so that there is no ambiguity, but I'd also like to make it clear 
that a codepoint *is* how a Unicode character is defined, and a Unicode 
"string" is actually *defined* to be a sequence of codepoints, and totally 
independent of normalization!) is fine.

That was never the issue at all. Unicode codepoints are wonderful.

Now, git _also_ heavily depends on the actual encoding of those 
codepoints, since we create hashes etc, so in fact, as far ass git is 
concerned, names have to be in some particular encoding to be hashed, and 
UTF-8 is the only sane encoding for Unicode. People can blather about 
UCS-2 and UTF-16 and UTF-32 all they want, but the fact is, UTF-8 is 
simply technically superior in so many ways that I don't even understand 
why anybody ever uses anything else.

So I would not disagree with using UTF-8 at all.

But that is *entirely* a separate issue from "normalization". 

Kevin, you seem to think that normalization is somehow forced on you by 
the "text-as-codepoints" decision, and that is SIMPLY NOT TRUE. 
Normalization is a totally separate decision, and it's a STUPID one, 
because it breaks so many of the _nice_ properties of using UTF-8.

And THAT is where we differ. It has nothing to do with "octets". It has 
nothing to do with not liking Unicode. It has nothing to do with 
"strings". 

In short:

 - normalization is by no means required or even a good feature. It's 
   something you do when you want to know if two strings are equivalent, 
   but that doesn't actually mean that you should keep the strings 
   normalized all the time!

 - normalization has *nothing* to do with "treating text as octets". 
   That's entirely an encoding issue.

 - of *course* git has to treat things as a binary stream at some point, 
   since you need that to even compute a SHA1 in the first place, but that 
...
To: Linus Torvalds <torvalds@...>
Cc: Kevin Ballard <kevin@...>, Peter Karlsson <peter@...>, Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Date: Monday, January 21, 2008 - 4:36 pm

Code point is a unique numerical value assigned to every Unicode character.
Also, every Unicode character has a uniqie name assigned to it. There are
some other non-unique properties that every Unicode has. So, to say that
a Unicode character is just a code point is not exactly correct, because
the code point is one of properties of a unicode character. But, yes, any
Unicode character can be identified by its code point. So, it is one to
one relation.

Dmitry
-
To: Linus Torvalds <torvalds@...>
Cc: Kevin Ballard <kevin@...>, Peter Karlsson <peter@...>, Mark Junker <mjscod@...>, Pedro Melo <melo@...>, git@vger.kernel.org <git@...>
Subject: