Re: What's cooking in git.git (Jul 2009, #01; Mon, 06)

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Linus Torvalds
Date: Tuesday, July 7, 2009 - 12:17 pm

On Mon, 6 Jul 2009, Junio C Hamano wrote:

Hmm. I'm not sure what array you're talking about (the newpath/newbase 
ones? We do protect against PATH_MAX, it's just that we protect against it 
in the "previous iteration").

The bigger issue, though, is that I spent half a day looking more at this 
series last Thursday, and I've got some improvements, but getting "all the 
way" turns out to be really quite painful.

Why?

We have a _lot_ of code that does "lstat()" on pathnames, and it all 
basically uses the internal git representation of the pathname. In 
particular, we do this a lot for index lookups, but it's true in other 
cases too (example: things like tree merging, where we check whether a 
file exists in the working tree).

To test this all out, I actually fleshed out the patches to the point 
where I could do

	[core]
		PathEncoding = Latin1

and actually have the working tree use Latin1 encoding, and convert 
internally in git to UTF-8, and have a working "git add ."

However, "git add ." was just about the only thing that I made do the 
right thing. Even doing a simple "git diff" afterwards would then show the 
file as deleted, because the UTF-8 version of the file (that the index 
contained) didn't exist in the filesystem. I fixed that with a hack, but 
it basically turns out to be pretty damn ugly, and there's a _lot_ of 
those places.

So, the question is, "What now?"

There's a few alternatives:

(a) don't do any of this crap at all. What git does right now works fairly
    well for most people. Instead, perhaps worry about just the crazy 
    case-insensitive filesystems, which are a totally separate issue.

    End result: git will always have problems with the crazy NFD format
    that OS X uses. Mixing git archives across OS X and other saner
    operating systems (and in this context, Windows really does count as
    "saner" - it really is OS X that is braindamaged!) will be painful if
    you have odd characters in your working tree.

    This is the simplest approach, of course. The case-insensitivity is 
    still not trivial, but we could work on it, and it really is a 
    different problem (and has none of the "if you look the file up with a 
    converted name, you cannot see it" issues that the Latin1<->UTF8 
    example had).

(b) Forget about the general case (like Latin1) that needs two-way 
    conversion. Just worry about OS X being crazy, and do the NFD->NFC 
    translation, which only needs to be done one way (because OS X will
    still accept and recognize NFC characters, so the "converted" path is 
    still seen as valid by 'lstat()' and friends).

    This is very much just a special case of handling filesystems that are 
    UTF-8, but are confused about what "equivalent" and "identical" means, 
    and where the filesystem designer was a moron on some seriously crazy 
    drugs, and thought that equivalence means identity, and thought that 
    NFD is a sane form to expose.

    This is a much simpler case than the general approach. I don't have OS 
    X to test with, though, and so far it hasn't appeared that any OS X 
    people really care about to actually implement it. So I can fix up my 
    series to a certain point, but will never be able to really do the 
    final testing and tuning. At least with the full "treat filesystem as 
    Latin1 encoding", I could _test_ it.

(c) Try to bite the bullet. I can do this, but it really is going to be a 
    _very_ invasive patch-series, and it will probably involve some nasty 
    changes to the index format (for performance, we'll likely have to 
    change the index to have _both_ the "git filename", and the 
    "filesystem filename" in it).

    This was what I wanted to do, and it's what you'd need to do if you do 
    things like Latin1 filesystem trees or ones where pathnames are done 
    with shift-JIS encoding or if we want to actually use the (crazy) 
    native Windows UCS filesystem accessors or whatever.

    But I have to admit that after looking at the pain, I'm not at all 
    convinced it's worth it. Do we ever want to say "git supports 
    filesystems with shift-JIS encoding"? Do people really care deeply 
    enough about non-utf filesystems that they'd be willing to live with a 
    _lot_ of pretty nasty complexity, and some real performance overhead?

I have to say, even with plain UTF-8, git isn't really a pleasure to use. 
While I did my Latin1 test, I used filenames like "åäö" (the three extra 
Finnish/Swedish characters), and if you do this

	mkdir test-repo
	cd test-repo
	git init
	echo testfile > åäö
	git add .
	git ls-files

the end result is not actually really usable. We quote it to a binary 
mess, rather than showing "åäö". Our pathname quoting is trying to be 
safe, which is good, but it does mean that right now, odd characters 
aren't very friendly even _if_ you are using a sane filesystem, and all 
plain NFC utf-8.

So right now, my personal opinion is:

 - let's just face the fact that the only sane filename representation is 
   NFC UTF-8. Show filenames as UTF-8 when possible, rather than quoting 
   them.

 - Do case (b) above: add support for converting NFD -> NFC at readdir() 
   time, so that OS X people can use UTF-8 sanely. 

 - add a "binary encoding" mode to filesystems that actually use Latin1, 
   just so that if people use Latin1 or Shift-JIS filesystem encodings, we 
   promise that we'll never munge those kinds of names. 

 - Maybe we'd make the "binary encoding" (which is effectively existing 
   git behavior) be the default on non-OSX platforms.

but that's just my gut feel from trying to weigh the costs of trying to do 
something more involved against the costs of OS X support and just letting 
crazy encodings exist in their own little worlds. So a development group 
that uses Shift-JIS (or Latin1) would be able to work internally with git 
that way, but would not be able to sanely work with the world at large 
that uses UTF-8.

		Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
What's cooking in git.git (Jul 2009, #01; Mon, 06), Junio C Hamano, (Mon Jul 6, 11:32 am)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Marcus Camen, (Mon Jul 6, 1:29 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Junio C Hamano, (Mon Jul 6, 2:38 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Marcus Camen, (Mon Jul 6, 3:03 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Junio C Hamano, (Mon Jul 6, 3:34 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Jakub Narebski, (Mon Jul 6, 4:42 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Johannes Sixt, (Mon Jul 6, 11:30 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Linus Torvalds, (Tue Jul 7, 12:17 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Alex Riesen, (Tue Jul 7, 12:57 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Johannes Schindelin, (Tue Jul 7, 1:08 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Shawn O. Pearce, (Tue Jul 7, 1:13 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Linus Torvalds, (Tue Jul 7, 3:13 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Junio C Hamano, (Tue Jul 7, 3:19 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Shawn O. Pearce, (Tue Jul 7, 3:28 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Stephen Boyd, (Tue Jul 7, 10:39 pm)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Johannes Sixt, (Tue Jul 7, 11:38 pm)
notes, was Re: What's cooking in git.git (Jul 2009, #01; M ..., Johannes Schindelin, (Wed Jul 8, 6:42 am)
Re: What's cooking in git.git (Jul 2009, #01; Mon, 06), Christian Couder, (Thu Jul 9, 10:05 pm)