On Jan 16, 2008, at 8:41 PM, Linus Torvalds wrote:
quoted text > On Thu, 17 Jan 2008, Johannes Schindelin wrote:
>> On Thu, 17 Jan 2008, Wincent Colaiuta wrote:
>>
>>> El 17/1/2008, a las 1:40, Pedro Melo escribi=F3:
>>>
>>>> That's the point I'm making. The fact that I need to set LANG =20
>>>> across
>>>> all users of a project is insane...
>>
>> FWIW if you use another filesystem, such as reiserfs or ext[2-4], the
>> filenames will be _unaffected_ by your particular setting of LANG. =20=
quoted text >> They
>> will be stored byte-wise exactly like asked for. That's why I call =20=
quoted text >> them
>> "sane".
>
> One of the advantages (the biggest one, in fact, apart from the =20
> obvious
> US-ASCII down-compatibility and the fact that you can do C-compatible
> NUL-terminated strings) of UTF-8 is that it's locale-independent, and
> doesn't care about LANG, because it's valid in all languages.
>
> And that's really important. It's important for a very simple reason:
> there is almost never such a thing as "a locale" except for US-=20
> ASCII. Once
> you move away from US-ASCII, it actually tends to be much more =20
> common that
> you have a *mixture* of locales - often in the same "document" - =20
> than to
> have one single locale.
>
> It very much happens even in filenames - people "mix" locales in =20
> trivial
> ways even within a single pathname component (non-US-ASCII filename, =20=
quoted text > but
> with a regular file extension), but much more interestingly they do so
> within a directory tree (ie you have have translation subdirectories =20=
quoted text > where
> the filenames themselves are in another language, and you can have =20
> full
> pathnames where different components are in different languages, for
> example).
>
> And UTF-8 is _wonderful_ for this, because LANG doesn't matter, and
> cannot matter, and thus mixing isn't a problem.
>
> Of course, you can screw it up. Locales still can change things like =20=
quoted text > sort
> order and capitalization etc, so even if you use UTF-8, you sure can =20=
quoted text > get
> into trouble with LANG and thinking that a per-session locale makes =20=
quoted text > sense.
>
> So choosing UTF-8 for the filesystem isn't wrong per se. It's a fine
> choice, and has no issues with LANG in itself. Limiting it to strictly
> valid UTF-8 encodings is also fine. Limiting it (further) to only
> character normalized UTF-8 is also fine.
>
> Most Linux filesystems don't limit it in any way, so you can make
> filenames that aren't valid UTF-8 at all, much less normalizing
> multi-character sequences.
>
> I personally think that's the best option, but I probably do so mostly
> because I know some people still use Latin1 as their only locale =20
> (and I
> suspect Asia will take decades before it has converted to UTF-8 and =20=
quoted text > will
> also have cases where they use other non-UTF locales).
>
> But enforcing clean UTF-8 is not a bad idea per se. Not allowing byte
> sequences that aren't a valid UTF-8 encoding (eg \xc0\xc0 is not a =20
> valid
> UTF-8 character) is fine.
>
> I wouldn't call people crazy for doing that, although it does mean =20
> that
> you cannot, for example, decide to write a Latin1 filename (which is =20=
quoted text > not
> necessarily a *good* idea in this day and age, but I think there's a
> difference between "that's not a good idea" and "you cannot do that").
>
> And even limiting the UTF-8 charset further to only the minimal
> representation of one particular glyph (ie not allowing multi-=20
> character
> sequences that can be represented more simply) may be even *more*
> big-brother, but would at least not cause the technical aliasing =20
> issues. I
> personally think that's so controlling as to be stupid (and has no =20
> real
> advantage), but hey, at least it doesn't *corrupt* anything silently.
>
> So I think that using UTF-8 as a character encoding is a *good* =20
> thing to
> do, and that automatically means that LANG shouldn't matter for =20
> filenames,
> but within that choice of UTF-8 there are still mistakes that you can
> make. Notably multi-character normalization and case-insensitivity.
>
> Linus
Alright, you've made your point, and I'm willing to concede at least =20
some of what you've said. So perhaps we can now move onto the more =20
relevant and practical issue of: HFS+, despite how stupid it may or =20
may not be, normalizes filenames (and is case-insensitive, which is a =20=
related issue). This causes a problem with git. How can this be solved?
I'm more than willing to do work to solve it, my biggest issue is I =20
don't believe I actually have the free time to learn the git internals =20=
well enough to actually do proper work on what I would assume is a =20
fairly performance-critical section of git's code. However, I would be =20=
happy to work with others who are perhaps more knowledgeable in this =20
area.
-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com