On Jan 21, 2008, at 1:12 PM, Linus Torvalds wrote:
I could say the same thing about you.
I'm not saying it's forced on you, I'm saying when you treat filenames =20=
as text, it DOESN'T MATTER if the string gets normalized. As long as =20
the string remains equivalent, YOU DON'T CARE about the underlying =20
byte stream.
Alright, fine. I'm not saying HFS+ is right in storing the normalized =20=
version, but I do believe the authors of HFS+ must have had a reason =20
to do that, and I also believe that it shouldn't make any difference =20
to me since it remains equivalent.
Sure it does. Normalizing a string produces an equivalent string, and =20=
so unless I look at the octets the two strings are, for all intents =20
and purposes, the same.
You're right, but it doesn't have to treat it as a binary stream at =20
the level I care about. I mean, no matter what you do at some level =20
the string is evaluated as a binary stream. For our purposes, just =20
redefine the hashing algorithm to hash all equivalent strings the =20
same, and you can implement that by using SHA1 on a particular =20
encoding of the string.
Decomposing and recomposing shouldn't lose any information we care =20
about - when treating filenames as text, a<COMBINING DIARESIS> and <A =20=
WITH DIARESIS> are equivalent, and thus no distinction is made between =20=
them. I'm not sure what other information you might be considering =20
lost in this case.
I don't believe you. See below.
does =20
When have I ever said that Unicode meant Forced normalization?
Wrong.
Wrong. '\x61\x18' in Latin1, when converted to UTF-8 (NFD) is still =20
'\x61\xc2\xa8'. You're mixing up DIARESIS (U+00A8) and COMBINING =20
DIARESIS (U+0308).
I suspect this is why you've been yelling so much - you have a =20
fundamental misunderstanding about what normalization is actually doing.
See above as to why you're not losing the information you so fervently =20=
believe you are.
People who insult others run the risk of looking like a fool when =20
shown to be wrong.
=20
in
Sure, it all depends on what level you need to evaluate text. If we're =20=
talking about english paragraphs, then whitespace can be messed with. =20=
When we're talking about unicode strings, then specific encoding can =20
be messed with. When talking about byte sequence, nothing can be =20
messed with.
In our case, when working on an HFS+ filesystem all you have to care =20
about is the unicode string level. The specific encoding can be messed =20=
with, and the client shouldn't care. Problems only arise when =20
attempting to interoperate with filesystems that work at the byte =20
sequence level.
The only information you lose when doing canonical normalization is =20
what the original byte sequence was. Sure, this is a problem when =20
working on a filesystem that cares about byte sequence, but it's not a =20=
problem when working on a filesystem that cares about the unicode =20
string.
-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.orgkevin@sb.orghttp://www.tildesoft.com