No, no, no, NO!
Egads! Why do people constantly do these totally idiotic things for
Unicode?
You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all
source code projects in UTF-8, without *ever* doing any format changes at
all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie
we don't care about byte-order or size of integers or anything like that),
but that's the common case for most hashes that aren't used for BTree
lookup on disk or something like that.
Here, let me show you:
unsigned int name_hash(const char *name, int size)
{
hash = HASH_INIT;
do {
unsigned char c;
if (size >= sizeof(long)) {
unsigned long val = get_unaligned_long(name);
if (!(val & 0x8080808080808080)) {
/* Make it equivalent in case */
val &= ~0x2020202020202020;
hash = hash_long(hash, val);
name += sizeof(long);
size -= sizeof(long);
continue;
}
}
c = *name;
if (!(c & 0x80)) {
hash = hash_long(hash, c & ~0x20);
name++;
size--;
continue;
}
/* This is the unusual and slowish case */
hash = hash_utf8_char(hash, c, &name, &size);
} while (size);
return hassh;
}
and then the only point you ever do that actual UTF8->codepoint conversion
is for that "high bit set" case.
A few things to note on the above:
- the hash obviously has "odd" characteristics. We're not necessarily
hashing characters at a time at all, and the alignment of characters
with high bits *within*the*string* will make a difference to how we
hash them.
But it's also important that the "get_unaligned_long()" means that the
alignment of the string itself doesn't matter, so its' purely a
"chunking within the string" thing, and the alignment of the string
itself won't affect the hash value
- I'm not writing out hash_utf8_char(), because it's certainly not
totally trivial, but it's not *really* complex either. The
nontriviality isn't so much the decoding into a codepoint (which is
pretty simple), but the fact that when you have the codepoint you
should then decompose it and turn it into lower case, which is
generally two table lookups. Then, you just do
for_each_decomposed_uppercased_codepoint(c)
hash = hash_long(hash, c);
and one thing to note is that for the hashing, the decomposition and
uppercasing doesn't even have to be "exact" (the same way I didn't do
an "exact" upper-casing for US-ASCII, just a "good enough" one!)
Similarly, when you actually do a unicode *compare* function, you should
never *ever* actually convert to any unicode codepoints or do any
expensive decomposition AT ALL by default! What you do is to compare
things byte-by-byte, and only convert to unicode/decompse if there are any
differences, and only for those parts of the sequence that differ!
So if you have two UTF-8 strings (even if they have "complex" characters,
ie with the high bit set), the *common* case is that you'll match them
byte for byte, and they'll match without any Unicode conversion needed at
all! This is common because:
- normally, even if you don't ever normalize, people tend to input things
in a *similar* manner (again, OS X is the odd man out), so even
non-normalized strings are often non-normalized the same way!
- we're only going to compare things that have hashed to the same thing
anyway, so the common case is that it's the same string, and most
likely had the same source. And if it's a collision, it's often totally
different. And even if it's different only in case, the *common* case
is going to be (for source code trees, at least) that the different
point is a US-ASCII letter, and the case-comparison will again be done
without any Unicode knowledge at all!
This is why normalization of strings before-hand is generally so utterly
stupid. It doesn't buy you anything. It complicates things a lot (you
don't want to normalize in-place, so you have memory management issues),
and it actually SLOWS THINGS DOWN.
It's much better to do UTF-8 comparisons and hashing char-by-char. At
least if you know that the common case is not going to be the complex part
of the character set (which is undoubtedly true for source code
filenames).
Normalizing things ahead of time *only* makes sense if:
- you expect complex characters to be a big part of your load
and
- you're going to do literally *thousands* of comparisons against the
*same* strings over and over (so that the cost of normalization is
literally up-front)
For example, for a filesystem, it's true that you're going to compare
against the *target* (on-disk) multiple times, but that doesn't actually
mean that normalizing it makes any sense - because the data you're going
to compare against comes from user space and isn't guaranteed to be
normalized, so you still cannot do a simple memcmp() without the expense
of normalizing that.
And since you're going to hash the filenames anyway, you will not have
"thousands of comparisons" per source lookup, you'll generally only have a
couple, so now your normalization actually cost you *more* than doing the
above on-the-fly comparison!
See?
Basically, it's almost always a stupid thing to actually normalize a whole
string. You do those things character-by-character, and only lazily when
you actually need to!
Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html