Re: I'm a total push-over..

Previous thread: [PATCH] git-clone -s: document problems with git gc --prune by Miklos Vajna on Tuesday, January 22, 2008 - 2:03 pm. (3 messages)

Next thread: Re: git on MacOSX and files with decomposed utf-8 file names by Theodore Tso on Tuesday, January 22, 2008 - 5:08 pm. (30 messages)
From: Linus Torvalds
Date: Tuesday, January 22, 2008 - 4:37 pm

Ok, here's an interesting patch based on the current 'next' (since it very 
intimately requires the new in-memory index format).

What it does is to create a hash index of every single file added to the 
index. Right now that hash index isn't actually used for much: I 
implemented a "cache_name_exists()" function that uses it to efficiently 
look up a filename in the index without having to do the O(logn) binary 
search, but quite frankly, that's not why this patch is interesting.

No, the whole and only reason to create the hash of the filenames in the 
index is that by modifying the hash function, you can fairly easily do 
things like making it always hash equivalent names into the same bucket. 

That, in turn, means that suddenly questions like "does this name exists 
in the index under an _equivalent_ name?" becomes much much cheaper.

Guiding principles behind this patch:

 - it shouldn't be too costly. In fact, my primary goal here was to 
   actually speed up "git commit" with a fully populated kernel tree, by 
   being faster at checking whether a file already existed in the index. I 
   did succeed, but only barely:

	Best before:
		[torvalds@woody linux]$ time git commit > /dev/null
		real    0m0.255s
		user    0m0.168s
		sys     0m0.088s

	Best after:

		[torvalds@woody linux]$ time ~/git/git commit > /dev/null
		real    0m0.233s
		user    0m0.144s
		sys     0m0.088s

   so some things are actually faster (~8%).

   Caveat: that's really the best case. Other things are invariably going 
   to be slightly slower, since we populate that index cache, and quite 
   frankly, few things really use it to look things up. 

   That said, the cost is really quite small. The worst case is probably 
   doing a "git ls-files", which will do very little except puopulate the 
   index, and never actually looks anything up in it, just lists it.

	Before:
		[torvalds@woody linux]$ time git ls-files > /dev/null
		real    0m0.016s
		user    0m0.016s
		sys     ...
From: Kevin Ballard
Date: Tuesday, January 22, 2008 - 6:35 pm

This is fantastic. Thank you very much for actually taking this issue  
seriously despite the mess I made on the list. This is exactly why I  
wanted to discuss on the lists instead of hacking away myself - there  
are very smart people on the list (like you) that already know how git  
works that can come up with ideas like this while I would still be  
trying to figure out where the index code is even stored.

-Kevin Ballard

-- 
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com


From: Junio C Hamano
Date: Tuesday, January 22, 2008 - 7:23 pm

This is nice.  It does not do anything specific with HFS+ issues
but aims for faster look-ups, which would help everybody.

Two things I noticed (only two, not necessarily because you are
good but mostly because I am still mired in day job and could
not get enough uninterrupted minutes to read the patch ;-)):

 - You might want to store the hash table (once computed) in the
   index extension section, and lazily unpack the table the
   first time index_name_exists() or set_index_entry() is called
   on the given istate, instead of unpacking it immediately when
   you read from the disk.  That way, ls-files does not have to
   suffer at all.

 - You would need to get rid of the table in discard_index().

-

From: Junio C Hamano
Date: Tuesday, January 22, 2008 - 7:36 pm

Actually, I take one fourth of this back.  

 * I am not yet retracting the suggestion to do the hashing
   lazily (what I mean by "lazily" is that the first access that
   wants hashed access will iterate active_cache and hash them
   all, not "lazily one entry at a time as needed" which would
   not make any sense for a hashtable).  I have to find time to
   try benching the effect of it myself.  So that's one half
   retained.

 * We certainly do not necessarily want to store this in the
   index right now.  The hash algorithms would be improved from
   the version you are almost ashamed of ;-).  That sounds as if
   I am retrating the other half, but not quite.

 * Once we have a mechanism to detect that the extension section
   stores a precomputed hash that was done with a different
   algorithm and ignore it (and recompute afresh when needed),
   then we can afford to put a more elaborate hashing algorithm,
   slightly loosening one of your "Guiding principles", and keep
   the result in the generated index to be reused by the next
   user.  So that is why I am retracting only half of the
   suggestion to save it in the extension section (which in turn
   is a half of my suggestion).
-

From: Johannes Schindelin
Date: Wednesday, January 23, 2008 - 5:24 am

Hi,


Both issues (and the config variable issue Linus raised) are easily helped 
with: store not only the hashmap in the extension, but also an identifier 
for the hash method used.

Then you can improve on the hash function all you like, and add the config 
variable dependent choice of the hashing everybody seems to want all of a 
sudden, as long as you change the method identifier, too.

Ciao,
Dscho

-

From: David Kastrup
Date: Wednesday, January 23, 2008 - 5:28 am

For a reasonably implemented hash algorithm, computing the hash should
be cheaper than reading it from disk.

So storing precomputed hashes is not worth the trouble.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-

From: Theodore Tso
Date: Wednesday, January 23, 2008 - 5:56 am

Yes, but if on Mac OS systems when the git repository is stored on an
HFS+ system, the hash algorithm gets changed to one which forces
Unicode strings which HFS+ happens to "normalize" into the same hash
bucket pre- and post- normalization, it might not be cheap any
more....

       	  	      	    	   	     - Ted
-

From: Linus Torvalds
Date: Tuesday, January 22, 2008 - 7:58 pm

I really hate that. 

Basically, I dislike having two copies of the same data. If something can 
be computed from something else, then only the original data should exist, 
and the other thing should be recomputed. Otherwise you easily get into 
situations where you spend a lot of time maintaining the other copy, or 
worse, you have inconsistent data and it's really subtle what is going on.

Also, one of the ideas behind the index is that would depend on your 
notion of what is "equivalent", which is actually somehing fairly fluid. 
In fact, it's likely going to depend on a config option.

So encoding the indexing on disk, when it can change when you do a simple 
"git config", or even just depending on your LANG environment variable, 
seems like a singularly bad idea, even if it wasn't for the coherence.

I did consider doing the indexing only on demand, and we can certainly 
simply just "turn it off" when we know it's never going to get used (ie 
"git ls-files"). So in that sense, it's easy to get rid of the overhead, 
but it didn't really seem like the conceptual complexity (even if it's 
just a couple of lines) is really worth it. It's not like git ls-files is 

Now this, of course, is obviously true.

And the patch to do that is very simple too. No need to walk any chains, 
since the "free(istate->alloc);" will release all the pre-allocated 
cache_entry structures, and the rest are (necessarily) leaked anyway.

[ Side note for non-Junios: the leaking of cache_entry structures isn't 
  new, we've always done it, and it's even done on purpose. The common 
  case is that there is one *big* allocation (istate->alloc) that contains 
  all the original cache entries.

  There are usually none, or only a very few individual allocations, and 
  we don't even keep track of them. With the new in-memory format, we 
  could make a special flag that does "is this cache-entry an individual 
  allocation or not" (or we could even just see if they are inside the 
  "alloc" ...
From: Linus Torvalds
Date: Tuesday, January 22, 2008 - 8:19 pm

Ok, I pushed the squashed/fixed commit to the "new-lstat" branch. It's now 
based on your 'next' branch, and I should have renamed it, since it's not 
about any of the old lstat() optimizations any more. Whatever.

But here it is also as a full patch, with a fixed up subject line etc. You 
tend to want to have them as topic-branches, and it's probably easier this 
way.

		Linus

---
From ca98bbc7b0cc1d9d088a8c6ae80e733115a1f775 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 22 Jan 2008 18:41:14 -0800
Subject: [PATCH] Create pathname-based hash-table lookup into index

This creates a hash index of every single file added to the index.
Right now that hash index isn't actually used for much: I implemented a
"cache_name_exists()" function that uses it to efficiently look up a
filename in the index without having to do the O(logn) binary search,
but quite frankly, that's not why this patch is interesting.

No, the whole and only reason to create the hash of the filenames in the
index is that by modifying the hash function, you can fairly easily do
things like making it always hash equivalent names into the same bucket.

That, in turn, means that suddenly questions like "does this name exist
in the index under an _equivalent_ name?" becomes much much cheaper.

Guiding principles behind this patch:

 - it shouldn't be too costly. In fact, my primary goal here was to
   actually speed up "git commit" with a fully populated kernel tree, by
   being faster at checking whether a file already existed in the index. I
   did succeed, but only barely:

	Best before:
		[torvalds@woody linux]$ time git commit > /dev/null
		real    0m0.255s
		user    0m0.168s
		sys     0m0.088s

	Best after:

		[torvalds@woody linux]$ time ~/git/git commit > /dev/null
		real    0m0.233s
		user    0m0.144s
		sys     0m0.088s

   so some things are actually faster (~8%).

   Caveat: that's really the best case. Other things are invariably going
   ...
From: Junio C Hamano
Date: Thursday, January 24, 2008 - 11:50 pm

I wonder if the issue Dave Miller addressed with
69ae517541ed5ab7d4fdcd8f82a9b8bd949df347 (fast-import: fix
unalinged allocation and access) applies here.

commit 69ae517541ed5ab7d4fdcd8f82a9b8bd949df347
Author: David S. Miller <davem@davemloft.net>
Date:   Fri Dec 14 20:39:16 2007 -0800

    fast-import: fix unalinged allocation and access
    
    The specialized pool allocator fast-import uses aligned objects on the
    size of a pointer, which was not sufficient at least on Sparc.  Instead,
    make the alignment for objects of type unitmax_t.
    
    Signed-off-by: David S. Miller <davem@davemloft.net>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

-

From: Linus Torvalds
Date: Friday, January 25, 2008 - 9:24 am

Good point, although we actually do things wrong for *another* reason.

We currently force cache_entry to be 8-byte aligned regardless of what the 
actual "sizeof(ptr)" is, so we should assume that alignment:

	#define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7)

and if that isn't correct, we'd need to change this #define.

So right now, the right thing to do is probably to make this alignment 
explicit:

	#define CE_ALIGN 8

and then use that both in the "cache_entry_size()" _and_ in the 
"estimate_cache_size()" calculations to make it obvious what the alignment 
is.

And then we could actually make the alignment less on architectures that 
don't need that much (there may be architectures that need more, but I 
doubt it: we don't have any large fields in that structure, so the 
structure alignment really probably does max out at 8 in practice even if 
the C language theory doesn't give you any such guarantees).

Side note: this is not likely to be a problem in _practice_. The on-disk 
representation is also aligned (also by 8), and while they can be 
*differently* aligned due to the relative alignment of the varying-length 
"name[]" field, and that can cause some padding to be needed, in practice 
it will never matter. The on-disk size also contains a header that we 
don't take into account, so it's already "over-estimated" to begin with 
for the in-memory representation.

So "estimate_cache_size()" really does over-estimate its needs by a 
biggish amount, which is why it all works regardless, but better safe than 
sorry. 

		Linus
-

From: Junio C Hamano
Date: Wednesday, January 23, 2008 - 12:23 am

Yes, I agree with that in principle. Storing computable values
makes sense only when it is expensive to recompute.  We did not
have cache-tree for quite a long time until you noticed that it
was rather expensive and wasteful to recompute tree objects from
unchanged parts of the index every time.

It's the same argument; when the hashing performance starts to
become noticeable, we can think about storing and reusing it,

Yes, ls-files is cheap.  So is lstat(2) on Linux.  It only
matters when you do it many many times.

In any case, the change does not look too bad.  The best time
(real) of running git-ls-files in the kernel repository on my
box is 0.010s vs 0.011s (10% improvement, heh!, which is the
same as the master version) and empty commit is both 0.082s (no
change).

-- >8 --
[PATCH] lazy index hashing

This delays the hashing of index names until it becomes necessary for
the first time.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h      |    1 +
 read-cache.c |   26 +++++++++++++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index 409738c..e4aeff0 100644
--- a/cache.h
+++ b/cache.h
@@ -191,6 +191,7 @@ struct index_state {
 	struct cache_tree *cache_tree;
 	time_t timestamp;
 	void *alloc;
+	unsigned name_hash_initialized : 1;
 	struct hash_table name_hash;
 };
 
diff --git a/read-cache.c b/read-cache.c
index 9477c0b..e45f4b3 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -34,12 +34,11 @@ static unsigned int hash_name(const char *name, int namelen)
 	return hash;
 }
 
-static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 {
 	void **pos;
 	unsigned int hash = hash_name(ce->name, ce_namelen(ce));
 
-	istate->cache[nr] = ce;
 	pos = insert_hash(hash, ce, &istate->name_hash);
 	if (pos) {
 		ce->next = *pos;
@@ -47,6 +46,24 @@ static void set_index_entry(struct ...
From: Johannes Schindelin
Date: Wednesday, January 23, 2008 - 5:25 am

Hi,


I fully expect it to be noticable with that UTF-8 "normalisation".  But 
then, the infrastructure is there, and whoever has an itch to scratch...

Ciao,
Dscho

-

From: Linus Torvalds
Date: Wednesday, January 23, 2008 - 9:25 am

Actually, it's going to be totally invisible even with UTF-8 
normalization, because we're going to do it sanely.

And by "sanely" I mean just having the code test the high bit, and using 
US-ASCII as-is (possibly with that " & ~0x20 " thing to ignore case in 
it).

End result: practically all projects will never notice anything at all for 
99.9% of all files. One extra well-predicted branch, and a few more hash 
collissions for cases where you have both "Makefile" and "makefile" etc.

Doing names with *lots* of UTF-8 characters will be rather slower. It's 
still not horrible to do if you do it the smart way, though. In fact, it's 
pretty simple, just a few table lookups (one to find the NFD form, one to 
do the upcasing).

And yes, for hashing, it makes sense to turn things into NFD because it's 
generally simpler, but the point is that you really don't actually modify 
the name itself at all, you just hash things (or compare things) character 
by expanded character.

IOW, only a total *moron* does Unicode name comparisons with

	strcmp(convert_to_nfd(a), convert_to_nfd(b));

which is essentially what Apple does. It's quite possible to do

	utf8_nfd_strcmp(a,b)

and (a) do it tons and tons faster and (b) never have to modify the 
strings themselves. Same goes (even more) for hashing.

			Linus
-

From: Johannes Schindelin
Date: Wednesday, January 23, 2008 - 9:34 am

Hi,


Well, that's the point, to avoid having both "Makefile" and "makefile" in 

Heh, indeed that is what I would have done as an initial step (out of 

Okay.  Point taken.

But I really hope that you are not proposing to use the case-ignoring 
hash when we are _not_ on a case-challenged filesystem...

Ciao,
Dscho

-

From: Linus Torvalds
Date: Wednesday, January 23, 2008 - 10:09 am

Right. But what I'm saying is that this is *really* cheap to test for for 
US-ASCII-only characters, and if only 0.1% of all filenames have unicode 
in them, the fact that they are much mroe expensive isn't even going to be 

Note that one reason the above is tons faster is that even with complex 
unicode, the *common* case is going to be that the names match with a 

I actually suspect that we could, and nobody will notice. The hash would 
cause a few more collissions, but not so you'd know.

And the thing is, people who work with other people who are on 
case-challenged systems would still want to have the case-insenstive 
compare too - although it should just warn, not actually "work".

			Linus
-

From: Linus Torvalds
Date: Wednesday, January 23, 2008 - 10:29 am

To clarify: the thing I want to point out that the decision to *hash* the 
filenames in a case-insensitive hash, is very different from the decision 
to then *compare* the filenames when traversing the hash with a 
case-insensitive compare.

And this difference is actually very important. Hashing things together 
that are "equivalent" according to any random rule is what makes it 
possible to then *check* for equivalence cheaply (because you only need to 
make the potentially expensive check with the subset of cases where it 
might trigger), but it in no way forces you to actually recode or mangle 
or compare things equivalently.

In fact, I'd argue that this is what HFS+ did wrong in the first place: 
they had stupid/incompetent people who didn't understand about this, so 
they normalized the string *before* the hashing rather than as part of the 
hash itself, and thus actually corrupt the string itself.

So what you can do (and I'd argue that we do) is to have a hash that can 
handle almost arbitrary input, but then never corrupt the filename, and 
always compare exactly by default.

Then, depending on a config option, we can decide to change the compare so 
that equivalent (according to whatever rule) filenames either cause a 
warning (people on sane filesystems, but working with people who aren't), 
or are silently considered the same file (people on insane filesystems).

			Linus
-

From: Jeremy Maitin-Shepard
Date: Thursday, January 24, 2008 - 10:21 pm

Linus Torvalds wrote:

In general, there may be a large number of comparison function options
that git will eventually support, and they will likely not all form a
single chain of increasing "strictness".

Given that the hash values aren't even being stored on disk (and if they
were, a simple approach of also storing an identifier for the hash
function to know whether they stored values are still valid could be
used), having a chain of increasingly "strict" comparison functions and
using a hash function that corresponds to the least strict one is useful
for exactly one reason: giving (possibly several different levels of)
non-fatal warnings for various types of duplicates.

But since multiple hash functions will be needed anyway to support
different notions of case-insensitivity, if the warning is not enabled,
there is no reason to use a case-insensitive hash function with a
byte-exact comparison.

-- 
Jeremy Maitin-Shepard
-

From: Johannes Schindelin
Date: Friday, January 25, 2008 - 5:51 am

Hi,


No, only multiple compare functions will be needed.  The hash function can 
be built in such a manner that it guarantees that file names being equal 
with _any_ of the compare functions fall into the same bucket.

The upside of such a hash function: less code to maintain.

Hth,
Dscho

-

From: Jeremy Maitin-Shepard
Date: Friday, January 25, 2008 - 11:19 am

In theory, I agree that this is possible, but in practice it may not be
reasonable at all.  Consider two possible comparison functions:

1. compare file names as strings case-insensitively assuming a latin 1
encoding

2. compare file names as strings case-insensitively assuming a UTF-8
encoding

Actually writing a hash function such that two strings hash to the same
value if either of these comparison functions says that the strings are

A simple hash function that doesn't try to do anything regarding
case-insensitivity is extremely short and simple and therefore is hardly
a maintenance burden.

Although in some cases it is possible to "share" a hash function, except
for the "warning" purpose, actually doing so doesn't make much sense.
Using the "case-insensitive" hash function when you intend to use an
"exact" comparison function just amounts to using a hash function that
is unequivocally worse: it is slower, more complicated, and has a higher
collision rate.

-- 
Jeremy Maitin-Shepard
-

From: Johannes Schindelin
Date: Friday, January 25, 2008 - 11:24 am

Hi,


You misunderstand me.  If the complicated hash function is the one that is 
less exercised, you _will_ face problems.

OTOH if you _already_ need the "complicated" hash function, there is 
_little_ point not to use it, and be consistent between platforms, 
_especially_ since now all people eat the same dog food.

So I never thought about the simple hash function as being a burden.

Hth,
Dscho

-

From: Junio C Hamano
Date: Friday, January 25, 2008 - 12:07 pm

Once you start adding more "case folding" supported filesystems
to the repertoire, such a unified hash function Dscho suggests
needs to throw paths that other (N-1) "case folding" filesystems
treat as distinct but only 1 filesystem treats "equivalent" into
the same hash bucket.  I would say not just difficult but the
resulting function would have too many collisions to make it
ineffective.

-

From: Andreas Ericsson
Date: Wednesday, January 23, 2008 - 1:32 am

Insofar as hashes go, it's not that shabby for hashing filenames.
Here's the test output from a small hash-comparison program I've
got, which runs the test-input through a series of different hashes
to compare dispersion, collisions, lookup- and insert times and
other things that are interesting from a practical PoV.

Fowler/Noll/Vo cheap hash
Collisions: 1829, 7.91%. Depth max: 3, average: 0.09.
Time spent in lookups: 167.859ms. Per entry: 0.145us
Time spent inserting: 2.753ms. Per entry: 0.119us

PHP Zend cheap hash
Collisions: 1908, 8.25%. Depth max: 3, average: 0.09.
Time spent in lookups: 171.819ms. Per entry: 0.149us
Time spent inserting: 2.778ms. Per entry: 0.120us

Phong Vo's linear congruential hash
Collisions: 1996, 8.63%. Depth max: 3, average: 0.09.
Time spent in lookups: 168.276ms. Per entry: 0.146us
Time spent inserting: 2.840ms. Per entry: 0.123us

Phong Vo's second linear congruential hash
Collisions: 1933, 8.36%. Depth max: 3, average: 0.09.
Time spent in lookups: 170.416ms. Per entry: 0.147us
Time spent inserting: 2.774ms. Per entry: 0.120us

Glib string hash
Collisions: 1907, 8.24%. Depth max: 3, average: 0.09.
Time spent in lookups: 192.420ms. Per entry: 0.166us
Time spent inserting: 3.154ms. Per entry: 0.136us

sdbm hash
Collisions: 1899, 8.21%. Depth max: 3, average: 0.09.
Time spent in lookups: 170.797ms. Per entry: 0.148us
Time spent inserting: 2.724ms. Per entry: 0.118us

bfd hash
Collisions: 1949, 8.43%. Depth max: 3, average: 0.09.
Time spent in lookups: 206.504ms. Per entry: 0.179us
Time spent inserting: 3.241ms. Per entry: 0.140us

Linus' GIT hash
Collisions: 1946, 8.41%. Depth max: 4, average: 0.09.
Time spent in lookups: 179.336ms. Per entry: 0.155us
Time spent inserting: 2.781ms. Per entry: 0.120us

This is with 64K buckets, which (with my implementation) means
a total hash-table size of 256KiB. The test-input is a simple
file-listing from the linux kernel (although it has the .git
directory included too).

As you can see, ...
From: Dmitry Potapov
Date: Wednesday, January 23, 2008 - 2:15 am

Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
than FNV, and also it is much less likely to have any collision.

The description and some comparision with other hash can be found here:
http://burtleburtle.net/bob/hash/doobs.html
http://burtleburtle.net/bob/c/lookup3.c

Perhaps, the second choice is Paul Hsieh's hash.
http://www.azillionmonkeys.com/qed/hash.html

Note: Paul Hsieh provides the table where he compares his hash
with others. There is also the program he used. I ran his program
on my computer, advantage of his over others was not so big on
my computer. Moreover, his test includes an old version of Bob
Jenkins' hash. The new version -- lookup3, which I mentione above,
has about the same speed as Paul Hsieh's hash (with -O2) or even
12% faster when I used -O3 -march=athlon-xp.

Also, Bob Jenkins' hash is better for non-x86 architectures. So,
I believe it is the best hash for today.

Dmitry
-

From: Andreas Ericsson
Date: Wednesday, January 23, 2008 - 2:31 am

---
FNV Hash

I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions. 
---

My tests ran on Intel. I also noticed I had a few hashes commented out when
doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and
Hsie's didn't perform well for me last time I used the comparison thing (I
did a more thorough job back then, with tests running for several minutes
per hash and table-size, so I commented out the poor candidates).

I still believe that for this very simple case, the lookup3.c case is not
very practical, as the code is that much more complicated, which was my
main point with posting the comparison. Iow, not "switch to this hash,
because it's better", but rather "the hash is not as bad as you think and
will probably work well for all practical purposes".

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-

From: Marko Kreen
Date: Wednesday, January 23, 2008 - 7:01 am

I suspect that this paragraph was about comparison with lookup2
(not lookup3) because lookup3 beat easily all the "simple" hashes
in my testing.  Only competitor was Hsieh one which was like 50:50


If you don't mind few percent speed penalty compared to Jenkings
own optimized version, you can use my simplified version:

  http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c3457...

It works always with "native" endianess, unlike Jenkins fixed-endian
hashlittle() / hashbig().  It may or may not matter if you plan
to write values on disk.

Speed-wise it may be 10-30% slower worst case (in my case sparc-classic
with unaligned data), but on x86, lucky gcc version and maybe
also memcpy() hack seen in system.h, it tends to be ~10% faster,
especially as it does always 4byte read in main loop.

-- 
marko
-

From: Andreas Ericsson
Date: Wednesday, January 23, 2008 - 7:39 am

It might be. It's from the link Dmitry posted in his reply to my original

By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
and 0.1 microsecons / lookup. We're talking about a case here where
there will never be more lookups than insertions (unless I'm much

I don't, but I don't care that deeply either. On the one hand,
it would be nifty to have an excellent hash-function in git.
On the other hand, it would look stupid with something that's

It would have to be a significant improvement in wall-clock time
on a test-case of hashing 30k strings to warrant going from 6 to 80
lines of code, imo. I still believe the original dumb hash Linus
wrote is "good enough".

On a side-note, it was very interesting reading, and I shall have
to add jenkins3_mkreen() to my test-suite (although the "keep
copyright note" license thing bugs me a bit).

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-

From: Luke Lu
Date: Wednesday, January 23, 2008 - 11:51 pm

Would you, for completeness' sake, please add Tcl and STL hashes to  
your test suite? The numbers are quite interesting. Is your test  
suite available somewhere, so we can test with our own data and  
hardware as well. Both Tcl hash and STL (from SGI probably HP days,  
still the current default with g++) string hashes are extremely  
simple (excluding the loop constructs):

Tcl: h += (h<<3) + c; 	// essentially *9+c (but work better on non- 
late-intels)
STL: h = h * 5 + c;	// worse than above for most of my data

Thanks,

__Luke

-

From: Andreas Ericsson
Date: Thursday, January 24, 2008 - 3:24 am

I could do that. Or I just publish the entire ugly thing and let someone

Not yet, no. I usually munge it up quite a lot when I want to test hashes

They sure do look simple enough. As for loop constructs, I've tried to
use the same looping mechanics for everything, so as to let the algorithm
be the only difference. Otherwise it gets tricky to do comparisons. The
exceptions are ofcourse hashes relying on Duff's device or similar
alignment trickery.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-

From: Marko Kreen
Date: Thursday, January 24, 2008 - 6:19 am

FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned

Jenkins hash is fast because it does not look at individual bytes.
If you _do_ want to look at them for unrelated reasons, (case-insensitive,
unicode-juggling), then it obiously loses the point.  That is, if you

Well, ad-hoc dumb hashes may have horrible worst-cases that you cant
see with light testing.  Therefore I'd still suggest some better

Sorry.  I just used template boilerplate.  Considering all the
hard work was done by other people, it not proper to put under
my own license.  I tagged the file as 'public domain' and pushed out.

Btw, the reason I started cleaning lookup3 was that at first I was
scared of the complexity of Jenkins code and decided to go with
Hsieh hash.  Then I found out that Hsieh code is under totally
werdo license (http://www.azillionmonkeys.com/qed/weblicense.html)
so I could not use it.

====================================================================

Here is my raw-speed test of different hashes.  Input is 4-byte
aligned which should be common case for malloc()-ed strings.
This also is best case for original lookup3(), on unaligned
input the memcpy variants beat it easily.  Input string
length varies randomly in range 0..64.

own_memcpy - last 12-byte memcpy() calls out to libc
memcpy_hack - last memcpy is inlined bytewise copy loop:

  while (len--) *dst++ = *src++;

Note that is is raw-speed test, if you benchmark larger code the
hash difference probably matters less.

--------------------------------------------------------------------

Testing: seed=34 align=4 minlen=0 maxlen=64 trycnt=2 duration=10

lookup3 : try=0: ... 247.4880 MB/s
lookup3 : try=1: ... 247.6154 MB/s
own_memcpy: try=0: ... 223.5508 MB/s
own_memcpy: try=1: ... 223.5830 MB/s
memcpy_hack: try=0: ... 241.2241 MB/s
memcpy_hack: try=1: ... 241.2492 MB/s
lookup2 : try=0: ... 190.2697 MB/s
lookup2 : try=1: ... 190.3283 MB/s
fnv     : try=0: ... 153.0318 MB/s
fnv     : try=1: ... ...
From: Andreas Ericsson
Date: Thursday, January 24, 2008 - 9:00 am

But the tests surely need to check for unaligned cases, as that's what

I believe the ability to add unicode-juggling was a major point
with the patch, so perhaps Jenkins' isn't such a good option.

I'm not familiar with data-mangling the way Linus (or Theo Tso

True. FNV is used in both MySQL and PostgreSQL. I'd say it's safe to

Thanks. I'll see if I can add it, although it'll probably have to
wait until I have reason to dig into it at work again. I've added
the hash to the code-base, but not yet incorporated it into the

True. I was in contact with him a while back since I wanted to use it
in an opensource project, but the licensing issues made me go with
another one instead. The patch got turned down anyways, so it was a

Unless arena allocated, like we do in git.

I'm not surprised that this test favours Jenkin's and Hsie's.
That's to be expected as those benefit far more than simpler
hashing algorithms for long strings. The overhead when trying
shorter strings (say, between 3 and 15 chars, and not necessarily

Well, memcpy() isn't very interesting to compare against
hashes, as they test vastly different parts of the hardware's
parts' performance. memcpy() should also perform exactly the
same no matter what the test-data, which isn't always true for
hashes.

What *would* be interesting would be something along the lines
of "duff_cpy()": ie, an unrolled loop that aligns itself and
copies each byte to the same address each time.

The bytewise equivalence would ofcourse be

magic_cpy(unsigned char *k, int len)
{
	unsigned char magic;
	do {
		magic = *k++;
	} while (--len);

Interesting output, but not very surprising. Do you have the code
available somewhere?

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-

From: Marko Kreen
Date: Thursday, January 24, 2008 - 9:13 am

Ok, here is 0..15 chars, random alignment:

Testing: seed=34 align=0 minlen=0 maxlen=15 trycnt=2 duration=10

lookup3 : try=0: ...  69.8092 MB/s
lookup3 : try=1: ...  69.8146 MB/s
own_memcpy: try=0: ...  66.7808 MB/s
own_memcpy: try=1: ...  66.7814 MB/s
memcpy_hack: try=0: ...  74.0635 MB/s
memcpy_hack: try=1: ...  74.0518 MB/s
lookup2 : try=0: ...  68.6582 MB/s
lookup2 : try=1: ...  68.6634 MB/s
fnv     : try=0: ...  74.5098 MB/s
fnv     : try=1: ...  74.5283 MB/s
hsieh   : try=0: ...  71.6708 MB/s
hsieh   : try=1: ...  71.6814 MB/s
oat     : try=0: ...  74.7828 MB/s
oat     : try=1: ...  74.7716 MB/s
elf     : try=0: ...  65.2077 MB/s
elf     : try=1: ...  65.2128 MB/s

Results compared to reference:

lookup3         : 100.000 %
own_memcpy      :  95.659 %
memcpy_hack     : 106.082 %
lookup2         :  98.351 %
fnv             : 106.743 %
hsieh           : 102.670 %
oat             : 107.112 %



I can put it out.

-- 
marko
-

From: Dmitry Potapov
Date: Thursday, January 24, 2008 - 9:28 am

I don't think you can any meaningful unicode-juggling without converting
symbols to UCS-4, and after that it makes much more sense to operate
with uint32 than bytes. So, Jenkins' hash is still relevant, just because
it does not operate on single bytes, but using uint32.

Dmitry
-

From: Linus Torvalds
Date: Thursday, January 24, 2008 - 10:15 am

No, no, no, NO!

Egads! Why do people constantly do these totally idiotic things for 
Unicode?

You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all 
source code projects in UTF-8, without *ever* doing any format changes at 
all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie 
we don't care about byte-order or size of integers or anything like that), 
but that's the common case for most hashes that aren't used for BTree 
lookup on disk or something like that.

Here, let me show you:

	unsigned int name_hash(const char *name, int size)
	{
		hash = HASH_INIT;
		do {
			unsigned char c;
			if (size >= sizeof(long)) {
				unsigned long val = get_unaligned_long(name);
				if (!(val & 0x8080808080808080)) {
					/* Make it equivalent in case */
					val &= ~0x2020202020202020;
					hash = hash_long(hash, val);
					name += sizeof(long);
					size -= sizeof(long);
					continue;
				}
			}

			c = *name;
			if (!(c & 0x80)) {
				hash = hash_long(hash, c & ~0x20);
				name++;
				size--;
				continue;
			}

			/* This is the unusual and slowish case */
			hash = hash_utf8_char(hash, c, &name, &size);
		} while (size);
		return hassh;
	}

and then the only point you ever do that actual UTF8->codepoint conversion 
is for that "high bit set" case.

A few things to note on the above:

 - the hash obviously has "odd" characteristics. We're not necessarily 
   hashing characters at a time at all, and the alignment of characters 
   with high bits *within*the*string* will make a difference to how we 
   hash them.

   But it's also important that the "get_unaligned_long()" means that the 
   alignment of the string itself doesn't matter, so its' purely a 
   "chunking within the string" thing, and the alignment of the string 
   itself won't affect the hash value

 - I'm not writing out hash_utf8_char(), because it's certainly not 
   totally trivial, but it's not *really* complex either. The 
   nontriviality ...
From: Dmitry Potapov
Date: Thursday, January 24, 2008 - 11:45 am

It is better to use 'while' instead of 'if' here, i.e.:

			while (!((c = *name) & 0x80)) {
				hash = hash_long(hash, c & ~0x20);
				name++;
				if (!--size)
					return hash;


Dmitry
-

From: Linus Torvalds
Date: Thursday, January 24, 2008 - 12:08 pm

Yes, that looks like a good further micro-optimization.

		Linus
-

From: Marko Kreen
Date: Friday, January 25, 2008 - 1:52 pm

Well, although this is very clever approach, I suggest against it.
You'll end up with complex code that gives out substandard results.

I think its better to have separate case-folding function (or several),
that copies string to temp buffer and then run proper optimized hash
function on that buffer.

That way you can use already tested building blocks and can optimize
both sides separately.  Eg. the folding-only function can  aswell be
optimized to load 4 or 8-byte at-a-time.  This also isolates hashing
from exact details how folding happens to access the input string which
seem to be the weak point in your approach.  (In both collision and
complexity sense.)

Such temp buffer happens to fits my lookup3_memcpy also better (heh).
Its weak point is that on platforms that do not allow unaligned access,
it degenerates to byte-by-byte loading.  But if know you always
have aligned buffer, you can notify gcc to do 4-byte fetch there too.
It should be as simple as tagging data pointer as uint32_t *.


If your input strings are over kilobyte on average then I'd
agree with you, but if you process 20-30 bytes on average,
is the additional complexity worth it?

-- 
marko
-

From: Linus Torvalds
Date: Friday, January 25, 2008 - 3:16 pm

I'm sorry, but you just cannot do that efficiently and portably.

I can write a hash function that reliably does 8 bytes at a time for the 
common case on a 64-bit architecture, exactly because it's easy to do 
"test high bits in parallel" with a simple bitwise 'and', and we can do 
the same with "approximate lower-to-uppercase 8 bytes at a time" for a 
hash by just clearing bit 5.

In contrast, trying to do the same thing in half-way portable C, but being 
limited to having to get the case-folding *exactly* right (which you need 
for the comparison function) is much much harder. It's basically 
impossible in portable C (it's doable with architecture-specific features, 
ie vector extensions that have per-byte compares etc).

And hashing is performance-critical, much more so than the compares (ie 
you're likely to have to hash tens of thousands of files, while you will 
only compare a couple). So it really is worth optimizing for.

And the thing is, "performance" isn't a secondary feature. It's also not 
something you can add later by optimizing. 

It's also a mindset issue. Quite frankly, people who do this by "convert 
to some folded/normalized form, then do the operation" will generally make 
much more fundamental mistakes. Once you get into the mindset of "let's 
pass a corrupted strign around", you are in trouble. You start thinking 
that the corrupted string isn't really "corrupt", it's in an "optimized 
format". 

And it's all downhill from there. Don't do it.

			Linus
-

From: Linus Torvalds
Date: Friday, January 25, 2008 - 3:35 pm

Side note: you *can* get better approximations fairly cheaply if you care.

If you want to distinguish the characters 0-31 from the characters 31-63 
in your hash (pointless for filenames, but it can be worthwhile for some 
other string cases), you can decide to clear bit#5 only if bit#6 in that 
byte was also set, with just a few bitwise operations.

Eg, imagine that you have "unsigned long x" containing eight bytes of 
ascii data (ie you already did the test by 0x8080808080808080), you can do 
things like

	unsigned long bit6 = x & 0x4040404040404040;
	x &= ~(bit6 >> 1);

which will only clear bit5 if bit6 in the same byte was set..

So you can do tricks like that, and it will still be plenty fast. And 
nobody will ever care that while it collides 'A' with 'a' (by design), it 
also causes '{' and '[' to be considered "case collisions".

[ Amusing side note: '{' and '[' *are* case collisions in legacy 7-bit 
  "Finnish ASCII". The editor I use still "upper-cases" '{' to '['. I'm 
  not kidding, and yes, it really does it on purpose!

  It used to be that before everybody turned to Latin1, the {|} characters 
  were re-used in Finland (and Sweden, for that matter) for the 
  extra characters needed in Finnish. Because obviously nobody ever
  needed them for any real work.

  I (and probably every Finnish C UNIX programmer) used to be very good at 
  reading C source code even when it was full of odd finnish characters 
  with dots on top, instead of curly braces! ]

And yes, from a performance standpoint, things liek this probably do realy 
matter. For the kernel tree, the average pathname length is ~28 
characters. If you can do it with three iterations that do the first 24 
characters eight characters at a time, and then four iterations over the 
four last ones, rather than 28 iterations with byte->longword and 
multiplications in each, I bet it's quite visible.

Of course, it's going to be visible only if everything else is fast too, 
but git has been pretty ...
From: Marko Kreen
Date: Saturday, January 26, 2008 - 5:16 am

Here you misunderstood me, I was proposing following:

int hash_folded(const char *str, int len)
{
   char buf[512];
   do_folding(buf, str, len);
   return do_hash(buf, len);
}

That is - the folded string should stay internal to hash function.

Only difference from combined foling+hashing would be that

Againg, you seem to keep HFS+ behaviour in mind, but that was
not what I did suggest.  Probably my mistake, sorry.

-- 
marko
-

From: Linus Torvalds
Date: Saturday, January 26, 2008 - 11:51 pm

If it's internal, it's much better, but you still missed the performance 
angle.

The fact is, hashing can take shortcuts that folding cannot do!

Case folding, by definition, has to be "exact" (since the whole point is 
what you're going to use the same folding function to do the compare, so 
if you play games with folding, the compares will be wrong).

But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if 
they were different cases of the same character, if that gives you a 
faster hash function. Especially as those charactes are rather rare in 
filenames.

So if you do hashing as a function of its own, you can simply do a better 
job at it.

I do agree that the functions that create a folded set of characters from 
a _complex_ UTF-8 character should be shared between folding and hashing, 
since that code is too complex and there are no simple shortcuts for doing 
a faster hash that still retains all the properties we want. 

			Linus
-

From: Dmitry Potapov
Date: Sunday, January 27, 2008 - 1:21 am

Let's rename do_folding as something else, because it is not a real
folding, but a preparation step for hash calculation. Keeping these
steps separately simplifies the code, and allows further optimization,
for instance, you do not need this do_folding step on a case-sensitive
filesystem. Though it is certainly possible to mix both steps together,
it bloats the code and makes it less readable. Of course, the idea to
avoid a temporary buffer and do everything at once is very appealing,
so I gave it a try -- and here is a 32-bit version of name_hash(), but
I am not very happy with the result:


#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))

#define mix(a,b,c) \
{ \
	a -= c;  a ^= rot(c, 4);  c += b; \
	b -= a;  b ^= rot(a, 6);  a += c; \
	c -= b;  c ^= rot(b, 8);  b += a; \
	a -= c;  a ^= rot(c,16);  c += b; \
	b -= a;  b ^= rot(a,19);  a += c; \
	c -= b;  c ^= rot(b, 4);  b += a; \
}
#define final(a,b,c) \
{ \
	c ^= b; c -= rot(b,14); \
	a ^= c; a -= rot(c,11); \
	b ^= a; b -= rot(a,25); \
	c ^= b; c -= rot(b,16); \
	a ^= c; a -= rot(c,4);  \
	b ^= a; b -= rot(a,14); \
	c ^= b; c -= rot(b,24); \
}

#define hash_value(x) \
	hs[hp] += (x); \
	if (++hp == 3) { \
		mix (hs[0], hs[1], hs[2]); \
		hp = 0; \
	}
unsigned int name_hash(const char *name, unsigned size)
{
	unsigned hp = 0;
	unsigned hs[3];
	hs[0] = hs[1] = hs[2] = 0xdeadbeef + size;

	do {
		unsigned char c;
		if (size >= sizeof(unsigned)) {
			unsigned val = get_unaligned_uint(name);
			if (!(val & 0x80808080)) {
				val &= ~0x20202020;
				hash_value(val);
				name += sizeof(val);
				size -= sizeof(val);
				continue;
			}
		}

		while (!((c = *name) & 0x80)) {
			hash_value(c & ~0x20);
			name++;
			if (!--size)
				goto done:
		}

		do {
			// TODO: add denormalization for Mac
			unsigned val = towupper (utf8_to_wchar(&name, &size));
			hash_value(val);
		} while (size && (*name & 0x80));

	} while (size);
done:
	if (hp)
		final(a,b,c);
	return hs[2];
}


Dmitry
-

From: Johannes Schindelin
Date: Sunday, January 27, 2008 - 7:07 am

Hi,


<irony>Oh yes, let's take this one, it is so much shorter, cleaner and 
overall more elegant than Linus' code.</irony>

Ciao,
Dscho

-

From: Dmitry Potapov
Date: Sunday, January 27, 2008 - 7:48 am

Hi,


I am not sure what you meant by that, and what exactly code you meant
saying Linus' code....

Anyway, my point was that mixing both steps together does not look very
nice, and the code was intended to demonstrate why.

Dmitry
-

From: Marko Kreen
Date: Sunday, January 27, 2008 - 2:45 am

Well, you can always have fold_quick_and_dirty() function that
is used only internally in hash_folded() function, which can:

- fold with simple |= 0x20202020..
- write out full uint32/64, no need to make result proper string
- zero-fill at the end, so hash function does not need to check
  for partial block, which is pretty expensive part of hashing.

The win would be:
- more modularized code
- can use faster/any hash
- hash function can be certain to work on aligned data
  (win on non-x86)

The minus:
- some memory i/o overhead which may or may not matter
- the parts would not be fully generic, but special to hashing


-- 
marko


PS. Typo in last mail - "inner loop should be reversible - that
means that details from beginning of data should shift out of
horizon."  That obviously means "data should _not_ shift
out of horizon.

btw, "reversible" for integer hashes means that there is 1:1
mapping between input and output - no collisions.  Thus
no info loss.
-

From: Dmitry Potapov
Date: Sunday, January 27, 2008 - 8:06 am

If a string is short, it will probably reside in the processor cache, so
there is no real memory i/o overhead here. For more longer strings, it
may be better to do that in short chunks, so each chunk can reside in
the cache. But I don't think filenames are long, so it is not an issue

I think the second part can be rather generic to be reused for hashing
something else that does not require filename specific case-folding.

Dmitry
-

From: Marko Kreen
Date: Saturday, January 26, 2008 - 5:37 am

Ok, you seem to focus on case folding and general performance,
I focus on hash quality and code complexity.  Considering you
may want several folding methods, it seemed to me that it would
be good to separate the two aspects.

But I'll try to follow your path for a moment.

Hashing 32 or 64 bits at a time is not trivial, eg. you cannot
use same algorithm for both cases, 64 bits requires twice the
work to mix well.

Per Jenkins notes, hash inner loop should be reversible - that
means that details from beginning of data should shift out of horizon.
But final mixing should achieve avalanche - that means each bit in
input should affect 50% bits in output.

Also must be noted that we are mixing 32->32 and 64->64 instead
of the usual 8->32.  So it seems that the best bet would be to use
integer hash functions as core.  Jenkins himself points to Thomas
Wang (http://www.cris.com/~Ttwang/tech/inthash.htm) who has good
integer mix functions for both 32 and 64 bits.

Integer hash function must have both reversibility and avalance,
so they may be slightly overkill for this purpose, but fixing that
means lot of work.

So here is what I propose for hashing, if you really want to have
combined folding + hashing:

/* Thomas Wang integer hash functions */

static inline uint32_t hash32(uint32_t key)
{
        key = ~key + (key << 15);
        key = key ^ (key >> 12);
        key = key + (key << 2);
        key = key ^ (key >> 4);
        key = key * 2057;
        key = key ^ (key >> 16);
        return key;
}

static inline uint64_t hash64(uint64_t key)
{
        key = (~key) + (key << 21); // key = (key << 21) - key - 1;
        key = key ^ (key >> 24);
        key = (key + (key << 3)) + (key << 8); // key * 265
        key = key ^ (key >> 14);
        key = (key + (key << 2)) + (key << 4); // key * 21
        key = key ^ (key >> 28);
        key = key + (key << 31);
        return key;
}

/*
 * Simple addition should be enough for new values,
 * considering the mix ...
From: Marko Kreen
Date: Friday, January 25, 2008 - 1:08 pm

http://pgbouncer.projects.postgresql.org/hashtest/hashtest-2008-01-25.tgz

I cleaned it up a bit.  Also fixed a bug - the lookup3 was called
via wrapper function so it acually is tiny bit faster than the
results show.

You can consider the code as public domain too.  Also remember
the was not meant to be published, so it rather hack...

-- 
marko
-

From: Dmitry Potapov
Date: Wednesday, January 23, 2008 - 10:10 am

I believe that under words "my hash", Bob Jenkins meant lookup2, which


I expected that Paul Hsieh's hash may not do well on some architecture,

I would not describe lookup3 as impractical. It is widely used and well
tested. Perhaps, for some Intel CPUs, the difference in speed is not so
big, and FNV hash is much smaller and simpler, so FNV is a reasonable
choice, but the hash is twice slower on my AMD processor and I suspect
it may be even worse on other CPUs, where integer multiplication is slow.
Besides, it may turn out that hashing filename may be not only case where
a fast hash is needed.

Dmitry
-

From: Andreas Ericsson
Date: Thursday, January 24, 2008 - 3:39 am

From /proc/cpuinfo. It's the best I can do without going to our purchase
department and asking for the spec so I can contact the vendor and get
the real thing. Dualcore shouldn't matter for this test, as it isn't
threaded.

It doesn't do that well on certain types of data, in my experience. It does
have excellent dispersion, so with very long strings it's usually the

Ah well. I think once the patch is in master, it will be easy enough to
test and verify different algorithms. Since it's intended for in-memory
data only, it's no problem to have several algorithms and pick the one
most suitable for the architecture we're compiling for.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-

From: Linus Torvalds
Date: Wednesday, January 23, 2008 - 9:06 am

Hashing filenames is pretty easy. You can do a reasonable job with any 
"multiply by an odd number, add in value". Picking the odd number is a 
random choice, some are better than others (and it depends on whether you 
end up always having a power-of-two bucket size etc), but it generally 
won't be horrible.

And considering that we generally won't have tons and tons of pathnames 
(ie we'd generally have thousands, not millions), and that the underlying 
hash not only resizes itself but actually uses the ful 32 bits as the 
lookup key, I don't worry too much. I suspect my random choice is fine.

So no, I didn't think my hash would _suck_, although I also didn't 
research which odd numbers to pick.

No, it's a joke because it doesn't really give an example of how to do the 
_expected_ hash collissions.

Here's another hash that is actually going to collide *much* more (and on 
purpose!), but is actually showing an example of something that actually 
hashes UTF strings so that upper-case and lower-case (and normalization) 
will all still hash to the same value:

	static unsigned int hash_name(const char *name, int namelen)
	{
	        unsigned int hash = 0x123;

	        do {
	                unsigned char c = *name++;
			if (c & 0x80)
				c = 0;
			c &= ~0x20;
	                hash = hash*101 + c;
	        } while (--namelen);
	        return hash;
	}

but the above does so by making the hash much much worse (although 
probably still acceptable for "normal source code name distributions" 
that don't have very many same-name-in-different-cases and high bit 
characters anyway).

The above is still fairly fast, but obviously at a serious cost in hash 
goodness, to the point of being totally unusable for anybody who uses 
Asian characters in their filenames.

To actually be really useful, you'd have to teach it about the character 
system and do a lookup into a case/normalization table.

So *that* is mainly why it's a joke. But it should work fine for the case ...
Previous thread: [PATCH] git-clone -s: document problems with git gc --prune by Miklos Vajna on Tuesday, January 22, 2008 - 2:03 pm. (3 messages)

Next thread: Re: git on MacOSX and files with decomposed utf-8 file names by Theodore Tso on Tuesday, January 22, 2008 - 5:08 pm. (30 messages)