I'm a total push-over..

Previous thread: [PATCH] git-clone -s: document problems with git gc --prune by Miklos Vajna on Tuesday, January 22, 2008 - 5:03 pm. (3 messages)

Next thread: Re: git on MacOSX and files with decomposed utf-8 file names by Theodore Tso on Tuesday, January 22, 2008 - 8:08 pm. (30 messages)
To: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Tuesday, January 22, 2008 - 7:37 pm

Ok, here's an interesting patch based on the current 'next' (since it very
intimately requires the new in-memory index format).

What it does is to create a hash index of every single file added to the
index. Right now that hash index isn't actually used for much: I
implemented a "cache_name_exists()" function that uses it to efficiently
look up a filename in the index without having to do the O(logn) binary
search, but quite frankly, that's not why this patch is interesting.

No, the whole and only reason to create the hash of the filenames in the
index is that by modifying the hash function, you can fairly easily do
things like making it always hash equivalent names into the same bucket.

That, in turn, means that suddenly questions like "does this name exists
in the index under an _equivalent_ name?" becomes much much cheaper.

Guiding principles behind this patch:

- it shouldn't be too costly. In fact, my primary goal here was to
actually speed up "git commit" with a fully populated kernel tree, by
being faster at checking whether a file already existed in the index. I
did succeed, but only barely:

Best before:
[torvalds@woody linux]$ time git commit > /dev/null
real 0m0.255s
user 0m0.168s
sys 0m0.088s

Best after:

[torvalds@woody linux]$ time ~/git/git commit > /dev/null
real 0m0.233s
user 0m0.144s
sys 0m0.088s

so some things are actually faster (~8%).

Caveat: that's really the best case. Other things are invariably going
to be slightly slower, since we populate that index cache, and quite
frankly, few things really use it to look things up.

That said, the cost is really quite small. The worst case is probably
doing a "git ls-files", which will do very little except puopulate the
index, and never actually looks anything up in it, just lists it.

Before:
[torvalds@woody linux]$ time git ls-files > /dev/null
real 0m0.016s
user 0m0.016s
sys 0...

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 4:32 am

Insofar as hashes go, it's not that shabby for hashing filenames.
Here's the test output from a small hash-comparison program I've
got, which runs the test-input through a series of different hashes
to compare dispersion, collisions, lookup- and insert times and
other things that are interesting from a practical PoV.

Fowler/Noll/Vo cheap hash
Collisions: 1829, 7.91%. Depth max: 3, average: 0.09.
Time spent in lookups: 167.859ms. Per entry: 0.145us
Time spent inserting: 2.753ms. Per entry: 0.119us

PHP Zend cheap hash
Collisions: 1908, 8.25%. Depth max: 3, average: 0.09.
Time spent in lookups: 171.819ms. Per entry: 0.149us
Time spent inserting: 2.778ms. Per entry: 0.120us

Phong Vo's linear congruential hash
Collisions: 1996, 8.63%. Depth max: 3, average: 0.09.
Time spent in lookups: 168.276ms. Per entry: 0.146us
Time spent inserting: 2.840ms. Per entry: 0.123us

Phong Vo's second linear congruential hash
Collisions: 1933, 8.36%. Depth max: 3, average: 0.09.
Time spent in lookups: 170.416ms. Per entry: 0.147us
Time spent inserting: 2.774ms. Per entry: 0.120us

Glib string hash
Collisions: 1907, 8.24%. Depth max: 3, average: 0.09.
Time spent in lookups: 192.420ms. Per entry: 0.166us
Time spent inserting: 3.154ms. Per entry: 0.136us

sdbm hash
Collisions: 1899, 8.21%. Depth max: 3, average: 0.09.
Time spent in lookups: 170.797ms. Per entry: 0.148us
Time spent inserting: 2.724ms. Per entry: 0.118us

bfd hash
Collisions: 1949, 8.43%. Depth max: 3, average: 0.09.
Time spent in lookups: 206.504ms. Per entry: 0.179us
Time spent inserting: 3.241ms. Per entry: 0.140us

Linus' GIT hash
Collisions: 1946, 8.41%. Depth max: 4, average: 0.09.
Time spent in lookups: 179.336ms. Per entry: 0.155us
Time spent inserting: 2.781ms. Per entry: 0.120us

This is with 64K buckets, which (with my implementation) means
a total hash-table size of 256KiB. The test-input is a simple
file-listing from the linux kernel (although it has the .git
directory included too).

As you can see, it l...

To: Andreas Ericsson <ae@...>
Cc: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 12:06 pm

Hashing filenames is pretty easy. You can do a reasonable job with any
"multiply by an odd number, add in value". Picking the odd number is a
random choice, some are better than others (and it depends on whether you
end up always having a power-of-two bucket size etc), but it generally
won't be horrible.

And considering that we generally won't have tons and tons of pathnames
(ie we'd generally have thousands, not millions), and that the underlying
hash not only resizes itself but actually uses the ful 32 bits as the
lookup key, I don't worry too much. I suspect my random choice is fine.

So no, I didn't think my hash would _suck_, although I also didn't
research which odd numbers to pick.

No, it's a joke because it doesn't really give an example of how to do the
_expected_ hash collissions.

Here's another hash that is actually going to collide *much* more (and on
purpose!), but is actually showing an example of something that actually
hashes UTF strings so that upper-case and lower-case (and normalization)
will all still hash to the same value:

static unsigned int hash_name(const char *name, int namelen)
{
unsigned int hash = 0x123;

do {
unsigned char c = *name++;
if (c & 0x80)
c = 0;
c &= ~0x20;
hash = hash*101 + c;
} while (--namelen);
return hash;
}

but the above does so by making the hash much much worse (although
probably still acceptable for "normal source code name distributions"
that don't have very many same-name-in-different-cases and high bit
characters anyway).

The above is still fairly fast, but obviously at a serious cost in hash
goodness, to the point of being totally unusable for anybody who uses
Asian characters in their filenames.

To actually be really useful, you'd have to teach it about the character
system and do a lookup into a case/normalization table.

So *that* is mainly why it's a joke. But it should work fine for the c...

To: Andreas Ericsson <ae@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 5:15 am

Actually, Bob Jenkins' lookup3 hash is twice faster in my tests
than FNV, and also it is much less likely to have any collision.

The description and some comparision with other hash can be found here:
http://burtleburtle.net/bob/hash/doobs.html
http://burtleburtle.net/bob/c/lookup3.c

Perhaps, the second choice is Paul Hsieh's hash.
http://www.azillionmonkeys.com/qed/hash.html

Note: Paul Hsieh provides the table where he compares his hash
with others. There is also the program he used. I ran his program
on my computer, advantage of his over others was not so big on
my computer. Moreover, his test includes an old version of Bob
Jenkins' hash. The new version -- lookup3, which I mentione above,
has about the same speed as Paul Hsieh's hash (with -O2) or even
12% faster when I used -O3 -march=athlon-xp.

Also, Bob Jenkins' hash is better for non-x86 architectures. So,
I believe it is the best hash for today.

Dmitry
-

To: Dmitry Potapov <dpotapov@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 5:31 am

---
FNV Hash

I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions.
---

My tests ran on Intel. I also noticed I had a few hashes commented out when
doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and
Hsie's didn't perform well for me last time I used the comparison thing (I
did a more thorough job back then, with tests running for several minutes
per hash and table-size, so I commented out the poor candidates).

I still believe that for this very simple case, the lookup3.c case is not
very practical, as the code is that much more complicated, which was my
main point with posting the comparison. Iow, not "switch to this hash,
because it's better", but rather "the hash is not as bad as you think and
will probably work well for all practical purposes".

--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231

-

To: Andreas Ericsson <ae@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 1:10 pm

I believe that under words "my hash", Bob Jenkins meant lookup2, which

I expected that Paul Hsieh's hash may not do well on some architecture,

I would not describe lookup3 as impractical. It is widely used and well
tested. Perhaps, for some Intel CPUs, the difference in speed is not so
big, and FNV hash is much smaller and simpler, so FNV is a reasonable
choice, but the hash is twice slower on my AMD processor and I suspect
it may be even worse on other CPUs, where integer multiplication is slow.
Besides, it may turn out that hashing filename may be not only case where
a fast hash is needed.

Dmitry
-

To: Dmitry Potapov <dpotapov@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 6:39 am

From /proc/cpuinfo. It's the best I can do without going to our purchase
department and asking for the spec so I can contact the vendor and get
the real thing. Dualcore shouldn't matter for this test, as it isn't
threaded.

It doesn't do that well on certain types of data, in my experience. It does
have excellent dispersion, so with very long strings it's usually the

Ah well. I think once the patch is in master, it will be easy enough to
test and verify different algorithms. Since it's intended for in-memory
data only, it's no problem to have several algorithms and pick the one
most suitable for the architecture we're compiling for.

--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-

To: Andreas Ericsson <ae@...>
Cc: Dmitry Potapov <dpotapov@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 10:01 am

I suspect that this paragraph was about comparison with lookup2
(not lookup3) because lookup3 beat easily all the "simple" hashes
in my testing. Only competitor was Hsieh one which was like 50:50

If you don't mind few percent speed penalty compared to Jenkings
own optimized version, you can use my simplified version:

http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c2...

It works always with "native" endianess, unlike Jenkins fixed-endian
hashlittle() / hashbig(). It may or may not matter if you plan
to write values on disk.

Speed-wise it may be 10-30% slower worst case (in my case sparc-classic
with unaligned data), but on x86, lucky gcc version and maybe
also memcpy() hack seen in system.h, it tends to be ~10% faster,
especially as it does always 4byte read in main loop.

--
marko
-

To: Marko Kreen <markokr@...>
Cc: Dmitry Potapov <dpotapov@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 10:39 am

It might be. It's from the link Dmitry posted in his reply to my original

By how much? FNV beat Linus' hash by 0.01 microseconds / insertion,
and 0.1 microsecons / lookup. We're talking about a case here where
there will never be more lookups than insertions (unless I'm much

I don't, but I don't care that deeply either. On the one hand,
it would be nifty to have an excellent hash-function in git.
On the other hand, it would look stupid with something that's

It would have to be a significant improvement in wall-clock time
on a test-case of hashing 30k strings to warrant going from 6 to 80
lines of code, imo. I still believe the original dumb hash Linus
wrote is "good enough".

On a side-note, it was very interesting reading, and I shall have
to add jenkins3_mkreen() to my test-suite (although the "keep
copyright note" license thing bugs me a bit).

--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231

-

To: Andreas Ericsson <ae@...>
Cc: Dmitry Potapov <dpotapov@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 9:19 am

FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned

Jenkins hash is fast because it does not look at individual bytes.
If you _do_ want to look at them for unrelated reasons, (case-insensitive,
unicode-juggling), then it obiously loses the point. That is, if you

Well, ad-hoc dumb hashes may have horrible worst-cases that you cant
see with light testing. Therefore I'd still suggest some better

Sorry. I just used template boilerplate. Considering all the
hard work was done by other people, it not proper to put under
my own license. I tagged the file as 'public domain' and pushed out.

Btw, the reason I started cleaning lookup3 was that at first I was
scared of the complexity of Jenkins code and decided to go with
Hsieh hash. Then I found out that Hsieh code is under totally
werdo license (http://www.azillionmonkeys.com/qed/weblicense.html)
so I could not use it.

====================================================================

Here is my raw-speed test of different hashes. Input is 4-byte
aligned which should be common case for malloc()-ed strings.
This also is best case for original lookup3(), on unaligned
input the memcpy variants beat it easily. Input string
length varies randomly in range 0..64.

own_memcpy - last 12-byte memcpy() calls out to libc
memcpy_hack - last memcpy is inlined bytewise copy loop:

while (len--) *dst++ = *src++;

Note that is is raw-speed test, if you benchmark larger code the
hash difference probably matters less.

--------------------------------------------------------------------

Testing: seed=34 align=4 minlen=0 maxlen=64 trycnt=2 duration=10

lookup3 : try=0: ... 247.4880 MB/s
lookup3 : try=1: ... 247.6154 MB/s
own_memcpy: try=0: ... 223.5508 MB/s
own_memcpy: try=1: ... 223.5830 MB/s
memcpy_hack: try=0: ... 241.2241 MB/s
memcpy_hack: try=1: ... 241.2492 MB/s
lookup2 : try=0: ... 190.2697 MB/s
lookup2 : try=1: ... 190.3283 MB/s
fnv : try=0: ... 153.0318 MB/s
fnv : try=1: ... 153.017...

To: Marko Kreen <markokr@...>
Cc: Dmitry Potapov <dpotapov@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 12:00 pm

But the tests surely need to check for unaligned cases, as that's what

I believe the ability to add unicode-juggling was a major point
with the patch, so perhaps Jenkins' isn't such a good option.

I'm not familiar with data-mangling the way Linus (or Theo Tso

True. FNV is used in both MySQL and PostgreSQL. I'd say it's safe to

Thanks. I'll see if I can add it, although it'll probably have to
wait until I have reason to dig into it at work again. I've added
the hash to the code-base, but not yet incorporated it into the

True. I was in contact with him a while back since I wanted to use it
in an opensource project, but the licensing issues made me go with
another one instead. The patch got turned down anyways, so it was a

Unless arena allocated, like we do in git.

I'm not surprised that this test favours Jenkin's and Hsie's.
That's to be expected as those benefit far more than simpler
hashing algorithms for long strings. The overhead when trying
shorter strings (say, between 3 and 15 chars, and not necessarily

Well, memcpy() isn't very interesting to compare against
hashes, as they test vastly different parts of the hardware's
parts' performance. memcpy() should also perform exactly the
same no matter what the test-data, which isn't always true for
hashes.

What *would* be interesting would be something along the lines
of "duff_cpy()": ie, an unrolled loop that aligns itself and
copies each byte to the same address each time.

The bytewise equivalence would ofcourse be

magic_cpy(unsigned char *k, int len)
{
unsigned char magic;
do {
magic = *k++;
} while (--len);

Interesting output, but not very surprising. Do you have the code
available somewhere?

--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-

To: Andreas Ericsson <ae@...>
Cc: Dmitry Potapov <dpotapov@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Friday, January 25, 2008 - 4:08 pm

http://pgbouncer.projects.postgresql.org/hashtest/hashtest-2008-01-25.tgz

I cleaned it up a bit. Also fixed a bug - the lookup3 was called
via wrapper function so it acually is tiny bit faster than the
results show.

You can consider the code as public domain too. Also remember
the was not meant to be published, so it rather hack...

--
marko
-

To: Andreas Ericsson <ae@...>
Cc: Marko Kreen <markokr@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 12:28 pm

I don't think you can any meaningful unicode-juggling without converting
symbols to UCS-4, and after that it makes much more sense to operate
with uint32 than bytes. So, Jenkins' hash is still relevant, just because
it does not operate on single bytes, but using uint32.

Dmitry
-

To: Dmitry Potapov <dpotapov@...>
Cc: Marko Kreen <markokr@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 1:15 pm

No, no, no, NO!

Egads! Why do people constantly do these totally idiotic things for
Unicode?

You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all
source code projects in UTF-8, without *ever* doing any format changes at
all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie
we don't care about byte-order or size of integers or anything like that),
but that's the common case for most hashes that aren't used for BTree
lookup on disk or something like that.

Here, let me show you:

unsigned int name_hash(const char *name, int size)
{
hash = HASH_INIT;
do {
unsigned char c;
if (size >= sizeof(long)) {
unsigned long val = get_unaligned_long(name);
if (!(val & 0x8080808080808080)) {
/* Make it equivalent in case */
val &= ~0x2020202020202020;
hash = hash_long(hash, val);
name += sizeof(long);
size -= sizeof(long);
continue;
}
}

c = *name;
if (!(c & 0x80)) {
hash = hash_long(hash, c & ~0x20);
name++;
size--;
continue;
}

/* This is the unusual and slowish case */
hash = hash_utf8_char(hash, c, &name, &size);
} while (size);
return hassh;
}

and then the only point you ever do that actual UTF8->codepoint conversion
is for that "high bit set" case.

A few things to note on the above:

- the hash obviously has "odd" characteristics. We're not necessarily
hashing characters at a time at all, and the alignment of characters
with high bits *within*the*string* will make a difference to how we
hash them.

But it's also important that the "get_unaligned_long()" means that the
alignment of the string itself doesn't matter, so its' purely a
"chunking within the string" thing, and the alignment of the string
itself won't affect the hash value

- I'm not writing out hash_utf8_char(), because it's certainly not
totally trivial, but it's not *really* complex either....

To: Linus Torvalds <torvalds@...>
Cc: Dmitry Potapov <dpotapov@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Friday, January 25, 2008 - 4:52 pm

Well, although this is very clever approach, I suggest against it.
You'll end up with complex code that gives out substandard results.

I think its better to have separate case-folding function (or several),
that copies string to temp buffer and then run proper optimized hash
function on that buffer.

That way you can use already tested building blocks and can optimize
both sides separately. Eg. the folding-only function can aswell be
optimized to load 4 or 8-byte at-a-time. This also isolates hashing
from exact details how folding happens to access the input string which
seem to be the weak point in your approach. (In both collision and
complexity sense.)

Such temp buffer happens to fits my lookup3_memcpy also better (heh).
Its weak point is that on platforms that do not allow unaligned access,
it degenerates to byte-by-byte loading. But if know you always
have aligned buffer, you can notify gcc to do 4-byte fetch there too.
It should be as simple as tagging data pointer as uint32_t *.

If your input strings are over kilobyte on average then I'd
agree with you, but if you process 20-30 bytes on average,
is the additional complexity worth it?

--
marko
-

To: Marko Kreen <markokr@...>
Cc: Dmitry Potapov <dpotapov@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Friday, January 25, 2008 - 6:16 pm

I'm sorry, but you just cannot do that efficiently and portably.

I can write a hash function that reliably does 8 bytes at a time for the
common case on a 64-bit architecture, exactly because it's easy to do
"test high bits in parallel" with a simple bitwise 'and', and we can do
the same with "approximate lower-to-uppercase 8 bytes at a time" for a
hash by just clearing bit 5.

In contrast, trying to do the same thing in half-way portable C, but being
limited to having to get the case-folding *exactly* right (which you need
for the comparison function) is much much harder. It's basically
impossible in portable C (it's doable with architecture-specific features,
ie vector extensions that have per-byte compares etc).

And hashing is performance-critical, much more so than the compares (ie
you're likely to have to hash tens of thousands of files, while you will
only compare a couple). So it really is worth optimizing for.

And the thing is, "performance" isn't a secondary feature. It's also not
something you can add later by optimizing.

It's also a mindset issue. Quite frankly, people who do this by "convert
to some folded/normalized form, then do the operation" will generally make
much more fundamental mistakes. Once you get into the mindset of "let's
pass a corrupted strign around", you are in trouble. You start thinking
that the corrupted string isn't really "corrupt", it's in an "optimized
format".

And it's all downhill from there. Don't do it.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Dmitry Potapov <dpotapov@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Saturday, January 26, 2008 - 8:37 am

Ok, you seem to focus on case folding and general performance,
I focus on hash quality and code complexity. Considering you
may want several folding methods, it seemed to me that it would
be good to separate the two aspects.

But I'll try to follow your path for a moment.

Hashing 32 or 64 bits at a time is not trivial, eg. you cannot
use same algorithm for both cases, 64 bits requires twice the
work to mix well.

Per Jenkins notes, hash inner loop should be reversible - that
means that details from beginning of data should shift out of horizon.
But final mixing should achieve avalanche - that means each bit in
input should affect 50% bits in output.

Also must be noted that we are mixing 32->32 and 64->64 instead
of the usual 8->32. So it seems that the best bet would be to use
integer hash functions as core. Jenkins himself points to Thomas
Wang (http://www.cris.com/~Ttwang/tech/inthash.htm) who has good
integer mix functions for both 32 and 64 bits.

Integer hash function must have both reversibility and avalance,
so they may be slightly overkill for this purpose, but fixing that
means lot of work.

So here is what I propose for hashing, if you really want to have
combined folding + hashing:

/* Thomas Wang integer hash functions */

static inline uint32_t hash32(uint32_t key)
{
key = ~key + (key << 15);
key = key ^ (key >> 12);
key = key + (key << 2);
key = key ^ (key >> 4);
key = key * 2057;
key = key ^ (key >> 16);
return key;
}

static inline uint64_t hash64(uint64_t key)
{
key = (~key) + (key << 21); // key = (key << 21) - key - 1;
key = key ^ (key >> 24);
key = (key + (key << 3)) + (key << 8); // key * 265
key = key ^ (key >> 14);
key = (key + (key << 2)) + (key << 4); // key * 21
key = key ^ (key >> 28);
key = key + (key << 31);
retu...

To: Linus Torvalds <torvalds@...>
Cc: Dmitry Potapov <dpotapov@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Saturday, January 26, 2008 - 8:16 am

Here you misunderstood me, I was proposing following:

int hash_folded(const char *str, int len)
{
char buf[512];
do_folding(buf, str, len);
return do_hash(buf, len);
}

That is - the folded string should stay internal to hash function.

Only difference from combined foling+hashing would be that

Againg, you seem to keep HFS+ behaviour in mind, but that was
not what I did suggest. Probably my mistake, sorry.

--
marko
-

To: Marko Kreen <markokr@...>
Cc: Dmitry Potapov <dpotapov@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Sunday, January 27, 2008 - 2:51 am

If it's internal, it's much better, but you still missed the performance
angle.

The fact is, hashing can take shortcuts that folding cannot do!

Case folding, by definition, has to be "exact" (since the whole point is
what you're going to use the same folding function to do the compare, so
if you play games with folding, the compares will be wrong).

But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if
they were different cases of the same character, if that gives you a
faster hash function. Especially as those charactes are rather rare in
filenames.

So if you do hashing as a function of its own, you can simply do a better
job at it.

I do agree that the functions that create a folded set of characters from
a _complex_ UTF-8 character should be shared between folding and hashing,
since that code is too complex and there are no simple shortcuts for doing
a faster hash that still retains all the properties we want.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Dmitry Potapov <dpotapov@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Sunday, January 27, 2008 - 5:45 am

Well, you can always have fold_quick_and_dirty() function that
is used only internally in hash_folded() function, which can:

- fold with simple |= 0x20202020..
- write out full uint32/64, no need to make result proper string
- zero-fill at the end, so hash function does not need to check
for partial block, which is pretty expensive part of hashing.

The win would be:
- more modularized code
- can use faster/any hash
- hash function can be certain to work on aligned data
(win on non-x86)

The minus:
- some memory i/o overhead which may or may not matter
- the parts would not be fully generic, but special to hashing

--
marko

PS. Typo in last mail - "inner loop should be reversible - that
means that details from beginning of data should shift out of
horizon." That obviously means "data should _not_ shift
out of horizon.

btw, "reversible" for integer hashes means that there is 1:1
mapping between input and output - no collisions. Thus
no info loss.
-

To: Marko Kreen <markokr@...>
Cc: Linus Torvalds <torvalds@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Sunday, January 27, 2008 - 11:06 am

If a string is short, it will probably reside in the processor cache, so
there is no real memory i/o overhead here. For more longer strings, it
may be better to do that in short chunks, so each chunk can reside in
the cache. But I don't think filenames are long, so it is not an issue

I think the second part can be rather generic to be reused for hashing
something else that does not require filename specific case-folding.

Dmitry
-

To: Linus Torvalds <torvalds@...>
Cc: Marko Kreen <markokr@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Sunday, January 27, 2008 - 4:21 am

Let's rename do_folding as something else, because it is not a real
folding, but a preparation step for hash calculation. Keeping these
steps separately simplifies the code, and allows further optimization,
for instance, you do not need this do_folding step on a case-sensitive
filesystem. Though it is certainly possible to mix both steps together,
it bloats the code and makes it less readable. Of course, the idea to
avoid a temporary buffer and do everything at once is very appealing,
so I gave it a try -- and here is a 32-bit version of name_hash(), but
I am not very happy with the result:

#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))

#define mix(a,b,c) \
{ \
a -= c; a ^= rot(c, 4); c += b; \
b -= a; b ^= rot(a, 6); a += c; \
c -= b; c ^= rot(b, 8); b += a; \
a -= c; a ^= rot(c,16); c += b; \
b -= a; b ^= rot(a,19); a += c; \
c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a,b,c) \
{ \
c ^= b; c -= rot(b,14); \
a ^= c; a -= rot(c,11); \
b ^= a; b -= rot(a,25); \
c ^= b; c -= rot(b,16); \
a ^= c; a -= rot(c,4); \
b ^= a; b -= rot(a,14); \
c ^= b; c -= rot(b,24); \
}

#define hash_value(x) \
hs[hp] += (x); \
if (++hp == 3) { \
mix (hs[0], hs[1], hs[2]); \
hp = 0; \
}
unsigned int name_hash(const char *name, unsigned size)
{
unsigned hp = 0;
unsigned hs[3];
hs[0] = hs[1] = hs[2] = 0xdeadbeef + size;

do {
unsigned char c;
if (size >= sizeof(unsigned)) {
unsigned val = get_unaligned_uint(name);
if (!(val & 0x80808080)) {
val &= ~0x20202020;
hash_value(val);
name += sizeof(val);
size -= sizeof(val);
continue;
}
}

while (!((c = *name) & 0x80)) {
hash_value(c & ~0x20);
name++;
if (!--size)
goto done:
}

do {
// TODO: add denormalization for Mac
unsigned val = towupper (utf8_to_wchar(&name, &size));
hash_value(val);
} while (size && (*name & 0x80));

} while (size);
done:
if (hp)
fin...

To: Dmitry Potapov <dpotapov@...>
Cc: Marko Kreen <markokr@...>, Linus Torvalds <torvalds@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Sunday, January 27, 2008 - 10:07 am

Hi,

<irony>Oh yes, let's take this one, it is so much shorter, cleaner and
overall more elegant than Linus' code.</irony>

Ciao,
Dscho

-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Marko Kreen <markokr@...>, Linus Torvalds <torvalds@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Sunday, January 27, 2008 - 10:48 am

Hi,

I am not sure what you meant by that, and what exactly code you meant
saying Linus' code....

Anyway, my point was that mixing both steps together does not look very
nice, and the code was intended to demonstrate why.

Dmitry
-

To: Marko Kreen <markokr@...>
Cc: Dmitry Potapov <dpotapov@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Friday, January 25, 2008 - 6:35 pm

Side note: you *can* get better approximations fairly cheaply if you care.

If you want to distinguish the characters 0-31 from the characters 31-63
in your hash (pointless for filenames, but it can be worthwhile for some
other string cases), you can decide to clear bit#5 only if bit#6 in that
byte was also set, with just a few bitwise operations.

Eg, imagine that you have "unsigned long x" containing eight bytes of
ascii data (ie you already did the test by 0x8080808080808080), you can do
things like

unsigned long bit6 = x & 0x4040404040404040;
x &= ~(bit6 >> 1);

which will only clear bit5 if bit6 in the same byte was set..

So you can do tricks like that, and it will still be plenty fast. And
nobody will ever care that while it collides 'A' with 'a' (by design), it
also causes '{' and '[' to be considered "case collisions".

[ Amusing side note: '{' and '[' *are* case collisions in legacy 7-bit
"Finnish ASCII". The editor I use still "upper-cases" '{' to '['. I'm
not kidding, and yes, it really does it on purpose!

It used to be that before everybody turned to Latin1, the {|} characters
were re-used in Finland (and Sweden, for that matter) for the
extra characters needed in Finnish. Because obviously nobody ever
needed them for any real work.

I (and probably every Finnish C UNIX programmer) used to be very good at
reading C source code even when it was full of odd finnish characters
with dots on top, instead of curly braces! ]

And yes, from a performance standpoint, things liek this probably do realy
matter. For the kernel tree, the average pathname length is ~28
characters. If you can do it with three iterations that do the first 24
characters eight characters at a time, and then four iterations over the
four last ones, rather than 28 iterations with byte->longword and
multiplications in each, I bet it's quite visible.

Of course, it's going to be visible only if everything else is fast too,
but git has ...

To: Linus Torvalds <torvalds@...>
Cc: Marko Kreen <markokr@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 2:45 pm

It is better to use 'while' instead of 'if' here, i.e.:

while (!((c = *name) & 0x80)) {
hash = hash_long(hash, c & ~0x20);
name++;
if (!--size)
return hash;

Dmitry
-

To: Dmitry Potapov <dpotapov@...>
Cc: Marko Kreen <markokr@...>, Andreas Ericsson <ae@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 3:08 pm

Yes, that looks like a good further micro-optimization.

Linus
-

To: Andreas Ericsson <ae@...>
Cc: Dmitry Potapov <dpotapov@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 12:13 pm

Ok, here is 0..15 chars, random alignment:

Testing: seed=34 align=0 minlen=0 maxlen=15 trycnt=2 duration=10

lookup3 : try=0: ... 69.8092 MB/s
lookup3 : try=1: ... 69.8146 MB/s
own_memcpy: try=0: ... 66.7808 MB/s
own_memcpy: try=1: ... 66.7814 MB/s
memcpy_hack: try=0: ... 74.0635 MB/s
memcpy_hack: try=1: ... 74.0518 MB/s
lookup2 : try=0: ... 68.6582 MB/s
lookup2 : try=1: ... 68.6634 MB/s
fnv : try=0: ... 74.5098 MB/s
fnv : try=1: ... 74.5283 MB/s
hsieh : try=0: ... 71.6708 MB/s
hsieh : try=1: ... 71.6814 MB/s
oat : try=0: ... 74.7828 MB/s
oat : try=1: ... 74.7716 MB/s
elf : try=0: ... 65.2077 MB/s
elf : try=1: ... 65.2128 MB/s

Results compared to reference:

lookup3 : 100.000 %
own_memcpy : 95.659 %
memcpy_hack : 106.082 %
lookup2 : 98.351 %
fnv : 106.743 %
hsieh : 102.670 %
oat : 107.112 %

I can put it out.

--
marko
-

To: Andreas Ericsson <ae@...>
Cc: Dmitry Potapov <dpotapov@...>, Marko Kreen <markokr@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 2:51 am

Would you, for completeness' sake, please add Tcl and STL hashes to
your test suite? The numbers are quite interesting. Is your test
suite available somewhere, so we can test with our own data and
hardware as well. Both Tcl hash and STL (from SGI probably HP days,
still the current default with g++) string hashes are extremely
simple (excluding the loop constructs):

Tcl: h += (h<<3) + c; // essentially *9+c (but work better on non-
late-intels)
STL: h = h * 5 + c; // worse than above for most of my data

Thanks,

__Luke

-

To: Luke Lu <git@...>
Cc: Dmitry Potapov <dpotapov@...>, Marko Kreen <markokr@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, January 24, 2008 - 6:24 am

I could do that. Or I just publish the entire ugly thing and let someone

Not yet, no. I usually munge it up quite a lot when I want to test hashes

They sure do look simple enough. As for loop constructs, I've tried to
use the same looping mechanics for everything, so as to let the algorithm
be the only difference. Otherwise it gets tricky to do comparisons. The
exceptions are ofcourse hashes relying on Duff's device or similar
alignment trickery.

--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Tuesday, January 22, 2008 - 10:23 pm

This is nice. It does not do anything specific with HFS+ issues
but aims for faster look-ups, which would help everybody.

Two things I noticed (only two, not necessarily because you are
good but mostly because I am still mired in day job and could
not get enough uninterrupted minutes to read the patch ;-)):

- You might want to store the hash table (once computed) in the
index extension section, and lazily unpack the table the
first time index_name_exists() or set_index_entry() is called
on the given istate, instead of unpacking it immediately when
you read from the disk. That way, ls-files does not have to
suffer at all.

- You would need to get rid of the table in discard_index().

-

To: Junio C Hamano <gitster@...>
Cc: Git Mailing List <git@...>
Date: Tuesday, January 22, 2008 - 10:58 pm

I really hate that.

Basically, I dislike having two copies of the same data. If something can
be computed from something else, then only the original data should exist,
and the other thing should be recomputed. Otherwise you easily get into
situations where you spend a lot of time maintaining the other copy, or
worse, you have inconsistent data and it's really subtle what is going on.

Also, one of the ideas behind the index is that would depend on your
notion of what is "equivalent", which is actually somehing fairly fluid.
In fact, it's likely going to depend on a config option.

So encoding the indexing on disk, when it can change when you do a simple
"git config", or even just depending on your LANG environment variable,
seems like a singularly bad idea, even if it wasn't for the coherence.

I did consider doing the indexing only on demand, and we can certainly
simply just "turn it off" when we know it's never going to get used (ie
"git ls-files"). So in that sense, it's easy to get rid of the overhead,
but it didn't really seem like the conceptual complexity (even if it's
just a couple of lines) is really worth it. It's not like git ls-files is

Now this, of course, is obviously true.

And the patch to do that is very simple too. No need to walk any chains,
since the "free(istate->alloc);" will release all the pre-allocated
cache_entry structures, and the rest are (necessarily) leaked anyway.

[ Side note for non-Junios: the leaking of cache_entry structures isn't
new, we've always done it, and it's even done on purpose. The common
case is that there is one *big* allocation (istate->alloc) that contains
all the original cache entries.

There are usually none, or only a very few individual allocations, and
we don't even keep track of them. With the new in-memory format, we
could make a special flag that does "is this cache-entry an individual
allocation or not" (or we could even just see if they are inside the
"alloc" ran...

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 3:23 am

Yes, I agree with that in principle. Storing computable values
makes sense only when it is expensive to recompute. We did not
have cache-tree for quite a long time until you noticed that it
was rather expensive and wasteful to recompute tree objects from
unchanged parts of the index every time.

It's the same argument; when the hashing performance starts to
become noticeable, we can think about storing and reusing it,

Yes, ls-files is cheap. So is lstat(2) on Linux. It only
matters when you do it many many times.

In any case, the change does not look too bad. The best time
(real) of running git-ls-files in the kernel repository on my
box is 0.010s vs 0.011s (10% improvement, heh!, which is the
same as the master version) and empty commit is both 0.082s (no
change).

-- >8 --
[PATCH] lazy index hashing

This delays the hashing of index names until it becomes necessary for
the first time.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
cache.h | 1 +
read-cache.c | 26 +++++++++++++++++++++++---
2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index 409738c..e4aeff0 100644
--- a/cache.h
+++ b/cache.h
@@ -191,6 +191,7 @@ struct index_state {
struct cache_tree *cache_tree;
time_t timestamp;
void *alloc;
+ unsigned name_hash_initialized : 1;
struct hash_table name_hash;
};

diff --git a/read-cache.c b/read-cache.c
index 9477c0b..e45f4b3 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -34,12 +34,11 @@ static unsigned int hash_name(const char *name, int namelen)
return hash;
}

-static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
{
void **pos;
unsigned int hash = hash_name(ce->name, ce_namelen(ce));

- istate->cache[nr] = ce;
pos = insert_hash(hash, ce, &istate->name_hash);
if (pos) {
ce->next = *pos;
@@ -47,6 +46,24 @@ static void set_...

To: Junio C Hamano <gitster@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 8:25 am

Hi,

I fully expect it to be noticable with that UTF-8 "normalisation". But
then, the infrastructure is there, and whoever has an itch to scratch...

Ciao,
Dscho

-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 12:25 pm

Actually, it's going to be totally invisible even with UTF-8
normalization, because we're going to do it sanely.

And by "sanely" I mean just having the code test the high bit, and using
US-ASCII as-is (possibly with that " & ~0x20 " thing to ignore case in
it).

End result: practically all projects will never notice anything at all for
99.9% of all files. One extra well-predicted branch, and a few more hash
collissions for cases where you have both "Makefile" and "makefile" etc.

Doing names with *lots* of UTF-8 characters will be rather slower. It's
still not horrible to do if you do it the smart way, though. In fact, it's
pretty simple, just a few table lookups (one to find the NFD form, one to
do the upcasing).

And yes, for hashing, it makes sense to turn things into NFD because it's
generally simpler, but the point is that you really don't actually modify
the name itself at all, you just hash things (or compare things) character
by expanded character.

IOW, only a total *moron* does Unicode name comparisons with

strcmp(convert_to_nfd(a), convert_to_nfd(b));

which is essentially what Apple does. It's quite possible to do

utf8_nfd_strcmp(a,b)

and (a) do it tons and tons faster and (b) never have to modify the
strings themselves. Same goes (even more) for hashing.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 12:34 pm

Hi,

Well, that's the point, to avoid having both "Makefile" and "makefile" in

Heh, indeed that is what I would have done as an initial step (out of

Okay. Point taken.

But I really hope that you are not proposing to use the case-ignoring
hash when we are _not_ on a case-challenged filesystem...

Ciao,
Dscho

-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 1:09 pm

Right. But what I'm saying is that this is *really* cheap to test for for
US-ASCII-only characters, and if only 0.1% of all filenames have unicode
in them, the fact that they are much mroe expensive isn't even going to be

Note that one reason the above is tons faster is that even with complex
unicode, the *common* case is going to be that the names match with a

I actually suspect that we could, and nobody will notice. The hash would
cause a few more collissions, but not so you'd know.

And the thing is, people who work with other people who are on
case-challenged systems would still want to have the case-insenstive
compare too - although it should just warn, not actually "work".

Linus
-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 1:29 pm

To clarify: the thing I want to point out that the decision to *hash* the
filenames in a case-insensitive hash, is very different from the decision
to then *compare* the filenames when traversing the hash with a
case-insensitive compare.

And this difference is actually very important. Hashing things together
that are "equivalent" according to any random rule is what makes it
possible to then *check* for equivalence cheaply (because you only need to
make the potentially expensive check with the subset of cases where it
might trigger), but it in no way forces you to actually recode or mangle
or compare things equivalently.

In fact, I'd argue that this is what HFS+ did wrong in the first place:
they had stupid/incompetent people who didn't understand about this, so
they normalized the string *before* the hashing rather than as part of the
hash itself, and thus actually corrupt the string itself.

So what you can do (and I'd argue that we do) is to have a hash that can
handle almost arbitrary input, but then never corrupt the filename, and
always compare exactly by default.

Then, depending on a config option, we can decide to change the compare so
that equivalent (according to whatever rule) filenames either cause a
warning (people on sane filesystems, but working with people who aren't),
or are silently considered the same file (people on insane filesystems).

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: <git@...>
Date: Friday, January 25, 2008 - 1:21 am

Linus Torvalds wrote:

In general, there may be a large number of comparison function options
that git will eventually support, and they will likely not all form a
single chain of increasing "strictness".

Given that the hash values aren't even being stored on disk (and if they
were, a simple approach of also storing an identifier for the hash
function to know whether they stored values are still valid could be
used), having a chain of increasingly "strict" comparison functions and
using a hash function that corresponds to the least strict one is useful
for exactly one reason: giving (possibly several different levels of)
non-fatal warnings for various types of duplicates.

But since multiple hash functions will be needed anyway to support
different notions of case-insensitivity, if the warning is not enabled,
there is no reason to use a case-insensitive hash function with a
byte-exact comparison.

--
Jeremy Maitin-Shepard
-

To: Jeremy Maitin-Shepard <jbms@...>
Cc: Linus Torvalds <torvalds@...>, <git@...>
Date: Friday, January 25, 2008 - 8:51 am

Hi,

No, only multiple compare functions will be needed. The hash function can
be built in such a manner that it guarantees that file names being equal
with _any_ of the compare functions fall into the same bucket.

The upside of such a hash function: less code to maintain.

Hth,
Dscho

-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Linus Torvalds <torvalds@...>, <git@...>
Date: Friday, January 25, 2008 - 2:19 pm

In theory, I agree that this is possible, but in practice it may not be
reasonable at all. Consider two possible comparison functions:

1. compare file names as strings case-insensitively assuming a latin 1
encoding

2. compare file names as strings case-insensitively assuming a UTF-8
encoding

Actually writing a hash function such that two strings hash to the same
value if either of these comparison functions says that the strings are

A simple hash function that doesn't try to do anything regarding
case-insensitivity is extremely short and simple and therefore is hardly
a maintenance burden.

Although in some cases it is possible to "share" a hash function, except
for the "warning" purpose, actually doing so doesn't make much sense.
Using the "case-insensitive" hash function when you intend to use an
"exact" comparison function just amounts to using a hash function that
is unequivocally worse: it is slower, more complicated, and has a higher
collision rate.

--
Jeremy Maitin-Shepard
-

To: Jeremy Maitin-Shepard <jbms@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, <git@...>
Date: Friday, January 25, 2008 - 3:07 pm

Once you start adding more "case folding" supported filesystems
to the repertoire, such a unified hash function Dscho suggests
needs to throw paths that other (N-1) "case folding" filesystems
treat as distinct but only 1 filesystem treats "equivalent" into
the same hash bucket. I would say not just difficult but the
resulting function would have too many collisions to make it
ineffective.

-

To: Jeremy Maitin-Shepard <jbms@...>
Cc: Linus Torvalds <torvalds@...>, <git@...>
Date: Friday, January 25, 2008 - 2:24 pm

Hi,

You misunderstand me. If the complicated hash function is the one that is
less exercised, you _will_ face problems.

OTOH if you _already_ need the "complicated" hash function, there is
_little_ point not to use it, and be consistent between platforms,
_especially_ since now all people eat the same dog food.

So I never thought about the simple hash function as being a burden.

Hth,
Dscho

-

To: Junio C Hamano <gitster@...>
Cc: Git Mailing List <git@...>
Date: Tuesday, January 22, 2008 - 11:19 pm

Ok, I pushed the squashed/fixed commit to the "new-lstat" branch. It's now
based on your 'next' branch, and I should have renamed it, since it's not
about any of the old lstat() optimizations any more. Whatever.

But here it is also as a full patch, with a fixed up subject line etc. You
tend to want to have them as topic-branches, and it's probably easier this
way.

Linus

---
From ca98bbc7b0cc1d9d088a8c6ae80e733115a1f775 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 22 Jan 2008 18:41:14 -0800
Subject: [PATCH] Create pathname-based hash-table lookup into index

This creates a hash index of every single file added to the index.
Right now that hash index isn't actually used for much: I implemented a
"cache_name_exists()" function that uses it to efficiently look up a
filename in the index without having to do the O(logn) binary search,
but quite frankly, that's not why this patch is interesting.

No, the whole and only reason to create the hash of the filenames in the
index is that by modifying the hash function, you can fairly easily do
things like making it always hash equivalent names into the same bucket.

That, in turn, means that suddenly questions like "does this name exist
in the index under an _equivalent_ name?" becomes much much cheaper.

Guiding principles behind this patch:

- it shouldn't be too costly. In fact, my primary goal here was to
actually speed up "git commit" with a fully populated kernel tree, by
being faster at checking whether a file already existed in the index. I
did succeed, but only barely:

Best before:
[torvalds@woody linux]$ time git commit > /dev/null
real 0m0.255s
user 0m0.168s
sys 0m0.088s

Best after:

[torvalds@woody linux]$ time ~/git/git commit > /dev/null
real 0m0.233s
user 0m0.144s
sys 0m0.088s

so some things are actually faster (~8%).

Caveat: that's really the best case. Other things are invariably g...

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Friday, January 25, 2008 - 2:50 am

I wonder if the issue Dave Miller addressed with
69ae517541ed5ab7d4fdcd8f82a9b8bd949df347 (fast-import: fix
unalinged allocation and access) applies here.

commit 69ae517541ed5ab7d4fdcd8f82a9b8bd949df347
Author: David S. Miller <davem@davemloft.net>
Date: Fri Dec 14 20:39:16 2007 -0800

fast-import: fix unalinged allocation and access

The specialized pool allocator fast-import uses aligned objects on the
size of a pointer, which was not sufficient at least on Sparc. Instead,
make the alignment for objects of type unitmax_t.

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>

-

To: Junio C Hamano <gitster@...>
Cc: Git Mailing List <git@...>
Date: Friday, January 25, 2008 - 12:24 pm

Good point, although we actually do things wrong for *another* reason.

We currently force cache_entry to be 8-byte aligned regardless of what the
actual "sizeof(ptr)" is, so we should assume that alignment:

#define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7)

and if that isn't correct, we'd need to change this #define.

So right now, the right thing to do is probably to make this alignment
explicit:

#define CE_ALIGN 8

and then use that both in the "cache_entry_size()" _and_ in the
"estimate_cache_size()" calculations to make it obvious what the alignment
is.

And then we could actually make the alignment less on architectures that
don't need that much (there may be architectures that need more, but I
doubt it: we don't have any large fields in that structure, so the
structure alignment really probably does max out at 8 in practice even if
the C language theory doesn't give you any such guarantees).

Side note: this is not likely to be a problem in _practice_. The on-disk
representation is also aligned (also by 8), and while they can be
*differently* aligned due to the relative alignment of the varying-length
"name[]" field, and that can cause some padding to be needed, in practice
it will never matter. The on-disk size also contains a header that we
don't take into account, so it's already "over-estimated" to begin with
for the in-memory representation.

So "estimate_cache_size()" really does over-estimate its needs by a
biggish amount, which is why it all works regardless, but better safe than
sorry.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Tuesday, January 22, 2008 - 10:36 pm

Actually, I take one fourth of this back.

* I am not yet retracting the suggestion to do the hashing
lazily (what I mean by "lazily" is that the first access that
wants hashed access will iterate active_cache and hash them
all, not "lazily one entry at a time as needed" which would
not make any sense for a hashtable). I have to find time to
try benching the effect of it myself. So that's one half
retained.

* We certainly do not necessarily want to store this in the
index right now. The hash algorithms would be improved from
the version you are almost ashamed of ;-). That sounds as if
I am retrating the other half, but not quite.

* Once we have a mechanism to detect that the extension section
stores a precomputed hash that was done with a different
algorithm and ignore it (and recompute afresh when needed),
then we can afford to put a more elaborate hashing algorithm,
slightly loosening one of your "Guiding principles", and keep
the result in the generated index to be reused by the next
user. So that is why I am retracting only half of the
suggestion to save it in the extension section (which in turn
is a half of my suggestion).
-

To: Junio C Hamano <gitster@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 8:24 am

Hi,

Both issues (and the config variable issue Linus raised) are easily helped
with: store not only the hashmap in the extension, but also an identifier
for the hash method used.

Then you can improve on the hash function all you like, and add the config
variable dependent choice of the hashing everybody seems to want all of a
sudden, as long as you change the method identifier, too.

Ciao,
Dscho

-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 8:28 am

For a reasonably implemented hash algorithm, computing the hash should
be cheaper than reading it from disk.

So storing precomputed hashes is not worth the trouble.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-

To: David Kastrup <dak@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, January 23, 2008 - 8:56 am

Yes, but if on Mac OS systems when the git repository is stored on an
HFS+ system, the hash algorithm gets changed to one which forces
Unicode strings which HFS+ happens to "normalize" into the same hash
bucket pre- and post- normalization, it might not be cheap any
more....

- Ted
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Tuesday, January 22, 2008 - 9:35 pm

This is fantastic. Thank you very much for actually taking this issue
seriously despite the mess I made on the list. This is exactly why I
wanted to discuss on the lists instead of hacking away myself - there
are very smart people on the list (like you) that already know how git
works that can come up with ideas like this while I would still be
trying to figure out where the index code is even stored.

-Kevin Ballard

--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com

Previous thread: [PATCH] git-clone -s: document problems with git gc --prune by Miklos Vajna on Tuesday, January 22, 2008 - 5:03 pm. (3 messages)

Next thread: Re: git on MacOSX and files with decomposed utf-8 file names by Theodore Tso on Tuesday, January 22, 2008 - 8:08 pm. (30 messages)