Ok, here's an interesting patch based on the current 'next' (since it very intimately requires the new in-memory index format). What it does is to create a hash index of every single file added to the index. Right now that hash index isn't actually used for much: I implemented a "cache_name_exists()" function that uses it to efficiently look up a filename in the index without having to do the O(logn) binary search, but quite frankly, that's not why this patch is interesting. No, the whole and only reason to create the hash of the filenames in the index is that by modifying the hash function, you can fairly easily do things like making it always hash equivalent names into the same bucket. That, in turn, means that suddenly questions like "does this name exists in the index under an _equivalent_ name?" becomes much much cheaper. Guiding principles behind this patch: - it shouldn't be too costly. In fact, my primary goal here was to actually speed up "git commit" with a fully populated kernel tree, by being faster at checking whether a file already existed in the index. I did succeed, but only barely: Best before: [torvalds@woody linux]$ time git commit > /dev/null real 0m0.255s user 0m0.168s sys 0m0.088s Best after: [torvalds@woody linux]$ time ~/git/git commit > /dev/null real 0m0.233s user 0m0.144s sys 0m0.088s so some things are actually faster (~8%). Caveat: that's really the best case. Other things are invariably going to be slightly slower, since we populate that index cache, and quite frankly, few things really use it to look things up. That said, the cost is really quite small. The worst case is probably doing a "git ls-files", which will do very little except puopulate the index, and never actually looks anything up in it, just lists it. Before: [torvalds@woody linux]$ time git ls-files > /dev/null real 0m0.016s user 0m0.016s sys ...
This is fantastic. Thank you very much for actually taking this issue seriously despite the mess I made on the list. This is exactly why I wanted to discuss on the lists instead of hacking away myself - there are very smart people on the list (like you) that already know how git works that can come up with ideas like this while I would still be trying to figure out where the index code is even stored. -Kevin Ballard -- Kevin Ballard http://kevin.sb.org kevin@sb.org http://www.tildesoft.com
This is nice. It does not do anything specific with HFS+ issues but aims for faster look-ups, which would help everybody. Two things I noticed (only two, not necessarily because you are good but mostly because I am still mired in day job and could not get enough uninterrupted minutes to read the patch ;-)): - You might want to store the hash table (once computed) in the index extension section, and lazily unpack the table the first time index_name_exists() or set_index_entry() is called on the given istate, instead of unpacking it immediately when you read from the disk. That way, ls-files does not have to suffer at all. - You would need to get rid of the table in discard_index(). -
Actually, I take one fourth of this back. * I am not yet retracting the suggestion to do the hashing lazily (what I mean by "lazily" is that the first access that wants hashed access will iterate active_cache and hash them all, not "lazily one entry at a time as needed" which would not make any sense for a hashtable). I have to find time to try benching the effect of it myself. So that's one half retained. * We certainly do not necessarily want to store this in the index right now. The hash algorithms would be improved from the version you are almost ashamed of ;-). That sounds as if I am retrating the other half, but not quite. * Once we have a mechanism to detect that the extension section stores a precomputed hash that was done with a different algorithm and ignore it (and recompute afresh when needed), then we can afford to put a more elaborate hashing algorithm, slightly loosening one of your "Guiding principles", and keep the result in the generated index to be reused by the next user. So that is why I am retracting only half of the suggestion to save it in the extension section (which in turn is a half of my suggestion). -
Hi, Both issues (and the config variable issue Linus raised) are easily helped with: store not only the hashmap in the extension, but also an identifier for the hash method used. Then you can improve on the hash function all you like, and add the config variable dependent choice of the hashing everybody seems to want all of a sudden, as long as you change the method identifier, too. Ciao, Dscho -
For a reasonably implemented hash algorithm, computing the hash should be cheaper than reading it from disk. So storing precomputed hashes is not worth the trouble. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Yes, but if on Mac OS systems when the git repository is stored on an
HFS+ system, the hash algorithm gets changed to one which forces
Unicode strings which HFS+ happens to "normalize" into the same hash
bucket pre- and post- normalization, it might not be cheap any
more....
- Ted
-
I really hate that. Basically, I dislike having two copies of the same data. If something can be computed from something else, then only the original data should exist, and the other thing should be recomputed. Otherwise you easily get into situations where you spend a lot of time maintaining the other copy, or worse, you have inconsistent data and it's really subtle what is going on. Also, one of the ideas behind the index is that would depend on your notion of what is "equivalent", which is actually somehing fairly fluid. In fact, it's likely going to depend on a config option. So encoding the indexing on disk, when it can change when you do a simple "git config", or even just depending on your LANG environment variable, seems like a singularly bad idea, even if it wasn't for the coherence. I did consider doing the indexing only on demand, and we can certainly simply just "turn it off" when we know it's never going to get used (ie "git ls-files"). So in that sense, it's easy to get rid of the overhead, but it didn't really seem like the conceptual complexity (even if it's just a couple of lines) is really worth it. It's not like git ls-files is Now this, of course, is obviously true. And the patch to do that is very simple too. No need to walk any chains, since the "free(istate->alloc);" will release all the pre-allocated cache_entry structures, and the rest are (necessarily) leaked anyway. [ Side note for non-Junios: the leaking of cache_entry structures isn't new, we've always done it, and it's even done on purpose. The common case is that there is one *big* allocation (istate->alloc) that contains all the original cache entries. There are usually none, or only a very few individual allocations, and we don't even keep track of them. With the new in-memory format, we could make a special flag that does "is this cache-entry an individual allocation or not" (or we could even just see if they are inside the "alloc" ...
Ok, I pushed the squashed/fixed commit to the "new-lstat" branch. It's now based on your 'next' branch, and I should have renamed it, since it's not about any of the old lstat() optimizations any more. Whatever. But here it is also as a full patch, with a fixed up subject line etc. You tend to want to have them as topic-branches, and it's probably easier this way. Linus --- From ca98bbc7b0cc1d9d088a8c6ae80e733115a1f775 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue, 22 Jan 2008 18:41:14 -0800 Subject: [PATCH] Create pathname-based hash-table lookup into index This creates a hash index of every single file added to the index. Right now that hash index isn't actually used for much: I implemented a "cache_name_exists()" function that uses it to efficiently look up a filename in the index without having to do the O(logn) binary search, but quite frankly, that's not why this patch is interesting. No, the whole and only reason to create the hash of the filenames in the index is that by modifying the hash function, you can fairly easily do things like making it always hash equivalent names into the same bucket. That, in turn, means that suddenly questions like "does this name exist in the index under an _equivalent_ name?" becomes much much cheaper. Guiding principles behind this patch: - it shouldn't be too costly. In fact, my primary goal here was to actually speed up "git commit" with a fully populated kernel tree, by being faster at checking whether a file already existed in the index. I did succeed, but only barely: Best before: [torvalds@woody linux]$ time git commit > /dev/null real 0m0.255s user 0m0.168s sys 0m0.088s Best after: [torvalds@woody linux]$ time ~/git/git commit > /dev/null real 0m0.233s user 0m0.144s sys 0m0.088s so some things are actually faster (~8%). Caveat: that's really the best case. Other things are invariably going ...
I wonder if the issue Dave Miller addressed with
69ae517541ed5ab7d4fdcd8f82a9b8bd949df347 (fast-import: fix
unalinged allocation and access) applies here.
commit 69ae517541ed5ab7d4fdcd8f82a9b8bd949df347
Author: David S. Miller <davem@davemloft.net>
Date: Fri Dec 14 20:39:16 2007 -0800
fast-import: fix unalinged allocation and access
The specialized pool allocator fast-import uses aligned objects on the
size of a pointer, which was not sufficient at least on Sparc. Instead,
make the alignment for objects of type unitmax_t.
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
-
Good point, although we actually do things wrong for *another* reason. We currently force cache_entry to be 8-byte aligned regardless of what the actual "sizeof(ptr)" is, so we should assume that alignment: #define cache_entry_size(len) ((offsetof(struct cache_entry,name) + (len) + 8) & ~7) and if that isn't correct, we'd need to change this #define. So right now, the right thing to do is probably to make this alignment explicit: #define CE_ALIGN 8 and then use that both in the "cache_entry_size()" _and_ in the "estimate_cache_size()" calculations to make it obvious what the alignment is. And then we could actually make the alignment less on architectures that don't need that much (there may be architectures that need more, but I doubt it: we don't have any large fields in that structure, so the structure alignment really probably does max out at 8 in practice even if the C language theory doesn't give you any such guarantees). Side note: this is not likely to be a problem in _practice_. The on-disk representation is also aligned (also by 8), and while they can be *differently* aligned due to the relative alignment of the varying-length "name[]" field, and that can cause some padding to be needed, in practice it will never matter. The on-disk size also contains a header that we don't take into account, so it's already "over-estimated" to begin with for the in-memory representation. So "estimate_cache_size()" really does over-estimate its needs by a biggish amount, which is why it all works regardless, but better safe than sorry. Linus -
Yes, I agree with that in principle. Storing computable values
makes sense only when it is expensive to recompute. We did not
have cache-tree for quite a long time until you noticed that it
was rather expensive and wasteful to recompute tree objects from
unchanged parts of the index every time.
It's the same argument; when the hashing performance starts to
become noticeable, we can think about storing and reusing it,
Yes, ls-files is cheap. So is lstat(2) on Linux. It only
matters when you do it many many times.
In any case, the change does not look too bad. The best time
(real) of running git-ls-files in the kernel repository on my
box is 0.010s vs 0.011s (10% improvement, heh!, which is the
same as the master version) and empty commit is both 0.082s (no
change).
-- >8 --
[PATCH] lazy index hashing
This delays the hashing of index names until it becomes necessary for
the first time.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
cache.h | 1 +
read-cache.c | 26 +++++++++++++++++++++++---
2 files changed, 24 insertions(+), 3 deletions(-)
diff --git a/cache.h b/cache.h
index 409738c..e4aeff0 100644
--- a/cache.h
+++ b/cache.h
@@ -191,6 +191,7 @@ struct index_state {
struct cache_tree *cache_tree;
time_t timestamp;
void *alloc;
+ unsigned name_hash_initialized : 1;
struct hash_table name_hash;
};
diff --git a/read-cache.c b/read-cache.c
index 9477c0b..e45f4b3 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -34,12 +34,11 @@ static unsigned int hash_name(const char *name, int namelen)
return hash;
}
-static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
{
void **pos;
unsigned int hash = hash_name(ce->name, ce_namelen(ce));
- istate->cache[nr] = ce;
pos = insert_hash(hash, ce, &istate->name_hash);
if (pos) {
ce->next = *pos;
@@ -47,6 +46,24 @@ static void set_index_entry(struct ...Hi, I fully expect it to be noticable with that UTF-8 "normalisation". But then, the infrastructure is there, and whoever has an itch to scratch... Ciao, Dscho -
Actually, it's going to be totally invisible even with UTF-8 normalization, because we're going to do it sanely. And by "sanely" I mean just having the code test the high bit, and using US-ASCII as-is (possibly with that " & ~0x20 " thing to ignore case in it). End result: practically all projects will never notice anything at all for 99.9% of all files. One extra well-predicted branch, and a few more hash collissions for cases where you have both "Makefile" and "makefile" etc. Doing names with *lots* of UTF-8 characters will be rather slower. It's still not horrible to do if you do it the smart way, though. In fact, it's pretty simple, just a few table lookups (one to find the NFD form, one to do the upcasing). And yes, for hashing, it makes sense to turn things into NFD because it's generally simpler, but the point is that you really don't actually modify the name itself at all, you just hash things (or compare things) character by expanded character. IOW, only a total *moron* does Unicode name comparisons with strcmp(convert_to_nfd(a), convert_to_nfd(b)); which is essentially what Apple does. It's quite possible to do utf8_nfd_strcmp(a,b) and (a) do it tons and tons faster and (b) never have to modify the strings themselves. Same goes (even more) for hashing. Linus -
Hi, Well, that's the point, to avoid having both "Makefile" and "makefile" in Heh, indeed that is what I would have done as an initial step (out of Okay. Point taken. But I really hope that you are not proposing to use the case-ignoring hash when we are _not_ on a case-challenged filesystem... Ciao, Dscho -
Right. But what I'm saying is that this is *really* cheap to test for for US-ASCII-only characters, and if only 0.1% of all filenames have unicode in them, the fact that they are much mroe expensive isn't even going to be Note that one reason the above is tons faster is that even with complex unicode, the *common* case is going to be that the names match with a I actually suspect that we could, and nobody will notice. The hash would cause a few more collissions, but not so you'd know. And the thing is, people who work with other people who are on case-challenged systems would still want to have the case-insenstive compare too - although it should just warn, not actually "work". Linus -
To clarify: the thing I want to point out that the decision to *hash* the filenames in a case-insensitive hash, is very different from the decision to then *compare* the filenames when traversing the hash with a case-insensitive compare. And this difference is actually very important. Hashing things together that are "equivalent" according to any random rule is what makes it possible to then *check* for equivalence cheaply (because you only need to make the potentially expensive check with the subset of cases where it might trigger), but it in no way forces you to actually recode or mangle or compare things equivalently. In fact, I'd argue that this is what HFS+ did wrong in the first place: they had stupid/incompetent people who didn't understand about this, so they normalized the string *before* the hashing rather than as part of the hash itself, and thus actually corrupt the string itself. So what you can do (and I'd argue that we do) is to have a hash that can handle almost arbitrary input, but then never corrupt the filename, and always compare exactly by default. Then, depending on a config option, we can decide to change the compare so that equivalent (according to whatever rule) filenames either cause a warning (people on sane filesystems, but working with people who aren't), or are silently considered the same file (people on insane filesystems). Linus -
Linus Torvalds wrote: In general, there may be a large number of comparison function options that git will eventually support, and they will likely not all form a single chain of increasing "strictness". Given that the hash values aren't even being stored on disk (and if they were, a simple approach of also storing an identifier for the hash function to know whether they stored values are still valid could be used), having a chain of increasingly "strict" comparison functions and using a hash function that corresponds to the least strict one is useful for exactly one reason: giving (possibly several different levels of) non-fatal warnings for various types of duplicates. But since multiple hash functions will be needed anyway to support different notions of case-insensitivity, if the warning is not enabled, there is no reason to use a case-insensitive hash function with a byte-exact comparison. -- Jeremy Maitin-Shepard -
Hi, No, only multiple compare functions will be needed. The hash function can be built in such a manner that it guarantees that file names being equal with _any_ of the compare functions fall into the same bucket. The upside of such a hash function: less code to maintain. Hth, Dscho -
In theory, I agree that this is possible, but in practice it may not be reasonable at all. Consider two possible comparison functions: 1. compare file names as strings case-insensitively assuming a latin 1 encoding 2. compare file names as strings case-insensitively assuming a UTF-8 encoding Actually writing a hash function such that two strings hash to the same value if either of these comparison functions says that the strings are A simple hash function that doesn't try to do anything regarding case-insensitivity is extremely short and simple and therefore is hardly a maintenance burden. Although in some cases it is possible to "share" a hash function, except for the "warning" purpose, actually doing so doesn't make much sense. Using the "case-insensitive" hash function when you intend to use an "exact" comparison function just amounts to using a hash function that is unequivocally worse: it is slower, more complicated, and has a higher collision rate. -- Jeremy Maitin-Shepard -
Hi, You misunderstand me. If the complicated hash function is the one that is less exercised, you _will_ face problems. OTOH if you _already_ need the "complicated" hash function, there is _little_ point not to use it, and be consistent between platforms, _especially_ since now all people eat the same dog food. So I never thought about the simple hash function as being a burden. Hth, Dscho -
Once you start adding more "case folding" supported filesystems to the repertoire, such a unified hash function Dscho suggests needs to throw paths that other (N-1) "case folding" filesystems treat as distinct but only 1 filesystem treats "equivalent" into the same hash bucket. I would say not just difficult but the resulting function would have too many collisions to make it ineffective. -
Insofar as hashes go, it's not that shabby for hashing filenames. Here's the test output from a small hash-comparison program I've got, which runs the test-input through a series of different hashes to compare dispersion, collisions, lookup- and insert times and other things that are interesting from a practical PoV. Fowler/Noll/Vo cheap hash Collisions: 1829, 7.91%. Depth max: 3, average: 0.09. Time spent in lookups: 167.859ms. Per entry: 0.145us Time spent inserting: 2.753ms. Per entry: 0.119us PHP Zend cheap hash Collisions: 1908, 8.25%. Depth max: 3, average: 0.09. Time spent in lookups: 171.819ms. Per entry: 0.149us Time spent inserting: 2.778ms. Per entry: 0.120us Phong Vo's linear congruential hash Collisions: 1996, 8.63%. Depth max: 3, average: 0.09. Time spent in lookups: 168.276ms. Per entry: 0.146us Time spent inserting: 2.840ms. Per entry: 0.123us Phong Vo's second linear congruential hash Collisions: 1933, 8.36%. Depth max: 3, average: 0.09. Time spent in lookups: 170.416ms. Per entry: 0.147us Time spent inserting: 2.774ms. Per entry: 0.120us Glib string hash Collisions: 1907, 8.24%. Depth max: 3, average: 0.09. Time spent in lookups: 192.420ms. Per entry: 0.166us Time spent inserting: 3.154ms. Per entry: 0.136us sdbm hash Collisions: 1899, 8.21%. Depth max: 3, average: 0.09. Time spent in lookups: 170.797ms. Per entry: 0.148us Time spent inserting: 2.724ms. Per entry: 0.118us bfd hash Collisions: 1949, 8.43%. Depth max: 3, average: 0.09. Time spent in lookups: 206.504ms. Per entry: 0.179us Time spent inserting: 3.241ms. Per entry: 0.140us Linus' GIT hash Collisions: 1946, 8.41%. Depth max: 4, average: 0.09. Time spent in lookups: 179.336ms. Per entry: 0.155us Time spent inserting: 2.781ms. Per entry: 0.120us This is with 64K buckets, which (with my implementation) means a total hash-table size of 256KiB. The test-input is a simple file-listing from the linux kernel (although it has the .git directory included too). As you can see, ...
Actually, Bob Jenkins' lookup3 hash is twice faster in my tests than FNV, and also it is much less likely to have any collision. The description and some comparision with other hash can be found here: http://burtleburtle.net/bob/hash/doobs.html http://burtleburtle.net/bob/c/lookup3.c Perhaps, the second choice is Paul Hsieh's hash. http://www.azillionmonkeys.com/qed/hash.html Note: Paul Hsieh provides the table where he compares his hash with others. There is also the program he used. I ran his program on my computer, advantage of his over others was not so big on my computer. Moreover, his test includes an old version of Bob Jenkins' hash. The new version -- lookup3, which I mentione above, has about the same speed as Paul Hsieh's hash (with -O2) or even 12% faster when I used -O3 -march=athlon-xp. Also, Bob Jenkins' hash is better for non-x86 architectures. So, I believe it is the best hash for today. Dmitry -
--- FNV Hash I need to fill this in. Search the web for FNV hash. It's faster than my hash on Intel (because Intel has fast multiplication), but slower on most other platforms. Preliminary tests suggested it has decent distributions. --- My tests ran on Intel. I also noticed I had a few hashes commented out when doing the test, one of them being Paul Hsie's. For some reason, Jenkin's and Hsie's didn't perform well for me last time I used the comparison thing (I did a more thorough job back then, with tests running for several minutes per hash and table-size, so I commented out the poor candidates). I still believe that for this very simple case, the lookup3.c case is not very practical, as the code is that much more complicated, which was my main point with posting the comparison. Iow, not "switch to this hash, because it's better", but rather "the hash is not as bad as you think and will probably work well for all practical purposes". -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
I suspect that this paragraph was about comparison with lookup2 (not lookup3) because lookup3 beat easily all the "simple" hashes in my testing. Only competitor was Hsieh one which was like 50:50 If you don't mind few percent speed penalty compared to Jenkings own optimized version, you can use my simplified version: http://repo.or.cz/w/pgbouncer.git?a=blob;f=src/hash.c;h=5c9a73639ad098c296c0be562c3457... It works always with "native" endianess, unlike Jenkins fixed-endian hashlittle() / hashbig(). It may or may not matter if you plan to write values on disk. Speed-wise it may be 10-30% slower worst case (in my case sparc-classic with unaligned data), but on x86, lucky gcc version and maybe also memcpy() hack seen in system.h, it tends to be ~10% faster, especially as it does always 4byte read in main loop. -- marko -
It might be. It's from the link Dmitry posted in his reply to my original By how much? FNV beat Linus' hash by 0.01 microseconds / insertion, and 0.1 microsecons / lookup. We're talking about a case here where there will never be more lookups than insertions (unless I'm much I don't, but I don't care that deeply either. On the one hand, it would be nifty to have an excellent hash-function in git. On the other hand, it would look stupid with something that's It would have to be a significant improvement in wall-clock time on a test-case of hashing 30k strings to warrant going from 6 to 80 lines of code, imo. I still believe the original dumb hash Linus wrote is "good enough". On a side-note, it was very interesting reading, and I shall have to add jenkins3_mkreen() to my test-suite (although the "keep copyright note" license thing bugs me a bit). -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
Would you, for completeness' sake, please add Tcl and STL hashes to your test suite? The numbers are quite interesting. Is your test suite available somewhere, so we can test with our own data and hardware as well. Both Tcl hash and STL (from SGI probably HP days, still the current default with g++) string hashes are extremely simple (excluding the loop constructs): Tcl: h += (h<<3) + c; // essentially *9+c (but work better on non- late-intels) STL: h = h * 5 + c; // worse than above for most of my data Thanks, __Luke -
I could do that. Or I just publish the entire ugly thing and let someone Not yet, no. I usually munge it up quite a lot when I want to test hashes They sure do look simple enough. As for loop constructs, I've tried to use the same looping mechanics for everything, so as to let the algorithm be the only difference. Otherwise it gets tricky to do comparisons. The exceptions are ofcourse hashes relying on Duff's device or similar alignment trickery. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
FNV is around 40% slower than lookup3 on my Intel Core CPU, on 4byte aligned Jenkins hash is fast because it does not look at individual bytes. If you _do_ want to look at them for unrelated reasons, (case-insensitive, unicode-juggling), then it obiously loses the point. That is, if you Well, ad-hoc dumb hashes may have horrible worst-cases that you cant see with light testing. Therefore I'd still suggest some better Sorry. I just used template boilerplate. Considering all the hard work was done by other people, it not proper to put under my own license. I tagged the file as 'public domain' and pushed out. Btw, the reason I started cleaning lookup3 was that at first I was scared of the complexity of Jenkins code and decided to go with Hsieh hash. Then I found out that Hsieh code is under totally werdo license (http://www.azillionmonkeys.com/qed/weblicense.html) so I could not use it. ==================================================================== Here is my raw-speed test of different hashes. Input is 4-byte aligned which should be common case for malloc()-ed strings. This also is best case for original lookup3(), on unaligned input the memcpy variants beat it easily. Input string length varies randomly in range 0..64. own_memcpy - last 12-byte memcpy() calls out to libc memcpy_hack - last memcpy is inlined bytewise copy loop: while (len--) *dst++ = *src++; Note that is is raw-speed test, if you benchmark larger code the hash difference probably matters less. -------------------------------------------------------------------- Testing: seed=34 align=4 minlen=0 maxlen=64 trycnt=2 duration=10 lookup3 : try=0: ... 247.4880 MB/s lookup3 : try=1: ... 247.6154 MB/s own_memcpy: try=0: ... 223.5508 MB/s own_memcpy: try=1: ... 223.5830 MB/s memcpy_hack: try=0: ... 241.2241 MB/s memcpy_hack: try=1: ... 241.2492 MB/s lookup2 : try=0: ... 190.2697 MB/s lookup2 : try=1: ... 190.3283 MB/s fnv : try=0: ... 153.0318 MB/s fnv : try=1: ... ...
But the tests surely need to check for unaligned cases, as that's what
I believe the ability to add unicode-juggling was a major point
with the patch, so perhaps Jenkins' isn't such a good option.
I'm not familiar with data-mangling the way Linus (or Theo Tso
True. FNV is used in both MySQL and PostgreSQL. I'd say it's safe to
Thanks. I'll see if I can add it, although it'll probably have to
wait until I have reason to dig into it at work again. I've added
the hash to the code-base, but not yet incorporated it into the
True. I was in contact with him a while back since I wanted to use it
in an opensource project, but the licensing issues made me go with
another one instead. The patch got turned down anyways, so it was a
Unless arena allocated, like we do in git.
I'm not surprised that this test favours Jenkin's and Hsie's.
That's to be expected as those benefit far more than simpler
hashing algorithms for long strings. The overhead when trying
shorter strings (say, between 3 and 15 chars, and not necessarily
Well, memcpy() isn't very interesting to compare against
hashes, as they test vastly different parts of the hardware's
parts' performance. memcpy() should also perform exactly the
same no matter what the test-data, which isn't always true for
hashes.
What *would* be interesting would be something along the lines
of "duff_cpy()": ie, an unrolled loop that aligns itself and
copies each byte to the same address each time.
The bytewise equivalence would ofcourse be
magic_cpy(unsigned char *k, int len)
{
unsigned char magic;
do {
magic = *k++;
} while (--len);
Interesting output, but not very surprising. Do you have the code
available somewhere?
--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
Ok, here is 0..15 chars, random alignment: Testing: seed=34 align=0 minlen=0 maxlen=15 trycnt=2 duration=10 lookup3 : try=0: ... 69.8092 MB/s lookup3 : try=1: ... 69.8146 MB/s own_memcpy: try=0: ... 66.7808 MB/s own_memcpy: try=1: ... 66.7814 MB/s memcpy_hack: try=0: ... 74.0635 MB/s memcpy_hack: try=1: ... 74.0518 MB/s lookup2 : try=0: ... 68.6582 MB/s lookup2 : try=1: ... 68.6634 MB/s fnv : try=0: ... 74.5098 MB/s fnv : try=1: ... 74.5283 MB/s hsieh : try=0: ... 71.6708 MB/s hsieh : try=1: ... 71.6814 MB/s oat : try=0: ... 74.7828 MB/s oat : try=1: ... 74.7716 MB/s elf : try=0: ... 65.2077 MB/s elf : try=1: ... 65.2128 MB/s Results compared to reference: lookup3 : 100.000 % own_memcpy : 95.659 % memcpy_hack : 106.082 % lookup2 : 98.351 % fnv : 106.743 % hsieh : 102.670 % oat : 107.112 % I can put it out. -- marko -
I don't think you can any meaningful unicode-juggling without converting symbols to UCS-4, and after that it makes much more sense to operate with uint32 than bytes. So, Jenkins' hash is still relevant, just because it does not operate on single bytes, but using uint32. Dmitry -
No, no, no, NO!
Egads! Why do people constantly do these totally idiotic things for
Unicode?
You can do a perfectly fine 8-bytes-at-a-time hash for almost 100% of all
source code projects in UTF-8, without *ever* doing any format changes at
all. Admittedly, it's a lot easier if the hash is a pure in-memory one (ie
we don't care about byte-order or size of integers or anything like that),
but that's the common case for most hashes that aren't used for BTree
lookup on disk or something like that.
Here, let me show you:
unsigned int name_hash(const char *name, int size)
{
hash = HASH_INIT;
do {
unsigned char c;
if (size >= sizeof(long)) {
unsigned long val = get_unaligned_long(name);
if (!(val & 0x8080808080808080)) {
/* Make it equivalent in case */
val &= ~0x2020202020202020;
hash = hash_long(hash, val);
name += sizeof(long);
size -= sizeof(long);
continue;
}
}
c = *name;
if (!(c & 0x80)) {
hash = hash_long(hash, c & ~0x20);
name++;
size--;
continue;
}
/* This is the unusual and slowish case */
hash = hash_utf8_char(hash, c, &name, &size);
} while (size);
return hassh;
}
and then the only point you ever do that actual UTF8->codepoint conversion
is for that "high bit set" case.
A few things to note on the above:
- the hash obviously has "odd" characteristics. We're not necessarily
hashing characters at a time at all, and the alignment of characters
with high bits *within*the*string* will make a difference to how we
hash them.
But it's also important that the "get_unaligned_long()" means that the
alignment of the string itself doesn't matter, so its' purely a
"chunking within the string" thing, and the alignment of the string
itself won't affect the hash value
- I'm not writing out hash_utf8_char(), because it's certainly not
totally trivial, but it's not *really* complex either. The
nontriviality ...It is better to use 'while' instead of 'if' here, i.e.:
while (!((c = *name) & 0x80)) {
hash = hash_long(hash, c & ~0x20);
name++;
if (!--size)
return hash;
Dmitry
-
Yes, that looks like a good further micro-optimization. Linus -
Well, although this is very clever approach, I suggest against it. You'll end up with complex code that gives out substandard results. I think its better to have separate case-folding function (or several), that copies string to temp buffer and then run proper optimized hash function on that buffer. That way you can use already tested building blocks and can optimize both sides separately. Eg. the folding-only function can aswell be optimized to load 4 or 8-byte at-a-time. This also isolates hashing from exact details how folding happens to access the input string which seem to be the weak point in your approach. (In both collision and complexity sense.) Such temp buffer happens to fits my lookup3_memcpy also better (heh). Its weak point is that on platforms that do not allow unaligned access, it degenerates to byte-by-byte loading. But if know you always have aligned buffer, you can notify gcc to do 4-byte fetch there too. It should be as simple as tagging data pointer as uint32_t *. If your input strings are over kilobyte on average then I'd agree with you, but if you process 20-30 bytes on average, is the additional complexity worth it? -- marko -
I'm sorry, but you just cannot do that efficiently and portably. I can write a hash function that reliably does 8 bytes at a time for the common case on a 64-bit architecture, exactly because it's easy to do "test high bits in parallel" with a simple bitwise 'and', and we can do the same with "approximate lower-to-uppercase 8 bytes at a time" for a hash by just clearing bit 5. In contrast, trying to do the same thing in half-way portable C, but being limited to having to get the case-folding *exactly* right (which you need for the comparison function) is much much harder. It's basically impossible in portable C (it's doable with architecture-specific features, ie vector extensions that have per-byte compares etc). And hashing is performance-critical, much more so than the compares (ie you're likely to have to hash tens of thousands of files, while you will only compare a couple). So it really is worth optimizing for. And the thing is, "performance" isn't a secondary feature. It's also not something you can add later by optimizing. It's also a mindset issue. Quite frankly, people who do this by "convert to some folded/normalized form, then do the operation" will generally make much more fundamental mistakes. Once you get into the mindset of "let's pass a corrupted strign around", you are in trouble. You start thinking that the corrupted string isn't really "corrupt", it's in an "optimized format". And it's all downhill from there. Don't do it. Linus -
Side note: you *can* get better approximations fairly cheaply if you care.
If you want to distinguish the characters 0-31 from the characters 31-63
in your hash (pointless for filenames, but it can be worthwhile for some
other string cases), you can decide to clear bit#5 only if bit#6 in that
byte was also set, with just a few bitwise operations.
Eg, imagine that you have "unsigned long x" containing eight bytes of
ascii data (ie you already did the test by 0x8080808080808080), you can do
things like
unsigned long bit6 = x & 0x4040404040404040;
x &= ~(bit6 >> 1);
which will only clear bit5 if bit6 in the same byte was set..
So you can do tricks like that, and it will still be plenty fast. And
nobody will ever care that while it collides 'A' with 'a' (by design), it
also causes '{' and '[' to be considered "case collisions".
[ Amusing side note: '{' and '[' *are* case collisions in legacy 7-bit
"Finnish ASCII". The editor I use still "upper-cases" '{' to '['. I'm
not kidding, and yes, it really does it on purpose!
It used to be that before everybody turned to Latin1, the {|} characters
were re-used in Finland (and Sweden, for that matter) for the
extra characters needed in Finnish. Because obviously nobody ever
needed them for any real work.
I (and probably every Finnish C UNIX programmer) used to be very good at
reading C source code even when it was full of odd finnish characters
with dots on top, instead of curly braces! ]
And yes, from a performance standpoint, things liek this probably do realy
matter. For the kernel tree, the average pathname length is ~28
characters. If you can do it with three iterations that do the first 24
characters eight characters at a time, and then four iterations over the
four last ones, rather than 28 iterations with byte->longword and
multiplications in each, I bet it's quite visible.
Of course, it's going to be visible only if everything else is fast too,
but git has been pretty ...Here you misunderstood me, I was proposing following:
int hash_folded(const char *str, int len)
{
char buf[512];
do_folding(buf, str, len);
return do_hash(buf, len);
}
That is - the folded string should stay internal to hash function.
Only difference from combined foling+hashing would be that
Againg, you seem to keep HFS+ behaviour in mind, but that was
not what I did suggest. Probably my mistake, sorry.
--
marko
-
If it's internal, it's much better, but you still missed the performance
angle.
The fact is, hashing can take shortcuts that folding cannot do!
Case folding, by definition, has to be "exact" (since the whole point is
what you're going to use the same folding function to do the compare, so
if you play games with folding, the compares will be wrong).
But hashing doesn't have to be exact. It's ok to hash '{' and '[' as if
they were different cases of the same character, if that gives you a
faster hash function. Especially as those charactes are rather rare in
filenames.
So if you do hashing as a function of its own, you can simply do a better
job at it.
I do agree that the functions that create a folded set of characters from
a _complex_ UTF-8 character should be shared between folding and hashing,
since that code is too complex and there are no simple shortcuts for doing
a faster hash that still retains all the properties we want.
Linus
-
Let's rename do_folding as something else, because it is not a real
folding, but a preparation step for hash calculation. Keeping these
steps separately simplifies the code, and allows further optimization,
for instance, you do not need this do_folding step on a case-sensitive
filesystem. Though it is certainly possible to mix both steps together,
it bloats the code and makes it less readable. Of course, the idea to
avoid a temporary buffer and do everything at once is very appealing,
so I gave it a try -- and here is a 32-bit version of name_hash(), but
I am not very happy with the result:
#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#define mix(a,b,c) \
{ \
a -= c; a ^= rot(c, 4); c += b; \
b -= a; b ^= rot(a, 6); a += c; \
c -= b; c ^= rot(b, 8); b += a; \
a -= c; a ^= rot(c,16); c += b; \
b -= a; b ^= rot(a,19); a += c; \
c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a,b,c) \
{ \
c ^= b; c -= rot(b,14); \
a ^= c; a -= rot(c,11); \
b ^= a; b -= rot(a,25); \
c ^= b; c -= rot(b,16); \
a ^= c; a -= rot(c,4); \
b ^= a; b -= rot(a,14); \
c ^= b; c -= rot(b,24); \
}
#define hash_value(x) \
hs[hp] += (x); \
if (++hp == 3) { \
mix (hs[0], hs[1], hs[2]); \
hp = 0; \
}
unsigned int name_hash(const char *name, unsigned size)
{
unsigned hp = 0;
unsigned hs[3];
hs[0] = hs[1] = hs[2] = 0xdeadbeef + size;
do {
unsigned char c;
if (size >= sizeof(unsigned)) {
unsigned val = get_unaligned_uint(name);
if (!(val & 0x80808080)) {
val &= ~0x20202020;
hash_value(val);
name += sizeof(val);
size -= sizeof(val);
continue;
}
}
while (!((c = *name) & 0x80)) {
hash_value(c & ~0x20);
name++;
if (!--size)
goto done:
}
do {
// TODO: add denormalization for Mac
unsigned val = towupper (utf8_to_wchar(&name, &size));
hash_value(val);
} while (size && (*name & 0x80));
} while (size);
done:
if (hp)
final(a,b,c);
return hs[2];
}
Dmitry
-
Hi, <irony>Oh yes, let's take this one, it is so much shorter, cleaner and overall more elegant than Linus' code.</irony> Ciao, Dscho -
Hi, I am not sure what you meant by that, and what exactly code you meant saying Linus' code.... Anyway, my point was that mixing both steps together does not look very nice, and the code was intended to demonstrate why. Dmitry -
Well, you can always have fold_quick_and_dirty() function that is used only internally in hash_folded() function, which can: - fold with simple |= 0x20202020.. - write out full uint32/64, no need to make result proper string - zero-fill at the end, so hash function does not need to check for partial block, which is pretty expensive part of hashing. The win would be: - more modularized code - can use faster/any hash - hash function can be certain to work on aligned data (win on non-x86) The minus: - some memory i/o overhead which may or may not matter - the parts would not be fully generic, but special to hashing -- marko PS. Typo in last mail - "inner loop should be reversible - that means that details from beginning of data should shift out of horizon." That obviously means "data should _not_ shift out of horizon. btw, "reversible" for integer hashes means that there is 1:1 mapping between input and output - no collisions. Thus no info loss. -
If a string is short, it will probably reside in the processor cache, so there is no real memory i/o overhead here. For more longer strings, it may be better to do that in short chunks, so each chunk can reside in the cache. But I don't think filenames are long, so it is not an issue I think the second part can be rather generic to be reused for hashing something else that does not require filename specific case-folding. Dmitry -
Ok, you seem to focus on case folding and general performance, I focus on hash quality and code complexity. Considering you may want several folding methods, it seemed to me that it would be good to separate the two aspects. But I'll try to follow your path for a moment. Hashing 32 or 64 bits at a time is not trivial, eg. you cannot use same algorithm for both cases, 64 bits requires twice the work to mix well. Per Jenkins notes, hash inner loop should be reversible - that means that details from beginning of data should shift out of horizon. But final mixing should achieve avalanche - that means each bit in input should affect 50% bits in output. Also must be noted that we are mixing 32->32 and 64->64 instead of the usual 8->32. So it seems that the best bet would be to use integer hash functions as core. Jenkins himself points to Thomas Wang (http://www.cris.com/~Ttwang/tech/inthash.htm) who has good integer mix functions for both 32 and 64 bits. Integer hash function must have both reversibility and avalance, so they may be slightly overkill for this purpose, but fixing that means lot of work. So here is what I propose for hashing, if you really want to have combined folding + hashing: /* Thomas Wang integer hash functions */ static inline uint32_t hash32(uint32_t key) { key = ~key + (key << 15); key = key ^ (key >> 12); key = key + (key << 2); key = key ^ (key >> 4); key = key * 2057; key = key ^ (key >> 16); return key; } static inline uint64_t hash64(uint64_t key) { key = (~key) + (key << 21); // key = (key << 21) - key - 1; key = key ^ (key >> 24); key = (key + (key << 3)) + (key << 8); // key * 265 key = key ^ (key >> 14); key = (key + (key << 2)) + (key << 4); // key * 21 key = key ^ (key >> 28); key = key + (key << 31); return key; } /* * Simple addition should be enough for new values, * considering the mix ...
http://pgbouncer.projects.postgresql.org/hashtest/hashtest-2008-01-25.tgz I cleaned it up a bit. Also fixed a bug - the lookup3 was called via wrapper function so it acually is tiny bit faster than the results show. You can consider the code as public domain too. Also remember the was not meant to be published, so it rather hack... -- marko -
I believe that under words "my hash", Bob Jenkins meant lookup2, which I expected that Paul Hsieh's hash may not do well on some architecture, I would not describe lookup3 as impractical. It is widely used and well tested. Perhaps, for some Intel CPUs, the difference in speed is not so big, and FNV hash is much smaller and simpler, so FNV is a reasonable choice, but the hash is twice slower on my AMD processor and I suspect it may be even worse on other CPUs, where integer multiplication is slow. Besides, it may turn out that hashing filename may be not only case where a fast hash is needed. Dmitry -
From /proc/cpuinfo. It's the best I can do without going to our purchase department and asking for the spec so I can contact the vendor and get the real thing. Dualcore shouldn't matter for this test, as it isn't threaded. It doesn't do that well on certain types of data, in my experience. It does have excellent dispersion, so with very long strings it's usually the Ah well. I think once the patch is in master, it will be easy enough to test and verify different algorithms. Since it's intended for in-memory data only, it's no problem to have several algorithms and pick the one most suitable for the architecture we're compiling for. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
Hashing filenames is pretty easy. You can do a reasonable job with any
"multiply by an odd number, add in value". Picking the odd number is a
random choice, some are better than others (and it depends on whether you
end up always having a power-of-two bucket size etc), but it generally
won't be horrible.
And considering that we generally won't have tons and tons of pathnames
(ie we'd generally have thousands, not millions), and that the underlying
hash not only resizes itself but actually uses the ful 32 bits as the
lookup key, I don't worry too much. I suspect my random choice is fine.
So no, I didn't think my hash would _suck_, although I also didn't
research which odd numbers to pick.
No, it's a joke because it doesn't really give an example of how to do the
_expected_ hash collissions.
Here's another hash that is actually going to collide *much* more (and on
purpose!), but is actually showing an example of something that actually
hashes UTF strings so that upper-case and lower-case (and normalization)
will all still hash to the same value:
static unsigned int hash_name(const char *name, int namelen)
{
unsigned int hash = 0x123;
do {
unsigned char c = *name++;
if (c & 0x80)
c = 0;
c &= ~0x20;
hash = hash*101 + c;
} while (--namelen);
return hash;
}
but the above does so by making the hash much much worse (although
probably still acceptable for "normal source code name distributions"
that don't have very many same-name-in-different-cases and high bit
characters anyway).
The above is still fairly fast, but obviously at a serious cost in hash
goodness, to the point of being totally unusable for anybody who uses
Asian characters in their filenames.
To actually be really useful, you'd have to teach it about the character
system and do a lookup into a case/normalization table.
So *that* is mainly why it's a joke. But it should work fine for the case ...