Linus Torvalds wrote:Insofar as hashes go, it's not that shabby for hashing filenames. Here's the test output from a small hash-comparison program I've got, which runs the test-input through a series of different hashes to compare dispersion, collisions, lookup- and insert times and other things that are interesting from a practical PoV. Fowler/Noll/Vo cheap hash Collisions: 1829, 7.91%. Depth max: 3, average: 0.09. Time spent in lookups: 167.859ms. Per entry: 0.145us Time spent inserting: 2.753ms. Per entry: 0.119us PHP Zend cheap hash Collisions: 1908, 8.25%. Depth max: 3, average: 0.09. Time spent in lookups: 171.819ms. Per entry: 0.149us Time spent inserting: 2.778ms. Per entry: 0.120us Phong Vo's linear congruential hash Collisions: 1996, 8.63%. Depth max: 3, average: 0.09. Time spent in lookups: 168.276ms. Per entry: 0.146us Time spent inserting: 2.840ms. Per entry: 0.123us Phong Vo's second linear congruential hash Collisions: 1933, 8.36%. Depth max: 3, average: 0.09. Time spent in lookups: 170.416ms. Per entry: 0.147us Time spent inserting: 2.774ms. Per entry: 0.120us Glib string hash Collisions: 1907, 8.24%. Depth max: 3, average: 0.09. Time spent in lookups: 192.420ms. Per entry: 0.166us Time spent inserting: 3.154ms. Per entry: 0.136us sdbm hash Collisions: 1899, 8.21%. Depth max: 3, average: 0.09. Time spent in lookups: 170.797ms. Per entry: 0.148us Time spent inserting: 2.724ms. Per entry: 0.118us bfd hash Collisions: 1949, 8.43%. Depth max: 3, average: 0.09. Time spent in lookups: 206.504ms. Per entry: 0.179us Time spent inserting: 3.241ms. Per entry: 0.140us Linus' GIT hash Collisions: 1946, 8.41%. Depth max: 4, average: 0.09. Time spent in lookups: 179.336ms. Per entry: 0.155us Time spent inserting: 2.781ms. Per entry: 0.120us This is with 64K buckets, which (with my implementation) means a total hash-table size of 256KiB. The test-input is a simple file-listing from the linux kernel (although it has the .git directory included too). As you can see, it loses out on mathematical correctness, as it has more collisions (but not that many). OTOH, the simplicity of the implementation makes it a viable option anyway, since the time spent in insertion (which is almost completely inside the hash-function) is on average so much shorter. For a temporary hash-table, it will work splendidly. It will probably have issues scaling to, say, 30 or 40 million input lines (fairly typical test-data for database hashes, fe). Note that real-world timings will be shorter. My test-program doesn't have the luxury of keeping hash-functions inline, so some compiler optimizations become impossible. This is on a Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz (according to /proc/cpuinfo), with a cache size of 4meg. The use of multiplication is sane though, as it means no arch will suffer greatly. The FNV hash would be better (pasted below), but I doubt anyone will ever care, and there will be larger differences between architectures with this one than the lt_git hash (well, a function's gotta have a name). /* * Fowler/Noll/Vo hash * * The basis of the hash algorithm was taken from an idea sent by email to the * IEEE Posix P1003.2 mailing list from Phong Vo (kpv@research.att.com) and * Glenn Fowler (gsf@research.att.com). Landon Curt Noll (chongo@toad.com) * later improved on their algorithm. * * The magic is in the interesting relationship between the special prime * 16777619 (2^24 + 403) and 2^32 and 2^8. * * This hash produces the fewest collisions of any function that we've seen so * far, and works well on both numbers and strings. * * (Last comment from MySQL code) * */ u32 FNV1(u8 *k, u32 len) { u8 *e; u32 h; e = k + len; for (h = 0; k < e; k++) { h *= 16777619; h ^= *k; } return (h); } I could provide figures for other table-sizes too, if anyone's interested. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| David Miller | Slow DOWN, please!!! |
| KAMEZAWA Hiroyuki | Re: 2.6.22-rc1-mm1 |
| Steven Rostedt | [RFC PATCH 1/3] Unified trace buffer |
| Steven Rostedt | [RFC PATCH 0/6] Convert all tasklets to workqueues |
git: | |
| Peter Klavins | Re: CRLF problems with Git on Win32 |
| J. Bruce Fields | Re: Git User's Survey 2007 unfinished summary continued |
| Linus Torvalds | Re: VCS comparison table |
| Junichi Uekawa | Re: [ANNOUNCE] GIT 1.5.4 |
| Arjan van de Ven | Re: [GIT]: Networking |
| Rémi | [PATCH 0/6] [RFC] Phonet pipes protocol (v2) |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Jozsef Kadlecsik | Re: TCP connection stalls under 2.6.24.7 |
| Richard Stallman | Real men don't attack straw men |
| Rogier Krieger | Re: bcw(4) is gone |
| Leon Dippenaar | New tcp stack attack |
| Brandon Lee | DELL PERC 5iR slow performance |
| high memory | 6 hours ago | Linux kernel |
| semaphore access speed | 9 hours ago | Applications and Utilities |
| the kernel how to power off the machine | 10 hours ago | Linux kernel |
| Easter Eggs in windows XP | 13 hours ago | Windows |
| Shared swap partition | 14 hours ago | Linux general |
| Root password | 14 hours ago | Linux general |
| Where/when DNOTIFY is used? | 16 hours ago | Linux kernel |
| How to convert Linux Kernel built-in module into a loadable module | 18 hours ago | Linux kernel |
| Linux 2.6.24 and I/O schedulers | 19 hours ago | Linux kernel |
| USB Driver -- Interrupt Polling -- A Little Help Please | 1 day ago | Linux general |
