login
Header Space

 
 

Re: I'm a total push-over..

Score:
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Linus Torvalds <torvalds@...>
Cc: Dmitry Potapov <dpotapov@...>, Andreas Ericsson <ae@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Saturday, January 26, 2008 - 8:37 am

On 1/26/08, Linus Torvalds <torvalds@linux-foundation.org> wrote:

Ok, you seem to focus on case folding and general performance,
I focus on hash quality and code complexity.  Considering you
may want several folding methods, it seemed to me that it would
be good to separate the two aspects.

But I'll try to follow your path for a moment.

Hashing 32 or 64 bits at a time is not trivial, eg. you cannot
use same algorithm for both cases, 64 bits requires twice the
work to mix well.

Per Jenkins notes, hash inner loop should be reversible - that
means that details from beginning of data should shift out of horizon.
But final mixing should achieve avalanche - that means each bit in
input should affect 50% bits in output.

Also must be noted that we are mixing 32->32 and 64->64 instead
of the usual 8->32.  So it seems that the best bet would be to use
integer hash functions as core.  Jenkins himself points to Thomas
Wang (http://www.cris.com/~Ttwang/tech/inthash.htm) who has good
integer mix functions for both 32 and 64 bits.

Integer hash function must have both reversibility and avalance,
so they may be slightly overkill for this purpose, but fixing that
means lot of work.

So here is what I propose for hashing, if you really want to have
combined folding + hashing:

/* Thomas Wang integer hash functions */

static inline uint32_t hash32(uint32_t key)
{
        key = ~key + (key << 15);
        key = key ^ (key >> 12);
        key = key + (key << 2);
        key = key ^ (key >> 4);
        key = key * 2057;
        key = key ^ (key >> 16);
        return key;
}

static inline uint64_t hash64(uint64_t key)
{
        key = (~key) + (key << 21); // key = (key << 21) - key - 1;
        key = key ^ (key >> 24);
        key = (key + (key << 3)) + (key << 8); // key * 265
        key = key ^ (key >> 14);
        key = (key + (key << 2)) + (key << 4); // key * 21
        key = key ^ (key >> 28);
        key = key + (key << 31);
        return key;
}

/*
 * Simple addition should be enough for new values,
 * considering the mix functions does work well.
 */

/* this is functon to use in git */
static inline unsigned long hash_long(unsigned long hash, unsigned long val)
{
        if (sizeof(long) == 8)
                return hash64(hash + val);
        else
                return hash32(hash + val);
}

/* below is regular hash for testing */

static uint32_t inline hash_int32(uint32_t hash, uint32_t val)
{
        return hash32(hash + val);
}

static uint64_t inline hash_int64(uint64_t hash, uint64_t val)
{
        return hash64(hash + val);
}

/* hack to avoid call to libc memcpy() */
static inline void simple_memcpy(void *_dst, const void *_src, unsigned len)
{
        const uint8_t *src = _src;
        uint8_t *dst = _dst;
        while (len--)
                *dst++ = *src++;
}

uint32_t int32_hash(const void *_data, unsigned int size)
{
        const uint8_t *src = _data;
        /* inital value.  +size avoids \0 and \0\0 hashing same */
        uint32_t hash = 1234567890 + size;
        uint32_t val;
        while (size >= 4) {
                memcpy(&val, src, 4); /* direct load on x86/64 */
                src += 4;
                size -= 4;
                hash = hash_int32(hash, val);
        }
        if (size > 0) {
                val = 0;
                simple_memcpy(&val, src, size);
                hash = hash_int32(hash, val);
        }
        return hash;
}

uint32_t int64_hash(const void *_data, unsigned int size)
{
        const uint8_t *src = _data;
        uint64_t hash = 12345678901234567890ULL + size;
        uint64_t val;
        while (size >= 8) {
                memcpy(&val, src, 8); /* direct load on x86/64 */
                hash = hash_int64(hash, val);
                src += 8;
                size -= 8;
        }
        if (size > 0) {
                val = 0;
                simple_memcpy(&val, src, size);
                hash = hash_int64(hash, val);
        }
        /* here we go to 32 bits, simple masking is enough */
        return hash;
}


In the "regular" hash functions I again use the memcpy() trick, because
I don't want to bother with access optimizations.  Especially considering
that part is unnecessary for git.


Intel Core Duo (32bit).  String length 0 .. 40, random alignment:
-------------------------------------------------------------------


Testing: seed=34 align=0 minlen=0 maxlen=40 trycnt=3 duration=10

lookup3             :  #0 .. 165.489  #1 .. 165.494  #2 .. 165.490 MB/s
int32_hash          :  #0 .. 148.359  #1 .. 148.350  #2 .. 148.435 MB/s
int64_hash          :  #0 .. 123.105  #1 .. 123.040  #2 .. 123.039 MB/s
lookup3_memcpy_hack :  #0 .. 169.791  #1 .. 169.795  #2 .. 169.749 MB/s
oat                 :  #0 .. 134.737  #1 .. 134.702  #2 .. 134.735 MB/s
fnv                 :  #0 .. 131.457  #1 .. 131.470  #2 .. 131.474 MB/s
hsieh               :  #0 .. 166.619  #1 .. 166.622  #2 .. 166.588 MB/s

Results compared to reference:

lookup3             : 100.000 %
int32_hash          :  89.661 %
int64_hash          :  74.361 %
lookup3_memcpy_hack : 102.591 %
oat                 :  81.409 %
fnv                 :  79.441 %
hsieh               : 100.676 %


AMD Opteron(tm) Processor 252 (64bit) 2.6GHz
-------------------------------------------------

Testing: seed=34 align=0 minlen=0 maxlen=40 trycnt=3 duration=10

lookup3             :  #0 .. 208.819  #1 .. 208.877  #2 .. 208.897 MB/s
int32_hash          :  #0 .. 181.096  #1 .. 181.100  #2 .. 181.097 MB/s
int64_hash          :  #0 .. 196.823  #1 .. 196.761  #2 .. 196.825 MB/s
lookup3_memcpy_hack :  #0 .. 201.593  #1 .. 201.597  #2 .. 201.594 MB/s
oat                 :  #0 .. 160.769  #1 .. 160.774  #2 .. 160.772 MB/s
fnv                 :  #0 .. 200.046  #1 .. 200.044  #2 .. 200.046 MB/s
hsieh               :  #0 .. 205.515  #1 .. 205.520  #2 .. 205.517 MB/s

Results compared to reference:

lookup3             : 100.000 %
int32_hash          :  86.706 %
int64_hash          :  94.225 %
lookup3_memcpy_hack :  96.519 %
oat                 :  76.974 %
fnv                 :  95.778 %
hsieh               :  98.398 %


So speedwise the result is not bad.  Especially considering unoptimized
data fetching.  On larger data (~1k) is tends to lose to lookup3 more,
I guess lookup3 parallelizes better (3x 32bit int vs. 1x 32/64 int).

The functions pass Jenkins lookup3 selftest that eg. FNV does not.

The code is also available at:

 http://pgbouncer.projects.postgresql.org/hashtest/hashtest-2008-01-26.tgz


-- 
marko
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
I'm a total push-over.., Linus Torvalds, (Tue Jan 22, 7:37 pm)
Re: I'm a total push-over.., Andreas Ericsson, (Wed Jan 23, 4:32 am)
Re: I'm a total push-over.., Linus Torvalds, (Wed Jan 23, 12:06 pm)
Re: I'm a total push-over.., Dmitry Potapov, (Wed Jan 23, 5:15 am)
Re: I'm a total push-over.., Andreas Ericsson, (Wed Jan 23, 5:31 am)
Re: I'm a total push-over.., Dmitry Potapov, (Wed Jan 23, 1:10 pm)
Re: I'm a total push-over.., Andreas Ericsson, (Thu Jan 24, 6:39 am)
Re: I'm a total push-over.., Marko Kreen, (Wed Jan 23, 10:01 am)
Re: I'm a total push-over.., Andreas Ericsson, (Wed Jan 23, 10:39 am)
Re: I'm a total push-over.., Marko Kreen, (Thu Jan 24, 9:19 am)
Re: I'm a total push-over.., Andreas Ericsson, (Thu Jan 24, 12:00 pm)
Re: I'm a total push-over.., Marko Kreen, (Fri Jan 25, 4:08 pm)
Re: I'm a total push-over.., Dmitry Potapov, (Thu Jan 24, 12:28 pm)
Re: I'm a total push-over.., Linus Torvalds, (Thu Jan 24, 1:15 pm)
Re: I'm a total push-over.., Marko Kreen, (Fri Jan 25, 4:52 pm)
Re: I'm a total push-over.., Linus Torvalds, (Fri Jan 25, 6:16 pm)
Re: I'm a total push-over.., Marko Kreen, (Sat Jan 26, 8:37 am)
Re: I'm a total push-over.., Marko Kreen, (Sat Jan 26, 8:16 am)
Re: I'm a total push-over.., Linus Torvalds, (Sun Jan 27, 2:51 am)
Re: I'm a total push-over.., Marko Kreen, (Sun Jan 27, 5:45 am)
Re: I'm a total push-over.., Dmitry Potapov, (Sun Jan 27, 11:06 am)
Re: I'm a total push-over.., Dmitry Potapov, (Sun Jan 27, 4:21 am)
Re: I'm a total push-over.., Johannes Schindelin, (Sun Jan 27, 10:07 am)
Re: I'm a total push-over.., Dmitry Potapov, (Sun Jan 27, 10:48 am)
Re: I'm a total push-over.., Linus Torvalds, (Fri Jan 25, 6:35 pm)
Re: I'm a total push-over.., Dmitry Potapov, (Thu Jan 24, 2:45 pm)
Re: I'm a total push-over.., Linus Torvalds, (Thu Jan 24, 3:08 pm)
Re: I'm a total push-over.., Marko Kreen, (Thu Jan 24, 12:13 pm)
Re: I'm a total push-over.., Luke Lu, (Thu Jan 24, 2:51 am)
Re: I'm a total push-over.., Andreas Ericsson, (Thu Jan 24, 6:24 am)
Re: I'm a total push-over.., Junio C Hamano, (Tue Jan 22, 10:23 pm)
Re: I'm a total push-over.., Linus Torvalds, (Tue Jan 22, 10:58 pm)
Re: I'm a total push-over.., Junio C Hamano, (Wed Jan 23, 3:23 am)
Re: I'm a total push-over.., Johannes Schindelin, (Wed Jan 23, 8:25 am)
Re: I'm a total push-over.., Linus Torvalds, (Wed Jan 23, 12:25 pm)
Re: I'm a total push-over.., Johannes Schindelin, (Wed Jan 23, 12:34 pm)
Re: I'm a total push-over.., Linus Torvalds, (Wed Jan 23, 1:09 pm)
Re: I'm a total push-over.., Linus Torvalds, (Wed Jan 23, 1:29 pm)
Re: I'm a total push-over.., Jeremy Maitin-Shepard, (Fri Jan 25, 1:21 am)
Re: I'm a total push-over.., Johannes Schindelin, (Fri Jan 25, 8:51 am)
Re: I'm a total push-over.., Jeremy Maitin-Shepard, (Fri Jan 25, 2:19 pm)
Re: I'm a total push-over.., Junio C Hamano, (Fri Jan 25, 3:07 pm)
Re: I'm a total push-over.., Johannes Schindelin, (Fri Jan 25, 2:24 pm)
Re: I'm a total push-over.., Linus Torvalds, (Tue Jan 22, 11:19 pm)
Re: I'm a total push-over.., Junio C Hamano, (Fri Jan 25, 2:50 am)
Re: I'm a total push-over.., Linus Torvalds, (Fri Jan 25, 12:24 pm)
Re: I'm a total push-over.., Junio C Hamano, (Tue Jan 22, 10:36 pm)
Re: I'm a total push-over.., Johannes Schindelin, (Wed Jan 23, 8:24 am)
Re: I'm a total push-over.., David Kastrup, (Wed Jan 23, 8:28 am)
Re: I'm a total push-over.., Theodore Tso, (Wed Jan 23, 8:56 am)
Re: I'm a total push-over.., Kevin Ballard, (Tue Jan 22, 9:35 pm)
speck-geostationary