Re: [PATCH 2/2] Implement a simple delta_base cache

Previous thread: [PATCH] use xstrdup please by Shawn O. Pearce on Thursday, March 15, 2007 - 6:02 pm. (1 message)

Next thread: Merging on a dirty tree??? by Dmitry Torokhov on Thursday, March 15, 2007 - 8:59 pm. (3 messages)
From: Linus Torvalds
Date: Thursday, March 15, 2007 - 6:04 pm

I looked at git profiles yesterday, and some of them are pretty scary. We 
spend about 50% of the time under some loads in just zlib uncompression, 
and when I actually looked closer at the zlib sources I can kind of 
understand why. That thing is horrid.

The sad part is that it looks like it should be quite possible to make 
zlib simply just perform better. The profiles seem to say that a lot of 
the cost is literally in the "inflate()" state machine code (and by that I 
mean *not* the code itself, but literally in the indirect jump generated 
by the case-statement).

Now, on any high-performance CPU, doing state-machines by having

	for (;;)
		switch (data->state) {
			...
			data->state = NEW_STATE;
			continue;
		}

(which is what zlib seems to be doing) is just about the worst possible 
way to code things.

Now, it's possible that I'm just wrong, but the instruction-level profile 
really did pinpoint the "look up state branch pointer and jump to it" as 
some of the hottest part of that function. Which is just *evil*. You can 
most likely use direct jumps within the loop (zero cost at all on most OoO 
CPU's) most of the time, and the entry condition is likely quite 
predictable too, so a lot of that overhead seems to be just sad and 
unnecessary.

Now, I'm just wondering if anybody knows if there are better zlib 
implementations out there? This really looks like it could be a noticeable 
performance issue, but I'm lazy and would be much happier to hear that 
somebody has already played with optimizing zlib. Especially since I'm not 
100% sure it's really going to be noticeable..

		Linus
-

From: Shawn O. Pearce
Date: Thursday, March 15, 2007 - 6:10 pm

Yes.  This is actually one of the motivations behind pack v4.
We don't store the "important bits" of commits and trees in zlib
compressed format at all; allowing us to completely bypass the
inflate() penalty you describe.

We're already much faster on the linux-2.6 kernel tree, and that's
*with* converting the pack raw data into text, then reparsing
that text into a struct commit* or a struct name_entry using the
current code.  We're also planning on reworking those parsers to
parse the raw pack data, allowing us to save some very unnecessary
raw->string->raw conversion time.

But Nico and I still looking to use zlib for commit messages and
blob content, so any improvements to inflate (or its replacement)
would still be most helpful.

-- 
Shawn.
-

From: Jeff Garzik
Date: Thursday, March 15, 2007 - 6:11 pm

ISTR there are a bunch of state transitions per byte, which would make 

I could have sworn that either Matt Mackall or Ben LaHaise had cleaned 
up the existing zlib so much that it was practically a new 
implementation.  I'm not aware of any open source implementations 
independent of zlib (except maybe that C++ behemoth, 7zip).

	Jeff


-

From: Matt Mackall
Date: Thursday, March 15, 2007 - 6:14 pm

I cleaned up the version in lib/ that's used for boot on most systems.
It's quite a bit simpler and cleaner than the code lib/zlib (and
elsewhere!). But making it faster is another matter entirely - I don't
know off-hand how the two compare.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Linus Torvalds
Date: Thursday, March 15, 2007 - 6:46 pm

Yeah. I only looked at the kernel sources (which include a cleaned-up 
older version of zlib), but they seemed to match the disassembly that I 
saw when doing the instruction-level profiling.

The code *sometimes* falls through from one state to another, ie there's a 
case where zlib does:

	...
        case DICTID:
            NEEDBITS(32);
            strm->adler = state->check = REVERSE(hold);
            INITBITS();
            state->mode = DICT;
        case DICT:
	...

which will obviously generate fine code. There's a few other examples of 
that, *BUT* most of the stuff is just horrid, like

        case HEAD:
            if (state->wrap == 0) {
                state->mode = TYPEDO;
                break;
            }

where the "break" (which simply breaks out of the case-statement, and thus 
just loops back to it:

    for (;;)
        switch (state->mode) {

thing is nasty nasty nasty).

A trivial thing to do is to just replace such

	state->mode = xxx;
	break;

with

	state->mode = xxx;
	goto xxx_mode;

and all of that complicated run-time code *just*goes*away* and is replaced 
by a relative no-op (ie an unconditional direct jump).

Some of them are slightly more complex, like

	state->mode = hold & 0x200 ? DICTID : TYPE;
	INITBITS();
	break;

which would need to be rewritten as

	old_hold = hold;
	INITBITS();
	if (old_hold & 0x200) {
		state->mode = DICTID;
		goto dictid_mode;
	}
	state->mode = TYPE;
	goto dictid_mode;

but while that looks more complicated on a source code level it's a *lot* 
easier for a CPU to actually execute.

Same obvious performance problems go for

	case STORED:
	   ...
        case COPY:
            copy = state->length;
            if (copy) {
		...
		state->length -= copy;
		break;
	    }
	    Tracev((stderr, "inflate:       stored end\n"));
	    state->mode = TYPE;
	    break;

notice how when you get to the COPY state it will actually have nicely 
fallen through from STORED (one ...
From: Linus Torvalds
Date: Thursday, March 15, 2007 - 6:54 pm

As an example, I *think* this patch to zlib-1.2.3 not only generates 
better code, but is (a) shorter and (b) more logical anyway.

Together with Davide's suggestion on using C macro expansion to make most 
of the mode switches simple branches, it might get rid of most of the 
indirect branches (to get rid of them all, you'd have to also find the 
places where we *don't* set a new state, because it stays the same like 
this one, and the ones where we have conditionals on what the mode is 
going to be..

Of course, the zlib sources are pretty horrid for other reasons (K&R 
source code meant to be compiled on 16-bit architectures too). But that's 
a separate issue, and at least shouldn't affect the resulting code 
quality..

		Linus

---
 inflate.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/inflate.c b/inflate.c
index 792fdee..fb26f39 100644
--- a/inflate.c
+++ b/inflate.c
@@ -819,7 +819,7 @@ int flush;
             state->mode = COPY;
         case COPY:
             copy = state->length;
-            if (copy) {
+            while (copy) {
                 if (copy > have) copy = have;
                 if (copy > left) copy = left;
                 if (copy == 0) goto inf_leave;
@@ -829,7 +829,6 @@ int flush;
                 left -= copy;
                 put += copy;
                 state->length -= copy;
-                break;
             }
             Tracev((stderr, "inflate:       stored end\n"));
             state->mode = TYPE;
-

From: Davide Libenzi
Date: Thursday, March 15, 2007 - 7:43 pm

That's the diff against 1.2.3, but it does not seem to make an substantial 
difference in my Opteron ...




- Davide



Index: zlib-1.2.3.quilt/inflate.c
===================================================================
--- zlib-1.2.3.quilt.orig/inflate.c	2007-03-15 18:17:19.000000000 -0700
+++ zlib-1.2.3.quilt/inflate.c	2007-03-15 18:31:14.000000000 -0700
@@ -551,6 +551,15 @@
    will return Z_BUF_ERROR if it has not reached the end of the stream.
  */
 
+#define CASE_DECL(n) \
+	case n: \
+	lbl_##n:
+
+#define STATE_CHANGE(s) do { \
+	state->mode = s; \
+	goto lbl_##s; \
+} while (0)
+
 int ZEXPORT inflate(strm, flush)
 z_streamp strm;
 int flush;
@@ -586,10 +595,9 @@
     ret = Z_OK;
     for (;;)
         switch (state->mode) {
-        case HEAD:
+        CASE_DECL(HEAD)
             if (state->wrap == 0) {
-                state->mode = TYPEDO;
-                break;
+		STATE_CHANGE(TYPEDO);
             }
             NEEDBITS(16);
 #ifdef GUNZIP
@@ -597,8 +605,7 @@
                 state->check = crc32(0L, Z_NULL, 0);
                 CRC2(state->check, hold);
                 INITBITS();
-                state->mode = FLAGS;
-                break;
+		STATE_CHANGE(FLAGS);
             }
             state->flags = 0;           /* expect zlib header */
             if (state->head != Z_NULL)
@@ -609,20 +616,17 @@
 #endif
                 ((BITS(8) << 8) + (hold >> 8)) % 31) {
                 strm->msg = (char *)"incorrect header check";
-                state->mode = BAD;
-                break;
+	        STATE_CHANGE(BAD);
             }
             if (BITS(4) != Z_DEFLATED) {
                 strm->msg = (char *)"unknown compression method";
-                state->mode = BAD;
-                break;
+	        STATE_CHANGE(BAD);
             }
             DROPBITS(4);
             len = BITS(4) + 8;
             if (len > state->wbits) {
                 strm->msg = (char *)"invalid window size";
-                ...
From: Linus Torvalds
Date: Thursday, March 15, 2007 - 7:56 pm

But the "goto" stuff you did is 5-6%? 

Is that 5-6% of total git costs, or just of inflate() itself?

		Linus
-

From: Davide Libenzi
Date: Thursday, March 15, 2007 - 8:16 pm

Didn't do proper cache warmup and test time was fairly short. Now I'm not 
able to notice substantial differences.
Hacked up test case below ...




- Davide



/* example.c -- usage example of the zlib compression library
 * Copyright (C) 1995-2004 Jean-loup Gailly.
 * For conditions of distribution and use, see copyright notice in zlib.h
 */

/* @(#) $Id$ */

#include <sys/time.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include "zlib.h"



#define CHECK_ERR(err, msg) do { \
	if (err != Z_OK) { \
		fprintf(stderr, "%s error: %d\n", msg, err); \
		exit(1); \
	} \
} while (0)




unsigned long long getustime(void) {
	struct timeval tm;

	gettimeofday(&tm, NULL);
	return tm.tv_sec * 1000000ULL + tm.tv_usec;
}


/* ===========================================================================
 * Test deflate() with large buffers and dynamic change of compression level
 */
void do_defl(Byte *compr, uLong *comprLen,
	     Byte *uncompr, uLong uncomprLen) {
	z_stream c_stream; /* compression stream */
	int err;

	c_stream.zalloc = (alloc_func)0;
	c_stream.zfree = (free_func)0;
	c_stream.opaque = (voidpf)0;

	err = deflateInit(&c_stream, Z_BEST_SPEED);
	CHECK_ERR(err, "deflateInit");

	c_stream.next_out = compr;
	c_stream.avail_out = (uInt) *comprLen;

	/* At this point, uncompr is still mostly zeroes, so it should compress
	 * very well:
	 */
	c_stream.next_in = uncompr;
	c_stream.avail_in = (uInt) uncomprLen;
	err = deflate(&c_stream, Z_FINISH);
	if (err != Z_STREAM_END) {
		fprintf(stderr, "whoops, got %d instead of Z_STREAM_END\n", err);
		exit(1);
	}

	err = deflateEnd(&c_stream);
	CHECK_ERR(err, "deflateEnd");

	*comprLen = c_stream.next_out - compr;
}

/* ===========================================================================
 * Test inflate() with large buffers
 */
void do_infl(Byte *compr, uLong comprLen,
	     Byte *uncompr, uLong *uncomprLen) {
	int err;
	z_stream d_stream; /* ...
From: Linus Torvalds
Date: Friday, March 16, 2007 - 9:21 am

This one seems to do benchmarking with 8MB buffers if I read it right 
(didn't try).

The normal size for the performance-critical git objects are in the couple 
of *hundred* bytes. Not kilobytes, and not megabytes.

The most performance-critical objects for uncompression are commits and 
trees. At least for the kernel, the average size of a tree object is 678
bytes. And that's ignoring the fact that most of them are then deltified, 
so about 80% of them are likely just a ~60-byte delta.

		Linus
-

From: Davide Libenzi
Date: Friday, March 16, 2007 - 9:24 am

Yes, I just wanted to have the biggest time spent in inflate(). That why I 

Definitely. The nature of the data matters.
Did you try to make a zlib with my patch and oprofile git on real data 
with that?



- Davide


-

From: Linus Torvalds
Date: Friday, March 16, 2007 - 9:35 am

Right. But if the biggest time is spent in setup, the big-buffer thing 

I haven't actually set it up so that I can build against my own zlib yet. 
Exactly because I was hoping that somebody would already have a solution 
;)

Will try to do later today, although since it's the kids school 
conferences today I'll effectively be in and out all day long.

		Linus
-

From: Davide Libenzi
Date: Friday, March 16, 2007 - 12:21 pm

I modified ztest.c to be able to bench on various data files (you list 
them on the command line), but result are pretty much same.
I cannot measure any sensible difference between the two.
Attached there's ztest.c and the diff, in case you want to try on your 

An LD_PRELOAD pointing to your own build should do it.




- Davide




#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <time.h>
#include "zlib.h"


#define MIN_TESTIME (2 * 1000000)
#define INNER_CYCLES 32



#define CHECK_ERR(err, msg) do { \
	if (err != Z_OK) { \
		fprintf(stderr, "%s error: %d\n", msg, err); \
		exit(1); \
	} \
} while (0)



static unsigned long long mintt = MIN_TESTIME;
static uLong incycles = INNER_CYCLES;



static unsigned long long getustime(void) {
	struct timeval tm;

	gettimeofday(&tm, NULL);
	return tm.tv_sec * 1000000ULL + tm.tv_usec;
}

static void do_defl(Byte *cdata, uLong *clen,
		    Byte *udata, uLong uclen) {
	z_stream c_stream; /* compression stream */
	int err;

	c_stream.zalloc = (alloc_func) NULL;
	c_stream.zfree = (free_func) NULL;
	c_stream.opaque = (voidpf) NULL;

	err = deflateInit(&c_stream, Z_BEST_SPEED);
	CHECK_ERR(err, "deflateInit");

	c_stream.next_out = cdata;
	c_stream.avail_out = (uInt) *clen;

	/* At this point, udata is still mostly zeroes, so it should compress
	 * very well:
	 */
	c_stream.next_in = udata;
	c_stream.avail_in = (uInt) uclen;
	err = deflate(&c_stream, Z_FINISH);
	if (err != Z_STREAM_END) {
		fprintf(stderr, "whoops, got %d instead of Z_STREAM_END\n", err);
		exit(1);
	}

	err = deflateEnd(&c_stream);
	CHECK_ERR(err, "deflateEnd");

	*clen = c_stream.next_out - cdata;
}

static void do_infl(Byte *cdata, uLong clen,
		    Byte *udata, uLong *uclen) {
	int err;
	z_stream d_stream; /* decompression stream */

	d_stream.zalloc = (alloc_func) NULL;
	d_stream.zfree = ...
From: Linus Torvalds
Date: Friday, March 16, 2007 - 5:01 pm

I'm using your previous patch (is it the same?) along with the additional 
patch appended.

And yes, it's not hugely faster, but I seem to see *some* difference: this 
is the real-time of ten runs of 

	time git log drivers/usb/ > /dev/null

Before:

	0m2.673s
	0m2.476s
	0m2.603s
	0m2.576s
	0m2.625s
	0m2.628s
	0m2.493s
	0m2.696s
	0m2.525s
	0m2.575s

After:

	0m2.639s
	0m2.519s
	0m2.454s
	0m2.604s
	0m2.499s
	0m2.497s
	0m2.506s
	0m2.394s
	0m2.409s
	0m2.562s

ie after I actually get under 2.4s once, and under 2.5s most of the time, 
while before it was under 2.5s just twice, and mostly in the 2.6s..

(I did end up adding the "-g", but I trust that doesn't make things 
*faster*. Generally gcc is good at not actually changing code generation 
based on -g)

But yeah, not very impressive changes. We're talking *maybe* 0.1s out of 
2.5, so potentially about 4% of total time but more likely about 2-3%, and 
it's clearly mostly in the noise. And inflate() is still at 16%, and 
inflate_fast obviously got no faster.

The nice part is that the instruction-level profile for inflate() got more 
interesting. Instead of clearly peaking at the silly indirect jump, the 
peak now seems to be a specific path through the thing. I've not decoded 
it fully yet, but it seems to be mostly the LEN/LIT cases:

 file inflate.c, line 942.
 file inflate.c, line 942.
 file inflate.c, line 949.
 file inflate.c, line 949.
 file inflate.c, line 949.
 file inflate.c, line 949.
 file inflate.c, line 949.
 file inflate.c, line 950.
 file inflate.c, line 950.
 file inflate.c, line 951.
 file inflate.c, line 951.
 file inflate.c, line 951.
 file inflate.c, line 951.
 file inflate.c, line 951.
 file inflate.c, line 949.
 file inflate.c, line 949.
 file inflate.c, line 950.
 file inflate.c, line 950.
 file inflate.c, line 953.
 file inflate.c, line 953.
 file inflate.c, line 953.
 file inflate.c, line 969.
 file inflate.c, line 1058.
 file inflate.c, line 1059.
 file ...
From: Linus Torvalds
Date: Friday, March 16, 2007 - 6:11 pm

Damn. I think I know why it's happening, and I'm an idiot. I think it's 
actually an issue I wondered about a *loong* time ago, and then forgot all 
about. And later or Nico made it almost impossible to fix with his "pack 
offset" changes.

The thing that made me realize was one of the callchains into inflate() 
that I looked at:

   (gdb) where
   #0  inflate (strm=0x7fff10d83810, flush=4) at inflate.c:566
   #1  0x000000000044c165 in unpack_compressed_entry (p=0x6d52e0, w_curs=0x7fff10d838e0, curpos=94941911,
       size=<value optimized out>) at sha1_file.c:1348
   #2  0x000000000044c2b6 in unpack_entry (p=0x6d52e0, obj_offset=94941909, type=0x7fff10d85d8c, sizep=0x7fff10d83928)
       at sha1_file.c:1408
   #3  0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94942707, type=0x7fff10d85d8c, sizep=0x7fff10d83988)
       at sha1_file.c:1373
   #4  0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94943021, type=0x7fff10d85d8c, sizep=0x7fff10d839e8)
       at sha1_file.c:1373
   #5  0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94943382, type=0x7fff10d85d8c, sizep=0x7fff10d83a48)
       at sha1_file.c:1373
   #6  0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94943531, type=0x7fff10d85d8c, sizep=0x7fff10d83aa8)
       at sha1_file.c:1373
   #7  0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94943622, type=0x7fff10d85d8c, sizep=0x7fff10d83b08)
       at sha1_file.c:1373
   #8  0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94945357, type=0x7fff10d85d8c, sizep=0x7fff10d83b68)
       at sha1_file.c:1373
   #9  0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94945447, type=0x7fff10d85d8c, sizep=0x7fff10d83bc8)
       at sha1_file.c:1373
   #10 0x000000000044c32e in unpack_entry (p=0x6d52e0, obj_offset=94945571, type=0x7fff10d85d8c, sizep=0x7fff10d85d80)
       at sha1_file.c:1373
   #11 0x000000000044c3f8 in read_packed_sha1 (sha1=<value optimized out>, type=0x7fff10d85d8c, size=0x7fff10d85d80)
     ...
From: Nicolas Pitre
Date: Friday, March 16, 2007 - 8:28 pm

In the worst case, yes.  And if you're walking history then the 

Right.  Should be really trivial to hook into unpack_delta_entry() 
actually replacing the call to unpack_entry() with a wrapper function 
that returns cached data, or populates the cache with unpack_entry() 
when no match is found.

Then it would only be a matter of coming up with a clever cache 

Well, I think the pack format issue is significant too.  But because 
those are independent issues the gain in performance will be additive.


Nicolas
-

From: Shawn O. Pearce
Date: Friday, March 16, 2007 - 10:19 pm

Yes.  Linus above seems to imply (at least to me) that we wouldn't
want to cache the original object requested by read_sha1_file(), as
its not the delta base.  But given our packing rules, we should be
(in general anyway) first asking for the most recent revision of
a file, which is stored whole, then for an older revision, which
will be a delta of the more recent revision we just saw.

Hence we probably would want to cache an object.  Well, at least
anything that had been packed as a delta.  Caching a deflated

I'm torn there.

There's two places that we do lots of unpacks of objects where we
run into this difficult case of unpacking the same base object many
times: git-blame and a rev-list with a path limiter.

Now the git-blame case is obvious: we are constantly unpacking
various revisions of the same file, and these are probably delta'd
against each other, so the unpacking gets really brutal after a
while.  A blob cache here would probably *really* help out git-blame.

What's slightly less obvious about git-blame is we are probably also
traversing the different versions of the same trees over and over, as
we resolve the path to the correct blob in each commit we traverse.
So again here we are hitting lots of the same trees multiple times.

That last part about git-blame also obviously applies to the rev-list
with a path limiter.

But most other operations don't seem like they would benefit from a
base object cache; actually they might slow down from having such
a cache present!

Commits tend not to delta well; if they delta it is a very rare
occurrance.  So we aren't getting huge unpacking benefits there
by caching them.  Scratch any benefit of the cache for any sort of
rev-list operation that doesn't require tree access.

As for the other common operations (diff, read-tree, checkout-index,
merge-recursive): I don't think these will benefit from a cache
either.  Their data access patterns are pretty spread out over
the tree.  With the exception of rename ...
From: Linus Torvalds
Date: Saturday, March 17, 2007 - 10:55 am

Actually, it's even better than that.

If we're walking a certain pathspec (which is reall ythe only thing that 
is expensive), we're pretty much *guaranteed* that we'll hit exactly this 
case. Doing some instrumentation on the test-case I've been using (which 
is just "git log drivers/usb/ > /dev/null") shows:

	[torvalds@woody linux]$ grep Needs delta-base-trace | wc -l
	469334
	[torvalds@woody linux]$ grep Needs delta-base-trace | sort -u | wc -l
	21933

where that delta-base-trace is just a trace of which delta bases were 
needed. Look how we currently generate almost half a million of them, but 
only 22000 are actually unique objects - we just generate many of them 
over and over again. In fact, the top delta bases with counts looks like:

    558 Needs 102398354
    556 Needs 161353360
    554 Needs 161354852
    552 Needs 161354916
    550 Needs 161354980
    526 Needs 161355044
    524 Needs 161355108
    522 Needs 161355174
    520 Needs 161355238
    508 Needs 161445724
    446 Needs 119712387
    425 Needs 133406737
    420 Needs 161513997
    387 Needs 120784913
    331 Needs 127094253
    321 Needs 95694853
    319 Needs 125888524
    303 Needs 155109487
    301 Needs 155627964
    299 Needs 155628028
    .....

ie the top twenty objects were all generated hundreds of times each.

More importantly, the trace also shows that it actually has very good 
locality too - exactly as you'd expect, since when we traverse the trees, 
we'd generally see a particular delta base used as a base when that thing 
is slowly changing, so of the half-million "needs" entries in my trace, if 
I pick the top delta_base (102398354), and use "cat -n" to give them all 
line numbers (from 1 to half a million), and grep for that particular 
delta:

	grep Needs delta-base-trace | cat -n | grep 102398354 | less -S

they are *all* at lines 61624..89352, with the bulk of them being very 
close together (the bulk of those are all around 88k line mark).

In other words, ...
From: Linus Torvalds
Date: Saturday, March 17, 2007 - 12:40 pm

Ok, got distracted by guests coming to look at the new puppy, so it took 
longer than it should have, but the following is a simple two-patch series 
that improves path-following by a factor of almost 2.5 for me.

The cache is *really* simple. It's just a 256-entry hashed cache of the 
last few base entries, and it brings down my test-case of

	git log drivers/usb/ > /dev/null

from 2.5s to just over 1s. I have *not* tuned or tweaked this at all, and 
maybe there are better ways to do this, but this was simple as hell and 
obviously quite effective.

It also speeds up "git blame", for all the same reasons. Before (best 
times out of a run of five):

	[torvalds@woody linux]$ time git blame drivers/char/Makefile > /dev/null
	real    0m1.585s
	user    0m1.576s
	sys     0m0.004s

after:

	[torvalds@woody linux]$ time ~/git/git blame drivers/char/Makefile > /dev/null
	real    0m0.763s
	user    0m0.644s
	sys     0m0.120s

so it's a factor of two there too (just a random file, I'm not at all 
going to guarantee that this is really consistent - it should get more 
testing etc).

The first patch just does some obvious re-factoring and setting up (no 
real code changes). The second patch just uses the new functions to 
actually add a cache.

		Linus
-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 12:42 pm

This doesn't change any code, it just creates a point for where we'd
actually do the caching of delta bases that have been generated.
    
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Done this way to make all the changes as obvious as possible.


diff --git a/sha1_file.c b/sha1_file.c
index 110d696..f11ca3f 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1352,6 +1352,18 @@ static void *unpack_compressed_entry(struct packed_git *p,
 	return buffer;
 }
 
+static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
+	unsigned long *base_size, enum object_type *type)
+{
+	return unpack_entry(p, base_offset, type, base_size);
+}
+
+static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
+	void *base, unsigned long base_size, enum object_type type)
+{
+	free(base);
+}
+
 static void *unpack_delta_entry(struct packed_git *p,
 				struct pack_window **w_curs,
 				off_t curpos,
@@ -1365,7 +1377,7 @@ static void *unpack_delta_entry(struct packed_git *p,
 	off_t base_offset;
 
 	base_offset = get_delta_base(p, w_curs, &curpos, *type, obj_offset);
-	base = unpack_entry(p, base_offset, type, &base_size);
+	base = cache_or_unpack_entry(p, base_offset, &base_size, type);
 	if (!base)
 		die("failed to read delta base object"
 		    " at %"PRIuMAX" from %s",
@@ -1378,7 +1390,7 @@ static void *unpack_delta_entry(struct packed_git *p,
 	if (!result)
 		die("failed to apply delta");
 	free(delta_data);
-	free(base);
+	add_delta_base_cache(p, base_offset, base, base_size, *type);
 	return result;
 }
 
-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 12:44 pm

This trivial 256-entry delta_base cache improves performance for some 
loads by a factor of 2.5 or so.

Instead of always re-generating the delta bases (possibly over and over 
and over again), just cache the last few ones. They often can get re-used.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

This should have some other people doing performance testing too, since 
it's fairly core. But *dang*, it's really simple.

diff --git a/sha1_file.c b/sha1_file.c
index f11ca3f..a7e3a2a 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1352,16 +1352,57 @@ static void *unpack_compressed_entry(struct packed_git *p,
 	return buffer;
 }
 
+#define MAX_DELTA_CACHE (256)
+
+static struct delta_base_cache_entry {
+	struct packed_git *p;
+	off_t base_offset;
+	unsigned long size;
+	void *data;
+	enum object_type type;
+} delta_base_cache[MAX_DELTA_CACHE];
+
+static unsigned long pack_entry_hash(struct packed_git *p, off_t base_offset)
+{
+	unsigned long hash;
+
+	hash = (unsigned long)p + (unsigned long)base_offset;
+	hash += (hash >> 8) + (hash >> 16);
+	return hash & 0xff;
+}
+
 static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	unsigned long *base_size, enum object_type *type)
 {
+	void *ret;
+	unsigned long hash = pack_entry_hash(p, base_offset);
+	struct delta_base_cache_entry *ent = delta_base_cache + hash;
+
+	ret = ent->data;
+	if (ret && ent->p == p && ent->base_offset == base_offset)
+		goto found_cache_entry;
 	return unpack_entry(p, base_offset, type, base_size);
+
+found_cache_entry:
+	ent->data = NULL;
+	*type = ent->type;
+	*base_size = ent->size;
+	return ret;
 }
 
 static void add_delta_base_cache(struct packed_git *p, off_t base_offset,
 	void *base, unsigned long base_size, enum object_type type)
 {
-	free(base);
+	unsigned long hash = pack_entry_hash(p, base_offset);
+	struct delta_base_cache_entry *ent = delta_base_cache + hash;
+
+	if (ent->data)
+		free(ent->data);
+	ent->p = ...
From: Linus Torvalds
Date: Saturday, March 17, 2007 - 2:45 pm

Not just to compare actual timings, this shows the difference in the 
traces I did. Remember, before we had:

	[torvalds@woody linux]$ grep Needs delta-base-trace | wc -l
	469334
	[torvalds@woody linux]$ grep Needs delta-base-trace |sort -u | wc -l
	21933

and now with the simple cache, I get:

	[torvalds@woody linux]$ grep Needs delta-base-trace-new | wc -l
	28688
	[torvalds@woody linux]$ grep Needs delta-base-trace-new | sort -u | wc -l
	21933

ie, we still re-generate some of the objects multiple times, but now, 
rather than generating them (on average) 20+ times each, we now generate 
them an average of just 1.3 times each. Which explains why the wall-time 
goes down by over a factor of two.

Changing the (statically sized) cache from 256 entries to 1024 (and 
updating the hash function appropriately of course) gets the number down 
to 23953 delta-base lookups (the number of unique ones obviously stays the 
same), for an average of just 1.1 object generates per unique object, and 
also means that you occasionally get sub-second times for my test-case of 
logging drivers/usb/.

It all also means that libz isn't really even the top entry in the 
profiles any more, although it's still pretty high. But the profile now 
says:

	samples  %        app name                 symbol name
	41527    15.6550  git                      strlen
	30215    11.3905  git                      inflate
	27504    10.3685  git                      inflate_table
	20321     7.6607  git                      find_pack_entry_one
	16892     6.3680  git                      interesting
	16259     6.1294  vmlinux                  __copy_user_nocache
	16010     6.0355  git                      inflate_fast
	9240      3.4833  git                      get_mode
	8863      3.3412  git                      tree_entry_extract
	7145      2.6935  git                      strncmp
	7131      2.6883  git                      memcpy
	6863      2.5872  git                      diff_tree
	6113      2.3045 ...
From: Junio C Hamano
Date: Saturday, March 17, 2007 - 3:37 pm

This is beautiful.  You only cache what we were about to discard
anyway, and when giving a cached one out, you invalidate the
cached entry, so there is no way the patch can introduce leaks
nor double-frees and it is absolutely safe (as long as we can
pin the packed_git structure, which I think is the case --- even
when we re-read the packs, I do not think we discard old ones).

I've thought about possible ways to improve on it, but came up
almost empty.

When unpacking a depth-3 deltified object A, the code finds the
target object A (which is a delta), ask for its base B and put B
in the cache after using it to reconstitute A.  While doing so,
the first-generation base B is also a delta so its base C (which
is a non-delta) is found and placed in the cache.  When A is
returned, the cache has B and C.  If you ask for B at this
point, we read the delta, pick up its base C from the cache,
apply, and return while putting C back in the cache.  If you ask
for A after that, we do not read from the cache, although it is
available.

Which feels a bit wasteful at first sight, and we *could* make
read_packed_sha1() also steal from the cache, but after thinking
about it a bit, I am not sure if it is worth it.  The contract
between read_packed_sha1() and read_sha1_file() and its callers
is that the returned data belongs to the caller and it is a
responsibility for the caller to free the buffer, and also the
caller is free to modify it, so stealing from the cache from
that codepath means an extra allocation and memcpy.  If the
object stolen from the cache is of sufficient depth, it might be
worth it, but to decide it we somehow need to compute and store
which delta depth the cached one is at.

In any way, your code makes a deeply delitified packfiles a lot
more practical.  As long as the working set of delta chains fits
in the cache, after unpacking the longuest delta, the objects on
the chain can be had by one lookup and one delta application.


Ack ;-)

-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 4:09 pm

Yes.

I debated that a bit with myself, but decided that:

 (a) it probably doesn't really matter a lot (but I don't have the 
     numbers)

 (b) trying to *also* fill non-delta-base queries from the delta-base 
     cache actually complicates things a lot. Surprisingly much so (the 
     current logic of removing the entry from the cache only to re-insert 
     it after being used made the memory management totally trivial, as 
     you noticed)

 (c) and regardless, we could decide to do a more extensive caching layer 
     later if we really wanted to, and at that point it probably makes 
     more sense to integrate it with the delta-base cache.

     Most git objects are use-once, which is why we really *just* save the 
     flag bits and the SHA1 hash name itself in "struct object", but doing 
     a generic caching layer for object content would likely obviate the 
     need for the current logic to do "save_commit_buffer".

That (c) in particular was what made me think that it's better to keep it 
simple and obvious for now, since even the simple thing largely fixes the 
performance issue.  Almost three seconds I felt bad about, while just over 
a second for something as complex as "git log drivers/usb/" I just cannot 

Yeah. I think it would be good to probably (separately and as "further 
tweaks"):

 - have somebody actually look at hit-rates for different repositories and 
   hash sizes.

 - possibly allow people to set the hash size as a config option, if it 
   turns out that certain repository layouts or usage scenarios end up 
   preferring bigger caches.

   For example, it may be that for historical archives you might want to 
   have deeper delta queues to make the repository smaller, and if they 
   are big anyway maybe they would prefer to have a larger-than-normal 
   cache as a result. On the other hand, if you are memory-constrained, 
   maybe you'd prefer to re-generate the objects and waste a bit of CPU 
   rather than cache the ...
From: Linus Torvalds
Date: Saturday, March 17, 2007 - 4:54 pm

Well, to some degree I obviously *do* have the numbers.

I have the numbers that we used to re-generate the object data over five 
*hundred* times per object for some cases, and that I got the average 
such delta-base usage down from 20x to 1.1-1.3x depending on cache size.

In contrast, the "use delta-base also for non-delta queries" fairly 
obviously cannot touch those kinds of numbers. We migth avoid a *few* 
object generation cases, but we're not looking at factors of 20 for any 
kind of sane cases.

So I do think that a higher-level caching approach can work too, but it's 
going to be more effective in other areas:

 - get rid of some ugly hacks (like the "save_commit_buffer" thing I 
   mentioned)
 - possibly help some insane loads (eg cases where we really *do* end up 
   seeing the same object over and over again, perhaps simply because some 
   idiotic automated commit system ends up switching between a few states 
   back-and-forth).

I really think the "insane loads" thing is unlikely, but I could construct 
some crazy usage scenario where a cache of objects in general (and not 
just delta bases) would work. I don't think it's a very realistic case, 
but who knows - people sometimes do really stupid things.

		Linus
-

From: Nicolas Pitre
Date: Saturday, March 17, 2007 - 6:13 pm

So?

A malloc() + memcpy() will always be faster than mmap() + malloc() + 
inflate().  If the data is already there it is certainly better to copy 
it straight away.

With the patch below I can do 'git log drivers/scsi/ > /dev/null' about 
7% faster.  I bet it might be even more on those platforms with bad 
mmap() support.

Signed-off-by: Nicolas Pitre <nico@cam.org>
---
diff --git a/sha1_file.c b/sha1_file.c
index a7e3a2a..ee64865 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1372,7 +1372,7 @@ static unsigned long pack_entry_hash(struct packed_git *p, off_t base_offset)
 }
 
 static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
-	unsigned long *base_size, enum object_type *type)
+	unsigned long *base_size, enum object_type *type, int keep_cache)
 {
 	void *ret;
 	unsigned long hash = pack_entry_hash(p, base_offset);
@@ -1384,7 +1384,13 @@ static void *cache_or_unpack_entry(struct packed_git *p, off_t base_offset,
 	return unpack_entry(p, base_offset, type, base_size);
 
 found_cache_entry:
-	ent->data = NULL;
+	if (!keep_cache)
+		ent->data = NULL;
+	else {
+		ret = xmalloc(ent->size + 1);
+		memcpy(ret, ent->data, ent->size);
+		((char *)ret)[ent->size] = 0;
+	}
 	*type = ent->type;
 	*base_size = ent->size;
 	return ret;
@@ -1418,7 +1424,7 @@ static void *unpack_delta_entry(struct packed_git *p,
 	off_t base_offset;
 
 	base_offset = get_delta_base(p, w_curs, &curpos, *type, obj_offset);
-	base = cache_or_unpack_entry(p, base_offset, &base_size, type);
+	base = cache_or_unpack_entry(p, base_offset, &base_size, type, 0);
 	if (!base)
 		die("failed to read delta base object"
 		    " at %"PRIuMAX" from %s",
@@ -1615,7 +1621,7 @@ static void *read_packed_sha1(const unsigned char *sha1,
 	if (!find_pack_entry(sha1, &e, NULL))
 		return NULL;
 	else
-		return unpack_entry(e.p, e.offset, type, size);
+		return cache_or_unpack_entry(e.p, e.offset, size, type, 1);
 }
 
 /*
-

From: Junio C Hamano
Date: Sunday, March 18, 2007 - 12:47 am

I do not know if there is mmap() cost involved, but you are
correct to point out that my aversion to malloc() cost was
unfounded.  We need to allocate anyway, and memcpy() should of

Wonderful.  I was going to nitpick but you even took care of the
convention of returning a buffer with one extra byte that
terminates the contents with NUL.  Perfect.


-

From: Junio C Hamano
Date: Saturday, March 17, 2007 - 4:12 pm

This largely would depend on the project, but if a blob that is
cached is 20kB each, a 1024-entry cache would grow to 20MB.  We
may need to introduce early eviction of cached objects with
total cache size limit, configurable per repository.

-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 4:24 pm

One thing that I considered was to limit the delta-base cache to just tree 
entries. Those tend to be the really performance-sensitive ones - by the 
time you actually unpack blob entries, you're going to do something with 
that *single* entry anyway (like compare it to another blob), and the cost 
of unpacking the entry is likely to not be really all that noticeable.

That said, it was just simpler to do it unconditionally, and it obviously 
*works* fine regardless of the object type, so limiting it to trees is a 
bit sad. And since the intensive tree operations tend to be in a separate 
phase (ie the commit simplification phase) from the the blob operations 
(say, doing "git log -p <pathspec>"), I suspect that the cache locality 
would still remain good.

So I didn't do anything along the lines of "only cache for case Xyzzy".

But yes, especially if a project has big blobs, it might make sense to 
limit by full size of the cached entries some way.

			Linus
-

From: Jon Smirl
Date: Saturday, March 17, 2007 - 4:52 pm

If you still have a Mozilla pack file around it would be a good test
case. It has delta chains thousands of entries long. If I remember
correctly one had over 4,000 deltas.

-- 
Jon Smirl
jonsmirl@gmail.com
-

From: Morten Welinder
Date: Saturday, March 17, 2007 - 6:14 pm

Almost 16% in strlen?  Ugh!

That's a lot of strings, or perhaps very long strings.  Or a profiling bug.

M.
-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 6:29 pm

It's likely real, and the problem is likely lots of small strings.

Each git tree entry is:

	"<octal mode> name\0" <20-byte sha1>

so you do have a *lot* of strlen() calls when doing any tree parsing. And 
for some inexplicable reason, glibc thinks strings are long on average, so 
it has a fancy algorithm to do 8 bytes at a time and tries to do things 
aligned etc.

The size of strlen() on x86-64 with glibc is 232 bytes. I'm not kidding.

			Linus
-

From: Nicolas Pitre
Date: Saturday, March 17, 2007 - 6:38 pm

This is definitely an area where pack v4 will bring that cost down to 
zero.


Nicolas
-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 6:55 pm

Heh. I believe that when I see it. The thing is, unless you re-generate 
the tree object data structures, you'll have to have totally different 
tree walkers for different tree types, and it will all be quite ugly and 
complex. And "ugly and complex" seldom translates into "zero cost".

		Linus
-

From: Nicolas Pitre
Date: Saturday, March 17, 2007 - 7:03 pm

Well... in my opinion it is the _current_ tree walker that is quite ugly 
and complex.  It is always messier to parse strings than fixed width 
binary fields.


Nicolas
-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 7:20 pm

Sure. On the other hand, text is what made things easy to do initially, 
and you're missing one *BIG* clue: you cannot remote the support without 
losing compatibility with all traditional object formats.

So you have no choice. You need to support the text representation. As a 
result, *your* code will now be way more ugly and messy.

The thing is, parsing some little text may sound expensive, but if the 
expense is in finding the end of the string, we're doing really well.

In other words: the data structures are both simple and straightforward, 
and the only reason strlen() shows up at all is:

 - we pass strings around as just C strings, even when we know their 
   lengths. Prime example: look at tree-diff.c. And when you look at it, 
   realize that *for*every*single*strlen* in that file except for the very 
   last one (which is only used once per process for setup) we actually 
   know the string length from before, but we (well, *I*) decided that it 
   wasn't worth passing down as a parameter all the time.

 - the simple parsing of the tree itself (which really isn't that 
   expensive - the real expense is bringing the data into the CPU cache, 
   but that's something we'd need to do *anyway*).

So I seriously suspect that you could get the strlen() overhead down from 
that 16% pretty easily, but you'd have to pass the length of the "base" 
string along all the time (and in the tree_entry cases you'd replace the 
"strlen()" calls with a call to something like

	static inline int tree_entry_len(const char *name, const unsigned char *sha1)
	{
		return (char *)sha1 - (char *)name - 1;
	}

which will do it for you).

But what you're ignoring here is that "16%" may sound like a huge deal, 
but it's 16% of somethng that takes 1 second, and that other SCM's cannot 
do AT ALL.

		Linus
-

From: Nicolas Pitre
Date: Saturday, March 17, 2007 - 8:00 pm

Depends. We currently have separate parsers for trees, commits, tags, 
etc.  That should be easy enough to add another (separate) parser for 
new tree objects while still having a common higher level accessor 
interface like tree_entry().

But right now we only regenerate the text representation whenever the 
binary representation is encountered just to make things easy to do, and 
yet we still have a performance gain already in _addition_ to a net 

Of course the current tree parser will remain, probably forever.  And it 

Sure.  But at this point the reference to compare GIT performance 
against might be GIT itself.  And while 1 second is really nice in this 
case, there are some repos where it could be (and has already been 
reported to be) much more.

I still have a feeling that we can do even better than we do now.  Much 
much better than 16% actually.  But that require a new data format that 
is designed for speed.

We'll see.


Nicolas
-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 8:31 pm

I'd still like to see the KDE repo, that thing went quiet after it was 
supposed to hit sneaker-net..

If it was 30 seconds before to do a "git log" for some individual file, 
after the recent optimizations it should hopefully be down to 10. And I 
agree that I might be more motivated to try to get it down further if I 
could just find a repository where it's that much. 

Right now I can can do a "git log" on any file in the kernel archive in 
under a second (well, when I say "any file", I started with a script, but 
with 22 thousand files I didn't bother to run it for all that long, so I 
ended up testing a few random files in addition to the first few hundred 
files of "git ls-files", and they are all well under a second).

And that's without the "git diff --quiet" thing that is still in "next", 
and that cut down some of the overhead for other reasons (although I 
suspect the effect of that will be less when combined with my patches 
since the stuff it cut down I probably cut down even more).

I really suspect you'll have a hard time beating "normal" git with the 
patches I sent out. I'm sure it's quite possible - don't get me wrong - I 
just suspect it won't be spectacular, and it will be a lot of work.

		Linus
-

From: Julian Phillips
Date: Saturday, March 17, 2007 - 10:30 pm

In my test repository (which emulates a real repository in terms of 
approximate size in terms of commits, branches and tags) "git log f12000" 
takes about 15m (using 1.5.0.4).  After applying patches 1/2 and 2/2 on 
top of master I get ~3m50s.  With 3/2 as well it goes down a bit more to 
~3m20s.

I've attached the script that generated the repository in case you feel 
the urge to try some move time shaving exercises ... ;)

(This is a rather unrealistic repository consisting of a long series of 
commits of new binary files, but I don't have access to the repository 
that is being approximated until I get back to work on Monday ...)

-- 
Julian

  ---
That must be wonderful: I don't understand it at all.
 		-- Moliere
From: Linus Torvalds
Date: Sunday, March 18, 2007 - 10:23 am

This is a *horrible* test repo.

Is this actually really trying to approximate anything you work with? If 
so, please check whether you have cyanide or some other effective poison 
to kill all your cow-orkers - it's really doing them a favor - and then do 
the honorable thing yourself? Use something especially painful on whoever 
came up with the idea to track 25000 files in a single directory.

I'll see what the profile is, but even without the repo full generated 
yet, I can already tell you that you should *not* put tens of thousands of 
files in a single directory like this.

It's not only usually horribly bad quite independently of any SCM issues 
(ie most filesystems will have some bad performance behaviour with things 
like this - if only because "readdir()" will inevitably be slow).

And for git it means that you lose all ability to efficiently prune away 
the parts of the tree that you don't care about. git will always end up 
working with a full linear filemanifest instead of a nice collection of 
recursive trees, and a lot of the nice tree-walking optimizations that git 
has will just end up being no-ops: each tree is always one *huge* 
manifest.

So it's not that git cannot handle it, it's that a lot of the nice things 
that make git really efficient simply won't trigger for your repository.

In short: avoiding tens of thousands of files in a single directory is 
*always* a good idea. With or without git.

(Again, SCM's that are really just "one file at a time" like CVS, 
won't care as much. They never really track all files anyway, so while 
they are limited by potential filesystem performance bottlenecks, they 
won't have the fundamental issue of tracking 25,000 files..)

		Linus
-

From: Robin Rosenberg
Date: Sunday, March 18, 2007 - 3:53 am

From: Linus Torvalds
Date: Sunday, March 18, 2007 - 10:34 am

Do you happen to have a fast internet connection that you can expose this 
thing on?

3GB will take me a while to download, but it sounds like a great 
test-case. A 3GB pack-file is what we're supposed to be able to handle 
fairly comfortably right now, so it sounds like the ideal project to do 
performance testing on. 

		Linus
-

From: Robin Rosenberg
Date: Sunday, March 18, 2007 - 11:29 am

From: Shawn O. Pearce
Date: Sunday, March 18, 2007 - 2:25 pm

Robin Rosenberg <robin.rosenberg.lists@dewire.com> wrote:
> s
From: David Brodsky
Date: Monday, March 19, 2007 - 6:16 am

Robin Rosenberg wrote:
> s
From: Robin Rosenberg
Date: Monday, March 19, 2007 - 11:35 pm

Uploaded now.

David Brodsky provides the final location.

Linus: I noted a large extra file that I don't know where it is from. Seems to
be form the first convetsion. Perhaps you wont' need to download it: 

ECLIPSE.git/.git/objects/pack_sETUPg

-- robin
-

From: David Brodsky
Date: Tuesday, March 20, 2007 - 2:13 am

Anonymous ftp at agnes.kajka.koleje.cuni.cz:10000 - it will stay up for
a while, but since it my desktop machine, I don't guarantee anything.

And (hopefully) permanent http://steamer.kajka.koleje.cuni.cz/Eclipse

Enjoy

David
-

From: Linus Torvalds
Date: Tuesday, March 20, 2007 - 7:37 pm

Ok, thanks, downloaded. Although the pack-file is just 1.7GB for me, not 
3.7 like somebody said.

Anyway, doing a

	git blame --incremental HEAD -- org.eclipse.debug.ui/plugin.xml > /dev/null

on that thing (picked a random file that got modified in a recent commit) 
took something like 12 seconds, so this is certainly a perfectly fine 
test-case.

Sadly, "git-gui blame" doesn't work in a bare git repository, so I had to 
do an ugly

	ln -s . .git

to make git-gui happy, and that worked, and was pretty usable. Still, 
12seconds should be something we can improve on.

And yeah, the profile is pretty horrid:

	samples  %        app name                 symbol name
	70307    20.9412  libc-2.5.so              strlen
	50925    15.1682  libz.so.1.2.3            (no symbols)
	24295     7.2364  git                      tree_entry_interesting
	19816     5.9023  libc-2.5.so              memcpy
	19569     5.8287  git                      tree_entry_extract
	17693     5.2699  vmlinux                  memcpy_c
	17032     5.0730  git                      assign_blame
	16956     5.0504  git                      get_mode
	12401     3.6937  git                      get_origin
	11815     3.5191  git                      skip_uninteresting
	10449     3.1123  git                      update_tree_entry
	10359     3.0855  git                      find_pack_entry_one
	7946      2.3667  git                      cmp_suspect
	4572      1.3618  libc-2.5.so              strncmp
	...

so I guess we need to find some more strlen's to remove ;)

			Linus
-

From: Nicolas Pitre
Date: Tuesday, March 20, 2007 - 7:54 pm

There is a 1.3GB garbage pack_sETUPg file in eclipse.git/objects/ laying 
there, probably resulting from an interrupted index-pack, making the 
repository needlessly bigger.  It can be safely deleted.

We probably should make git-prune get rid of those automatically.


Nicolas
-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 8:06 pm

This is a micro-optimization that grew out of the mailing list discussion 
about "strlen()" showing up in profiles. 

We used to pass regular C strings around to the low-level tree walking 
routines, and while this worked fine, it meant that we needed to call 
strlen() on strings that the caller always actually knew the size of 
anyway.

So pass the length of the string down wih the string, and avoid 
unnecessary calls to strlen(). Also, when extracting a pathname from a 
tree entry, use "tree_entry_len()" instead of strlen(), since the length 
of the pathname is directly calculable from the decoded tree entry itself 
without having to actually do another strlen().

This shaves off another ~5-10% from some loads that are very tree 
intensive (notably doing commit filtering by a pathspec).

Signed-off-by: Linus Torvalds  <torvalds@linux-foundation.org>"
---


So here's the patch.

It definitely cuts down on CPU usage, and I actually left one extra 
"strlen()" around, simply because I didn't want to mess with the exported 
interface of "diff_tree()".

But that other strlen() is also one that is done *once* for the whole 
tree, so from a performance standpoint it doesn't matter (we *could* have 
passed in that length too, but that would have involved more changes that 
simply aren't really useful).

Does it help? Yes it does. It takes another 5-10% off my test-case. 
"strlen()" still exists, but it's basically half of what it used to be 
because we now basically only call it when literally parsing the tree data 
itself (ie now it's ~8% of the total, and no longer the hottest entry.

Is it worth it? If it was just a random micro-optimization I might not 
care, but I guess it's not that ugly to pass an extra "baselen" around all 
the time. And that "tree_entry_len()" helper function is actually quite 
nice. So yeah, I'd suggest applying this one just because it's actually a 
perfectly fine patch and it does speed things up.

So it *is* very much a micro-optimization, ...
From: Junio C Hamano
Date: Sunday, March 18, 2007 - 2:45 am

With your 256-entry cache, Nico's reusing objects out of delta
base cache, and this strlen() patch

	git blame -C block/ll_rw_blk.c

gets these numbers:

(v1.5.0)
14.71user 0.26system 0:15.07elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+93622minor)pagefaults 0swaps

(master + three patches)
8.94user 0.14system 0:09.10elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+40075minor)pagefaults 0swaps

Just for fun, these are the same for the kernel history with tglx-history 
repository's history grafted behind it, i.e. with this grafts file:

$ cat .git/info/grafts
1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 e7e173af42dbf37b1d946f9ee00219cb3b2bea6a

(v1.5.0)
73.80user 2.57system 1:16.40elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+773077minor)pagefaults 0swaps

(master + three patches)
65.14user 0.40system 1:05.55elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+125052minor)pagefaults 0swaps

In either case, it is showing drastic reduction of minor faults.

-

From: Linus Torvalds
Date: Sunday, March 18, 2007 - 8:54 am

That's an interesting test-case (and I get 53 seconds, nyaah, nyaah ;)

However, it's almost totally *not* about object access any more with my 
patches. All the top profiling hits are about generating the patches and 
assigning blame:

	samples  %        image name               app name                 symbol name
	470352   15.5813  git                      git                      xdl_hash_record
	298683    9.8944  git                      git                      cmp_suspect
	225156    7.4587  git                      git                      assign_blame
	221308    7.3312  libc-2.5.so              libc-2.5.so              memcpy
	177621    5.8840  libc-2.5.so              libc-2.5.so              memchr
	163571    5.4186  vmlinux                  vmlinux                  __copy_user_nocache
	129301    4.2833  git                      git                      xdl_prepare_ctx
	99009     3.2799  libc-2.5.so              libc-2.5.so              _int_malloc
	83899     2.7793  git                      git                      xdiff_outf
	80588     2.6696  libz.so.1.2.3            libz.so.1.2.3            (no symbols)
	..

so as you can see, libz is down in the 2.5% range, and strlen and the tree 
accessor functions are totally un the noise. 

So it looks like it *used* to be somewhat of a problem (the object access 
itself must have been about 10 seconds, since that got shaved off the 
time), but realistically, if you want to speed up "git blame", we can 
totally ignore the git object data structures, an dconcentrate on xdiff 
and on blame itself (cmp_suspect and assign_blame probably have some nasty 
O(n^2) behaviour or something like that, that could hopefully be fixed 
fairly easily. The xdl hashing is a different thing, and I don't think 
it's necessarily easy to fix that one..)

			Linus
-

From: Linus Torvalds
Date: Sunday, March 18, 2007 - 8:57 am

Btw, it's also an example of why the incremental blame is so much nicer.

No way would I want to wait 53 seconds to get the whole blame. But doing

	git gui blame HEAD block/ll_rw_blk.c

(the "git gui" command line is a bit unwieldly) you get something quite 
usable!

Of course, the git gui blame colorization is clearly done by somebody who 
is still actively popping LSD with both fists and didn't realize that the 
60's are long done, but that's another issue.

		Linus
-

From: Shawn O. Pearce
Date: Sunday, March 18, 2007 - 2:38 pm

git-gui is open source.  I'd be happy to take a patch.  Or,
since that is horribly messy Tcl/Tk code, just a better color
suggestion. :-)

-- 
Shawn.
-

From: Linus Torvalds
Date: Sunday, March 18, 2007 - 2:48 pm

Yeah, the Tcl/Tk part means that I take one look and decide that I have 
absolutely zero clue..

Also, I'm not entirely sure what the "right" color is, but the changing 
colors do confuse me. Also, maybe I'm some kind of white suburban 
house-wife or something, but I prefer calmer pastel colors over the bright 
ones you've selected.

I would suggest:

 - some special color for "currently selected" (which defaults to being 
   the first one coming out of the blame thing, of course). 

   I'd suggest "black text on pale green background", but that may be just 
   me. Patricia calls the current color "hot pink", and maybe that's 
   appropriate for a certain segment of the population, but I'm not sure I 
   want to even *meet* that segment ;)

 - some *stable* graduated color for the rest. I don't think it 
   necessarily needs to be "older" vs "newer", and in fact I'd suggest 
   just two slightly different shades of gray for the background - just 
   pick alternating shades for each blame entry that comes in (and leave 
   un-blamed lines white).

The flickering just makes me go "ooh, I'm really happy I don't have 
epilepsy, because otherwise I'd be writhing on the floor every time I 
tried to use this tool".

			Linus


-

From: Johannes Schindelin
Date: Monday, March 19, 2007 - 8:05 pm

Hi,


I felt a little left out in all that performance slashing, and so I 
thought maybe, just maybe, a small change in xdl_hash_record() can do 
wonders (since it _is_ really simple, but still takes almost a 6th of the 
CPU time). I don't have a proper test case setup, so maybe you want to try 
this:

-- snipsnap --
[PATCH] xdiff/xutils.c(xdl_hash_record): factor out whitespace handling

Since in at least one use case, xdl_hash_record() takes over 15% of the
CPU time, it makes sense to even micro-optimize it. For many cases, no
whitespace special handling is needed, and in these cases we should not
even bother to check for whitespace in _every_ iteration of the loop.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>

---

	Please do not consider this patch _unless_ it is proven to enhance 
	the profile statistics substantially.

 xdiff/xutils.c |   22 ++++++++++++++++++++--
 1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 3653864..bf91c0f 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -236,12 +236,13 @@ int xdl_recmatch(const char *l1, long s1, const char *l2, long s2, long flags)
 	return 0;
 }
 
-unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
+static unsigned long xdl_hash_record_with_whitespace(char const **data,
+		char const *top, long flags) {
 	unsigned long ha = 5381;
 	char const *ptr = *data;
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
-		if (isspace(*ptr) && (flags & XDF_WHITESPACE_FLAGS)) {
+		if (isspace(*ptr)) {
 			const char *ptr2 = ptr;
 			while (ptr + 1 < top && isspace(ptr[1])
 					&& ptr[1] != '\n')
@@ -270,6 +271,23 @@ unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
 }
 
 
+unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
+	unsigned long ha = 5381;
+	char const *ptr = *data;
+
+	if (flags & XDF_WHITESPACE_FLAGS)
+		return xdl_hash_record_with_whitespace(data, ...
From: Shawn O. Pearce
Date: Monday, March 19, 2007 - 8:29 pm

This is a massive difference for me.  I ran it on git-gui.sh in
the git-gui repository - this is a 6000 line file that has a lot of
revisions, and has been renamed a few times.  I applied the patch on
top of current 'master' (v1.5.1-rc1), so I was testing with Linus'
delta_base_cache.

# stock v1.5.1-rc1
$ for a in 1 2 3 4 5;do /usr/bin/time ../lt-blame blame --incremental HEAD git-gui.sh >/dev/null;done
        6.27 real         5.31 user         0.55 sys
        6.40 real         5.32 user         0.55 sys
        6.33 real         5.33 user         0.53 sys
        6.67 real         5.32 user         0.55 sys
        6.18 real         5.31 user         0.53 sys

# with the above patch
$ for a in 1 2 3 4 5;do /usr/bin/time ../js-blame blame --incremental HEAD git-gui.sh >/dev/null;done
        3.57 real         2.87 user         0.51 sys
        3.58 real         2.87 user         0.51 sys
        3.53 real         2.86 user         0.52 sys
        3.61 real         2.86 user         0.51 sys
        3.64 real         2.87 user         0.52 sys

For the record, both versions did produce identical output.

Given how small of a change it is, and how much of an improvement
it made, I say apply it.

-- 
Shawn.
-

From: Shawn O. Pearce
Date: Monday, March 19, 2007 - 8:40 pm

DrNick suggested on #git to try flipping the isspace test around.
This is a smaller change and generated the same ~3.60 seconds run
as Dscho's patch.  I like DrNick's version better.  ;-)

-->8--
[PATCH] xdiff/xutils.c(xdl_hash_record): factor out whitespace handling

Since in at least one use case, xdl_hash_record() takes over 15%
of the CPU time, it makes sense to even micro-optimize it. For
many cases, no whitespace special handling is needed, and in these
cases we should not even bother to check for whitespace in _every_
iteration of the loop.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
 xdiff/xutils.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/xdiff/xutils.c b/xdiff/xutils.c
index 3653864..7b1f213 100644
--- a/xdiff/xutils.c
+++ b/xdiff/xutils.c
@@ -241,7 +241,7 @@ unsigned long xdl_hash_record(char const **data, char const *top, long flags) {
 	char const *ptr = *data;
 
 	for (; ptr < top && *ptr != '\n'; ptr++) {
-		if (isspace(*ptr) && (flags & XDF_WHITESPACE_FLAGS)) {
+		if ((flags & XDF_WHITESPACE_FLAGS) && isspace(*ptr)) {
 			const char *ptr2 = ptr;
 			while (ptr + 1 < top && isspace(ptr[1])
 					&& ptr[1] != '\n')
-- 
1.5.1.rc1.595.gd1206

-- 
Shawn.
-

From: Linus Torvalds
Date: Monday, March 19, 2007 - 9:11 pm

For me, the result seems to be in the noise.

It may be due to running on Core 2. It's not very sensitive to 
micro-optimizations like this. It definitely makes sense to test the 
*stable* test first, since that will help branch prediction (the 
"isspace()" test is *not* very predictable), so I don't disagree with the 
patch, but I suspect it depends a lot on the microarchitecture just how 
much it matters.

Do you perhaps have a P4? It has a very bad branch mispredict penalty, so 
putting the predictable branch first could explain the big difference you 
see..

Dscho's bigger patch probably helps more on an in-order architecture, and 
should be equally good on a P4 (or Opteron). On Core 2, neither of the 
patches seem to make a huge difference.

			Linus
-

From: Shawn O. Pearce
Date: Monday, March 19, 2007 - 9:18 pm

I tested both patches on a PowerPC G4.  (Apple PowerBook, 1.5 GHz)
Running on Mac OS X 10.4.8.

Might be more of a Linux<->Darwin thing; perhaps my isspace is
significantly slower than yours is...  after all my mmap runs
like a PC from the 1980s...  ;-)

-- 
Shawn.
-

From: Linus Torvalds
Date: Monday, March 19, 2007 - 9:45 pm

No, we do a git-specific isspace().

But yeah, a G4 will explain the thing even more than a P4 would. The G4 
really isn't a very good uarch compared to the modern x86 ones. Not 
aggressively out-of-order with deep instruction queues and I don't think 
it does basically any memop re-ordering at all. I know Apple used to claim 
that they were the fastest PC around (both with the G4 and the G5), but 
let's face it, they lied.

The closer to in-order you are, the more instruction scheduling in sw 
tends to matter.

		Linus
-

From: Junio C Hamano
Date: Monday, March 19, 2007 - 10:44 pm

Because hoisting stable test outside loop is always better for
any architecture, I thought picking between Gitte and Gitney
patches is a no brainer, and I didn't bother to compare-bench,
but I got curious.

(plain)
7.89user 0.15system 0:08.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41608minor)pagefaults 0swaps
7.93user 0.18system 0:08.14elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41608minor)pagefaults 0swaps

(gitte -- separate function for slow path)
6.98user 0.18system 0:07.17elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41606minor)pagefaults 0swaps
7.14user 0.12system 0:07.26elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41607minor)pagefaults 0swaps

(gitney -- cheap test first before isspace)
7.23user 0.18system 0:07.42elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41608minor)pagefaults 0swaps
7.32user 0.14system 0:07.48elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+41607minor)pagefaults 0swaps

So it does not seem to make much difference on Athlon 64x2 either.

Will apply the "stupid hashcmp() removal" and Dscho's patch and
call it a day.

-

From: Junio C Hamano
Date: Monday, March 19, 2007 - 8:16 pm

With this stupidity-removal patch, it gets down to 7.80user from
8.72user (comparable number of minor faults) for blaming
block/ll_rw_blk.c (without tglx grafts)

diff --git a/builtin-blame.c b/builtin-blame.c
index b51cdc7..104521e 100644
--- a/builtin-blame.c
+++ b/builtin-blame.c
@@ -182,9 +182,8 @@ struct scoreboard {
 
 static int cmp_suspect(struct origin *a, struct origin *b)
 {
-	int cmp = hashcmp(a->commit->object.sha1, b->commit->object.sha1);
-	if (cmp)
-		return cmp;
+	if (a->commit != b->commit)
+		return 1;
 	return strcmp(a->path, b->path);
 }
 

-

From: Linus Torvalds
Date: Monday, March 19, 2007 - 9:31 pm

Yeah, this one works for me too. Even more than for you. For me, 

	git blame --incremental -C HEAD block/ll_rw_blk.c

takes 6.71s (best of ten) normally, and 4.85 (best of ten again) with your 
patch and Nico's one-liner. In fact, that's a much bigger improvement than 
I would have expected from the profile, but it may be that you just cut 
the data cache footprint down a lot, and thus made other things more 
efficient.

(I just double-checked. Nico's one-liner does help, but not nearly as 
radically as it did for Nico. The "best of ten" with *just* Nico's 
one-liner is 6.22 for me - better than before, but the combination of 
Nico's patch and yours is much more dramatic).

Btw, Dscho's slightly more invasive patch seems to *just* edge out Nico's 
one-liner for me, with best-of-ten being 6.17s.

The winner is your patch *with* Dscho's slightly more invasive one: 4.69s.

But the difference between the numbers of Dscho's bigger patch and Nico's 
one-liner really are totally in the noise. Dscho *just* wins the 
best-of-ten both with and without your patch, but in both cases it's 
*way* in the noise. For example, while 4.69s was the best for your+Dscho 
in my testing, the full series was

	0:05.69
	0:04.69
	0:04.82
	0:04.97
	0:04.85
	0:05.88
	0:04.77
	0:04.69
	0:05.12
	0:04.98

so the variability was big enough that I wouldn't say that 0.1s is really 
all that meaningful even for "best of ten". I didn't try to make the 
machine totally quiescent, I've got xmms playing in the background etc..

But these kinds of things will definitely vary from machine to machine. 
It's all good, though.

			Linus
-

From: Shawn O. Pearce
Date: Monday, March 19, 2007 - 9:39 pm

Uh, instead of Nico here don't you mean DrNick on #git?  He is in
real life Nicholas Miell.  Google says he's somewhat active in the
kernel world, so maybe you know him?  ;-)

-- 
Shawn.
-

From: Linus Torvalds
Date: Monday, March 19, 2007 - 9:57 pm

I actually meant you.

For some reason, I confuse you and Nico. I've done it several times, and 
even without any DrNick mention.

Time to take my meds, 

		Linus
-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 6:44 pm

Btw, the reason I'm pretty sure that it's not a profiling bug is that
 (a) the rest of the profile looks fine
 (b) it actually matches the rest of the profile.

In particular, while you reacted to

	samples  %        app name                 symbol name
	41527    15.6550  git                      strlen

you didn't bat an eye on

	9240      3.4833  git                      get_mode
	8863      3.3412  git                      tree_entry_extract

ie over 3% of time spent in tree entry extract and get_mode. But take 
another look at that tree_entry_extract() function in particular and look 
what it does, and ask yourself: if *that* function takes up 3% of time, 
what does it tell you about strlen()?

(Side note: we could probably improve "strlen()" in particular. We 
sometimes call it twice: look at "entry_extract()", which calls strlen() 
on the tree entry extract, but then *also* calls strlen on the resulting 
path.

I suspect the

	a->pathlen = strlen(a->path);

could be written as

	a->pathlen = (char *)a->sha1 - (char *)a->path - 1;

but somebody should check that I didn't off-by-one or something. Also, it 
migt be better to make that part of "tree_entry_extract()" itself, because 
other callers do the same thing (see "find_tree_entry()": doing a 
"strlen()" on the path return of tree_entry_extract() seems to be a common 
pattern).

HOWEVER!

Once we get to *that* level of optimizations, we're doing pretty damn 
well. I'm sure we could probably cut down that strlen() from 16% to 8% by 
being smart about it, but still - this is a "good kind of problem" to 
have, if these things are your lowest-hanging fruit!

Maybe it all boils down to the same thing: I just can't seem to be really 
upset about "git log drivers/usb/ > /dev/null" taking all of a second. It 
just doesn't strike me as a performance problem ;)

			Linus
-

From: Avi Kivity
Date: Saturday, March 17, 2007 - 11:28 pm

Or maybe strlen() is the first function to touch the page/cacheline.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

-

From: Linus Torvalds
Date: Saturday, March 17, 2007 - 3:44 pm

Btw, final comment on this issue:

I was initially a bit worried about optimizing for just the "git log" with 
pathspec or "git blame" kind of behaviour, and possibly pessimizing some 
other load.

But the way the caching works, this is likely to be faster (or at least 
not slower) even for something that doesn't ever need the cache (which in 
turn is likely to be because it's a smaller footprint query and only works 
on one version).

Because the way the cache works, it doesn't really do any extra work: it 
basically just delays the "free()" on the buffer we allocated. So for 
really small footprints it just avoids the overhead of free() (let the OS 
reap the pages for it at exit), and for bigger footprints (that end up 
replacing the cache entries) it will just do the same work a bit later.

Because it's a simple direct-mapped cache, the only cost is the (trivial) 
hash of a few instructions, and possibly the slightly bigger D$ footprint. 
I would strongly suspect that even on loads where it doesn't help by 
reusing the cached objects, the delayed free'ing on its own is as likely 
to help as it is to hurt.

So there really shouldn't be any downsides.

Testing on some other loads (for example, drivers/scsi/ has more activity 
than drivers/usb/), the 2x performance win seems to happen for other 
things too. For drivers/scsi, the log generating went down from 3.582s 
(best) to 1.448s.

"git blame Makefile" went from 1.802s to 1.243s (both best-case numbers 
again: a smaller win, but still a win), but there the issue seems to be 
that with a file like that, we actually spend most of our time comparing 
different versions.

For the "git blame Makefile" case *all* of zlib combined is just 18%, 
while the ostensibly trivial "cmp_suspect()" is 23% and another 11% is 
from "assign_blame()" - so for top-level entries the costs would seem to 
tend to be in the blame algorithm itself, rather than in the actual object 
handling.

(I'm sure that could be improved too, but ...
From: Jeff Garzik
Date: Friday, March 16, 2007 - 9:35 am

Ahhh.  At least for me, that explains a lot.  Rather than spending all 
its time in inflate_fast(), git is dealing with lots of zlib 
startup/shutdown overhead.

Although it sounds like zlib could indeed be optimized to reduce its 
startup and shutdown overhead, I wonder if switching compression 
algorithms to a pure Huffman or even RLE compression (with associated 
lower startup/shutdown costs) would perform better in the face of all 
those small objects.

And another random thought, though it may be useless in this thread:  I 
bet using a pre-built (compiled into git) static zlib dictionary for git 
commit and tree objects might improve things a bit.

	Jeff


-

From: Linus Torvalds
Date: Friday, March 16, 2007 - 9:51 am

Well, the thing is, I personally much prefer to have just a single 
compression algorithm and object layout. Most of the performance-critical 
objects from a decompression standpoint during commit traversal are all 
small (especially if you do pathname limiting), but when you do something 
like a "git add ." most objects are actually random blob objects and you 
need to have a compression algorithm that works in the general case too.

Of course, pack-v4 may (likely will) end up using different strategies for 
different objects (delta's in particular), but the "one single object 
compression type" was a big deal for initial implementation.

It's may not be fundamental to git operation (so we can fairly easily 
change it and make it more complex without any higher-level stuff even 
noticing), but it was definitely fundamental to "get something stable and 

That's kind of pack-v4 area. It will happen, but I'd actually like to see 
if we can just avoid stupid performance problems with zlib, independently 
of trying to make more tuned formats.

		Linus
-

From: Matt Mackall
Date: Friday, March 16, 2007 - 9:42 am

Mercurial simply stores uncompressed objects below a threshold of 44
bytes, based on benchmarks I did in April 2005. I'd probably up that
number if I redid my measurements today. There's just not a whole lot
zlib can do at these small sizes. Given that a SHA hash is an
uncompressible 20 bytes already, you're well into the domain of

Ideally, you'd compress all deltas in a chain with the same context.
You've got to decompress the delta base to do the delta
calculation, so this should allow you to recover the context up to
that point. Zlib isn't really set up for this sort of thing though.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Nicolas Pitre
Date: Friday, March 16, 2007 - 10:12 am

See my last post.  We'll do even better with special object 
encoding altogether.  Those representations are so dense that 
compression provides no gain at all making the point moot.


Nicolas
-

From: Shawn O. Pearce
Date: Friday, March 16, 2007 - 4:22 pm

As Nico already stated, for pack v4 we are probably heading in a
direction where these really small (except for blobs anyway) objects
aren't compressed at all by zlib.  They are smaller in disk space,

I've actually tried this with the Mozilla project.  The improvement
was under 2% on disk space usage and no runtime performance gains.
Not worth the pain involved.  We are seeing much higher disk
space improvements and much better performance gains in the pack
v4 prototype.

Oh, and that was *with* a dictionary that was customized to Mozilla.
Not a static one.  A lot of keywords in the dictionary were Mozilla
project specific, and would actually *hurt* compression for the
Linux kernel, Git, X.org, etc...

-- 
Shawn.
-

From: Nicolas Pitre
Date: Friday, March 16, 2007 - 10:06 am

This is why in pack v4 there will be an alternate tree object 
representation which is not deflated at all.

In short we intend to have 3 tables where common things are factored 
out:

 1) the path component string table (deflated)

 2) author/committer string table (deflated)

 3) sorted SHA1 table (obviously not deflated)

The sorted SHA1 table will be part of the pack instead of being in the 
pack index.  The idea is that most SHA1's are already duplicated in the 
pack already anyway within commit and tree objects.  With a single table 
then commit and tree objects can index into that SHA1 table rather than 
providing the SHA1 value inline for the objects they refer to.

This means that a tree object record would be only 6 bytes according to 
the current design: 2 bytes to index into the path component string 
table (which also include the mode information), and 4 bytes to index 
into the sorted SHA1 table.  And similarly for commit objects.

This means that the pack index will only have a table of offsets 
corresponding to the table of sorted SHA1's.

So... walking revisions will become only a matter of picking the first 
commit object, using the tree index value (which is not deflated), but 
instead of using it in the SHA1 table it could be used in the offset 
table to find the location of the corresponding tree object directly.  
Same goes for tree entries, or for locating the parent's commit object.

No deflating, no binary searching, no SHA1 comparisons.  Plain straight 
pointer dereference.

Then, if you want to filter tree walking on path spec, you only need to 
locate the path component in the path table once and use the 
corresponding index to filter tree entries instead of repeated strcmp().  
Same thing if you want to filter commits based on author/committer.  
One side effect of this is that you can tell straight away that a path 
doesn't exist in the whole pack if one of its components cannot be found 
in the table (that works only if no legacy tree ...
From: Linus Torvalds
Date: Friday, March 16, 2007 - 10:51 am

Well, the thing is, for things that really don't compress, zlib shouldn't 
add much of an overhead on uncompression. It *should* just end up being a 
single "memcpy()" after you've done:
 - check the header for size and mode ("plain data")
 - check the adler checksum (which is *really* nice - we've found real 
   corruption this way!).

The adler32 checksumming may sound unnecessary when you already have the 
SHA1 checksum, but the thing is, we normally don't actually *check* the 
SHA1 except when doing a full fsck. So I actually like the fact that 
object unpacking always checks at least the adler32 checksum at each 
stage, which you get "for free" when you use zlib.

So not using compression at all actually not only gets rid of the 
compression, it gets rid of a good safety valve - something that may not 
be immediately obvious when you don't think about what all zlib entails. 

People think of zlib as just compressing, but I think the checksumming is 
almost as important, which is why it isn't an obviously good thing to not 
compress small objects just because you don't win on size!

Remember: stability and safety of the data is *the* #1 objective here. The 
git SHA1 checksums guarantees that we can find any corruption, but in 
every-day git usage, the adler32 checksum is the one that generally would 
*notice* the corruption and cause us to say "uhhuh, need to fsck".

Everything else is totally secondary to the goal of "your data is secure". 
Yes, performance is a primary goal too, but it's always "performance with 
correctness guarantees"!

But I just traced through a simple 60-byte incompressible zlib thing. It's 
painful. This should be *the* simplest case, and it should really just be 
the memcpy and the adler32 check. But:

	[torvalds@woody ~]$ grep '<inflate' trace | wc -l
	460
	[torvalds@woody ~]$ grep '<adler32' trace | wc -l
	403
	[torvalds@woody ~]$ grep '<memcpy' trace | wc -l
	59

ie we spend *more* instructions on just the stupid setup in ...
From: Nicolas Pitre
Date: Friday, March 16, 2007 - 11:09 am

But the thing is that with tree objects which records are 6 fairly 
random bytes we already know that compression will never be worth it 
size wise, so it is not worth it even if the header overhead was zero.  

We still can perform adler32 on undeflated objects directly though.  But 
they need no be stored in the pack.  I'd store the adler32 checksum for 
each object in the pack index as it can be recomputed by index-pack 
(which will do the full SHA1 validation anyway).


Nicolas
-

From: Davide Libenzi
Date: Thursday, March 15, 2007 - 6:33 pm

A quick hack would be to just define:

#define SWITCH_LBL(n) \
	case n: \
	lbl_##n:

#define STATE_CHANGE(s) \
	state->mode = s; \
	goto lbl_##s;

Then replace all the "state->mode = STATE; break;" into STATE_CHANGE(STATE);
I'm giving it a try as we speak ...




- Davide


-

From: Davide Libenzi
Date: Thursday, March 15, 2007 - 7:06 pm

I get about 5-6% boost with it AFAICS ...



- Davide


-

Previous thread: [PATCH] use xstrdup please by Shawn O. Pearce on Thursday, March 15, 2007 - 6:02 pm. (1 message)

Next thread: Merging on a dirty tree??? by Dmitry Torokhov on Thursday, March 15, 2007 - 8:59 pm. (3 messages)