Actually, it's even better than that.
If we're walking a certain pathspec (which is reall ythe only thing that
is expensive), we're pretty much *guaranteed* that we'll hit exactly this
case. Doing some instrumentation on the test-case I've been using (which
is just "git log drivers/usb/ > /dev/null") shows:
[torvalds@woody linux]$ grep Needs delta-base-trace | wc -l
469334
[torvalds@woody linux]$ grep Needs delta-base-trace | sort -u | wc -l
21933
where that delta-base-trace is just a trace of which delta bases were
needed. Look how we currently generate almost half a million of them, but
only 22000 are actually unique objects - we just generate many of them
over and over again. In fact, the top delta bases with counts looks like:
558 Needs 102398354
556 Needs 161353360
554 Needs 161354852
552 Needs 161354916
550 Needs 161354980
526 Needs 161355044
524 Needs 161355108
522 Needs 161355174
520 Needs 161355238
508 Needs 161445724
446 Needs 119712387
425 Needs 133406737
420 Needs 161513997
387 Needs 120784913
331 Needs 127094253
321 Needs 95694853
319 Needs 125888524
303 Needs 155109487
301 Needs 155627964
299 Needs 155628028
.....
ie the top twenty objects were all generated hundreds of times each.
More importantly, the trace also shows that it actually has very good
locality too - exactly as you'd expect, since when we traverse the trees,
we'd generally see a particular delta base used as a base when that thing
is slowly changing, so of the half-million "needs" entries in my trace, if
I pick the top delta_base (102398354), and use "cat -n" to give them all
line numbers (from 1 to half a million), and grep for that particular
delta:
grep Needs delta-base-trace | cat -n | grep 102398354 | less -S
they are *all* at lines 61624..89352, with the bulk of them being very
close together (the bulk of those are all around 88k line mark).
In other words, it's not "spread out" over time. It's very clustered,
which I'd expect anyway, which means that even a simple cache of just a
few hundred entries (statically sized) will be very effective.
So the cache doesn't need to be "complete". It will get good hit-rates
even from being very simple. I think I have a very simple and cunning
plan, I'll try it out asap.
Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html