Re: What's cooking in git.git (topics)

Previous thread: [PATCH] gitk: Do not pick up file names of "copy from" lines by Johannes Sixt on Tuesday, October 2, 2007 - 7:21 am. (1 message)

Next thread: git-diff not showing changes (corrupt repo?) by Dan Zwell on Tuesday, October 2, 2007 - 11:55 am. (8 messages)
From: David Kastrup
Date: Tuesday, October 2, 2007 - 9:31 am

This does not actually require an actual merge _sort_ AFAICS: do the
"sort file.hashed" step using qsort.  The comparison step does not
actually need to produce merged output, but merely advances through
two hash arrays and generates statistics.

This should already beat the pants off the current implementation,
even when the hash array is sparse, simply because our inner loop then
has perfect hash coherence.

Getting rid of this outer O(n^2) remains an interesting challenge,
though.  One way would be the following: fill a _single_ array with
entries containing _both_ hash and file number.  Sort this, and then
gather the statistics of hash runs by making a single pass through.
That reduces the O(n^2) behavior to only those parts with actual hash
collisions.

-- 
David Kastrup

-

From: Jeff King
Date: Tuesday, October 2, 2007 - 10:39 am

Right, that's why I used "merge" in quotes. The sort used in the O(n)
step is irrelevant, but we are doing a merge-sort-like behavior in the
second step (except instead of actually merging into a new list, we are
summarizing the comparisons in a numeric "difference" variable). But I


Interesting. Care to take a stab at implementing it?

-Peff
-

From: David Kastrup
Date: Tuesday, October 2, 2007 - 11:44 am

I actually have worked through the last night on the day job, have
urgent stuff piling up in my freelance work queue, and the next thing
I need to finish for git is some smart stuff for delta packing.

So it's unlikely I'll get to _that_ anytime soon.  However, I had a
hilarious idea on the way home that kept me rather amused (perhaps my
programmer's humour is affected by sleep deprivation).

I was annoyed at needing double the space because of having to keep
score of both hash and file number.  So I came up with a rather cute
manner to avoid this: first do all files in isolation with full
precision, but store the resulting list of hash as difference to the
last value.  When merging the data of 2^k and 2^k (or somewhat less)
files, we multiply the values by two (this will not carry except for
utterly improbable cases or very small data sets which we can do
differently) and add one bit of identification.  When we have just a
single sequence remaining, undeltafying will tell us about collisions
in the high bits, and the affected files in the low bits.

Of course, using a merge-like algorithm means that we temporarily need
double space anyway.  Which takes some of the fun.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-

From: Linus Torvalds
Date: Tuesday, October 2, 2007 - 7:28 pm

[ This is the discussed stupid approach - just sort the dang hash array, 
  so that we can use a linear scan over the src/dst ]


Sadly, that's not the case. It *does* seem to beat the current 
implementation, but it's not "beat the pants off". It looks like an 
improvement of about 15%, which is nothing to sneeze at, but it's not an 
order-of-magnitude improvement either.

Here's a test-patch. I don't guarantee anything, except that when I did 
the timings I also did a "wc" on the result, and they matched..

Before:
	[torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. | wc
	   7104   28574  438020

	real    0m10.526s
	user    0m10.401s
	sys     0m0.136s

After:
	[torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. | wc
	   7104   28574  438020
	
	real    0m8.876s
	user    0m8.761s
	sys     0m0.128s

but the diff is fairly simple, so if somebody will go over it and say 
whether it's likely to be *correct* too, that 15% may well be worth it.

[ Side note, without rename detection, that diff takes just under three 
  seconds for me, so in that sense the improvement to the rename detection 
  itself is larger than the overall 15% - it brings the cost of just 
  rename detection from 7.5s to 5.9s, which would be on the order of just 
  over a 20% performance improvement. ]

Hmm. The patch depends on half-way subtle issues like the fact that the 
hashtables are guaranteed to not be full => we're guaranteed to have zero 
counts at the end => we don't need to do any steenking iterator count in 
the loop. A few comments might in order.

		Linus

---
 diffcore-delta.c |   54 ++++++++++++++++++++++++++++++------------------------
 1 files changed, 30 insertions(+), 24 deletions(-)

diff --git a/diffcore-delta.c b/diffcore-delta.c
index d9729e5..6d65697 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -46,22 +46,6 @@ struct spanhash_top {
 	struct spanhash data[FLEX_ARRAY];
 };
 
-static struct spanhash *spanhash_find(struct ...
From: Jeff King
Date: Tuesday, October 2, 2007 - 11:54 pm

I get slightly better speedups with my pathological case (around 30%):

Before:
  $ /usr/bin/time git-diff --raw -M -l0 06d288^ 06d288 >/dev/null
  105.38user 3.65system 2:14.90elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (15432major+542627minor)pagefaults 0swaps

After:
  $ /usr/bin/time git-diff --raw -M -l0 06d288^ 06d288 >/dev/null
  71.70user 3.47system 1:40.43elapsed 74%CPU (0avgtext+0avgdata 0maxresident)k
  0inputs+0outputs (15065major+551778minor)pagefaults 0swaps


I found less noise in the timing by using --raw, since the patch

Patch looks correct, and it produces correct results on my (admittedly
limited) test data.

I think it's worth applying (though I agree that a comment on the
assumption of a zero "cnt" at the end is worth adding) unless some
drastically different solution comes along (e.g., David's idea to try
avoiding the outer O(n^2) loop). But I don't think there is much more to
be gained from a different approach to comparing the two hash tables.

-Peff
-

From: Linus Torvalds
Date: Wednesday, October 3, 2007 - 9:13 am

Ok, 30% is definitely "worth doing". Even if your performance still sucks, 
and 71 seconds is just way out of line for anything like this (of course, 
these days you need that "-l0" to ever trigger that case, but it would be 
nice if we could speed things up so much that we no longer care).

		Linus
-

From: David Kastrup
Date: Wednesday, October 3, 2007 - 1:20 am

Part of the reason is that it is not actually what I had in mind.  Why
create the hash array as a hash array?  Filling the hash array in
basically random order, then sort+compressing it is what is causing
much of the costs.  My idea was to just fill the "hash array"
linearly.  It is quite pointless (and certainly very inefficient with
regard to cache poisoning) to do it in hash order when we are going to
sort it anyway.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-

From: Jeff King
Date: Wednesday, October 3, 2007 - 9:59 am

Try profiling the code, and you will see that the creation of the hashes
is totally dwarfed by the comparisons. So yes, you might be able to
speed up the creation code, but it's going to have a minimal impact on
the overall run time.

-Peff
-

From: Linus Torvalds
Date: Wednesday, October 3, 2007 - 10:53 am

Yeah. Oprofile is your friend.

The biggest win would be to avoid calling diffcore_count_changes() in the 
first place, and we actually do that in the caller by looking at the size 
of the files. However, while that prunes out the *really* obvious cases, 
it's not nearly enough, since there tends to be very limited filesizes 
anyway.

What we could also do is to pass in the minimum similarity score, and use 
that to at least exit early. We currently pass in "delta_limit" which is 
close, but we never use it, and we really probably would be better off 
passing in the minimum score itself and perhaps be able to exit early from 
diffcore_count_changes.

However, while I did think about doing that, I came to the conclusion that 
we'd still always end up having to look at least at *half* the hash data 
(for the default 50% score), so while it would help, it again wouldn't be 
a matter of orders-of-magnitudes, and it looked like the end result would 
be unnecessarily complex too.

The best option, of course, would be to use a similarity hash to make the 
initial choice. I think we had one at one point, but it wasn't 
fine-grained enough. But it might be interesting to do that as a filter in 
*front* of the more expensive diffcore_count_changes() call.

We had some "similarity fingerprint" code at some point using a Rabin 
polynomial. It wasn't reliable enough to be used as a direct score, but 
maybe it can be used as a first-line "we know this isn't even worth doing 
the expensive stuff on".

			Linus
-

From: David Kastrup
Date: Wednesday, October 3, 2007 - 11:09 am

Well, and if -Oprofile has no strong opinion, I'd let wc -l pitch in.

When we are not actually going to use the hash tables as hash tables,
why create them as such?  If the first thing that actually looks at
the values of the hashes (except possibly for the optimization of not
storing the same hash twice in succession) is the sort, then there is
no code that can go out of whack when confronted with degenerate data.

Maybe it's not much of an optimization, but it certainly should be a
cleanup.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-

From: Junio C Hamano
Date: Thursday, October 4, 2007 - 12:10 am

The patch actually is quite readable.  That double-loop finding
the matching hashval in destination hash was simply silly to
begin with, so even if this is not "orders of magnitude"
improvement, I think your patch is worth doing.
-

Previous thread: [PATCH] gitk: Do not pick up file names of "copy from" lines by Johannes Sixt on Tuesday, October 2, 2007 - 7:21 am. (1 message)

Next thread: git-diff not showing changes (corrupt repo?) by Dan Zwell on Tuesday, October 2, 2007 - 11:55 am. (8 messages)