This does not actually require an actual merge _sort_ AFAICS: do the "sort file.hashed" step using qsort. The comparison step does not actually need to produce merged output, but merely advances through two hash arrays and generates statistics. This should already beat the pants off the current implementation, even when the hash array is sparse, simply because our inner loop then has perfect hash coherence. Getting rid of this outer O(n^2) remains an interesting challenge, though. One way would be the following: fill a _single_ array with entries containing _both_ hash and file number. Sort this, and then gather the statistics of hash runs by making a single pass through. That reduces the O(n^2) behavior to only those parts with actual hash collisions. -- David Kastrup -
Right, that's why I used "merge" in quotes. The sort used in the O(n) step is irrelevant, but we are doing a merge-sort-like behavior in the second step (except instead of actually merging into a new list, we are summarizing the comparisons in a numeric "difference" variable). But I Interesting. Care to take a stab at implementing it? -Peff -
I actually have worked through the last night on the day job, have urgent stuff piling up in my freelance work queue, and the next thing I need to finish for git is some smart stuff for delta packing. So it's unlikely I'll get to _that_ anytime soon. However, I had a hilarious idea on the way home that kept me rather amused (perhaps my programmer's humour is affected by sleep deprivation). I was annoyed at needing double the space because of having to keep score of both hash and file number. So I came up with a rather cute manner to avoid this: first do all files in isolation with full precision, but store the resulting list of hash as difference to the last value. When merging the data of 2^k and 2^k (or somewhat less) files, we multiply the values by two (this will not carry except for utterly improbable cases or very small data sets which we can do differently) and add one bit of identification. When we have just a single sequence remaining, undeltafying will tell us about collisions in the high bits, and the affected files in the low bits. Of course, using a merge-like algorithm means that we temporarily need double space anyway. Which takes some of the fun. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
[ This is the discussed stupid approach - just sort the dang hash array,
so that we can use a linear scan over the src/dst ]
Sadly, that's not the case. It *does* seem to beat the current
implementation, but it's not "beat the pants off". It looks like an
improvement of about 15%, which is nothing to sneeze at, but it's not an
order-of-magnitude improvement either.
Here's a test-patch. I don't guarantee anything, except that when I did
the timings I also did a "wc" on the result, and they matched..
Before:
[torvalds@woody linux]$ time git diff -l0 --stat -C v2.6.22.. | wc
7104 28574 438020
real 0m10.526s
user 0m10.401s
sys 0m0.136s
After:
[torvalds@woody linux]$ time ~/git/git diff -l0 --stat -C v2.6.22.. | wc
7104 28574 438020
real 0m8.876s
user 0m8.761s
sys 0m0.128s
but the diff is fairly simple, so if somebody will go over it and say
whether it's likely to be *correct* too, that 15% may well be worth it.
[ Side note, without rename detection, that diff takes just under three
seconds for me, so in that sense the improvement to the rename detection
itself is larger than the overall 15% - it brings the cost of just
rename detection from 7.5s to 5.9s, which would be on the order of just
over a 20% performance improvement. ]
Hmm. The patch depends on half-way subtle issues like the fact that the
hashtables are guaranteed to not be full => we're guaranteed to have zero
counts at the end => we don't need to do any steenking iterator count in
the loop. A few comments might in order.
Linus
---
diffcore-delta.c | 54 ++++++++++++++++++++++++++++++------------------------
1 files changed, 30 insertions(+), 24 deletions(-)
diff --git a/diffcore-delta.c b/diffcore-delta.c
index d9729e5..6d65697 100644
--- a/diffcore-delta.c
+++ b/diffcore-delta.c
@@ -46,22 +46,6 @@ struct spanhash_top {
struct spanhash data[FLEX_ARRAY];
};
-static struct spanhash *spanhash_find(struct ...I get slightly better speedups with my pathological case (around 30%): Before: $ /usr/bin/time git-diff --raw -M -l0 06d288^ 06d288 >/dev/null 105.38user 3.65system 2:14.90elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (15432major+542627minor)pagefaults 0swaps After: $ /usr/bin/time git-diff --raw -M -l0 06d288^ 06d288 >/dev/null 71.70user 3.47system 1:40.43elapsed 74%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (15065major+551778minor)pagefaults 0swaps I found less noise in the timing by using --raw, since the patch Patch looks correct, and it produces correct results on my (admittedly limited) test data. I think it's worth applying (though I agree that a comment on the assumption of a zero "cnt" at the end is worth adding) unless some drastically different solution comes along (e.g., David's idea to try avoiding the outer O(n^2) loop). But I don't think there is much more to be gained from a different approach to comparing the two hash tables. -Peff -
Ok, 30% is definitely "worth doing". Even if your performance still sucks, and 71 seconds is just way out of line for anything like this (of course, these days you need that "-l0" to ever trigger that case, but it would be nice if we could speed things up so much that we no longer care). Linus -
Part of the reason is that it is not actually what I had in mind. Why create the hash array as a hash array? Filling the hash array in basically random order, then sort+compressing it is what is causing much of the costs. My idea was to just fill the "hash array" linearly. It is quite pointless (and certainly very inefficient with regard to cache poisoning) to do it in hash order when we are going to sort it anyway. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Try profiling the code, and you will see that the creation of the hashes is totally dwarfed by the comparisons. So yes, you might be able to speed up the creation code, but it's going to have a minimal impact on the overall run time. -Peff -
Yeah. Oprofile is your friend. The biggest win would be to avoid calling diffcore_count_changes() in the first place, and we actually do that in the caller by looking at the size of the files. However, while that prunes out the *really* obvious cases, it's not nearly enough, since there tends to be very limited filesizes anyway. What we could also do is to pass in the minimum similarity score, and use that to at least exit early. We currently pass in "delta_limit" which is close, but we never use it, and we really probably would be better off passing in the minimum score itself and perhaps be able to exit early from diffcore_count_changes. However, while I did think about doing that, I came to the conclusion that we'd still always end up having to look at least at *half* the hash data (for the default 50% score), so while it would help, it again wouldn't be a matter of orders-of-magnitudes, and it looked like the end result would be unnecessarily complex too. The best option, of course, would be to use a similarity hash to make the initial choice. I think we had one at one point, but it wasn't fine-grained enough. But it might be interesting to do that as a filter in *front* of the more expensive diffcore_count_changes() call. We had some "similarity fingerprint" code at some point using a Rabin polynomial. It wasn't reliable enough to be used as a direct score, but maybe it can be used as a first-line "we know this isn't even worth doing the expensive stuff on". Linus -
Well, and if -Oprofile has no strong opinion, I'd let wc -l pitch in. When we are not actually going to use the hash tables as hash tables, why create them as such? If the first thing that actually looks at the values of the hashes (except possibly for the optimization of not storing the same hash twice in succession) is the sort, then there is no code that can go out of whack when confronted with degenerate data. Maybe it's not much of an optimization, but it certainly should be a cleanup. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
The patch actually is quite readable. That double-loop finding the matching hashval in destination hash was simply silly to begin with, so even if this is not "orders of magnitude" improvement, I think your patch is worth doing. -
