Re: git-diff-tree inordinately (O(M*N)) slow on files with many changes

Previous thread: [RFC/PATCH] git-fetch: Use already fetched branch with the --local flag. by sbejar on Monday, October 16, 2006 - 6:40 am. (4 messages)

Next thread: Re: git-diff-tree inordinately (O(M*N)) slow on files with many changes by Jakub Narebski on Monday, October 16, 2006 - 9:54 am. (1 message)
From: Jim Meyering
Date: Monday, October 16, 2006 - 7:12 am

[using very latest code, built an hour ago:
 git version 1.4.3.rc3.gb32db-dirty ]

I found that git-diff-tree is *very* slow when processing files with
many changes.  The offending example involves comparing the configure
file from coreutils-6.3 with that from the latest coreutils development
sources.  Both are over 50k lines long, and the diff -u output is almost
50k-lines long, divided into ~700 hunks.

  http://meyering.net/code/git-perf/configure.gz
  http://meyering.net/code/git-perf/configure-curr.gz

Comparing them with "diff -u" takes about 0.3s.
Putting them in a git repo (uncompressed, and with the same name,
of course) and comparing with git-diff-tree takes over a minute.
That's 200 times slower.

I traced the problem to xdiff/xprepare.c's xdl_cleanup_records function.
Each of its two doubly-nested loops ends up running the inner-loop
code more than 2 *billion* times.

That seems to be due to the two typos fixed by this patch:
With this patch, my "git-diff-tree --no-commit-id -r -p 2c2172"
command completes in just 2 seconds.

diff --git a/xdiff/xprepare.c b/xdiff/xprepare.c
index 1be7b31..e5438a9 100644
--- a/xdiff/xprepare.c
+++ b/xdiff/xprepare.c
@@ -381,7 +381,7 @@ static int xdl_cleanup_records(xdfile_t
 		hav = (*recs)->ha;
 		rhi = (long) XDL_HASHLONG(hav, xdf2->hbits);
 		for (nm = 0, rec = xdf2->rhash[rhi]; rec; rec = rec->next)
-			if (rec->ha == hav && ++nm == mlim)
+			if (rec->ha == hav || ++nm == mlim)
 				break;
 		dis1[i] = (nm == 0) ? 0: (nm >= mlim) ? 2: 1;
 	}
@@ -392,7 +392,7 @@ static int xdl_cleanup_records(xdfile_t
 		hav = (*recs)->ha;
 		rhi = (long) XDL_HASHLONG(hav, xdf1->hbits);
 		for (nm = 0, rec = xdf1->rhash[rhi]; rec; rec = rec->next)
-			if (rec->ha == hav && ++nm == mlim)
+			if (rec->ha == hav || ++nm == mlim)
 				break;
 		dis2[i] = (nm == 0) ? 0: (nm >= mlim) ? 2: 1;
 	}

However, that change causes the t800*.sh (14,16,18) annotate/blame
tests to fail, so take it with a grain of salt.

I.e., running one of ...
From: Linus Torvalds
Date: Monday, October 16, 2006 - 8:47 am

Davide? I'm quoting the whole report, because I suspect you don't follow 
the git lists, and this is all original libxdiff code.

Jim: the annotation failure _may_ just be due to a "valid" diff change 
(there is not always a unique correct answer for a diff, and so two 
different diff algorithms can validly give two different answers, which 
will also mean that git-annotate/blame would give different explanations). 

But it could certainly also be that you just broke the diffs entirely, so 
I would like to wait for Davide to comment on your diff before Junio 
should apply it. 

Others may be intimately familiar with the diff algorithms too, of course. 
It just scares me personally ;)

		Linus

-

From: Linus Torvalds
Date: Monday, October 16, 2006 - 9:12 am

I think you broke it. 

If the "&& vs ||" makes a difference (and it clearly does), that implies 
that you have lots of different hash values on the same hash chain, and 
you end up considering those _different_ hash values to be all equivalent 
for the counting, even though they obviously aren't.

I think the real problem is that with big input, the hash tables are too 
small, making the hash chains too long - even though the values on the 
chains are different (ie we're not hashing different records with the same 
hash value over and over again - if that was true, the "&& vs ||" change 
wouldn't make any difference).

So I think xdiff has chosen too small a hash. Can you try what happens if 
you change xdl_hashbits() (in xdiff/xutil.c) instead? Try making it return 
a bigger value (for example, by initializing "bits" to 2 instead of 0), 
and see if that makes a difference.

But again, I'm not actually all _that_ familiar with the libxdiff 
algorithms, _especially_ the line-based ones (I can follow the regular 
binary delta code, but the line-based one just makes my head hurt). So 
take anything I say with a pinch of salt.

		Linus
-

From: Jim Meyering
Date: Monday, October 16, 2006 - 9:33 am

It makes no difference.

Bear in mind that there are a *lot* of duplicate lines in the files
being compared: filtering each through "sort -u" removes 40-50k lines.
-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 9:42 am

Ok, try to bring down XDL_MAX_EQLIMIT to something like 8 or 16 ...



- Davide


-

From: Jim Meyering
Date: Monday, October 16, 2006 - 9:50 am

That helps a little.
Now, instead of taking 63s, my test takes ~30s.
(32 for XDL_MAX_EQLIMIT = 16, 30 for XDL_MAX_EQLIMIT = 8)
-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 9:54 am

That's too much still. May I have the offending files?



- Davide


-

From: Jim Meyering
Date: Monday, October 16, 2006 - 9:57 am

These URLs were in at least one of the messages to which
-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 10:02 am

Thanks, those are fine.



- Davide


-

From: Linus Torvalds
Date: Monday, October 16, 2006 - 10:56 am

Btw, what architecture is this on?

I'm testing those two files, and I get much more reasonable numbers with 
both ppc32 and x86. Both 32-bit:

	[torvalds@macmini test-perf]$ time git show | wc -l
	25221

	real    0m1.437s
	user    0m1.436s
	sys     0m0.012s

ie it generated the diff in less than a second and a half. Not wonderful, 
but certainly not your 63s either.

HOWEVER. On x86-64, it takes forever (still not 63 seconds, but it takes 
17 seconds on my 2GHz merom machine).

So I think there's something seriously broken with hashing on 64-bit. 

And I think I know what it is.

Try this patch. And make sure to do a "make clean" first, since I think 
the dependencies on xdiff may be broken.

Davide: there's two things wrong with your old XDL_HASHLONG():

 - the GR_PRIME was just 32-bit, so it wouldn't shift low bits up far 
   enough on a 64-bit architecture, so then shifting things down caused 
   pretty much everything to be very small.

 - The whole idea of shifting up by multiplying and then shifting down to 
   get the high bits is _broken_. Even on 32-bit architectures. Think 
   about what happens when "hashbits" is 16 on a 32-bit architecture: the 
   multiply moves the low bits _up_, but it doesn't move the high bits 
   _down_. And with hashbits being a large fraction of the whole word, you 
   need to shift things down, not up.

So just making GR_PRIME be a bigger value on a 64-bit architecture would 
not have fixed it. The whole hash was simply broken. Do it the sane and 
obvious way instead: always pick the low bits, but mix in upper bits there 
too..

This patch brings the time down from 17 seconds to 0.8 seconds for me.

		Linus

---
diff --git a/xdiff/xmacros.h b/xdiff/xmacros.h
index 4c2fde8..bb4830b 100644
--- a/xdiff/xmacros.h
+++ b/xdiff/xmacros.h
@@ -24,14 +24,27 @@ #if !defined(XMACROS_H)
 #define XMACROS_H
 
 
-#define GR_PRIME 0x9e370001UL
+static inline unsigned long xdl_hashlong(unsigned long val, unsigned int ...
From: Linus Torvalds
Date: Monday, October 16, 2006 - 11:03 am

Side note: in _practice_ I think it would have fixed it. The "not mixing 
in high bits" is not a real problem if the original hash-value has a good 
distribution of bits, which I think we do have. So it's unclear whether we 
even need any mixing in of bits at all, and it's possible that it would be 
fine to just have

	#define XDL_HASHLONG(v,b) ((unsigned long)(v) & ((1ul << (b))-1))

which is simpler than my patch.

I prefer the mixing in of high bits just because it can help if the 
original hash was bad (or had a tendency to have patterns in the low bits, 
which could be the case). But I'm not sure xdiff actually needs it in this 
case.

			Linus
-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 11:41 am

ATM, I added AC_CHECK_SIZEOF(long) inside libxdiff's configure.in, and I 
have (inside xmacros.h):

#if SIZEOF_LONG == 4
#define GR_PRIME 0x9e370001UL
#else
#define GR_PRIME 0x9e37fffffffc0001UL
#endif

I'm also looking into streamlining the discard loop to reuse information 
collected during the context-setup phase ...




- Davide


-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 11:18 am

Yeah, using an appropriate golden ratio prime for 64 bits fixes it. I 
think it's the best/minimal fix (use 0x9e37fffffffc0001UL, like the 
kernel does).
I'm also looking into optimizing the multi-match discard loop, that 
actually loses the classifier informations collected in the context 
prepare phase.




- Davide


-

From: Linus Torvalds
Date: Monday, October 16, 2006 - 11:51 am

Junio, I think this is worthy to go in before a 1.4.3 release. Possibly 
even back-ported to earlier trees. Anything that causes an almost two 
orders of magnitude slowdown (even if it's just on 64-bit architectures 
and most people won't necessarily compile git that way) is worth fixing 
pronto.


Ok. But then you need something like the appended to avoid warnings..

(This is the only nice portable way to figure out at compile-time whether 
"unsigned long" is more than 32 bits that I can come up with: everything 
that uses actual C expressions ends up warning about integers not fitting 
etc)

Quite frankly, I prefer my previous patch more, it just avoids that whole 
problem, and two shifts and adds (even with a conditional) are often 
faster than a full 64-bit multiply.

		Linus
---
diff --git a/xdiff/xmacros.h b/xdiff/xmacros.h
index 4c2fde8..38f8f93 100644
--- a/xdiff/xmacros.h
+++ b/xdiff/xmacros.h
@@ -23,8 +23,13 @@
 #if !defined(XMACROS_H)
 #define XMACROS_H
 
+#include <limits.h>
 
+#if LONG_MAX > 2147483647ul
+#define GR_PRIME 0x9e37fffffffc0001UL
+#else
 #define GR_PRIME 0x9e370001UL
+#endif
 
 
 #define XDL_MIN(a, b) ((a) < (b) ? (a): (b))
-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 12:44 pm

I ended up using this one:

#define XDL_HASHLONG(v, b) ((((unsigned long) (v) >> ((CHAR_BIT * sizeof(unsigned long)) - (b))) + \
                            (unsigned long) (v)) & ((1UL << (b)) - 1))

The GR_PRIME selection does not make me feel good, and the 'static inline' 
is puked-over by certain C compilers. It'd be probably fine to just use a 
simple function, though the above should work just fine.

real    0m0.665s
user    0m0.655s
sys     0m0.010s

(Opteron 252)



- Davide


-

From: Junio C Hamano
Date: Monday, October 16, 2006 - 3:53 pm

Thanks for resolving this quickly; I was (am still somewhat)

I agree (although I am not sure about the "do it twice for
small" bit), and I think Davide agrees with you in his reply:


so I am inclined to apply Davide's version, but I am going to
bed again now...



-

From: Linus Torvalds
Date: Monday, October 16, 2006 - 4:24 pm

Sure. Davide's all-macro version is fine. I don't like re-using the same 
value twice even in a ALL-CAPS macro, so I'm used to inline functions, but 
all the uses of XDL_HASHLONG() are fine with multiple uses of the 
arguments.

Somebody should just double-check that all the parentheses ended up being 
right ;)

It might be easier to read if you write it as

	#define BITS_IN_LONG	(CHAR_BIT * sizeof(unsigned long))
	#define XDL_HIGHBITS(v,b) ((v) >> (BITS_IN_LONG - (b)))
	#define XDL_MASKBITS(b) ((1UL << (b)) - 1)
	#define XDL_HASHBITS(v,b) (((v) + XDL_HIGHBITS(v,b)) & XDL_MASKBITS(b))
	#define XDL_HASHLONG(v,b) XDL_HASHBITS( (unsigned long)(v) , b )

just to avoid one huge #define.

That said, it unnecessarily calculates "BITS_IN_LONG - (b)" to shift with, 
because it really shouldn't matter _which_ high bits you use for hashing, 
so you might as well just use the "next" b bits, and have

	#define XDL_ADDBITS(v,b)	((v) + ((v) >> (b)))
	#define XDL_MASKBITS(b)		((1UL << (b)) - 1)
	#define XDL_HASHLONG(v,b)	(XDL_ADDBITS((unsigned long)(v), b) & XDL_MASKBITS(b))

which generates better code at least on x86 (and x86-64), because the 
shift count stays the same for all shifts and can thus be kept in %ecx. 
For example, on x86-64, you get

	movq    %rdi, %rax		# copy 'val'
	movl    $1, %edx		# const 1: start generating (1 << b) - 1
	shrq    %cl, %rax		# val >> b
	salq    %cl, %rdx		# 1 << b
	leaq    (%rdi,%rax), %rax	# val + (val >> b)
	subq    $1, %rdx		# (1 << b) -1
	andq    %rdx, %rax		# final hash

which is short and sweet. And on ppc32 (or ppc64) you get

	li 9,1			# const 1: start generating (1 << b) - 1
	srw 0,3,4		# val >> b
	slw 9,9,4		# 1 << b
	add 0,0,3		# val + (val >> b)
	addi 9,9,-1		# (1 << b) - 1
	and 3,0,9		# final hash

in other words, apart from having two shifts (which you can't really 
avoid, although a multiply can do one of them) it's just a very efficient 
way to mix together (2*b) bits into a (b)-bit hash.

But taking the high bits ...
From: Davide Libenzi
Date: Monday, October 16, 2006 - 4:52 pm

Ok, I'm fine with this. And my Opteron agrees too:

real    0m0.283s
user    0m0.267s
sys     0m0.016s



- Davide


-

From: Jim Meyering
Date: Monday, October 16, 2006 - 11:24 am

Yep.  Dependencies are definitely broken.
Applied your patch.  No improvement after a plain "make",
but doing "make clean && make" solved the problem.

Now, my diff-tree takes 2s (it's comparing other files, too).
Thank you!

IMHO, my "&& vs. ||" patch is still worth applying.
If not, then the existing code doesn't make sense, and
there can be significant simplification in the affected loops.
With my patch, I get an additional 3x speed-up: diff-tree takes 0.7s
-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 11:30 am

No, the patch is broken. It will discard *any* line seen at least two 
times, that is an extremely low threashold.



- Davide


-

From: Jim Meyering
Date: Monday, October 16, 2006 - 11:43 am

Davide Libenzi <davidel@xmailserver.org> wrote:

Oh!  I see it now.  Thanks.
-

From: Linus Torvalds
Date: Monday, October 16, 2006 - 9:54 am

It can't be due to duplicate lines. If the lines are truly duplicate, then 
they'd get the same 32-bit hash value, and then the first conditional in 
the expression would always be true, and then it wouldn't _matter_ if it's 
a "&&" or a "||".

See?

So as far as I can tell it has to be some kind of collission on the hash 
queue with _different_ hash values being queued on the same hash queue.

Now, it could be that there's a bad hash algorithm somewhere (eg if 
XDL_HASHLONG() just does horribly badly in distributing the hash values 
onto the hash queues, you'd see this _regardless_ of how many bits you 
have, just because it clumps).

Or there could be something else that I'm just missing..

It would probably be nice to just get a sampling of what the hash-queue 
looks like for the bad case? Maybe it would be obvious that certain 
different hash values then get the same XDL_HASHLONG() thing..

		Linus
-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 9:36 am

I think the xdl_hashbits() picks up the hash table size "almost" 
correctly. I think we're looking at some bad hash *collisions* (not 
records with same hash value, that'd be stopped by the mlim check). 

That's my revenge on myself having to follow your code in the kernel  :D




- Davide


-

From: Linus Torvalds
Date: Monday, October 16, 2006 - 9:57 am

Davide, they were mentioned in the original report:

	http://meyering.net/code/git-perf/configure.gz
	http://meyering.net/code/git-perf/configure-curr.gz

I'd take a look myself, but I need to run..

		Linus
-

From: Davide Libenzi
Date: Monday, October 16, 2006 - 9:24 am

The test is fine as is. Only really bad hash collisions can show O(M*N). 
Can I have the two sample files to test?



- Davide


-

Previous thread: [RFC/PATCH] git-fetch: Use already fetched branch with the --local flag. by sbejar on Monday, October 16, 2006 - 6:40 am. (4 messages)

Next thread: Re: git-diff-tree inordinately (O(M*N)) slow on files with many changes by Jakub Narebski on Monday, October 16, 2006 - 9:54 am. (1 message)