Linear time/space rename logic for *inexact* case

Previous thread: [PATCH] s/pattern/prefix/ in help's list_commands by Scott R Parish on Monday, October 22, 2007 - 4:51 am. (1 message)

Next thread: git filter-branch --subdirectory-filter error by Jan Wielemaker on Monday, October 22, 2007 - 6:27 am. (4 messages)
To: <git@...>
Date: Monday, October 22, 2007 - 5:40 am

I just subscribed to this list, so sorry I can't respond to the
threads already started here. I'm the guy that was mailing Linus
about this algorithm to do similarity detection in linear time,
mentioned here:

Subject: [PATCH] Split out "exact content match" phase of rename detection
http://marc.info/?l=git&m=119299209323278&w=2

Subject: [PATCH, take 1] Linear-time/space rename logic (exact renames
http://marc.info/?l=git&m=119301122908852&w=2

Jeff, in this message http://marc.info/?l=git&m=119303566130201&w=2 I
think you basically hit on the first half of what I was getting at.

The step 1 you describe is the first step of the algorithm -- make an
inverted index of lines (and you can use a hash code of the line to
stand in for the line).

To avoid the m*n memory usage of step 2, I use a hash table which maps
2-tuples to counts to represent the sparse similarity matrix, instead
of representing it directly. The 2-tuple is a pair of filenames,
corresponding to the row/column of the matrix, and the counts/values
are the entries in the matrix.

You can also prune entries which have a long "postings list" (using
the term in the IR sense here:
http://www.xapian.org/docs/intro_ir.html). This has the nice side
effect of getting rid of quadratic behavior, *and* making the
algorithm more accurate because it stops considering common lines like
"#endif" as contributing to similarity.

This pruning of common lines from the index makes step 3 linear. In
fact, I prune the index to only include lines which occur just *once*.
Nearly every line of code in real data sets is unique, so this works
well in practice.

I already sent this demo to Linus, but I think it's worth sending to
the list as well. I am just going to copy parts of my earlier e-mails
below, and attach the same demos (hopefully it is kosher to send
attachments to this list).

Before I do that, Jeff, can you still reproduce this problem:

[ message continues ]

" title="http://marc.info/?l=git&m=118975452330644...">http://marc.info/?l=git&m=118975452330644...

To: Andy C <andychup@...>
Cc: <git@...>
Date: Monday, October 22, 2007 - 6:34 am

OK, makes sense (that's what I was trying to propose near the end of my

Ah, very clever. I think that should have nice behavior most of the
time, though I wonder about a pathological case where we have many
files, all very similar to each other, and then have a commit where they
all start to diverge, but just by a small amount, while getting moved.
We would end up with an artificially low score between renamed files
(because we've thrown out all of the commonality) which may lead us to
believe there was no rename at all.

Yes, I still have the problem (the 2 minutes was _after_ we did fixes,
down from 20 minutes; the latest patch from Linus just covers the "exact
match" case where two files have identical SHA1s).

It's a 1.2G repo at the end of a slow DSL line, so rather than cloning
it, here's a way to reproduce a repo with similar properties:

-- >8 --
#!/bin/sh
rm -rf repo
mkdir repo && cd repo

# seq and openssl aren't portable, but the
# point is to generate 200 random 1M files
for i in `seq -f %03g 1 600`; do
openssl rand 100000 >$i.rand
done

# make repo, fully packed
# we don't bother trying to delta in the pack
# since the files are all random
git-init
git-add .
git-commit -q -m one
git-repack -a -d --window=0

# move every file
mkdir new
git-mv *.rand new

# modify every file
for i in new/*.rand; do
echo foo >>$i
done
git-add new

# this is the operation of interest
time git-diff --cached --raw -M -l0 >/dev/null

-- >8 --

The idea is to have a large number of files that are slightly changed
and moved, and to try to find the pairs. The diff takes about 20
seconds to run for me (the real repo has 1M files rather than 100K
files, but it's nice to have the tests take a lot less time). If you
want a bigger test, bump up the file size (or increase the number of

We have to handle binary files, too. In the current implementation, we
consider either lines or "chunks", and similarity is increased by the

I think ...

To: Jeff King <peff@...>
Cc: <git@...>
Date: Monday, October 22, 2007 - 7:47 pm

Well I do adjust the file sizes when pruning. In the .py demo code
you can see it at these lines:

for f in filelist:
sizes[f] -= 1

The similarity metric depends on the sizes, so that way it won't get
out of whack if there are a lot of frequently occurring lines.

I did mention that pathological case to Linus. The worst thing is if
you have N adds and N deletes and every single file is identical.
Well the convenient thing is that the pruning handles this gracefully.
*Every* line in both indices will be pruned, and then the similarity
matrix will be completely empty, and no renames will be recorded.
Arguably that is what you want. You don't want N^2 renames. I guess

Ah OK thanks. This is a good test case for binary files, and my

Yes, my thing doesn't handle binary files, but I have a variation of
the idea for that too which I'll explain below.

Can you explain what the current implementation does for binary files
in more detail, or point me to some docs? I'm having a hard code
figuring out what the code is doing, even though the comments insist
that it is very simple : )

from diffcore-delta.c:
"
* Idea here is very simple.
*
* Almost all data we are interested in are text, but sometimes we have
* to deal with binary data. So we cut them into chunks delimited by
* LF byte, or 64-byte sequence, whichever comes first, and hash them.
"

What is the similarity metric for binary files? Is it related to the
number of 64 byte chunks they have in common? How do you know that
the 64 byte chunks match up? Suppose I have 10k file, and then I
insert 10 random bytes at positions, 1000, 2000, 3000, etc. How does

Yes, there seem to be quite a few flags that have to do with trading
off features for computational time/memory. I believe this algorithm
could be made fast enough to simplify the user interface a bit. I'm
not yet super familiar with git, but I imagine it's probably hard for

Yes, I think we should try it out for text files first, sinc...

To: Andy C <andychup@...>
Cc: Jeff King <peff@...>, <git@...>
Date: Monday, October 22, 2007 - 8:49 pm

The 'LF' byte will start a new window, so even with binary files (assuming
some random-enough distribution that you have *some* 'LF' bytes!), you
basically get a synchronization event on each LF.

So the 64-byte hunks "line up" automatically, even for binary files. Maybe
not immediately, but soon enough (ie on average, assuming some reasonably
compressed - ie virtually random - binary format) you should find a LF
roughly every fourth hunk.

There are probably smarter ways to guarantee it even in the absense of
certain bit-patterns, but I bet the "break every 64 bytes or when you see
a LF" thing works well for any reasonable binary file too.

But text files are obviously the most important case. Diffs on binaries
are somewhat hit-and-miss regardless.

Linus
-

To: <git@...>
Date: Monday, October 22, 2007 - 6:09 am

I think I can make this a lot clearer than I did, while glossing over
some details and the line_threshold parameter.

1) Make a "left index" and a "right index" out of the 2 sets of files,
{ line => [list of docs] }.

2) Remove any lines that appear in more than one doc from the left
index. Do the same for the right index. (this corresponds to
line_threshold=1 case)

3) For all lines, if the line appears in *both* the left index and the
right index, increment the count of the (row=doc from left set,
column=doc from right set) entry in the similarity matrix by 1. The
matrix is represented by a hash of 2-tuples => counts.

After this is done for all lines, then the matrix is sparsely filled
with the count of common lines between every pair of files in the 2
sets. The vast majority of cells in the matrix are implicitly 0 and
thus consume neither memory nor CPU with the hash table representation
of matrix.

4) Then you can use this to compute similarity scores.

Hopefully that is more clear... though I guess it might not be obvious
that it works for the problem that git has. I am fairly sure it does,
but the demo should allow us to evaluate that.

Andy
-

Previous thread: [PATCH] s/pattern/prefix/ in help's list_commands by Scott R Parish on Monday, October 22, 2007 - 4:51 am. (1 message)

Next thread: git filter-branch --subdirectory-filter error by Jan Wielemaker on Monday, October 22, 2007 - 6:27 am. (4 messages)