Re: Linear time/space rename logic for *inexact* case

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Jeff King
Date: Monday, October 22, 2007 - 3:34 am

On Mon, Oct 22, 2007 at 02:40:08AM -0700, Andy C wrote:


Great, welcome aboard. :)


OK, makes sense (that's what I was trying to propose near the end of my
mail).


Ah, very clever. I think that should have nice behavior most of the
time, though I wonder about a pathological case where we have many
files, all very similar to each other, and then have a commit where they
all start to diverge, but just by a small amount, while getting moved.
We would end up with an artificially low score between renamed files
(because we've thrown out all of the commonality) which may lead us to
believe there was no rename at all.

But it might be worth ignoring that case.


Makes sense.


Yes, I still have the problem (the 2 minutes was _after_ we did fixes,
down from 20 minutes; the latest patch from Linus just covers the "exact
match" case where two files have identical SHA1s).

It's a 1.2G repo at the end of a slow DSL line, so rather than cloning
it, here's a way to reproduce a repo with similar properties:

-- >8 --
#!/bin/sh
rm -rf repo
mkdir repo && cd repo

# seq and openssl aren't portable, but the
# point is to generate 200 random 1M files
for i in `seq -f %03g 1 600`; do
  openssl rand 100000 >$i.rand
done

# make repo, fully packed
# we don't bother trying to delta in the pack
# since the files are all random
git-init
git-add .
git-commit -q -m one
git-repack -a -d --window=0

# move every file
mkdir new
git-mv *.rand new

# modify every file
for i in new/*.rand; do
  echo foo >>$i
done
git-add new

# this is the operation of interest
time git-diff --cached --raw -M -l0 >/dev/null

-- >8 --

The idea is to have a large number of files that are slightly changed
and moved, and to try to find the pairs.  The diff takes about 20
seconds to run for me (the real repo has 1M files rather than 100K
files, but it's nice to have the tests take a lot less time). If you
want a bigger test, bump up the file size (or increase the number of
files, which will emphasize the quadratic behavior).


We have to handle binary files, too. In the current implementation, we
consider either lines or "chunks", and similarity is increased by the
size of the chunk.


I think we are already treating the files as unordered sets of lines.
And really, I think there is some value in that, anyway. If I reverse
the order of all lines in a file, it might be useful for git to say
"this file came from that file".


Great. git has a "look for copies also" flag, but it is usually disabled
because of the computational cost. If we can get it low enough, it might
actually become a lot more useful.


I'll try it on my test data, but it sounds like it doesn't really handle
binary files.


Trying to fit it into the C git code would be useful, but I doubt I'll
have time to work on it tonight, since it's getting onto dawn here.

-Peff
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: Linear time/space rename logic for *inexact* case, Jeff King, (Mon Oct 22, 3:34 am)
Re: Linear time/space rename logic for *inexact* case, Linus Torvalds, (Mon Oct 22, 5:49 pm)