Well I do adjust the file sizes when pruning. In the .py demo code
you can see it at these lines:
for f in filelist:
sizes[f] -= 1
The similarity metric depends on the sizes, so that way it won't get
out of whack if there are a lot of frequently occurring lines.
I did mention that pathological case to Linus. The worst thing is if
you have N adds and N deletes and every single file is identical.
Well the convenient thing is that the pruning handles this gracefully.
*Every* line in both indices will be pruned, and then the similarity
matrix will be completely empty, and no renames will be recorded.
Arguably that is what you want. You don't want N^2 renames. I guess
you could just assign them randomly to each other too for N renames.
Ah OK thanks. This is a good test case for binary files, and my
Linux 2.4 vs Linux 2.6 test seems reasonable for text files.
Yes, my thing doesn't handle binary files, but I have a variation of
the idea for that too which I'll explain below.
Can you explain what the current implementation does for binary files
in more detail, or point me to some docs? I'm having a hard code
figuring out what the code is doing, even though the comments insist
that it is very simple : )
from diffcore-delta.c:
"
* Idea here is very simple.
*
* Almost all data we are interested in are text, but sometimes we have
* to deal with binary data. So we cut them into chunks delimited by
* LF byte, or 64-byte sequence, whichever comes first, and hash them.
"
What is the similarity metric for binary files? Is it related to the
number of 64 byte chunks they have in common? How do you know that
the 64 byte chunks match up? Suppose I have 10k file, and then I
insert 10 random bytes at positions, 1000, 2000, 3000, etc. How does
that work?
Yes, there seem to be quite a few flags that have to do with trading
off features for computational time/memory. I believe this algorithm
could be made fast enough to simplify the user interface a bit. I'm
not yet super familiar with git, but I imagine it's probably hard for
some users to intuit what these thresholds do.
Yes, I think we should try it out for text files first, since that's
probably the more important case. But at the risk of getting a bit
ahead of myself, I think you can do a similar thing for binary files.
This first part isn't my idea; I heard it at a talk by Udi Manber at
Google, but I think it could be combined with my pruning and sparse
similarity matrix thing. And it looks like there are already some of
the same ideas in git, though I'm still deciphering the code (any docs
on this area would help).
Basically his idea is that you take a rolling window of say 64 bytes
and create an inverted index, just like you do for text files. So if
you have a 10,000 byte file, then there are (10,000 - 64) hashes to be
made.
This could be a lot, so you can speed it up by sampling. But you want
to sample in such a way that preserves commonality. That is, you
don't want to sample a given 64-byte sequence in one doc but miss it
in the other one. So then all you do is keep all windows where the
hash code is 0 mod k (and you can adjust k that to achieve the
sampling rate you want). If a given sequence is sampled in *one* doc,
then it's guaranteed to be sampled in *all* docs.
The second point is that you can do this rolling window hashing very
fast with a polynomial hash. You can use the results of the previous
hash to compute the next hash. I think you already have this in
diff-delta.c, but I'm not quite sure. I'm not sure exactly what that
code is used for either (I don't think it is for the rename matching,
because that is done by estimate_similarity and
diffcore_count_changes?).
Once you have these 2 inverted indices of { sampled subset of 64-byte
substrings -> [list of documents] } then you simply do the same thing
as I suggested for text files. The logic is more or less the same.
If 2 files have many 64-byte substrings in common compared to their
sizes, then they are similar. If a given 64-byte sequence (say
0000000000000...000) appears in many docs, then it's not a good
differentiator or indicator of similarity -- so throw it out.
(It occurs to me that it might be useful to do this with a small
window (8 bytes?) and a larger one (>64), and somehow combine the
results. I guess some experimentation is in order, or maybe there is
some theory about that.)
So this is a linear algorithm too. The hashing could have a high
constant factor, but the sampling and polynomial hashing should
basically eliminate that, I believe.
But like I said, it's probably better to concentrate on the text part
before anyone tackles that (at least I will look at the text case
first). I have less experience with this algorithm, but it seems like
it should work just as well, and is quite simple. I think both of
these algorithms are a bit more general and should lead to a net
reduction in lines of code, which is a good thing.
Andy
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html