Re: Linear time/space rename logic for *inexact* case

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: <git@...>
Date: Monday, October 22, 2007 - 6:09 am

On 10/22/07, Andy C <andychup@gmail.com> wrote:

I think I can make this a lot clearer than I did, while glossing over
some details and the line_threshold parameter.

1) Make a "left index" and a "right index" out of the 2 sets of files,
{ line => [list of docs] }.

2) Remove any lines that appear in more than one doc from the left
index.  Do the same for the right index.  (this corresponds to
line_threshold=1 case)

3) For all lines, if the line appears in *both* the left index and the
right index, increment the count of the (row=doc from left set,
column=doc from right set) entry in the similarity matrix by 1.  The
matrix is represented by a hash of 2-tuples => counts.

After this is done for all lines, then the matrix is sparsely filled
with the count of common lines between every pair of files in the 2
sets.  The vast majority of cells in the matrix are implicitly 0 and
thus consume neither memory nor CPU with the hash table representation
of matrix.

4) Then you can use this to compute similarity scores.

Hopefully that is more clear... though I guess it might not be obvious
that it works for the problem that git has.  I am fairly sure it does,
but the demo should allow us to evaluate that.

Andy
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: Linear time/space rename logic for *inexact* case, Linus Torvalds, (Mon Oct 22, 8:49 pm)
Re: Linear time/space rename logic for *inexact* case, Andy C, (Mon Oct 22, 6:09 am)