Re: What's cooking in git.git (topics)

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Jeff King
Date: Tuesday, October 2, 2007 - 9:11 am

On Tue, Oct 02, 2007 at 08:10:28AM +0200, David Kastrup wrote:


The algorithm is something like this:  We have N files, and we want to
find "similar" candidates. So we go through each file and generate a
table of fingperint hashes (diffcore-rename.c:hash_chars), and then
compare each file with every other file, using the hash tables to do the
comparison.

So the comparison step for two files is currently something like:

  for each hash in file1
    hash2 = look up hash in file2
    compare hash and hash2

and if they were sorted, perhaps we could do something merge-like:

  while hashes are left to compare
      compare file1.next, file2.next
      advance file1, file2, or both (depending on comparison)


It would be sort once. I.e.,:

  for each file
     generate file.hashes
     sort file.hashes
  for each file1
    for each file2
      compare file1.hashes to file2.hashes

where that 'compare' step is taking most of the CPU time (for the
obvious reason that we call it in an O(n^2) loop).

I will try to implement this as time permits, but if you want to tinker
with it in the meantime, feel free.

-Peff
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: What's cooking in git.git (topics), David Kastrup, (Mon Oct 1, 11:10 pm)
Re: What's cooking in git.git (topics), Jeff King, (Tue Oct 2, 9:11 am)