On Mon, Oct 22, 2007 at 02:40:08AM -0700, Andy C wrote:Great, welcome aboard. :) OK, makes sense (that's what I was trying to propose near the end of my mail). Ah, very clever. I think that should have nice behavior most of the time, though I wonder about a pathological case where we have many files, all very similar to each other, and then have a commit where they all start to diverge, but just by a small amount, while getting moved. We would end up with an artificially low score between renamed files (because we've thrown out all of the commonality) which may lead us to believe there was no rename at all. But it might be worth ignoring that case. Makes sense. Yes, I still have the problem (the 2 minutes was _after_ we did fixes, down from 20 minutes; the latest patch from Linus just covers the "exact match" case where two files have identical SHA1s). It's a 1.2G repo at the end of a slow DSL line, so rather than cloning it, here's a way to reproduce a repo with similar properties: -- >8 -- #!/bin/sh rm -rf repo mkdir repo && cd repo # seq and openssl aren't portable, but the # point is to generate 200 random 1M files for i in `seq -f %03g 1 600`; do openssl rand 100000 >$i.rand done # make repo, fully packed # we don't bother trying to delta in the pack # since the files are all random git-init git-add . git-commit -q -m one git-repack -a -d --window=0 # move every file mkdir new git-mv *.rand new # modify every file for i in new/*.rand; do echo foo >>$i done git-add new # this is the operation of interest time git-diff --cached --raw -M -l0 >/dev/null -- >8 -- The idea is to have a large number of files that are slightly changed and moved, and to try to find the pairs. The diff takes about 20 seconds to run for me (the real repo has 1M files rather than 100K files, but it's nice to have the tests take a lot less time). If you want a bigger test, bump up the file size (or increase the number of files, which will emphasize the quadratic behavior). We have to handle binary files, too. In the current implementation, we consider either lines or "chunks", and similarity is increased by the size of the chunk. I think we are already treating the files as unordered sets of lines. And really, I think there is some value in that, anyway. If I reverse the order of all lines in a file, it might be useful for git to say "this file came from that file". Great. git has a "look for copies also" flag, but it is usually disabled because of the computational cost. If we can get it low enough, it might actually become a lot more useful. I'll try it on my test data, but it sounds like it doesn't really handle binary files. Trying to fit it into the C git code would be useful, but I doubt I'll have time to work on it tonight, since it's getting onto dawn here. -Peff - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| Jeremy Fitzhardinge | Re: [RFC 00/15] x86_64: Optimize percpu accesses |
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
| Mike Galbraith | Re: regression: CD burning (k3b) went broke |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Linus Torvalds | Re: [GIT]: Networking |
| Michael Grollman | Re: 8169 Intermittent ifup Failure Issue With RTL8102E Chipset in Intel's New D945... |
