This patch teaches diffcore_rename() to look into
$GIT_DIR/rename-cache and make use of it to recreate diff_filepair.
With proper cache, there should be no available entry for estimation
after exact matching.Rename caching is per commit. I don't think abitrary tree-tree caching
is worth it.$GIT_DIR/rename-cache spans out like $GIT_DIR/objects. Each file
corresponds to one commit. Its content consists of lines like this<Destination SHA-1> <SPC> <Source SHA-1> <SPC> <Score in decimal> <NL>
This can be used to:
- Make --find-copies-harder pratically usable for moderate-size
repositories. The first "git show" on a linux kernel commit was 5.3
sec, it then went down to 0.13 sec.
- Give git-svn a chance to (locally) import explicit renames from
Subversion
- People may correct rename results for better diff, if automatic
rename detection is not good enough.Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
diff.h | 2 +
diffcore-rename.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++-
log-tree.c | 2 +
t/t4030-rename-cache.sh | 55 ++++++++++++++++++
4 files changed, 199 insertions(+), 2 deletions(-)
create mode 100755 t/t4030-rename-cache.shdiff --git a/diff.h b/diff.h
index a49d865..8b68f6f 100644
--- a/diff.h
+++ b/diff.h
@@ -110,6 +110,8 @@ struct diff_options {
add_remove_fn_t add_remove;
diff_format_fn_t format_callback;
void *format_callback_data;
+
+ struct commit *commit;
};enum color_diff {
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 168a95b..598cc8d 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -5,6 +5,7 @@
#include "diff.h"
#include "diffcore.h"
#include "hash.h"
+#include "commit.h"/* Table of rename/copy destinations */
@@ -409,13 +410,130 @@ static void record_if_better(struct diff_score m[], struct diff_score *o)
m[worst] = *o;
}+struct cached_filepair {
+ un...
This is something I have thought about in the past, good to see that
That could be a nice complement to my directory-rename patch.
--
Has anybody thought about interaction between that caching and pathspec
limited operation?
--
T24gMTEvOC8wOCwgSnVuaW8gQyBIYW1hbm8gPGdpdHN0ZXJAcG9ib3guY29tPiB3cm90ZToKPiBZ
YW5uIERpcnNvbiA8eWRpcnNvbkBhbHRlcm4ub3JnPiB3cml0ZXM6Cj4KPiAgPiBPbiBGcmksIE5v
diAwNywgMjAwOCBhdCAwOTozNTozMlBNICswNzAwLCBOZ3V5Pz8/biBUaMOhaSBOZz8/P2MgRHV5
IHdyb3RlOgo+ICA+PiBUaGlzIHBhdGNoIHRlYWNoZXMgZGlmZmNvcmVfcmVuYW1lKCkgdG8gbG9v
ayBpbnRvCj4gID4+ICRHSVRfRElSL3JlbmFtZS1jYWNoZSBhbmQgbWFrZSB1c2Ugb2YgaXQgdG8g
cmVjcmVhdGUgZGlmZl9maWxlcGFpci4KPiAgPj4gV2l0aCBwcm9wZXIgY2FjaGUsIHRoZXJlIHNo
b3VsZCBiZSBubyBhdmFpbGFibGUgZW50cnkgZm9yIGVzdGltYXRpb24KPiAgPj4gYWZ0ZXIgZXhh
Y3QgbWF0Y2hpbmcuCj4gID4KPiAgPiBUaGlzIGlzIHNvbWV0aGluZyBJIGhhdmUgdGhvdWdodCBh
Ym91dCBpbiB0aGUgcGFzdCwgZ29vZCB0byBzZWUgdGhhdAo+ICA+IGltcGxlbWVudGVkIDopCj4g
ID4KPiAgPj4gUmVuYW1lIGNhY2hpbmcgaXMgcGVyIGNvbW1pdC4gSSBkb24ndCB0aGluayBhYml0
cmFyeSB0cmVlLXRyZWUgY2FjaGluZwo+ICA+PiBpcyB3b3J0aCBpdC4KPiAgPgo+ICA+IFRoYXQg
Y291bGQgYmUgYSBuaWNlIGNvbXBsZW1lbnQgdG8gbXkgZGlyZWN0b3J5LXJlbmFtZSBwYXRjaC4K
Pgo+Cj4gSGFzIGFueWJvZHkgdGhvdWdodCBhYm91dCBpbnRlcmFjdGlvbiBiZXR3ZWVuIHRoYXQg
Y2FjaGluZyBhbmQgcGF0aHNwZWMKPiAgbGltaXRlZCBvcGVyYXRpb24/Cj4KCkkgZGlkbid0LiBC
dXQgSSB0aGluayBhbGwgb3V0LW9mLXBhdGhzcGVjIGRpZmYgcGFpcnMgYXJlIHJlbW92ZWQKYmVm
b3JlIGl0IHJlYWNoZXMgZGlmZmNvcmVfcmVuYW1lKCkgc28gdGhlIGNhY2hlIGhhcyBub3RoaW5n
IHRvIGRvCndpdGggaXQgKGV4Y2VwdCBpdCBzdGlsbCBsb2FkcyBmdWxsIGNhY2hlIGZvciBhIGNv
bW1pdCkuCi0tIApEdXkK
--
Well, it could be that an out-of-pathspec pair would have a better
score than an in-pathspec one. Maybe cache recording should be turned
off when doing pathspec limitation ?
--
One thing I notice is that the cache works at the level of "here is the
best rename for this commit." Maybe it could go down a level and say
"here is the inexact rename score between these blobs". Then you would
still find the best score between two blobs each time, but save the
really computationally intensive part (which is comparing the actual
_content_ of the blobs).That should work in the face of path limiting or any other option,
because it is caching something immutable: this is the similarity score
between two pieces of content. And then you get arbitrary tree-to-tree
speedups for free, since such a cache would be valid for every commit.The downsides are:
- your cache is potentially bigger, since you are caching the score of
every pair you look at, instead of just "good" pairs (OTOH, you are
not doing a per-commit cache, which helps reduce the size)- you can still "lie" about a score to pre-seed imported SVN renames,
but such lying will actually apply to all commits.-Peff
--
I did that and realized the cost was not from each diff, in
--find-copies-harder case, but from the number of diffs you had to do.
Even with exact matching on linux-2.6.git, it could take significant
time (it was about 5 minutes in no-cache case, 1 minute without exactIt is huge if you accidentially add --find-copies-harder to your
command, considering that every new file will be compared against
--
Duy
--
Hmm, yeah. I was thinking you might be able to do some kind of cut-off
on the caching (i.e., don't bother storing anything that didn't come
close). But you can't safely assume that because an entry isn't there,
it isn't worth seeing (since it might also just not have been computed
yet). You could still organize by commit, and then each commit is either
fully computed or not. But then you still have a pathspec problem.One thing you could do is just compute the rename score between all
pairs, even if a pathspec is given, limit it to values over "0.5" (or
something low, but that eliminates the totally uninteresting cases), and
then store that as the complete cache for that commit (or tree pair, if
you want to support that).Then you would have the full information and could do an arbitrary
pathspec limit on it. If you wanted to set the rename threshold below
0.5, then we would have to recompute without the cache (but in practice,
that should be rare).The real downside is that you pay for the whole-tree detection when you
have asked for a pathspec (but only the first time, after which you can
always generate from cache).Just thinking out loud...
-Peff
--
T24gMTEvOC8wOCwgWWFubiBEaXJzb24gPHlkaXJzb25AYWx0ZXJuLm9yZz4gd3JvdGU6Cj4gT24g
U2F0LCBOb3YgMDgsIDIwMDggYXQgMTE6MDE6MjBBTSArMDcwMCwgTmd1eWVuIFRoYWkgTmdvYyBE
dXkgd3JvdGU6Cj4gID4gT24gMTEvOC8wOCwgSnVuaW8gQyBIYW1hbm8gPGdpdHN0ZXJAcG9ib3gu
Y29tPiB3cm90ZToKPiAgPiA+IFlhbm4gRGlyc29uIDx5ZGlyc29uQGFsdGVybi5vcmc+IHdyaXRl
czoKPiAgPiA+Cj4gID4gPiAgPiBPbiBGcmksIE5vdiAwNywgMjAwOCBhdCAwOTozNTozMlBNICsw
NzAwLCBOZ3V5Pz8/biBUaMOhaSBOZz8/P2MgRHV5IHdyb3RlOgo+ICA+ID4gID4+IFRoaXMgcGF0
Y2ggdGVhY2hlcyBkaWZmY29yZV9yZW5hbWUoKSB0byBsb29rIGludG8KPiAgPiA+ICA+PiAkR0lU
X0RJUi9yZW5hbWUtY2FjaGUgYW5kIG1ha2UgdXNlIG9mIGl0IHRvIHJlY3JlYXRlIGRpZmZfZmls
ZXBhaXIuCj4gID4gPiAgPj4gV2l0aCBwcm9wZXIgY2FjaGUsIHRoZXJlIHNob3VsZCBiZSBubyBh
dmFpbGFibGUgZW50cnkgZm9yIGVzdGltYXRpb24KPiAgPiA+ICA+PiBhZnRlciBleGFjdCBtYXRj
aGluZy4KPiAgPiA+ICA+Cj4gID4gPiAgPiBUaGlzIGlzIHNvbWV0aGluZyBJIGhhdmUgdGhvdWdo
dCBhYm91dCBpbiB0aGUgcGFzdCwgZ29vZCB0byBzZWUgdGhhdAo+ICA+ID4gID4gaW1wbGVtZW50
ZWQgOikKPiAgPiA+ICA+Cj4gID4gPiAgPj4gUmVuYW1lIGNhY2hpbmcgaXMgcGVyIGNvbW1pdC4g
SSBkb24ndCB0aGluayBhYml0cmFyeSB0cmVlLXRyZWUgY2FjaGluZwo+ICA+ID4gID4+IGlzIHdv
cnRoIGl0Lgo+ICA+ID4gID4KPiAgPiA+ICA+IFRoYXQgY291bGQgYmUgYSBuaWNlIGNvbXBsZW1l
bnQgdG8gbXkgZGlyZWN0b3J5LXJlbmFtZSBwYXRjaC4KPiAgPiA+Cj4gID4gPgo+ICA+ID4gSGFz
IGFueWJvZHkgdGhvdWdodCBhYm91dCBpbnRlcmFjdGlvbiBiZXR3ZWVuIHRoYXQgY2FjaGluZyBh
bmQgcGF0aHNwZWMKPiAgPiA+ICBsaW1pdGVkIG9wZXJhdGlvbj8KPiAgPiA+Cj4gID4KPiAgPiBJ
IGRpZG4ndC4gQnV0IEkgdGhpbmsgYWxsIG91dC1vZi1wYXRoc3BlYyBkaWZmIHBhaXJzIGFyZSBy
ZW1vdmVkCj4gID4gYmVmb3JlIGl0IHJlYWNoZXMgZGlmZmNvcmVfcmVuYW1lKCkgc28gdGhlIGNh
Y2hlIGhhcyBub3RoaW5nIHRvIGRvCj4gID4gd2l0aCBpdCAoZXhjZXB0IGl0IHN0aWxsIGxvYWRz
IGZ1bGwgY2FjaGUgZm9yIGEgY29tbWl0KS4KPgo+Cj4gV2VsbCwgaXQgY291bGQgYmUgdGhhdCBh
biBvdXQtb2YtcGF0aHNwZWMgcGFpciB3b3VsZCBoYXZlIGEgYmV0dGVyCj4gIHNjb3JlIHRoYW4g
YW4gaW4tcGF0aHNwZWMgb25lLiAgTWF5YmUgY2FjaGUgcmVjb3JkaW5nIHNob3VsZCBiZSB0dXJu
ZWQKPiAgb2ZmIHdoZW4gZG9pbmcgcGF0aHNwZWMgbGltaXRhdGlvbiA/CgpSaWdodCwgcmVjb3Jk
aW5nIHNob3VsZCBiZSB0dXJuZWQgb2ZmIG9yIHNvbWV0aGluZy4gTGV0IG1lIHNlZS4uCi0tIApE
dXkK
--
If diff.cacherenames is true, then renames will be cached to
$GIT_DIR/rename-cache. By default, it will not overwrite existing
cache. Add --refresh-cache to overwrite.Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
if git-svn is going to use this, then perharps we should add a rule to prevent
overwriting certain cache files with .keep files, so that git-svn generated cache
does not get lostDocumentation/config.txt | 5 ++++
Documentation/diff-options.txt | 5 ++++
diff.c | 12 +++++++++++
diff.h | 2 +
diffcore-rename.c | 41 ++++++++++++++++++++++++++++++++++++++++
t/t4030-rename-cache.sh | 27 ++++++++++++++++++++++++++
6 files changed, 92 insertions(+), 0 deletions(-)diff --git a/Documentation/config.txt b/Documentation/config.txt
index 29369d0..81160d3 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -630,6 +630,11 @@ diff.renames::
will enable basic rename detection. If set to "copies" or
"copy", it will detect copies, as well.+diff.cacherenames::
+ Tells git to automatically cache renames when detected. The
+ cache resides in $GIT_DIR/rename-cache, which is used by git
+ if exists.
+
fetch.unpackLimit::
If the number of objects fetched over the git native
transfer is below this
diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index c62b45c..d477a40 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -102,6 +102,11 @@ endif::git-format-patch[]
Turn off rename detection, even when the configuration
file gives the default to do so.+--refresh-rename-cache::
+ By default, when git finds a cached version of a commit, it
+ will not overwrite the cache. This option makes git overwrite
+ old cache or create a new one.
+
--check::
Warn if changes introduce trailing whitespace
or an indent that uses a space before a tab. Exits with
diff --git ...
| Andrea Arcangeli | [PATCH 00 of 12] mmu notifier #v13 |
| David Newall | Re: What still uses the block layer? |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Konrad Rzeszutek | [PATCH] Add iSCSI iBFT support (v0.4.5) |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Stefan Richter | Re: [GIT]: Networking |
| Antonio Almeida | HTB accuracy for high speed |
