I have something that is tentatively called pickaxe, but that is different from "diff -S<pickaxe>". The name will need to change, perhaps to git-blame. It's quite a while since the original algorithm sketch that led to the current git-blame was posted here; I suspect that most of the people recently joined the list haven't seen it. The idea in there still applies, though... Here is a re-drawn sketch of how the 'pickaxe' edition of the algorithm works. 0. Outline. We track range of lines from a path in one revision. That's what "-L n,m", "path", and "commit" parameters to the command are about. When the program starts, the commit given on the command line is suspected to be responsible for all lines in the given range. But being a suspect and determined to be the guilty party are different things. The name of the game is "not taking responsibility but passing the blame to your parents". In other words, "that's not a new problem I introduced -- it was there before my patch was applied". You are exonerated for those lines that you can successfully pass blame to your parent, making that parent a new suspect for the lines. There will remain lines you cannot pass blame to anybody else -- you will be determined to be the guilty party for them. When nobody can pass any more blame to anybody else, the dance stops and we will know who is guilty for every line of the input. 1. Data structure. We keep track of who the current suspect for each line is in the structure called 'scoreboard'. It is a collection of ranges, called 'blame_entry'. Each blame_entry describes which lines (in the final image) it is about, who the current suspect for the range is, if the suspect is already known to be guilty, where in the suspected commit the lines came from (path and line number). In the 'blame' implementation, a suspect is represented by a commit object with one path hanging to its util field. 'pickaxe' edition introduces the concept of 'origin', which is a pair of ...
In an earlier message, I talked about pickaxe data structure
that has potential for blaming more than one paths in a single
revision.
I was talking about the future enhancements, but it turns out
that there is a corner case that the feature is already needed
using the classic blame algorithm, and running blame shows its
limitation through.
The scenario is like this:
- initial has two files A and B
- branch left renames A to C and adds stuff at the top.
- branch right renames B to C and adds stuff at the bottom.
- left and right is merged and leaves one file, C, by taking
what both branches did.
- annotate C from the top.
The parts that came from A and B in the initial revision should
be assigned to these files in the initial commit. If you look
at the blame output, you will see it fails to do so. pickaxe
gives what is expected.
-- >8 --
#!/bin/sh
if test -d .git
then
echo 'please run this in a fresh empty directory'
exit
fi
git init-db
for i in 1 2 3 4 5 6 7 8 9 ; do echo line line line $i ; done >A
for i in A B C D E F G H I ; do echo line line line $i ; done >B
git add A B
git commit --author='Initial <initial@author>' -m initial
git branch right
git branch left
# Left
git checkout left
for i in 1 2 3; do echo added by left; done >C
cat A >>C
rm -f A B
git update-index --add --remove A B C
git commit --author='Left <left@branch>' -m Left
# Right
git checkout right
cat B >C
for i in 1 2 3; do echo added by right; done >>C
rm -f A B
git update-index --add --remove A B C
git commit --author='Right <right@branch>' -m Right
# Merge them; this should fail which is expected
git pull . left
{
git cat-file blob :3:C
git cat-file blob :2:C
} >C
git update-index C
git commit --author='Merge <merge@branch>' -m 'Changes are merged.'
rm -f C~*
echo blame
git blame -f -n -t C
echo pickaxe
git pickaxe -f -n -t C
-
I've spent a few hours tonight to further work (eh, "have fun") on this. The version at the tip of "pu" implements detection of You can use the pickaxe on its source itself, like this: git pickaxe -n master..pu builtin-pickaxe.c If you compare this with output from: git log --pretty=short -p master..pu builtin-pickaxe.c you will notice the line-movement detection in action. During the course of development, I had to move quite a few static functions around so that they are defined before their first call site. This is partly because I am very bad at initial planning (who is?) and this still being in experimental stage I did not bother declaring static functions upfront as forward declarations. For example, commit db3f0f2 introduces find_last_in_target() function, but it was moved up by commit b5c0e4f (near the tip of "pu"). pickaxe blames the implementation of it to db3f0f20, and also notices that the bulk of its logic was actually copied from the implementation of pass_blame_to_parent() function in commit b14dc9ef (the initial commit that introduced builtin-pickaxe.c). What _ought_ to come next is to detect line movement across files, but I'll go to bed for now. -
And this implements it. After testing it some more, I'll have
it near the tip of "pu" sometime tonight.
I have to admit that I am becoming quite fond of the name
"pickaxe", by the way.
I consider that the "following the changed lines and see where
it came from" you described in the ancient message:
http://thread.gmane.org/gmane.comp.version-control.git/27/focus=217
the holy grail of software archaeology. Once we have it, I
would feel I am pretty much _done_ with git. It will have
everything I want it to do.
My "diff -S" (where the "pickaxe" name started) was a crude
first step to achieve that goal. The way I envisioned it was
that you would feed the lines 50-89 as a single -S parameter to
"git whatchanged" and the tool finds the commit that touched
that area, and the Porcelain would either figure out what the
corresponding lines in the previous commit look like or ask the
user to highlight what to dig deeper interactively, and restart
another "whatchanged" with adjusted value for -S. Unfortunately
nobody wrote such a Porcelain command.
I think the command with this patch gets us closer to that goal;
although it still does not have a way to detect "oh these five
copies were consolidated into one" (I'd imagine we _could_ do
that in find_copy_in_parent() function if we really wanted to --
but that would be a tad expensive). In any case, I think this
is going in the right direction.
Before anybody asks, the example Daniel Barkalow gave on date.c
is far trickier and even this version would not catch it. It
requires digging random, distant, ancestors in order to look for
lines that were lost by an earlier removal.
-- >8 --
git-pickaxe: blame cut-and-paste lines.
This completes the initial round of git-pickaxe. In addition to
the line movement the previous round implements, this finds new
lines that were created by moving or cutting-and-pasting lines
from different path in the parent.
With this,
git pickaxe -f -n v1.4.0 -- revision.c
finds ...Thanks for the nice writeup!
Dear diary, on Fri, Oct 13, 2006 at 03:43:46AM CEST, I got a letter
(This got me nervous but I guess it actually makes sense - if only one
parent modified a line, it's right to pass the blame up; else if you
took one parent's version verbatim, it's right to pass the blame up as
well (I think); else you've already got the blame assigned to the merge
Now, this is very nifty and so for moving functions around, but I
think it is very dangerous for anything where ordering matters - large
arrays definitions, patch series files, etc. In that case, you've
completely ommitted the fact that the movement occured, which can be
crucial and based on the current behaviour of the tools, I think people
expect this now. To put it shortly, "who wrote this line" vs "who put
this line here".
I would prefer to stay more conservative here by default. Graphical
tools could solve this by some nice presentation like
3fea1 int moved() -.
3fea1 { |
28abc watch(more + anime); | 17d50
17d50 return !hypno; |
3fea1 } -'
Dear diary, on Sun, Oct 15, 2006 at 01:43:53AM CEST, I got a letter
Gaah, please don't. ;-)
--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
#!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj
$/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1
lK[d2%Sa2/d0$^Ixp"|dc`;s/\W//g;$_=pack('H*',/((..)*)$/)
-
Well, this part is the classic blame algorithm. The beauty of
it is that a merge case falls out as a natural consequence of
the one-parent case and you do not have to do anything special.
You either inherited a line from your parent (or one or more of
your parents) or you created the line yourself. If you subtract
what you can blame your parents for, the remainder is what you
introduced. The number of parents you have does not have any
That's exactly why we have -f and -n options, so that the
program that reads from blame output can tell where things came
from. It is not about "who wrote it" vs "who put it here";
pickaxe gives a lot more than that: "where did this originally
come from", i.e. "by whom in which file at what line number was
the line created".
If the user is not prepared to see code movement, pickaxe can be
run without -M nor -C to get the classic blame output.
By the way, We would not want to do the code movement (or
copying from unrelated file) for very trivial lines. E.g. you
would not want to blame the following three lines:
+ }
+}
+#endif
to a random file that happens to have the exact copy of the
above that is not related at all. Something like the above can
happen almost anywhere. The current implementation of -M/-C
does not do this very well. find_copy_in_blob() currently
passes blame to the parent when it finds nontrivial copies, but
instead it should inspect the patch and return a score, and the
caller should take the parent with the best match and assign
blame to it.
-
Hi, this blame-passing thing really looks very promising and powerful. Another blame-passing heuristic would be very interesting for code: "Ignore white-space changes". This way, commits which only do some reindentations simply are skipped. It looks like such a thing would just be a matter of passing "-b" to executions of "diff" in the blame-passing algorithm. Josef -
I am thinking that that is probabally going to need to be optional, for example python the indentation is everything to the meaning of the code. -apw -
Dear diary, on Mon, Oct 16, 2006 at 04:15:01PM CEST, I got a letter Ok, so in this case -M and -C does not mean just looking for copies/movements in other files but inside the same file as well. Perhaps we might want to differentiate those two cases since searching (OTOH, just today I was retrieving some code from deep inside a script to a common function, which of course caused massive indentation shift. So it is very desirable in order to catch these. But more we get involved in this, the more we will probably want to know about the syntax of the content we are digging in.) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ #!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj $/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1 lK[d2%Sa2/d0$^Ixp"|dc`;s/\W//g;$_=pack('H*',/((..)*)$/) -
I do not personally worry about people confusing -M/-C to pickaxe with -M/-C given to diff (to pickaxe, use of diff is purely an internal implementation issue), and reused -M and -C to mean quite different things. -M is to detect line movement inside a file (it is not strictly limited to "line movement", though. It _is_ about "copies and moves within the same file"). On the other hand, -C (and its stronger form -C -C) is to detect copies and moves across file boundaries (but wholesale "file rename" comes as part of the basic algorithm so you do not have to ask for it with -M nor -C). So in that sense pronouncing M "move" and C "copy" is not accurate. Their differences is already what you said "we might want". However they match the cost expectation people are used to in diff options pretty well. -M is not so expensive and -C is somewhat. -C -C is like find-copies-harder and is quite expensive. Also currently the code does not do "move" detection at all. Contrary to intuition, move detection is more expensive than copy detection in this case. -
Junio,
Excellent write up. Comments inline:
Do you handle the case where a merge had a conflict and
the user changed the code (resolved) and then committed?
In this case some lines will have to be blamed on the
How about move-and-edit scenario? Wouldn't that make the algorithm
somewhat complicated if we wanted to properly describe (blame)
Here I see a lot of _arbitration_ in the form of "choose a good
value for N and a good value for M".
Is it possible that we do away with such "user" arbitration,
and instead find an algorithm that solves every case...? Even
if we have to go back to a simpler "blame".
I'm concerned that with "user" arbitration, there would always
be at least one corner case for which the "setting-on-the-knobs"
Indeed.
Luben
-
Working on a small example by hand is a good way to convince
yourself. The whole point of "try to pass the blame, and take
the blame yourself only when you can't pass to anybody" is
precisely to handle the merges sanely. The answer to your later
question also would become crystal clear with that exercise.
Suppose that we are looking at a merge that would give this in
its "git show" output:
diff --cc hello.c
index 3c27792,db3fdef..cec80d2
--- a/hello.c
+++ b/hello.c
@@@ -1,4 -1,6 +1,6 @@@
int main(int ac, char **av)
{
- printf("hello, world.\n");
- const char *msg = "hello, world";
++ const char *msg = "hello, world.";
+
+ printf("%s\n", msg);
}
First, we inspect the diff from the first parent:
diff --git a/hello.c b/hello.c
index 3c27792..cec80d2 100644
--- a/hello.c
+++ b/hello.c
@@ -1,4 +1,6 @@
1 int main(int ac, char **av)
2 {
- printf("hello, world.\n");
3 + const char *msg = "hello, world.";
4 +
5 + printf("%s\n", msg);
6 }
That would find that lines 1, 2 and 6 came from the first parent
(line numbers are of the postimage; e.g. line 6 is the closing
brace).
We are still left with lines 3, 4, and 5. So we will see the
difference from the second parent:
diff --git a/hello.c b/hello.c
index db3fdef..cec80d2 100644
--- a/hello.c
+++ b/hello.c
@@ -1,6 +1,6 @@
1 int main(int ac, char **av)
2 {
- const char *msg = "hello, world";
3 + const char *msg = "hello, world.";
4
5 printf("%s\n", msg);
6 }
It shows that lines 1, 2, 4, 5 and 6 are the same as the second
parent (again, line numbers are of the postimage). This means
that we _could_ attribute line 1, 2 and 6 to the second parent
if we wanted to, but we have already passed blame for 1, 2 and 6
to the first parent [*1*] and only lines 4 and 5 are ...