Re: On blame/pickaxe

Previous thread: Re: [PATCH] git-pickaxe: blame rewritten. by Junio C Hamano on Thursday, October 12, 2006 - 5:06 pm. (3 messages)

Next thread: none
From: Junio C Hamano
Date: Thursday, October 12, 2006 - 6:43 pm

I have something that is tentatively called pickaxe, but that is
different from "diff -S<pickaxe>".  The name will need to
change, perhaps to git-blame.

It's quite a while since the original algorithm sketch that led
to the current git-blame was posted here; I suspect that most of
the people recently joined the list haven't seen it.  The idea
in there still applies, though...

Here is a re-drawn sketch of how the 'pickaxe' edition of the
algorithm works.

0. Outline.

We track range of lines from a path in one revision.  That's
what "-L n,m", "path", and "commit" parameters to the command
are about.

When the program starts, the commit given on the command line is
suspected to be responsible for all lines in the given range.
But being a suspect and determined to be the guilty party are
different things.

The name of the game is "not taking responsibility but passing
the blame to your parents".  In other words, "that's not a new
problem I introduced -- it was there before my patch was
applied".  You are exonerated for those lines that you can
successfully pass blame to your parent, making that parent a new
suspect for the lines.  There will remain lines you cannot pass
blame to anybody else -- you will be determined to be the guilty
party for them.  When nobody can pass any more blame to anybody
else, the dance stops and we will know who is guilty for every
line of the input.


1. Data structure.

We keep track of who the current suspect for each line is in the
structure called 'scoreboard'.  It is a collection of ranges,
called 'blame_entry'.

Each blame_entry describes which lines (in the final image) it
is about, who the current suspect for the range is, if the
suspect is already known to be guilty, where in the suspected
commit the lines came from (path and line number).

In the 'blame' implementation, a suspect is represented by a
commit object with one path hanging to its util field.
'pickaxe' edition introduces the concept of 'origin', which is a
pair of ...
From: Junio C Hamano
Date: Friday, October 13, 2006 - 12:54 am

In an earlier message, I talked about pickaxe data structure
that has potential for blaming more than one paths in a single
revision.

I was talking about the future enhancements, but it turns out
that there is a corner case that the feature is already needed
using the classic blame algorithm, and running blame shows its
limitation through.

The scenario is like this:

 - initial has two files A and B
 - branch left renames A to C and adds stuff at the top.
 - branch right renames B to C and adds stuff at the bottom.
 - left and right is merged and leaves one file, C, by taking
   what both branches did.
 - annotate C from the top.

The parts that came from A and B in the initial revision should
be assigned to these files in the initial commit.  If you look
at the blame output, you will see it fails to do so.  pickaxe
gives what is expected.

-- >8 --
#!/bin/sh

if test -d .git
then
        echo 'please run this in a fresh empty directory'
	exit
fi

git init-db

for i in 1 2 3 4 5 6 7 8 9 ; do echo line line line $i ; done >A
for i in A B C D E F G H I ; do echo line line line $i ; done >B

git add A B
git commit --author='Initial <initial@author>' -m initial

git branch right
git branch left

# Left
git checkout left
for i in 1 2 3; do echo added by left; done >C
cat A >>C
rm -f A B
git update-index --add --remove A B C
git commit --author='Left <left@branch>' -m Left

# Right
git checkout right
cat B >C
for i in 1 2 3; do echo added by right; done >>C
rm -f A B
git update-index --add --remove A B C
git commit --author='Right <right@branch>' -m Right

# Merge them; this should fail which is expected
git pull . left
{
	git cat-file blob :3:C
	git cat-file blob :2:C
} >C
git update-index C
git commit --author='Merge <merge@branch>' -m 'Changes are merged.'
rm -f C~*

echo blame
git blame -f -n -t C

echo pickaxe
git pickaxe -f -n -t C

-

From: Junio C Hamano
Date: Saturday, October 14, 2006 - 3:26 am

I've spent a few hours tonight to further work (eh, "have fun")
on this.  The version at the tip of "pu" implements detection of

You can use the pickaxe on its source itself, like this:

	git pickaxe -n master..pu builtin-pickaxe.c

If you compare this with output from:

	git log --pretty=short -p master..pu builtin-pickaxe.c

you will notice the line-movement detection in action.

During the course of development, I had to move quite a few
static functions around so that they are defined before their
first call site.  This is partly because I am very bad at
initial planning (who is?) and this still being in experimental
stage I did not bother declaring static functions upfront as
forward declarations.

For example, commit db3f0f2 introduces find_last_in_target()
function, but it was moved up by commit b5c0e4f (near the tip of
"pu").  pickaxe blames the implementation of it to db3f0f20, and
also notices that the bulk of its logic was actually copied from
the implementation of pass_blame_to_parent() function in commit
b14dc9ef (the initial commit that introduced builtin-pickaxe.c).

What _ought_ to come next is to detect line movement across
files, but I'll go to bed for now.

-

From: Junio C Hamano
Date: Saturday, October 14, 2006 - 4:43 pm

And this implements it.  After testing it some more, I'll have
it near the tip of "pu" sometime tonight.

I have to admit that I am becoming quite fond of the name
"pickaxe", by the way.

I consider that the "following the changed lines and see where
it came from" you described in the ancient message:

    http://thread.gmane.org/gmane.comp.version-control.git/27/focus=217

the holy grail of software archaeology.  Once we have it, I
would feel I am pretty much _done_ with git.  It will have
everything I want it to do.

My "diff -S" (where the "pickaxe" name started) was a crude
first step to achieve that goal.  The way I envisioned it was
that you would feed the lines 50-89 as a single -S parameter to
"git whatchanged" and the tool finds the commit that touched
that area, and the Porcelain would either figure out what the
corresponding lines in the previous commit look like or ask the
user to highlight what to dig deeper interactively, and restart
another "whatchanged" with adjusted value for -S.  Unfortunately
nobody wrote such a Porcelain command.

I think the command with this patch gets us closer to that goal;
although it still does not have a way to detect "oh these five
copies were consolidated into one" (I'd imagine we _could_ do
that in find_copy_in_parent() function if we really wanted to --
but that would be a tad expensive).  In any case, I think this
is going in the right direction.

Before anybody asks, the example Daniel Barkalow gave on date.c
is far trickier and even this version would not catch it.  It
requires digging random, distant, ancestors in order to look for
lines that were lost by an earlier removal.

-- >8 --
git-pickaxe: blame cut-and-paste lines.

This completes the initial round of git-pickaxe.  In addition to
the line movement the previous round implements, this finds new
lines that were created by moving or cutting-and-pasting lines
from different path in the parent.

With this,

	git pickaxe -f -n v1.4.0 -- revision.c

finds ...
From: Petr Baudis
Date: Sunday, October 15, 2006 - 7:21 pm

Thanks for the nice writeup!

Dear diary, on Fri, Oct 13, 2006 at 03:43:46AM CEST, I got a letter

  (This got me nervous but I guess it actually makes sense - if only one
parent modified a line, it's right to pass the blame up; else if you
took one parent's version verbatim, it's right to pass the blame up as
well (I think); else you've already got the blame assigned to the merge

  Now, this is very nifty and so for moving functions around, but I
think it is very dangerous for anything where ordering matters - large
arrays definitions, patch series files, etc. In that case, you've
completely ommitted the fact that the movement occured, which can be
crucial and based on the current behaviour of the tools, I think people
expect this now. To put it shortly, "who wrote this line" vs "who put
this line here".

  I would prefer to stay more conservative here by default. Graphical
tools could solve this by some nice presentation like

	3fea1 int moved()                         -.
	3fea1 {                                    |
	28abc 	watch(more + anime);               | 17d50
	17d50 	return !hypno;                     |
	3fea1 }                                   -'
	

Dear diary, on Sun, Oct 15, 2006 at 01:43:53AM CEST, I got a letter

  Gaah, please don't. ;-)

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
#!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj
$/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1
lK[d2%Sa2/d0$^Ixp"|dc`;s/\W//g;$_=pack('H*',/((..)*)$/)
-

From: Junio C Hamano
Date: Sunday, October 15, 2006 - 11:43 pm

Well, this part is the classic blame algorithm.  The beauty of
it is that a merge case falls out as a natural consequence of
the one-parent case and you do not have to do anything special.
You either inherited a line from your parent (or one or more of
your parents) or you created the line yourself.  If you subtract
what you can blame your parents for, the remainder is what you
introduced.  The number of parents you have does not have any

That's exactly why we have -f and -n options, so that the
program that reads from blame output can tell where things came
from.  It is not about "who wrote it" vs "who put it here";
pickaxe gives a lot more than that: "where did this originally
come from", i.e. "by whom in which file at what line number was
the line created".

If the user is not prepared to see code movement, pickaxe can be
run without -M nor -C to get the classic blame output.

By the way, We would not want to do the code movement (or
copying from unrelated file) for very trivial lines.  E.g. you
would not want to blame the following three lines:

        +	}
        +}
        +#endif

to a random file that happens to have the exact copy of the
above that is not related at all.  Something like the above can
happen almost anywhere.  The current implementation of -M/-C
does not do this very well.  find_copy_in_blob() currently
passes blame to the parent when it finds nontrivial copies, but
instead it should inspect the patch and return a score, and the
caller should take the parent with the best match and assign
blame to it.



-

From: Josef Weidendorfer
Date: Monday, October 16, 2006 - 7:02 am

Hi,

this blame-passing thing really looks very promising and powerful.


Another blame-passing heuristic would be very interesting for code:
"Ignore white-space changes".
This way, commits which only do some reindentations simply are skipped.

It looks like such a thing would just be a matter of passing "-b" to
executions of "diff" in the blame-passing algorithm.

Josef
-

From: Andy Whitcroft
Date: Monday, October 16, 2006 - 7:15 am

I am thinking that that is probabally going to need to be optional, for
example python the indentation is everything to the meaning of the code.

-apw
-

From: Petr Baudis
Date: Monday, October 16, 2006 - 5:44 pm

Dear diary, on Mon, Oct 16, 2006 at 04:15:01PM CEST, I got a letter

Ok, so in this case -M and -C does not mean just looking for
copies/movements in other files but inside the same file as well.

Perhaps we might want to differentiate those two cases since searching

(OTOH, just today I was retrieving some code from deep inside a script
to a common function, which of course caused massive indentation shift.
So it is very desirable in order to catch these. But more we get
involved in this, the more we will probably want to know about the
syntax of the content we are digging in.)

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
#!/bin/perl -sp0777i<X+d*lMLa^*lN%0]dsXx++lMlN/dsM0<j]dsj
$/=unpack('H*',$_);$_=`echo 16dio\U$k"SK$/SM$n\EsN0p[lN*1
lK[d2%Sa2/d0$^Ixp"|dc`;s/\W//g;$_=pack('H*',/((..)*)$/)
-

From: Junio C Hamano
Date: Friday, October 20, 2006 - 8:20 pm

I do not personally worry about people confusing -M/-C to
pickaxe with -M/-C given to diff (to pickaxe, use of diff is
purely an internal implementation issue), and reused -M and -C
to mean quite different things.  -M is to detect line movement
inside a file (it is not strictly limited to "line movement",
though. It _is_ about "copies and moves within the same file").
On the other hand, -C (and its stronger form -C -C) is to detect
copies and moves across file boundaries (but wholesale "file
rename" comes as part of the basic algorithm so you do not have
to ask for it with -M nor -C).  So in that sense pronouncing M
"move" and C "copy" is not accurate.  Their differences is
already what you said "we might want".

However they match the cost expectation people are used to in
diff options pretty well.  -M is not so expensive and -C is
somewhat.  -C -C is like find-copies-harder and is quite
expensive.

Also currently the code does not do "move" detection at all.
Contrary to intuition, move detection is more expensive than
copy detection in this case.

-

From: Luben Tuikov
Date: Monday, October 16, 2006 - 4:45 pm

Junio,

Excellent write up.  Comments inline:



Do you handle the case where a merge had a conflict and
the user changed the code (resolved) and then committed?
In this case some lines will have to be blamed on the

How about move-and-edit scenario?  Wouldn't that make the algorithm
somewhat complicated if we wanted to properly describe (blame)

Here I see a lot of _arbitration_ in the form of "choose a good
value for N and a good value for M".

Is it possible that we do away with such "user" arbitration,
and instead find an algorithm that solves every case...? Even
if we have to go back to a simpler "blame".

I'm concerned that with "user" arbitration, there would always
be at least one corner case for which the "setting-on-the-knobs"

Indeed.

    Luben

-

From: Junio C Hamano
Date: Tuesday, October 17, 2006 - 2:03 am

Working on a small example by hand is a good way to convince
yourself.  The whole point of "try to pass the blame, and take
the blame yourself only when you can't pass to anybody" is
precisely to handle the merges sanely.  The answer to your later
question also would become crystal clear with that exercise.

Suppose that we are looking at a merge that would give this in
its "git show" output:

diff --cc hello.c
index 3c27792,db3fdef..cec80d2
--- a/hello.c
+++ b/hello.c
@@@ -1,4 -1,6 +1,6 @@@
  int main(int ac, char **av)
  {
-       printf("hello, world.\n");
 -      const char *msg = "hello, world";
++      const char *msg = "hello, world.";
+ 
+       printf("%s\n", msg);
  }

First, we inspect the diff from the first parent:

        diff --git a/hello.c b/hello.c
        index 3c27792..cec80d2 100644
        --- a/hello.c
        +++ b/hello.c
        @@ -1,4 +1,6 @@
1        int main(int ac, char **av)
2        {
        -       printf("hello, world.\n");
3       +       const char *msg = "hello, world.";
4       +
5       +       printf("%s\n", msg);
6        }

That would find that lines 1, 2 and 6 came from the first parent
(line numbers are of the postimage; e.g. line 6 is the closing
brace).

We are still left with lines 3, 4, and 5.  So we will see the
difference from the second parent:

        diff --git a/hello.c b/hello.c
        index db3fdef..cec80d2 100644
        --- a/hello.c
        +++ b/hello.c
        @@ -1,6 +1,6 @@
1        int main(int ac, char **av)
2        {
        -       const char *msg = "hello, world";
3       +       const char *msg = "hello, world.";
4
5               printf("%s\n", msg);
6        }

It shows that lines 1, 2, 4, 5 and 6 are the same as the second
parent (again, line numbers are of the postimage).  This means
that we _could_ attribute line 1, 2 and 6 to the second parent
if we wanted to, but we have already passed blame for 1, 2 and 6
to the first parent [*1*] and only lines 4 and 5 are ...
Previous thread: Re: [PATCH] git-pickaxe: blame rewritten. by Junio C Hamano on Thursday, October 12, 2006 - 5:06 pm. (3 messages)

Next thread: none