[PATCH] Add git-annotate - a tool for annotating files with the revision and person that created each line in the file.

Previous thread: Handling large files with GIT by Martin Langhoff on Wednesday, February 8, 2006 - 2:14 am. (39 messages)

Next thread: Shortest path between commits by Ralf Baechle on Wednesday, February 8, 2006 - 9:03 am. (2 messages)

Signed-off-by: Ryan Anderson <ryan@michonline.com>

---

I think this version is mostly ready to go.

Junio, the post you pointed me at was very helpful (once I got around to
listening to it), but the code it links to is missing - if that's a
better partial implementation than this, can you ressurrect it
somewhere?  I'd be happy to reintegrate it together.

 Makefile          |    1 
 git-annotate.perl |  291 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 292 insertions(+), 0 deletions(-)
 create mode 100755 git-annotate.perl

86fa163e7fd1bee2929b7946456407dbc7745193
diff --git a/Makefile b/Makefile
index 5c32934..8d24660 100644
--- a/Makefile
+++ b/Makefile
@@ -117,6 +117,7 @@ SCRIPT_SH = \
 SCRIPT_PERL = \
 	git-archimport.perl git-cvsimport.perl git-relink.perl \
 	git-shortlog.perl git-fmt-merge-msg.perl git-rerere.perl \
+	git-annotate.perl \
 	git-svnimport.perl git-mv.perl git-cvsexportcommit.perl
 
 SCRIPT_PYTHON = \
diff --git a/git-annotate.perl b/git-annotate.perl
new file mode 100755
index 0000000..a3ea201
--- /dev/null
+++ b/git-annotate.perl
@@ -0,0 +1,291 @@
+#!/usr/bin/perl
+# Copyright 2006, Ryan Anderson <ryan@michonline.com>
+#
+# GPL v2 (See COPYING)
+#
+# This file is licensed under the GPL v2, or a later version
+# at the discretion of Linus Torvalds.
+
+use warnings;
+use strict;
+
+use Data::Dumper;
+
+my $filename = shift @ARGV;
+
+
+my @stack = (
+	{
+		'rev' => "HEAD",
+		'filename' => $filename,
+	},
+);
+
+our (@lineoffsets, @pendinglineoffsets);
+our @filelines = ();
+open(F,"<",$filename)
+	or die "Failed to open filename: $!";
+
+while(<F>) {
+	chomp;
+	push @filelines, $_;
+}
+close(F);
+our $leftover_lines = @filelines;
+our %revs;
+our @revqueue;
+our $head;
+
+my $revsprocessed = 0;
+while (my $bound = pop @stack) {
+	my @revisions = git_rev_list($bound->{'rev'}, $bound->{'filename'});
+	foreach my $revinst (@revisions) {
+		my ($rev, @parents) = @$revinst;
+		$head ||= ...

Does it depends on some ealier patch?  I get this:

git]$ git-annotate diff-delta.c
Undefined subroutine &main::all_lines_claimed called at
/home/peter/bin/git-annotate line 124.

The patch was applied to: git version 1.1.6.gd19e-dirty.

Peter
-


Hi,


Just add a function like

-- snip --
sub all_lines_claimed {
        return ($leftover_lines == 0);
}
-- snap --

and you're done.

However, it does not yet do the correct thing: it does not show the root 
commit. For example, if you do "git annotate git-am.sh" it should show 
"d1c5f2a4" for the first lines, not "a1451104" as it does.

Ciao,
Dscho

-


another perl script :(


Thanks
--
               Franck
-

From: Johannes Schindelin
Date: Wednesday, February 8, 2006 - 10:45 am

Hi,


Yes. Do not try to introduce unnecessary dependencies. But if it is 
the right tool to do the job, you should use it. As of now, we have perl, 
python and Tcl/Tk.

Hth,
Dscho

-


Very well said.  That's what currently stands.

-


The dependency on Python 2.4 already is a problem for installation on some
systems ...

  Ralf
-


Not many though. Since Python is only required on the workstation where 
the developer does his/her work it's not a very cumbersome requirement. 
The same holds for Perl, btw. It's not a requirement on the server 
hosting the public repositories, unless some of the scripts are used 
from the hooks (git shortlog is used from the default update-hook, but 
that can be changed with no trouble at all).

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-


I understand that in the environments where the Python dependency is a
problem it is probably not due to the specific version. However, if
WITH_OWN_SUBPROCESS is defined in the Makefile then Python 2.3 should
work fine too (this is actually automatically detected now, so you
shouldn't have to do anything special to use Python 2.3).

- Fredrik
-


>>>>> "Franck" == Franck Bui-Huu <vagabon.xyz@gmail.com> writes:

Franck> another perl script :(

Franck> Are there any rules on the choice of the script language ?

I could argue that they should all be Perl. :)

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<merlyn@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
-


Brave thing to do among such a bunch of hardcore C hackers. ;)

So long as we never involve ruby, java or DCL, I'm a happy fellow.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-


Wholeheartedly seconded ;-).

-


I agree to but my point was more why not only using python scripts ?
Why sometimes some scripts is written in perl whereas python could be
used and vice-versa ?

Thanks
--
               Franck
-


Perl is better suited for some tasks, Python for others. Mostly it's 
because the contributor (one out of 137 to date) thought the language 
appropriate for the tool he/she set out to write and felt comfortable 
with it.

I personally abhor the syntax of Perl and the block indentation of 
Python but I happily embrace both if the alternative is to rewrite all 
the script tools in C.

That said, some tools have been rewritten in the past (mostly scripts 
have been replaced by C code versions), but I don't think Junio will 
accept replacement tools just because they're in one particular 
language. If anything, it would be to replace the two python scripts 
with Perl versions, since more tools are implemented in Perl than in 
Python (so we could drop one dependency), Perl exists on more platforms 
(so git becomes more portable), and Perl is used inline in four of the 
shell-scripts (which means we can't get rid of the Perl dependency 
without major hackery anyway).

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-


Hmm.. I get

   [torvalds@g5 git]$ ./git-annotate Makefile
   fatal: 'e83c5163316f89bfbde7d9ab23ca2e25604af290^1..e83c5163316f89bfbde7d9ab23ca2e25604af290': No such file or directory
   Undefined subroutine &main::all_lines_claimed called at ./git-annotate line 124.

where that fatal error is because e83c51.. doesn't _have_ a parent, it's 
the root (so doing ^1 on it doesn't work).

After fixing the "all_lines_claimed" problem as outlined by Dscho, I get a 
lot of

	Skipping diff-parse - i = filelines)

and no actual output.

Doing it on a file that didn't exist in the root commit still have those 
"Skipping" messages, but at least it did actually output something. 

However, what it output was clearly not correct, so there's still some 
tweaking to do.

For example, doing

	./git-annotate apply.c

annotates most of that file to Junio's commit 1c15afb9, which is totally 
incorrect, that commit actually only changed a few lines.

So it looks like there's still some work to be done on this..

			Linus
-


I still have it, but the reason why I withdrew circulating it
was because I found that on some inputs it did not work
correctly as intended.  Not that the algorithm was necessarily
broken but the implementation certainly was.

Unlike yours mine reads and interprets diff output to find which
lines are common and which lines are added, and I think the diff
interpretation logic has various corner cases wrong.  I did
combine-diff.c diff interpreter without looking at my
'git-blame', so I do not remember where I got it wrong,
though...

It's been a while since I looked at it the last time so it may
not even work with the current git, but here it is..

--
#!/usr/bin/perl -w

use strict;

package main;
$::debug = 0;

sub read_blob {
    my $sha1 = shift;
    my $fh = undef;
    my $result;
    local ($/) = undef;
    open $fh, '-|', 'git-cat-file', 'blob', $sha1
	or die "cannot read blob $sha1";
    $result = join('', <$fh>);
    close $fh
	or die "failure while closing pipe to git-cat-file";
    return $result;
}

sub read_diff_raw {
    my ($parent, $filename) = @_;
    my $fh = undef;
    local ($/) = "\0";
    my @result = (); 
    my ($meta, $status, $sha1_1, $sha1_2, $file1, $file2);

    print STDERR "* diff-index --cached $parent $filename\n" if $::debug;
    my $has_changes = 0;
    open $fh, '-|', 'git-diff-index', '--cached', '-z', $parent, $filename
	or die "cannot read git-diff-index $parent $filename";
    while (defined ($meta = <$fh>)) {
	$has_changes = 1;
    }
    close $fh
	or die "failure while closing pipe to git-diff-index";
    if (!$has_changes) {
	return ();
    }

    $fh = undef;
    print STDERR "* diff-index -B -C --find-copies-harder --cached $parent\n" if $::debug;
    open($fh, '-|', 'git-diff-index', '-B', '-C', '--find-copies-harder',
	 '--cached', '-z', $parent)
	or die "cannot read git-diff-index with $parent";
    while (defined ($meta = <$fh>)) {
	chomp($meta);
	(undef, undef, $sha1_1, $sha1_2, $status) = split(/ ...

I tried that approach at first, and it was much much more confusing to
try to keep track of.  The problem Linus found (that of a missing
"all_lines_claimed()") was related to that code.  This implementation is
simple, though it has to have some problems with guessing at duplicated

I'll take a look through this in greater detail later, hopefully your
approach can be applied.  Diff-analyzing is apparently tricky.

-- 

Ryan Anderson
  sometimes Pug Majere
-


Reading diff is tricky but I was lazy to match up the lines by
hand, which is also a real work ;-).

There are a few things I should add to that ancient code:

 - It wants old ls-tree behaviour.  The command line used in the
   "sub find_file" needs to be updated to something like this:

    open $fh, '-|', 'git-ls-tree', '-z', '-r', $commit->{TREE}, $path
	or die "cannot read git-ls-tree $commit->{TREE}";

 - It only cares about the line numbers and its output is meant
   to be postprocessed with the contents from the latest blob.

 - It predates the recent rev-list that skips commits that do
   not change the specified paths, and it literally follows each
   parent and optimizes not to diff with uninteresting parents
   by hand.

I suspect if you go with the diff-reading approach, it might be
easy to convert it to C (or even write the initial version in C)
using the machinery similar to what is in combine-diff.c.

The algorithm combine-diff.c uses keeps the lines discarded from
each parent in lline structure linked to the sline structure
(which keeps track of the lines in the final version), but for
your annotate purposes what you care about is only what the
child adds to the parent (IOW, we do not care about the lines
that do not appear in the final version), so the logic and the
data structure could be greatly simplified.  You only need to
keep "flag" element in the sline structure, and maybe bol and
len that point at the contents of the resulting line from the
final version.  In addition, you would need to store "the
current suspect commit" (starts from the final revision and
updated as you pass the blame along) and another bool that says
if "the current suspect" is known to be the guilty party or if
the true culprit is one of its ancestors (capital vs lowercase
difference in that explanatory note).


-


Reading a diff is tricky, yes, but if you're willing to just throw RAM
at the problem, it might not be quite as bad as I was trying at first.

My current thought on how to get it more correct is this:
	foreach $rev (@revqueue) {
		foreach $parent (@{$revs{$rev}{parents}}) {
			my @templines = @{$revs{$rev}{lines}};

			$revs{$parent}{lines} = apply_diff(\@templlines);
		}
	}

The @lines arrays that get built will be entirely made up of hash or
array references, so they just get reused for each successive file.

When apply_diff() deletes a line from the new copy, it should mark that
line as "claimed" by the current rev.

I'm thinking that each element of @lines will look like this:
	{
		text => $text,
		in_original => [0 | 1],
		claimed_by => $rev,
	}
at least to start.

This method can sanity check itself by calling git cat-file and actually
reading in each version of the file, and comparing it against the
generated copy, aborting if we get the two out of sync.

I'll see about implementing something along these lines this weekend,
time permitting.

-- 

Ryan Anderson
  sometimes Pug Majere
-

Previous thread: Handling large files with GIT by Martin Langhoff on Wednesday, February 8, 2006 - 2:14 am. (39 messages)

Next thread: Shortest path between commits by Ralf Baechle on Wednesday, February 8, 2006 - 9:03 am. (2 messages)