Re: GSoC draft proposal: Line-level history browser

Previous thread: [PATCH] Use test_expect_success for test setups by Brian Gernhardt on Saturday, March 20, 2010 - 1:29 am. (4 messages)

Next thread: [PATCH] commit: use the generic "sha1_pos" function to lookup sha1 by Bo Yang on Saturday, March 20, 2010 - 2:51 am. (1 message)
From: Bo Yang
Date: Saturday, March 20, 2010 - 2:18 am

Hi,

I am very interested in the project 'Line-level history browser',
after some days consideration, now I made up a draft of my proposal, I
think it is helpful to send it to the list before submitting it. Could
you please give me some advise?

-----------------------------------------------
Draft proposal: Line-level History Browser

=====Purpose of this project=====
"git blame" can tell us who is responsible for a line of code, but it
can't help if we want to get the detail of how the lines of code have
evolved as what it is now.
This project will add a new utility for git called 'git line-log'. It
can trace the history of any line range of certain file at any
revision. For simplity, users can run the command like: ' git line-log
builtin/diff.c 6..8 ', he will get the change history of code between
line 6 and line 8 of the diff.c file. And for each history entry, it
will provide the commits, the diff block which contains changes of
users' interested lines.
This utility will trace all the modification history of interested
lines and stop until it finds the root of the lines, which is a point
where all the new code is added from scratch. Also, the users can
specify how deeply he wants this utility to trace. And this tool will
treat code move just like modification too, so it will follow the code
move inside one file.
Note that, the history may not always be a single thread of commits.
If there are more than one commit which produce the specified line
range, the thread of history will split. And this utility will stop
and provide all commits with its code changes to the user, let the
user to select which one to trace next.

=====Work and technical issues=====
==Command options==
This new tool should be used for exploring the history of changes for
certain line range of code in one file.

git line-log [options] <file> <line range>

Options:
1. Since it will output commit description, it will contain the option
used to control whether we should show the whole commit ...
From: Johannes Schindelin
Date: Saturday, March 20, 2010 - 4:30 am

Hi,


I like it very much already! You obviously put in a substantial amount of 
time to learn intricate details about the way Git operates, and what is 
already available.

And you also provided a patch (unrelated to line-level history browser), 
so you proved that you actually cloned Git, and that you can actually 
patch it and use Git itself to send a patch to this list.

Very good.


I think that that might be good for starters, but one could imagine that 
an integration into "git log" might be even better, so that gitk can use 

It would be good if the code looked harder after failing with the simple 
strategy, such as looking for code removed in other files, fuzzy matching 
(optional), and looking for code duplication (i.e. literal copying, or 
slightly modified copying).

The fuzzy matching might be necessary to catch things like a Java class 
moving from one file into another (and changing its name): the first line 

Almost.

Just have a look at the word-level diff (--color-words):

http://repo.or.cz/w/git/dscho.git/blob/bc1ed6aafd9ee4937559535c66c8bddf1864bec6:/diff....

You will see that there is a function fn_out_diff_words_aux(), which is 
passed to xdi_diff_outf(). That latter function calls xdiff such that the 
former function receives a complete line at a time. And this is what I 
would suggest doing in the line-level log, too.

Ciao,
Dscho

--

From: Bo Yang
Date: Saturday, March 20, 2010 - 6:10 am

Hi Johannes,




That's really a good idea.
So, when the program reach the end of the history thread of some
changes of line range, it should not stop immediately. It then should
make a harder code search and try to find whether the new add lines of
code is moved to there or just copied from other place to there. And
these kind of search should use fuzzy matching instead of exact string
matching.

But notice that, detect code movement in one commit is much efficient
than detecting code copy. So, I think we should add an option to
control whether we detect such kind of code copy. By default, we

I have look over the function fn_out_diff_words_aux, this function
parse each line of a memory diff. We can use it to detect the diff
hunk head and find the line change. If you think the performance is
acceptable, I think using this callback mechanism is all right.

Regards!
Bo
-- 
My blog: http://blog.morebits.org
--

From: Junio C Hamano
Date: Saturday, March 20, 2010 - 6:30 am

If you are hooking into "git log", it already has "-M / -C / -C -C" as a
notion to express "different levels of digging" to find code movement and
copies, and so does "git blame".  You probably will save a lot of time if
you studied the current blame implementation thouroughly before designing
or coding.

Two things that you need to think about carefully is why "blame" stops at
the commits it shows, and if you could "peel" these lines in its output to
peek what are behind the lines, what you would see.  This is not a rocket
science topic, but it is not entirely trivial.
--

From: Bo Yang
Date: Saturday, March 20, 2010 - 11:03 pm

Hi Junio,


Yes, both blame and log has such '-M/-C/-C -C/' options.  But the
meaning are not very same:
For 'git log': -M is used to detect file rename, -C is used to trace
code copy. Both options accept no argument.
For 'git blame': -M is used to trace code move, -C is used to trace
code copy. And both options accept a <num> which specify the lower
bound of the 'same code characters'.
And, I think the line-level history tool act more like 'git blame'.
So, the '-C' option for 'git log' is exactly what we need but '-M' is
not. So, I think, maybe we should add another '-m' option to 'git log'
for line-level code movement detect.

I have make a rough look over blame.c, it is really very helpful and I
find I can borrow some code from 'git blame' to make the line-level
history browser.


I think blame's purpose is to find who is responsible for which line
of code. So, it stop after it find the origin of the code. And
line-level history browser will continue back into more history on
what blame got, it will find what the line should be before this
commit, and go backward the history based on the origin line to get a
more old status and go on again. Simply, it is something like 'git
blame' recursively. :)

Thanks again for your advice, I get too much from your feedback, thanks!

Regards!
Bo
--

From: Johannes Schindelin
Date: Saturday, March 20, 2010 - 6:36 am

Hi,

[please do not cull the Cc: list]


Yes, I think that this should be the target for the user interface. 
However, the logic should be different enough to merit a completely new 

Yes, it is much more difficult, and it is more expensive. So: there are 
several steps in the project (you could also call them "milestones"), and 
fuzzy matching end lines would come later than simple code movement. And 

Yes, I think that the performance is alright there, it works well enough 
for --color-words.

Thanks,
Dscho

--

From: Bo Yang
Date: Saturday, March 20, 2010 - 11:05 pm

Ok, I will add some milestones on my next version proposal, thanks.

Regards!
Bo
--

From: Alex Riesen
Date: Saturday, March 20, 2010 - 1:35 pm

You might want to reconsider the line range syntax. Exactly the same syntax
is already used to specify a commit range, so reusing it may lead to confusion.
--

From: Junio C Hamano
Date: Saturday, March 20, 2010 - 1:57 pm

I would actually recommend you take a look at -L option from blame.  What
I use most often and find very handy myself is this pattern:

	blame -L '/^void some_function()/,/^}/' -- path

as I do not have to count the line numbers.

There also was a discussion on allowing more than one -L to blame, which I
think is applicable to this feature.  Check the list archive for the past
few months.

--

From: Bo Yang
Date: Saturday, March 20, 2010 - 11:10 pm

I have look at that options and I find it is very convenient and

I think it is rationale for 'git blame' to allow more than one -L to
let the users see more than one block of code. But for a tool which
used to explore history, I think the user almost focus on one thread
of history. If the history split on some point, we should ask user for
choose one to go on. So, I think the line-level browser need not to
support such a thing. :)

Regards!
Bo
--

From: A Large Angry SCM
Date: Saturday, March 20, 2010 - 2:58 pm

I, actually, think the proposed line range syntax works because it uses 
the same _range_ notation. The issue is how to differentiate the _line_ 
range(s) from the _commit_ range(s); and, yes, I would like multiple 
ranges of each type as well as multiple files.
--

From: Bo Yang
Date: Saturday, March 20, 2010 - 11:16 pm

As what I said in previous post, I think we should adopt 'git blame'
way. Use a '-L <start pos>,<end pos>' to specify the line range. It
support both line number and posix regex.
For multiple ranges stuff, I don't think it is very useful to support
it for a history browser. Anyway, our users can only focus on one line
of thread history. I am very willing to listen what is your use case
for a multiple ranges?

Thanks for your precious advice!

Regards!
Bo
--

From: A Large Angry SCM
Date: Sunday, March 21, 2010 - 6:19 am

Bo Yang wrote:

More than one line range can be related and of interest to a 
forensics/archeology task.

In a simple multi range case, you'd have 2 line ranges in the same file 
that you want to see the history and graph of. Such as 2 related macro 
definitions in a header file.

In a complex multi range case, you'd have many line ranges spread over 
multiple blobs and some of the blobs have disjoint commit graphs.

The complex multi range case may be too much for a GSOC project, and the 
simple multi range case may be also. However, the command syntax should 
be general enough to handle them without being too ugly so that the 
implementation could be improved and expanded later.
--

From: Bo Yang
Date: Sunday, March 21, 2010 - 8:48 pm

Yeah, how do you think use the following syntax:

<file1>@<rev1>:<start pos>,<end pos> <file2>@<rev2>:<start pos>,<end pos>

Thanks!

Regards!
Bo
--

From: Junio C Hamano
Date: Sunday, March 21, 2010 - 9:24 pm

Horrible.  That is not how we name things.

What's wrong with bog standard:

    $ git log -L 10,20 master -- Documentation/git.txt

which is exactly how "blame" does it?

--

From: Bo Yang
Date: Sunday, March 21, 2010 - 9:34 pm

The 'blame' way is very good if we only support one line range. But if
we want to support multiple line ranges, I don't think it is suitable
for that case. Anyway, how can I specify multi-ranges which refers to
multiple files at multiple revision and multiple line ranges using
above syntax?

Except that, I still can't convince myself that we need multiple
ranges support. Anyway, how do we display such a result to our users?

Regards!
Bo
--

From: Junio C Hamano
Date: Sunday, March 21, 2010 - 10:32 pm

I would sort of see you may want to be able to say "explain lines 10 thru
15 of config.h and lines 100-115 of hello.c that appear in v1.2.0", but I
think it is a total nonsense to ask for "ll 10-15 of config.h in v1.2.0
and ll 110-115 of hello.c in v1.0.0".  After all they never existed in the
same revision (otherwise you would have said "ll 7-13 of config.h and ll
110-115 of hello.c that appear in v1.0.0").  So I would reject the
SVN-like "rev@" in the first place.

While I don't seriously buy "multiple files" either, if that is really
needed, I could be pursuaded with  "log -- path1:10-15 path2:1-7", or
"log -L path1:10-15 -Lpath2:1-7 -- path1 path2" or something similarly
ugly like these, but that is not how we generally name things, and it
probably shouldn't be a new option to "log" anymore.

On the other hand, multiple ranges in a single file is something that
may be quite reasonable, e.g.

  $ git log -L10-15 -L200-210 -- Makefile
  $ git log -L'*/^#ifdef WINDOWS/,/^#endif \/\* WINDOWS \/\*/' -- config.h

As I already said, I wouldn't be so worried about multiple-range feature,
but I would be worried about the usefulness of this feature, even for the
case to track a single range of a single file, starting from one given
revision.  When you want to know where the first few lines of Makefile
came from, and if blame says the first line came from 2731d048, that
really means that between the revision you started digging from and the
found revision, there is no commit that touched that particular line, but
equally importantly, that before that found revision, there wasn't a
corresponding line in that file---blame stopped exactly because there is
nobody before that found revision that the line can be blamed on.

So implementing "git log -L1,10 -- Makefile" might be just the matter of
doing something like:

 1. Run "git blame -L1,10 -- Makefile";
 2. Note the commits that appear in the output;
 3. Topologically sort these commits;
 4. Run "git show <the result of that ...
From: Bo Yang
Date: Monday, March 22, 2010 - 12:31 am

I am sorry, but I did not catch up you here. You worried about the
usefulness of the multi-range feature or the line level history
browser?

I think tracking a single range of a single file, starting from one
given revision is useful when the line of history split on some point.
This can let users focus on a single line of history using this

Yes, this is not satisfying. But as I understand, the line level
history browser will do more than just this. It will not stop on 'step
4', it can follow the change history recursively and deeply, to find
more. I think this is useful when we focus just one a range of code
and want to know how it become into such a now condition.

Anyway, it is not a bad thing too add a new convenient feature to a
daily tool. :)

Regards!
Bo
--

From: Junio C Hamano
Date: Monday, March 22, 2010 - 12:41 am

I am actually questioning the existence of "recursively and deeply to find
more"; the reason blame stopped at a particular commit is exactly because
there is no more---otherwise it wouldn't have stopped there but kept
digging deeper.

That is what I meant in the message you are responding to, quoted at the
top of this message.
--

From: Bo Yang
Date: Monday, March 22, 2010 - 12:52 am

I think an example may explain me well.

commit 1 of the file:
line 1 rev 1
line 2 rev 1

commit 2 of the file:
line 1 rev 2
line 2 rev 2

commit 3 of the file:
line 1 rev 3
line 2 rev 3

If we run, git blame file, it will show two lines are blamed on commit
3. Line level utility will also show rev2 and rev1 to users as the
format like what git log provide. I think git blame focus on who
produce the current code range. And the line level browser will
provide more than that, it also answer, how the lines evolved into
current condition.

I hope I explain everything clearly. :)

Regards!
Bo
--

From: Jonathan Nieder
Date: Monday, March 22, 2010 - 1:10 am

Hmm, I can imagine some (mutually inconsistent) heuristics:

 - Suppose in the blamed commit a single isolated line changed.  Then
   it is clear where to look next.

 - If the mystery code is at the beginning of the file (resp.
   beginning of a diff -C0 hunk), maybe it was based on the line at the
   same position within the previous commit.

 - Take the line with the lowest Levenshtein distance from the mystery
   code.

 - Expect certain common patterns of change: substituted words,
   whitespace changes, added arguments for a function, things like that.

That said, I still don’t have a clear picture of a basic strategy.

Interested,
Jonathan
--

From: Bo Yang
Date: Monday, March 22, 2010 - 11:01 pm

I can't understand fully about your above strategy. I think we can
category the code change into two cases:
1. The diff looks like:

@@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
**argv, const char *prefix)
                add_signoff = xmemdupz(committer, endpos - committer + 1);
        }

-       for (i = 0; i < extra_hdr_nr; i++) {
-               strbuf_addstr(&buf, extra_hdr[i]);
+       for (i = 0; i < extra_hdr.nr; i++) {
+               strbuf_addstr(&buf, extra_hdr.items[i].string);
                strbuf_addch(&buf, '\n');
        }


ie: there is both deletion and addition in a change. And this means we
modify some lines of the code. So, what we do will be tracing the two
'minus' lines and then find another diff. Start trace from that diff
recursively.
Yes, the new added code may also be moved or copied from other place.
But, I think here, we should focus on the lines before this changeset.

2. The diff looks like:

@@ -879,9 +885,12 @@ int cmd_grep(int argc, const char **argv, const
char *prefix)
        opt.regflags = REG_NEWLINE;
        opt.max_depth = -1;

+       strcpy(opt.color_context, "");
        strcpy(opt.color_filename, "");
+       strcpy(opt.color_function, "");
        strcpy(opt.color_lineno, "");
        strcpy(opt.color_match, GIT_COLOR_BOLD_RED);

This means, the code here is added from scratch. Here, I think we have
three options.
1. Find if the new code is moved here from other place.
2. Find if the new code is copied from other place.
3. We find the end of the history, so stop here.

The problems remain how do we find the copied/moved code. The new
added code may be copied/moved from multiple place with little
changes.

I hope I understand the requirement of the line-level browser, could
you please point it out if I have made some mistake?

Regards!
Bo
--

From: Jakub Narebski
Date: Tuesday, March 23, 2010 - 3:08 am

> > 
From: Bo Yang
Date: Tuesday, March 23, 2010 - 3:38 am

If I understand correctly, that is as following.

@@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
**argv, const char *prefix)
               add_signoff = xmemdupz(committer, endpos - committer + 1);
       }

-       for (i = 0; i < extra_hdr_nr; i++) {
-               strbuf_addstr(&buf, extra_hdr[i]);
+       for (i = 0; i < extra_hdr.nr; i++) {
+               strbuf_addstr(&buf, extra_hdr.items[i].string);
               strbuf_addch(&buf, '\n');
       }

Here, the user only ask for tracking the strbuf_addstr line. And we
find the above diff hunk. I think we can then find what the line would
be in the preimage using @@ -1008,29 +1000,29 @@.  The strbuf_addstr
is located at
1000(the postimage start line number)
+3(the context number)
+1(the number of lines '+' before this line) in the postimage,
and we can calculate its line number in the preimage by the same way
1008
+3
+1(the number of lines with '-' before this line).

How do you think about this method?

Regards!
Bo
--

From: Jakub Narebski
Date: Tuesday, March 23, 2010 - 4:22 am

On Tue, 23 Mar 2010, Bo Yang wrote:

Please do not forget to include attribution line, like the one I have

This would work with the simplest case, but not in more complicated
cases, like for example preimage and postimage with different size.

Take for example the following chunk (fragment):

diff --git a/run-command.c b/run-command.c
index 2feb493..3206d61 100644
--- a/run-command.c
+++ b/run-command.c
@@ -67,19 +67,21 @@ static int child_notifier = -1;
 
 static void notify_parent(void)
 {
-	write(child_notifier, "", 1);
+	ssize_t unused;
+	unused = write(child_notifier, "", 1);
 }
 
 static NORETURN void die_child(const char *err, va_list params)

If you follow ssize_t line, it is created.  If you follow line with
write, which is 2nd line in postimage, its previous version is 1st
line in preimage.


Another example would be reordering of lines, or reordering with
some change.

-- 
Jakub Narebski
Poland
--

From: Bo Yang
Date: Tuesday, March 23, 2010 - 5:23 am

Hi,


Ah, yes, you are right.

And now, I really get the difference between the understanding about
line level browser of us. :) When users want to browsing the history
of some line or line range, you want to display only the related lines
to them, but I want to display the minim diff hunk to them. :)
And I think displaying the minimum diff hunk is sensible and feasible.
Could you please tell me how do you think about this?

Regards!
Bo
--

From: Jakub Narebski
Date: Tuesday, March 23, 2010 - 6:49 am

The problem is not what (part of) diff you would display.  The problem
is with following the history (with history simplification).  *After*
displaying diff / chunk / chunk fragment, do we further follow history
of the whole preimage?  Or do we follow history of line pre-change
starting from blamed commit?

If we *don't* follow the history, how line-level browser is different
from (wrapped) git-blame?


Try to come with the result of line-level history for some line in
git sources "by hand": this would help in discussion about what 
line-level history browser should do, and perhaps even be first test
of it (see e.g. tests for git-blame).

-- 
Jakub Narebski
Poland
--

From: Bo Yang
Date: Tuesday, March 23, 2010 - 8:23 am

Hi,


Thanks for your advice of coming with a real example, Jakub! And I can
give a not too trivial one, :)

If you look at the pretty.c line 1032 line, you will find a line like:

format_commit_message(commit, user_format, sb, context);

Take for example, we will trace the history of this line.
We will find that the first time this line appears:

@@ -900,18 +900,18 @@ char *reencode_commit_message(const struct
commit *commit, const char **encoding
...skipped...
        if (fmt == CMIT_FMT_USERFORMAT) {
-               format_commit_message(commit, user_format, sb, dmode);
+               format_commit_message(commit, user_format, sb, context);
                return;
        }
And we should trace the preimage, something like:
        if (fmt == CMIT_FMT_USERFORMAT) {
               format_commit_message(commit, user_format, sb, dmode);

We will find these below:
@@ -770,7 +775,7 @@ void pretty_print_commit(enum cmit_fmt fmt, const struct com
        const char *encoding;

        if (fmt == CMIT_FMT_USERFORMAT) {
-               format_commit_message(commit, user_format, sb);
+               format_commit_message(commit, user_format, sb, dmode);
                return;
        }

Again:
+
+       if (fmt == CMIT_FMT_USERFORMAT) {
+               format_commit_message(commit, user_format, sb);
+               return;
+       }
+

Here, we find that the line is added from scratch and line level
history browser will do a code movement and copy matching try to find
whether this line if moved from other files.

And it is. In commit 93fc05eb9(Split off the pretty print stuff into
its own file), some code is moved from commit.c to pretty.c and this
line if from commit.c .

Ok, now, we will trace into commit.c for this line.
Again:
        char *reencoded;
        const char *encoding;
-       char *buf;

-       if (fmt == CMIT_FMT_USERFORMAT)
-               return format_commit_message(commit, user_format,
buf_p, space_p);
+       if (fmt == ...
From: Jonathan Nieder
Date: Tuesday, March 23, 2010 - 12:57 pm

That might be necessary, but I will admit that I suspect it to be
harder to make useful.  One of the very nice things about ‘git log’ is
that it is easy to browse through history in a nonlinear way in a
pager (by using a pager’s search functionality).  The “backend” ‘git
rev-list’ is easy to write scripts with, also because of its simple
input and output.

If your program requires input from the user, how will it paginate its
output?  Most pagers expect the standard input to be available for
input from the user.

One approach (I will not say it is a good one) to the problem of
ambiguous origins for a line is to blame _both_ parents.  That is,
start following both lines of history in your revision walking.
Perhaps higher-level tools like ‘git log --graph’ and gitk could
visually represent the branched history you are showing.

Another approach is to just choose one parent automatically: for
example, prefer the first parent, or assign some score representing
the relatedness of each parent and choose the most related one.

Jonathan
--

From: A Large Angry SCM
Date: Tuesday, March 23, 2010 - 2:51 pm

What I would like to see (and may be too much for a GSOC project) is the 
  result to be a simplified commit graph with additional annotations of 
the line range mappings that could be fed into something like a modified 
gitk to view the _history_ of the lines of interest.
--

From: Bo Yang
Date: Tuesday, March 23, 2010 - 7:30 pm

Hi,


Both the approach is very precious for me. I think maybe I will
propose the first one in my real proposal to Git, thanks a lot! You
really help my too much! Thanks!

Regards!
Bo
--

From: Peter Kjellerstedt
Date: Tuesday, March 23, 2010 - 5:02 am

Look more closely. Hint: a _ is not the same as a . ;)

//Peter

--

From: Jonathan Nieder
Date: Tuesday, March 23, 2010 - 11:57 am

Hi,

[reordering quoted text for convenience]


Thanks!  What you said is much more coherent than the vague things I

If the code is copied verbatim from elsewhere, this is something ‘git
blame’ is already very good at.  See [1].

Fuzzy matching is a big pain.  ‘git blame’ knows how to ignore
whitespace.  Dscho suggested counting common words.  Maybe there are
some other ways.  I think there is a real danger of getting lost in this
problem and wasting a lot of time, so although it is very interesting, I

If you can make a heuristic along these lines this work well, I think it
would be great.  I imagine it might work very well for commits that made
nice, small changes (like many of those in git.git).  Jakub pointed out
some of the difficulties, and I like to hope your idea of “when in doubt,
include more lines” may work well in many cases in git.git still.

Good luck, and thank you for taking my crazy ideas seriously. :)

Regards,
Jonathan

[1] See v1.4.4-rc1~2 (Merge branch 'jc/pickaxe', 2006-11-07) and the
commits preceding it.  About that series, Junio wrote:

	Actually the plan is to make it do _true_ pickaxe,
	although it will most likely end up either in dustbin or
	replace blame.

It replaced blame.

I am not actually sure, but I assume “true pickaxe” refers to the
goals described in <http://gitster.livejournal.com/35628.html>
and the linked-to message.
--

From: Bo Yang
Date: Tuesday, March 23, 2010 - 7:39 pm

HI,


I have looked over the article and the message from Linus, it really
help me very much. The message and article pointed out most of the
things a line level tool should do, and I am happy to find that it is
similar with my proposal. :) Thanks again for your precious advice and
I think I can come up a better proposal, now. Thanks!

Regards!
Bo
--

From: Jonathan Nieder
Date: Tuesday, March 23, 2010 - 9:02 pm

Okay, so now I looked over that thread again.  I found this [1]:

  <http://minnie.tuhs.org/Programs/Ctcompare/index.html>

It’s for fuzzy matching of a certain kind.  The latest version is under
the GPLv3, unfortunately for us.  I would still like to reiterate my
warning to not get sidetracked on this, but maybe it would be pleasant
reading.

Enjoy,
Jonathan

[1] Thanks, Linus.
http://thread.gmane.org/gmane.comp.version-control.git/27/focus=225
--

From: Alex Riesen
Date: Monday, March 22, 2010 - 3:39 am

But then, how about putting the "path" last in the argument,
so that the unambiguosly defined part of the format comes first?
Less need for quoting of ":" (or "@") in pathnames.
--

From: Johannes Schindelin
Date: Monday, March 22, 2010 - 8:05 am

Hi,


Yes. Besides, it is an easy fall-out of the common "a Java class was split 
into two" case, where you follow line ranges in different files (at least 
at some stage) _anyway_.

Ciao,
Dscho
--

From: Bo Yang
Date: Sunday, March 21, 2010 - 8:52 pm

Hi all,

     Thanks a lot for your precious advice and based on that, I have
prepared a new version of my proposal, generally it provide a detailed
options which I want to add to 'git log' and a new syntax for
supporting multi line ranges in any file at any revision. Also, this
version provide a milestones and timeline for this project. Thanks
again for your advice and I appreciate your feedback very much for
this version.

-----------------------------------------------------------------------
Draft proposal(v2): Line-level History Browser

=====Purpose of this project=====
"git blame" can tell us who is responsible for a line of code, but it
can't help if we want to get the detail of how the lines of code have
evolved as what it is now.
This project will add a new feature for 'git log' to display line
level history. It can trace the history of any line range of certain
file at any revision. For simplity, users can run the command like: '
git log -L builtin/diff.c:6,8 ', he will get the change history of
code between line 6 and line 8 of the diff.c file. And for each
history entry, it will provide the commits, the diff block which
contains changes of users' interested lines.
This utility will trace all the modification history of interested
lines and stop until it finds the root of the lines, which is a point
where all the new code is added from scratch. Also, the users can
specify how deeply he wants this utility to trace. And this tool will
treat code move just like modification too, so it will follow the code
move inside one commit.
Note that, the history may not always be a single thread of commits.
If there are more than one commit which produce the specified line
range, the thread of history will split. And this utility will stop
and provide all commits with its code changes to the user, let the
user to select which one to trace next.

=====Work and technical issues=====
==Command options==
This new feature should be used for exploring the history of changes
for ...
From: Jakub Narebski
Date: Monday, March 22, 2010 - 8:48 am

I think that, at least at first, line-level log should follow the
git-blame, i.e.

  git log -L <begin>,<end>  <revs>  -- <file>

If we want (in the future) to follow history of some lines from one
file, and other lines from other file together, we do not need to use

  -L <file>:<begin>,<end>

syntax.  If parseopt allows, we can use posotion of parameters, i.e.


The most important *new* algorithm you need to implement is, after
finding (blame-like) the commit that created given version of given
line, what was previous version of given line and which line that was.

You can probably find some heuristic in existing merge tools, like
emerge from GNU Emacs, or graphical diff tools.

-- 
Jakub Narebski
Poland
ShadeHawk on #git
--

From: Johannes Schindelin
Date: Monday, March 22, 2010 - 11:21 am

Hi,


Oh, is it bikeshedding time already? /me might have missed the start 

I do not think that these tools can help, as they never look further than 
identical lines (and they mustn't, either).

More importantly, the first step really is about driving the libxdiff in 
such a way that you can recognize the exact same lines.

(One point to note for the technical details: the algorithm has to expect 
opposite code moves, i.e. it must cope well when the diff shows the code 
in question removed in one hunk and added in another.)

We also should not get ahead of ourselves, but allow the student to get a 
full understanding of the requirements, from which he can then make a 
project plan (with milestones, Christian, no problem).

BTW by "requirements" I do not mean something as technical as the syntax, 
but rather a definition what people should be able to expect to do with 
this at the end of the summer.

As to fuzzy matching of lines that could not be attributed otherwise, I 
think that that will require a lot of playing around with different ideas. 
A simple Levenshtein-Damerau is highly unlikely to be enough.

Ciao,
Dscho

--

From: Sverre Rabbelier
Date: Monday, March 22, 2010 - 11:38 am

Heya,

On Mon, Mar 22, 2010 at 19:21, Johannes Schindelin

I'd recommend making this either the last milestone, or not a
milestone at all. As I noticed with git-stats such metrics might not
exist at all (or at least be too hard to find/implement), and it's
quite a bummer to not be able to implement your primary milestone ;).

-- 
Cheers,

Sverre Rabbelier
--

From: Johannes Schindelin
Date: Monday, March 22, 2010 - 12:26 pm

Hi,


Indeed. TBH, I wanted to ask you to assist in that part of the project. 
You probably can give a good overview over what does not work, and why.

Ciao,
Dscho

--

From: Sverre Rabbelier
Date: Monday, March 22, 2010 - 1:21 pm

Heya,

On Mon, Mar 22, 2010 at 20:26, Johannes Schindelin

Back then I think we even talked about teaching git log to find code
moves? I have some silly code online on repo.or.cz even. maybe.
Anyway, my main problem there was finding a heuristic that would give
a sensible answer both in small _and_ large moves. It might be worth
investigating two or more metrics instead, one that works for (very)
small chunks of code, and thus require an almost exact match, then
perhaps a somewhat linear function (the longer the block moved, the
more 'fuzz' you allow), and maybe after some size, say practical
full-file moves, use an algorithm similar to what rename detection
does. </brandump>

-- 
Cheers,

Sverre Rabbelier
--

From: Johannes Schindelin
Date: Monday, March 22, 2010 - 12:24 pm

Hi,


I would not be too specific here about the exact syntax. I would rather 
have an example where this might be useful.

In git.git, for example, you could point to pretty_print_commit() which 
was split out from commit.c into pretty.c in 93fc05e(Split off the pretty 
print stuff into its own file), and mention that it is hard to verify 
without much hassle that the code split was really only a code split, 
rather than a split with an evil change.

Or you could point to 691f1a2(replace direct calls to unlink(2) with 
unlink_or_warn), where code was refactored, into a new function 
(unfortunately in two commits, so it might be a case not covered by your 
project) and it might be somebody's task to find out the original author 
for that function.

Basically, I would like to have a structure in the proposal like this: 

Do not forget the case where there are more than one source of a code 

I would like this not to be specified too much here. For example, we do 
not know yet, whether the matching will be fuzzy, or whether we find 
something cleverer than that.

So, I suggest to list not the command line options, but what you intend to 

Here you do not need to say that it is -m<num>, but that you want to 
support following code movements both inside and between files, but only 
optionally, for performance reasons (or some such).


It would be more in line with the diff options to use -U, but you do not 

Again, there are better options for "git log" already, but you do not need 
to be too explicit on the syntax side. Just say that you want to be able 
to use as many of "git log"s options as make sense in the context of 




Again, do not be too specific about details that have to be fleshed out 
while working on the project. For example, we do not know yet whether it 
would make more sense to look for code movements automatically when we 
detected a deletion, and maybe fall back automatically to detecting code 




IMHO this should be split into

	1a) have ...
From: Bo Yang
Date: Monday, March 22, 2010 - 11:08 pm

Yeah, I really ignore such a condition. Thanks a lot!
And any new added code can be moved/copied from multiple source. This


I will make a more specified milestones and timeline, thanks!

Regards!
Bo
--

From: Bo Yang
Date: Monday, March 22, 2010 - 11:27 pm

I am sorry not. I mean, lines copied from other files that were
modified in the same commit. Just what 'blame' means with one '-C'

I think fuzzy matching is used to track multiple lines of
copy/movement, even with little change of the source.
For example, one C function is moved from file1 to file2 and get
renamed. In this case, most of the origin code of function body will
remain unchanged except the function name. So, simply compare the new
added lines with original code line by line and permit some percent of
mismatch will help to find this kind of movement.


Regards!
Bo
--

From: Bo Yang
Date: Sunday, March 28, 2010 - 9:14 pm

Hi Thomas,

You mean the dates? They are made up according on 'GSoC's timeline'
and my estimation about the workload of each milestone.

And this is the draft proposal, after a long thread of discussion, the
timeline and milestone change much.  The fuzzy matching milestone will
become a bonus milestone instead of a primary GSoC milestone. And I
think it may help that I provide a newest version of it, I paste it in
the end of the email.

And I will appreciate any feedback from you. Especially about the
implementation section :)


Regards!
Bo

-------------------------------------------------------------------------
Draft proposal(v3): Line-level History Browser

=====Purpose of this project=====
"git blame" can tell us who is responsible for a line of code, but it
can't help if we want to get the detail of how the lines of code have
evolved as what it is now. For example, in Git, commit 93fc05e(Split
off the pretty print stuff into its own file) split out
pretty_print_commit() from commit.c into pretty.c, and it is hard to
verify without much hassle that the code split was really only a code
split, rather than a split with an evil change.

This project will add a new feature for 'git log' to display line
level history. It can trace the history of any line range of certain
file at any revision. And for each history entry, it will provide the
commits, the diff block which contains changes of users' interested
lines.

This utility will trace all the modification history of interested
lines and stop until it finds the root of the lines, which is a point
where all the new code is added from scratch. Also, the users can
specify how deeply he wants this utility to trace. And this tool will
also follow the code movement and copy inside one commit, too.

Note that, the history may not always be a single thread of commits.
If there are more than one commits which produce the specified line
range, or there are more than one source of code move/copy, the thread
of history will ...
From: Thomas Rast
Date: Monday, March 29, 2010 - 11:42 am

Is this really the right use-case?  AFAICT the answer to the implied
question is given by simply running 'git blame -M 93fc05e:pretty.c'.

(Coming up with a better example should be easy; the way I currently
think of the feature means that it will mostly replace git-blame for

I would, by far, prefer the latter.  So far 'git log' has always been
noninteractive, and there's no really good way to make it interactive
because it also goes through the pager.  (In the case of blame this is
solved in 'git gui blame', which might also be a reasonable approach.)

OTOH, if you can really fake a history walk, then just about any
log-oriented tool should be able to work with it.  You'd get graphical
output for free with gitk and git log --graph.  I haven't really

I would prefer if you could inline a short example, perhaps starting
at your second diff snippet.  Examples are good ;-)

Even if not, please drop the /match= parameter since it is very


One thing that IMO is missing from this list, is a plumbing mode that
just feeds the raw data to a (presumed) frontend.  It could be as
simple as supporting

  git log -L ... --pretty=raw --raw

or similar, if this provides sufficient information.  Compare 'git

This section is too handwavy for my taste.  I think in most cases you
say "we can" when you really mean "git-blame already does it, so we
can just use a similar algorithm".  Which is fine, but I'd rather see

I agree with what Dscho pointed out earlier in the thread: multiple
ranges will be an easy exercise once you can follow a "blame split"
where half the lines blame to some file and half the lines blame to
another.

Other than that I think the milestones look sensible.  As a theory
guy, I'm not a huge believer in timelines, so lets hope someone else

Push the code somewhere public as you go, even between feature
completions.  Post RFCs once you have workable features so people can
comment.  Generally try to be visible.

Bonus points if you can think of something visible ...
From: Bo Yang
Date: Monday, March 29, 2010 - 7:52 pm

Hi Thomas,






Changed in next version to make this clear. But only add some words to

Yeah, really is a good point. And I have tried to play around on
github.com and try to set up a http://github.com/byang/my_git for this

Thanks a lot for this good advice, I will do so.

With these feedback, I think I can make up a complete version of the
proposal and submit it to Google. Thanks!

Regards!
Bo
--

From: Michael J Gruber
Date: Tuesday, March 30, 2010 - 2:07 am

You may want to create your repo as a fork of gitster/git instead.
That's easier on github, they have a hard time anyways these days ;)
Seriously, it helps making use of their network feature etc.

I don't have anything to add to your proposal (I like it), but I'll be
at NKU next week (Conference @ Chern Institute) so drop me a PM if you wish.

Cheers,
Michael
--

From: Michael J Gruber
Date: Tuesday, March 30, 2010 - 2:38 am

Actually, make this git/git, the other one isn't being updated... Sorry!

Michael
--

From: Bo Yang
Date: Tuesday, March 30, 2010 - 4:10 am

Hi Michael,

On Tue, Mar 30, 2010 at 5:07 PM, Michael J Gruber


That is really a big coincidence. :)
I am very willing to meet you at NKU, and I think I can be your guide
in NKU and some beautiful spots in Tianjin if you have spare time. :)
Anyway, let us talk about this in personal email off the list. :-)

Regards!
Bo
--

From: Jakub Narebski
Date: Tuesday, March 30, 2010 - 2:10 am

By the way, it would be good to find an example with "evil merge",
which means that the change to given line(s) is in the merge commit
itself.  Correctly simplifying history in such case might be
non-trivial.

Another example that it would be good to have is "history split"
example, which means the case where some lines were consolidated
(e.g. after refactoring), and some of lines in "preimage" come
from different lines of history.

This would help with writing tests for this feature (compare tests
for blame), although they are not in my opinion necessary for the
proposal itself.
 
I hope that all this cases would fall naturally from the
implementation.


my_git is not very descriptive... well, unless you would do your work
on GSoC2010/line-level-history-browser branch, or something like that.

It might be good idea to have repo.or.cz as an additional repository,
as a fork of git.git repo, and with SoC / GSoC labels.  See
http://repo.or.cz/w/git.git/forks?t=soc

-- 
Jakub Narebski
Poland
ShadeHawk on #git
--

From: Bo Yang
Date: Tuesday, March 30, 2010 - 4:15 am

Hi Jakub,


It is a little time consuming to find such a change in the history. I
think we can come up some ones at the start of the project manually

Ah, a repo at  http://github.com/byang/gsoc-line-browser is created
and a mirror at http://repo.or.cz/w/gsoc-line-browser.git, I think
this is enough. :-)

Thanks!
Bo
--

Previous thread: [PATCH] Use test_expect_success for test setups by Brian Gernhardt on Saturday, March 20, 2010 - 1:29 am. (4 messages)

Next thread: [PATCH] commit: use the generic "sha1_pos" function to lookup sha1 by Bo Yang on Saturday, March 20, 2010 - 2:51 am. (1 message)