Hi, I am very interested in the project 'Line-level history browser', after some days consideration, now I made up a draft of my proposal, I think it is helpful to send it to the list before submitting it. Could you please give me some advise? ----------------------------------------------- Draft proposal: Line-level History Browser =====Purpose of this project===== "git blame" can tell us who is responsible for a line of code, but it can't help if we want to get the detail of how the lines of code have evolved as what it is now. This project will add a new utility for git called 'git line-log'. It can trace the history of any line range of certain file at any revision. For simplity, users can run the command like: ' git line-log builtin/diff.c 6..8 ', he will get the change history of code between line 6 and line 8 of the diff.c file. And for each history entry, it will provide the commits, the diff block which contains changes of users' interested lines. This utility will trace all the modification history of interested lines and stop until it finds the root of the lines, which is a point where all the new code is added from scratch. Also, the users can specify how deeply he wants this utility to trace. And this tool will treat code move just like modification too, so it will follow the code move inside one file. Note that, the history may not always be a single thread of commits. If there are more than one commit which produce the specified line range, the thread of history will split. And this utility will stop and provide all commits with its code changes to the user, let the user to select which one to trace next. =====Work and technical issues===== ==Command options== This new tool should be used for exploring the history of changes for certain line range of code in one file. git line-log [options] <file> <line range> Options: 1. Since it will output commit description, it will contain the option used to control whether we should show the whole commit ...
Hi, I like it very much already! You obviously put in a substantial amount of time to learn intricate details about the way Git operates, and what is already available. And you also provided a patch (unrelated to line-level history browser), so you proved that you actually cloned Git, and that you can actually patch it and use Git itself to send a patch to this list. Very good. I think that that might be good for starters, but one could imagine that an integration into "git log" might be even better, so that gitk can use It would be good if the code looked harder after failing with the simple strategy, such as looking for code removed in other files, fuzzy matching (optional), and looking for code duplication (i.e. literal copying, or slightly modified copying). The fuzzy matching might be necessary to catch things like a Java class moving from one file into another (and changing its name): the first line Almost. Just have a look at the word-level diff (--color-words): http://repo.or.cz/w/git/dscho.git/blob/bc1ed6aafd9ee4937559535c66c8bddf1864bec6:/diff.... You will see that there is a function fn_out_diff_words_aux(), which is passed to xdi_diff_outf(). That latter function calls xdiff such that the former function receives a complete line at a time. And this is what I would suggest doing in the line-level log, too. Ciao, Dscho --
Hi Johannes, That's really a good idea. So, when the program reach the end of the history thread of some changes of line range, it should not stop immediately. It then should make a harder code search and try to find whether the new add lines of code is moved to there or just copied from other place to there. And these kind of search should use fuzzy matching instead of exact string matching. But notice that, detect code movement in one commit is much efficient than detecting code copy. So, I think we should add an option to control whether we detect such kind of code copy. By default, we I have look over the function fn_out_diff_words_aux, this function parse each line of a memory diff. We can use it to detect the diff hunk head and find the line change. If you think the performance is acceptable, I think using this callback mechanism is all right. Regards! Bo -- My blog: http://blog.morebits.org --
If you are hooking into "git log", it already has "-M / -C / -C -C" as a notion to express "different levels of digging" to find code movement and copies, and so does "git blame". You probably will save a lot of time if you studied the current blame implementation thouroughly before designing or coding. Two things that you need to think about carefully is why "blame" stops at the commits it shows, and if you could "peel" these lines in its output to peek what are behind the lines, what you would see. This is not a rocket science topic, but it is not entirely trivial. --
Hi Junio, Yes, both blame and log has such '-M/-C/-C -C/' options. But the meaning are not very same: For 'git log': -M is used to detect file rename, -C is used to trace code copy. Both options accept no argument. For 'git blame': -M is used to trace code move, -C is used to trace code copy. And both options accept a <num> which specify the lower bound of the 'same code characters'. And, I think the line-level history tool act more like 'git blame'. So, the '-C' option for 'git log' is exactly what we need but '-M' is not. So, I think, maybe we should add another '-m' option to 'git log' for line-level code movement detect. I have make a rough look over blame.c, it is really very helpful and I find I can borrow some code from 'git blame' to make the line-level history browser. I think blame's purpose is to find who is responsible for which line of code. So, it stop after it find the origin of the code. And line-level history browser will continue back into more history on what blame got, it will find what the line should be before this commit, and go backward the history based on the origin line to get a more old status and go on again. Simply, it is something like 'git blame' recursively. :) Thanks again for your advice, I get too much from your feedback, thanks! Regards! Bo --
Hi, [please do not cull the Cc: list] Yes, I think that this should be the target for the user interface. However, the logic should be different enough to merit a completely new Yes, it is much more difficult, and it is more expensive. So: there are several steps in the project (you could also call them "milestones"), and fuzzy matching end lines would come later than simple code movement. And Yes, I think that the performance is alright there, it works well enough for --color-words. Thanks, Dscho --
Ok, I will add some milestones on my next version proposal, thanks. Regards! Bo --
You might want to reconsider the line range syntax. Exactly the same syntax is already used to specify a commit range, so reusing it may lead to confusion. --
I would actually recommend you take a look at -L option from blame. What I use most often and find very handy myself is this pattern: blame -L '/^void some_function()/,/^}/' -- path as I do not have to count the line numbers. There also was a discussion on allowing more than one -L to blame, which I think is applicable to this feature. Check the list archive for the past few months. --
I have look at that options and I find it is very convenient and I think it is rationale for 'git blame' to allow more than one -L to let the users see more than one block of code. But for a tool which used to explore history, I think the user almost focus on one thread of history. If the history split on some point, we should ask user for choose one to go on. So, I think the line-level browser need not to support such a thing. :) Regards! Bo --
I, actually, think the proposed line range syntax works because it uses the same _range_ notation. The issue is how to differentiate the _line_ range(s) from the _commit_ range(s); and, yes, I would like multiple ranges of each type as well as multiple files. --
As what I said in previous post, I think we should adopt 'git blame' way. Use a '-L <start pos>,<end pos>' to specify the line range. It support both line number and posix regex. For multiple ranges stuff, I don't think it is very useful to support it for a history browser. Anyway, our users can only focus on one line of thread history. I am very willing to listen what is your use case for a multiple ranges? Thanks for your precious advice! Regards! Bo --
Bo Yang wrote: More than one line range can be related and of interest to a forensics/archeology task. In a simple multi range case, you'd have 2 line ranges in the same file that you want to see the history and graph of. Such as 2 related macro definitions in a header file. In a complex multi range case, you'd have many line ranges spread over multiple blobs and some of the blobs have disjoint commit graphs. The complex multi range case may be too much for a GSOC project, and the simple multi range case may be also. However, the command syntax should be general enough to handle them without being too ugly so that the implementation could be improved and expanded later. --
Yeah, how do you think use the following syntax: <file1>@<rev1>:<start pos>,<end pos> <file2>@<rev2>:<start pos>,<end pos> Thanks! Regards! Bo --
Horrible. That is not how we name things.
What's wrong with bog standard:
$ git log -L 10,20 master -- Documentation/git.txt
which is exactly how "blame" does it?
--
The 'blame' way is very good if we only support one line range. But if we want to support multiple line ranges, I don't think it is suitable for that case. Anyway, how can I specify multi-ranges which refers to multiple files at multiple revision and multiple line ranges using above syntax? Except that, I still can't convince myself that we need multiple ranges support. Anyway, how do we display such a result to our users? Regards! Bo --
I would sort of see you may want to be able to say "explain lines 10 thru 15 of config.h and lines 100-115 of hello.c that appear in v1.2.0", but I think it is a total nonsense to ask for "ll 10-15 of config.h in v1.2.0 and ll 110-115 of hello.c in v1.0.0". After all they never existed in the same revision (otherwise you would have said "ll 7-13 of config.h and ll 110-115 of hello.c that appear in v1.0.0"). So I would reject the SVN-like "rev@" in the first place. While I don't seriously buy "multiple files" either, if that is really needed, I could be pursuaded with "log -- path1:10-15 path2:1-7", or "log -L path1:10-15 -Lpath2:1-7 -- path1 path2" or something similarly ugly like these, but that is not how we generally name things, and it probably shouldn't be a new option to "log" anymore. On the other hand, multiple ranges in a single file is something that may be quite reasonable, e.g. $ git log -L10-15 -L200-210 -- Makefile $ git log -L'*/^#ifdef WINDOWS/,/^#endif \/\* WINDOWS \/\*/' -- config.h As I already said, I wouldn't be so worried about multiple-range feature, but I would be worried about the usefulness of this feature, even for the case to track a single range of a single file, starting from one given revision. When you want to know where the first few lines of Makefile came from, and if blame says the first line came from 2731d048, that really means that between the revision you started digging from and the found revision, there is no commit that touched that particular line, but equally importantly, that before that found revision, there wasn't a corresponding line in that file---blame stopped exactly because there is nobody before that found revision that the line can be blamed on. So implementing "git log -L1,10 -- Makefile" might be just the matter of doing something like: 1. Run "git blame -L1,10 -- Makefile"; 2. Note the commits that appear in the output; 3. Topologically sort these commits; 4. Run "git show <the result of that ...
I am sorry, but I did not catch up you here. You worried about the usefulness of the multi-range feature or the line level history browser? I think tracking a single range of a single file, starting from one given revision is useful when the line of history split on some point. This can let users focus on a single line of history using this Yes, this is not satisfying. But as I understand, the line level history browser will do more than just this. It will not stop on 'step 4', it can follow the change history recursively and deeply, to find more. I think this is useful when we focus just one a range of code and want to know how it become into such a now condition. Anyway, it is not a bad thing too add a new convenient feature to a daily tool. :) Regards! Bo --
I am actually questioning the existence of "recursively and deeply to find more"; the reason blame stopped at a particular commit is exactly because there is no more---otherwise it wouldn't have stopped there but kept digging deeper. That is what I meant in the message you are responding to, quoted at the top of this message. --
I think an example may explain me well. commit 1 of the file: line 1 rev 1 line 2 rev 1 commit 2 of the file: line 1 rev 2 line 2 rev 2 commit 3 of the file: line 1 rev 3 line 2 rev 3 If we run, git blame file, it will show two lines are blamed on commit 3. Line level utility will also show rev2 and rev1 to users as the format like what git log provide. I think git blame focus on who produce the current code range. And the line level browser will provide more than that, it also answer, how the lines evolved into current condition. I hope I explain everything clearly. :) Regards! Bo --
Hmm, I can imagine some (mutually inconsistent) heuristics: - Suppose in the blamed commit a single isolated line changed. Then it is clear where to look next. - If the mystery code is at the beginning of the file (resp. beginning of a diff -C0 hunk), maybe it was based on the line at the same position within the previous commit. - Take the line with the lowest Levenshtein distance from the mystery code. - Expect certain common patterns of change: substituted words, whitespace changes, added arguments for a function, things like that. That said, I still don’t have a clear picture of a basic strategy. Interested, Jonathan --
I can't understand fully about your above strategy. I think we can
category the code change into two cases:
1. The diff looks like:
@@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
**argv, const char *prefix)
add_signoff = xmemdupz(committer, endpos - committer + 1);
}
- for (i = 0; i < extra_hdr_nr; i++) {
- strbuf_addstr(&buf, extra_hdr[i]);
+ for (i = 0; i < extra_hdr.nr; i++) {
+ strbuf_addstr(&buf, extra_hdr.items[i].string);
strbuf_addch(&buf, '\n');
}
ie: there is both deletion and addition in a change. And this means we
modify some lines of the code. So, what we do will be tracing the two
'minus' lines and then find another diff. Start trace from that diff
recursively.
Yes, the new added code may also be moved or copied from other place.
But, I think here, we should focus on the lines before this changeset.
2. The diff looks like:
@@ -879,9 +885,12 @@ int cmd_grep(int argc, const char **argv, const
char *prefix)
opt.regflags = REG_NEWLINE;
opt.max_depth = -1;
+ strcpy(opt.color_context, "");
strcpy(opt.color_filename, "");
+ strcpy(opt.color_function, "");
strcpy(opt.color_lineno, "");
strcpy(opt.color_match, GIT_COLOR_BOLD_RED);
This means, the code here is added from scratch. Here, I think we have
three options.
1. Find if the new code is moved here from other place.
2. Find if the new code is copied from other place.
3. We find the end of the history, so stop here.
The problems remain how do we find the copied/moved code. The new
added code may be copied/moved from multiple place with little
changes.
I hope I understand the requirement of the line-level browser, could
you please point it out if I have made some mistake?
Regards!
Bo
--
If I understand correctly, that is as following.
@@ -1008,29 +1000,29 @@ int cmd_format_patch(int argc, const char
**argv, const char *prefix)
add_signoff = xmemdupz(committer, endpos - committer + 1);
}
- for (i = 0; i < extra_hdr_nr; i++) {
- strbuf_addstr(&buf, extra_hdr[i]);
+ for (i = 0; i < extra_hdr.nr; i++) {
+ strbuf_addstr(&buf, extra_hdr.items[i].string);
strbuf_addch(&buf, '\n');
}
Here, the user only ask for tracking the strbuf_addstr line. And we
find the above diff hunk. I think we can then find what the line would
be in the preimage using @@ -1008,29 +1000,29 @@. The strbuf_addstr
is located at
1000(the postimage start line number)
+3(the context number)
+1(the number of lines '+' before this line) in the postimage,
and we can calculate its line number in the preimage by the same way
1008
+3
+1(the number of lines with '-' before this line).
How do you think about this method?
Regards!
Bo
--
On Tue, 23 Mar 2010, Bo Yang wrote:
Please do not forget to include attribution line, like the one I have
This would work with the simplest case, but not in more complicated
cases, like for example preimage and postimage with different size.
Take for example the following chunk (fragment):
diff --git a/run-command.c b/run-command.c
index 2feb493..3206d61 100644
--- a/run-command.c
+++ b/run-command.c
@@ -67,19 +67,21 @@ static int child_notifier = -1;
static void notify_parent(void)
{
- write(child_notifier, "", 1);
+ ssize_t unused;
+ unused = write(child_notifier, "", 1);
}
static NORETURN void die_child(const char *err, va_list params)
If you follow ssize_t line, it is created. If you follow line with
write, which is 2nd line in postimage, its previous version is 1st
line in preimage.
Another example would be reordering of lines, or reordering with
some change.
--
Jakub Narebski
Poland
--
Hi, Ah, yes, you are right. And now, I really get the difference between the understanding about line level browser of us. :) When users want to browsing the history of some line or line range, you want to display only the related lines to them, but I want to display the minim diff hunk to them. :) And I think displaying the minimum diff hunk is sensible and feasible. Could you please tell me how do you think about this? Regards! Bo --
The problem is not what (part of) diff you would display. The problem is with following the history (with history simplification). *After* displaying diff / chunk / chunk fragment, do we further follow history of the whole preimage? Or do we follow history of line pre-change starting from blamed commit? If we *don't* follow the history, how line-level browser is different from (wrapped) git-blame? Try to come with the result of line-level history for some line in git sources "by hand": this would help in discussion about what line-level history browser should do, and perhaps even be first test of it (see e.g. tests for git-blame). -- Jakub Narebski Poland --
Hi,
Thanks for your advice of coming with a real example, Jakub! And I can
give a not too trivial one, :)
If you look at the pretty.c line 1032 line, you will find a line like:
format_commit_message(commit, user_format, sb, context);
Take for example, we will trace the history of this line.
We will find that the first time this line appears:
@@ -900,18 +900,18 @@ char *reencode_commit_message(const struct
commit *commit, const char **encoding
...skipped...
if (fmt == CMIT_FMT_USERFORMAT) {
- format_commit_message(commit, user_format, sb, dmode);
+ format_commit_message(commit, user_format, sb, context);
return;
}
And we should trace the preimage, something like:
if (fmt == CMIT_FMT_USERFORMAT) {
format_commit_message(commit, user_format, sb, dmode);
We will find these below:
@@ -770,7 +775,7 @@ void pretty_print_commit(enum cmit_fmt fmt, const struct com
const char *encoding;
if (fmt == CMIT_FMT_USERFORMAT) {
- format_commit_message(commit, user_format, sb);
+ format_commit_message(commit, user_format, sb, dmode);
return;
}
Again:
+
+ if (fmt == CMIT_FMT_USERFORMAT) {
+ format_commit_message(commit, user_format, sb);
+ return;
+ }
+
Here, we find that the line is added from scratch and line level
history browser will do a code movement and copy matching try to find
whether this line if moved from other files.
And it is. In commit 93fc05eb9(Split off the pretty print stuff into
its own file), some code is moved from commit.c to pretty.c and this
line if from commit.c .
Ok, now, we will trace into commit.c for this line.
Again:
char *reencoded;
const char *encoding;
- char *buf;
- if (fmt == CMIT_FMT_USERFORMAT)
- return format_commit_message(commit, user_format,
buf_p, space_p);
+ if (fmt == ...That might be necessary, but I will admit that I suspect it to be harder to make useful. One of the very nice things about ‘git log’ is that it is easy to browse through history in a nonlinear way in a pager (by using a pager’s search functionality). The “backend” ‘git rev-list’ is easy to write scripts with, also because of its simple input and output. If your program requires input from the user, how will it paginate its output? Most pagers expect the standard input to be available for input from the user. One approach (I will not say it is a good one) to the problem of ambiguous origins for a line is to blame _both_ parents. That is, start following both lines of history in your revision walking. Perhaps higher-level tools like ‘git log --graph’ and gitk could visually represent the branched history you are showing. Another approach is to just choose one parent automatically: for example, prefer the first parent, or assign some score representing the relatedness of each parent and choose the most related one. Jonathan --
What I would like to see (and may be too much for a GSOC project) is the result to be a simplified commit graph with additional annotations of the line range mappings that could be fed into something like a modified gitk to view the _history_ of the lines of interest. --
Hi, Both the approach is very precious for me. I think maybe I will propose the first one in my real proposal to Git, thanks a lot! You really help my too much! Thanks! Regards! Bo --
Look more closely. Hint: a _ is not the same as a . ;) //Peter --
Hi, [reordering quoted text for convenience] Thanks! What you said is much more coherent than the vague things I If the code is copied verbatim from elsewhere, this is something ‘git blame’ is already very good at. See [1]. Fuzzy matching is a big pain. ‘git blame’ knows how to ignore whitespace. Dscho suggested counting common words. Maybe there are some other ways. I think there is a real danger of getting lost in this problem and wasting a lot of time, so although it is very interesting, I If you can make a heuristic along these lines this work well, I think it would be great. I imagine it might work very well for commits that made nice, small changes (like many of those in git.git). Jakub pointed out some of the difficulties, and I like to hope your idea of “when in doubt, include more lines” may work well in many cases in git.git still. Good luck, and thank you for taking my crazy ideas seriously. :) Regards, Jonathan [1] See v1.4.4-rc1~2 (Merge branch 'jc/pickaxe', 2006-11-07) and the commits preceding it. About that series, Junio wrote: Actually the plan is to make it do _true_ pickaxe, although it will most likely end up either in dustbin or replace blame. It replaced blame. I am not actually sure, but I assume “true pickaxe” refers to the goals described in <http://gitster.livejournal.com/35628.html> and the linked-to message. --
HI, I have looked over the article and the message from Linus, it really help me very much. The message and article pointed out most of the things a line level tool should do, and I am happy to find that it is similar with my proposal. :) Thanks again for your precious advice and I think I can come up a better proposal, now. Thanks! Regards! Bo --
Okay, so now I looked over that thread again. I found this [1]: <http://minnie.tuhs.org/Programs/Ctcompare/index.html> It’s for fuzzy matching of a certain kind. The latest version is under the GPLv3, unfortunately for us. I would still like to reiterate my warning to not get sidetracked on this, but maybe it would be pleasant reading. Enjoy, Jonathan [1] Thanks, Linus. http://thread.gmane.org/gmane.comp.version-control.git/27/focus=225 --
But then, how about putting the "path" last in the argument, so that the unambiguosly defined part of the format comes first? Less need for quoting of ":" (or "@") in pathnames. --
Hi, Yes. Besides, it is an easy fall-out of the common "a Java class was split into two" case, where you follow line ranges in different files (at least at some stage) _anyway_. Ciao, Dscho --
Hi all,
Thanks a lot for your precious advice and based on that, I have
prepared a new version of my proposal, generally it provide a detailed
options which I want to add to 'git log' and a new syntax for
supporting multi line ranges in any file at any revision. Also, this
version provide a milestones and timeline for this project. Thanks
again for your advice and I appreciate your feedback very much for
this version.
-----------------------------------------------------------------------
Draft proposal(v2): Line-level History Browser
=====Purpose of this project=====
"git blame" can tell us who is responsible for a line of code, but it
can't help if we want to get the detail of how the lines of code have
evolved as what it is now.
This project will add a new feature for 'git log' to display line
level history. It can trace the history of any line range of certain
file at any revision. For simplity, users can run the command like: '
git log -L builtin/diff.c:6,8 ', he will get the change history of
code between line 6 and line 8 of the diff.c file. And for each
history entry, it will provide the commits, the diff block which
contains changes of users' interested lines.
This utility will trace all the modification history of interested
lines and stop until it finds the root of the lines, which is a point
where all the new code is added from scratch. Also, the users can
specify how deeply he wants this utility to trace. And this tool will
treat code move just like modification too, so it will follow the code
move inside one commit.
Note that, the history may not always be a single thread of commits.
If there are more than one commit which produce the specified line
range, the thread of history will split. And this utility will stop
and provide all commits with its code changes to the user, let the
user to select which one to trace next.
=====Work and technical issues=====
==Command options==
This new feature should be used for exploring the history of changes
for ...I think that, at least at first, line-level log should follow the git-blame, i.e. git log -L <begin>,<end> <revs> -- <file> If we want (in the future) to follow history of some lines from one file, and other lines from other file together, we do not need to use -L <file>:<begin>,<end> syntax. If parseopt allows, we can use posotion of parameters, i.e. The most important *new* algorithm you need to implement is, after finding (blame-like) the commit that created given version of given line, what was previous version of given line and which line that was. You can probably find some heuristic in existing merge tools, like emerge from GNU Emacs, or graphical diff tools. -- Jakub Narebski Poland ShadeHawk on #git --
Hi, Oh, is it bikeshedding time already? /me might have missed the start I do not think that these tools can help, as they never look further than identical lines (and they mustn't, either). More importantly, the first step really is about driving the libxdiff in such a way that you can recognize the exact same lines. (One point to note for the technical details: the algorithm has to expect opposite code moves, i.e. it must cope well when the diff shows the code in question removed in one hunk and added in another.) We also should not get ahead of ourselves, but allow the student to get a full understanding of the requirements, from which he can then make a project plan (with milestones, Christian, no problem). BTW by "requirements" I do not mean something as technical as the syntax, but rather a definition what people should be able to expect to do with this at the end of the summer. As to fuzzy matching of lines that could not be attributed otherwise, I think that that will require a lot of playing around with different ideas. A simple Levenshtein-Damerau is highly unlikely to be enough. Ciao, Dscho --
Heya, On Mon, Mar 22, 2010 at 19:21, Johannes Schindelin I'd recommend making this either the last milestone, or not a milestone at all. As I noticed with git-stats such metrics might not exist at all (or at least be too hard to find/implement), and it's quite a bummer to not be able to implement your primary milestone ;). -- Cheers, Sverre Rabbelier --
Hi, Indeed. TBH, I wanted to ask you to assist in that part of the project. You probably can give a good overview over what does not work, and why. Ciao, Dscho --
Heya, On Mon, Mar 22, 2010 at 20:26, Johannes Schindelin Back then I think we even talked about teaching git log to find code moves? I have some silly code online on repo.or.cz even. maybe. Anyway, my main problem there was finding a heuristic that would give a sensible answer both in small _and_ large moves. It might be worth investigating two or more metrics instead, one that works for (very) small chunks of code, and thus require an almost exact match, then perhaps a somewhat linear function (the longer the block moved, the more 'fuzz' you allow), and maybe after some size, say practical full-file moves, use an algorithm similar to what rename detection does. </brandump> -- Cheers, Sverre Rabbelier --
Hi, I would not be too specific here about the exact syntax. I would rather have an example where this might be useful. In git.git, for example, you could point to pretty_print_commit() which was split out from commit.c into pretty.c in 93fc05e(Split off the pretty print stuff into its own file), and mention that it is hard to verify without much hassle that the code split was really only a code split, rather than a split with an evil change. Or you could point to 691f1a2(replace direct calls to unlink(2) with unlink_or_warn), where code was refactored, into a new function (unfortunately in two commits, so it might be a case not covered by your project) and it might be somebody's task to find out the original author for that function. Basically, I would like to have a structure in the proposal like this: Do not forget the case where there are more than one source of a code I would like this not to be specified too much here. For example, we do not know yet, whether the matching will be fuzzy, or whether we find something cleverer than that. So, I suggest to list not the command line options, but what you intend to Here you do not need to say that it is -m<num>, but that you want to support following code movements both inside and between files, but only optionally, for performance reasons (or some such). It would be more in line with the diff options to use -U, but you do not Again, there are better options for "git log" already, but you do not need to be too explicit on the syntax side. Just say that you want to be able to use as many of "git log"s options as make sense in the context of Again, do not be too specific about details that have to be fleshed out while working on the project. For example, we do not know yet whether it would make more sense to look for code movements automatically when we detected a deletion, and maybe fall back automatically to detecting code IMHO this should be split into 1a) have ...
Yeah, I really ignore such a condition. Thanks a lot! And any new added code can be moved/copied from multiple source. This I will make a more specified milestones and timeline, thanks! Regards! Bo --
I am sorry not. I mean, lines copied from other files that were modified in the same commit. Just what 'blame' means with one '-C' I think fuzzy matching is used to track multiple lines of copy/movement, even with little change of the source. For example, one C function is moved from file1 to file2 and get renamed. In this case, most of the origin code of function body will remain unchanged except the function name. So, simply compare the new added lines with original code line by line and permit some percent of mismatch will help to find this kind of movement. Regards! Bo --
Hi Thomas, You mean the dates? They are made up according on 'GSoC's timeline' and my estimation about the workload of each milestone. And this is the draft proposal, after a long thread of discussion, the timeline and milestone change much. The fuzzy matching milestone will become a bonus milestone instead of a primary GSoC milestone. And I think it may help that I provide a newest version of it, I paste it in the end of the email. And I will appreciate any feedback from you. Especially about the implementation section :) Regards! Bo ------------------------------------------------------------------------- Draft proposal(v3): Line-level History Browser =====Purpose of this project===== "git blame" can tell us who is responsible for a line of code, but it can't help if we want to get the detail of how the lines of code have evolved as what it is now. For example, in Git, commit 93fc05e(Split off the pretty print stuff into its own file) split out pretty_print_commit() from commit.c into pretty.c, and it is hard to verify without much hassle that the code split was really only a code split, rather than a split with an evil change. This project will add a new feature for 'git log' to display line level history. It can trace the history of any line range of certain file at any revision. And for each history entry, it will provide the commits, the diff block which contains changes of users' interested lines. This utility will trace all the modification history of interested lines and stop until it finds the root of the lines, which is a point where all the new code is added from scratch. Also, the users can specify how deeply he wants this utility to trace. And this tool will also follow the code movement and copy inside one commit, too. Note that, the history may not always be a single thread of commits. If there are more than one commits which produce the specified line range, or there are more than one source of code move/copy, the thread of history will ...
Is this really the right use-case? AFAICT the answer to the implied question is given by simply running 'git blame -M 93fc05e:pretty.c'. (Coming up with a better example should be easy; the way I currently think of the feature means that it will mostly replace git-blame for I would, by far, prefer the latter. So far 'git log' has always been noninteractive, and there's no really good way to make it interactive because it also goes through the pager. (In the case of blame this is solved in 'git gui blame', which might also be a reasonable approach.) OTOH, if you can really fake a history walk, then just about any log-oriented tool should be able to work with it. You'd get graphical output for free with gitk and git log --graph. I haven't really I would prefer if you could inline a short example, perhaps starting at your second diff snippet. Examples are good ;-) Even if not, please drop the /match= parameter since it is very One thing that IMO is missing from this list, is a plumbing mode that just feeds the raw data to a (presumed) frontend. It could be as simple as supporting git log -L ... --pretty=raw --raw or similar, if this provides sufficient information. Compare 'git This section is too handwavy for my taste. I think in most cases you say "we can" when you really mean "git-blame already does it, so we can just use a similar algorithm". Which is fine, but I'd rather see I agree with what Dscho pointed out earlier in the thread: multiple ranges will be an easy exercise once you can follow a "blame split" where half the lines blame to some file and half the lines blame to another. Other than that I think the milestones look sensible. As a theory guy, I'm not a huge believer in timelines, so lets hope someone else Push the code somewhere public as you go, even between feature completions. Post RFCs once you have workable features so people can comment. Generally try to be visible. Bonus points if you can think of something visible ...
Hi Thomas, Changed in next version to make this clear. But only add some words to Yeah, really is a good point. And I have tried to play around on github.com and try to set up a http://github.com/byang/my_git for this Thanks a lot for this good advice, I will do so. With these feedback, I think I can make up a complete version of the proposal and submit it to Google. Thanks! Regards! Bo --
You may want to create your repo as a fork of gitster/git instead. That's easier on github, they have a hard time anyways these days ;) Seriously, it helps making use of their network feature etc. I don't have anything to add to your proposal (I like it), but I'll be at NKU next week (Conference @ Chern Institute) so drop me a PM if you wish. Cheers, Michael --
Actually, make this git/git, the other one isn't being updated... Sorry! Michael --
Hi Michael, On Tue, Mar 30, 2010 at 5:07 PM, Michael J Gruber That is really a big coincidence. :) I am very willing to meet you at NKU, and I think I can be your guide in NKU and some beautiful spots in Tianjin if you have spare time. :) Anyway, let us talk about this in personal email off the list. :-) Regards! Bo --
By the way, it would be good to find an example with "evil merge", which means that the change to given line(s) is in the merge commit itself. Correctly simplifying history in such case might be non-trivial. Another example that it would be good to have is "history split" example, which means the case where some lines were consolidated (e.g. after refactoring), and some of lines in "preimage" come from different lines of history. This would help with writing tests for this feature (compare tests for blame), although they are not in my opinion necessary for the proposal itself. I hope that all this cases would fall naturally from the implementation. my_git is not very descriptive... well, unless you would do your work on GSoC2010/line-level-history-browser branch, or something like that. It might be good idea to have repo.or.cz as an additional repository, as a fork of git.git repo, and with SoC / GSoC labels. See http://repo.or.cz/w/git.git/forks?t=soc -- Jakub Narebski Poland ShadeHawk on #git --
Hi Jakub, It is a little time consuming to find such a change in the history. I think we can come up some ones at the start of the project manually Ah, a repo at http://github.com/byang/gsoc-line-browser is created and a mirror at http://repo.or.cz/w/gsoc-line-browser.git, I think this is enough. :-) Thanks! Bo --
