This implements an experimental "git log-fpc" command that shows
short-log style output sorted by topics.
A "topic" is identified by going through the first-parent
chains; this ignores the fast-forward case, but for a top-level
integrator it often is good enough.
For example, if the commit ancestry graph looks like this:
x---x---x---X---o---*---o---o---o HEAD
\ /
o---o---o---o---o
and the command line asks for
git log-fpc --no-merges X..
It first finds all the commits 'o'. Then it emits the four
commits on the upper line (assume the merge '*' has the commit
that is a child of X as its first parent in the picture). When
it does so, it the list of authors for these four commits on one
line, followed by the title of these commits. After that, it
does the same for the five commits on the lower line.
---
I initially wanted to do this inside Johannes's enhanced
shortlog, but ended up doing this as a pretty much independent
thing, because the shortlog implementation stringifies the
information from the commits too early to be easily enhanced for
this purpose.
If this turns out to be a better way to present shortlog,
however, this should become an option to git-shortlog.
A sample output from:
git log-fpc --no-merges v1.4.4.1..f64d7fd2
looks like this (f64d7fd2 was the tip of master when the last
"What's in" message was sent out). It shows that many "fixes"
and git-svn enhancements were directly done on "master" (that is
the first group), while many gitweb enhancements, changing the
output from "prune -n", "git branch" enhancements, etc. were
first cooked in separate topic branches and then later merged
into 'master'.
To this output, I can manually add a topic title to the
beginning of each group and it would make a better overview than
what I currently send out in "What's in" message which is
generated with shortlog.
----------------------------------------------------------------
Eric Wong (6), ...Umm. May I suggest that you try this with the kernel repo too.. There, the "first parent chain" tends to be less interesting than a lot of other heuristics: - committer If the committer changes, you should probably consider it a break, the same way a second parent would be a break. You probably won't see this in the git archive, because there tends to be a single committer, but on something like the kernel where we really merge other peoples repos, it's going to be as good (or better) than looking at "other parents". - subdirectory heuristics Again, with git it's not very interesting, but I bet that you'd be able to use heuristics like "the bulk of the changes were contained within this directory tree" for projects like the kernel, and automatically decide on "topics" like drivers/scsi, fs/ext3 etc. In other words, I don't think the "fpc" decision is even very interesting. If you _really_ want to do a cool shortlogger, I bet it can be done, but I suspect that it would be a LOT cooler to do some automatic bayesian clustering based on committer, author and list of filenames changed. Of course, such a thing done well would probably be worthy of a doctoral thesis or something. Maybe somebody on this list who is into bayesian clustering and doesn't have a thesis subject... (Of course, since I haven't been in a University setting for the last ten years, maybe bayesian clustering isn't the cool thing to work on any more). Anyway, "topics" really should be something that is extremely open to various clustering models, bayesian or not .. Linus -
Have you? I've compared gitk HEAD~40..HEAD and git-log-fpc --no-merges HEAD~40..HEAD Admittedly, the first group ("from the tip of the master") tends to be seriously mixed up without a fixed theme (well the theme appears to be "fix trivial warnings and compilation breakages not limited to any particular subsystem"), but I find the other groups quite a sane representation of what actually happened. My copy of your tree is a bit old (HEAD is at 1abbfb412), but I see: - a two-commit series on MIPS via Ralf Baechle, - a four-commit series on ARM via Russel King, - a three-commit series on POWERPC via Paul Mackerras, - a seventeen-commit series in net/ area via Dave Miller, - a three-commit series on x86_64 via Andi Kleen. ... As you said, committer would be a good addition to break a fast-forward case to make it even better. -
You'll reasonably often see in the kernel: - a patch-series by Andrew (where nothing but filename clustering really would help: the committer is me, and the thing is linear) - linearly on top of that, a git merge that was a fast-forward (especially from the subset of people who actively rebase their trees: that notably includes Dave Miller, but also for example the DVB people) so purely a first-parent logic would not catch that case at all (but the committer would at least catch the "patch-series by Andrew" -> "Merge of network tree by Davem" break). But especially with long patch-series through Andrew, it would be nice to have some other heuristics (although they _tend_ to be fairly random, especially at the end of the release cycle - at the beginning, I tend to have series of 100-200 patches that often _could_ be clearly clustered into a few clusters). Anyway, the real win of clusterign would likely be for big releases, ie soemthing like "v2.6.18..v2.6.19-rc1", where there's definitely some clustering even apart from just merging (although the merge topology will definitely get some of it) Linus -
After sending out a response, I re-read your message because I did not quite get where bayesian would come into the picture. I think I should have used the word "topic branch" instead of "topic". In other words, I was not interested in sifting the various totally unrelated linear commits into groups that deal with distinct problems. But again you are showing your superiour intelligence by setting the problem in a much grander scheme ;-), where there is no such developer discipline that would help the shortlogger (like use of topic branches). In such a case, you would need a set of heuristics that you described. -
Well, I think you're grown slightly jaded by the fact that git has very active "normal" development, that is actually done by you on the main branch, and you do basically zero rebasing along the side branches. I think that's actually likely the exception rather than the rule. It's much more likely that people have almost _all_ active development done on side branches, and that - together with rebasing of the side branches - inevitably means that the "main branch" ends up not having such a clean set of "topic branch" merges. In addition, on a more mature tree, a lot (probably _most_) of the commits aren't really "topics" at all, but "maintenance", which exacerbates the problem: you don't have a "line of development of this feature", you tend to have much more of a random "fix this general area", where the only common theme may be the fact that things are _related_ to some common subsystem, but not that they are a "topic branch" in the _development_ sense. Put another way: bugs get fixed one by one, not in a nice linear fashion by "topic". So I'm coming at it from a totally different project - where "topic branches" simply aren't delineated as much, and even when they are, they tend to be merged in multiple steps (and they pull both ways when they aren't re-based). So that's why I don't think the pure branch topology is as interesting. A single line of development ends up being useful for you, and we'll certainly see _some_ of that, but in the kernel, I pretty much guarantee that you probably get better "topic clustering" by going simply by author, like the old standard "git shortlog" does. Because that will tend to get the clustering at a finer granularity (ie not just "networking", but things like "packet filtering" etc). So the "sort by people" actually works fairly well, but it's kind of an "incidental" thing, and it _would_ be potentially useful to have other ways of grouping things. See? It's not about "superior intelligence", it's ...
You are absolutely right about "Andrew patchbomb" which is linear and does not have the series boundary. Import from mostly linear foreign SCM would have the same issue. Merge Again, you are right, but that only means topic based grouping is not for everybody, and certainly is not suitable for a long stretch of commits on the trunk of a mature project because they tend to touch everywhere and not all that clustered. If those bugs were fixed by committing on separate topic branches and then later merged, the topology based clustering would get the grouping right, but I would imagine we would end up seeing hundreds of such short groups which would not be useful at all. In such cases, it would be much more useful to have one huge group that says "these are small fixes, each of which may touch different areas -- they are not related but grouped together because they are all small, obviously correct and harmless fixes". So I suspect that is a slightly different issue -- it I agree multiple steps merge and merging both ways would happen in real life, but I had an impression that fpc handles that topology reasonably well, unless that "merge from upstream" are I think "networking" vs "packet filtering" largely depends on how the networking subsystem you pull from is managed. If netfilter comes as e-mailed patches to DaveM and are applied onto the trunk of networking subsystem, we will face exactly the same problem as we have with Andrew's patchbomb to your trunk. If it were managed on a separate topic branch in the networking subsystem repository (either DaveM manages them in his repository as a topic, or DaveM pulls from netfilter git repository -- I do not know how that part of the patchflow works), I would imagine you would get the same "per topic" grouping. Another factor is that the author population of a wide and mature project like the kernel tends to be more diverse, and a single person tends to be focused on one thing at a time while others work on different ...
Most of the subsystems end up using patches - they're simply better ways to move things around and have people comment on them than saying "please pull on this tree to see my suggestion". I do it myself: even when I _generate_ the diff in my tree, I will often just do a git diff > ~/diff and then import the thing into my mailer, and say "Maybe something like this?". So I think patches are fundamentally the core way to get things in the periphery into just about any system. Maybe we do it more than most just because we're so _used_ to them, but I actually think that if the kernel does it more than most (and I'm not sure it does), it's simply because the thing about patches is that they really _work_. So yes, the network subsystem tends to be entirely linear by the time it hits me. That's true of a lot of other subsystems too (SCSI etc). There's a _few_ subsystems that actually have real topic branches: ACPI and network driver development comes to mind, but it seems to actually be the exception rather than the rule. (I think that a lot of people work like I occasionally do: they do have their own local branches for some stuff, but they end up re-linearizing and keeping them active with "git rebase", so the branches really are purely local, rather than something that is visible in the end result). But the REAL reason I'd love to see a smarter "data-mining" git log (whether it does things by bayesian clustering or any other kind of grouping technology) is that this is actually something that people ask for: when I make my "git shortlog" for major releases, the thing is often thousands of lines long, and it would be _beautiful_ if that could be data-mined somewhat more intelligently. So, for example, do a simple git shortlog v2.6.17..v2.6.18 (with the shortlog in "next" that can do this - btw, why doesn't it default to using PAGER like "git log" does?), and realize that it's about 8500 lines of stuff, and nobody can really be expected ...
Hi, Funny you should mention it... I recently was exposed to Formal Concept Analysis, and immediately thought that this would have applications in the visualization of source codes' histories. Maybe there is a way to apply Bayesian Inference to determine a subset which bears the highest information / subset size ratio. As for reducing the number of lines in the shortlog: taking myself as an example, I often touch the same code several times, just to fix bugs. So, if the same code was touched several times, just take the first oneline, and add "(+fixes)". Of course, this is more like a wedding between shortlog and annotate, and likely to be slow. Ciao, Dscho -
Interesting. While driving to work this morning I had the same
thought. A revision that does not appear in the output from
for file in $(list of files the commit touches)
do
git blame v2.6.17..v2.6.18 -- $file
done
can safely be omitted from the shortlog, because later changes
fully supersedes it.
I think the list of "important" changes is an interesting
problem, but the importance may not directly be related to the
number of paths a patch touches (e.g. "you reorder the members
of a structure everybody uses in one include file and everything
starts performing faster due to better cache behaviour" would be
a few lines of a single header file). Also better clues to
judge the importance would be found outside the repository.
"The patch discussed by many people on the list" and "the patch
that had very many iteration to get in the final shape" would
certainly be interesting ones, but that information is often not
found in the repository.
-
Just for fun, I took a look at what we might see by ordering commits by
their "amount of blamedness". That is, the count of lines introduced by
a commit which were not later superseded. The script I used is below:
#!/bin/sh
start=$1; shift
end=$1; shift
start_sha1=`git-rev-parse $start^{}`
git-rev-list --parents $start..$end >revs
echo $start_sha1 >>revs
for i in `git-diff --raw -r $start $end | cut -f2`; do
echo blaming $i... >&2
git-blame -l -S revs $i | cut -d' ' -f1
done |
grep -v $start_sha1 |
sort | uniq -c | sort -rn |
while read count hash; do
echo "$count `git-rev-list --max-count=1 --pretty=oneline $hash`"
done
The top 15 for v1.4.3 to v1.4.4 are:
1604 6973dcaee76ef7b7bfcabd2f26e76205aae07858 Libify diff-files.
1100 9f613ddd21cbd05bfc139d9b1551b5780aa171f6 Add git-for-each-ref: helper for language bindings
1050 cee7f245dcaef6dade28464f59420095a9949aac git-pickaxe: blame rewritten.
700 58e60dd203362ecb9fdea765dcc2eb573892dbaf Add support for pushing to a remote repository using HTTP/DAV
571 9f1afe05c3ab7228e21ba3666c6e35d693149b37 gitk: New improved gitk
524 197e8951abd2ebf2c70d0847bb0b38b16b92175b http-push: support for updating remote info/refs
504 83b5d2f5b0c95fe102bc3d1cc2947abbdf5e5c5b builtin-grep: make pieces of it available as library.
462 aa1dbc9897822c8acb284b35c40da60f3debca91 Update http-push functionality
344 a57a9493df00b6fbb3699fda8ceedf4ac0783ac6 Added Perl git-cvsimport-script
343 f8b28a4078a29cbf93cac6f9edd8d5c203777313 gitk: Add a tree-browsing mode
323 00449f992b629f7f7884fb2cf46ff411a2a4f381 Make git-fmt-merge-msg a builtin
285 fd8ccbec4f0161b14f804a454e68b29e24840ad3 gitk: Work around Tcl's non-standard names for encodings
283 9cf6d3357aaaaa89dd86cc156221b7b604e9358c Add git-index-pack utility
277 e4fbbfe9eccd37c0f9c060eac181ce05988db76c Add git-zip-tree
256 da7c24dd9c75d014780179f8eb843968919e4c46 gitk: Basic support for highlighting one view within another
The bottom 15 are:
1 ...Hi, I was surprised that not more of my stuff was in the top-15, since I submitted less-than-finished patches quite often. Especially merge-recursive was quite a bit of work for Alex and me. BTW merge-recursive is a perfect example why this approach will break down: most of the rewrite in C took place in a private repository with quite some commits. This does not show in the git repository. I fully expect the linux repository to behave similarly, since most of the features are cooked elsewhere, and not all of them are pulled, but some are applied (i.e. they appear out of nowhere from the repository's viewpoint). Ciao, Dscho -
Yes, I think this would be more useful in concert with some sort of grouping. If we can make a group of commits related to merge-recursive, and score them as a single item, then they can be compared to other groups (which may consist of a single commit or several). -Peff -
Something is SERIOUSLY wrong. That commit is not even between v1.4.3 and v1.4.4. -
Hmm, you're right. I haven't quite figured out what went wrong with the
script I posted. However, a somewhat simpler approach is to just use the
revision limiting in git-blame. The problem with this is that commits
whose parents aren't in the revision range end up getting blamed for a
lot of lines they're not responsible for.
As a quick hack, I just threw out any revisions whose parents weren't in
range. This is wrong, since those revisions probably _do_ have some
correctly blamed lines. It made me wonder about a possible feature for
git-blame: when we can't pass the blame up further, instead of taking
responsibility, output a "no responsibility line" (blaming on commit
0{40}, or some other format). I think this should be more informative
when there is a limit on the range of revisions.
The top of the "blamedness" list for v1.4.3..v1.4.4 is below. Important
things do seem to float to the top, but it would probably be much more
accurate if we were scoring groups of commits (generated by some other
analysis).
-Peff
-- >8 --
1050 cee7f245dcaef6dade28464f59420095a9949aac git-pickaxe: blame rewritten.
223 fe142b3a4577a6692a39e2386ed649664ad8bd20 Rework cvsexportcommit to handle binary files for all cases.
219 c31820c26b8f164433e67d28c403ca0df0316055 Make git-branch a builtin
216 636171cb80255682bdfc9bf5a98c9e66d4c0444a make index-pack able to complete thin packs.
182 b1f33d626501c3e080b324e182f1da76f49b5bf9 Swap the porcelain and plumbing commands in the git man page
173 744d0ac33ab579845808b8b01e526adc4678a226 gitweb: New improved patchset view
169 e30496dfcb98a305a57b835c248cbc3aa2376bfc gitweb: Support for 'forks'
142 5b329a5f5e3625cdc204e3d274c89646816f384c t6022: ignoring untracked files by merge-recursive when they do not matter
134 c0990ff36f0b9b8e806c8f649a0888d05bb22c37 Add man page for git-show-ref
128 780e6e735be189097dad4b223d8edeb18cce1928 make pack data reuse compatible with both delta types
121 2d477051ef260aad352d63fc7d9c07e4ebb4359b add the capability for ...The way you used "-S rev" was wrong. It is a way to temporarily
install grafts and nothing else; but your grafts introduced that
way exactly matched the true parenthood except for the bottom
commit and side branches merged during the timeframe leaked
right through your grafts. The digger started from your HEAD
(whatever that happened to be) along with the true parenthood
and found an way ancient ancestor.
A "bit more correct" script would have been something like this.
-- >8 --
#!/bin/sh
#
# Usage: sh ./run-me v1.4.3 v1.4.4
#
bottom=${1?bottom} top=${2?top}
bottom=$(git rev-parse --verify "$bottom^0")
range="$bottom..$top"
top=$(git rev-parse --verify "$top^0")
for path in $(git diff --name-only -r --diff-filter=AM "$range")
do
echo >&2 "* $path"
git blame -l -C "$range" -- "$path"
done | sed -e 's/ .*//' | sort | uniq -c | sort -n -r |
while read num hash
do
test "$hash" = "$bottom" && continue
it=$(git rev-list --pretty=oneline --abbrev --abbrev-commit -1 "$hash")
printf '%6d %s\n' $num "$it"
done
-- 8< --
But as you correctly observed, even the above script is wrong.
The top one blamed with the above script is this commit:
8301 808239a Merge branch 'sk/ftp'
But that is an ancestor of v1.4.3!
What's wrong is that the ancestry graph around that commit
roughly looks like this:
z---o---o---o
/ \
808239a--v1.4.3--o---*---o---v1.4.4
The pickaxe passes the blame around to the parents but does not
allow a "boundary" commits to pass the blame to their parents.
As the result, the blame at the commit marked with '*' are split
along both branches, and after the leftmost commit 'z' passes
its blame to its parent, it stops there and ends up blaming
808239a, which is an ancestor of the original "boundary" commit
v1.4.3 given from the command line. What's wrong with my script
quoted above is that the filter that checks $hash with $bottom;
it needs to check if $hash is an ancestor of $bottom.
With that change, the ...Gaah. "that is NOT a good workaround". Sorry about the wasted -
Excellent. This is exactly what I had in mind, and it seems to produce
sensible results out of the box:
git-diff --raw -r --diff-filter=AM $1 | cut -f2 |
while read f; do
git-blame -l $1 -- $f | grep -v ^- | cut -d' ' -f1
done |
sort | uniq -c | sort -rn |
while read count hash; do
echo "$count `git-rev-list --max-count=1 --pretty=oneline $hash`"
done
Thanks!
-Peff
-
