How recent a Python are we assuming here? Is late 2.4 recent This is a real gem. I really like reading well-written Python programs. When git-rev-list (or "git-log --pretty=raw" that you use in your main()) simplifies the merge history based on subtree, we look at the merge and if the tree matches any of the parent we discard other parents and make the history a single strand of pearls. However for this application that is not what you want, so I can see why you run full "git-log" and prune history by hand here like this. I wonder if using "git-log --full-history -- $project" to let the core side omit commits that do not change the $project (but still give you all merged branches) would have made your job any easier? You are handling grafts by hand because --pretty=raw is special in that it displays the real parents (although traversal does use grafts). Maybe it would have helped if we had a --pretty format that is similar to raw but rewrites the parents? -
I have wondered recently why grafts are hidden in this way. I feel they
are something I want to know is occuring in my history as this history
is being manipulated. Perhaps we could emit a graft record in the
output, indeed have a graft object there? Someone could commit a
change, then graft over it in the history so I'd not see it even though
its in my working copy.
For instance in my historical git tree I have the following, note the
lack of a parent. If I move the graft up one commit, then I get a
parent, but not a parent that points at the next commit.
commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
author Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
committer Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
[...]
commit e7e173af42dbf37b1d946f9ee00219cb3b2bea6a
tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
parent 607899e17218b485a021c6ebb1cff771fd690eec
author Linus Torvalds <torvalds@ppc970.osdl.org> 1112580513 -0700
committer Linus Torvalds <torvalds@ppc970.osdl.org> 1112580513 -0700
It might be nice to have it more like the following, with a graft in
there, N*40 would obviously be fakes in the first instance as the object
isn't modified. M*40 would refer to the old parent if there was one,
else NONE or 0*40 perhaps.
commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
parent NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
author Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
committer Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
[...]
graft NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
parent e7e173af42dbf37b1d946f9ee00219cb3b2bea6a
OLD: MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
commit e7e173af42dbf37b1d946f9ee00219cb3b2bea6a
tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
parent 607899e17218b485a021c6ebb1cff771fd690eec
author Linus Torvalds <torvalds@ppc970.osdl.org> ...Just to make sure we are on the same page, only "raw" format output is special and it is special only on output. Ancestry traversal always honors what you have in grafts. However, you can do: $ git rev-list --parents --pretty=raw which would give you "commit $this_commit $its $parents" lines and "parent $true_parent" lines at the same time. And they will be inconsistent when you have grafts or path limiter. The former honor grafts and path limiter, and the latter show the true set of parents. -
-- >8 --
An illustration of rev-list --parents --pretty=raw
This script creates two separate histories, A and B, each of
which does:
(A0, B0): create fileA and subdir/fileB
(A1, B1): modify fileA
(A2, B2): modify subdir/fileB
and then grafts them together to make B0 a child of A2. So
the final history looks like (time flows from top to bottom):
true parent touches subdir?
A0 none yes (creates it)
A1 A0 no
A2 A1 yes
B0 none yes (different from what's in A2)
B1 B0 no
B2 B1 yes
"git rev-list --parents --pretty=raw B2" would give "fake"
parents on the "commit " header lines while "parent " header
lines show the parent as recorded in the commit object (i.e. B0
appears to have A2 as its parent on "commit " header but there
is no "parent A2" header line in it).
When you have path limiters, we simplify history to omit
commits that do not affect the specified paths.
So "git rev-list --parents --pretty=raw B2 subdir" would return
"B2 B0 A2 A0" (because B1 and A1 do not touch the path). When
it does so, the "commit " header lines have "fake" parents
(i.e. B2 appears to have B0 as its parent on "commit " header),
but you can still get the true parents by looking at "parent "
header.
---
diff --git a/t/t6001-rev-list-graft.sh b/t/t6001-rev-list-graft.sh
new file mode 100755
index 0000000..08a6cff
--- /dev/null
+++ b/t/t6001-rev-list-graft.sh
@@ -0,0 +1,113 @@
+#!/bin/sh
+
+test_description='Revision traversal vs grafts and path limiter'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+ mkdir subdir &&
+ echo >fileA fileA &&
+ echo >subdir/fileB fileB &&
+ git add fileA subdir/fileB &&
+ git commit -a -m "Initial in one history." &&
+ A0=`git rev-parse --verify HEAD` &&
+
+ echo >fileA fileA modified &&
+ git commit -a -m "Second in one history." &&
+ A1=`git rev-parse --verify HEAD` &&
+
+ echo >subdir/fileB fileB modified &&
+ git commit -a -m "Third in one history." ...We ran it with 2.4, so yes. git-split does require at least 2.4, though, because it uses set(), str.rsplit(), and subprocess, none of Thanks. We had some fun writing this; git's elegant repository I don't think it would. We still need to know what commit to use as the parent of any given commit, so we don't want commits in the log output with parents that don't exist in the log output. And rewriting parents in git-log based on which revisions change the specified subdirectory Yes, that would help. We could then avoid dealing with grafts manually. How would you feel about including git-split in the GIT tree? We could certainly write up the necessary documentation for it. - Josh Triplett
Umm.. You didn't realize that git log already _does_ exactly that? You need to rewrite the parents in order to get a nice and readable history, which in turn is needed for any visualizer. So git has long done the parent rewriting in order to be able to do things like gitk drivers/char on the kernel. And yes, that's done by the core revision parsing code, so when you do git log --full-history --parents -- $project you do get the rewritten parent output (of course, it's not actually _simplified_, so you get a fair amount of duplicate parents etc which you'd still have to simplify and which don't do anything at all). Without the "--full-history", you get a simplified history, but it's likely to be _too_ simplified for your use, since it will not only collapse multiple identical parents, it will also totally _remove_ parents that don't introduce any new content. So there are multiple levels of history simplification, and right now the internal git revision parser only gives you two choices: "none" (--full-history) and "extreme" (which is the default when you give a set of filenames). Linus -
No, I didn't, primarily because the git log output I've scrutinized most carefully came from git log --pretty=3Draw, which doesn't rewrite parents= Considering that git-split does exactly that (remove parents that don't introduce new content, assuming they changed things outside the subtree), that might actually work for us. I just checked, and the output of "git log --parents -- $project" on one of my repositories seems to show the same sequence of commits as git log --parents on the head commit printed by git-split $project (apart from the rewritten I don't think we need any middle ground here; why might we want less simplification? - Josh Triplett
Ok. In that case, you're good to go, and just use the current simplification entirely. Although I think that somebody (Dscho?) also had a patch to remove multiple identical parents, which he claimed could happen with There's really three levels of simplification: - none at all ("--full-history"). This is really annoying, but if you want to guarantee that you see all the changes (even duplicate ones) done along all branches, you currently need to do this one. Currently "git whatchanged" uses this one (and that ignores merges by default, making it quite palatable). So with "git whatchanged", you will get _every_ commit that changed the file, even if there are duplicates alogn different histories. - extreme (the current default). This one is really nice, in that it shows the simplest history you can make that explains the end result. But it means that if you had two branches that ended up with the same result, we will pick just one of them. And the other one may have done it differently, and the different way of reaching the same result might be interesting. We'll never know. As an exmple: the extreme simplification can also throw away branches that had work reverted on them - the branch ended up the _same_ as the one we chose, but it did so because it had some experimental work that was deemed to be bad. Extreme simplification may or may not remove that experiment, simply depending on which branch it _happened_ to pick. Currently, this is what most git users see if they ask for pathname simplification, ie "gitk drivers/char" or "git log -p kernel/sched.c" uses this simplification. It's extremely useful, but it definitely culls real history too. - The nice one that doesn't throw away potentially interesting duplicate paths to reach the same end result. We don't have this one, so no git commands do this yet. The way to do this one would be "--full-history", but then ...
t. So, if a commit has more than one parent (a merge), you want to eliminate any parents that end up as ancestors to other parents in the merge (including if their head has the same commit ID), but not eliminate multiple parents with different head commits but the same tree object? That seems simple enough; I *think* git-split actually already does that, though I haven't actually tested that particular case. If git log eliminates all but one of the parents with different commits but the same tree, I believe the commit sequence generated by git-split will differ from that of git log in that case, by including all such parents. I do agree that the behavior you describe seems like the best simplification, and I don't think the alternative you describe as "extreme simplification" makes any sense at all (picking a parent arbitrarily), nor does it seem any simpler to generate; either way, you still have to figure out if one parent has another as an ancestor, while the additional "extreme simplification" just *adds* a comparison of tree hashes. Or have I misunderstood the case you have concerns about? Why would the "nice" format incur additional cost? - Josh Triplett
Try it. The default "extreme" simplification is a _hell_ of a lot faster than doing the full history. [torvalds@g5 linux]$ time git-rev-list --full-history --parents HEAD -- kernel/sched.c >/dev/null real 0m4.660s user 0m4.612s sys 0m0.044s [torvalds@g5 linux]$ time git-rev-list --parents HEAD -- kernel/sched.c >/dev/null real 0m1.684s user 0m1.680s sys 0m0.004s and the "nice" thing will be much slower still: just trying to figure out whether a commit is a parent of another commit is expensive. Doing so for _each_ merge is more expensive still. I think it's O(n^3), but what do I know.. Linus -
[ timings removed ] Btw, the reason it is so much faster is that it can be done early, and allows us to prune out parts of the history that we don't care about. For example, when we hit a merge, and the result of that merge is identical to one of the parents (in the set of filenames that we are interested in), we can simply choose to totally ignore the other parent, and we don't need to traverse that history at _all_. Because clearly, all the actual _data_ came from just the other one. So the "extreme" simplification is way way faster, because in the presense of a lot of merges, it can select to go down just one of the paths, and totally ignore the other ones. In practice, for a fairly "bushy" history tree like the kernel, that can cut down the number of commits you need to compare by a factor of two or more. In many ways, it is also actually a _better_ result, in that it's a "closer to minimal" way of reaching a particular state. So if you're just interested in how something came to be, and want to just cut through the crap, the result extreme simplification really _is_ better. So the branches that were dismissed really _aren't_ important - they might contain real work, but from the point of the end result, that real work might as well not have happened, since the simpler history we chose _also_ explain the end result sufficiently. So I think the default simplification is really a good default: not only because it's fundamentally cheaper, but because it is actually more likely to be distill what you actually care about if you wonder what happened to a file or a set of files. But if you care about all the "side efforts" that didn't actually matter for the end result too, then you'd want the more expensive, and more complete graph. But it _will_ be a lot more expensive to compute. Linus -
Hi, IIRC It only happened when full history was wished for, _and_ path limiting. And Junio said that in that case, culling identical parents would be the wrong thing to do. Ciao, Dscho -
Yeah, with full history, you might as well keep the trivially identical parents too. In the "nice rewrite", you'd get rid of them, but you'd get rid of them not because they are trivially identical, but because they are just the trivial case of "one parent totally dominates the other". Sadly, while it may be the _trivial_ case in --full-history, it's probably not even the common case, at least if you have lots of merges (because you end up having one parent point to one merge, and the other to another, and you need to simplify things in multiple passes). Linus -
So one potential action item that came out from this discussion for me is to either modify --pretty=raw (or add --pretty=rawish) that gives the rewritten parents instead of real parents? With that, you can drop the code to simplify ancestry by hand in your loop, and also you do not have to do the grafts inforamation yourself either? If that is the case I'd be very happy. The only thing left for us to decide is if reporting the true parenthood like the current --pretty=raw makes sense (if so we need to keep it and introduce --pretty=rawfish). The only in-tree user of --pretty=raw seems to be git-svn but it only looks at path-unlimited log/rev-list from one given commit, so the only difference between dumping what is recorded in the commit object and listing what parents we _think_ the commit has is what we read from grafts. I think we are safe to just "fix" the behaviour of --pretty=raw Comments? -
I actually think I want to look further into the idea of just using git --pretty=3Draw --parents -- $project, and see if I can find any corner ca= ses where it generates a different history than what we want. This combinati= on of options seems like it provides everything we need: redundant history simplification, parent rewriting based on simplification and grafts, and = easy parsing. If the only case in which it differs occurs when you have two distinct commits with identical trees, I don't know that I care too much;= that particular scenario seems unlikely to occur in any of the trees I care ab= out, and any sane simplification behavior for it seems OK. :) As long as it ru= ns correctly with various ancestor/descendant/cousin/unrelated relationships= between merged branches (which I want to test further), I think it will d= o the job nicely. - Josh Triplett
I do not mind _coding_ the --pretty=rawfish change if needed but I do not think it is necessary, which is pretty good news. After I wrote the message I realized that I probably do not have to do anything, since --parents would give you the rewritten parents already anyway. -
