Re: [RFC] git-split: Split the history of a git repository by subdirectories and ranges

Previous thread: Re: [PATCH 3/3] diff --stat: sometimes use non-linear scaling. by Johannes Schindelin on Wednesday, September 27, 2006 - 1:35 am. (2 messages)

Next thread: Re: git and time by Junio C Hamano on Wednesday, September 27, 2006 - 3:13 am. (2 messages)
From: Junio C Hamano
Date: Wednesday, September 27, 2006 - 3:13 am

How recent a Python are we assuming here?  Is late 2.4 recent

This is a real gem.  I really like reading well-written Python
programs.

When git-rev-list (or "git-log --pretty=raw" that you use in
your main()) simplifies the merge history based on subtree, we
look at the merge and if the tree matches any of the parent we
discard other parents and make the history a single strand of
pearls.  However for this application that is not what you want,
so I can see why you run full "git-log" and prune history by
hand here like this.

I wonder if using "git-log --full-history -- $project" to let
the core side omit commits that do not change the $project (but
still give you all merged branches) would have made your job any
easier?

You are handling grafts by hand because --pretty=raw is special
in that it displays the real parents (although traversal does
use grafts).  Maybe it would have helped if we had a --pretty
format that is similar to raw but rewrites the parents?


-

From: Andy Whitcroft
Date: Wednesday, September 27, 2006 - 4:59 am

I have wondered recently why grafts are hidden in this way.  I feel they
are something I want to know is occuring in my history as this history
is being manipulated.  Perhaps we could emit a graft record in the
output, indeed have a graft object there?  Someone could commit a
change, then graft over it in the history so I'd not see it even though
its in my working copy.

For instance in my historical git tree I have the following, note the
lack of a parent.  If I move the graft up one commit, then I get a
parent, but not a parent that points at the next commit.

  commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
  tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
  author Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
  committer Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700

  [...]

  commit e7e173af42dbf37b1d946f9ee00219cb3b2bea6a
  tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
  parent 607899e17218b485a021c6ebb1cff771fd690eec
  author Linus Torvalds <torvalds@ppc970.osdl.org> 1112580513 -0700
  committer Linus Torvalds <torvalds@ppc970.osdl.org> 1112580513 -0700

It might be nice to have it more like the following, with a graft in
there, N*40 would obviously be fakes in the first instance as the object
isn't modified.  M*40 would refer to the old parent if there was one,
else NONE or 0*40 perhaps.

  commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
  tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
  parent NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  author Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700
  committer Linus Torvalds <torvalds@ppc970.osdl.org> 1113690036 -0700

  [...]

  graft NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  parent e7e173af42dbf37b1d946f9ee00219cb3b2bea6a

    OLD: MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

  commit e7e173af42dbf37b1d946f9ee00219cb3b2bea6a
  tree 0bba044c4ce775e45a88a51686b5d9f90697ea9d
  parent 607899e17218b485a021c6ebb1cff771fd690eec
  author Linus Torvalds <torvalds@ppc970.osdl.org> ...
From: Junio C Hamano
Date: Wednesday, September 27, 2006 - 12:08 pm

Just to make sure we are on the same page, only "raw" format
output is special and it is special only on output.  Ancestry
traversal always honors what you have in grafts.

However, you can do:

$ git rev-list --parents --pretty=raw

which would give you "commit $this_commit $its $parents" lines
and "parent $true_parent" lines at the same time.

And they will be inconsistent when you have grafts or path
limiter.  The former honor grafts and path limiter, and the
latter show the true set of parents.


-

From: Junio C Hamano
Date: Wednesday, September 27, 2006 - 12:31 pm

-- >8 --
An illustration of rev-list --parents --pretty=raw

This script creates two separate histories, A and B, each of
which does:

      (A0, B0): create fileA and subdir/fileB
      (A1, B1): modify fileA
      (A2, B2): modify subdir/fileB

and then grafts them together to make B0 a child of A2.  So
the final history looks like (time flows from top to bottom):

		true parent	touches subdir?

	A0	none		yes (creates it)
        A1      A0		no
        A2	A1		yes
        B0	none		yes (different from what's in A2)
        B1	B0		no
        B2	B1		yes

"git rev-list --parents --pretty=raw B2" would give "fake"
parents on the "commit " header lines while "parent " header
lines show the parent as recorded in the commit object (i.e. B0
appears to have A2 as its parent on "commit " header but there
is no "parent A2" header line in it).

When you have path limiters, we simplify history to omit
commits that do not affect the specified paths.  

So "git rev-list --parents --pretty=raw B2 subdir" would return
"B2 B0 A2 A0" (because B1 and A1 do not touch the path).  When
it does so, the "commit " header lines have "fake" parents
(i.e. B2 appears to have B0 as its parent on "commit " header),
but you can still get the true parents by looking at "parent "
header.

---
diff --git a/t/t6001-rev-list-graft.sh b/t/t6001-rev-list-graft.sh
new file mode 100755
index 0000000..08a6cff
--- /dev/null
+++ b/t/t6001-rev-list-graft.sh
@@ -0,0 +1,113 @@
+#!/bin/sh
+
+test_description='Revision traversal vs grafts and path limiter'
+
+. ./test-lib.sh
+
+test_expect_success setup '
+	mkdir subdir &&
+	echo >fileA fileA &&
+	echo >subdir/fileB fileB &&
+	git add fileA subdir/fileB &&
+	git commit -a -m "Initial in one history." &&
+	A0=`git rev-parse --verify HEAD` &&
+
+	echo >fileA fileA modified &&
+	git commit -a -m "Second in one history." &&
+	A1=`git rev-parse --verify HEAD` &&
+
+	echo >subdir/fileB fileB modified &&
+	git commit -a -m "Third in one history." ...
From: Josh Triplett
Date: Monday, October 23, 2006 - 3:17 am

We ran it with 2.4, so yes.  git-split does require at least 2.4,
though, because it uses set(), str.rsplit(), and subprocess, none of

Thanks.  We had some fun writing this; git's elegant repository

I don't think it would.  We still need to know what commit to use as the
parent of any given commit, so we don't want commits in the log output
with parents that don't exist in the log output.  And rewriting parents
in git-log based on which revisions change the specified subdirectory

Yes, that would help.  We could then avoid dealing with grafts manually.


How would you feel about including git-split in the GIT tree?  We could
certainly write up the necessary documentation for it.

- Josh Triplett

From: Linus Torvalds
Date: Monday, October 23, 2006 - 8:52 am

Umm.. You didn't realize that git log already _does_ exactly that?

You need to rewrite the parents in order to get a nice and readable 
history, which in turn is needed for any visualizer. So git has long done 
the parent rewriting in order to be able to do things like

	gitk drivers/char

on the kernel.

And yes, that's done by the core revision parsing code, so when you do

	git log --full-history --parents -- $project

you do get the rewritten parent output (of course, it's not actually 
_simplified_, so you get a fair amount of duplicate parents etc which 
you'd still have to simplify and which don't do anything at all).

Without the "--full-history", you get a simplified history, but it's 
likely to be _too_ simplified for your use, since it will not only 
collapse multiple identical parents, it will also totally _remove_ parents 
that don't introduce any new content.

So there are multiple levels of history simplification, and right now the 
internal git revision parser only gives you two choices: "none" 
(--full-history) and "extreme" (which is the default when you give a set 
of filenames). 

			Linus
-

From: Josh Triplett
Date: Monday, October 23, 2006 - 12:27 pm

No, I didn't, primarily because the git log output I've scrutinized most
carefully came from git log --pretty=3Draw, which doesn't rewrite parents=


Considering that git-split does exactly that (remove parents that don't
introduce new content, assuming they changed things outside the
subtree), that might actually work for us.  I just checked, and the
output of "git log --parents -- $project" on one of my repositories
seems to show the same sequence of commits as git log --parents on the
head commit printed by git-split $project (apart from the rewritten

I don't think we need any middle ground here; why might we want less
simplification?

- Josh Triplett


From: Linus Torvalds
Date: Monday, October 23, 2006 - 12:50 pm

Ok. In that case, you're good to go, and just use the current 
simplification entirely.

Although I think that somebody (Dscho?) also had a patch to remove 
multiple identical parents, which he claimed could happen with 

There's really three levels of simplification:

 - none at all ("--full-history"). This is really annoying, but if you 
   want to guarantee that you see all the changes (even duplicate ones) 
   done along all branches, you currently need to do this one.

   Currently "git whatchanged" uses this one (and that ignores merges by
   default, making it quite palatable). So with "git whatchanged", you 
   will get _every_ commit that changed the file, even if there are 
   duplicates alogn different histories.

 - extreme (the current default). This one is really nice, in that it 
   shows the simplest history you can make that explains the end result. 
   But it means that if you had two branches that ended up with the same 
   result, we will pick just one of them. And the other one may have done 
   it differently, and the different way of reaching the same result might 
   be interesting. We'll never know.

   As an exmple: the extreme simplification can also throw away branches 
   that had work reverted on them - the branch ended up the _same_ as the 
   one we chose, but it did so because it had some experimental work that 
   was deemed to be bad. Extreme simplification may or may not remove that 
   experiment, simply depending on which branch it _happened_ to pick.

   Currently, this is what most git users see if they ask for pathname 
   simplification, ie "gitk drivers/char" or "git log -p kernel/sched.c"
   uses this simplification. It's extremely useful, but it definitely 
   culls real history too.

 - The nice one that doesn't throw away potentially interesting 
   duplicate paths to reach the same end result. We don't have this one, 
   so no git commands do this yet.

   The way to do this one would be "--full-history", but then ...
From: Josh Triplett
Date: Monday, October 23, 2006 - 1:52 pm

t.

So, if a commit has more than one parent (a merge), you want to
eliminate any parents that end up as ancestors to other parents in the
merge (including if their head has the same commit ID), but not
eliminate multiple parents with different head commits but the same tree
object?  That seems simple enough; I *think* git-split actually already
does that, though I haven't actually tested that particular case.  If
git log eliminates all but one of the parents with different commits but
the same tree, I believe the commit sequence generated by git-split will
differ from that of git log in that case, by including all such parents.

I do agree that the behavior you describe seems like the best
simplification, and I don't think the alternative you describe as
"extreme simplification" makes any sense at all (picking a parent
arbitrarily), nor does it seem any simpler to generate; either way, you
still have to figure out if one parent has another as an ancestor, while
the additional "extreme simplification" just *adds* a comparison of tree
hashes.

Or have I misunderstood the case you have concerns about?  Why would the
"nice" format incur additional cost?

- Josh Triplett


From: Linus Torvalds
Date: Monday, October 23, 2006 - 2:06 pm

Try it. The default "extreme" simplification is a _hell_ of a lot faster 
than doing the full history.

	[torvalds@g5 linux]$ time git-rev-list --full-history --parents HEAD -- kernel/sched.c >/dev/null

	real    0m4.660s
	user    0m4.612s
	sys     0m0.044s

	[torvalds@g5 linux]$ time git-rev-list --parents HEAD -- kernel/sched.c >/dev/null
	
	real    0m1.684s
	user    0m1.680s
	sys     0m0.004s

and the "nice" thing will be much slower still: just trying to figure out 
whether a commit is a parent of another commit is expensive. Doing so for 
_each_ merge is more expensive still. I think it's O(n^3), but what do I 
know..

			Linus
-

From: Linus Torvalds
Date: Monday, October 23, 2006 - 2:19 pm

[ timings removed ]

Btw, the reason it is so much faster is that it can be done early, and 
allows us to prune out parts of the history that we don't care about.

For example, when we hit a merge, and the result of that merge is 
identical to one of the parents (in the set of filenames that we are 
interested in), we can simply choose to totally ignore the other parent, 
and we don't need to traverse that history at _all_. Because clearly, all 
the actual _data_ came from just the other one.

So the "extreme" simplification is way way faster, because in the presense 
of a lot of merges, it can select to go down just one of the paths, and 
totally ignore the other ones. In practice, for a fairly "bushy" history 
tree like the kernel, that can cut down the number of commits you need to 
compare by a factor of two or more.

In many ways, it is also actually a _better_ result, in that it's a 
"closer to minimal" way of reaching a particular state. So if you're just 
interested in how something came to be, and want to just cut through the 
crap, the result extreme simplification really _is_ better.

So the branches that were dismissed really _aren't_ important - they might 
contain real work, but from the point of the end result, that real work 
might as well not have happened, since the simpler history we chose _also_ 
explain the end result sufficiently.

So I think the default simplification is really a good default: not only 
because it's fundamentally cheaper, but because it is actually more likely 
to be distill what you actually care about if you wonder what happened to 
a file or a set of files.

But if you care about all the "side efforts" that didn't actually matter 
for the end result too, then you'd want the more expensive, and more 
complete graph. But it _will_ be a lot more expensive to compute.

		Linus
-

From: Johannes Schindelin
Date: Tuesday, October 24, 2006 - 7:56 am

Hi,


IIRC It only happened when full history was wished for, _and_ path 
limiting. And Junio said that in that case, culling identical parents 
would be the wrong thing to do.

Ciao,
Dscho

-

From: Linus Torvalds
Date: Tuesday, October 24, 2006 - 8:19 am

Yeah, with full history, you might as well keep the trivially identical 
parents too. In the "nice rewrite", you'd get rid of them, but you'd get 
rid of them not because they are trivially identical, but because they are 
just the trivial case of "one parent totally dominates the other".

Sadly, while it may be the _trivial_ case in --full-history, it's probably 
not even the common case, at least if you have lots of merges (because you 
end up having one parent point to one merge, and the other to another, and 
you need to simplify things in multiple passes).

			Linus
-

From: Junio C Hamano
Date: Tuesday, October 24, 2006 - 5:10 pm

So one potential action item that came out from this discussion
for me is to either modify --pretty=raw (or add --pretty=rawish)
that gives the rewritten parents instead of real parents?  With
that, you can drop the code to simplify ancestry by hand in your
loop, and also you do not have to do the grafts inforamation
yourself either?

If that is the case I'd be very happy.

The only thing left for us to decide is if reporting the true
parenthood like the current --pretty=raw makes sense (if so we
need to keep it and introduce --pretty=rawfish).

The only in-tree user of --pretty=raw seems to be git-svn but it
only looks at path-unlimited log/rev-list from one given commit,
so the only difference between dumping what is recorded in the
commit object and listing what parents we _think_ the commit has
is what we read from grafts.  I think we are safe to just "fix"
the behaviour of --pretty=raw

Comments?

-

From: Josh Triplett
Date: Tuesday, October 24, 2006 - 6:59 pm

I actually think I want to look further into the idea of just using git
--pretty=3Draw --parents -- $project, and see if I can find any corner ca=
ses
where it generates a different history than what we want.  This combinati=
on of
options seems like it provides everything we need: redundant history
simplification, parent rewriting based on simplification and grafts, and =
easy
parsing.  If the only case in which it differs occurs when you have two
distinct commits with identical trees, I don't know that I care too much;=
 that
particular scenario seems unlikely to occur in any of the trees I care ab=
out,
and any sane simplification behavior for it seems OK. :) As long as it ru=
ns
correctly with various ancestor/descendant/cousin/unrelated relationships=

between merged branches (which I want to test further), I think it will d=
o the
job nicely.

- Josh Triplett

From: Junio C Hamano
Date: Tuesday, October 24, 2006 - 7:13 pm

I do not mind _coding_ the --pretty=rawfish change if needed but
I do not think it is necessary, which is pretty good news.

After I wrote the message I realized that I probably do not have
to do anything, since --parents would give you the rewritten
parents already anyway.

-

Previous thread: Re: [PATCH 3/3] diff --stat: sometimes use non-linear scaling. by Johannes Schindelin on Wednesday, September 27, 2006 - 1:35 am. (2 messages)

Next thread: Re: git and time by Junio C Hamano on Wednesday, September 27, 2006 - 3:13 am. (2 messages)