I'm in the process of converting and stitching and patching vast amounts of initially disjunct CVS and SVN repositories into larger complete histories inside a single git repository. Recreating history as accurately as possible. The problem I encounter is that any number of times I have to "edit" history in a non-parameterable fashion, in any of the following ways: - Change parents. - Add merges. - Change author, committer, commitdate, authordate. - Change the tree (because of conversion errors in the automated conversion process) belonging to a single commit. - Retrofit a patch which has to ripple through all of history until the present. The only things which are easily done at the moment are: Change parents and add merges. This can be accomplished fairly easily using the grafts file. The other changes are messy at best and need to be parameterised into the form of a shell script so that git filter-branch can have a go at it. This parameterisation is doable for author/committer/dates in most cases (but not pretty), but is rather (too) convoluted for ripple-through patches. You have to imagine that the whole tree has lots of interconnects already (merges), and changing the tree at a point in history which has to ripple through is a mess, because all references and interconnects need to be rewritten as well. I propose the following: - Extend git fsck to do more sanity checks on the content of the grafts file (to make it more difficult to shoot yourself in the foot with that file; my feet will be grateful). - Extend the grafts file format to support something like the following syntax: commit eb03813cdb999f25628784bb4f07b3f4c8bfe3f6 Parent: 7bc72e647d54c2f713160b22e2e08c39d86c7c28 Merge: 3b3da24960a82a479b9ad64affab50226df02abe 13b8f53e8ccec3b08eeb6515e6a10a2a Merge: ac719ed37270558f21d89676fce97eab4469b0f1 Tree: 32fc99814b97322174dbe97ec320cf32314959e2 Author: Foo Bar (FooBar) <foo@bar> AuthorDate: Sat Jun 6 13:50:44 1998 +0000 Commit: Foo Bar (FooBar) ...
[...] First, if I remember correctly (from KernelTrap and now defunct Kernel Traffic and one issue of Git Traffic) the 'graft' mechanizm was created so it would be possible to "graft" (join) historical conversion repository with the "current work" git repository (started from zero when git was deemed good enough for Linux kernel development). The same mechanism is used for shallow clone, where one goes in the opposite direction, shortening history instead of joining two repositories (two histories). The fact that git-filter-branch (and earlier cg-admin-rewrite-hist) respects grafts, and rewrites history so that grafts are no-op and are not needed further is a bit of side-effect. So I think that it would be better to provide generic git-filter-branch filter which can understand this "generalized grafts" file format, or rather 'description of changes' file. Put it in contrib/, and here you go... -- Jakub Narebski Poland ShadeHawk on #git --
Maybe the upcoming git-sequencer could be the appropriate place? It tries to achieve just that: edit history by specifying a list of commands. The currently planned set of commands would need to be amended, but the framework should be in place. Michael --
That's the problem. Like git filter-branch, git sequencer needs you to
parameterise the changes, which, in my case, is hardly possible, since
the changes are randomlike.
Also, having to run the sequencer to dig 20000 commits into the past,
then change something, then come back up and rewrite all following
history and relations (parents/tags/merges) will take a sizeable amount
of time. I need something that can be changed at will, then viewed with
gitk a second later.
These edits are numerous and spread over many months, so the typical
history fixup-sessions involve periods where you make 30 random
historicaledits per hour (which need to be viewed and checked every time
immediately after making the change). And say once every 4 months, you
run it through git filter-branch to cast everything into stone. A
typical git filter-branch run takes 15 minutes on a repository this
size.
--
Sincerely,
Stephen R. van den Berg.
You are confused; but this is your normal state.
--
I think the point was more about making a tool to do exactly what you want, based on the new git sequencer. Note that git filter-branch could also be rewritten to use the sequencer. Mike --
Yes, that was at least my point. As I understand, git filter-branch -i is a candidate for that rewrite. But I understand now that OP wants to do lots of history edits and see them immediately before doing the actual (time consuming) rewrite; and then do the rewrite occasionally. Rewriting is surpirsingly slow even on tmpfs. Michael --
As far as I understood it, the new git sequencer rewrites history
proper. That is timeconsuming by definition, and thus it is *not*
possible to make a tool based on the sequencer that supports the desired
iterative-history-rewrite workflow.
--
Sincerely,
Stephen R. van den Berg.
You are confused; but this is your normal state.
--
Hi, I'm somehow quite confused about the desired workflow but I try an answer. If I got the problem right, it is possible. But you have to rewrite and cannot just fake history, of course. ...for example, a "pick <commit>" that just picks the _tree_ of the commit and not the _introduced changes_. (I've never used info/grafts, but if I get the principle right, such tree-picks could realize a linear list of info/grafts history fakes.) sequencer doesn't allow to change committer data, but this could be an easy change if you really need that. The same with the author timestamp, that could only be reused from "pause" instruction, and then do manual changes, then git sequencer --continue I wonder if grafts can be used in combination with sequencer in such a way that you rewrite foo~20000..foo~19950 and then fake the parents of You can run gitk whenever you did "pause" in the sequencer file. [Btw, an integration of sequencer into gitk is also on the TODO list, but that's OT here.] Regards, Stephan -- Stephan Beyer <s-beyer@gmx.net>, PGP 0x6EDDD207FCC5040F --
s/once/ones/ To give it some sense. Sorry ;) -- Stephan Beyer <s-beyer@gmx.net>, PGP 0x6EDDD207FCC5040F --
I don't think we speak about any normal workflow but about importing "initially disjunct CVS and SVN repositories into larger complete histories inside a single git repository." This is one-time work, not Using grafts allows you to fake history, which is very useful during import, because it allows you to edit history without running any filter-branch, which is very timeconsuming. Of course, at the end you have to run git filter-branch to have the "true" history, otherwise anyone who clones from you will end up with a broken repo. The purpose of rebase (and I believe the sequencer too) is rather different -- to allow you to keep your changes as patches to the I don't think it is a good idea. During the normal work you should never use grafts. Well, you can use grafts to add old history, but using it for anything else is really dangerous, because its *fakes* history. git rebase (and AFAIK sequencer too) just re-write history of some branch. IOW, it creates another branch from a different starting point using patches from some existing branch and then reassign the branch name to it. Dmitry --
Hi, I have written this in the context that Stephen only changes some commits from a long time ago (foo~20000) and then I showed a way how to avoid that sequencer rewrites the rest which takes so long. This is not related to "normal work", but to Stephen's use case (if I got it right). What I've meant, was: Instead of faking a lot of parents, changes and even merges using an extended grafts file, he could rewrite some patches - which can be fast - and then use _only one_ graft to change the parent to the changed and rewritten commit. This can be done iteratively and seems to be a good agreement in speed and reliability. Regards, Stephan -- Stephan Beyer <s-beyer@gmx.net>, PGP 0x6EDDD207FCC5040F --
Indeed.
--
Sincerely,
Stephen R. van den Berg.
This is a day for firm decisions! Or is it?
--
A second later might be too much, but for the case where you need to add a patch in the middle (which I suspect is the most timeconsuming and tricky part at the moment), you might want to use a temporary branch checked out where you need to apply the patch, apply the patch and then rebase the rest of the history onto that new commit. Rebase is fairly quick (although not a one-second thing for 20k commits), so you'll get the time down quite a bit, I imagine. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 --
Not really.
Rebase does two things:
a. Apply every patch/commit again, which takes too long for 20k commits.
b. Mess up carefully grafted parent/merge relationships.
Rebase is only suitable for short linear strands of commits.
The history I'm dealing with is neither short, nor linear.
--
Sincerely,
Stephen R. van den Berg.
A truly wise man never plays leapfrog with a unicorn.
--
Quite. Which is exactly the spirit I'm extending here.
I need it to stitch together history, but it needs to be more perfect
than mere connecting parents.
Also, the graft mechanism specifically is intended as a temporary
solution until one uses filter-branch to "finalise" the result into a
I beg to differ. It's not a side effect, it's the proper way to get
rid of the grafts file. Grafts are temporary and ugly. In proper
repositories they are a sign of transition to a proper state.
The problem is that the process of fixing history is an iterative one,
which can take many months, and everytime you make a change, the
correctness needs to be viewed using gitk.
For argument sake, consider the repository at hand which I'm trying to
"fix", it has 33000 commits, distributed over eight branches with
roughly 3500 merges over a timeperiod of 13 years.
The eight branches were eight separate CVS repositories which have
intersecting histories, and 3500 merges between CVS repositories (i.e.
branches).
If I need to backpatch a certain patch into history, it is likely that
in order to let the change ripple through, it will take 20000 commits to
be rewritten every time I make a slight change to history.
It's not really workable to ripple through 20000 commits everytime I
make a historical change, yet I need to view the change in gitk.
Using git filter-branch, or git sequencer basically has the same
problem, I need to ripple through most of history to get to a state
which is viewable using gitk again. That is too long a turnaround
cycle.
Using the proposed grafts format, I can make changes incrementally, and
immediately viewable (though not cloneable) on the local repository using gitk.
Then after making all the necessary changes, one git filter-branch run
will "burn" the changes into the repository proper in one go
(renumbering all tags, branches and merges along the way).
--
Sincerely,
Stephen R. van den Berg.
You are confused; but this is your normal ...Grafts are _much_ older than filter-branch and I'm not sure where did There's nothing ugly or necessarily temporary about grafts. One example of completely valid usage is adding previous history of a project to it later. First, you don't need to carry around all the archived baggage you are probably rarely going to access anyway if you don't need to; changing a VCS is ideal cutoff point. Second, you don't need to worry about doing perfect conversion at the moment of the switch. Third, even if you think you have done it perfectly, it will turn out later that something is wrong anyway. Fourth, it may not be actually _clear_ what the canonical history should be. Consider linux-kernel, you can graft the BitKeeper history (or one of possible candidates for the ideal conversion, though one is AFAIK clearly favoured), or you could also graft commit-per-tarball history even from the times before BitKeeper; you certainly don't want either in the current main history DAG. -- Petr "Pasky" Baudis The last good thing written in C++ was the Pachelbel Canon. -- J. Olson --
Not in direct documentation, but it is what breaths down from posts on the mailinglist like: http://kerneltrap.org/mailarchive/git/2008/6/10/2085624 That depends on the project, of course, and is not a valid statement in general. Part of the charm of full history is that git-blame and Not necessarily. I have automated the checkout-verification-process which basically checks out every revision from the respective old repository and binary-compares it with the corresponding revision in the git repository. This ensures a full binary match across the board. With respect to historical merges, I agree, those might not be completely correctly grafted, but the level of correctness can be determined at will, and once we achieve somewhere around 99% accuracy, That depends on the project. In my project it *is* clear, so this point doesn't make any difference. -- Sincerely, Stephen R. van den Berg. This is a day for firm decisions! Or is it? --
[...] I wanted to propose that git-filter-branch generic "generalized grafts" file based filter should be accompanied by extending gitk so it understand this format to... ...but after reading wonderfull suggestion to create new commits with corrected contents, and insert them (replace older version by them) using grafts, thought and brought independently by Dmitry Potapov and Petr Baudis, I think that you would be best with extending gitk to support this way instead. You would have to extend gitk to maintain reverse revision mapping (from revision to its children), and then you would be able to edit history interactively from within gitk, with gitk correcting its internal structures to redisplay changed commits, and creating commits and doing grafting behind the scenes for later git-filter-branch run. -- Jakub Narebski Poland --
I don't think that the grafts file is the right place for this kind of information. Perhaps, it would be better to have a separate file or even a directory with files where commit-id identifies a text file with a new commit object, which should be placed instead of an old one. So, it will be easy to tell git filter-branch to use this new information. However, if you want more than just ability to edit commits in a text file but also inspect changes using normal git commands and gitk (as it is possible with grafts), it will require changes to the git core, which, perhaps, not difficult to implement using pretend_sha1_file(), but I am not sure that everyone will welcome that... Dmitry --
On second thought, it may be not necessary. You can extract an old commit object, edit it, put it into Git with a new SHA1, and then use the graft file to replace all references from an old to a new one. And you will be able to see changes immediately in gitk. Dmitry --
Hmmmm, interesting thought. That just might solve my problem.
In that case, I will stick to extending git fsck to check grafts more
rigorously and fix git clone to *refrain* from looking at grafts.
If anyone still wants the extended format, I'd be willing to implement
it, but my immediate itch for it is gone.
--
Sincerely,
Stephen R. van den Berg.
You are confused; but this is your normal state.
--
This script is just a prove of the concept. It seems to work for me, but I don't really tested it. =========================================== #!/bin/bash set -e # creating some silly repo git init # creating some history for ((i=0; $i<10; i++)) do echo foo$i > foo$i git add foo$i git commit -m "add foo$i" done # run gitk to see it gitk --all & # dump all graft info to text file git rev-list --parents --all > .git/info/grafts.tmp mv .git/info/grafts.tmp .git/info/grafts # please choose what commit you want to edit echo while read -p 'Edit commit: ' C do C=$(git rev-parse "$C") || continue # edit commit C git cat-file commit $C > .git/COMMIT_OBJ vim .git/COMMIT_OBJ C2=$(git hash-object -w -t commit .git/COMMIT_OBJ) # replace all references from C to C2 sed -e 's/\<'$C'\>/'$C2'/g' < .git/info/grafts > .git/info/grafts.tmp mv .git/info/grafts.tmp .git/info/grafts done =========================================== Dmitry --
Linus suggested that "git-fsck and repacking should just consider it[grafts] to be an _additional_ source of parenthood rather than a _replacement_ source." http://article.gmane.org/gmane.comp.version-control.git/84686 Dmitry --
Yes, I know that's what he suggested, the way it should be implemented
IMO though is by checking once without and once with regard to grafts.
And still it should be such that git clone disregards grafts completely.
I'll fix both, eventually, since I need this functionality to verify
correctness for the projects I'm working on at the moment.
As for repack, it should probably ignore grafts, except for reference.
I.e. repack/gc should consider all mentioned SHA1s in the grafts file
to be referenced and undeletable.
--
Sincerely,
Stephen R. van den Berg.
You are confused; but this is your normal state.
--
I could see an argument that the only modes you really need are a) use grafts as replacements, and b) use grafts as additions. There is perhaps no need for c) ignore grafts. For example, say I wanted to give someone a copy of my repo that includes grafts (ignoring the fact that this is probably bad to do in general). He could git-clone it and then install a copy of my grafts file, as long as git-clone does (a) or (b) but not (c). On the other hand, if he just wants a copy of the "real" (graft-free) repo, then git-clone needs to do (b) or (c) but not (a). git-fsck needs (b), and most normal git operations want (a) (since that was the original purpose of grafts). Based on that, (c) is redundant, unless you're really concerned about not sending redundant objects to people who clone your repo that has grafts installed. But I think you probably shouldn't have people cloning your grafted repository anyway unless you know what you're doing, and if you know what you're doing, you probably want (b). If you see what I mean. Have fun, Avery --
Yeah, thanks for a reminder.
http://thread.gmane.org/gmane.comp.version-control.git/37744/focus=37866
is still on my "things to look at" list.
--
This shows how the "object transfer ignores grafts" side of the earlier
suggestion by Linus would look like to get people started. Totally
untested.
I threw in for_each_commit_graft() in the patch so that updates to the
reachability walker can add otherwise hidden objects, but otherwise it is
not used yet.
builtin-pack-objects.c | 5 +++++
builtin-send-pack.c | 3 ++-
cache.h | 1 +
commit.c | 10 ++++++++++
commit.h | 2 ++
environment.c | 1 +
upload-pack.c | 1 +
7 files changed, 22 insertions(+), 1 deletions(-)
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 28207d9..53b0b33 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -30,6 +30,7 @@ git-pack-objects [{ -q | --progress | --all-progress }] \n\
[--threads=N] [--non-empty] [--revs [--unpacked | --all]*] [--reflog] \n\
[--stdout | base-name] [--include-tag] \n\
[--keep-unreachable | --unpack-unreachable] \n\
+ [--ignore-graft] \n\
[<ref-list | <object-list]";
struct object_entry {
@@ -2160,6 +2161,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
die("bad %s", arg);
continue;
}
+ if (!strcmp(arg, "--ignore-graft")) {
+ honor_graft = 0;
+ continue;
+ }
usage(pack_usage);
}
diff --git a/builtin-send-pack.c b/builtin-send-pack.c
index d76260c..d932352 100644
--- a/builtin-send-pack.c
+++ b/builtin-send-pack.c
@@ -27,6 +27,7 @@ static int pack_objects(int fd, struct ref *refs)
*/
const char *argv[] = {
"pack-objects",
+ "--ignore-graft",
"--all-progress",
"--revs",
"--stdout",
@@ -36,7 +37,7 @@ static int pack_objects(int fd, struct ref *refs)
struct child_process po;
if (args.use_thin_pack)
- argv[4] = "--thin";
+ argv[5] = "--thin";
memset(&po, 0, sizeof(po));
po.argv = argv;
po.in = -1;
diff --git a/cache.h b/cache.h
index 188428d..00858f9 100644
--- a/cache.h
+++ ...This updates the earlier patch to teach the object transfer side to ignore
grafts, which makes things consistent between dumb commit walkers and
native transport. It is not meant for application as I haven't thought
about[*1*] nor looked into how this may interact with the "shallow clone"
stuff (which is graft in disguise but implemented separately).
Footnote. *1* I also suspect Linus did not think about interactions with
"shallow" when he made the suggestion referenced above, as "shallow" was
still a relatively new curiosity back then.
I am not sure if the addition of --ignore-graft to revision.c should be
there when this becomes real. I added it primarily for debugging
purposes, as it is something the end users should never trigger in the
normal workflow.
--
builtin-pack-objects.c | 5 +++
builtin-send-pack.c | 3 +-
cache.h | 1 +
commit.c | 10 +++++++
commit.h | 2 +
environment.c | 1 +
revision.c | 4 +++
t/t6500-graft.sh | 70 ++++++++++++++++++++++++++++++++++++++++++++++++
upload-pack.c | 2 +
9 files changed, 97 insertions(+), 1 deletions(-)
create mode 100755 t/t6500-graft.sh
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 28207d9..53b0b33 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -30,6 +30,7 @@ git-pack-objects [{ -q | --progress | --all-progress }] \n\
[--threads=N] [--non-empty] [--revs [--unpacked | --all]*] [--reflog] \n\
[--stdout | base-name] [--include-tag] \n\
[--keep-unreachable | --unpack-unreachable] \n\
+ [--ignore-graft] \n\
[<ref-list | <object-list]";
struct object_entry {
@@ -2160,6 +2161,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
die("bad %s", arg);
continue;
}
+ if (!strcmp(arg, "--ignore-graft")) {
+ honor_graft = 0;
+ continue;
+ }
usage(pack_usage);
}
diff --git a/builtin-send-pack.c ...I don't think it would. You want to apply a patch through a part of the history. To do that, it is not sufficient to apply the patch to only one commit/tree and then fake parenthood of its child commits. You still need to apply the patch to all children. -- Hannes --
I am aware of that.
There are actually two common cases:
- Historical changes which are confined and don't ripple through. The
above solution works just fine for that.
- Ripple-through changes. They indeed need to be applied to every tree
in the first-parent chain. Even though this is going to take a
considerable amount of time, there still are certain advantages to
doing this using the method described above:
+ You can apply the patch to every commit/tree "interactively" if you want.
(Yes, I know, git-sequencer supports this one as well, but not the
next point).
+ You can view the change at any point in time (including in relation to the
tree that follows it), right after making the amendments (without letting
it ripple through to the end).
+ The ripple-through does not need to be performed in topological order,
i.e. eventually you'll have to touch everything, but you can do it
in the order you see fit (whatever is most efficient to work on).
+ If, at some point during the ripple-through process, you find out
that you forgot some change(s), you can abort or restart the
ripple-through without having spent all that time waiting for a
full-ripple-through.
Actually, ripple-through changes are rare. In the current project it
seems I need exactly one, but it's buried deep in the past (sadly).
The reason why I need it, is to make sure that git-bisect will work for
any revision in the past (i.e. the tree contained/contains some
too-clever-for-their-own-good $Revision$-expansion dependencies)
--
Sincerely,
Stephen R. van den Berg.
This is a day for firm decisions! Or is it?
--
But you do know that you don't need to apply the change *now*; you can apply it at bisect-time? Unless you expect you or your mere mortal coworkers are going to do dozens of bisects into that part of the history, I wouldn't change history *like*this*. But of course, I don't understand the circumstances enough, so... just my 2 cents. -- Hannes --
That is exactly the case, I do expect dozens of bisects.
--
Sincerely,
Stephen R. van den Berg.
This is a day for firm decisions! Or is it?
--
Yet the grafts file is exactly the place where this type of
Not quite sure why this makes it easier. The point is that there
is not supposed to be a grafts file in a proper repository. Thus,
having a lot of these files means a larger disruption to the core, and
I'd like the core to be as efficient and lean as possible given an empty
I'd want to avoid a plethora of files, and the changes that can be
specified are supposed to be partial overrides, not complete rewrites.
So using pretend_sha1_file() is a bit overkill and more than I was
aiming for.
The point is, that the changes in grafts (as they are now) are *not*
used when cloning. I.e. the only thing you mess up is your *own*
repository, not someone else's. I.e. you can't make someone remote
think that the repository has been altered. That would require git
filter-branch, which immediately changes all the historical SHA1s, and
makes the changes in history blatantly visible.
--
Sincerely,
Stephen R. van den Berg.
You are confused; but this is your normal state.
--
Please, don't. It adds completely unnecessary complexity and it is _not_ grafting anymore - look the word up in a dictionary. :-) Have a look at what you wrote above - now, Git already has a way to store all this information, right? In the commit objects! So, the real solution is to take the commit objects you want to modify, create new commit objects, then graft the new commit on all the old commit children. It fits neatly in the Git philosophy, there is no need at all to tweak the current infrastructure for this and it should be trivial to automate, too. -- Petr "Pasky" Baudis The last good thing written in C++ was the Pachelbel Canon. -- J. Olson --
Oops, sorry; I stopped reading the branch of the thread I thought was going off on a different tangent one post too early. :-) -- Petr "Pasky" Baudis The last good thing written in C++ was the Pachelbel Canon. -- J. Olson --
