We are running a rather complex Git tree with heavy use of git-rerere
(the -tip kernel tree, with more than 80 topic branches). git-rerere is
really nice in that it caches conflict resolutions, but there are a few
areas where it would be nice to have improvements:
- Fixing resolutions: currently, when i do an incorrect conflict
resolution, and fix it on the next run, git-rerere does not pick up
the new resolution but uses the old (buggy) one on the next run. To
fix it up i have to find the right entries in .git/rr-cache/* and
manually erase them. Would be nice to have "git-rerere gc <pathspec>"
to flush out a single bad resolution.
- File deletion: would be nice if git-rerere picked up git-rm
resolutions. We hit this every now and then and right now i know
which ones need an extra git-rm pass.
- Automation: would be nice to have a git-rerere modus operandi where
it would auto-commit things if and only if all conflicting files were
resolved.
- Sharing .git/rr-cache. It's quite a PITA to share the .git/rr-cache
amongst -tip maintainers right now. It seems to have dependencies on
the index file, so if we want to share the conflict resolution data,
we have to copy our index file (which is dangerous anyway and assumes
very similar repositories).
It would be much nicer if we could share conflict resolutions with
each other - and with others as well. For example linux-next could
re-use our conflict resolution data as well - often Stephen Rothwell
has to re-do the same conflict resolution as well, creating
duplicated work.
( Also, it's a GPL nitpicky issue: the conflict resolution database
can be argued to be part of "source code" and as such it should be
shared with everyone who asks. With trivial merges the data is
probably not copyrightable hence probably falls outside the scope
of the GPL, but with a complex topic tree like -tip with dozens of
conflict resolutions, the ...- At least, compress the data in the rr-cache. It can grow big quite easily. Also, I wonder if keeping the entire files is not overkill... Mike --
Actually it would be rather straightforward to put it in the usual git store, and represent the current rr-cache with a flat file that points to the in-git preimage/postimages, and make git-gc aware of those. This would deal with the huge number of files + compression quite easily. I'm quite sure it's pretty straightforward actually :) --=20 =C2=B7O=C2=B7 Pierre Habouzit =C2=B7=C2=B7O madcoder@debia= n.org OOO http://www.madism.org
Actually, this is probably a required step in the direction of sharing such things btw. --=20 =C2=B7O=C2=B7 Pierre Habouzit =C2=B7=C2=B7O madcoder@debia= n.org OOO http://www.madism.org
Perhaps an approach similar to the 'notes' implementation can be used, in which a separate branch is created to contain the notes. This way the rerere information (being the 'rerere' branch) can be shared easily (by just pulling the branch), and as said we get free compression. Another advantage would be that you automagically get the ability to unlearn a bad rerere by simply (partially) reverting a commit on the rerere branch! -- Cheers, Sverre Rabbelier --
FWIW, StGit is well on its way to store its patch metadata in a git branch, for much the same reasons. -- Karl Hasselström, kha@treskal.com www.treskal.com/kalle --
For a more complex merge resolution, granted that it rises to the level of being "copyrightable", but I think it would be a huge stretch to call the rr-cache the "preferred form for modifications"! :-) - Ted --
yeah - i'm not really arguing any detail of the GPL here. I'm arguing the principle: there should be no technical assymetry between maintainer and contributor. So if i am able to run an effort-free integration of 85 topic branches, i'd like contributors (who will eventually grow up into co-maintainer roles in the future) to be able to do the same, if they want to do so. right now that is simply not possible technically - it's even very hard to share a .git/rr-cache with a co-maintainer whom i can trust with my index file. (which is an otherwise unsafe private binary cache that i'd not put into a public repository as it could in theory contain lots of unrelated data and is not endian-safe, etc.) Ingo --
Where did you get the idea that .git/index is involved in any way, I wonder... --
so it's only the rr-cache metadata that is involved? We had a few cases
where git-rerere sessions were not repeatable by copying the
.git/rr-cache, so i just assumed that there's some extra metadata in the
index file. When that happened i took a look at git/builtin-rerere.c:
static int find_conflict(struct path_list *conflict)
{
int i;
if (read_cache() < 0)
return error("Could not read index");
and (mistakenly) assumed that git-rerere depends on having something in
the index file - but on a second look it just checks out the conflicting
file(s) from the index file, right?
Ingo
--
The binary part of the index should be in network byte order and endian
safe. But it is not necessary to share the index. Well, if you think
about it, it would be mighty silly if index had any long term effect on
the operation of rerere, which is all about "I've done many conflict
resolutions in the past. My work tree state (including the index) came
back to a state similar to the conflicted state I saw some time ago.
Let's reuse the previous resolution if we can." You might have switched
branches, ran "reset --hard" and did 47 thousands different things to your
index since you resolved the conflict you are about to re-resolve ;-).
The replay and conflict recoding codepath of rerere goes like this:
* read the index, list the paths that have conflicts;
* inspect the conflicted blob to compute the conflict signature $sig and
store the sig and path in MERGE_RR;
* look into rr-cache/$sig; does it have already a conflict resolution
recorded?
- If so, modify the file in the working tree the same way to bring
rr-cache/$sig/preimage to rr-cache/$sig/postimage by 3-way merge.
- if not, record the file in the working tree as rr-cache/$sig/preimage
The resolution recording codepath goes like:
* see if any paths listed in MERGE_RR is resolved in the index;
* look into rr-cache/$sig for such resolved path. Does it already record
a resolution?
- If not, we have a new resolution we can use. Record it as
rr-cache/$sig/postimage for later use.
So rerere _does_ look at the index to decide what entries in rr-cache are
relevant and applicable. But other than that, it is not used. I do not
think there is no reason copy index to be able to reuse rr-cache.
--
I agree this is a real issue (I sometimes know that the resolution is iffy and say "rerere clear" to choose not to record it, but that is working around the issue with a perfect foresight and is not a solution). I think (and I think you would agree) "gc" is not the right word but rather you would want to more actively discard the wrong one. I agree that it is the right UI to do this to specify paths right after you found that a bad resolution that was recorded previously was used by rerere (I think that is what you are suggesting). Upon such a request, we should undo the bad resolution and bring the working tree copy to the I originally did not have need for anything other than three-way conflict resolving to a result. I do not know how safe reapplying a removal to I am not sure how safe this is. rerere as originally designed does not even update the index with merge results so that the application of earlier resolution can be manually inspected, and this is exactly because I consider a blind textual reapplication of previous resolution always iffy, even though I invented the whole mechanism. --
We use a 'safe, lazy integration' method in -tip, that basically has external checks against any integration bugs. Basically, we integrate only about once a day, and we advance the topic branches but do not reintegrate on every topic merge. We merge commits _both_ to their target topic branches, and to the (previous) integration branch. Then once a day (or every second day) we 'reintegrate': we propagate the topic branches to the linux-next auto-*-next branches [recreating them from scratch] and flush out the messy criss-cross merges from the integration tree. But that is always an identity transformation as far as the integration result is concerned: the result of the integration run must be exactly the same content (obviously it results in a very different tree structure) as the previous one. We only run it on a perfectly tested tree so we know none of our previous merges were wrong, and we want the git-rerere result to be the same. We repeat the integration until the end result matches. In fact sometimes git-rerere is able to pick up a conflict resolution from our 'messy' delta-merge into the integration tree, which is an added bonus. (this doesnt always work if the merge order differs from integration order) Anyway, the gist is that in this workflow it does not hurt at all if git-rerere is "unsafe", and we'd love to have the integration as fast as possible. Right now most of my manual overhead is in making sure that git-rerere has not missed some file. At a ~100 conflicting files tracked, that is rather error-prone, and i'd love to have further automation here besides a rather lame method of grepping for: "Resolved 'kernel/Makefile' using previous resolution." type of patterns in git-merge output. So i'd not mind if git-rerere was safe by default, but it would be nice to have some knob to turn it into something fast and automatic. For us it would be much _safer_, because right now most of our manual energy is spent on ...
Oh, "unsafe switch" that is off by default will not hurt anybody, and I do
not mind it as a new feature. We are in agreement in that sense.
Perhaps the way forward would be (and this is independent of the issue of
recording removal as a possible form of resolution):
(1) Introduce a new configuration rerere.autoupdate that is off by
default, but when it is on, paths cleanly resolved by rerere will
also be updated in the index (if we have capability to record
removal, this may remove such a path from the index as the result).
(2) The callers of rerere that expects rerere to resolve needs to be
changed to see if the resulting index after rerere is fully merged,
and continue. Currently the callers are "merge", "rebase" and "am",
I think. This step might be a bit more involved than you might
think, as rerere currently happens in the codepath that knows the
caller does _not_ go further than leaving the failed conflict to be
sorted out by the user (rerere is designed as merely a way to help).
Also you _might_ want a separate configuration rerere.autocommit to
control this --- the user (but not you) might be willing to allow
autoupdate but you may still want to eyeball the result.
Independent of the above, we have two potential new features:
* Introduce "git rerere revert paths..." that brings the index and
working tree back to the conflicted state after a previous resolution
is applied, because that resolution is incorrect. The old resolution
cached in rr-cache is also removed.
This however will become much less useful if you allow autoresolution
to be committed automatically, as the caller will move ahead without
giving you a chance to say "oh, that one is bad -- do not proceed".
* Somehow record the fact that the resolution for a particular conflict
signature is to remove the resulting path.
--
just to demonstrate it, i tried today to do an octopus merge of 87 topic branches: git-merge build checkme core/checkme core/debugobjects core/futex-64bit core/iter-div core/kill-the-BKL core/locking core/misc core/percpu core/printk core/rcu core/rodata core/softirq core/softlockup core/stacktrace core/topology core/urgent cpus4096 genirq kmemcheck kmemcheck2 mm/xen out-of-tree pci-for-jesse safe-poison-pointers sched sched-devel scratch stackprotector timers/clockevents timers/hpet timers/hrtimers timers/nohz timers/posixtimers tip tracing/ftrace tracing/ftrace-mergefixups tracing/immediates tracing/markers tracing/mmiotrace tracing/mmiotrace-mergefixups tracing/nmisafe tracing/sched_markers tracing/stopmachine-allcpus tracing/sysprof tracing/textedit x86/apic x86/apm x86/bitops x86/build x86/checkme x86/cleanups x86/cpa x86/cpu x86/defconfig x86/delay x86/gart x86/i8259 x86/idle x86/intel x86/irq x86/irqstats x86/kconfig x86/ldt x86/mce x86/memtest x86/mmio x86/mpparse x86/nmi x86/numa x86/numa-fixes x86/pat x86/pebs x86/ptemask x86/resumetrace x86/scratch x86/setup x86/smpboot x86/threadinfo x86/timers x86/urgent x86/urgent-undo-ioapic x86/uv x86/vdso x86/xen x86/xsave it failed miserably: warning: ignoring 066519068ad2fbe98c7f45552b1f592903a9c8c8; cannot handle more than 25 refs [...] fatal: merge program failed Automated merge did not work. Should not be doing an Octopus. Merge with strategy octopus failed. this wasnt even for purposes of an integration run: all i wanted to do was to pick up 2-3 new commits i have queued into 2-3 topic branches, into the (throw-away) integration branch. All the other branches were unmodified and already merged into the integration branch. Hence i believe that the suggestions above by Git that i'm doing something wrong are ... wrong :-) My scripting around this would be a lot faster (less than 10 seconds runtime versus a minute currently) and more robust if we could do such higher-order ...
The upcoming builtin-merge won't have this problem. I have added a testcase for this in my working branch: http://repo.or.cz/w/git/vmiklos.git?a=3Dcommit;h=3D7eef40b3cd772692c6eb7520= 686300533f35f10c
cool, thanks a ton! stupid question: does this mean that if i install the latest Git devel snapshot (v1.5.6-rc3-21-g8c6b578 or later), i'll be able to experiment around with it right now? Ingo --
Nope. It is currently in the 'builtin-merge' branch of git://repo.or.cz/git/vmiklos.git. And I'm working on to be merged after 1.5.6 will be out.
some hard numbers. Doing a scripted loop of 80 git-merges is 16.2 seconds: earth4:~/tip> time ( for N in $(cat 11 12 13 14); do git-merge $N; done ) [...] Already up-to-date. real 0m16.211s user 0m10.719s sys 0m5.604s doing the octopus merge of 4x 20 branch octopus merges is 11.6 seconds: earth4:~/tip> time ( for N in 1 2 3 4; do git-merge $(cat 1$N); done ) Already up-to-date. Yeeah! Already up-to-date. Yeeah! Already up-to-date. Yeeah! Already up-to-date. Yeeah! real 0m11.580s user 0m8.617s sys 0m2.895s a 40% speedup - and would be another 10% faster with an order-of-80 merge as well i think. Not to be sniffed at. Ingo --
As a part of patch series introducing new fast-forward strategies (--ff=never, --ff=only) there was patch which did merge reduction before selecting merge strategy, by Sverre Hvammen Johansen "[PATCH 4/5] Head reduction before selecting merge strategy" http://thread.gmane.org/gmane.comp.version-control.git/80288/focus=80335 (I'm not sure if the link above is to nevest version of patch series). It is now part of 'pu' branch, as commit 59171adb9c. It didn't make into 'next' as it conflict with builtin merge by Miklos Vajna, which (as he wrote) also includes head reduction. So you either would have to compile git from builtin-merge repository, compile git from 'pu' or just use git-merge.sh from 'pu' branch, or apply or cherry pick appropriate commit and compile git. -- Jakub Narebski Poland ShadeHawk on #git --
Side note: builtin-merge does not have problem with merging 25+ refs even in case every ref contains "new" commits. The patch by Sverre Hvammen Johansen is useful if some of the refs has no "new" commits, so it will help here, but I think it does not help in all cases.
So how many parents can a commit have, exactly? Is there a hard limit somewhere, or just a point beyond which some git tools will start behaving strangely? -- Karl Hasselström, kha@treskal.com www.treskal.com/kalle --
On Thu, Jun 19, 2008 at 09:23:08AM +0200, Karl Hasselstr=F6m <kha@treskal.c= AFAIK there is no limit at a core level. git-show-branch has a limit of 25 refs (it can't show more then 25 refs at one time) and git-merge.sh uses show-branch, while builtin-merge does not.
There is no hard limit at the data structure level. git-commit-tree has a hard limit of accepting 16 parents. git-blame has the same 16-parent limit while following the history (but the one in 'next' has lifted the latter limitation). But that is purely academic. Anybody who does an octopus with more than 8 legs should get his head examined ;-). --
Catalin and I are tossing ideas around for how to represent the history of an StGit patch stack (using a git commit for each log entry). One complication is that we have to keep references to all unapplied patches so that gc will leave them alone (and so that they will get carried along during a pull, in the future). And the number of unapplied patches is potentially large, so I thought we'd be going to have to make a tree of "merge" commits to connect them all up. (What we'd really like, of course, is a way to refer to a set of commits such that they are guaranteed to be reachable (in the gc and pull sense), but not considered "parents".) -- Karl Hasselström, kha@treskal.com www.treskal.com/kalle --
On Thu, Jun 19, 2008 at 10:21:56AM +0200, Karl Hasselstr=F6m <kha@treskal.c= I had a similar problem in git/vmiklos.git on repo.or.cz, while working on builtin-rebase: I squash several patches using rebase -i before sending a series, but it's nice to have the old long list of small patches in case I would need them later. What I did is to have a rebase-history branch: each commit in it is an octopus merge: - The first parent is the previous rebase-history ref - The second is the old HEAD - The third is the new HEAD This way I can use git rebase -i without worrying about loosing history, even if reflogs are not shared among machines. (It may or may not be a good idea to do something like this in StGit, I just though I share this idea here.)
What you're describing is pretty much what we're thinking about doing -- have a log branch where each commit contains enough metadata to recreate the complete patch stack state at that point in time, and has all the parents it needs to be safe from gc. The particular problem I'm asking about here is that due to StGit's concept of "unapplied" patches that are per definition not reachable from the current branch head, a given log entry might have to keep an unbounded number of commits from being gc'ed. Thus my question about what would blow up if we were to make a commit with 50 parents. Or 100. Or 1000, if our users are crazy enough. (The alternative being, of course, to make a tree of octopuses with a fixed maximum fan-out.) -- Karl Hasselström, kha@treskal.com www.treskal.com/kalle --
On Thu, Jun 19, 2008 at 11:19:03AM +0200, Karl Hasselstr=F6m <kha@treskal.c= I may miss something, but you have (at least) two options to store "patches". You can store them as a blob, make a tree of them and make a commit in the log branch point to the tree. This one has the advantage of being able to do a 'git log' on a particular patch of the patch set. The other one is to create n+1 trees (and commits, where the first commit has no parent) for n patches, and point to the last commit from the log branch.
If I don't store the pre or post tree in its entirety, I lose the ability to do patch application by three-way merge. (The current StGit design assumes that we can always make a three-way merge as a last resort when applying patches. Basically, StGit is just a fancy way to rebase.) But yes, this is a viable idea. (Though once I have to store one of the trees, I believe it's actually simpler and cheaper to just store the other tree as well, instead of having to compute the diff and There's actually no point in making more than one commit. A tree can easily hold a lot of sub-trees. I have an existing implementation that stores the pre and post tree for each patch, plus some metadata (message, author). The issue with this format is that every time we write a new log entry (that is, for every StGit command), we have to call git multiple times in order to write several new trees and blobs. StGit normally represents each patch by a commit object, so it should be faster to simply write a single new commit to the log that has some metadata in its commit message and just refers to all the patches' commit objects (by having them as parents). Which is why I was inquiring about the maximum number of parents of a commit object. ( Some background: At a given point in time, your StGit stack consists of a few applied patches, and a few unapplied patches. The applied patches are just a linear sequence of commits at the top of your current branch, so we can trivially save them all from the garbage collector by making the stack top a parent of our log commit. The unapplied patches, however, are commits that are not reachable from the stack top -- they can be "pushed" onto the stack by rebasing, at which point they become applied, but until then we can't make any assumptions about them being ancestors of anything. So a log commit potentially has to have _every_ unapplied patch as a parent. (If we know that the commit of an unapplied patch used to be applied, we ...
By the way, this safety is not a theoretical issue but has been a real one. I had two topics that changed the calling convention of the same function in different ways, and when they were merged to 'pu', the declaration, definition, and call sites existed on both of these branches were handled beautifully by rerere. Recording autoresolution would have been a wrong thing to do. One of the branches added a new call site to a file that was not among the ones that conflicted in the merge between the two branches. That call site, that uses the calling convention of one branch, needed to be adjusted to accomodate the change of calling convention from the other branch (from textual merge's point of view, this has to be an evil merge). I had to make and keep a mental note about that new call site until both topics graduated to 'master' (similar to your need to remember a particular merge is resolved to removal right now). To safely automate reapplication of such a merge, rerere needs to become much more clever. The conflicts rerere notices and records are strictly per blob. A conflicted merge to a blob is inspected and a "conflict signature", which becomes the directory name under rr-cache, is computed. We record the conflicted blob as a whole as the preimage, and your hand resolution as a whoe as the postimage. Next time when you have a conflicted merge to a blob, and the conflict has the exact same conflict signature, we run three-way merge between the recorded preimage, postimage and the new conflicted result. If we want to handle new call sites added only on a single side, you should be able to express something like "when a merge has a conflicted blob with this conflict signature, look in the whole tree, even outside the set of conflicted paths, and change this text to that". This is too much automation and I somehow think the potential for errors (both from the tool and from the user) is too high. --
in our workflow, we dont ever do any semantic things during the integration run. I.e. we dont put more complex merge changes into the integration merge commits. Such integration effects do come up occasionally (especially when a topic changes some widely used infrastructure), and we handle them via separate merge branches. The current ones in -tip are tip/tracing/ftrace-mergefixups and tip/tracing/mmiotrace-mergefixups. They are one or two orders of magnitude more rare than regular conflicts, and they show up immediately during testing. (or we anticipate them beforehand) i.e. we'd like to have a 'dumb' phase of integration, as much cached and automated as possible. Things that need more thought need to go into separate branches anyway, for better reviewability - merge commits are rather hard to debug as they hide their true contents, so we try to keep them simple and contextual only. Ingo --
another git-rerere observation: occasionally it happens that i
accidentally commit a merge marker into the source code.
That's obviously stupid, and it normally gets found by testing quickly,
but still it would be a really useful avoid-shoot-self-in-foot feature
if git-commit could warn about such stupidities of mine.
( and if i could configure git-commit to outright reject a commit like
that - i never want to commit lines with <<<<<< or >>>>> markers)
Another merge conflict observation is that Git is much worse at figuring
out the right merge resolution than our previous Quilt based workflow
was. I eventually found it to be mainly due to the following detail:
sometimes it's more useful to first apply the merged branch and then
attempt to merge HEAD, as a patch.
I've got a script for that which also combines it with the "rej" tool,
and in about 70%-80% of the cases where Git is unable to resolve a merge
automatically it figures things out. ('rej' is obviously a more relaxed
merge utility, but it's fairly robust in my experience, with a very low
false positive rate.)
The ad-hoc "tip-mergetool" script we are using is attached below. It's
really just for demonstration purposes - it doesnt work when there's a
rename related conflict, etc.
Peter Zijstra also wrote a git-mergetool extension for the 'rej' tool
btw., he might want to post that patch. I've attached Chris Mason's rej
tool too.
Ingo
[ "$#" = 0 ] && {
SRC=`git-ls-files -u | cut -f2 | head -1`
} || {
SRC=`git-ls-files -u | grep $1 | cut -f2 | head -1`
}
[ "$SRC" = "" -o ! -f "$SRC" ] && { echo "$1 has no conflicts!"; exit -1; }
SRC_SED=`echo $SRC | sed 's/\//\\\\\//g'`
SHA_1=`git-ls-files -u | grep $SRC | grep '^.* .* 1\>' | cut -d' ' -f2`
SHA_2=`git-ls-files -u | grep $SRC | grep '^.* .* 2\>' | cut -d' ' -f2`
SHA_3=`git-ls-files -u | grep $SRC | grep '^.* .* 3\>' | cut -d' ' -f2`
mv -b $SRC $SRC.automerge || { echo error1; exit -1; }
git-diff $SHA_1 $SHA_2 ...This is what I run with.
I added the cp to the 3-way merge tools because I think its stupid to
see the messed up merge markers instead of the original file.
The rej target basically takes the local version and takes the diff
between base and remote and applies that as a patch, upon failure it
invokes rej to fix up the mess.
--- /usr/bin/git-mergetool 2008-04-08 19:01:37.000000000 +0200
+++ git-mergetool 2008-06-02 19:00:55.000000000 +0200
@@ -214,12 +214,14 @@ merge_file () {
;;
meld|vimdiff)
touch "$BACKUP"
+ cp -- "$BASE" "$path"
"$merge_tool_path" -- "$LOCAL" "$path" "$REMOTE"
check_unchanged
save_backup
;;
gvimdiff)
touch "$BACKUP"
+ cp -- "$BASE" "$path"
"$merge_tool_path" -f -- "$LOCAL" "$path" "$REMOTE"
check_unchanged
save_backup
@@ -271,6 +273,13 @@ merge_file () {
status=$?
save_backup
;;
+ rej)
+ touch "$BACKUP"
+ cp -- "$LOCAL" "$path"
+ diff -up "$BASE" "$REMOTE" | patch "$path" || rej "$path"
+ check_unchanged
+ save_backup
+ ;;
esac
if test "$status" -ne 0; then
echo "merge of $path failed" 1>&2
@@ -311,7 +320,7 @@ done
valid_tool() {
case "$1" in
- kdiff3 | tkdiff | xxdiff | meld | opendiff | emerge | vimdiff | gvimdiff | ecmerge)
+ kdiff3 | tkdiff | xxdiff | meld | opendiff | emerge | vimdiff | gvimdiff | ecmerge | rej)
;; # happy
*)
return 1
--
While we're on the subject, I only found one tool that 'digs' these merge markers and that is xxdiff --unmerge. One would think more tools understand these merge markers, but I couldn't find any. --
The right place for this is in a pre-commit hook, which can look at what you are about to commit and decide if it is OK. In fact, the default pre-commit hook that ships with git performs this exact check. You just need to turn it on with: chmod +x .git/hooks/pre-commit -Peff --
From what I remember some time ago on git mailing list there was idea for git-rerere2, which would record resolutions on tree level, i.e. record file renames. It could probably record file deletion as well... would someone implement it, and didn't it stay loose idea. -- Jakub Narebski Poland ShadeHawk on #git --
Hi, I was dreaming about having "git rerere infer-from <merge-commit>". This would be - more versatile, as you do not have to ask the guy to share the cache, - would avoid transmitting lots of data that can be inferred from the data, - would avoid relying on the honesty of the person sharing the cache, and - it would put all license wieners^Wissues at rest. FWIW this is in my TODO list, but I am unlikely to get to it, least of all before 1.5.6 comes out. Ciao, Dscho --
