First, sorry for joining the thread lately, as far as I see the idea I
want to shere here was not mentioned by anybody yet.So, git revert already includes the "origin" of the commit in the commit
message, and I think that is fine for most people.What about adding an option to cherry-pick to add a similar
"commit 7b27718bdb1b70166383dec91391df5534d449ee upstream" or similar
string to the commit message?As far as I see the kernel -stable tree already have this, but it is
added manually and in many different forms, like:[ Upstream commit 5f3a9a207f1fccde476dd31b4c63ead2967d934f ]
commit 7b27718bdb1b70166383dec91391df5534d449ee upstream
Already in Linus' tree:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...etc.
Once git would provide a standard way to do this, that could be used to
avoid this.
It's already there: -x
Nicolas
--
Thanks, and sorry for not reading carefully the manpage before sending
the mail.
I think this is misguided. In general case, cherrypicks can be from
completely unrelated histories, and if you are doing the cherry pick,
you are saying that actually, the history *does not matter*. In that
case, this kind of link tries to impose a meaning where there is none,
and in an ill-defined way when whether the commit is actually around
anywhere is essentially random.Why do you actually *follow* the origin link at all anyway? Without its
parents, the associated tree etc., the object is essentially useless for
you; the authorship information and commit message should've been
preserved by a proper cherry-pick anyway. You're cluttering the object
store with invalid objects, which also breaks quite some fundamental
logic within Git (which assumes that if an object exists, all its
references are valid - give or take few special cases like shallow
repositories, but this would have very different characteristics).Having history browsers draw fancy lines is fine but I see nothing wrong
with them extracting this from the free-form part of the commit message.
For informative purposes, we don't shy away from heuristics anyway, c.f.
our renames detection (heck, we are even brave enough to use that for
merges).--
Petr "Pasky" Baudis
The next generation of interesting software will be done
on the Macintosh, not the IBM PC. -- Bill Gates
--
As my workflow sometimes make me do cherry-picks where history does
matter, in form of me doing a "partial merge" of one or more than one
commit from branch A into branch B, which does not necessarily have to
be directly related,is there a way to perform something like that, while keeping history?
Perhaps I'm damaged by having used CVS for too long, and merging just
some files, or abusing CVS internals to make some files on branch A
also be part of branch B by having their branches point to the same RCS
branch revision number, but sometimes I find that I miss being able to
do it in Git.Not really a big deal, just curious.
--
\\// Peter - http://www.softwolves.pp.se/
--
... and it works only because false positives (which make the merge
actively wrong) are extremely rare. If there is the occasional false
negative, well, it just makes merges more complicated and screws up
visualization a bit, so you live with it.Move/copy detection almost always worked for me, but there are two cases
where it didn't:1) empty files. Each of them is marked as copied from a
seemingly-picked-at-random one.2) renaming a Gtk+ class. You rename it (e.g. from gtkclassnamea.c to
gtkclassnameb.c) and at the same time dos/GtkClassNameA/GtkClassNameB/
s/GTK_CLASS_NAME_A(?:\>|?=_)/GTK_CLASS_NAME_B/
s/gtk_class_name_a(?:\>|?=_)/gtk_class_name_b/and reindent everything. Guaranteed to have a similarity index around
30-40%, not more.I don't care much about it, but face it, it is *not* perfect.
Paolo
--
The purpose I'd use the origin links for is to manage software projects
that consist of 7 main branches which have branched in (on average) two
year intervals, which never get merged anymore. The only thing that
happens is that there are backports amongst the branches about two per
week.The only way to perform the backports is by using cherry-pick.
The history of each backport *is* important though.
Since all the developers who care about the multiple release branches
have all the relevant branches in their repository, the presence ofI'd prefer to formalise the (weak) relationship of an origin link, instead of
relying on vague assumptions when parsing the free-form commit messageIt's not just that. If I make a change to an area that was cherrypicked
from another branch, then I find it rather important to check if any
changes to this area need to be backported/forwardported to the branches
the origin links are pointing to.
I.e. the origin link allows me to improve my efficiency as a programmer.
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
I'd argue that the origin link is a bit too general for your proposed
use. One of the problems with the origin link is that it is only a
one way pointer. Given a newer commit, you know that it is (somehow)
weekly related to a older commit. So your proposed workflow only
works if cherry-picks only happen in one direction. That isn't always
true, especially in distributed environments where the bugfix might
happen on someone else's development branch, and then it gets pulled
in, or perhaps rebased in, and you want to know they are related.I would argue the best way to do that is to store (either in the
object or in the free-form text area) not the link, which would have
to get renumbered but rather the identifier for the bug(s) that this
commit fixes. So for example, consider a convention where in the body
of the free-form text area, before the Signed-off-by:, Acked-by:, and
CC: headers for those projects that use them, we add something like
the following:Addresses-Bug: Red_Hat/149480, Sourceforge_Feature/120167
or
Addresses-Bug: Debian/432865, Launchpad/203323, Sourceforge_Bug/1926023
Once you have this information, it is not difficult to maintain a
berk_db database which maps a particular Bug identifier (i.e.,
Red_Hat/149480, or Debian/471977, or Launchpad/203323) to a series of
commits.The advantage of this scheme is that if a bug has been fixed in
multiple branches, you can see the association between two commits in
two different branches very easily. Furthermore, you get a link back
to the actual bug in one or more bug tracking systems, which the some
porcelain program could use to transform into a hot-link which when
clicked opens up a browser window to the bug in question.In contrast, using your proposed origin scheme, if the bug was
originally created in some development branch, and then cherry picked
into two separate maintenance branches, if you don't have the
development branch in your repository (maybe for some reason that
development branch wasn...
Well, the definition of the origin link (and a back/forwardport) is that:
- You (as a developer) consider the link relevant for posterity (IOW,
you consider it to be a proper back/forwardport which should be
recognisable as such).
- The back/forwardport always has to reference some existing (stable) commit.Especially the second condition always holds at the time of creation of
the backport (or forwardport, for that matter). I'm not quite sure
which circumstances you allude to above which would violate thisThe renumbering is not a problem, renumbering is a rare operation since
a project's history is supposed to be stable. And even if renumbering
is performed, it is a well understood operation of which the renumbering
of the origin links imposes a negligible overhead on top of the existingThis is nice, I admit, but it has the following downsides:
- It is nontrivial to automate this on execution of "git cherry-pick".
- In a distributed environment this requires a network-reachable bug
database.
- A network-reachable bug database means that suddenly git needs network
access for e.g. cherry-pick, revert, gitk, log --graph, blame.
- Network queries for commits containing references kind of kills
performance.
- Some backports don't have entries in a bug database because they
weren't bugs to begin with, in which case it becomes impossible to add
an identifier to the commit message after the fact.
- It relies heavily on tools outside of git-core, which raises theI'm not opposed to links like this, but I consider them a useful extra.
The link back is computationally of the same order of magnitude to find
all existing children of a certain commit; which is well understood andYes, you would. You'd notice that either:
- One origin will point to the other commit (recommended practice,
cherry-pick ripple-through, so to speak).True. The point is that specifying a definition for a origin
headerfield will narrow down how it is and can be used. Free-form is...
The database can just live in a special branch, with trees organized the
same way the object database is, possibly in a more optimized way
(having the HEAD trees cached around inside Git, etc.). This should be
no rocked science if the design is given a little thought, and should be
fairly fast afterwards.I'm not endorsing assigning UUIDs to commits now at all (but I don't
have time to formulate a comprehensive argument against that either).However, having a commit -> nonessential_volatile_metadata database
would be useful for many other things as well! For example amending
commit messages later, maintaining general linkage between related
commits, tracking explicit rename hints for Git (like the Samba guys
would appreciate right now, and me many times in the past - note that
this is NOT the same as directly tricking renames within Git history)
or caching expensive computations with mostly static results (like the
rename detection or maybe pickaxe indexes - that could be quite large,
so we might want to actually separate different kinds of data to
separate branches).--
Petr "Pasky" Baudis
The next generation of interesting software will be done
on the Macintosh, not the IBM PC. -- Bill Gates
--
100 points to Petr. :-)
Paolo
--
Which brings us back to the "commit notes" proposal.
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
Well *you* were the one using this as an argument for using the origin
link. But I'll note that in some workflows, rebasing happens all the
time when a patch is being developed and moved around. Sometimes
patches are created in git, exported as a patch, and then it re-enters
git again later (which is another reason why using an external UUID orIt's trivial if it's in the free-form text. In fact, it happens
automatically. If it's stored within the git commit object, then it
will be done in the C code (if you've updated to the latest git;No, because you don't need to look up the bug identifier unless you
want to, you know, actually look at the bug. Otherwise, we are just
using something like "debian/432865" as an identifier; you only need
to look them up if you want to look up the bug. Any time you have a
collaborative development environment, you will need either a
centralized, network accessible bug tracking system, or use a
distributed bug tracking system. Either way, though, if it's just
matter of seeing whether or not a bug fix such as debian/432865 is
fixed by some commit in some branch, using the bug identifier actuallyThis is true. The transition is a little easier if you are pointing
to a pre-existing commit, whereas if you need some kind of rendevous
identifer (whether it is a bug ID or some UUID). On the other hand,
you've cherry-picked some bug fix using a git that didn't support theWell, it relies on changes to git --- just like the origin link
requires changes to git. If the it is implemented using free-form
text, which is a great way to prototype it, you have the *option* of
implementing it via either git porcelain changes or outside tools like
emacs or vi macros (just as most of us who are kernel developers have
editor macros that insert Signed-off-by: into git commit messages, as
well as changes in git porcelain such that "git am -s" automatically
adds the Signed-off-by header). But given the wildly successful use
of Signed-off-by in the kernel ...
True.
But maybe Ted is on to something here. Rather than adding the
information to the commit object itself, why not maintain a separate
mapping, but keep it _within git_. That is how most of the DBTS's work
that I have seen. Maybe it is possible to implement some subset of the
features in a tool that could become part of core git.There was a proposal at some point for a "notes" feature which would
allow after-the-fact annotation of commits. I don't recall the exact
details, but I think it stored its information as a git tree of blobs.
You could choose whether or not to transfer the notes based on
transferring a ref pointing to the notes tree.I'm not sure how applicable this is to your problem, but if you want to
investigate you can find discussion in the list archive under the name
"notes".-Peff
--
Yes, that works, but it is non-trivial, especially since it needs to
The idea is nice, but if we were to use it to store the origin link
information, the following happens:
- Origin link information is rare.
- Yet during a log/gitk/blame run the information might need to
be queried for at every commit.
- Since in most cases the origin information does not exist, this
will cause misses to fill the dentry cache for directory lookups, and
thus killing performance.
- In order to make this efficient, a different database lookup system is
needed that is fast for misses.Whereas if the information is part of the commit, it costs nothing in
the typical case (no origin information present).
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
I think you are misunderstanding what I meant by "git tree" here. It is
literally a git tree object, so you don't ask the filesystem at all. You
are looking up within the single object file. If it's a miss, you know
after seeing that object. If not, then you dereference the blob object
that contains the notes.-Peff
--
I see. Indeed. That's a lot better.
Did the binary search inside tree objects ever get implemented?It is unclear why the latest commit notes proposal didn't make it,
though I admit that storing the origin link information in there seems
feasible.The downsides when doing that are:
- The lookup cost is small, but still noticable, since it is sometimes
done on every commit; using the in-commit origin headerfield solves
this at negligible cost.
- The origin information is no longer cryptographically protected (under
certain circumstances this could be considered an advantage and a
disadvantage at the same time).
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
I believe it's still linear (and skimming tree-walk.c:find_tree_entry
seems to confirm). However, one advantage of such an approach is that it
will improve as tree lookup improves (e.g., I believe the pack v4 workYes, those are inherent in the scheme, as is the upside that one can
make and distribute such annotations separately from commit creation.I haven't thought enough about it to decide whether there is a scenario
where making such a "cherry-picked from" annotation might make use of
that property.-Peff
--
No, not yet. Actually that's the part that still needs serious
thinking.Nicolas
--
Being able to subvert the authenticity of git blame by providing fake
origin information is not very appealing.
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
You could use a dummy submodule to ensure that each commit pointed to
the right set of notes. It would force to create a separate commit
whenever you modified the notes, which is actually not bad.Alternatively, the header of the commit can be modified to add a pointer
to a tree object for the notes; I suppose this is more palatable than
the origin link. The tree could be organized in directories+blobs like
.git/objects to speed up the lookup.I actually like the commit notes idea, but then I wonder: why are the
author and committer part of the commit object? How does the plumbing
use them? Isn't that metadata that could live in the "notes"? And so,
why should the origin link have less privileges?Paolo
--
Possibly, yes. But we'd have to be careful not to incur too much
overhead because every indirection will cost, especially since the
origin link sometimes is checked for on every commit during a treewalk.
The fact that it rarely exists means that it should be fast to find outThis won't work for the original notes concept, because it makes the
notes immutable after commit. For the origin links this would be fine,
since they don't change once committed.
The problem with fitting the origin links in the notes is twofold:
- They become mutable, which is undesirable, I'd like to preserve
history as is (just like parent links).
- There is a performance hit, since origin links need to be found not toIt would fit with a non-mutable version of the notes. Then again, we
already *have* the non-mutable version of the notes, it's called theThey both belong in the non-mutable notes, and those happen to live in
the header of the commit (which *is* the most efficient spot, of course).
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
Almost correct. Remove "header of" from the above and you'd be correct.
--
Yes, that was my point. I don't see how the author and committer fit in
the header of the commit message, if the origin does not.Paolo
--
Recording cherry-picks in your workflow certainly makes sense, but I'm
not talking about workflow-level issues here. You are adding an extra
header to the commit object. I'm talking about the object database and
low-level Git model implications this has.In other way, I think this is purely a porcelain matter and recording
And why are the notes created by git cherry-pick -x insufficient for that?
--
Petr "Pasky" Baudis
The next generation of interesting software will be done
on the Macintosh, not the IBM PC. -- Bill Gates
--
Stephen posed the origin links as weak, but it is not necessarily true
that you don't have the parents and the associated tree. For example,
if you download a repository that includes a "master" branch and a few
stable branches, you *will* have the objects cherry-picked into stable
branches, because they are commits in the master branch.Junio explained that the way achieves the same effect in git is by
forking the topic branch off the "oldest" branch where the patch will
possibly be of interest. Then he can merge it in that branch and all
the newest ones. That's great, but not all people are as
forward-looking (he did say that sometimes he needs to cherrypick).Another problem is that in some projects actually there are two "maint"
branches (e.g. currently GCC 4.2 and GCC 4.3), and most developers do
not care about what goes in the older "maint" branch; they develop for
trunk and for the newer "maint" branch, and then one person comes and
cherry-picks into the older "maint" branch. This has two problems:1) Having to fork topic branches off the older branch would force extra
testing on the developers.2) Besides this, topic branches are not cloned, so if I am the
integrator on the older "maint" branch, I need to dig manually in the
commits to find bugfixes. True, I could use Bugzilla, but what if I
want to use git instead? There is "git cherry -v ... | grep -w ^+.*PR",
except that it has too many false negatives (fixes that have alreadyFor example, these notes (or the ones created by "git revert") are
*wrong* because they talk about commits instead of changesets (deltas
between two commits).Why is only one commit present? Because these messages are meant for
users, not for programs. That's easy to show: users think of commits as
deltas anyway, even though git stores them as snapshots---"git show
HEAD" shows a delta, not a snapshot.And what does this mean for programs? That they must resort to
commit-message scraping to distinguish the two cases. (*)...
Those who base their work on the newest ones must be very forward-
looking :) but, seriously, cherry-picking is *not* a normal workflow
with Git. Git is optimized for easy merging while cherry-picking isIf a branch is meant to included in the oldest version, it must be
tested with that version anyway, and it is better when it is written for
the old version, because functions tend to be more backward compatible
than forward compatible. In other words, functions may often acquire
some extra functionality over time without changing their signature, so
the code written for a new version will merge without any conflict to
the old one, but it won't work correctly under some conditions. It is
certainly possible to have a problem in the opposite direction, but it
is much less likely, and usually bugs introduced in the development
version are not as bad as destabilizing a stable branch. Thus starting
branch that is clearly meant for inclusion to the old version from that
version is the right thing do.Of course, if you have more than one stable branch for a long time then
you may want some branches forked from the new stable. You can do that
by merging uninteresting changes from the new stable with the 'ours'
strategy (so they will be ignored), and after that merging actually
interesting features from the new stable.In contrast to cherry-picking, the real merge creates the history that
If you clearly mark all bugs in the commit message, there will be no
problem to find them by grepping log. There is a lot of potentially
useful information, and the 'origin' link is just one of many. It may
be okay to do some general mechanism for custom commit attributes (if
it's really necessary), but making a hack for one specific item of
information feels very wrong. In fact, I have not convinced at all
that the free-form text is not suitable to store this information.Dmitry
--
Could you explain how the above mechanisms work based on the following
cherry-pick action:A -- B -- C -- D -- L
\ /
E -- F -- G -- H -- KD is the stable branch.
K is the development branch.
G is cherry-picked and applied to D producing L.
The origin link of L would have contained (G, F).How would such a workflow be implemented using the temporary branches
Sometimes they're not bugs, yet they still are backported and thus carry
That's the problem, a general mechanism is undesirable, that we already
It's a rather well-defined usefull property (which precludes it from
being a hack, I suppose).
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
You don't. You do everything in topic branches based off the stable
branch, and you merge them. That's the other way round, compared to
what you (and I) are used to.Paolo
--
But that is irrelevant. If you already have the objects, whether to
follow the origin link does not matter at all.I argue that the following the origin link by one step is harmful as it
violated the internal Git object model and does not have real benefits.
If you want to have the origin links, do not follow them at all - the
commit objects themselves are not useful. (Or, optionally, follow them(BTW, I don't feel strongly enough about the header-freeform distinction
to argue about it and some of your and others' points are good. But even
if we have the origin links, I think we should only follow them not at
all or fully.)--
Petr "Pasky" Baudis
The next generation of interesting software will be done
on the Macintosh, not the IBM PC. -- Bill Gates
--
The origin links are rarely followed, not even by one step. They are
Maybe we have a misunderstanding about what "follow a link" means and
when it is done.
During most normal git operation, the origin links are just read, but
not followed.
The only commands that I expect to follow them are log --graph, gitk, fsck
and blame. I may have missed some corner use-cases, but this should
cover most of it; i.e. most of git ignores them or just makes note of
the hashvalues provided.
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
Oh, I'm sorry. By
- During fetch/push/pull the full commit including the origin fields is
transmitted, however, the objects the origin links are referring to
are not (unless they are being transmitted because of other reasons).I have understood that you fetch the origin target but not commits
referred from it, but instead you meant that you do not follow the
origin link at all.Petr "Pasky" Baudis
--
Indeed.
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
Using special references in the free-form area of a commit is akin to
using X-... headerfields in E-mail with all the assorted mess:
- No strict definition of what it means.
- Diverging porcelain implementations making use of the field in ever so
slightly changing ways over the years.
- You cannot rely on the field being always available.
- Automated "renumbering" becomes difficult at best.What we want are concise and unambiguous definitions which allow us to
build tools that operate predictably on them now, and will operateThings like rebase/filter-branch/stgit mess that up because they don't
know if the hash in the free-form should be altered.
Also, there is no automated way to actually fetch missing branches we
cherry-picked from this way.
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
I understand that multiple origin fields occur if you do a squash
merge, or if you cherry-pick multiple commits into single commit.
For example:
$ git cherry-pick -n <a1>
$ git cherry-pick <a2>
$ git commit --amend #; to correct commit messageI'm not sure if you plan to automatically add 'origin' field for
I think you wanted to use "(B, B^2)", which mean B and second parent
of B. B~2 means grandparent of B in the straight line:... <--- B~2 <--- B^1 = B^ = B~1 <--- B
/
... <--- B^2 <--/Besides I very much prefer using 'origin <sha1> <sha2>' (as proposed
in the neighbouring subthread), which would mean together with
'parent <parent>' (assuming that there are no other parents; if they
are it gets even more complicated), that the following is true<current> ~= <parent> + (<sha2> - <sha1>),
where '<rev1> ~= <rev2>' means that <rev1> is based on <rev2> (perhaps
with some fixups, corrections or the like). Perhaps 'origin' should
be then called 'changeset'.It would also be easier on implementation to check if
'origin'/'changeset' weak links are not broken, and to get to know
which commits are to be protected against pruning than your proposal
oforigin <"cousin" id> [<mainline = parent number>]
where <mainline> can be omitted if it is 1 (the default).
This can also lead to replacing
origin <b> <a>
origin <c> <b>by
origin <c> <a>
The above means that it is a 'weak' link, i.e. it is protecting
against pruning (perhaps influenced by some configuration variable),Errr... shouldn't objects referenced by 'origin' links be reachable in
order for "cherry-pick" or "revert" to succeed?On the other hand this leads to the following question: what happens
if you cherry-pick or revert a commit which has its own 'o...
That is not part of the plan so far.
The simplicity sounds inviting. I'd like to hear from others who have
more experience (than I have) with the git vs. changeset paradigms about
this. This allows a bit more flexibility in specifying the origin, theOn the contrary, my current proposal only needs to verify the validity
of a single commit, changing it like this will require the system to
verify the validity of two commits. Given the rareness of the origin
links this will hardly present a problem, but it *does* increaseOk, *that* is not possible with the original proposal. This might just
True. But sometimes it's necessary to emphasize the obvious; call it a
Nothing special. cherry-pick/revert behave as if the existing origin links
The order in which commits are listed is defined by the fact that
descendent commits are shown before any of their parents. The presence
of an origin link will make sure that the current commit will always
appear *before* the origin-commit it is referring to (if theQuite. Also, having them in a well-defined place will allow for easy
Care to explain your doubts?
The reason I want this behaviour, is because it's all about tracking
content, and that part of the content happens to come from somewhereQuite, but that is not a part of the definition of the origin field.
I can only try and make sure that we have a well-defined, well-behaved
mechanism in core git. If someone wants to get creative with theAs far as I can imagine, git rebase should alter the origin links during
rebase if they point to a commit within the strain being rebased.
Are there any other desirable use cases (for rebase)?
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
Quite frankly, recording the origins for _any_ of the above sounds like a
horribly mistake.All those operations are commonly used (along with "git rebase -i") to
clean up history in order to show a nicer version.The whole point of "origin" seems to be to _destroy_ that.
I would refuse to ever touch anything that had an "origin" pointer, so if
git were to add that feature, it would be a huge disappointment to me. I'd
have to have a version that makes sure that anything it pulls hasn't been
crapped on by somebody who added a stupid link to some dirty history that
I'm not at all interested in seeing.IOW, I'm seeing a _lot_ of downsides, and not any actual upsides. What are
the upsides again?Linus
--
Actually the above is _not_ a good example for using 'origin', and why
using 'origin'; just a bit convoluted example of multiple 'origin'If I understand correctly the point is to record those 'origin' headers
for git-revert (when 'origin'-ed commit is somewhere in the history),
and for git-cherry-pick from other long lived branch and thus require
additional option to git-cherry-pick to record 'origin' (denoting that
you this is "true" cherry-pick, and not reordering of commits and
cleaning up a history, better done with interactive rebase)./me is playing advocatus diaboli here, 'cause I'm not that convinced
to necessity of this feature.--
Jakub Narebski
Poland
--
Actually, I'd suggest that cherry-pick takes an -o flag which turns on
the origin link. This needs to be a concious decision because one deems
the history relevant. This typically is a Good Thing when the
development has several long-term-stable branches which never get merged
with each other, yet they receive frequent backports (using cherry-pick)Only in the case where the committer thinks the history is of interest,
and even then, since the origin link is in the header, displaying it or
not suddenly is under the control of git.
Had it been in the free-form textarea, there'd be no way suppress theAs you might have noticed, the actual process of pulling/fetching
explicitly does *not* pull in the objects being pointed to.
That, in turn, will cause the origin link output to be automatically
suppressed. I.e. you'll never know the difference.OTOH, if someone adds a free-form link to the commit message, you
essentially cannot hide that and are just suffering the clutter withoutThe upsides are:
- If your repository contains the proper branches, it will show a richer
content.
- If your repository lacks the proper branches, it will show a *reduced*
clutter content (because actual free-text references in the commit
messages will decrease).I see a lot of upsides, what were the downsides again?
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
.. which makes them _local_ data, which in turn means that they should not
be in the object database at all.IOW, i you want this for local reasons, you should use a local database,
like the index or the reflogs (and I don't mean "like the index" in the
sense that it would look _anything_ like that file, but in the sense that
it's a purely local thing and doesn't show up in the object database).Linus
--
Not really local data. More like _weakly referenced_ data. If it is
there, cool. If it is not there, no big deal.Paolo
--
You think it's "cool".
I think it is "unreliable, random, and depends on the phase of the moon".
My definition of "cool" is a totally different thing. What you describe is
the very anti-thesis of cool.If you want unreliable and random, use CVS. Please.
Linus
--
I think that shallow clones are not any different from this. If the
required piece of history is there, cool. If they're not there, no big
deal.I understood the hyperbole, but I think that it's not unreliable,
because all it relies on is the uniqueness of SHA1 values.Paolo
--
Sure. I don't use them either. But because I don't use them, it doesn't
affect me. It also doesn't change the core git data structures in any way
to introduce any new problems.Also, if there isn't a required piece of history, things generally break
very loudly. IOW, there are only certain things you can do with a shallow
repo. In general it's absolutely _not_ a "no big deal" issue, quite the
reverse - it's a deal-breaker.Linus
--
Btw, so far nobody has even _explained_ what the advantage of the origin
link is. It apparently has no effect for most things, and for other things
it has some (unspecified) effect when it can be resolved.Apart from the "dotted line" in graphical history viewers, I haven't
actually heard any single concrete example of exactly what it would *do*.And that dotted line really does sound like something you could do with
just the existing "hyperlink" functionality in the commit message.Linus
--
As far as I understand (note: I'm neither for, nor against the proposal;
although I think it has thin chance to be accepted, especially soon),
it is for graphical history viewers, for git-cherry to make it more
precise (to detect duplicated/cherry-picked changes better), and in
the future possibly to help history-aware merge strategies. And probably
help patch management interfaces.On the theoretical front it looks like extension/generalization of
a parent link, marking given commit do be derivative not only some
set of trees, or some line of history, but also on some changeset.--
Jakub Narebski
Poland
--
Can I suggest,
1. bury this origin link idea
2. make git-cherry-pick have a similar option to '-x', but instead of
recording the original commit ID, record the original *patch* ID,
*if* there was a merge conflict for that cherry pick.3. tools can build indexes from patch ID => (commit IDs) to make this
other form of history navigation fast.Sam
--
Actually, don't make it dependent on merge conflicts. Just make it depend
on whether the patch ID is _different_.It can happen even without any conflicts, just because the context
changed. So it really isn't about merge conflicts per se, just the fact
that a patch can change when it is applied in a new area with a three-way
diff - or because it got applied with fuzz.You could add it as a
Original-patch-id: <sha1>
or something. And then you just need to teach "git cherry/rebase" to take
both the original ID and the new one into account when deciding whether it
has already seen that patch.Linus
--
That will probably work fine when operating locally on (short) temporary
branches.It would probably become computationally prohibitive to use it between
long lived permanent branches. In that case it would need to be
augmented by the sha1 of the originating commit. Which gives you two
hashes as reference, and in that case you might as well use the two
commit hashes of which the difference yields the patch.
--
Sincerely,
Stephen R. van den Berg."Father's Day: Nine months before Mother's Day."
--
Nope, as Sam suggested in his original message (but which got clipped
by Linus when he was replying) all you have to do is to have a
separate local database which ties commits and patch-id's together as
a cache/index.I know you seem to be resistent to caches, but caches are **good**
because they are local information, which by definition can be
implementation-dependent; you can always generate the cache from the
git repository if for some reason you need to extend it. It also
means that if it turns out you need to index reationships a different
way, you can do that without having to make fundamental (incompatible)
changes in the git object.It's much like SQL databases; you have your database tables, where
making changes to the database schema is painful --- and indexes,
which can be added and dropped with much less effort. Think of these
local caches are database indexes. Just because you need an index in
a particular direction to optimize a query or loopup operation does
***not*** imply that you need to make a fundamental, globally visible,
database schema change or git object layout which breaks compatibility
for everybody.- Ted
--
True. But repopulating this cache after cloning means that you have to
calculate the patch-id of *every* commit in the repository. It sounds
like something to avoid, but maybe I'm overly concerned, I have only aI fully agree that caches are good.
And yes I seem to resist the idea to create a cache at every whim, but
that mostly is because I want to avoid that everyone invents their own
mini-database for each and every data access they want to accellerate.I mean, ideally, any database/index/accellerator structure you'd need
can reuse the SHA1 object database index, or maybe one or two other
semi-standard index types, and git would provide suitable library
functions for all three solutions. And if that would be the case, I'll
gladly throw in an extra cache or index at anytime to speed up the
particular access pattern I'm trying to make useable. But as far as I
can see, those library functions have not materialised yet, so I'm
hesitant to create yet another private database structure just for my
access patterns; and simply pulling in libdb or sqlite without agreement
that those libs are (re)used in a lot of places in git seems a bitIt's not a certainty that changing the git object layout has to break
compatibility (it should be reasonably possible to add columns to the
schema without breaking anything, to stay with the database paradigm),
but I agree that creating another index can be considered better than
extending the schema.
--
Sincerely,
Stephen R. van den Berg."Father's Day: Nine months before Mother's Day."
--
You don't necessarily need to do that. If the tool decides that the
sha1 it finds in the message is a patch-id reference, well it can just
start hunting around, caching the patch-ids it calculates as it finds
them, until it either finds one that matches, or determines you don't
have it. You can probably find it first try just based on the author
name and date 90% of the time anyway.Maybe the machinery could be adequately tilted such that if someone is
really desperate to make sure they are found quickly they can put the
information at refs/patches/PATCHID/COMMITID, but that sounds a bit
abusive.Sam.
--
For a rough estimate, try:
time git log -p | git patch-id >/dev/null
-Peff
--
On my system that results in 2ms per commit on average. Not huge, but
not small either, I guess. Running it results in real waiting time, it
all depends on how patient the user is.
--
Sincerely,
Stephen R. van den Berg."Father's Day: Nine months before Mother's Day."
--
For a local clone, git could be taught to copy the cache file. For a
network-based clone, the percentage of time needed to download is
roughly 2-3 times that (although that will obviously depend on your
network connectivity). Building this cache can be done in the
background, though, or delayed until the first time the cache is
needed.- Ted
--
Fair enough. If noone beats me to it, I'll probably take a stab at
implementing something like this and see how it fares for my own
application.
--
Sincerely,
Stephen R. van den Berg."Father's Day: Nine months before Mother's Day."
--
But it is not true that "you can always generate the cache from the
git repository" in this case; the patch-id that is to be saved is
_original_ patch-id of cherry-picked (or reverted) changeset.OTOH it is not much different from reflog information, which also
cannot be regenerated from object database.
--
Jakub Narebski
Poland
--
He's proposing storing the original patch id in the commit message, and
caching the commit SHA->patch id association on the side.Paolo
--
Actually its the association in the other direction which you'd want
to cache. It's fast given the commit SHA to dig the original patch id
out of the commit message. What is harder is given a patch id X, to
find all of the commits which either (a) have a patch id of X, or (b)
have a commit message indicating that the original patch-id was X. So
having a database which caches this information, so given a patch-id,
you can quickly look up the related commits, is what I believe Sam was
proposing, and which I think would solve the problem quite nicely.- Ted
--
Yeah, I must admit I am okay with *this* cache.
Paolo
--
Pardon my confusion, but why include two commit hashes? Surely the
commit already has its parent, so there is no need to include that in
your "cherry pick". And if the commit has more than one parent, then I
doubt you could/should really cherry-pick it anyway.Besides, you could always augment your local repo with a mapping of
patch ids to commits/commit pairs to reduce lookup time.Rogan
--
Well, actually, sometimes cherry-pick does pick just one of the
(multiple) parents to diff with; also, some people (not I) envisioned
using two commits which were not a direct parent and child of one
another (I'm not quite sure how that would work, but the model wouldYes, possible. But then after cloning, this mapping-cache needs to be
recreated, and that would mean that one would have to walk through all
commits and calculate all patch-id's, of which then only those few which are
referenced need to be stored.
--
Sincerely,
Stephen R. van den Berg."Father's Day: Nine months before Mother's Day."
--
Yes, right - it's the patch ID changing that's the problem for
git-cherry / rev-list --cherry-pick to be able to spot changes as the
'same'.Someone else pointed out that git-rebase -i might want to have this as well.
I actually looked into coding this, but there was a little problem with
the way git-revert worked - it builds the commit message before the diff
is calculated. So there would probably need to be a little trivial
refactoring first before this can be implemented.Sam.
--
I mentioned git-cherry as an additional use case. Automatic rename
detection works because it might have the occasional false negative, but
it has practically no false positive, and those are what screws up
merges. But automatic changeset detection a la git-patch-id has too
many false negatives to make the current implementation of git-cherry
practical, and here's when the origin link comes in. Also, automatic
changeset detection does not work with reverts, only with cherry-picks.Blame could also use the origin link to go backwards in the history and
find the origin of the code, without being fooled by reverts.(Note to Linus: reinforcement of your disagreement will be implicitly
Paolo
--
It allows one to follow and view the evolvement of a patch over time during
the various backports.
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
But then how would someone who clones the repository get at the information?
The information is essential to understand backports between the various
stable branches.The origin links describe the evolving state of a patch (i.e. just like
regular commits/parents store snapshots of the whole tree, the origin
links store snapshots of a patch as it evolves through time).
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
You just said it wouldn't get there with fetches.
If clone acts differently from a "full" fetch, something is really really
No it's not. You can mention the backport explicitly in the commit
message, and then you get hyperlinks in the graphical viewers. That works
when people _want_ it to work, instead of in some hidden automatic manner
that does entirely the wrong thing in all the common cases.What more do you want?
Linus
--
It does not act differently.
Let me elaborate:
- The origin field is part of the commit (and only present if
*consciously* added by the committer), and therefore is transmitted
along with the rest of a commit upon a fetch.- The commits being referred to by the origin field are *not*
transmitted upon a fetch.- Given a repository with 4 long lived published branches called A, B, C and D
and a backport from development branch D cherry-picked -o into branch A
which creates an origin field pointing back to (D^,D^^)- Now you fetch just branch A from this repository. This will not cause
branch D to be pulled in as well.- However, if you explicitly pull D, the origin information from A to D can
be used. People doing a generic clone get all four branches, and
therefore have all the important commits which normally could contain
origin links. Note that even during a clone, commits pointed to by
origin links are not being transmitted (unless there already are otherCould you spell out one of the common cases where it would do entirely
the wrong thing?
--
Sincerely,
Stephen R. van den Berg."Am I paying for this abuse or is it extra?"
--
IOW, it's not actually transferring them and saving them, since a simple
delete of the origin branch will basically make them unreachable.Fine. At least it works the same way as fetch, then. But it's still a huge
mistake, because it really does mean that it is technically no different
at all to just mentioning the SHA1 in the commit message, the way we
already do for backports.It carries along information that is worthless and meaningless and hidden.
I refuse to touch such an obviously braindamaged design. It has no sane
_semantics_. If it doesn't have semantics, it shouldn't exist, certainly
not as some architected feature.Nobody has shown any actual sane meaning for it. The only ones that have
been mentioned have been for things like avoiding re-picking commits
during a "git rebase", but (a) the patch SHA1 does that already for things
that are truly identical an (b) since that information isn't reliable
_anyway_, and since it's apparently a user choice, it's just "random".I'm sorry, but "good design" is a hell of a lot more important than some
made-up use case that isn't even reliable, and doesn't match any actual
real problems that anybody can explain.Linus
--
False.
If you fetch just branches A, B and C, but not D, the origin link from A
to D is dangling. Once you have fetched D as well, the origin link from
A to D is not dangling anymore. Subsequently deleting branch D but
keeping branch A will keep everything in branch D up till the commits
the origin link is pointing to alive and prevent those from beingGit will keep alive commits based on origin links once you (the fetcher)
has shown interest by fetching the appropriate branches.As to "meaning" for git, it's there in the form of:
- --topo-order uses the information to order the output (but only if theThe common cases would be:
a. "hidden": It doesn't need to be hidden. It can be hidden if you want it
to be. We can decide if git hides it sometimes, always or never.
So this point is moot.b. "meaningless": Git is all about taking snapshots of sourcetrees and
linking them in an orderly fashion. The origin link is all about
taking snapshots of patches and linking them in an orderly fashion.
This allows you to see the patch evolve over time, and it allows for
diffs between patches. We're not actually storing patches, we merely
store snapshots. As it happens, the snapshot of a patch is defined
by two commit hashes.
Doesn't sound meaningless to me. Just as one needs normal history
between commits in a branch to follow development, there is a history
of a backport as it "travels" from stable branch to stable branch.c. "worthless": Without the tracking of a backport through a series of
well-defined patch-snapshots, it becomes kind of haphazard to
actually figure out which piece of code came from where. Having this
information in the form of a series of origin links increases the
efficiency of a developer maintaining the backports between branches.
Maybe you consider that worthless, I consider anything that improves
code quality because having access to a concise history of how the
code evolved a Good Thing....
Stephen, here's a f*cking clue:
So I just said we deleted beanch 'D', so there's no way to ever fetch it
again.Get it?
The fact is, a big part of git is temporary branches. It's one of the
*best* features of git. Throw-away stuff. Those throw-away branches are
often done for initial development, and then the final result is often a
cleaned-up version. Often using rebase or cherry-picking or any number of
things.And this is why "git cherry-pick" DOES NOT PUT THE ORIGINAL SHA1 IN THE
COMMENT FIELD BY DEFAULT.(Although you can use "-x" to make it do so for when you actually _want_
to say "cherry-picked from xyzzy")Can you not understand that? The "origin" field is _garbage_. It's garbage
for all normal cases. The original commit will not ever even EXIST in the
result, because it has long since been thrown away and will never exist
anywhere else.Garbage should be _avoided_, not added.
Linus
--
I'd presume you do, but that doesn't mean you always accurately express
You did not state you deleted branch 'D' on the repository being fetched
*FROM*. I assumed you meant you deleted branch 'D' on the repositoryIndeed, features I value in git very much, and use every day, thanks.
The origin field will *not* be created on regular cherry-picks, this
*would* create garbage. The origin field is not meant to be generated
when doing things with temporary branches. The origin field is meant to
be filled *ONLY* when cherry-picking from one permanent branch toQuite.
I do understand that "normal cases" in your case mean cherry-picks among
temporary branches.
Well, you are completely right that *your* normal cases should not (and
will not) generate an origin field.
The origin field is intended for the *abnormal* cases, which means
cherry-picking between permanent branches (which, apparently, you rarely
do, if ever), this is something that (depending on your workflow) can be
a more frequent event. For *those* cases, the origin field will not
contain garbage.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
... and therefore you might as well just have a separate file (which
might or might not be tracked by git like the .gitignore files are)
to keep that information? Since this is a rare operation, modifying the
core database structure for this doesn't appear that appealing to most
so far.And, while recording this origin link is optional, you are likely to
make mistakes like forgotting to record it, or you might even wish to
fix it with better links after the facts. Having it versionned also
means that older git versions will be able to carry that information
even if they won't make any use of it, and that also solves the
cryptographic issue since that data is part of the top commit SHA1.Nicolas
--
For various reasons, the best alternate place would be at the trailing
end of the free-form field. Using a separate structure causesThat is not possible for commit messages, and should not be possible for
It would allow the data to be faked, that is undesirable for "git blame".
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
Why would this matter? The information is largely
self-authenticating. If a commit claims to have come from some other
cherry-pick, a human taking a quick look at it would know instantly
that this wasn't true. So what's the harm done if some incorrect
information gets introduced? "git blame" is something which is
generally used by humans, not by automated programs.Also, what's the attack scenario? The person who originally makes the
commit can easily fake the origin link information. They can hack git
to fill on some other commit ID, for example. So what you are
protecting against is someone after the fact adding the annotation
that this commit was related to this other commit. When would this be
a bad thing to do? If they are adding correct information, it's a
good thing. If they add incorrect information, what's the harm they
can as a result of being able to add the incorrect information.
(Noting that if this annotation file is kept under git control, you
can use what ever access controls and/or process controls that verify
that a new cherry-pick --- or a commit claiming to be a cherry-pick
--- is valid and should be accepted into the master git repository for
that project.- Ted
--
Attack-wise, you're right, it's not a big deal.
I think the comforting feeling one gets about the hashes protecting
integrity is what matters more for me here.
--
Sincerely,
Stephen R. van den Berg."Father's Day: Nine months before Mother's Day."
--
Did you try it? I don't particularly buy this performance argument, and
the bulk of my contributions to git so far were about performances. It
is quite easy to load a flat file with sorted commit SHA1s, and given
that origin links are the result of a rare operation, then there
shouldn't be too many entries to search through. Hell, doing 213647
lookups (and many other things like inflating zlib deflated data) with
each of them for commit objects in my Linux repository which has 1355167
total entries takes only 6 seconds here, or about a quarter of a
milisecond for each lookup. I doubt doing an extra lookup in a much
smaller table would show on the radar.Nicolas
--
Maybe you're right. The reason why my first knee-jerk reaction is
"performance problem" is because:
- The field is rarely present.
- When it is used, we look for it on every commit we traverse.
- This means that finding out the field does *not* exist is the most
common operation, and that effort rises linearly with the number of
commits visited.Whereas if the information is present in the header or trailer of the
commit, finding out that the field does not exist there is rather
cheap. But you could very well be right, that the absolute extra time
spent might be negligible for all intents and purposes.Nonetheless, the data-integrity argument still holds, i.e. placing it in
the commit (header or trailer) automatically protects it. External
files need extra care if you want the same integrity protection.
--
Sincerely,
Stephen R. van den Berg."Father's Day: Nine months before Mother's Day."
--
And that is why the proposal was to use "-o" option to git-cherry-pick
to add 'origin'/'changeset' header, exactly because git-cherry-pick is
_abused_ to clean up branches and reorder commits; although I think that
"git rebase --interactive" (and patch management interfaces) do replace
using git-cherry-pick for that purpose.git-revert would add 'origin'/'changeset' header unconditionally,
just like by default it seeds commit message with SHA-1 id of revertedHmmm... the difference between having 'origin' in a commit object header,
and having it in commit mesage is like difference between 'Signed-off-by:'
convention and 'author' header. First is the matter of workflow, second
is inherent, required and non-avoidable part of revision information.On the other hand git-cherry and git-blame would then have rely on
parsing correctly free-form part of a commit object, to take advantage
of 'origin' information: something what 'origin' info is for.P.S. 'generation' header was not added... just saying... :-)
--
Jakub Narebski
Poland
--
Yes, but you should not have used Stephen's proposed new option to git
cherry-pick, just like you shouldn't have used the existing -x option.
"-x" would not have created a dangling reference, but it would haveThese days I doubt people would use cherry-pick, they would probably use
... neither should cherry-picking create the origin link by default.
Only if requested by the user, using a new option that is basically "-x"
done in a different way. Just like "-x", it should not be used when
cherry-picking from private branches.But say someone does it, then what happens? If people clone the branch,
the reference will be basically unusable. But since "git gc" does not
delete the referenced commit, at least the origin commit is still
available in the repository where the cherry-pick was made. It is
debatable whether it is better or worse than "-x".Can we discuss instead a generic way to have porcelain-level metadata,
immutable or at least versioned, for the commit objects? (This is the
same kind of metadata as the author or committer, which clearly have
nothing to do with the git plumbing.) Do you have any proposal of saner
semantics, not for the origin link but for commit references within this
kind of metadata in general?Paolo
--
But my point is, _none_ of what Stephen proposes has _any_ advantage over
the already existing functionality.IOW, absolutely *everything* is actually done better with existing data
structures, and then just adding tools to perhaps follow those SHA1's in
the commit message.The whole "origin" field doesn't have any semantics that make sense for
core git. It's basically ignored by all normal git operations, and the
_only_ things that people seem to point out as being features are things
that can - and obviously in my opinion should - be done by much higher
levels.For example, the claim was that it's hard to follow the chain of
cherry-picks. That's not _true_. Use gitweb and gitk, and you can already
see them. Sure, you need to use "-x", BUT YOU'D HAVE TO USE THAT WITH
Steven's MODEL TOO!Exactly because it would be a frigging _disaster_ if that "origin" field
was done by default.And the only thing that "origin" does is:
- hide the information
- make it easier to make mistakes (either enable the feature by default,
or not notice that you didn't enable it when you wanted to)- add a requirement for a backwards-incompatible field that is just
guaranteed to confuse any old git binaries.- make it _harder_ to do things like send revert/cherry-pick information
by email.See? There are only downsides.
Look at the kernel -stable trees. They explicitly add that cherry-pick
information, and can add *more*. For example, they go look athttp://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.26.y.git;a=co...
and then go to its parent commit (just click on the parent SHA). And
notice how the stable kernel tree commits talk about where they were
back-ported from, or _why_ they aren't back-ports at all!IOW, there are really two main cases:
- the common case for cherry-picking: you do not want any origin
information, because it's irrelevant, pointless, and *wrong*.- you _d...
I think you're missing some of the advantages because you don't have a
lot of experience with cherry-pick workflows between multiple permanentThe best way to explain the difference is probably by implementing the
The existing cherry-pick -x option doesn't cut it, it helps for the
simple cases, yes, but there are cherry-pick situations where it justThis is a problem, I admit, but maybe this can be solved in the future.
Then again, since use of the feature is a *very* conscious decision, anyoneNot necessarily, adding an Origin field in the patch sent by mail is
easy. I don't see how it would be more difficult otherwise. PleaseI think I just neutralised all but one of the mentioned downsides, and
And this is impossible when using the origin link? The usage with an
Yes, and that *extra* information can and should go into the free-form
commit message, alongside of the origin field inside the header (or
trailer), just edit the commit message before committing after a
cherry-pick -o. What's your point?
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
I do not understand how this can be considered an acceptable behavior.
If an object ID is referenced in an object header, particularly commit
objects, fetch must gather those objects also because to do otherwise
breaks the cryptographic authentication in git.
--
No it does not.
The cryptographic seal is calculated over the content of the commit,
which includes the hashes of all referenced objects, but doesn't include
the objects themselves.
The content of the commit is not violated.Do not forget though:
- origin links are a rare occurrence.
- When they occur, they usually were made to point into other (deemed)
important public branches.
- Due to the fact that the branches they are pointing into are important
and public, in most cases the origin links *will* point to objects you
actually already have (even if you fetched from someone else).
- The only time you're going to have dangling origin links is when
they were pointing at someone's private branches, in which case it was
not very prudent of the committer to actually record the link in the
first place. But nothing breaks if you don't have his private branch
locally.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
The fetch MUST gather the referenced objects ALWAYS or I can't verify
the history. To do otherwise means that ID strings on the origin lines
are nothing more than an arbitrary text tag and not pointer to aHow do I verify (think git-fsck) that what the origin lines refer to
are, in fact, commits with the proper relationships? Either they HAVE to
be in the repository or the references do not belong in the header.
--
To fetch, by default, the origin lines *are* nothing more than arbitrary
If the origin hashes are not reachable, then fsck is required to silently
skip them, according to spec.
If the origin hashes *are* reachable, then fsck is required to verify
that they refer to proper commits with a normal history.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
Stephen R. van den Berg wrote:
By the way, I would really consider trying first to host 'origin' links
not in repository database itself, but in some extra database inside
git repository, like reflog or index. Git community is _very_
reluctant to modifying / extending format of persistent objects. From
all the proposals to add some extra header to a 'commit' object:
the 'prior' link to previous version of rebased, cherry-picked or redone
commit (superceded somewhat by local reflog, on by default in modern
git); the generic 'note' header, with examples of usage including
_non-linking_ cherry-pick and reverted commit-id, merge strategy used,
and hints for rename detection, i.e. something like #pragma in C
(rejected on the grounds that it was too generic and didn't have well
defined semantic); the 'generation' header which was meant to help and
speed up sorting commits, with root (parentless) commit having
generation of 1, and each commit having generation being 1 more than
maximum of generations of its parents (I think that backwards
compatibility killed it, and the fact that date-based heuristics was
improved); only the 'encoding' header was accepted.So I think you should go the route of externally (outside 'commit'
objects) maintaing 'origin'/'changeset'/'cset' links (like XLink
extended links ;-)) as a prototype to examine consequences of the idea.
That was the way _submodule_ support was added to Git, by the way.
First there were (at least) two implementations maintaining submodules
outside object database (see http://git.or.cz/gitwiki/SubprojectSupport
especially "References" section), then it was officially added first at
the level of plumbing support, as extension of a 'tree' object (and
index format, I think).--
Jakub Narebski
Poland
--
Well, the train of thought here goes as follows:
1. Sure, why not add a field (zero or more) at the bottom of the free-form
commit message reading like:Origin: bbb896d8e10f736bfda8f587c0009c358c9a8599 ee837244df2e2e4e9171f508f83f353730db9e53
2. Add support to cherry-pick/revert to actually generate the field upon
demand.3. Then add support to prune/gc/fsck/blame/log --graph to take the field
into account.4. Add support to filter-branch/rebase to renumber the field if necessary.
5. Add support to --topo-order to use the field if present and reachable.
6. For bonus points: add support to log to suppress the display of the
field at the end of the commit message, and redisplay the field
as Origin: bbb896d..ee83724
next to the Parent/Merge fields.Well, and after having done steps 1 to 5, the net result is that it
works almost as if the field is present in the header, except that:
- It is now at the end of the body in the commit message.
- It takes more time to find and parse it.So that gives two minuses, and no pluses.
So short-circuiting the reasoning suggests that since the only thing
that actually changes now is the position of the field (at the top or
end of the commit message), we might as well do it right and put it in
the top, that gets rid of the two minuses.Anything I missed?
Basically it means that:
a. If there is a better solution to tracking the backports, I'll gladly
use that instead, but simply using the current really freeform
approach doesn't cut it (it currently refers to a single commit,
instead of a pair of commits, and takes too long to parse out in a
--top-order or blame command). Better solutions I haven't heard so
far.b. I need the integrity protection of a commit to make sure that the
origin fields cannot be altered later; blame would be too easy to fool
otherwise. So using the notes solution seems to be out (it would also
be quite a performance hit again).c. I consider the Or...
A good convincing demonstration that this is actually worth doing in the
first place. And here I'm talking about the _feature_ and not theTechnically speaking, implementation d is obviously the most efficient.
but, as mentioned above, the actual need for this feature has not been
convincing so far. Until then, it is not wise to add random stuff to
the very structure of a commit object, while c can be done even
externally from git which is a good way to demonstrate and convince
people about the usefulness of such feature.Nicolas
--
The actual need for the feature seems to be dependent on one's workflow
habits. This is also the problem I sense throughout the thread: some
people know exactly what I'm talking about, and would come up with the
almost identical design specs for the feature independent of myself, and
others need to be explained every tiny detail of the spec because they
are not familiar with the concept and can't imagine why/how it would be
used.Let me try and describe once more the typical environment this origin field
is vital in:Imagine a repository with:
- 33774 commits total
- 13 years of history
- 1 development branch
- 9 stable branches (forked off of the development branch at regular
intervals during the past 13 years).
- The stable branches are never merged with each other or with the
development branch.
- 2787 individual back/forward ports between the development and stable
branches.In order to have meaningful output for git-blame, it needs to follow the
chain across cherry-picks reliably.
Once you alter a piece of code, in order to figure out what more to alter,
you need to verify if this piece of code was or wasn't forward/backported.
Reliable and fast reporting of this, and actual comparison of the
different forward/backports between the 9 branches is essential. It
basically means that you need to view the diffs of the patches across 9
branches on a regular basis.Without the origin links, this workflow will cost a lot more time to
pursue (I know it, because I'm living it at the moment, and no, I'm not
the only developer, it's a development team).This development model is not unique to my situation, it occurs at more
places.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
OK. I think I might be able to believe you.
Where I feel uncomfortable is with the real semantics of your "origin"
link proposal.First, its name. The word "origin" probably has a too narrow meaning
that creates confusion. I'd suggest something like a
"may-be-related-to" field that would be like a weak link.The format of a may-be-related-to field would be the same as the parent
field, except that the object pointed to by the sha1 could have its type
relaxed, i.e. it could be anything like a blob or a tag.The semantics of a "may-be-related-to" link would be defined for object
reachability only:- If the may-be-related-to link is dangling then it is ignored.
- If it is not dangling then usual reachability rules apply.
That's all the core git might care about, and the only real argument for
not having this information in the free form commit message.Still, in your case, you probably won't get rid of your stable branches,
hence the reachability argument is rather weak for your usage scenario,
meaning that you could as well have that info in the free form text
(like cherry-pick -x), and even generate a special graft file from that
locally for visualization/blame purposes. Sure the indirection will add
some overhead, but I doubt it'll be measurable.People fetching your main branch won't have to carry the whole
repository because those weak links would otherwise be followed if
they're formally part of the commit header. And if they want
to benefit from the information those weak links carry then they just
have to also fetch the branch(es) where those links are pointing. At
that point it is trivial to regenerate the special graft file locally
which would also have the benefit of only containing links to actually
reachable commits, hence you'd never have dangling "origin" links.Conclusion: the only fundamental reason for having this weak link
information in the commit header is for reachability convenience for
when the actual branch that ...
I keep hearing "blame" in this discussion, but I do not understand why
people think blame should _follow_ this "origin" information (in the usual
sense of "following").Suppose you cherry-pick an existing commit from unrelated context:
...---A---B
. (origin)
.
...---o---X---Y---Zi.e. on top of X the difference to bring A to B is applied to produce Y,
and a new development Z is made on top. You start digging from Z.Without any "origin", here is how blame works:
* What Z did is blamed on Z; what Z did not change is passed to Y;
* Y needs to:
(1) take responsibility for what it changed; and/or
(2) the remaining contents came from X --- pass the blame to it.
Let's see how we would want "origin" get involved. Instead of the above,
what Y would do would be:(1) if the contents (excluding the part Z changed) is different from X,
instead of taking the blame itself, give the _final_ blame to B.(2) the remainder is passed to X as usual.
This is different from the normal "following" in that B is not allowed to
pass the blame to its parents (should it be allowed to pass it to its
"origin"?), because the _only thing_ cherry-pick did was to transport what
B did (relative to A) to the unrelated history that led to X.IOW, you did not look at the contents outside "diff A..B" when you made
the cherry-pick. There could well be parts of the content that are common
across all of A, Y, X and Z, but as far as Y and Z are concerned, they did
not get any part of that common common content from A (otherwise "origin"
is no different from "parent", but you did not merge).The output from "origin" aware blame would be identical to the normal
blame, except that lines that usually are labeled with Y are labeled with
B. However:(1) If you _are_ interested in the line that says Y, you can look at
the commit object Y and see "cherry-pick -x" information to learn
it came from B ...
Well, I'd expect:
a. That B should be able to pass blame onto it's origin.
b. That B should be able to pass blame onto A (and deeper).Let me show another example:
...-C---D---E---F---G
. (origin)
.
...---A---B
. (origin)
.
...---o---X---Y---ZNow suppose there is a piece of sourcecode which evolves from C to F,
then when I dig into G using blame I get something like: CCCFFEGGDDDCC
(Every letter represents a line in the sourcecode)Digging into Z I'd expect to see the following: ZZCCCFFEDDYDCCB
All this assumes that there were minimal changes to the patch when
creating B, and also minimal changes to the patch when creating Y.I.e. large parts of that code where developed during C, D, E and F, so
that is what I expect to see; is that illogical?
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
I'm sorry, you're right, I'm confusing things here. The case I'm describing
here can only happen when you do this:....-C---D---E---F---G
\...\...\..\ (origin)
.
...---A---B
. (origin)
.
...---o---X---Y---ZI.e. the first cherry-pick needs to cherry-pick C, D, E *and* F into B,
that will result in four origin fields there.
And yes, that means that:
- blame follows origin links (repeatedly).
- blame does *not* travel to parents of commits found through an origin
link.Does that mean that blame uses origin fields? Yes, it does, and it has
to check for origin links at every commit it traverses.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
Well, the important properties of the name/field would be:
- It should be as specific as possible, in order to minimise the
potential for abuse in the future. I distill the desirability of this
requirement out of the various earlier discussions about commitheaders
in the past on this mailinglist held by others.
- It should convey a sense of direction (it's a directed graph).The origin field as currently proposed tightens the requirements that
it either is dangling and ignored or points to a commit.
rev-list --topo-order should use the origin links to order the output.
gc/prune won't delete commits referenced *by* an origin link.The only two other arguments one might give to actually keep the field
in the header of the commit as opposed to the trailer is that the
physical field can be kept machine readable, and the actual display can be
beautified like: Origin: 2abcdef..1234567
The output of the field could be suppressed (if so desired) if the
target commit isn't reachable.
All this is of course possible for a trailer field in the free-formThen again, I don't want to be bothered by stupid free-form origin links
made to local branches by a developer. If the developer creates them
using cherry-pick -o which creates an origin link, I'll never have to
see his silly commit hashes where he is referring to commits in hisThe free-form equivalent looks like:
Origin: df85f7855da44c730f942b330ada181209d09d7a ff1e8bfcd69e5e0ee1a3167e80ef75b611f72123
You need a pair of hashes, which is, a bit bulky, for my taste.What special graft file would I need to visualise? Isn't having the
You lost me here somewhere. Could you give a concrete example with one
Erm. Quite the opposite, actually.
The practical use for the origin link in case the target is unreachable
is zero to none, so it can gleefully be ignored in that case.
But maybe the semantics of your "related" link and my origin link are
sufficiently distinct.Agreed. But this is in reference to your "related" ...
Well, sure. But being too specific sometimes limits its usefulness.
There could be other usages for such a link which IMHO should be definedWell, my whole argument is that if it has no generic purpose then it
And I disagree on the gc/prune point, as mentioned previously.
As to rev-list --topo-order, it doesn't need for this link to actually
And I think you should simply create a file within the repository with
that info instead of either thecommit header or the free form text. It
gives all the usability advantages you wish for and more.Nicolas
--
Um, why should "git fsck", or "git prune" or "git gc" need to
understand about this field? What were you saying about unclean
semantics, again? I thought you claimed that dangling origin links
were OK? So why the heck should git fsck care? And why shouldn'tAs we discussed earlier in some cases renumbering the field is not the
right thing to do, especially if the commit in question has already
been cherry-picked --- and you don't know that. Again, this is why
prototyping it outside of the core git is so useful; it will show upA proof of concept, even if it isn't fully performant, is useful to
prove that an idea actually has merit --- which clearly not everyone
believes at this point.I'll also note that having a ***local*** database to cache the origin
link is a great way of short-circuiting the performance difficulties.
If it works, then it will be a lot easier to convince people that
perhaps it should be done git-core, and by modifying core git functions.Alternatively, if you think this is such a great idea, why don't you
grab a copy of the git repository, and start hacking the idea
yourself? If you have running code, it tends to make the idea much
more concrete, and much easier to evaluate. Or were you hoping to
convince other people to do all of this programming for you?- Ted
--
Well, sort of. In order for swift parsing it should be a real field,
i.e. it should not be an English sentence (in order to avoid people
accidentally translating it); and it should list a pair of hashes
(patches/changesets are defined by the difference between two tree
snapshots). So it would be a -o option most likely, in order to provideDangling origin links are ok only if the developer in charge of the
repository doesn't care about the commits/branches they point to.
The definition of a "caring developer" is formalised by the fact that
the offending commits are already present in the repository or not.This implies that fsck will skip the field if the hashes in question are
unreachable in the current repository.
If they are reachable though, fsck will follow the link and check the
whole tree referenced by the origin link. Obviously there are only two
conditions for an origin link: either the hash points to an unreachable
object or the hash points to a reachable object of type commit (and all
associated checks that go with any commit).gc will preserve the commits the origin links point to once they are
reachable. I.e. if the developer doesn't care about the commits the
origin links point to (i.e. if the branches are not reachable) then gc
just skips them, if the developer *does* care, the origin links are usedI agree that the behaviour of especially rebase with respect to the
origin links is still something that needs to be thought through.
I'm not convinced you are right, but I'm not convinced you are wrongCreating local databases for these kinds of structures feels kludgy
somehow, since the git hash objects essentially *are* a working
database. I have not checked yet if git already has some kind ofActually, in the first hour after posting the initial mail/proposal I
already had altered a local version of git to support the origin links
in commit.[ch], --topo-order and fsck. Before hacking further I decided
to get some feedback first to see if someone wou...
This seems wrong. OK, suppose you have branches A, B, C, and D, while
you are on branch C, you cherry pick commit 'p' from branch B, so that
there is a new commit q on branch C which has an origin link
containing the commit ID's p^ and 'p.Now suppose branch B gets deleted, and you do a "git gc". All of the
commits that were part of branch B will vanish except for p^ and p,
which in your model will stick around because they are origin links
commit q on branch C. But what good is are these two commits? They
represent two snapshots in time, with no context now that branch B has
been deleted. 99% of the time, the diff between p^ and p will result
in the equivalent of the diff between q^ and q. But even if they
aren't, what use are these isolated, disconnected commits? So having
"git gc" retain them commits that are pointed to be this proposed
origin link doesn't seem to make any sense, and doesn't seem to be
well thought through.Oh, BTW, suppose you then further do a "git cherry-pick -o" of commit
q while you are on branch D. Presumably this will create a new
commit, r. But will the origin-link of commit r be p^ and p, or q^
and q? And will this change depending on whether or not -o isGitk already keeps a cache (.git/gitk.cache) to speed up some of its
operations. And in some ways the index file is a cache, although it
does far more than that.- Ted
--
Not quite. Obviously all parents of p and p^ will continue to exist.
I.e. deleting branch B will cause all commits from p till the tip of B
(except p itself) to vanish. Keeping p implies that the whole chain of
parents below p will continue to exist and be reachable. That's the wayThe context are all their ancestors, which continue to exist, and that
It will be q^..q, and specifically not p^..p, using ^p..p would be
lying. We aim to document the evolvement of the patch in time.
Cherry-pick itself will always ignore the origin links present on theNo. Actually, cherry-pick will never generate origin links unless -o is
specified.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
That's still not very useful, since you still don't have a label for
this anonymous series of commit chain that just dead ends at commit p.So if you never pull branch C (where commit q resides), there is no
way for you to know that commits p and r are related. How.... not
useful.If the scenario was being able to tell which stable branches had a
particular bug fixes, I think my proposal of attaching a bug
identifier is a far superior solution.Again, what's the use case of "trying to document the development of
the patch in time?" Aside from drawing pretty dotted lines
everywhere, what *good* does this actually achieve? How would it
affect other git commands' behavior, and how would this change in
behavior actually be considered a net improvement over what we have
now?- Ted
--
That is a good point. Stephen has explained his workflow, and I can see
why he wants to reference the cherry-picked commits, and how he thinks
that the referenced commits will always be available in that workflow.
And obviously in Linus's workflow such references are basically useless,
and they should just not be generated.But what about workflows in between? When I pull from some developer who
has added a weak reference to a particular commit SHA1, but I _don't_
have that commit, my next question "OK, so what was in that commit?".
What is the mechanism by which I find out more information on that SHA1?Using a key that is meaningful to an external database (like a bug
tracker) means that you can go to that database to look up more
information.-Peff
--
This has _nothing_ to do with workflows or anything else.
Why are people claiming these total red herrings?
I have asked several times what it is that makes it so important that the
"origin" information be in the headers. Nobody has been able to explain
why it's so different from just doing it in the free-form part. NOBODY.If somebody has a workflow where they want to track "origin" commits, then
they can do it today with the in-body approach. But that has nothing
what-so-ever to do with the question of "let's change object file format
to some odd special-case that we just made up and is only apparently
useful for some special workflow that uses special tools and special
rules".I want the git object database to have really clear semantics. The fields
we have now, we have because we _require_ them. There is nothing unclear
what-so-ever about the semantics of author/commiter-ship, parenthood,
trees, or anything else.And there are _zero_ issues about "workflow". The workflow doesn't matter,
the objects always make sense, and they always work exactly the same way.
There are no special magic cases that are in the least questionable in any
way.So this argument is about more than just "minimalism", although I'll also
admit to that being an issue - I want to be able to basically explain how
git data structures work to any CS student, and not have any extra fat or
any gray areas. It's about everything having a clear design, and a clear
meaning, and there never being any question what-so-ever about what the
real "meaning" of something is.Then, if you have some special use case or rules for your particular
project, well that's where you can have things like formatting rules for
how the commit messages should look like. If somebody wants to use fixed
format rules for their project, that's fine. And THAT is where "workflow"
issues come up.But "workflow" has nothing to do with core git data structures. They were
designed for speed, stability, si...
That's because the difference is small:
In the header is slightly faster and more elegant (both designwise and
displaywise), that's it.Of course.
In any case, I think I got enough feedback from the list to create
a working implementation/concept which is going to use the free-form
trailer to implement the origin field.
--
Sincerely,
Stephen R. van den Berg.
--
The message you are responding to has nothing to do with an origin
header versus putting it in the free-form part. It is equally a problem
with both approaches.I was purely commenting on the "if I mention an arbitrary sha-1, what is
the person reading it supposed to _do_ with it, if they may never have
seen that sha-1" issue.So yes, it has _everything_ to do with workflows. In Stephen's case, he
claims that all references will be to commits on long-lived branches. In
which case, it is a non-issue because they will have the referenced
commits.But in the general case, people will not have them, and there is
potential head-scratching. My point is that even if a feature works for
Stephen's workflow, it may not be a good feature for everyone, since
other solutions handle the general case (as well as his case) muchYes, and I totally agree with everything you said. If you read the mail
you are responding to carefully, you will see that I never mention an
origin header versus the free-form commit.-Peff
--
Well, the usual way to fix this is to actually startup fetch and tell it
to try and fetch all the weak links (or just fetch a single hash (the
offending origin link)) from upstream; this is by no means the default
operatingmode of fetch, but I don't see any harm in allowing to fetchTrue. And also a Good Thing, I concur.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
Maybe I am misremembering the details of fetching, but I believe you
cannot fetch an arbitrary SHA-1, and that is by design. So:1. You would have to argue the merits of changing that design. I
believe the rationale relates to exposing some subset of the
content via refs, but I have personally never felt that is very
compelling.2. Even if we did make a change, that means that _both_ sides need the
upgraded version.-Peff
--
If you're using origin links, you'd need that anyway, so that's a given.
I could imagine the minimum would be something like:Allow direct SHA1 fetches (which obviously pull in all parents as well)
if the ref is part of one of the public branches (either as a commit,
or as an origin link).
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
And that's what I called stupid in my earlier reply to you. Either you
have proper branches or tags keeping P around, or deleting B brings
everything not reachable through other branches or tags (or reflog)
away too. Otherwise there is no point making a dangling origin link
valid.Nicolas
--
Well, the principle of least surprise dictates that they should be kept
by gc as described above, however...
I can envision an option to gc say "--drop-weak-links" which does
exactly what you describe.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
Well, IIRC the need for this was one of the causes of "death" of 'prior'
header link proposal...--
Jakub Narebski
Poland
--
As I understood it, one of the causes of death of the "prior" link
proposal was that it was unclear if it pulled in the linked-to commits
upon fetch. In the "origin" case, the default is *not* to fetch them.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
And that's WRONG. Both prior and origin must fetch them if they're
reference in the header.
--
By definition of the origin headerfield that is not wrong, there are no
other rules. But the point is moot at the moment, since I'm going to
create a proof of concept which puts the field in the free-form trailer.
--
Sincerely,
Stephen R. van den Berg."Father's Day: Nine months before Mother's Day."
--
Don't you think this starts to look silly at that point?
Nicolas
--
No, it's the developers vote controlling his own repository saying:
Ok, I expressed interest in the other branches and their
backport/forwardport relationships, but I changed my mind. Drop all
backport/forwardport information on branches I don't explicitly have.
--
Sincerely,
Stephen R. van den Berg.
"There are three types of people in the world;
those who can count, and those who can't."
--
After thinking about this a bit, I don't think that (recording
origin(al) commits) for rebased commits would be good idea. While one
can reasonably expect that cherry-picked changes should stay, and
reverted changes even more so (usually one reverts commit from
a history), usually the original commits being rebased are meantIt is the simplicity that it is the most compelling of this solution.
For revert we have "origin B B^", for cherry-pick we have "origin A^ A";
(or 'changeset') and always we have <rev> =~ <rev>^ + (<r2> - <r1>),
where '-' denote diff operation (<diff> = <tree1> - <tree2>), and '+'
denote patch application (<tree1> = <tree2> + <diff>).[ADDED LATER]
Also it could be useful for patch management interfaces using Git
as engine, such as StGIT, Guilt (formerly gq), TopGit, or now defunct,
obsoleted and no longer maintained Patchy Git aka 'pg'.The "weak" 'origin'/'changeset' header would allow some sort of
Errr... wasn't you proposing to keep/protect against pruning <cousin>
AND <cousin>^<mainline>? You want to have _diff_ (changeset) protected,
not a single tree state.And having "origin <r1> <r2>" makes it easier then to check validity; you
don't need to get <r1>, check if it has <mainline> parent and what it is,
and then check if <r1>^<mainline> exists (and is not for example behindI don't think that it is true in this case. This sentence _looks_ like
it offers / requires additional protection, while this "protection" isHmmm... while I think it might be a good idea, I'm not sure about its
But blame is all about what commit brought some line to currents version.
So the cherry-pick itself, or revert of a commit itself would be blamed,
and should be blamed, not its parents, nor commit which got cherry-picked,
or commit which got reversed.It would be nice to be able to follow 'origin'/'changeset' lines in the
_graphical_ blame b...
Actually, making sure that the commit we reference in the origin link
exists, we implicitly prove that all the parents of that commit exist as
well. Then again, this point is moot since I already conceded (in aActually, I have already programmed this part, and the overhead is close
Well, it depends, I guess.
If you'd go for a "committer" based display, then following origin links
is bad.
If you'd go for an "author" based display, then following origin links
should be the default (IMHO).
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
One thing to keep in mind is that you are not just proposing some new
behavior for a command, but rather a new header for the data structure
that we will live with from now until eternity. So I think it makes
sense to allow the general case even if nobody is generating it yet, if
there is some chance that it may be useful for somebody to generate in
the future.And yes, you can get _too_ general to the point where your semantics
become meaningless. But I don't think that is the case here. You are
defining the origin field as "by the way, the difference between state X
and state Y was used to make this commit". cherry-pick just happens to
make Y=X^, but something like rebase could use a series.As for "git vs changeset": this is git. So you have a sequence of tree
states whether that is what you want or not. Thus you are specifying
the difference between _some_ pair of commits. I don't see any benefitActually, it could decrease it. If I tell you that you must have "X" and
"X^2", then you could get away with just checking if you have "X". But
you might also want to check whether "X" even _has_ a second parent. And
that means not just looking up the object, but accessing it (resolving
deltas if need be, uncompressing, parsing the object). With "X" and
"Y", it is just two object lookups.Now obviously you don't have to be quite so careful in the "hash plus
parent" case. And if you are going to _do_ anything with the origin
field, you will end up accessing those objects anyway. But in that case,
you end up with the same number of lookups and accesses anyway: 2 ofI think that is smart; if somebody wants to drill down into the history
of origin links, they can do so at lookup time.-Peff
--
Another thing that made me wonder...
To be consistent, when you are at HEAD and are merging side branch B,
because that merge is to incorporate what happened on the side branch
while you are looking the other way, we should say "by the way, the
difference between state $(git merge-base HEAD B) and state B was used to
make this commit." in the resulting merge commit, shouldn't we?What happens if there is more than one merge base?
--
I suppose you could, though in that case you can obviously calculate the
merge base yourself from the parents of the merge. The difference with
cherry-picking (or rebasing) is that you might not otherwise know about
the "by the way" commits.-Peff
--
Quite. The origin link is primarily intended for cherry-picks/reverts
which are otherwise difficult to find. Anything that can use the normal
parent mechanism has no business using the origin links.
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
As for "by the way ... was used to make this commit": this is git. So how
you arrived at the tree state you record in a commit *does not matter*.Not only that, it is not just "the difference between state X and Y" that
you used to come to that tree. Another thing that is involved is the
specific cherry-pick implementation back when the commit was made. That
was what gave you the tree.To my ears, it rhymes rather well with a famous quote from $gmane/217:
You're freezing your (crappy) algorithm at tree creation time, and
basically making it pointless to ever create something better later,
because even if hardware and software improves, you've codified that
"we have to have crappy information".After reading the discussion so far, I am still not convinced if this is a
good idea, nor this time around it is that much different from what the
previous "prior" link discussion tried to do.
--
The typical use case for the origin links is in a project with several
long-lived branches which use cherry-picks to backport amongst them.
There is no real other way to solve this case, except for some rather
kludgy stuff in the free-form commit message which doesn't mesh well
with rebase/filter-branch/stgit etc.I tried to accomodate this approach by overloading the parent link and
then making git more intelligent to figure out if it is a cherry-pick or
not. That was deemed undesirable, so using the origin links is the nextIt is well-defined this time, and doesn't bleed across fetch/pull.
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
But it _does_ matter, which is why we have commit messages to explain
how you arrived at this tree state.For the record, I am not convinced it is a good idea either; I was
hoping to steer it in a direction where somebody could say "and now this
is the useful thing we can do now that we could not do before." If the
ultimate goal is to put links to other commits into history viewers,
then the commit message is a reasonable place to do so. The only thing I
see improving with a header is that it makes more sense for pruning and
object transfer.-Peff
--
Well, that is why I was carefull to say that "origin <rev1> <rev2>"
(or 'changeset', or 'cset') means that tree state for given commit
is created out of parent commit (or parent commits in the case of merge)
and of (<rev2> - <rev1>) patch. This is a bit of enhancement to
"parent <rev>" meaning that tree state for current commit is derived
from tree state of <rev>.I'm also not all convinced that 'cousin'/'origin'/'changeset'/'cset'
header is a good idea. I only tried to steer discussion in good
direction if it is somewhat a good idea.First, if the only goal would be to add extra links (extra edges) to
[graphical] history viewer, then full sha-1 of a commit which can be
recorded in commit message for cherry-picks and reverts should be
enough. It does mean parsing commit message, and all possibilities
for mistake which are connected to using conventions in free-form part
of commit object; on the other hand it is not _that_ critical.If however 'origin' links are more (perhaps only a tiny bit more),
for example discussed "weak" links... then I'm not sure if
the tradeoffs are worth it. First, if it is full connectivity like
in 'parent' header case, then a) why not use 'parent' anyway,
b) it pins the history indefinitely long. Second, if it is "weak"
link, i.e. local protect it on prune, then a) there are problems
with transferring the data, and protecting links on transfer,
as somewhere in the middle or at the end there might be repository
which uses older git (backwards compatibility strikes again),
b) git in many, many places assumes that object is valid if it passes,
and all objects linked to from object are valid; we would have either
use some kind of separate 'not strictly checked' packfile/storage,
or have grafts-like thingy.So I'm not sure if 'origin' links are worth the trouble.
About much, much earlier "prior" link discussion: I think the discussion
about "prior" header link was done before reflogs, or at least before
ref...
By the way, beside graphical history viewers it would also help rebase
(and git-cherry) notice when patch was already applied better.--
Jakub Narebski
Poland
--
I think that rebase had better not trust the origin links in deciding
whether a patch was already applied; it already does it well enough.git-cherry is another story, as that tool is not faking "changeset mode"
so well (because it cannot attempt merges, and these are what allows
git-rebase to fake changesets much better).Paolo
--
Yup. Same here.
I didn't see any information about why this "origin" link is
needed here, just how it might work.And some of that "how" scared me because it was doing some sort of
"soft" reachability, where errors aren't noticed but we are expected
to protect the data from prune/repack forever once it has entered
the repository.--
Shawn.
--
Quite. I'll drop the old format and adapt my proposal to use the double
hash.As far as the naming of the field is concerned: a changeset is what the
field describes, but changeset implies no sense of direction; origin
makes it clear that the current commit was derived *from* the changeset
represented by "origin".
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
And, incidentally, the above representation will potentially mesh well
with svn integration, making it possible to cleanly represent svn 1.5It does intuitively (but perhaps incorrectly) seem like the origin
information could be used to make more intelligent decisions about
automatic conflict resolution, if nothing else. Though obviously that
might, as you suggest, be a pretty big departure from the way git
merges currently work.-Steve
--
I think I forgot two:
- git rebase will fixup any origin pointers which point back into the
strain being rebased.- git filter-branch will rewrite origin pointers which point to commits
that receive a new hash.
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
What about just storing *two* hashes? This way cherry-pick can store
B~1..B and revert can store B..B~1. The two cases can be distinguishedWill cherry-pick -x create origin links? Also, does the origin link
propagate through multiple cherry picks? If not, how can the origingit cherry will use origin links to mark a commit as present, and will
only use patch-ids for commits that have no origin links. Bonus points
for an extra command-line/configuration option to only use origin links:--source=default << default: get setting from core.cherrysource
--source=patch-id
--source=origin
--source=origin,patch-idcore.cherrysource = patch-id
core.cherrysource = origin
core.cherrysource = origin,patch-idThanks!
Paolo
--
Valid point, but consider:
The new commit to receive hash A. The diff between A~1..A and B~1..B
actually defines the relation. Revert and cherry-pick are symmetrical
operations as far as git is concerned since git tracks content.
So I'm not quite sure if we actually need this extra information, gitI'd propose a cherry-pick -o and revert -o for that.
I wouldn't want to force the text which -x generates into the commitThe origin link is created point-to-point from the object referenced by
cherry-pick/revert to the new commit. The link creation specifically
does not follow any existing origin links. If you want the origin link
to point to a deeper origin behind the current, then cherry-pick fromThat can only happen during a fetch/pull, which doesn't use origin links
to determine transmittability by default.
--
Sincerely,
Stephen R. van den Berg."Be spontaneous!"
--
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Linus Torvalds | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| David Newall | Re: Slow DOWN, please!!! |
| Ian Campbell | Re: [PATCH] x86: Construct 32 bit boot time page tables in native format. |
| Matthias Scheler | Re: HEADS UP: timecounters (branch simonb-timecounters) merged into -current |
| Greg Troxel | Re: Interface to change NFS exports |
| Thor Lancelot Simon | metadata cache and memory fragmentation |
| YAMAMOTO Takashi | amap memory allocation |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| David Miller | [GIT]: Networking |
| Dushan Tcholich | Re: ksoftirqd high cpu load on kernels 2.6.24 to 2.6.27-rc1-mm1 |
