Re: git-submodule getting submodules from the parent repository

Previous thread: Summer of Code 2008 project application draft: Pack v4 by Peter Eriksen on Saturday, March 29, 2008 - 1:50 pm. (3 messages)

Next thread: [PATCH] git-svn: remove redundant slashes from show-ignore by Eric Wong on Saturday, March 29, 2008 - 4:37 pm. (4 messages)
From: Avery Pennarun
Date: Saturday, March 29, 2008 - 3:35 pm

Hi all,

I have a fairly pressing need for git-submodule-like behaviour, but
having tried git-submodule, it doesn't really work the way I'd like.

A super-simplified example of what I do:

- Project A (app) includes project B (build environment), which
includes project C (tool library)

- The projects are all open source, but B includes some binary
packages so it's a big download.  If you don't need the binary
packages, people want to just download C (hence the separation). But
everyone using A wants B and C, because they're lazy and bandwidth
isn't a problem.

- We have a local repository at work with mirrors of A, B, and C,
which are also available publicly (but there's no reason for everyone
in our office to be uploading/downloading the same big blobs all the
time).

- We frequently change B and C as part of building A (as well as other
A-like applications).

Here are the main problems, all in a jumble:

It's a  pain to check out / mirror / check in / push.  git-submodule
doesn't even init automatically when you check out A, so you have to
run it yourself.  The relative paths of A, B, and C on your mirror
have to be the same as upstream.  You can't make a local mirror of A
without mirroring B and C.  B and C start out with a disconnected
HEAD, so if you check in, it goes nowhere, and then when you push,
nothing happens, and if you're unlucky enough to pull someone else's
update to A and then "git-submodule update", it forgets your changes
entirely.  When you check in to C, you then have to check in to B, and
then to A, all by hand; and when you git-pull, you'd better to C, then
B, then A, or risk having A try to check out a revision from B that
you haven't pulled, etc.

...phew.

It would probably be possible to fix each of these problems
individually, but it would be a whole series of different fixes.  I'd
like to propose a rather different way of doing things that I think
would solve most of these problems, and get some feedback:

What if *all* the objects ...
From: Sam Vilain
Date: Saturday, March 29, 2008 - 4:22 pm

Well, that would create a lot of unnecessary work when cloning.
Partitioning by project is a natural way to divide the projects up.
It's worth noting that the early implementations of submodules were
based on this design, of keeping everything together.

However, what you are suggesting should IMHO be allowed to work.  In
particular, if the submodule path is ".", then I think there's a good
case that they should come from within the same project.  If it's a
relative URL, it should initialize based on the remote URL that was used
for the original fetch (or, rather, the remote URL for the current branch).

And, if it happens that after a checkout, that the commit of a submodule
is already in the object directory (ie, there's another branch), then

It could easily, if someone would allow clone to have a --track option
like git remote does:

  git init

This push failure thing is regrettable; however it's not clear which
branch name the submodules should get.  A given commit might exist on

I think this could be a switch to git clone/pull, configurable to be the

There is also a Google Summer of Code project for this - see

Well, no, it's true that the current workflow has interface niggles;
however it's important to understand why the current implementation is
the way it is, and make sure that new designs build on top of the parts
which are already designed well, where they can.

Sam
--

From: Eyvind Bernhardsen
Date: Sunday, March 30, 2008 - 6:32 am

I solved that by adding a "submodule push" that pushes the detached  
head of each submodule to its own ref ("refs/submodule-update/commit- 
$sha1", imaginatively).  I also made "submodule update" try to fetch  
that ref when looking for a sha1.

I ran into trouble trying to avoid pushing every submodule for each  
"submodule push", and then more or less decided not to use submodules,  
so it's not quite fit for public consumption.  I still think it's a  
sound idea in principle, so I'll clean it up and send it to the list  
if there's any interest.
-- 
Eyvind Bernhardsen

--

From: Sam Vilain
Date: Sunday, March 30, 2008 - 10:48 am

Hmm, a reasonable decision, but I think it would be better to force the
user to choose which branch they want to push to.  Leaving breadcrumbs

Indeed - it can only become "fit for public consumption" if people
submit their usability enhancements!

Sam.
--

From: Eyvind Bernhardsen
Date: Sunday, March 30, 2008 - 12:50 pm

Well, the point of "submodule push" was to avoid having to push in  
each submodule manually; not enforcing the requirement that commits in  
submodules must be publicly available before pushing from the main  
module is a recipe for disaster, or at least annoyance.  And nobody  
likes an annoying git.

Pushing to a branch works except that I couldn't figure out what to do  
if the push doesn't succeed, ie, the branch has advanced on the remote  
end.  That's a problem if more than one module references the  
submodule or there are multiple branches in the main module.

One solution that occurred to me was to have a branch in each  
submodule for every main module and branch.  A branch name would be  
provided for each submodule in .gitmodules, used by "submodule push"  
but not "submodule update".  In this case, if the push to the branch  
fails, the main module branch is probably behind too.

This seemed like a good idea, but it's racy.  If two simultaneous  
"submodule push"es try to push to the same branch on a submodule, one  
of them will be rejected, but it might already have updated branches  
on other submodules.  Ick.

I briefly toyed with creating tags named after the main module and its  
branch, with the submodule sha1 included for good measure, but that  
leaves a _real_ mess in refs/tags.  Figuring out that I could use refs/ 
submodule-push instead seemed like an epiphany at the time.


As an aside, my mental model of what the submodule needs is a  
fetchable reflog for every main module and branch that uses it,  
containing the history of commits used by that module/branch.

It's a reflog, not a branch, because a submodule can be changed to a  
different branch, rewound, etc between commits in the main module;  
there's no requirement that the old commit is in the new commit's  
history.  You actually don't want to fetch the whole thing, but you  
have to be able to fetch every sha1 contained in it, by sha1.

...so that's what refs/submodule-push is ...
From: Sam Vilain
Date: Sunday, March 30, 2008 - 1:19 pm

It's simple.  You just fail and tell the user what happened, and let

If it is a rewind there is no issue, because you don't even need to push.

But again it comes back to - let the user sort it out, don't try to be
too clever.

Sam.

--

From: Eyvind Bernhardsen
Date: Monday, March 31, 2008 - 3:05 am

I think you misunderstood: what I'm saying is that submodules'  
_current_ behaviour is annoying, since you're guaranteed to forget to  
push a submodule before pushing the main module at least once.  My  
attempt to solve that became too complicated, so I dropped it, and  
since the current behaviour is annoying, I gave up on submodules  

Sure, that solves the annoyance problem, but I wanted something more  

Yep, my problem was wanting to be cleverer than my limited git skills  
will allow.
-- 
Eyvind Bernhardsen

--

From: Avery Pennarun
Date: Sunday, March 30, 2008 - 4:03 pm

On Sun, Mar 30, 2008 at 3:50 PM, Eyvind Bernhardsen

That's easy: just error out in that case.  If the current system would
just error out when I screwed up, I'd at least be able to deal with
it.  Right now I silently create un-check-outable parent repositories
because I failed silently to upload my latest checkins to the child

What is unsafe about "submodule update"?

Thanks,

Avery
--

From: Eyvind Bernhardsen
Date: Monday, March 31, 2008 - 2:29 am

As I tried to explain, all the automatic push solutions I could come  
up with were flawed, so I decided not to use submodules at all and  
just have the build tool check out every module (that's what we  
currently do with CVS, so it's the easy way out anyway).

If I understand you correctly, you want to be forced to create a  
branch and push to that?  I don't think that works well with many  
developers pushing to a shared repository (my situation), and is in  
any case not the "automagical push" solution that I want.  I agree  

If you have local changes committed in a submodule that is updated by  
a pull in the main module, "submodule update" will silently overwrite  
them.  I was wrong, though, because you can fix that just by making  
"submodule update" error out when a submodule doesn't have its HEAD  
where the main module thinks it should be.
-- 
Eyvind Bernhardsen

--

From: Avery Pennarun
Date: Monday, March 31, 2008 - 2:36 pm

On Mon, Mar 31, 2008 at 5:29 AM, Eyvind Bernhardsen

I even *use* git-submodule and had to modify my build scripts because
"git submodule init" and "git submodule update" don't seem to kick in
automatically for some reason.  The ideal situation would be to have
git just manage the version control without having to babysit it, of
course.  That's hard to do in the general case, but should be quite

Hmm, this is curious.  If you're *not* using submodules, then I don't
think you can push successfully without being on a branch, can you?
So the suggestion merely extends this behaviour to submodules.

(To be more precise, 'git push' seems only to be able to push branch
heads.  When you're not using git-submodule, commits are by default
attached to branch heads, so this doesn't cause a problem.  If you
disconnect your HEAD, trying to push will silently do nothing, because
it'll push some other branch head that hasn't changed, or maybe no
branch at all.  But with git-submodule, the *default* is a
disconnected HEAD, which is too dangerous.  I propose to simply have
it fail out in this case.)

If you 'git checkout -b branchname' inside a submodule, then 'git
push' will do the right thing, so I'm not sure what you'd want to be

Shouldn't "git merge" get a merge conflict if you've made a checkin
that changed the submodule pointer, then try to pull someone else's
checking that changes the submodules pointer to something else?  It
would seem there's no better option than that.

While we're here, it's inconvenient to have to call "git submodule
update" at all when there *isn't* a conflict.  It should always be
safe for git checkout or git merge to do that for you, no?

Thanks,

Avery
--

From: Sam Vilain
Date: Tuesday, April 1, 2008 - 4:05 pm

The reason is that not everyone wants that by default.  Perhaps it is a
good idea for it to be default behaviour; but all in good time.  It can

I can understand the motivation to write such disparaging remarks;
however it may be more productive to come up with good ideas about how
it can be made to work better for you, without getting in the way of

Sure, you could;

  git push origin HEAD:branchname

However I think the right solution to this is to name the branch

Well, where did you get the branch name from?  That's the part that
requires user intervention.  You could make an educated guess, such as
with git name-rev, but it would not necessarily be the right guess - so
user confirmation of the choice would be desirable.

Sam.
--

From: Avery Pennarun
Date: Tuesday, April 1, 2008 - 4:56 pm

I didn't mean anything disparaging.  I have nothing against babysitters :)

I'll be happy to work on patches once we have some sort of consensus

Okay, yes.  But that's just arbitrarily avoiding a local branch and
creating a remote one instead.  I can't imagine a situation where
you'd really want the local branch to be anonymous while the remote
one is not.

When doing a normal "git clone" without submodules, git automatically
creates you a local branch with the same name as the remote's
.git/HEAD - which is rather arbitrary, but even an arbitrary local
name is better than no name, and when checking out a brand new
submodule, there are *no* local branches, so a name conflict is

Here's a paraphrase of what I suggested earlier.  I don't think it got
a response:

Instead of storing only the commitid of each submodule in the parent
tree, store the current branch name as well.  Use this as a hint to
'submodule update' so that when it checks out commitid, it names the
local branch with the same name as it used to have.  (This is rather
user-friendly since if I check in, push, and clone, my new submodule
checkout will have the same branchname as it used to have.)

Note that the newly checked-out submodule branch will probably have
the same name as as remote branch.  However, the remote branch may
refer to a different commitid (for example, if someone has pushed to
that branch after the parent repo was last updated).  This is exactly
right; it means that if I cd into the submodule and "git push", it'll
fail because I'm not up to date (I can always switch to a new branch
if I want), and if I "git pull", it'll pull from the place where it
should.

This way, cloning a project with submodules will work much like
cloning the parent project; pushing and pulling the parent and the
submodules will do as you expect.

The bad news is that this would require a change to the tree format
for submodules (to contain the branch name).  Is that a problem?  Can
it be done in a ...
From: Junio C Hamano
Date: Tuesday, April 1, 2008 - 5:35 pm

That goes quite against the fundamental design of git submodules in that
the submodules are by themselves independent entities.  An often-cited
example is an appliance project, where superproject bundles a clone of
Linux kernel and a clone of busybox repositories as its submodules.

Each submodule is an independent project, and as such, must not know
anything about the containing superproject (iow, the superproject can know
what the submodules are doing, but submodules should not know they are
contained within a particular superproject).

If your superproject (i.e. the appliance product) uses two branches to
manage two product lines, named "v1" and "v2", these names are local to
the superproject.  It should not force the projects you borrow your
submodules from to have branches with corresponding name.

Also, the submodules and the superproject are meant to be loosely coupled.
A single branch in superproject (say "v1") may have many branches in a
submodule ("add frotz to v1 product", "improve nitfol in v1 product") that
can potentially be merged and bound to.

The work flow for updating a tree would look like:

 - People "git fetch" in superproject and in its submodules. They
   obviously prime their tree with "git clone" of superproject, and the
   submodules they are interested in, and a single fetch will update all
   the remote tracking branches, so it does not really matter which branch
   is checked out.  However, if you employ a central repository model to
   keep them, an invariant must hold: all the necessary commits in
   submodules must be _reachable_ from some branch in them.

 - When not working in a particular submodule, but using it as a component
   to build the superproject, it would be better to leave its HEAD
   detached to the version the superproject points at.  IOW, usually you
   won't have to be on any branch in submodules unless you are working in
   them.

 - Sometimes you need to work in a submodule; e.g. you would want to add
   'frotz' tool ...
From: Avery Pennarun
Date: Tuesday, April 1, 2008 - 7:03 pm

Not sure what you mean here; the supermodule already stores the
commitid of the submodule.  All I'm proposing is that it also store
the default branchname (ie. the branchname that the submodule was
using when its gitlink was checked into the supermodule) along with
that commitid.  The submodule never knows anything about the


I meant that we should store the submodule's branch name when
committing the superproject, and put it back when checking out the

I agree that the submodule should have its HEAD pointing at exactly
the superproject-specified commit.  However, I believe this commit
should have a local branch name (in the subproject) attached to it, or
else (as I and my co-workers have frequently experienced) people will
accidentally check in to a nameless branch, causing 'git push' to
silently not upload anything, and thus lose track of their commits.  I
have lost work this way.

The idea of naming the local-subproject-branch with the same name as
it had on checking is that then "git pull" in the subproject will work
exactly as expected: it'll get you the latest version of the branch
the superproject developer was on.  But if you *don't* explicitly "git
pull" in the subproject, I'd expect (of course) the checkout to stick
to the commit specified by the superproject - and also to leave its

This is where my workflow is a bit different.  One of my subprojects
is a library that gets used by several application superprojects.  I
often add features to my library in the process of editing a
particular superproject.  I also expect my co-developers to want to do
the same.  Thus, the difference from your example is that I want to
streamline the process of working in a subproject as well as a
superproject, and minimize the chances of losing data in this case.

With the current system the way it is, it's too easy to make mistakes,

As an orthogonal secondary wish, I'd like to have the subproject and
superproject hosted in the same remote repository.  This appears to ...
From: Sam Vilain
Date: Wednesday, April 2, 2008 - 1:06 pm

How about this.

This could be an optional disambiguator in .gitmodules in the 
superproject, to allow you to "store the branch it was made from".  Glue 
to make this automatic/easy optional.

When updating a submodule, with an option set (or configured; which 
might even later become a default if people like it enough), it will try 
to figure out a reasonable branch for that commit, using git-name-rev, 
and check out the branch with that name.  It first uses the hint above 
as an argument to git-name-rev --refs=XX, and if that doesn't provide a 
reasonable answer then look for any branch.

I think this approach would not get in the way of people who don't want 

I think this is a separate argument against git-push, the default 
behaviour of which also causes me to tell people not to use the 
argument-less form of git-push until they understand how to use the 
two-argument form.

In the context of git-submodule, adding features to it to avoid this if 

It's not really the local branch name anyway, it's how the default push 
gets configured; perhaps it's worth distinguishing which part you are 

Yes - you've already seen the SoC plan for that, although I believe no 
students applied for that one, and if you think it's minor enough to do, 


I'd appreciate that feature - though I'm more interested in making sure 
that I don't push anything where the submodule commit is not available 
via the URL listed in .gitmodules.

Presuming such a check, would that check happen at push time, or do you 
check at a different time, such as when committing, or when adding the 
submodule to the index?

I think checking that referential integrity is something perhaps easier 
to bite off and get people to agree on.  I think it would solve the 
overall process problem, by forcing people to push the submodule before 
the commit of the superproject can succeed without forcing.

Thoughts/comments?
Sam
--

From: Junio C Hamano
Date: Wednesday, April 2, 2008 - 2:32 pm

It's not just racy, but I think it's wrong to limit to _one_ branch in
each submodule..

A submodule is an independent project on its own.

Suppose the commit DAG in the submodule looked like this:

                 o---o
                /     \
     --o---o---o---o---o---X---o---Z
            \                 /
             o---o---o---o---o---o
                  \     /
                   o---o

and the superproject points at commit X. You may need to tweak the
submodule to make it work better with the change you are making to the
superproject.

You have two choices:

 (1) update to some "stable" branch head that is descendant of X first,
     and make sure it works with the superproject.  Then develop on top of
     it, and bind the tip of suc development trail to the superproject:

                 o---o
                /     \
     --o---o---o---o---o---X---o---Z---o---o---Y (your changes are Z..Y)
            \                 /
             o---o---o---o---o---o
                  \     /
                   o---o

I think this is what you are suggesting.  But the superproject may not be
ready to use the submodule with the history from the lower side branch
merged in.  You would

 (2) fork off of X and develop; bind the tip of such development trail to
     the superproject.  IOW, you make the submodule DAG like this, and
     then "git add" commit Y in superproject.

                 o---o       o---o---Y (your changes)
                /     \     /
     --o---o---o---o---o---X---o---Z
            \                 /
             o---o---o---o---o---o
                  \     /
                   o---o

Sometimes forked branches need to be maintained until it proves stable
(and then your "tip" Y may be merged back to the tip of a public branch
Z).  So you would at least need to allow a set of topic branches in
submodules that corresponds to a single lineage of superproject history.

Then when both Z (with the changes from the lower side branch) and ...
From: Avery Pennarun
Date: Sunday, March 30, 2008 - 4:00 pm

What unnecessary work do you mean?  Certainly fetching only a
particular set of refs from a remote repository is possible, as that's
what 'git pull' does.

I agree that partitioning by project makes sense... but it also seems
to me that throwing extra objects into a repository that requires them
anyhow shouldn't have any major negative results.  After all, if you
can't build A without B, then downloading A might as well download the
objects from B too.  Which is not to say that B shouldn't *also* have

I'd like to read about the rationale behind this change.  Is there a

I agree, there's no reason to take away the existing functionality of
allowing split repos.  I was more suggesting a new functionality so

One option is to make a simple "git push origin" operation fail if
you're not on any branch; iirc, if you try that now, it just silently
*succeeds* without uploading anything at all, which is one reason I so
frequently screw it up.  Alternatively, is there a reason I can't
upload an object *without* giving it a branch name?  I guess that
would cause problems with garbage collection.

Now, the fail-on-branchless-push option still isn't really perfect,
because then I'll screw up like this:
- make change
- check in
- try to push: fails
- switch to branch
- realize I've lost my checkin(s) and have to go scrounge in the
reflog to try to find it

If we could disallow checkins to disconnected heads, then I'd get an
error at step 1, before I had a chance to screw up.  I think that
would be a usability improvement to git in general.  For example, if I
screw up a git-rebase and forget to abort, my HEAD ends up
disconnected and I occasionally check things in by accident and then
lose them (only to be saved by the reflog).  Perhaps an extra option
to git-commit that must be used if you want to check into a
non-branch?  Is that too harsh?

Another option would be to simply *always* create/update a branch tag
when doing "git submodule update".  But then the question is ...
From: Sam Vilain
Date: Tuesday, April 1, 2008 - 4:10 pm

A full clone takes a few shortcuts, especially over dumb transports like
HTTP.  I think there might be shortcuts in the git-daemon code as well.
 Forcing these to be partial might make these full fetches involve more




If you think it is simpler, then I'm sure that submodules users would
appreciate you sharing your ideas as a patch.  Sorry if I am starting to
sound like a parrot ;-).

Sam.
--

From: Johannes Sixt
Date: Sunday, March 30, 2008 - 11:22 pm

Would a "recurse" sub-command help your workflow?

http://thread.gmane.org/gmane.comp.version-control.git/69834

-- Hannes
--

From: Avery Pennarun
Date: Monday, March 31, 2008 - 2:24 pm

Well, typing "git submodule recurse push" or something would allow me
to lose the same data without typing quite as much, so strictly
speaking I guess it would be an improvement :)

I'd like it even more if "git push" actually somehow refused to push
at all if I forgot to push in the submodules.

Have fun,

Avery
--

Previous thread: Summer of Code 2008 project application draft: Pack v4 by Peter Eriksen on Saturday, March 29, 2008 - 1:50 pm. (3 messages)

Next thread: [PATCH] git-svn: remove redundant slashes from show-ignore by Eric Wong on Saturday, March 29, 2008 - 4:37 pm. (4 messages)