Re: [PATCH 6/6] Teach core object handling functions about gitlinks

Previous thread: sscanf/strtoul: parse integers robustly by Jim Meyering on Monday, April 9, 2007 - 7:01 pm. (3 messages)

Next thread: [PATCH 12/10] validate reused pack data with CRC when possible by Nicolas Pitre on Tuesday, April 10, 2007 - 12:15 am. (1 message)
To: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:12 am

Ok, the following is a series of six patches that implement some very
low-level plumbing for what I consider sane subproject support.

NOTE! I want to make it very clear that this series of patches does not
make subprojects "usable". They are very core plumbing that allows people
to think about the issues, and shows how the low-level code could (and in
my opinion, should) be done.

Some of the early patches are just cleanups and very basic stuff required
to actually get to the meat of it all. I actually think that they are all
in a state where they could be applied, if only because they don't
actually really *do* anything unless you start generating index files
entries (and trees) that have the "gitlink" entries in them.

I've actually done some testing with a repository that has these kinds of
subproject pointers in them, and no, it's really not fully fleshed out
yet, but yes, I can actually do a commit in one of the subprojects, and
when I do that, the "raw" diff literally looks like this:

[torvalds@woody superproject]$ git diff --raw
:160000 160000 5813084832d3c680a3436b0253639c94ed55445d 0000000... M sub-B

and I can do a "git commit -a" in the superproject to commit the new
state.

NOTE! This series of six patches does not actually contain everything you
need to do that - in particular, this series will not actually connect up
the magic to make "git add" (and thus "git commit") actually create the
gitlink entries for subprojects. That's another (quite small) patch, but I
haven't cleaned it up enough to be submittable yet.

I split my original larger patch up into more manageable pieces, so that
you should be able to actually just read the patches themselves and get a
reasonable idea about what it's doing, even *without* actually testing it.
And obviously, "make test" still completes happily, if only because none
of the tests actually trigger any of the new code.

The patches are all fairly small, and the two first ones are really just
totally...

To: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:46 am

Here is, for your enjoyment, the last patch I used to actually test this
all. I do *not* submit it as a patch for actual inclusion - the other
patches in the series are, I think, ready to actually be merged. This one
is not.

It's broken for a few reasons:

- it allows you to do "git add subproject" to add the subproject to the
index (and then use "git commit" to commit it), but even something as
simple as "git commit -a" doesn't work right, because the sequence that
"git commit -a" uses to update the index doesn't work with the current
state of the plumbing (ie the

git-diff-files --name-only -z |
git-update-index --remove -z --stdin

thing doesn't work right.

- even for "git add", the logic isn't really right. It should take the
old index state into account to decide if it wants to add it as a
subproject.

so this patch really isn't very good, but it allows people who are
interested to perhaps actually test something. For example, my test repo
was actually created with this:

[torvalds@woody superproject]$ git log --raw
commit 649ad968bdd79cb3b0f50feb819b7e9b134d3a1a
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date: Mon Apr 9 21:36:53 2007 -0700

This commits the modification to sub-project B

:160000 160000 5813084832d3c680a3436b0253639c94ed55445d 17d246a35f27a46762328281eb6e9d4558f91e9d M sub-B

commit f3c55ffcc000a8c0fecc6801e8909d084e3d419e
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date: Mon Apr 9 16:12:29 2007 -0700

Superproject with two subprojects

:000000 160000 0000000... c0daf4c85d48879ab450a6a887bbb241eb0de00a A sub-A
:000000 160000 0000000... 5813084832d3c680a3436b0253639c94ed55445d A sub-B

commit 45eb14edb43b10e3d3ac7a495a1ec861e85dc36f
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date: Mon Apr 9 15:36:24 2007 -0700

Add top-level Makefile for super-project

:000000 100644 0000000......

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 9:04 am

The other thing which will be missed a lot (I miss it that much)
is a subproject-recursive git-commit and git-status.
It is very possible that the default should be different for
the git-commit and git-status: git-commit is likely to have it
off whereas git-status will very much depend on how fast
the usual response is (or wished for). An integrator on very fast
machine may like it on for both, a subproject developer can have
it off for both (to avoid accidental commits and generally being
not interested in anything besides his code), an occasional person
can have the status defaulting to on and commit to off - to avoid
accidental commits in subprojects which are just tracked.

A separate config option and a command-line switch, probably.
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 4:32 am

hoi :)

git-status should really point out if a subproject has any changes,
as it does for files. Only that a submodule may have more types of
possible changes: has new commits which are not yet in the supermodule
index, has an dirty index of its own, dirty working directory.

But for commit it really does not make any sense. The commit in the
submodule is totally independent to the commit in the supermodule.
You'd want the the submodule commit message to not refer to any
supermodule stuff (as you likely want to reuse the submodule in other
supermodules), while the supermodule commit is much more high-level and
only records that the submodule got changed.

When viewed from the supermodule, a submodule is just part of its tree,
just as normal files. So a submodule commit is conceptually similiar to
changing a file, and you don't change files while you commit, also ;-).

--=20
Martin Waitz

To: Martin Waitz <tali@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 4:42 am

Only if I want it to. HEAD change check (which is cheap enough

Right. Perhaps not a commit in submodule but a recursive check
for working directory changes in submodules. So that you can
make that you don't make a superproject commit which cannot
be resolved to what you had in all the working directories:

git commit -a --check-clean-subprojects
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 4:57 am

hoi :)

Yes, that's the equivalent of checking normal files.

For -a such a check may even make sense unconditionally.
And without -a I don't see any value in such a check.
So we can just add that check to -a if we see that dirty submodules
are a problem for users.

--=20
Martin Waitz

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 11:13 am

Note that I was definitely planning on adding them too, but they are at a
higher level.

So the long-term plan is/was to add a flag to "git diff" (and "git
ls-tree" etc) to say "recurse into subprojects".

You cound perhaps even make that flag the default with some .git/config
option, if your superproject is small enough.

But this series of 6 (and the seventh ugly hack) is literally meant for
just the really core object-handling stuff, and even there it's not really
complete.

For example, you cannot even clone a superproject yet, simply because
git-upload-pack doesn't know that it's not supposed to follow the gitlink
things etc. So there's a lot of details left even for the really *core*
stuff, but I wanted to post the series of six patches because those six
patches are actually enough to reach the point where you can start looking
at individual problems (like "git upload-pack") and fix them
incrementally.

So I'd like this to be merged somewhere, not because "it works" or "it's
complete", but because it's in a shape where I think a lot of people can
start fixing small details.

For example, with just two smallish updates:
- teach "git upload-pack" not to try to follow gitlinks
- teach "git read-tree" to check out a git-link as just an empty
subdirectory
you should already be pretty close to being able to clone a superproject.
You'd still have to clone the subprojects one-by-one manually, and that
would be more of a porcelain'ish issue to teach git clone to fetch
submodules too (with some ".gitmodules" file that contains the rules for
that!)

But no, I didn't do any of that. I literally did just the "tree object
format change" to support the *notion* of gitlinks - not all the pieces to
then actually *implement* the notion are done by a long shot.

I think everybody agrees that we need some kind of subproject support, and
the KDE repository certainly shows that subprojects need to be truly
independent (because if they aren't, you end up ...

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 11:48 am

It is already "merged somewhere": as soon as the patches left landed
on vger, it is not possible to loose (and even destroy) them.

which also should fix switching between the branches with subprojects.
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:07 pm

Well, unless it hits something like Junios 'pu' (or 'next') branch, or
somebody (like you?) ends up maintaining a repo with this, it's just
unnecessarily hard to have lots of people working together on it..

I'm obviously interested in working on it, but at the same time, I don't
expect to be a primary *user* of it, so I'm hoping others will come in and
start looking at it.

It looks promising that you're getting involved, but I suspect you may be
a bit too optimistic when you say "just too much sought after". We've been
*talking* about subprojects for a long long time, and we've had other

Yes. It would require either git-read-tree or the git-checkout script
around it knowing to then also check out the subproject branches.

It's actually not *entirely* obvious what you should do when you switch
branches (or even just do a "git reset --hard") in the superproject. The
branches in the subprojects are likely to be totally different from the
superproject, so as far as I can see, you end up having two choices when
you reset a subproject:

- either basically create a "disconnected HEAD" in the subproject(s) when
you switch them around as a consequence of resetting/switching the
branch in the superproject.

- or you'd stay on the same branch in the subproject, and just reset that
branch..

- or you describe the branch name in the ".gitmodules" file in the
superproject, and use whatever branch in the submodule that is
described in the supermodule that you reset/check-out.

- or possibly other policies.

So there is bound to be various "policy" issues like this worth sorting
out. I don't think they matter that deeply.

I would _personally_ tend to like the notion of using ".gitmodules" in the
supermodule to describe things like this, exactly because it's a policy
decision - not something that git itself should really decide about, but
that the supermodule maintainers can just decide to agree on.

But I haven't really even thought about ...

To: Linus Torvalds <torvalds@...>
Cc: Alex Riesen <raa.lkml@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 3:32 pm

Well, I was planning to apply this directly on 'master' after
giving them another pass.

-

To: Junio C Hamano <junkio@...>
Cc: Alex Riesen <raa.lkml@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 4:11 pm

Goodie. I gave them another pass myself, and noticed a small leak and a
stupid copy-paste problem, fixed thus..

Linus

---
diff --git a/read-cache.c b/read-cache.c
index 8fe94cd..f458f50 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -279,7 +279,7 @@ int base_name_compare(const char *name1, int len1, int mode1,
c2 = name2[len];
if (!c1 && (S_ISDIR(mode1) || S_ISDIRLNK(mode1)))
c1 = '/';
- if (!c2 && (S_ISDIR(mode2) || S_ISDIRLNK(mode1)))
+ if (!c2 && (S_ISDIR(mode2) || S_ISDIRLNK(mode2)))
c2 = '/';
return (c1 < c2) ? -1 : (c1 > c2) ? 1 : 0;
}
diff --git a/refs.c b/refs.c
index 229da74..11a67a8 100644
--- a/refs.c
+++ b/refs.c
@@ -229,6 +229,7 @@ static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refna
if (!f)
return -1;
read_packed_refs(f, &refs);
+ fclose(f);
ref = refs.packed;
retval = -1;
while (ref) {
-

To: Linus Torvalds <torvalds@...>
Cc: Alex Riesen <raa.lkml@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 4:52 pm

By the way,...

People occasionally ask "how would I make a small fix to a
commit that is buried in the history", so let me take a moment
to give them a recipe.

Let's say while reviewing the code after applying all of the
6-series, you noticed the above thinko. First find out which
commit caused it:

$ git checkout lt/gitlink
$ git blame -L229,+7 master.. -- refs.c
b60108a1 (Linus Torvalds 2007-04-09 21:14:26 -0700 229) if (!f)
b60108a1 (Linus Torvalds 2007-04-09 21:14:26 -0700 230) re..
b60108a1 (Linus Torvalds 2007-04-09 21:14:26 -0700 231) read_packe..
b60108a1 (Linus Torvalds 2007-04-09 21:14:26 -0700 232) ref = refs..
b60108a1 (Linus Torvalds 2007-04-09 21:14:26 -0700 233) retval = -..
b60108a1 (Linus Torvalds 2007-04-09 21:14:26 -0700 234) while (ref..
b60108a1 (Linus Torvalds 2007-04-09 21:14:26 -0700 235) if..

The commit to fix is b60108a1 (this is what I have in my private
repo, and I'll be rebuilding the series with this example, so
you will never see this commit object name in the end result
I'll be pushing out). So I detach the HEAD at that commit and
make a fix:

$ git checkout b60108a1
$ edit refs.c
$ git diff; # just to make sure
$ git commit -a --amend

At this point, the detached HEAD and the original branch look
like this:

$ git show-branch lt/gitlink HEAD
! [lt/gitlink] Teach core object handling functions about gitlinks
* [HEAD] Add 'resolve_gitlink_ref()' helper function
--
* [HEAD] Add 'resolve_gitlink_ref()' helper function
+ [lt/gitlink] Teach core object handling functions about gitlinks
+ [lt/gitlink^] Teach "fsck" not to follow subproject links
+ [lt/gitlink~2] Add "S_IFDIRLNK" file mode infrastructure for git links
+ [lt/gitlink~3] Add 'resolve_gitlink_ref()' helper function
+* [HEAD^] Avoid overflowing name buffer in deep directory structures

We fixed lt/gitlink~3 and the fixed-up commit is at HEAD. We
want to rebase the rest of lt/gitlink on top of HEAD, like this:

$ git rebase HEAD lt/gitlink

This will tak...

To: Junio C Hamano <junkio@...>
Cc: Alex Riesen <raa.lkml@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 5:02 pm

That recipe looks ummm complicated...
What I usually do is:

git format-patch HEAD~4..HEAD
git reset --hard HAED~4
patch -p1 < 0004*
...edit...
delete diff from 0004*
git diff >> 0004*
git reset --hard
git am 000*

Maybe this is as complicated as your example but this
is very simple to deal with.
And I do not destroy history or anything.

But that said I do not use topic brances but simply
clone my local repository as needed.
And I always deal with a linear history.

[I post this mostly to check if this is insane
and I need to understand the way you propose to do stuff]

Sam
-

To: Sam Ravnborg <sam@...>
Cc: Alex Riesen <raa.lkml@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 5:27 pm

It's really the same. You keep 000* file, I keep them in the
original branch and have "git rebase" take care of the details.

-

To: Junio C Hamano <junkio@...>
Cc: Alex Riesen <raa.lkml@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 5:03 pm

This is definitively good Documentation/howto/ material.

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Sunday, April 15, 2007 - 7:21 pm

There's actually something similar already in "modifying a single
commit" in the "user manual":

http://www.kernel.org/pub/software/scm/git/docs/user-manual.html#id276844

But it uses a throw-away branch instead of the detached head, and uses
rebase --onto instead of rebasing and then --skip'ing.

--b.
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:43 pm

The people who need the feature are still using other VCS.
Some do not even know about git, the others are more interested
in their own projects than in hacking on git (like KDE or Ubuntu
people). And then there are commercial projects with thirdparty
libraries, components or data. The other VCS' provide the feature,
even if they do it wrong and badly (I never could go back in time in my
day-work project, always asked myself what was the point of using
Perforce at all).
So, I suspect it is the people who are unable or unwilling
to contribute to git (to anything, really) who need the feature most.
-

To: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:20 am

This teaches the really fundamental core SHA1 object handling routines
about gitlinks. We can compare trees with gitlinks in them (although we
can not actually generate patches for them yet - just raw git diffs),
and they show up as commits in "git ls-tree".

We also know to compare gitlinks as if they were directories (ie the
normal "sort as trees" rules apply).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Ok, that's it for now.

NOTE NOTE NOTE! I'd like to note once more that this doesn't actually get
you working subproject support. Not only do I need to connect up a few
more low-level helper functions (things like "git diff" don't know how to
generate even rudimentary "subproject X changed" patches, nor can you
actually yet *add* subprojects), but quite apart from that low-level
stuff, anything more high-level (like "git fetch" and friends) will need
to know about subprojects.

In general, think of this like the early git plumbing: it's the early
"content-addressable filesystem" part. The actual SCM parts going on top
of it are yet to be done.

I'm hoping/expecting that there are more people who have the ability and
the interest to work on the higher-level interfaces once the core plumbing
support is there. There's still some plumbing to be done, but after that,
maybe more people (and maybe the SoC people) can start filling out the
higher-level details..

Comments on the patches/approach so far?

builtin-ls-tree.c | 20 +++++++++++++++++++-
cache-tree.c | 2 +-
read-cache.c | 35 +++++++++++++++++++++++++++++++----
sha1_file.c | 3 +++
4 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/builtin-ls-tree.c b/builtin-ls-tree.c
index 6472610..1cb4dca 100644
--- a/builtin-ls-tree.c
+++ b/builtin-ls-tree.c
@@ -6,6 +6,7 @@
#include "cache.h"
#include "blob.h"
#include "tree.h"
+#include "commit.h"
#include "quote.h"
#include "builtin.h"

@@ -59,7 +60,24 @@ static int show_tree(con...

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 4:06 am

hoi :)

thanks Linus for your nice implementation. Your core code is so much
nicer than my hacked-up prototype :-).

I only had little time to actually have a look at it but the core is
very similiar to my approach and I'll try to rebase some of my code on
top of yours in the following days.

The only thing I disagree with you is in using HEAD of the submodule:

Always using HEAD of the submodule makes branches in the submodule
useless.

Whenever you do a checkout in the supermodule you also have to update
the submodule and this update has to change the same thing which is read
above.
Updating the branch which HEAD points to is dangerous. You could
overwrite some unrelated branch just because the user forgot to switch
back to his supermodule-tracking-branch. The user would always have to
make sure that all the submodules are in the correct state for an update
of the supermodule.
Updating HEAD directly is possible now and may make some sense, but you
still get problems when you want to switch to some temporary branch in
the submodule. You have no chance to get back to the original supermodule
version and now your temporary submodule branch gets shown as the new
submodule version which should be part of the supermodule.
The submodule version which is stored in the supermodules tree is kind
of a hidden/remote reference/branch. When working on a remote branch
we first create a local working branch and then sync it with the remote
one. I think that it makes sense to use the same model for submodules:
have one local branch in the submodule which is used for all work that
is done in the supermodule context.

So my advice is:
Always read and write one dedicated branch (hardcoded "master" or
configurable) when the supermodule wants to access a submodule.

Then you have two type of branches:
You can branch the supermodule and have you own branch of the entire
project with all submodules. Use this if you want to commit your
work on the submodule into the supermodule.
You ca...

To: Martin Waitz <tali@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 11:16 am

Well, I don't actually see much choice. HEAD is just shorthand for

No.

Branches in submodules actually in many ways are *more* important than
branches in supermodules - it's just that with the CVS mentality, you
would never actually see that, because CVS obviously doesn't really
support such a notion.

So I'd argue that branches in submodules give you:

- you can develop the submodule *independently* of the supermodule, but
still be able to easily merge back and forth.

Quite often, the submodule would be developed entirely _outside_ of the
supermodule, and the "branch" that gets the most development would thus
actually be the "vendor branch", entirely outside the supermodule. Call
that the "main" branch or whatever, inside the supermodule it would
often be something like the remote "remotes/origin/master" branch.

So inside the supermodule, the HEAD would generally point to something
that is *not* necessarily the "main development" branch, because the
supermodule maintainer would quite logically and often have his own
modifications to the original project on that branch. It migth be a
detached branch, or just a local branch inside the submodule.

- branches inside submodules are *also* very useful even inside the
supermodule, ie they again allow topic work to be fetched into the
submodule *without* having to actually be part of the supermodule,
or as a way to track a certain experimental branch of the supermodule.

I suspect that most supermodule usage is as an "integrator" branch,
which means that the supermodule tends to follow the "main
development", and the whole point of the supermodule is largely to have
a collection of "stable things that work together".

In contrast, branches within submodules are useful for doing all the
development that is *not* yet ready to be committed to the supermodule,
exactly because it's not yet been tested in the full "make World" kind

I suspect...

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 7:54 pm

hoi :)

I fully agree with you about the importance of submodule branches.
In fact, I want to make them even more important and useable!

I fully agree.

If you use a detached HEAD then you can no longer switch back to it
once you used some other (independent) branch (for testing or whatever).
This is my main argument: If you just update some 'special'
refs/heads/from-supermodule (or whatever, maybe get it from
=2Egitmodules/config) you can still switch between branches, making them
more useful IMHO.

If we create some other way to easily get to the commit referenced by
the index of the supermodule then a detached HEAD is ok for me, too.
But why create two things (this not-yet-existing way to get the
supermodule index entry, plus submodules HEAD) for the same thing?
Why not simply create a new refs/heads/whatever?

Fully agree.

Please don't confuse my "I always want to use one dedicated branch" with
"I always want to use one special branch from the submodule project".
This refs/heads/whatever I am talking about is _purely_ for ease of
use of the submodule inside the supermodule. It is in no way linked
to the branchnames that are used by the submodule project.

So you now have this nice "my-integration" branch lying next to other
independent (not-supermodule-related) branches.
If you want to _switch_ to one of these unrelated branches you obviously
have to change HEAD, and suddenly your unrelated branches are
considered to be part of the supermodule (ok, not yet part of its
index of course, but now all supermodule operations would work on
this unrelated branch).

I want to preserve these unrelated branches and see them as a strong
feature. Branches in submodules should be independent from the
supermodule _because_ the supermodule has no notion of which branch

Only that you loose your nice detached HEAD view once you start using

In terms of flexibility it is important what you can do with the
submodule. Being able to use branches just like in a normal
reposi...

To: Martin Waitz <tali@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Thursday, April 12, 2007 - 11:12 am

The supermodule checkout could create a .git/SUPER_HEAD for this.
OK, that is a special kind of reference.

Or introduce "git --super ..." with works with the superproject.
Form a submodule directory, a "git --super checkout ." could reset the
submodule checkout.

Josef
-

To: Martin Waitz <tali@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 9:57 pm

Why can't can't we extend checkout with an option to look for an
enclosing git project, find the gitlink in the index, and check out
that commit? That allows you to return to the original state without
needing to bother with new special branches.

And instead of recording the path in a .gitmodules file, why not a
list of git directories we search for the commit? Allows moving of
subprojects without suddenly breaking configuration files. When we
find the appropriate git dir, we can use a .gitlink file or symlinks
to attach the directory to it's repository.

I dislike moving git in the direction of enforcing more policy
instead of less, and of making it less capable of handling content
movement instead of more.

~~ Brian
-

To: Linus Torvalds <torvalds@...>
Cc: Martin Waitz <tali@...>, Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 6:49 pm

To discuss this detail, what about keeping refs, such as
refs/submodules/branch/path/* (or some other convention) which are
updated on commit? Then you can also easily clone just the submodule.

Sam.
-

To: <git@...>
Cc: Martin Waitz <tali@...>, Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>
Date: Wednesday, April 11, 2007 - 5:47 am

I know we've had this discussion before, but I'm going to bring it up again -
mainly because Linus's implementation exactly matches what I envisaged when
we originally spoke of this. I think in your "Updating the branch which HEAD
points to is dangerous" section, the main thing you're not taking into
account is that git can make detached checkouts. Updating HEAD is not
dangerous - updating refs is; and I don't think anyone is proposing that a
submodule ref should ever be updated by a supermodule.

I think you're also too strongly focussed on the idea that the supermodule
tracks submodule branches - it cannot branches are not part of "the"
repository they point at "a" repository. References are outside the
repository pointing in, and hence the supermodule cannot refer to them at its
core.

Now, if you check out a revision in the supermodule, that's going to look up
the submodule revision stored in the DIRLINK tree entry which will recurse
into the submodule and checkout that revision - almost certainly as a
detached HEAD. There are three possibilities then:
- The submodule revision is in the past and no submodule branch points at it
- The submodule revision is current and a submodule branch points at it
- The submodule revision is current and multiple submodule branches point at
it
The supermodule checkout will have to make a decision whether to update the
submodule HEAD (in one case it's obvious: a revision in the past has to be
detached HEAD as there is no suitable branch). It's also possible that the
single submodule branch case is easy - undetach HEAD; however I don't think
that is universally correct.

I know you're very much in favour of making branches in the submodule
correspond to branches in the supermodule, but I just don't see a way of
making it work - the supermodule cannot know about submodule branches,
branches are not part of the repository, they just point at the repository.
My branches could be different from your branches.

It...

To: Andy Parkins <andyparkins@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, <git@...>
Date: Wednesday, April 11, 2007 - 7:31 am

hoi :)

Then we already agree on the most important part.
My argument is mostly against updating the ref which is behind HEAD, not
HEAD per se. And I haven't thought about using detached HEADs until I

No, that may be an misunderstanding because my very first prototype
really did track branches. In the meantime I changed my mind, my
current prototypes all track submodule commits directly.
But in doing so we create a branch of its own: remember, a branch in
git is just a moving reference into the history. Such a reference
can be stored in .git/refs/heads or it can be stored in the index/tree of
the supermodule. The difference is not really big.

I don't like to guess which branches to update.

That would not work, you are right.

Again, doing things conditionally here just adds to confusion.
Just have one dedicated branch and be done with it.

--=20
Martin Waitz

To: Martin Waitz <tali@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 4:29 am

In this case it does not correspond to the working tree anymore.
HEAD is the "closest" to working tree of submodule.
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 4:36 am

hoi :)

yes.

This has been discussed in length already.
Please have a look at the archives.

Your working tree now contains a complete git repository which has
features which are not available for normal files. Notable, you
have the possibility to create branches in the submodule.
If you insist in using HEAD you throw away those submodule capabilities.

--=20
Martin Waitz

To: Martin Waitz <tali@...>
Cc: Alex Riesen <raa.lkml@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 5:15 am

Why? If you are working in the parent module (e.g integration)
and notice breakage due to a bug in a submodule, it is very
plausible that you would want to cd into the directory you have
the submodule checked out, which has its own .git/ as its
repository, and perform a fix-up there, with the goal of coming
up with a commit usable by the parent project pointed at by the
HEAD of the submodule repository. And while working toward that
goal, you will use branches, rebase, rewind or use StGIT there
in that submodule repository. It does not forbid you from using
any of these things -- as long as you end up with a good commit
at HEAD that the supermodule can use.

Once you come up with a suitable commit sitting at HEAD of the
submodule repository, you cd up to the parent module. Top-level
git-diff would notice that the commit recorded at the submodule
path has been updated (because you now have a good commit at
HEAD of the submodule repository, while earlier the one in your
index was a dud).

So it is not clear to me what your argument about throwing away
capabilities is.

-

To: Junio C Hamano <junkio@...>
Cc: Alex Riesen <raa.lkml@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 6:03 am

hoi :)

that's perfectly fine.
I only require one more thing: make sure that your commit is on
one dedicated branch (simply by merging your working/rebased/whatever
branch into the dedicated one) and not on some random one.

Again: for your above example this is not neccessary and using HEAD
would indeed be perfectly fine.

But you also have to update the submodule when you do a checkout in
the supermodule. So what do you update? Updating 'HEAD' is not
very concrete, please have a look at my initial mail to Linus.

What is stored in the supermodule? It stores a reference to a specific
point in the history of the submodule. As such I am convinced that
the right counterpart inside the submodule is a refs/heads/whatever,
and not the branch selector HEAD.
You can have other branches next to the one which is tracked by the
supermodule. If you always update HEAD you don't have a clear

If the supermodule just updates some random submodule branch I happen to
use at the time of a supermodule pull then submodule branches are
of much lower value.
Suddenly you have to make sure for yourself that the correct branch
gets updated.
For me, different branches should be independent and I want git to
always update the correct one.

--=20
Martin Waitz

To: Martin Waitz <tali@...>
Cc: Alex Riesen <raa.lkml@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 4:01 pm

Because 'submodule' is a project on its own, it can make
progress while the parent project is still using the stable
commit. Think of this:

- Your application uses product of another project as a
library (e.g. you are doing video application and embedding
ffmpeg).

- Your 'master' commit records a commit in the library
subproject. Maybe library subproject declared stable 1.0 and
that is what you used to integrate.

- But being an independent project on its own, the library
project can make progress, outside the context of this
aggregated work (i.e. your application). Next time you do:

$ cd ffmpeg ; git fetch

there may not be any branch that points at the exact "stable 1.0"
commit.

When you do a "checkout -f --recurse-into-subprojects" from the
toplevel, I suspect that you would need to detach HEAD in the
subproject repository grafted in your application tree to move
it to the exact commit the toplevel project (i.e. your
application) wants, and match the working tree to that commit.
The toplevel simply should _not_ have to care what branch that
commit comes from.

-

To: Junio C Hamano <junkio@...>
Cc: Alex Riesen <raa.lkml@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 6:19 pm

hoi :)

yes.

But why does everybody want to detach the submodule HEAD, instead
of creating one 'special' branch which holds the commit which is
used by the supermodule?

If you then want to switch to another submodule branch you loose
the reference that comes from the supermodule.

I want to create the extra branch exactly _because_ there is
independent work going on in the submodule (or the project it is
based on). As you can switch between detached HEAD and an
independent branch you can also switch between the 'supermodule branch'
and independent branches -- only that you can easily switch back
if you have an branch of your own.

BTW: I also think that your --recurse-into-subprojects should
be implied.
If you check out one index entry, you should be able to read it
back afterwards. That is a nice property everyone expects from
normal files and we should try to keep that for submodules.
When checkout_entry wants to touch a submodule we can simply rewrite
the 'supermodule branch' in the submodule. If HEAD happens to point
to it we also read-tree the submodule.
This is easy to understand and implement and I have some good experience
with this model.

--=20
Martin Waitz

To: Martin Waitz <tali@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 6:36 pm

I don't think "everybody" wants it.

But the point is, *regardless* of whether you want a "detached HEAD" or
you want a "'special' branch", you should always use HEAD to look up the
commit, and using HEAD *allows* both (ie just make HEAD a symref to the
'special' branch if you want that behaviour).

And if you *do* use a special branch, HEAD *must* match that special
branch anyway, since when you commit in the supermodule, the only

And that is entirely appropriate.

But that still means that HEAD must point to that branch (when in the
submodule), since that branch must be the one that is checked out. If it
isn't the branch that is checked out, normal operations like "git diff"
etc wouldn't make sense from the supermodule.

And that is why *regardless* of whether you use a special branch or not,
HEAD is the right thing to look up.

Linus
-

To: Martin Waitz <tali@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 4:49 am

I should. But at least a short summary of the reasons

In this (a very special, I believe) case, why not use git update-index
--cacheinfo?
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 5:20 am

hoi :)

Not neccessarily, yes.

Branches in the submodule make no sense unless they are independent
=66rom supermodule branches. And then changing to another branch in
the submodule automatically means that your current submodule working
directory should be independent to the supermodule.

git-status in the supermodule should of course warn when a submodule
is on a different branch, so that you don't accidently loose submodule

I think misunderstood each other.
For me branching is not special case.

--=20
Martin Waitz

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:28 pm

So this does mean that the SHA1 of a gitlink entry corresponds
to the commit in the subproject?

I wonder if it is not useful to be able to add some attribute(s)
to a gitlink, i.e. first reference a gitlink object in the superproject,
which then references the submodule commit, and also holds some
further attributes. These attributes can not be put into the subproject,
as it should be independent.

An example for such an attribute would be a subproject name/ID.
An argument for this: The user should be able to specify some policies
for submodules, like "do not clone/checkout this submodule". But the
path where the submodule resides in a given commit is not useful here,
as a submodule can reside at different paths in the history of the
supermodule.

Josef
-

To: Josef Weidendorfer <Josef.Weidendorfer@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 7:36 pm

I mentioned this briefly on another strand of this thread, but I think
that the simplest way to do this would be to just make refs/subproject/*
populate itself sensibly when you commit in the superproject.

I mentioned refs/subprojects/path/branch before, but I think it would
probably be the sort of thing that should be in the .git/config

Sam.
-

To: Josef Weidendorfer <Josef.Weidendorfer@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 2:45 pm

The special "link" object has come up before, and I actually thought I'd
do it that way first, but there were a few reasons why I didn't:

- I tend to like "minimal", and the patches I sent out really are pretty
minimal, in the sense that they introduce just _one_ new concept, in
one place (it's basically a "tree entry" - so it shows up in tree
reading and writing, and nowhere else. The index, of course, is the
staging area for trees, so the index was also affected, but that was
really a very direct result of that "it's a new tree entry" thing).

- in a "link" object, the only thing that would normally *change* is
really just the commit SHA1. Everything else is really pretty static.
As such, I decided that it's just a waste of a perfectly fine object to
have several thousands of the "link" objects that really only differ in
the pointer to the commit.

- the "static" part, which you might as well have somewhere else, tends
to be stuff that you would need to be able to override locally, and as
such it does *not* really have a global meaning that is useful
historically.

For example, the things that you'd want to associate with the gitlink
are things like "where would I find the repository that the commit is
part of" and "what is a description of that submodule" and "what are
the relationships between the submodules". These are things that aren't
necessarily even totally independent: in CVS, for example, you have
module names that are really not submodules themselves, but are really
just aliases for *collections* of submodules.

So a 1:1 link object simply wouldn't make much sense anyway, and you'd
want to override those defaults with site-specific ones (maybe there is
a "canonical" address for the submodule repository, but if you have a
copy of it locally on-site, when you clone, you'd rather use the
*local* copy over the standard site, for example).

So all of this just made ...

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 8:42 pm

I guess this file could also cover the case where the superproject is
only interested in a small subset of the subproject. For example if I
only uses some header-files in a library and want
"/lib1/src/interface" in the subproject end up as "/includes/lib1" in
the superproject. Could single files be handled in a similar way?

Although this is just an example, external links shouldn't be
specified in the same configuration file as project internal things
(which should be version-controlled). If the url configuration gets
overwritten with checkouts there will be problems bisecting if the url
changes over time.
-

To: Torgil Svensson <torgil.svensson@...>
Cc: Junio C Hamano <junkio@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 8:56 pm

hoi :)

Conceptionally this information would have to be part of the
supermodule tree (after all it changes how your tree is set up).

I think it makes more sense to make users think about which part
of their tree can be reused and make them choose submodule boundaries

Most of the time we may not need to add any per-submodule URL
information anyway. If you fetch a new supermodule version, you
can get the new submodule from the same source (or from a per-submodule
source which can be determined by looking at and munching the supermodule U=
RL).

--=20
Martin Waitz

To: Martin Waitz <tali@...>
Cc: Junio C Hamano <junkio@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Thursday, April 12, 2007 - 5:23 pm

I agree. This could be included in the module config file which in

Sometimes you can't control upstream projects the way you want it.
Also, splitting up projects for the potential need of future
superprojects has several obvious disadvantages (multiple changelogs,
versions etc). I don't see the subfolder checkout thing as a problem
since the core plumbing in Linus's implementation doesn't care what's
beneath the commit link. The subfolder checkout can "easily" be done
in a porcelain.

It's more problematic if you want to cherry-pick individual files in a
subproject. Here, I think the tight connection between links and
directories to be too restrictive. Why does a subproject commit-link
have to be represented as a folder?

//Torgil
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 3:29 pm

So when moving the kdelibs submodule around, you would
have to update the .gitmodules file.

I like it.

Josef
-

To: Josef Weidendorfer <Josef.Weidendorfer@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 3:45 pm

Right. The assumption here is:
- submodules almost never actually change. You might add a new one
occasionally, and once a decade you might do some bigger
re-organization, but in general it's pretty much static.
- when you do move submodules around, it's probably a big flag-day anyway
(ie I would expect that it's a big reorg, and that you'd quite likely
expect developers to have to re-check out their tree if you did major
surgery).

That's certainly how it works under CVS. I bet we can make it much nicer
than CVS, but the point is, people really don't expect submodules to be
something that you move around very dynamically. You want to be *able* to

The advantage with splitting things out like this is that it allows you
much more flexibility than something automatic and deeply integrated does.

You can still edit the modules setup even if you yourself might not even
have that particular module checked out! That may sound insane, but it's
actually *required* for things like "oh, the standard server for that
module went away, I need to edit the module settings to get it from xyz
instead".

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 7:47 pm

Also, in the Perl 5 Perforce conversion there are a number of
"submodules" (ie, bundled modules with their own history) that move
around a lot. In some tree representations used during the conversion
process they might even appear twice in a given tree with differing
versions.

Sam.
-

To: Sam Vilain <sam@...>
Cc: Junio C Hamano <junkio@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 8:13 pm

That should actually be something that is fairly natural to handle with
the current git submodule design - there's absolutely no problem with
having the same subproject showing up in multiple different places in the
tree (and each place obviously will have its own commit).

However, it causes some questions at two points:

- What do you do in the ".gitmodules" file, where you describe the
submodule setup?

This is not so much a _problem_ as a "how do you want to handle it"
issue.

Would people want such a module to show up as "one module" that is just
visible in the tree in multiple places? Or do people prefer to think of
of it as completely separate modules that just happen to have the same
base repository?

I don't think it's clear that one or the other is the "right way" to
see things, and I don't think git really should care. I suspect it's
more likely to be a detail that some importer script just has to
resolve one way or the other.

The core git infrastructure needs to be able to have one module show up
in multiple places over time anyway, so I don't think there is any real
reason not to allow the same module to show up in multiple places even
within one single commit.. (Ie it's really mostly about the .gitmodules
file *syntax* - but if we use the config file syntax, it's actually
very natural to allow multiple entries for the module directory name)

At the same time, there are reasons why you might want to consider them
separate modules too - maybe you want to *descibe* them separately, and
maybe one of the copies is used for "legacy support", and you might be
in a situation where you want to check out only one of the copies and
not the other (and thus describing them as two *different* modules
rather than two versions of the *same* module actually makes sense!).

So I think this is something where we are technically neutral, but
where we may have non-technical issue...

To: <git@...>
Cc: Junio C Hamano <junkio@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, April 10, 2007 - 3:04 pm

Would it be nicer if .gitmodules were line-based to aid in merging?

Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
-

To: Andy Parkins <andyparkins@...>
Cc: Junio C Hamano <junkio@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, Linus Torvalds <torvalds@...>, <git@...>
Date: Tuesday, April 10, 2007 - 3:41 pm

this is very similar to the problem I asked about with merging config files a
couple weeks ago. the answer then was that when we get .gitattributes we should
be able to specify content specific merge programs that could deal with this
sort of thing on a per-file basis. That sounds like the answer to your concern
as well, rather then makeing things order dependant and otherwise harder to read
to make it able to be merged with the current tools (which assume line-based
order-dependant content)

David Lang
-

To: Andy Parkins <andyparkins@...>
Cc: Josef Weidendorfer <Josef.Weidendorfer@...>, Linus Torvalds <torvalds@...>, <git@...>
Date: Tuesday, April 10, 2007 - 4:06 pm

I personally feel that if there are cases that merge conflict is
hard to resolve, there is something wrong in the communication
between project members. In other words, merging this *should*
be hard.

Really, if somebody wants to have project X at directory sub/X/
and somebody else wants the same at directory X/, merging the
modules file would be the least of your concern -- resulting
toplevel would not build correctly until you decide which tree
hierarchy should be picked, and later exchange of results among
project members would not be usable easily to half the people
who picked the hierarchy differently from you did.

-

To: Andy Parkins <andyparkins@...>
Cc: Junio C Hamano <junkio@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, <git@...>
Date: Tuesday, April 10, 2007 - 3:20 pm

I seriously doubt you'll ever be merging or changing this a lot. So I
don't think it's a huge concern.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Andy Parkins <andyparkins@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, <git@...>
Date: Tuesday, April 10, 2007 - 4:19 pm

I think Andy's comment comes from our earlier discussion on the
other in-tree configuration, .gitattributes file.

We were talking about using in-tree .gitattributes for deciding
if we apply crlf to each paths and other things like which 3-way
file-level merge backend to apply, and need to make the system
gracefully degrade even when in-tree .gitattributes have
conflict markers during a merge. And for that purpose, it is
certainly easier to arrange "pick each line, while ignoring <<<
or === or >>>, and if there are conflicting duplicates do
something sensible about them", if the file is line oriented.

But I do not think the .gitmodules thing needs that. If we have
conflicting (or non-conflicting for that matter) submodule
moves, that's a _MAJOR_ project re-organization, and I do not
think we would even want to automatically descend into
submodules for merging or checking-out when we have such a
situation in the higher level project.

-

To: Junio C Hamano <junkio@...>
Cc: Andy Parkins <andyparkins@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, <git@...>
Date: Tuesday, April 10, 2007 - 4:33 pm

100% agreed.

Also, note that while the ".gitmodules" (or whatever) file will be
required to do things like "git pull", the basic tree-level logic that I
sent out obviously doesn't need/use .gitmodules at all.

So there's a very real issue where a repository with submodules still
"works", even with a .gitmodules file that is totally scrogged and doesn't
have the right information (yet), it's just that it may simply not be able
to do all the operations because it cannot figure out where to pull
missing subproject data from etc..

So there is no reason to believe that we need to magically and
automatically resolve conflicts - if conflicts happen, functionality is
reduced, but it's not reduced so much that you cannot use the tree and try
to resolve them (which is important, btw, since often before you commit
your fix for the conflicts you'd want to *test* that fix, so we definitely
don't want these kinds of files to be so central that it gets hard to get
normal work done without them).

It really boils down to the same design issue: the way I think submodules
should work is that they are very loosely coupled with the supermodule.
The fact that the ".gitmodules" file isn't *that* critical comes largely
from that loose coupling.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Andy Parkins <andyparkins@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, <git@...>
Date: Wednesday, April 11, 2007 - 8:12 pm

Whoa... "missing" subproject data?

Surely, unless you're doing lightweight/shallow clones, if you have a
gitlink you've also got the dependent repository? Otherwise the
reachability rule will be broken.

Sam.
-

To: Sam Vilain <sam@...>
Cc: Junio C Hamano <junkio@...>, Andy Parkins <andyparkins@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, <git@...>
Date: Wednesday, April 11, 2007 - 10:01 pm

[ Dang. Power failure in the middle of writing emails. Can't remember
which one was lost. Am rewriting some of this reply in abbreviated form. ]

Absolutely. Not just subproject data. The whole subproject is often
missing.

If I fetch the KDE superproject, I generally do *not* want every single
subproject. In fact, I'd likely just want one or two subprojects.

The notion that all subprojects are populated is a *bug*. I would
personally refuse to use such a setup. Even CVS can handle that just fine,
we certainly don't want to be worse than CVS here.

If you just track a project, it's quite common to only check out the "src"
module, and *not* fetch things like the "validation" or "test" module if
you're just following along.

Or you might fetch the "kdebase" module, but that sure doesn't mean that
you want all the other ones (kdevelop source code? full kdelibs sources?

The reachability rule *must* be breakable. That's why fsck currently
doesn't care AT ALL.

It's much better to break that rule than to even check it! I'd rather
leave fsck like it is now, than to *ever* fix it, if the "fix" involves
"you have to always fetch all submodules to shut fsck up".

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Andy Parkins <andyparkins@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, <git@...>
Date: Wednesday, April 11, 2007 - 11:56 pm

Ok, but couldn't this be considered a variation of a lightweight checkout?

The only reason I'm worried about this is the case where the
superproject contains *thousands* of subprojects. Eg, a superproject for
all repo.or.cz projects. Say in a day 200 projects get updated with a
few commits - do you have to do 200 pulls or just one? But maybe that
problem can be solved in another way, or maybe it won't really hurt so
much in practice and still be faster/more efficient than rsync mirroring.

This is especially the case in concert with gittorrent, which will need
modifications to support sharing multiple repositories (not that that's

Well fsck can be fixed easily enough to not descend, like lightweight
checkouts.

What I really want to avoid is the situation where you can't checkout,
even though you didn't indicate a shallow/lightweight clone.

What else might this decision impact? Obviously with a smaller base you
have fewer delta targets, though that's probably not a real issue.

Sam.
-

To: Sam Vilain <sam@...>
Cc: Junio C Hamano <junkio@...>, Andy Parkins <andyparkins@...>, Josef Weidendorfer <Josef.Weidendorfer@...>, Linus Torvalds <torvalds@...>, <git@...>
Date: Wednesday, April 11, 2007 - 8:35 pm

hoi :)

With submodules you actually have a natural cutting point where
you can say: no, I don't want to get that.
So for submodules the reachability rule is a little bit more relaxed.

And when you fetch the superproject you now need some way to fetch
the new submodule objects. They may be in the same upstream repository
but it may make sense to have this configurable.

--=20
Martin Waitz

To: Josef Weidendorfer <Josef.Weidendorfer@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:50 pm

These attributes can be put into a file in superproject tree and
checked in at the same as the gitlink. No real need for introducing
another object type (right now there is no gitlink object type, just
an entry in tree with special mode).
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 1:23 pm

Like... .gitattributes ? ;-)
Ok, this could work; however, there of course is the possibility of
inconsistencies when e.g. manually moving subprojects around.

How is consistency ensured for .gitattributes ?
I see that for .gitignore consistency, the user is responsible.

Josef
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 4:40 am

Not that I have time right now to look up the exact context (only read
the patch), but I would've expected a "case S_IFDIRLNK:" here?

Gruesse,
--
Frank Lichtenheld <frank@lichtenheld.de>
www: http://www.djpig.de/
-

To: Frank Lichtenheld <frank@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 10:55 am

So we have this strange (and worrying) dualism inside git: we use the same
macros *both* for "stat data" *and* for "git-internal file modes".

So sometimes a mode is the result of a [l]stat() call like above, and then
a gitlink is just a directory and we use S_IFDIR. And if it comes from the
index, then it uses the internal git representation, and is S_IFDIRLNK.

I'm not very happy about it, but I'm actually most unhappy about it since
I could imagine that the constants themselves are different on different
OS's (eg VMS - a Unix-related OS will use the same constants for
historical reasons).

In this particular place (index-path), we obviously not only have a stat()
result, but more importantly, we never come here for a "normal" directory,
since a normal directory would have been expanded into its component paths
by the "read_directory()" logic.

So that interaction with directory expansion is somewhat non-obvious:
normal directories are expanded recursively into the files they contain,
while git directories end up being visible to internals as real
directories, and are turned into gitlinks by code like the above.

Linus
-

To: Junio C Hamano <junkio@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 7:31 am

No, the st_mode comes directly from file system. It knows nothing about
dirlinks.
-

To: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:15 am

Since the subprojects don't necessarily even exist in the current tree,
much less in the current git repository (they are totally independent
repositories), we do not want to try to follow the chain from one git
repository to another through a gitlink.

This involves teaching fsck to ignore references to gitlink objects from
a tree and from the current index.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
builtin-fsck.c | 9 ++++++++-
tree.c | 15 ++++++++++++++-
2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/builtin-fsck.c b/builtin-fsck.c
index 4d8b66c..f22de8d 100644
--- a/builtin-fsck.c
+++ b/builtin-fsck.c
@@ -253,6 +253,7 @@ static int fsck_tree(struct tree *item)
case S_IFREG | 0644:
case S_IFLNK:
case S_IFDIR:
+ case S_IFDIRLNK:
break;
/*
* This is nonstandard, but we had a few of these
@@ -695,8 +696,14 @@ int cmd_fsck(int argc, char **argv, const char *prefix)
int i;
read_cache();
for (i = 0; i < active_nr; i++) {
- struct blob *blob = lookup_blob(active_cache[i]->sha1);
+ unsigned int mode;
+ struct blob *blob;
struct object *obj;
+
+ mode = ntohl(active_cache[i]->ce_mode);
+ if (S_ISDIRLNK(mode))
+ continue;
+ blob = lookup_blob(active_cache[i]->sha1);
if (!blob)
continue;
obj = &blob->object;
diff --git a/tree.c b/tree.c
index d188c0f..dbb63fc 100644
--- a/tree.c
+++ b/tree.c
@@ -143,6 +143,14 @@ struct tree *lookup_tree(const unsigned char *sha1)
return (struct tree *) obj;
}

+/*
+ * NOTE! Tree refs to external git repositories
+ * (ie gitlinks) do not count as real references.
+ *
+ * You don't have to have those repositories
+ * available at all, much less have the objects
+ * accessible from the current repository.
+ */
static void track_tree_refs(struct tree *item)
{
int n_refs = 0, i;
@@ -152,8 +160,11 @@ static void track_tree_refs(struct tree *item)

/* Count how many entries there...

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 6:41 pm

Does this consider the case where the intent of the subprojects are to
collate multiple, small projects into one bigger project?

In that case, you might want to keep all of the subprojects in the same
git repository.

Sam.
-

To: Sam Vilain <sam@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 6:48 pm

I assume you mean "you might want to keep all of the subprojects' objects
in the same git object directory".

And yes, that's absolutely true, but it's technically no different from
just using GIT_OBJECT_DIRECTORY to share objects between totally unrelated
projects, or using git/alternates to share objects between (probably
*less* unrelated repositories, but still clearly individual repos).

So the main point of superproject/subprojects is to allow independence
(because independence is what allows it to scale), but there is nothing to
say that things *have* to kept totally isolated.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 6:59 pm

Would that be the only distinction?

I'm particularly interested in repositories with, say, thousands of
submodules but only a few hundred meg. I really want to avoid the
situation where each of those submodules gets checked or descended into
separately for updates etc.

Sam.
-

To: Sam Vilain <sam@...>
Cc: Junio C Hamano <junkio@...>, <danahow@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 7:30 pm

This seems slightly related to the hazy picture I'm forming of how
I'd like to use git at our site. Essentially, everyone would have their
own working tree with .git directory, but .git/objects is a symlink
to a shared object repository. How do you fully run git-fsck on this
shared object repository? The actual heads (roots) are distributed amongst
many .git/refs directories (I suppose you could do something akin
to git-fsck $(cat /somepaths*/.git/refs/*), but that means you know
where all the repositories are). So in this setup, maybe I'd want to run
fsck twice: the first time checking everything but not complaining about
dangling commit objects [but listing them?], and maybe a 2nd finding
all these in the users' repos [still need to know where these are].
Please note this is just a thought experiment at this point.

Anyway, git started out with a 1:1 relationship between working tree,
index, and object repository. Various things could weaken that --
alternates, subprojects with different relationships to their object
repositories, etc. -- so special commands like git fsck which
focus mostly on the object repository may need a little tweaking eventually.

--
Dana L. How danahow@gmail.com +1 650 804 5991 cell
-

To: Sam Vilain <sam@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 7:16 pm

I think we'll eventually want that *regardless* of how the object handling
is done (a kind of "cross-submodule boundary check"), but I think that's
actually outside of the scope of the current fsck.

The current fsck goes to great lengths to make sure that the internal
consistency of a repository is good. That's also why it takes so long, and
why it is such an expensive operation to do (notably when you do a
"--full" check).

In contrast, the "cross-submodule boundary check" is a much cheaper
operation, *if* you have already verified that the projects are internally
consistent. It literally boils down to doing a very simplified commit
chain walker that only parses tree objects and simply spits out the
SHA1's of the sub-tree commits (and their location in the tree), and then
a separate phase that just verifies those against the submodules.

And that separate phase - once you've done the fsck for all the
*individual* repositories - is truly trivial. It's literally just a matter
of "is that SHA1 a valid commit object". That's *cheap*.

So I think that the way to verify a superproject is:

- fsck each and every project totally independently. This is something
you have to do *anyway*.

- either as you fsck, or as a separate phase after the fsck, just
traverse the trees and spit out "these are the SHA1's of subprojects"

- finally, just go through the list of SHA1's (after every project has
been fsck'd) and verify that they exist (since if they exist, they will
have everything that is reachable from them, as that's one of the
things that the *local* fsck verifies)

Notice? At no point do you actually need to do a "global fsck". You can do
totally independent local fsck's, and then a really cheap test of
connectedness once those fsck's have completed.

The reason a *full* global fsck is so expensive is that it would have an
absolutely humungous working set, and effectively keep everything in
memory through it all. Doing it in stages ("...

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 8:34 pm

The small detail in the last step is wrong, though. Even if
they EXIST, they may be isolated commits that are note connected
to refs, and fsck in the repository would not have warned about
unreachable trees from such unconnected commits. So you would
need to do a reachability from these commits to the refs in the
subproject.

This would be similar to the quick-fetch topic I sent out a
couple of patches for, that implements logic to skip fetching
objects from your alternate. You would have rev-list --objects
traverse from them with "--not --all" in the subproject
repository and make sure it does not trigger "I could not list
all objects reachable from the commits you wanted because such
and such tree/blob are missing".

That reminds me of one thing I haven't verified. I am not
absolutely sure that rev-list --objects makes sure that
blobs it lists exist (trees are checked as it needs to read
them, and if they are missing or corrupt it would notice and
barf). When it is used for the purpose of this "subproject
boundary fsck" and the quick-fetch, it should. Perhaps a
specialized option to check deeper than usual is needed. I

This is still true.

-

To: Junio C Hamano <junkio@...>
Cc: Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 9:52 pm

The superproject *is* a ref.

You cannot prune the subprojects on their own. That's the *only* real
special rule about subprojects. Exactly because pruning them on their own
is not a valid op to do.

It's the same way with an source of "alternate" objects (or a shared
object directory) - you'd better not prune them, because other projects
may have refs to them that you don't know about locally. So this isn't
somethign new to subprojects.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 10:00 pm

But when you fsck the subproject repository in isolation in the
earlier step in your procedure, that is not taken into account,
is it?

The situation I had in mind was not about pruning, but an
earlier fetch, either the native one that unpacks the objects
into loose form or a http walker, fetched a commit near the tip
but was interrupted/killed before finishing the fetch nor
updating the ref. The tip of such an incomplete commit chain
would be reported dangling. They are ahead of your refs but
they may lack commits and trees to complete the chain back to
your refs yet. When the higher-level project points at such a
commit, the existence of the commit is not a proof that
everything needed to complete the commit is available.

We need to prove that separately, and that was my suggestion to
run a "rev-list --objects $those-commits --not --all" in the
subproject repository, simlar to what the quick-fetch topic
does.

-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 10:06 pm

Ah, forget about this. The HEAD, which is in the tree of the
higher-level project, is a ref. Silly me.

-

To: Junio C Hamano <junkio@...>
Cc: Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 10:28 pm

Well, not entirely "silly you".

If you do a "git reset" in the superproject, that will obviously have to
rewrite the heads in the subproject.

I do suspect that we should always enable reflogs for the subprojects, so
that pruning is safe even for these kinds of situations, but that
doesn't resolve all issues.

For example: to manage *cloning* of the extra stuff, you might actually
want to have externally visible refs, and while I suspect the main
solution will always be to just do good maintenance (ie "don't do 'git
bisect' and _never_ rewrite history in the main superproject!!"), I don't
think it's out of the question to add other safety nets too..

So for example, while I'm not sure it's necessary, I don't think it would
be *wrong* if we might eventually end up having *other* safety features
like adding a totally separate "refs/superprojects/xyzzy" ref structure.

Or something like that.. Just to make the refs more visible both
externally and internally, and to make it much harder to make stupid
mistakes without realizing it.

I suspect a lot of this will depend on just how many mistakes people make.
I don't think we've so far had a single problem with alternates files,
re-basing, and people then pruning away objects used by other repositories
by mistake, so maybe people really don't make those kinds of mistakes.

So maybe we don't need any extra safety nets at all. But who knows..

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 7:05 pm

would it make sense to have a --multiple-project option for fsck that would let
you specify multiple 'projects' that share a object set and have the default
checking not do the reachability checks that cause problems in this case?

Then people can share the objects if they want to and still do a full check, but
would get warned that the full check would take a lot of time. which is not a
big problem for a housekeeping thing that's run infrequently to find unreachable
objects (which is something that should seldom happen in a well managed project)

David Lang
-

To: David Lang <david.lang@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 7:53 pm

Well, the thing is, sharing object directories actually makes things
*harder* to check, rather than easier.

It can be a nice space optimization, and yes, if there really is a lot of
shared state, it can make it much cheaper to do some of the checks, but
right now we have absolutely *no* way for fsck to then do the reachability
check, because there is no way to tell fsck where all the refs are (since
now the refs come in from multiple repositories!)

So the individual objects get cheaper to fsck (no need to fsck shared
objects over and over again), but the reachability gets much harder to
fsck.

It's not an insurmountable problem, or even necessarily a very large one,
but it boils down to one very basic issue:

- nobody seems to actually *use* the shared object directory model!

The thing is, with pack-files and alternates directories, a lot of the
original reasons for shared object directories simply don't exist..

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, David Lang <david.lang@...>, Git Mailing List <git@...>
Date: Wednesday, April 11, 2007 - 8:03 pm

I think that's just the chicken-and-egg problem. Once this happens I
think we'll see people aggregating all sorts of related repositories
with this feature, and possibly making much richer histories by tracking
portions of their trees as subprojects rather than just a subdirectory.

Sam.

-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 7:30 pm

this is why I was suggesting a --multiple-project option to let you tell fsck

I suspect that if it coudl be checked it would be used more, especially with the
subproject support.

David Lang
-

To: David Lang <david.lang@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 10:14 pm

Well, just from a personal observation:
- I would *personally* actually refuse to share objects with anybody
else.

I just find the idea too scary. Somebody doing something bad to their
object store by mistake (running "git prune" without realizing that there
are *my* objects there too, or just deciding that they want to play with
the object directory by hand, or running a new fancy experimental importer
that has a subtle bug wrt object handling or anything like that).

I'll endorse use "alternates" files, but partly because I know the main
project is safe (any alternates usage is in the "satellite" clones anyway,
and they will never write to the alternate object directory), and partly
because at least for the kernel, we don't have branches that get reset in
the main project, so there's no reason to fear that a "git repack -a -d"
will ever screw up any of the satellite repositories even by mistake.

But for git projects, even alternates isn't safe, in case somebody bases
their own work on a version of "pu" that eventually goes away (even with
reflogs, pruning *eventually* takes place).

So I tend to think that alternates and shared object directories are
really for "temporary" stuff, or for *managed* repositories that are at
git *hosting* sites (eg repo.or.cz), and where there is some other safety
involved, ie users don't actually access the object directories directly
in any way.

So I've at least personally come to the conclusion that for a *developer*
(as opposed to a hosting site!), shared object directories just never make
sense. The downsides are just too big. Even alternates is something where
you just need to be fairly careful!

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, David Lang <david.lang@...>, <danahow@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Thursday, April 12, 2007 - 2:32 pm

These arguments all seem pretty convincing to me --
maybe the problem is that I'm not a "*developer*" right now.
Instead I'm part of a multi-developer *site*.
Below I talk about a possible way we could use git
without changing it (since I recognize this would be a minority usage pattern).

We use perforce to manage a mixed hardware/software project
(I'm the 55GB check-out guy, remember?). We have at least 3 different
kinds of data with different usage patterns, and using perforce for
everything in one centralized server was not the best solution.

Each user ("client") has their own worktree and the perforce
repository is on a shared central server. You can consider perforce
to have the equivalent of git's index, but it is stored on the server,
in one file ("db.have") covering all clients. Obviously that becomes a
bottleneck -- and recently db.have got larger than the total cache RAM on
the server, which really slowed things down until we moved to a larger
server. But repository architecture aside, the real problem has been
perforce's usability. Frequently one contributor, having gotten ahead
of the team, needs to share this more recent work with only a few
people. This could be done with p4 branching, but this is really clunky.
So instead the work is pushed out (submitted) to everyone, causing
instability; this is partially remedied by doing it in smaller chunks.
Another perforce problem is that tagging consumes a lot of server
space (and may slow things down as well).

Some of this data will stay in perforce, some will move into revision
control built-in to some of our other tools, and I'd like to try to move some
of it into git. The main attraction for the last group is the lightweight
branching that would allow early/tentative work to be easily shared.
I think the subproject work currently being discussed is going to
be very helpful as well -- the perforce equivalent is chaotic.

We could give each user a work tree and an object repository,
and then have a "releas...

To: Dana How <danahow@...>
Cc: Junio C Hamano <junkio@...>, David Lang <david.lang@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Thursday, April 12, 2007 - 3:17 pm

Yes.

The issues for hosting sites are very different from the issues of
individual developers having their own git repositories, and I agree 100%
that both alternates and shared object directories make tons of sense for

I hope it wouldn't even be a minority usage pattern. I am a firm believer
that distributed SCM's and git in particular makes a lot more sense for
source control hosting than CVS or SVN do. I'm really disappointed with
things like sourceforge, and part of the problem is literally that a
centralized SCM is really *fundamentally* wrong for a hosting entity.

Using a distributed SCM just makes _so_ much more sense for hosting
projects, and I've actually very much wanted to try to make sure that git
can help people who host things.

It's not my *own* primary use, but I think it's a very important usage
pattern, even though it's very different froma "normal developer" private
sandbox case.

So I think your case is really very interesting. I'd love to help figure
out how to help you guys with git, but because it's not how I personally
work, I can really just try to help when you actually hit a problem -
you'll have to figure out what your usage patterns actually are on your
own ;)

And btw, I think the shared object model really works very well, but I
think it has to be paired with some stricter rules than people who use
their own repos tend to have. For example, end-point developers have
become very used to rebasing and generally rewriting history (or just
resetting to an older state), and that's something that works find in a
"local repository" setup, but it's also the kinds of patterns that can
really screw you in a hosted and shared-object environment.

As to your two setups: I would suggest you go with the "hidden" shared
version (ie people use the remote access pull/push to a server, and the
*server* uses a shared object repository for multiple repositories),
rather than having a user-visible globally shared object directory. Ev...

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, David Lang <david.lang@...>, <danahow@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Sunday, April 15, 2007 - 2:50 am

For clarity I should have written *office* instead of *site* to
describe my situation,
We did go down the local disk route, but after two significant losses of
individuals' work, it was decreed that (perforce) work trees must be
on the NetApp. So we already made the investment in beefiness --
for different reasons -- and I need to conform to these decisions for
the moment.

After reliability, the other big criterion (especially with our
penchant for large files)
will be speed. With perforce, users now see submit={1 copy to server},
sync={1 copy from server}. In the short term I can't get away with changing
this to submit={copy working to indiv repo, copy indiv repo to shared repo}
and sync={copy shared repo to indiv repo, copy indiv repo to working},
because at first everyone will be trying to emulate what they did in perforce.

So probably I'll start out with either a very small testgroup,
or one shared object repository with sticky/group tricks on the NetApp.
Once git's collaboration advantages are apparent,
I'll switch to the hidden repository model which I prefer as well.
And hopefully these collaboration advantages will also mean people
will commit more often and local disks can come back into favor --
and then the "extra" local repo file copy operations will be less noticeable.

In any event, I have some scripting to do to learn more about our usage
patterns and pushing our datasets throught git. I also need to finish
the pack-splitting patch (after 64b index goes in). Finally, before all that,
I'll be out of the country for the next ~10 days...

Thanks,
--
Dana L. How danahow@gmail.com +1 650 804 5991 cell
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, David Lang <david.lang@...>, Dana How <danahow@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Friday, April 13, 2007 - 5:00 am

Would it not make sense for a hosting environment to say, if you are
using alternates, or shared object directories, then you need to include
*all* the refs in *all* the projects if you ever do an fsck?

I'm not sure how well git will scale in this case, although it just
should be a matter of how well git scales to dealing with a single
project with tens of thousands of refs/tags/etc. The only problem might
be in passing all those refs/tags to fsck in one go. STDIN, I guess?

Rogan
-

To: Rogan Dawes <lists@...>
Cc: Junio C Hamano <junkio@...>, David Lang <david.lang@...>, Dana How <danahow@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Friday, April 13, 2007 - 11:23 am

Yes. And it shouldn't be hard to add support to do it. It's just not been
done.

A lot of git programs already take refs on stdin, but fsck just doesn't do
it (it can do it from the command line, but you'd run out of command line
space very quickly).

More natural would be to just list all the git repos by git repo pathname
(and there, usually the command line probably *is* long enough), but
somebody would just have to do it. It's probably not very much code: just
iterate over each repo both when adding refs and when actually doing the

For a real shared object directory, passing the refs to stdin (and
teaching fsck about a "--stdin" flag) would be consistent with what we do
for many other commands, so yes, that would work.

However, fsck actually tends to want not just the refs, but actually
things like the index files and reflog files too, because those add other
reachability info, which is why it's probably more natural to just give
fsck the list of related repositories and let it figure them out.

That's also what you'd want to do for "alternates", since now there is no
longer a single object directory either, but multiple separate (but
related) ones.

Somebody would just have to write the code.. The basic rules are really
all in "git/builtin-fsck.c": cmd_fsck(). Hint hint.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Thursday, April 12, 2007 - 1:18 pm

I was actually thinking that hosting sites (and things like gitorrent) would be
the ones that would get the most benifit from shareing objects. the amount saved
for any individual developer is probably fairly minor (and the individual
developer could run a script to look across their objects and hard-link them
togeather if they care about the space)

David Lang
-

To: Linus Torvalds <torvalds@...>
Cc: David Lang <david.lang@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 10:30 pm

Actually that is not even true for repo.or.cz -- the site lets
people to create *forks* of the main project, and I recall it is
implemented in terms of alternates.

That's one of the reasons I never asked to take over git.git
repository there. I have alt-git.git instead, which does not
allow forks.

-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, David Lang <david.lang@...>, <danahow@...>, Git Mailing List <git@...>, Sam Vilain <sam@...>
Date: Wednesday, April 11, 2007 - 8:00 pm

Cool -- my previous email makes me either a git idiot or a git pioneer!

So I'll think through my usage model some more and
look over the fsck source.

Until then,
--
Dana L. How danahow@gmail.com +1 650 804 5991 cell
-

To: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:14 am

This just adds the basic helper functions to recognize and work with git
tree entries that are links to other git repositories ("subprojects").
They still aren't actually connected up to any of the code-paths, but
now all the infrastructure is in place.

The next commit will start actually adding actual subproject support.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
cache.h | 20 +++++++++++++++++++-
1 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/cache.h b/cache.h
index eb57507..1b3d00e 100644
--- a/cache.h
+++ b/cache.h
@@ -25,6 +25,22 @@
#endif

/*
+ * A "directory link" is a link to another git directory.
+ *
+ * The value 0160000 is not normally a valid mode, and
+ * also just happens to be S_IFDIR + S_IFLNK
+ *
+ * NOTE! We *really* shouldn't depend on the S_IFxxx macros
+ * always having the same values everywhere. We should use
+ * our internal git values for these things, and then we can
+ * translate that to the OS-specific value. It just so
+ * happens that everybody shares the same bit representation
+ * in the UNIX world (and apparently wider too..)
+ */
+#define S_IFDIRLNK 0160000
+#define S_ISDIRLNK(m) (((m) & S_IFMT) == S_IFDIRLNK)
+
+/*
* Intensive research over the course of many years has shown that
* port 9418 is totally unused by anything else. Or
*
@@ -104,6 +120,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
{
if (S_ISLNK(mode))
return htonl(S_IFLNK);
+ if (S_ISDIR(mode) || S_ISDIRLNK(mode))
+ return htonl(S_IFDIRLNK);
return htonl(S_IFREG | ce_permissions(mode));
}
static inline unsigned int ce_mode_from_stat(struct cache_entry *ce, unsigned int mode)
@@ -121,7 +139,7 @@ static inline unsigned int ce_mode_from_stat(struct cache_entry *ce, unsigned in
}
#define canon_mode(mode) \
(S_ISREG(mode) ? (S_IFREG | ce_permissions(mode)) : \
- S_ISLNK(mode) ? S_IFLNK : S_IFDIR)
+ S_ISLNK(mode) ? S_IFLNK : S_ISDIR(mode) ? S_IFDIR : S_IFDIRLNK)

#def...

To: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:14 am

This new function resolves a ref in *another* git repository. It's
named for its intended use: to look up the git link to a subproject.

It's not actually wired up to anything yet, but we're getting closer to
having fundamental plumbing support for "links" from one git directory
to another, which is the basis of subproject support.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
refs.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
refs.h | 3 ++
2 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/refs.c b/refs.c
index d2b7b7f..229da74 100644
--- a/refs.c
+++ b/refs.c
@@ -215,6 +215,85 @@ static struct ref_list *get_loose_refs(void)

/* We allow "recursive" symbolic refs. Only within reason, though */
#define MAXDEPTH 5
+#define MAXREFLEN (1024)
+
+static int resolve_gitlink_packed_ref(char *name, int pathlen, const char *refname, unsigned char *result)
+{
+ FILE *f;
+ struct cached_refs refs;
+ struct ref_list *ref;
+ int retval;
+
+ strcpy(name + pathlen, "packed-refs");
+ f = fopen(name, "r");
+ if (!f)
+ return -1;
+ read_packed_refs(f, &refs);
+ ref = refs.packed;
+ retval = -1;
+ while (ref) {
+ if (!strcmp(ref->name, refname)) {
+ retval = 0;
+ memcpy(result, ref->sha1, 20);
+ break;
+ }
+ ref = ref->next;
+ }
+ free_ref_list(refs.packed);
+ return retval;
+}
+
+static int resolve_gitlink_ref_recursive(char *name, int pathlen, const char *refname, unsigned char *result, int recursion)
+{
+ int fd, len = strlen(refname);
+ char buffer[128], *p;
+
+ if (recursion > MAXDEPTH || len > MAXREFLEN)
+ return -1;
+ memcpy(name + pathlen, refname, len+1);
+ fd = open(name, O_RDONLY);
+ if (fd < 0)
+ return resolve_gitlink_packed_ref(name, pathlen, refname, result);
+
+ len = read(fd, buffer, sizeof(buffer)-1);
+ close(fd);
+ if (len < 0)
+ return -1;
+ while (len && isspace(buffer[len-1]))
+ len--;
+ buffer[len] = 0;
+
+ /* Was it...

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 5:38 am

Can't a subproject be bare?
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 10:58 am

Not when it is checked out, no. That's what "checked out" means ;)

If a subproject is bare, it never gets resolved, because it's never
checked out in a superproject.

So a subproject *can* be bare, but when it's bare it is just a totally
regular independent git project, simply by *definition* of not being
checked out inside a superproject.

But hey, that was just a design decision of mine, and if people can argue
for it being wrong, I don't think I'm married to it ;)

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Alex Riesen <raa.lkml@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 11:54 am

It would be nice if a redirection via a "gitdir = ..." line
in .git/link of the subproject (when existing) would be possible.
This was part of the light-weight checkout proposal.

In contrast to contrib/workdir/git-new-workdir, this would allow
for (to be implemented) magic symlinks to stay intact when
moving the submodule directory around.

However, this can be added later.

Josef

PS: I wonder how long it takes to move the official KDE repository over to git ;-)
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 11:35 am

I didn't actually had a use case in mind as I asked it.
After a bit of thinking I could imagine a repo which is
used for integration exclusively (no compilation or looking
at the files at all).
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 11:52 am

Well, you also cannot *commit* to a bare repository, so it's a bit
pointless for integration reasons. You'd still have to commit all changes
somewhere else.

That said, it's definitely designed so that if you want to automate
tracking other peoples bare repositories, you can do so: you'd just have
to *really* script it with something like

git update-index --cacheinfo 0160000 <sha1> <dirname>

(which is how you could create those commits to a bare repo too, so it's
not like this is really even any different)

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 11:57 am

Yes. Subprojects are push-only for storing and reference purposes.

Nice :)
-

To: Alex Riesen <raa.lkml@...>
Cc: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:16 pm

Well, the *really* nice thing about doing it like this is that you can
actually update subprojects without even having them even be *local* to
where you do the superproject.

IOW, you could literally build up the superproject by saying that you want
to track "all git projects I care about" somewhere else, and do a series
of automated

git ls-remote sub-project-xyzzy tracking-branch-xyzzy | ...

and basically create the "superproject" without ever actually downloading
or populating the subprojects at all.

Then, if everything is set up correctly, you can basically use the
superproject as an "auto-mirror" - whenever you want to get all the
projects you care about, you just clone that superproject, and (once
you've taught "git clone" to fetch the subprojects, of course ;^) you'd
basically fetch them all from their appropriate locations - without ever
having the actual superproject have to even *really* care about it.

So basically, a superproject could be used as just a "gathering point",
without having to actually *contain* any of the subprojects. The actual
sources for subprojects may be on totally different servers. That's what
real distribution is all about.

Linus
-

To: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:13 am

This just makes sure that when we do a read_directory(), we check
that the filename fits in the buffer we allocated (with a bit of
slop)

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
dir.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/dir.c b/dir.c
index 7426fde..4f5a224 100644
--- a/dir.c
+++ b/dir.c
@@ -353,6 +353,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
!strcmp(de->d_name + 1, "git")))
continue;
len = strlen(de->d_name);
+ /* Ignore overly long pathnames! */
+ if (len + baselen + 8 > sizeof(fullname))
+ continue;
memcpy(fullname + baselen, de->d_name, len+1);
if (simplify_away(fullname, baselen + len, simplify))
continue;
--
1.5.1.110.g1e4c

-

To: Junio C Hamano <junkio@...>, Git Mailing List <git@...>
Date: Tuesday, April 10, 2007 - 12:13 am

The diff helpers used to do the magic mode canonicalization and all the
other special mode handling by hand ("trust executable bit" and "has
symlink support" handling).

That's bogus. Use "ce_mode_from_stat()" that does this all for us.

This is also going to be required when we add support for links to other
git repositories.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
diff-lib.c | 15 +++------------
1 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/diff-lib.c b/diff-lib.c
index 5c5b05b..c6d1273 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -357,7 +357,7 @@ int run_diff_files(struct rev_info *revs, int silent_on_removed)
continue;
}
else
- dpath->mode = canon_mode(st.st_mode);
+ dpath->mode = ntohl(ce_mode_from_stat(ce, st.st_mode));

while (i < entries) {
struct cache_entry *nce = active_cache[i];
@@ -374,8 +374,7 @@ int run_diff_files(struct rev_info *revs, int silent_on_removed)
int mode = ntohl(nce->ce_mode);
num_compare_stages++;
hashcpy(dpath->parent[stage-2].sha1, nce->sha1);
- dpath->parent[stage-2].mode =
- canon_mode(mode);
+ dpath->parent[stage-2].mode = ntohl(ce_mode_from_stat(nce, mode));
dpath->parent[stage-2].status =
DIFF_STATUS_MODIFIED;
}
@@ -424,15 +423,7 @@ int run_diff_files(struct rev_info *revs, int silent_on_removed)
if (!changed && !revs->diffopt.find_copies_harder)
continue;
oldmode = ntohl(ce->ce_mode);
-
- newmode = canon_mode(st.st_mode);
- if (!trust_executable_bit &&
- S_ISREG(newmode) && S_ISREG(oldmode) &&
- ((newmode ^ oldmode) == 0111))
- newmode = oldmode;
- else if (!has_symlinks &&
- S_ISREG(newmode) && S_ISLNK(oldmode))
- newmode = oldmode;
+ newmode = ntohl(ce_mode_from_stat(ce, st.st_mode));
diff_change(&revs->diffopt, oldmode, newmode,
ce->sha1, (changed ? null_sh...

Previous thread: sscanf/strtoul: parse integers robustly by Jim Meyering on Monday, April 9, 2007 - 7:01 pm. (3 messages)

Next thread: [PATCH 12/10] validate reused pack data with CRC when possible by Nicolas Pitre on Tuesday, April 10, 2007 - 12:15 am. (1 message)