(I wrote most of this a couple of days ago, so it's not at the tip of the conversational tree, so to speak. But it's effectively a response to Linus's "what do you want to do with submodules" question, with some thoughts on implementation. Sorry it's so long; like Blaise Pascal, "I would have written a shorter letter, but I did not have the time.") The supermodule concept, implemented right, could really improve cooperation among embedded platform integrators, boutique distro publishers, and other editorial contributors to sprawling metaprojects who don't want to run kernel.org-scale mirrors. To make this work, you need sparse repositories (conserving resources when fetching, by omitting the bulk of currently un-needed submodules that can reliably be obtained later from elsewhere) and shallow cloning (conserving resources when publishing, by referring cloners to a third-party repository for universally available content). For instance, it would be a wonderful thing if the pile-o-patches nightmare that is PTXdist (and crosstool and buildtool and every other approach I have seen for ongoing maintenance of embedded toolchains and userlands) were obsoleted by a git supermodule. Its submodules would mostly track external projects, but would also logically contain the fix-up patches worked out during platform integration, checked in to branches anchored at each upstream release point. The supermodule would contain all of the build automation, log auditing, and remote unit testing stuff, as well as the metadata for each submodule involved in this platform build cycle. At a content level, the sparsely populated / shallowly published supermodule wouldn't be much different from today's PTXdist. But the pay-off comes when you merge forward to a new release of some base component (compiler, library, etc.) and discover that some of your fix-ups have been adopted or obsoleted upstream, and new fix-ups are needed for components that depend on the updated bit, and the set of configurables has changed (for which you need to compensate in the meta-configurator). Instead of piling up versioned patch directories, you commit fix-ups to the sub-modules, which other integration branches can ignore (if they aren't affected), merge, or cherry-pick. As I understand it, in today's git, every content object is a patch to the _data_ of one and only one git repository, containing the label of the preceding _data_ state plus a diff of file contents and attributes. Assuming this model is retained, any clean state of a "leaf" module (one with no submodules) can be reached by replaying a series of patches, starting from the repository's root node (an empty directory with the hopefully unique label generated by init-db). The label (SHA1) of the last patch is therefore a perfectly good label for this _data_ state. If all we were trying to do with supermodules was to capture and track various states of the submodules' data, we could extend the format of content objects to include "state X of submodule with init-db label Y". That would have the effect of capturing submodule states as _data_ in non-"leaf" modules. We would have to help cloners find a place from which to pull these states, of course; and it's easy to get sidetracked onto that part of the problem. But that's not where the bang for the buck is in supermodules. The whole model of distributed supermodules, with references to slightly diverging submodules whose content should mostly be fetched from external sources, smells to me just like LVM. The external sources (like an LVM volume of which you have taken a "snapshot") make up the bulk of the content pool. They also give you a window into developments on the submodule's own branches (like being able to peek forward and merge changes from the original volume). The supermodule (the snapshot volume) provides most of the interesting refs (submodule commits referenced by supermodule tags and branch heads), along with enough "journaled" content to replay forward from some checkpoint guaranteed to be available in each external source to any of these refs. The implication here is that submodule states are not just SHA1 labels to be embedded within supermodule data diffs. One ought to be able to clone a supermodule without immediately cloning full copies of any of its submodules. This ought to populate the clone's content database with all of the quanta of submodule content that aren't guaranteed to be available from any not-too-stale submodule mirror. When cloning, you don't want to have to inspect every supermodule state for submodule states that are outside the global subset. So the supermodule needs to maintain a set of supplemental refs from which all referenced submodule states can be reached. This allows you to traverse the portion of the pool of submodule content that can't be reached from true submodule branch heads. On 12/1/06, Linus Torvalds <torvalds@osdl.org> wrote:This is not a defect; it's a virtue. It's important for every commit to the supermodule to contain the information of which submodule branches you're currently on and how far along them you've crawled. Any particular supermodule commit point is likely to reflect an integration milestone visible only to the person working at the supermodule level. No content object should ever cross a submodule boundary, because then you wouldn't be able to apply it to the submodule in isolation (or in another supermodule state) or identify it when it is applied upstream and propagates back to you in a pull. But the supermodule can also contain supplemental refs (heads and tags) that don't exist in the submodule (and shouldn't necessarily be pushed to it); the commits they refer to are localized to the submodule but may not be reachable from any of the submodule's branch heads. There is an opportunity for useful deep integration here. The same algorithm that does reachability analysis for "git prune" can dig from supermodule down to submodules, copying objects into the supermodule database until it hits a commit that is advertised as "global" by the submodule. "git clone" of the supermodule can then pull the bulk of the submodules (a superset of the "global" subset) from (a mirror of) the canonical place for each, and use the supermodule object database as an alternate source for commits that don't exist in the "canonical" submodule. As simple as possible; but no simpler. The "alternates" / "git clone --reference" model is already almost powerful enough for the supermodule to contain a "journal" of submodule commits that haven't yet been retired to the canonical subset (guaranteed present in each mirror). The only difference is that the supermodule should be considered a "weak alternates" source. Commit objects in the supermodule's database should be visible to submodule-level operations (so that commits which are accepted upstream get flowed in nicely during "git pull"). But if a commit becomes reachable from a ref that is really in the submodule (not just one of the supermodule's "supplemental refs", which should _not_ be visible to submodule operations), then it should be copied into the submodule's object database. (The refs internal to the submodule should retain their integrity even if the supermodule is inaccessible.) The existing "strong alternates" mechanism should be reserved for repos which are at least as public and persistent as the submodule, and supermodules don't qualify (e. g., Linus's transmeta scenario). I think "global resource, local provider" is the way to go, with each provider advertising what checkpoints of what resources it can supply. When I clone or pull, I should be able to consult a local mapping of submodule URIs to "mirrors" (which may well be local repositories containing content and branches that aren't in the "official" upstream). The only thing that may need "global" agreement is the boundaries of the "global" subset for each submodule, i. e., the set of commit objects that can reliably be obtained from any mirror of the "official" upstream repository. That doesn't need to be terribly clever; "at least three days old on a globally published branch" would probably be a perfectly good heuristic. I think the implication of "submodule objects" is that supermodule diffs would say "roll submodule X from commit-id A to commit-id B". I don't think that would work very well for pulls/merges in the sparsely populated scenario, because you want to be able to pull the non-canonical subset of the individual diffs between states A and B into the supermodule's object pool. When you decide later to flesh out submodule X, you should only have to clone some canonical mirror and then fast-forward to state B using objects you already have in the supermodule pool. The merge case is even clearer. Suppose I pull updates from two remote branches of the supermodule onto my master branch. Each remote branch has added the same submodule, cloned from third-party repositories whose clone history goes back to the same origin. (The example I have in mind is when some project switches to git from some other SCM, and the maintainers of the remote branches port their integration patches over from their git-svn tracker submodule to a clone of upstream's new git repo.) I should be able to postpone the merge effort, come back later and clone the upstream repo, then merge the non-canonical commits that were pulled earlier. I might want to decide at supermodule pull time to postpone pulling the bodies of the submodule commits; but I want the full sequence of submodule commit IDs in the supermodule commit object. So it's not so much the supermodule _state_ that has a hierarchical structure; it's the supermodule _diffs_ and _object_pool_ that become hierarchical. I think the only global-to-local-namespace mapping applies to the different labels for the "empty repository" state generated at init-db time. Given the init-db SHA1 of the linux kernel repository, I should be able to choose any mirror or clone of that repository as a source for objects in its "global set". I expect this provider not to scribble on globally published branches, but that isn't even all that critical; anything outside the canonical set is kept in the supermodule's object pool, so I can always blow the submodule away and regenerate it from a different mirror. Sure. But all you really need from the canonical place is its init-db SHA1 (permanent) and its list of globally published branches (monotonically expanding). A URL for it is a convenient shorthand but doesn't have to be persistent. Cheers, - Michael - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Artem Bityutskiy | [PATCH 18/44 take 2] [UBI] build unit implementation |
| James Morris | Re: LSM conversion to static interface |
git: | |
| Paul Jackson | [PATCH] cpuset sched_load_balance kmalloc fix |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Linus Torvalds | Re: [GIT]: Networking |
