Btw, just in case you don't understand _why_ this is true, the fact is, in
a git repository, quite fundamentally, because we don't have "backlinks"
at any stage at all, we don't know - and fundamentally _cannot_ know -
whether we're goign to see the same object in the future.
So operations like "git-rev-list --objects" (or, these days, more commonly
anything that just does the equivalent of that internally using the
library interfaces - ie "git pack-objects" and friends) VERY FUNDAMENTALLY
have to hold on to the object flags for the whole lifetime of the whole
operation.
And you should realize that this is really really fundamental. You can't
fix it with "smarter memory management". You can't fix it with "garbage
collection". This is _not_ a result of the fact that we use C and malloc,
and we don't free those objects, like some people sometimes seem to
believe.
So garbage collection will never help this kind of situation. It flows
_directly_ from the fact that our objects are immutable: because they are
immutable, they don't have any backpointers, because we cannot (and must
not) add backpointers to an old existing object when a new object is
created that points to it.
So this really isn't a memory management issue. You could somewhat work
around it by adding a "caching layer" on top of git, and allow that
caching layer to modify their cache of old objects (so that they can
contain back-pointers), but for 99% of all users that would actually make
performance MUCH WORSE, and it would also be a serious problem for
coherency issues (one of the things that immutable objects cause is that
there are basically never any race conditions, while a "caching layer"
like this would have some serious issues about serialization).
So: the very fundamental nature and choices that were made in git also
means that when you have something like git-pack-objects that wants to
walk the whole repo, you will end up with something that remembers EVERY
SINGLE OBJECT it walked.
And while I've worked very hard to make the memory footprint of individual
objects as small as possible, and this means that this all works fine even
for fairly large databases (especially since very few operations actually
do this "traverse the whole friggin tree" thing), it does mean that
there's a very fundamental limit to scalability. You can't just make a
whole repository a hundred times bigger - because the operations that
traverse the whole thing will require a hundred times more memory!
Now, in "real" projects, this is not a problem. I can pretty much
_guarantee_ that memory sizes and hardware will grow faster than projects
grow. I'm not AT ALL worried about the fact that in ten years, the linux
kernel repository will likely be two or three times the size it is now.
Because I'm absolutely convinced that in ten years, the machines we have
now will be obsolete.
So on any "individual project" basis, the fact that memory requirements
scale roughly as O(n) in the total repository size is simply not a
problem. In fact, O(n) is pretty damn good, especially since the constant
is pretty small (basically 28 bytes per object - and 20 of those bytes
are the SHA1 that you simply cannot avoid).
But it does mean that supermodules really should NOT be so seamless that
doing a "git clone" on a supermodule does one _large_ clone. Because it's
simply going to be better to:
- when you clone the supermodule, track the commits you need on all
submodules (this _may_ be a reason in itself for the "link" object,
just so that you can traverse the supermodule object dependencies and
know what subobject you are looking at even _without_ having to look at
the path you got there from)
- clone submodules one-by-one, using the list of objects you gathered.
Maybe there are other solutions, but quite frankly, I doubt it. Yes,
you'll end up "traversing" exactly as many objects either way, but the
"globe subobjects one by one" is going to be a _hell_ of a lot more
memory-efficient, and quite frankly, "memory usage" == "performance"
under many loads (notably, any load that uses too much memory will _suck_
performance-wise, either because of swapping or simply because it will
throw out caches that "many small invocations" would not have thrown out).
So I guarantee that it's going to be better to do five clones of five
small repositories over one clone of one big one. If only because you need
less memory to do the five smaller clones.
Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html