And that's actually ignoring inode sizes and directory sizes (well, it
doesn't "ignore" directory sizes - it counts them - but if you compare it
to a straight packed format, it's still overhead).
Anyway, looks like it's about 2:1, not 3:1 like I claimed, but the point
being that blocking factors tend to be at least on the same order of
magnitude as just plain compression (which also tends to be in the 2:1
area for normal, fairly easily compressible, stuff).
The delta-packing obviously is much bigger for any project with real
history. In traditional setups (where you always delta-pack within one
thing, ie at the level of individual SCCS/RCS files), the delta packing
obviously _also_ avoids blocking issues, since it means that a thousand
revisions of the same file will all share the same inode.
So because git uses a whole-file model, it obviously makes the blocking
issues with its unpacked format _much_ higher than for any traditional
medium - no conglomeration of different versions of the file in the same
filesystem object. On the other hand, the packed format also tends to be
even _more_ efficient than a traditional one, so the end result of it all
is apparently a pretty big net win even in space consumption).
Side note: I realize that some people think the packs are ugly and
strange. They aren't linear versions of a file, and instead appear as a
fairly random "jumble". And they can't be incrementally re-packed, and you
have to generate a whole new pack-file (which can be incremental in
_content_, of course). So people think they are ugly.
I'd argue that they are beautiful. They are beautiful because they _don't_
contain history in themselves (the objects they contain encode the history
of course, but the pack-file itself does not).
And they are beautiful because we can use the exact same format for
streaming data over the network as for the database itself (that, of
course, was just about _the_ design consideration). Show me another system
that has exactly the same (not "similar", not "same concepts": _same_)
network protocol as it internal database.
And they are beautiful exactly because their lack of any internal
structure allows you to pack things by criteria _you_ care about, ie the
whole "sort things by recency" thing, so that commonly accessed data can
be packed at the head of the pack-file - exactly because the pack-file
doesn't have any internal structure of its own that you need to worry
about and that constrains your sorting.
The same thing is what allows you to delta any blob against any other
blob - without worrying about history or other random pack-file rules. You
can do packign purely by how well you want to pack, not by any secondary
constraints.
And the "no incremental updates" may sound like a huge downside, but it's
all the same basic git logic: objects and filesystem contents are
immutable, and that allows us to avoid a lot of locking overhead. Locking
is _hard_. Locking is _inefficient_. And locking really really screws you
when you miss it.
So I'll happily say that pack-files are strange, and that you have to get
a bit used to the notion that they should be repacked "asynchronously".
But it's really a matter of "getting used to it", because once you do,
you'll see that it's actually an absolutely huge deal, and you'll learn to
love the bomb^H^H^H^Hpack-file.
Linus "pack-files rule" Torvalds
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html