Re: Unresolved issues #2 (shallow clone again)

Previous thread: [PATCH] Fix for config file section parsing. by sean on Friday, May 5, 2006 - 6:49 am. (1 message)

Next thread: Re: [ANNOUNCE] Git wiki by Jakub Narebski on Friday, May 5, 2006 - 9:47 am. (1 message)
From: Jakub Narebski
Date: Friday, May 5, 2006 - 8:18 am

So we would get 'skin-deep clone' rather than 'shallow' one?

-- 
Jakub Narebski
Warsaw, Poland

-

From: Linus Torvalds
Date: Friday, May 5, 2006 - 8:59 am

Well, it's really shallow, but perhaps more importantly, I think it should 
be really easy, and have totally unambiguous semantics. Never any question 
of how far back to go, and I think we already really do have all the 
support logic for doing it.

Now, we don't actually expose the internal "no_walk" flag with a 
"--no-walk" command line argument parsing, but that's a one-liner.

There's another approach that might be a bit friendlier, which is again to 
walk only the objects of the WANT/HAVE things, but then _do_ walk the 
history for just commit objects. Something close to what I think the http 
fetch thing does if you pass it "-c -t". That too shouldn't require too 
much extra complexity, and it would mean that "git log" at least works.

Of course, that would require another slight difference to "rev-list.c", 
where we'd only recurse into trees of selected commit objects (ie we'd 
have to mark the HAVE/WANT commits specially, but it's not exactly 
complex either).

Of course, the complexity of _both_ of these approaches is really in the 
fsck stage, and all the crud you need to then do other things with these 
pared-down repos. For example, do you allow cloning? And do you just 
automatically notice that you're cloning a shallow repo, and only do a 
shallow clone. Etc etc..

		Linus
-

From: Martin Langhoff
Date: Friday, May 5, 2006 - 11:23 pm

Would it make sense to make all the shallow clone clone machinery walk
everything and trim only blob objects? In that case, all the machinery
that walks commits/trees would remain intact -- we only have to deal
with the case of not having blob objects, which affects less
codepaths.

It means that for a merge or checkout involving stuff we "don't have",
it's trivial to know we are missing, and so we can  attempt a fetch of
the missing objects or tell the user how to request them them before
retrying.


Definitely.

cheers,


martin
-

From: Junio C Hamano
Date: Saturday, May 6, 2006 - 12:10 am

Commit maybe, but is this based on a hard fact?  

Earlier Linus said something about "git log" working on
commit-only copy, but obviously you would want at least trees
for the path limiting part to work, so having commits and trees
would be handy, but my impression was that at least for deep
project like the kernel trees tend to be nonnegligible (a commit
consists of 18K paths and 1200 trees or something like that).

-

From: Martin Langhoff
Date: Saturday, May 6, 2006 - 11:08 pm

No hard facts here :( but I think it's reasonable to assume that the
trees delta/compress reasonably well, as a given commit will change
just a few entries in each tree.

I might try and hack a shallow local clone of the kernel and pack it
tightly to see what it yields.

cheers,



martin
-

From: Jeff King
Date: Sunday, May 7, 2006 - 12:56 am

A few hard facts (using Linus' linux-2.6 tree):
  - original packsize: 120996 kilobytes
  - unpacked: 233338 objects, 1417476 kilobytes
    This is an 11.7:1 compression ratio (of course, much of this is
    wasted space from the 4k block size in the filesystem)
  - There were 87915 total blob objects, of which 19321 were in the
    current tree. I removed all non-current blobs to produce a "shallow"
    tree.
  - The shallow tree unpacked: 164744 objects, 761960 kilobytes
    IOW, about half of the unpacked disk usage was old blobs.
  - Shallow commit/tree/tag objects packed (using 1.3.1
    git-pack-objects):
      Total 164744, written 164744 (delta 92322), reused 0 (delta 0)
      size: 108088
    The compression ratio here is only 7.0:1
  - Total savings by going shallow: 10.7%

So basically, trees and commits DON'T compress as well as historical
blobs (potentially because git-pack-objects isn't currently optimized
for this -- I haven't checked). As a result, we're saving only 10% by
going shallow instead of a potential 50%.

-Peff
-

From: Linus Torvalds
Date: Sunday, May 7, 2006 - 8:27 am

The biggest size savers from packing is (in rough order of relevance, if 
I recall the rough statistics I did):

 - avoiding block boundaries. 
 - delta packing of blobs
 - delta packing of trees
 - regular compression

The block boundaries are huge, we have tons of small objects, and that was 
one of the primary reasons for packing. I'd suspect that this is a 3:1 
factor for a lot of things for many "common" filesystem setups. You 
probably didn't even account for the size of inodes in your "du" setup.

And blobs with history generally delta very well (_much_ better than 
regular compression).

Trees should _delta_ very well, but they basically don't compress, 
especially after deltaing. The SHA1's are totally incompressible (in a 
tree they aren't even ASCII), and as a deta, the names won't compress much 
either because they are short.

Commits are fairly small, shouldn't delta all that much, and they don't 
even compress _that_ well either (they're normal text and often have some 
redundancy with the committer and author being the same, but they are 
short and have some fairly incompressible elements, so..)

The thing with trees in particular is that they are very common for the 
kernel (and probably not so much for many other projects). A single commit 
ends up quite commonly being just one commit object, one blob (that deltas 
really well), and three or four trees. Merges often have no new blobs at 
all, just several new trees and the commit object.

So a huge amount of the wins from packing come from the file _history_, 
the part that a shallow clone (on purpose) leaves behind. 

The regular compression will pick up a fair amount of slack with the 
blobs, but it's a much smaller factor than the delta compression for 
something that has a long history.

It's somewhat interesting to note that over the year that we've used git, 
the kernel pack-size hasn't even increased all that much. I forget exactly 
what it was when we started packing, but it was on the ...
From: Jeff King
Date: Sunday, May 7, 2006 - 9:24 pm

My numbers came from git-count-objects, which uses the st_blocks sum for
all objects. The actual du numbers showing space wasted by block
boundaries are:
  du -c ??: 1429216
  du -c --apparent-size ??: 792277
So it's about 45% wasted space.


As Linus indicated, that assumes a uniform distribution of file sizes
(and my numbers above show that it is, in fact, somewhat higher). FYI,
the mean and median of usage of the final 4K block in the linux-2.6
repository are 1309 and 912 bytes, respectively.

-Peff
-

From: Linus Torvalds
Date: Monday, May 8, 2006 - 8:32 am

And that's actually ignoring inode sizes and directory sizes (well, it 
doesn't "ignore" directory sizes - it counts them - but if you compare it 
to a straight packed format, it's still overhead).

Anyway, looks like it's about 2:1, not 3:1 like I claimed, but the point 
being that blocking factors tend to be at least on the same order of 
magnitude as just plain compression (which also tends to be in the 2:1 
area for normal, fairly easily compressible, stuff).

The delta-packing obviously is much bigger for any project with real 
history. In traditional setups (where you always delta-pack within one 
thing, ie at the level of individual SCCS/RCS files), the delta packing 
obviously _also_ avoids blocking issues, since it means that a thousand 
revisions of the same file will all share the same inode.

So because git uses a whole-file model, it obviously makes the blocking 
issues with its unpacked format _much_ higher than for any traditional 
medium - no conglomeration of different versions of the file in the same 
filesystem object. On the other hand, the packed format also tends to be 
even _more_ efficient than a traditional one, so the end result of it all 
is apparently a pretty big net win even in space consumption).

Side note: I realize that some people think the packs are ugly and 
strange. They aren't linear versions of a file, and instead appear as a 
fairly random "jumble". And they can't be incrementally re-packed, and you 
have to generate a whole new pack-file (which can be incremental in 
_content_, of course). So people think they are ugly.

I'd argue that they are beautiful. They are beautiful because they _don't_ 
contain history in themselves (the objects they contain encode the history 
of course, but the pack-file itself does not).

And they are beautiful because we can use the exact same format for 
streaming data over the network as for the database itself (that, of 
course, was just about _the_ design consideration). Show me another system ...
From: Theodore Tso
Date: Sunday, May 7, 2006 - 5:33 pm

If there are 233338 objects, then the average wasted space due to
internal fragmentation is 233338 * 2k, or 466676 kilobytes, or only
36% of the wasted space.  Most of the savings is probably coming from
the compression and delta packing.

						- Ted
-

From: Linus Torvalds
Date: Sunday, May 7, 2006 - 5:50 pm

That's not necessarily true.

That assumes a randomly distributed filesize. File sizes are _not_ random, 
and in particular if you have the distribution leaning towards <2kB being 
common, you can actually get >50% fragmentation.

Btw, I hit this when some people argued that the page size should be made 
64kB. The above (incorrect) logic implies that you waste 32kB on average 
per file. That's not true, if a large fraction of your files are small, in 
which case you may actually be wastign closer to 60kB on average from 
using a big page-size, because about half of the kernel files are actually 
smaller than 4kB (or something. I forget the exact statistics, I did them 
with a script at some point).

Anyway, with inode overhead and a lot of objects being just a couple of 
hundred bytes, I think I estimated at some point that you actually lost 
closer to 3kB per object.

Many of the objects actually end up being smaller than the inode they end 
up allocating ;(

			Linus
-

From: Theodore Tso
Date: Sunday, May 7, 2006 - 6:26 pm

I just ran the numbers on filesizes of a kernel tree I had handy,
which happened to be 2.6.16.11.  With no object files, git files,
etc. the average loss was 2351 bytes --- not that far away from the
average of 2048 bytes.  Granted, it may be there is more different
versions of small objects causing a skewing of the distributions of
git objects in the 2.6 tree, but I'm not familiar enough with the git
porcelain to be able to make it disgorge the sizes of the repository
to do the math.

						- Ted

-

From: Linus Torvalds
Date: Sunday, May 7, 2006 - 7:04 pm

Is that without compression?

git objects are compressed, and common types (trees) tend to be smaller 
than your normal C file.

So git objects tend to be _smaller_ than the regular files. By about 30%. 
In addition, the non-blob git objects themselves tend to be smaller still.

So for example, right now I have just 58 unpacked objects (I repack pretty 
often). But of those 58 objects, exactly _fifty_ are smaller than 2kB, and 
38 are smaller than 1kB. The median size is 771 bytes.

On master.kernel.org, I've not repacked as recently, so I've got 2268 
unpacked objects. But the median size there is 773 bytes, so it looks like 
the numbers are statistically pretty stable.

			Linus
-

From: Theodore Tso
Date: Sunday, May 7, 2006 - 7:24 pm

Yes, without compression.  So yes, that probably explains the
difference between your numbers and mine. 

That brings up an interesting question though --- why not skip
compressing files that are under 4k (or whatever the filesystem
blocksize happens to be) if they are unpacked?  It burns CPU time;
maybe not enough to be human-noticeable, but it's still not buying you
anything.

						- Ted
-

From: Linus Torvalds
Date: Sunday, May 7, 2006 - 7:42 pm

Well, other filesystems don't have 4kB issues. Reiser can do smaller 
things iirc, and you might obviously have a ext3 filesystem with a 1kB 
blocksize too. And with tails on FFS, you might have a filesystem with a 
8kB blocksize, but despite that it might lay out <1kB files well.

Anyway, packing makes all this basically a non-issue. There are no block 
boundaries in a pack-file, and you only use a single inode. And you'd 
obviously want to pack for other reasons anyway (ie the delta compression 
will makea huge difference over time).

		Linus
-

From: Sergey Vlasov
Date: Sunday, May 7, 2006 - 1:01 am

For linux v2.6.16:

7,3M commits-b41b04a36afebdba3b70b74f419fc7d97249bd7f.pack
 24M commits_trees-8397f1c2a885527acd07e2caa8c95df626451493.pack
 97M full-c7b2747a674ff55cb4a59dabebe419f191e360df.pack

For comparizon, a single version in packed form:

 51M v2.6.12-rc2-4f3526b6815eb63da6c43ed85be1494bb776e2c5.pack

Made with

git-rev-list v2.6.16 | git-pack-objects commits
git-rev-list --objects --no-blobs v2.6.16 | git-pack-objects commits_trees
git-rev-list --objects v2.6.16 | git-pack-objects full

and this hack to git-rev-list:

diff --git a/revision.c b/revision.c
index f2a9f25..b5a929e 100644
--- a/revision.c
+++ b/revision.c
@@ -636,6 +636,10 @@ int setup_revisions(int argc, const char
 				revs->blob_objects =3D 1;
 				continue;
 			}
+			if (!strcmp(arg, "--no-blobs")) {
+				revs->blob_objects =3D 0;
+				continue;
+			}
 			if (!strcmp(arg, "--objects-edge")) {
 				revs->tag_objects =3D 1;
 				revs->tree_objects =3D 1;

So trees are definitely not lightweight, and commits are rather large
too.
From: Martin Langhoff
Date: Sunday, May 7, 2006 - 4:27 pm

With this pack arrangement, do you get any noticeable difference in
walking commits? How about walking commits+trees with git-log <path> ?

I wonder whether segregating packs by object type would make things better...

cheers,



martin
-

From: Junio C Hamano
Date: Sunday, May 7, 2006 - 4:35 pm

It shouldn't.  The existing packfile is designed to make "git
log" very efficient, by making it cheap to look only at the
commit message and ancestry information.

The objects are sorted first by type in the pack with the
existing code already, and commits come first.  Try this.

        git repack -a -d
        git show-index <.git/objects/pack/pack-*.idx |
        sort -n |
        while read offset objectname
        do
                git cat-file -t "$objectname"
        done

-

From: Martin Langhoff
Date: Sunday, May 7, 2006 - 4:44 pm

Thanks for the explanation! My ignorance in the pre-coffee stage of
Monday is... exemplary. I should know better than assume that there
are trivial optimizations for git that I can come up with ;-)

I guess I should get more into the pack stuff and educate myself --
everyone's having fun with it... grumble, will have to brush up my C
to get to play.

cheers,


martin
-

Previous thread: [PATCH] Fix for config file section parsing. by sean on Friday, May 5, 2006 - 6:49 am. (1 message)

Next thread: Re: [ANNOUNCE] Git wiki by Jakub Narebski on Friday, May 5, 2006 - 9:47 am. (1 message)