So we would get 'skin-deep clone' rather than 'shallow' one? -- Jakub Narebski Warsaw, Poland -
Well, it's really shallow, but perhaps more importantly, I think it should be really easy, and have totally unambiguous semantics. Never any question of how far back to go, and I think we already really do have all the support logic for doing it. Now, we don't actually expose the internal "no_walk" flag with a "--no-walk" command line argument parsing, but that's a one-liner. There's another approach that might be a bit friendlier, which is again to walk only the objects of the WANT/HAVE things, but then _do_ walk the history for just commit objects. Something close to what I think the http fetch thing does if you pass it "-c -t". That too shouldn't require too much extra complexity, and it would mean that "git log" at least works. Of course, that would require another slight difference to "rev-list.c", where we'd only recurse into trees of selected commit objects (ie we'd have to mark the HAVE/WANT commits specially, but it's not exactly complex either). Of course, the complexity of _both_ of these approaches is really in the fsck stage, and all the crud you need to then do other things with these pared-down repos. For example, do you allow cloning? And do you just automatically notice that you're cloning a shallow repo, and only do a shallow clone. Etc etc.. Linus -
Would it make sense to make all the shallow clone clone machinery walk everything and trim only blob objects? In that case, all the machinery that walks commits/trees would remain intact -- we only have to deal with the case of not having blob objects, which affects less codepaths. It means that for a merge or checkout involving stuff we "don't have", it's trivial to know we are missing, and so we can attempt a fetch of the missing objects or tell the user how to request them them before retrying. Definitely. cheers, martin -
Commit maybe, but is this based on a hard fact? Earlier Linus said something about "git log" working on commit-only copy, but obviously you would want at least trees for the path limiting part to work, so having commits and trees would be handy, but my impression was that at least for deep project like the kernel trees tend to be nonnegligible (a commit consists of 18K paths and 1200 trees or something like that). -
No hard facts here :( but I think it's reasonable to assume that the trees delta/compress reasonably well, as a given commit will change just a few entries in each tree. I might try and hack a shallow local clone of the kernel and pack it tightly to see what it yields. cheers, martin -
A few hard facts (using Linus' linux-2.6 tree):
- original packsize: 120996 kilobytes
- unpacked: 233338 objects, 1417476 kilobytes
This is an 11.7:1 compression ratio (of course, much of this is
wasted space from the 4k block size in the filesystem)
- There were 87915 total blob objects, of which 19321 were in the
current tree. I removed all non-current blobs to produce a "shallow"
tree.
- The shallow tree unpacked: 164744 objects, 761960 kilobytes
IOW, about half of the unpacked disk usage was old blobs.
- Shallow commit/tree/tag objects packed (using 1.3.1
git-pack-objects):
Total 164744, written 164744 (delta 92322), reused 0 (delta 0)
size: 108088
The compression ratio here is only 7.0:1
- Total savings by going shallow: 10.7%
So basically, trees and commits DON'T compress as well as historical
blobs (potentially because git-pack-objects isn't currently optimized
for this -- I haven't checked). As a result, we're saving only 10% by
going shallow instead of a potential 50%.
-Peff
-
The biggest size savers from packing is (in rough order of relevance, if I recall the rough statistics I did): - avoiding block boundaries. - delta packing of blobs - delta packing of trees - regular compression The block boundaries are huge, we have tons of small objects, and that was one of the primary reasons for packing. I'd suspect that this is a 3:1 factor for a lot of things for many "common" filesystem setups. You probably didn't even account for the size of inodes in your "du" setup. And blobs with history generally delta very well (_much_ better than regular compression). Trees should _delta_ very well, but they basically don't compress, especially after deltaing. The SHA1's are totally incompressible (in a tree they aren't even ASCII), and as a deta, the names won't compress much either because they are short. Commits are fairly small, shouldn't delta all that much, and they don't even compress _that_ well either (they're normal text and often have some redundancy with the committer and author being the same, but they are short and have some fairly incompressible elements, so..) The thing with trees in particular is that they are very common for the kernel (and probably not so much for many other projects). A single commit ends up quite commonly being just one commit object, one blob (that deltas really well), and three or four trees. Merges often have no new blobs at all, just several new trees and the commit object. So a huge amount of the wins from packing come from the file _history_, the part that a shallow clone (on purpose) leaves behind. The regular compression will pick up a fair amount of slack with the blobs, but it's a much smaller factor than the delta compression for something that has a long history. It's somewhat interesting to note that over the year that we've used git, the kernel pack-size hasn't even increased all that much. I forget exactly what it was when we started packing, but it was on the ...
My numbers came from git-count-objects, which uses the st_blocks sum for all objects. The actual du numbers showing space wasted by block boundaries are: du -c ??: 1429216 du -c --apparent-size ??: 792277 So it's about 45% wasted space. As Linus indicated, that assumes a uniform distribution of file sizes (and my numbers above show that it is, in fact, somewhat higher). FYI, the mean and median of usage of the final 4K block in the linux-2.6 repository are 1309 and 912 bytes, respectively. -Peff -
And that's actually ignoring inode sizes and directory sizes (well, it doesn't "ignore" directory sizes - it counts them - but if you compare it to a straight packed format, it's still overhead). Anyway, looks like it's about 2:1, not 3:1 like I claimed, but the point being that blocking factors tend to be at least on the same order of magnitude as just plain compression (which also tends to be in the 2:1 area for normal, fairly easily compressible, stuff). The delta-packing obviously is much bigger for any project with real history. In traditional setups (where you always delta-pack within one thing, ie at the level of individual SCCS/RCS files), the delta packing obviously _also_ avoids blocking issues, since it means that a thousand revisions of the same file will all share the same inode. So because git uses a whole-file model, it obviously makes the blocking issues with its unpacked format _much_ higher than for any traditional medium - no conglomeration of different versions of the file in the same filesystem object. On the other hand, the packed format also tends to be even _more_ efficient than a traditional one, so the end result of it all is apparently a pretty big net win even in space consumption). Side note: I realize that some people think the packs are ugly and strange. They aren't linear versions of a file, and instead appear as a fairly random "jumble". And they can't be incrementally re-packed, and you have to generate a whole new pack-file (which can be incremental in _content_, of course). So people think they are ugly. I'd argue that they are beautiful. They are beautiful because they _don't_ contain history in themselves (the objects they contain encode the history of course, but the pack-file itself does not). And they are beautiful because we can use the exact same format for streaming data over the network as for the database itself (that, of course, was just about _the_ design consideration). Show me another system ...
If there are 233338 objects, then the average wasted space due to internal fragmentation is 233338 * 2k, or 466676 kilobytes, or only 36% of the wasted space. Most of the savings is probably coming from the compression and delta packing. - Ted -
That's not necessarily true. That assumes a randomly distributed filesize. File sizes are _not_ random, and in particular if you have the distribution leaning towards <2kB being common, you can actually get >50% fragmentation. Btw, I hit this when some people argued that the page size should be made 64kB. The above (incorrect) logic implies that you waste 32kB on average per file. That's not true, if a large fraction of your files are small, in which case you may actually be wastign closer to 60kB on average from using a big page-size, because about half of the kernel files are actually smaller than 4kB (or something. I forget the exact statistics, I did them with a script at some point). Anyway, with inode overhead and a lot of objects being just a couple of hundred bytes, I think I estimated at some point that you actually lost closer to 3kB per object. Many of the objects actually end up being smaller than the inode they end up allocating ;( Linus -
I just ran the numbers on filesizes of a kernel tree I had handy, which happened to be 2.6.16.11. With no object files, git files, etc. the average loss was 2351 bytes --- not that far away from the average of 2048 bytes. Granted, it may be there is more different versions of small objects causing a skewing of the distributions of git objects in the 2.6 tree, but I'm not familiar enough with the git porcelain to be able to make it disgorge the sizes of the repository to do the math. - Ted -
Is that without compression? git objects are compressed, and common types (trees) tend to be smaller than your normal C file. So git objects tend to be _smaller_ than the regular files. By about 30%. In addition, the non-blob git objects themselves tend to be smaller still. So for example, right now I have just 58 unpacked objects (I repack pretty often). But of those 58 objects, exactly _fifty_ are smaller than 2kB, and 38 are smaller than 1kB. The median size is 771 bytes. On master.kernel.org, I've not repacked as recently, so I've got 2268 unpacked objects. But the median size there is 773 bytes, so it looks like the numbers are statistically pretty stable. Linus -
Yes, without compression. So yes, that probably explains the difference between your numbers and mine. That brings up an interesting question though --- why not skip compressing files that are under 4k (or whatever the filesystem blocksize happens to be) if they are unpacked? It burns CPU time; maybe not enough to be human-noticeable, but it's still not buying you anything. - Ted -
Well, other filesystems don't have 4kB issues. Reiser can do smaller things iirc, and you might obviously have a ext3 filesystem with a 1kB blocksize too. And with tails on FFS, you might have a filesystem with a 8kB blocksize, but despite that it might lay out <1kB files well. Anyway, packing makes all this basically a non-issue. There are no block boundaries in a pack-file, and you only use a single inode. And you'd obviously want to pack for other reasons anyway (ie the delta compression will makea huge difference over time). Linus -
For linux v2.6.16:
7,3M commits-b41b04a36afebdba3b70b74f419fc7d97249bd7f.pack
24M commits_trees-8397f1c2a885527acd07e2caa8c95df626451493.pack
97M full-c7b2747a674ff55cb4a59dabebe419f191e360df.pack
For comparizon, a single version in packed form:
51M v2.6.12-rc2-4f3526b6815eb63da6c43ed85be1494bb776e2c5.pack
Made with
git-rev-list v2.6.16 | git-pack-objects commits
git-rev-list --objects --no-blobs v2.6.16 | git-pack-objects commits_trees
git-rev-list --objects v2.6.16 | git-pack-objects full
and this hack to git-rev-list:
diff --git a/revision.c b/revision.c
index f2a9f25..b5a929e 100644
--- a/revision.c
+++ b/revision.c
@@ -636,6 +636,10 @@ int setup_revisions(int argc, const char
revs->blob_objects =3D 1;
continue;
}
+ if (!strcmp(arg, "--no-blobs")) {
+ revs->blob_objects =3D 0;
+ continue;
+ }
if (!strcmp(arg, "--objects-edge")) {
revs->tag_objects =3D 1;
revs->tree_objects =3D 1;
So trees are definitely not lightweight, and commits are rather large
too.
With this pack arrangement, do you get any noticeable difference in walking commits? How about walking commits+trees with git-log <path> ? I wonder whether segregating packs by object type would make things better... cheers, martin -
It shouldn't. The existing packfile is designed to make "git
log" very efficient, by making it cheap to look only at the
commit message and ancestry information.
The objects are sorted first by type in the pack with the
existing code already, and commits come first. Try this.
git repack -a -d
git show-index <.git/objects/pack/pack-*.idx |
sort -n |
while read offset objectname
do
git cat-file -t "$objectname"
done
-
Thanks for the explanation! My ignorance in the pre-coffee stage of Monday is... exemplary. I should know better than assume that there are trivial optimizations for git that I can come up with ;-) I guess I should get more into the pack stuff and educate myself -- everyone's having fun with it... grumble, will have to brush up my C to get to play. cheers, martin -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linu |
