On Thu, Apr 01, 2010 at 08:01:59PM -0400, Jeff King wrote:
Since this is a space-time tradeoff, I thought it would make sense to
show some size numbers as a followup.
To get a sense of the size of the repo (it's almost all photos and
videos):
[size of the repo, already fully packed]
$ du -sh .git/objects
4.0G .git/objects
[the number of unique blobs through all history; most are binary media]
$ git log --raw --no-abbrev | awk '/^:/ {print $3 "\n" $4}' | sort -u | wc -l
10605
In comparison, the metadata for a given file (produced by the textconv)
is about 200 bytes of text.
So I did a big cache priming:
$ time git log -p >/dev/null
real 39m29.748s
user 23m1.090s
sys 3m46.642s
Slow, and unsurprisingly spends quite a bit of time waiting on I/O. The
result is a notes tree with almost one textconv per blob:
$ git ls-tree -r notes/textconv/mfo | wc -l
10317
We're now using almost 200M:
$ git count-objects
39513 objects, 198604 kilobytes
But wait. Many of those objects are trees for stale versions of the
cache.
$ git repack -d
$ (cd .git/objects/pack && du -k *.pack)
2056 pack-34170e72ec40a07e98aae044479abccc9e02751b.pack
4089224 pack-81797628f3aebf6a0bdc082fa05ec14932910534.pack
$ git count-objects
30685 objects, 163288 kilobytes
In actuality, a fully packed cache is only about 2M (from 35M of
loose objects; it deltas quite well because there is a lot of overlap
in my metadata). And we can prune away the other 160M of cruft:
$ git prune
$ git count-objects
0 objects, 0 kilobytes
And of course, the final speed result:
$ time git log -p >/dev/null
real 0m7.606s
user 0m6.084s
sys 0m0.788s
So what I take away from this is two things:
1. The size tradeoff is definitely worthwhile for some workloads. In
this case, the textconv version is orders of magnitude smaller than
the original. I'd be interested to see numbers for something like a
repository of documents that get textconv'd to pure ascii.
2. We had 460% loose object overhead just from tree objects in
intermediate versions of the cache. While it was not too hard to
get rid of with a repack and a prune, we are probably better off
not generating it in the first place. In theory we could have
written only one notes tree, and kept the intermediate state in
memory. In practice, flushing once per commit-diff (instead of once
per file) would probably be fine, and would be simpler to
implement.
And of course, now that I have a completely primed cache, I can push it
around with "git push $dest notes/textconv/mfo". Yay for storing notes
as git objects.
-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html