Re: [PATCH take 2] change the unpack limit treshold to a saner value

Previous thread: Fast access git-rev-list output: some OS knowledge required by Marco Costalba on Wednesday, December 6, 2006 - 3:24 pm. (17 messages)

Next thread: What's in git.git (stable) by Junio C Hamano on Wednesday, December 6, 2006 - 5:18 pm. (3 messages)
To: Junio C Hamano <junkio@...>
Cc: <git@...>
Date: Wednesday, December 6, 2006 - 5:08 pm

Let's assume the average object size is x. Given n objects, the needed
storage size is n*(x + b), where b is the average wasted block size on
disk.

If those objects are packed, the needed storage becomes n*x + b so we
save on the block overhead. But there is the pack index which is
1024 + n*24 + b.

Trying to find the value of n where packed objects become an advantage,
we have:

n*x + n*b > n*x + n*24 + 2*b + 1024

n*b - 2*b > n*24 + 1024

(n - 2)*b > n*24 + 1024

So given this we need at least 3 objects for the whole to use more space
than a pack, and only if b is greater than 1096. Since a common block
size is 4096 then the value of b is likely to converge towards 2048. 3
objects is also where more directory entries are used over an constant
of 2 for a pack (assuming that both objects would end up with the same
first 2 bytes of hash which is overly optimistic). So 3 should be the
optimal number of objects for not exploding a pack. And of course
larger packs are likely to take even less space due to delta compression
kicking in.

This is why I think the current default treshold should be 3 instead of
the insane value of 5000. But since it feels a bit odd to go from 5000
to 3 I setled on 10.

Signed-off-by: Nicolas Pitre <nico@cam.org>

---

Sidenote: I think it is OK for pushes to _not_ use thin packs. When
not exploding thin packs, they must be completed by adding objects
creating duplicates and using more disk space. The penalty for not
using thin packs is a slight increase in bandwidth for push operations,
but since pushes are normally much less frequent than fetches it seems
OK to penalize the push a bit for a better disk usage on servers.

diff --git a/receive-pack.c b/receive-pack.c
index d62ed5b..9140312 100644
--- a/receive-pack.c
+++ b/receive-pack.c
@@ -11,7 +11,7 @@
static const char receive_pack_usage[] = "git-receive-pack <git-dir>";

static int deny_non_fast_forwards = 0;
-static int unpack_l...

To: Junio C Hamano <junkio@...>
Cc: <git@...>
Date: Thursday, December 7, 2006 - 12:01 am

Currently the treshold is 5000. The likelihood of this value to ever be
crossed for a single push is really small making it not really useful.

The optimal treshold for a pure space saving on a filesystem with 4kb
blocks is 3. However this is likely to create many small packs
concentrating a large number of files in a single directory compared to
the same objects which are spread over 256 directories when loose. This
means we would need 512 objects per pack on average to approximagte the
same directory cost (a pack has 2 files because of the index).

But 512 is a really high value just like 5000 since most pushes are
unlikely to have that many objects. So let's try with a value of 100
which should have a good balance between small pushes going to be
exploded into loose objects and large pushes kept as whole packs.

This is not a replacement for periodic repacks of course.

Signed-off-by: Nicolas Pitre <nico@cam.org>

---

diff --git a/receive-pack.c b/receive-pack.c
index d62ed5b..9140312 100644
--- a/receive-pack.c
+++ b/receive-pack.c
@@ -11,7 +11,7 @@
static const char receive_pack_usage[] = "git-receive-pack <git-dir>";

static int deny_non_fast_forwards = 0;
-static int unpack_limit = 5000;
+static int unpack_limit = 100;
static int report_status;

static char capabilities[] = " report-status delete-refs ";
-

To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <junkio@...>, <git@...>
Date: Thursday, December 7, 2006 - 3:59 am

I started that out at 5000 because it was a reasonably safe
threshold. Users who didn't explicitly lower this value didn't
enable the feature. But it was still useful on initial pushes into
a brand new repository when the project was very large. For example
pushing git.git into a bare repository *sucked* before this change.

Yea, 5000 is probably too high. Nice to see it drop.

--
Shawn.
-

To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <junkio@...>, <git@...>
Date: Wednesday, December 6, 2006 - 9:08 pm

Definitely not.

We have a much easier time handling many loose packed objects than many
pack-files. For many reasons, but two really obvious ones:

- pack-file indexes get read in on startup, and we maintain an explicit
list of them. Having lots of pack-files adds overhead that doesn't
exist for lots of loose objects.

- loose files are spread out over 256 subdirectories to make lookup
easier, packfiles are not (and always create an index file too).

So in general, as a trivial heuristic, you probably want about 512 times
as many loose objects as you want pack-files, i fonly because of the
latter issue, because you can much more easily handle lots of loose
objects than lots of pack-files. So it's _not_ a factor of 3. Or even 10.

But since there _is_ reason to do pack-files too, and since using too big
a value means that you never end up keeping a pack-file _at_all_ if you
pull often, I'd suggest that rather than use a limit of 512 you go for
something like 100-200 objects as the threshold (of course, the proper one
would depend on the distribution of the size of your pack-files, but I'll
just hand-wave and say that together with occasional re-packing, something
in that range is _generally_ going to be a good idea).

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, <git@...>
Date: Wednesday, December 6, 2006 - 11:24 pm

Note that this setting is currently observed for pushes not pulls.
On the pull side you currentli need to provide -k for not exploding
packs.

So the question is what number of objects on average do pushes have? If
most pushes are below the treshold this is not going to be really
useful.

And I think 5000 is definitely way too high. 10 might be too small
indeed. 100 is maybe a good default to try out.

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <junkio@...>, <git@...>
Date: Wednesday, December 6, 2006 - 11:39 pm

It will depend a lot on the project, and "where" in the project you are.

For example, for most end developers, the "push" is likely going to be a
few commits (say, a days work). Probably on the order of a few tens to
maybe a few hundred objects. It's actually hard to create a pack with less
than ten objects if you have a few directory levels (a single small commit
in the kernel is usually 5-7 objects: commit + 2-3 levels of directory + a
couple of blobs).

For me, as I pull a big merge and push it out, a push can easily be in the
thousands of objects, just because I merged other peoples combined work
over several weeks.

And for a "mirror" server, it will depend on the granularity of the

I think 100 is a nice round number for humans. Worth trying.

Linus
-

To: Nicolas Pitre <nico@...>
Cc: <git@...>
Date: Wednesday, December 6, 2006 - 6:24 pm

I see you are optimizing for disk footprint, and this will
result in tons of tiny packs left between "repack -a".

I have not benched it yet, but the runtime pack handling code
was written assuming we have only a handful of big packs; I
suspect this change would affect the performance at runtime in
quite a bad way.

-

To: Junio C Hamano <junkio@...>
Cc: <git@...>
Date: Wednesday, December 6, 2006 - 8:19 pm

Possibly.

Still a treshold of 5000 is way too large IMHO.

Nicolas
-

Previous thread: Fast access git-rev-list output: some OS knowledge required by Marco Costalba on Wednesday, December 6, 2006 - 3:24 pm. (17 messages)

Next thread: What's in git.git (stable) by Junio C Hamano on Wednesday, December 6, 2006 - 5:18 pm. (3 messages)