Re: git-index-pack really does suck..

Previous thread: StGit patchname numeric prefix? by Jamie Border on Tuesday, April 3, 2007 - 4:49 am. (3 messages)

Next thread: Re: [PATCH] gitweb: Support comparing blobs (files) with different names by Jakub Narebski on Tuesday, April 3, 2007 - 7:57 am. (3 messages)
From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 8:15 am

Junio, Nico,
 I think we need to do something about it.

CLee was complaining about git-index-pack on #irc with the partial KDE 
repo, and while I don't have the KDE repo, I decided to investigate a bit.

Even with just the kernel repo (with a single 170MB pack-file), I can do

	git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack

and it uses 52s of CPU-time, and on my 4GB machine it actually started 
doing IO and swapping, because git-index-pack grew to 4.8GB in size. So 
while I initially thought I'd want a bigger test-case to see the problem, 
I sure as heck don't.

The 52s of CPU time exploded into almost three minutes of actual 
real-time:

	47.33user 5.79system 2:41.65elapsed 32%CPU
	2117major+1245763minor

And that's on a good system with a powerful CPU, "enough memory" for any 
reasonable development, and good disks! Very much ungood-plus-plus.

I haven't looked into exactly why yet, but I bet it's just that we keep 
every single object expanded in memory. We do need to keep the objects 
around, so that we can resolve delta's, but we can certainly do it other 
ways. 

Two suggestion for other ways:

 - simple one: don't keep unexploded objects around, just keep the deltas, 
   and spend tons of CPU-time just re-expanding them if required.

   We *should* be able to do it with just keeping the original 170MB 
   pack-file in memory, not expanding it to 3.8GB! 

   Still, even this will be painful once you have a big pack-file, and the 
   CPU waste is nasty (although a delta-base cache like we do in 
   sha1_file.c would probably fix it 99% - at that point it's getting 
   less simple, and the "best" solution below looks more palatable)

 - best one: when writing out the pack-file, we incrementally keep a 
   "struct packed_git" around, and update the index for it dynamically, 
   and totally get rid of all objects that we've written out, because we 
   can re-create them.

   This means that we should have _zero_ memory footprint ...
From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 9:21 am

Ahh. False alarm.

The problem is actually largely a really stupid memory leak in the SHA1 
collision checking (which wouldn't trigger on a normal pull, but obviously 
does trigger for every single object when testing!)

This trivial patch fixes most of it. git-index-pack still uses too much 
memory, but it does a *lot* better.

Junio, please get this into 1.5.1 (I *think* the SHA1 checking is new, but 
if it exists in 1.5.0 too, it obviously needs the same fix).

It still grows, but it grew to just 287M in size now for the 170M kernel 
object:

	41.59user 1.39system 0:43.64elapsed
	0major+73552minor

which is quite a lot better.

Duh.

		Linus

---
 index-pack.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/index-pack.c b/index-pack.c
index 6284fe3..3c768fb 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size,
 		if (size != has_size || type != has_type ||
 		    memcmp(data, has_data, size) != 0)
 			die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
+		free(has_data);
 	}
 }
 
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 9:40 am

Damn!

Please don't report those things when I'm out for lunch so I could have 


Indeed.



Nicolas
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 9:33 am

Even better:

  - Fix my own stupid mistake with a _single_ line of code:

diff --git a/index-pack.c b/index-pack.c
index 6284fe3..3c768fb 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size,
 		if (size != has_size || type != has_type ||
 		    memcmp(data, has_data, size) != 0)
 			die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
+		free(has_data);
 	}
 }

The thing is, that code path is executed _only_ when index-pack is 
encountering an object already in the repository in order to protect 
against possible SHA1 collision attacks.  See commit 8685da42561d log 
for the full story.

Normally this should not happen in normal usage scenarios because the 
objects you fetch are those that you don't already have.  But if you 
manually run index-pack inside an existing repository then you'll 
already have _all_ those objects already explaining the high CPU usage.

But this is no excuse for not freeing the data though.


Nicolas
-

From: Chris Lee
Date: Tuesday, April 3, 2007 - 12:27 pm

There's another issue here.

I'm running git-index-pack as part of a workflow like so:

$ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
$ grep 'blob' /tmp/all-objects > /tmp/blob-objects
$ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
--delta-base-offset --all-progress --stdout > blob.pack
$ git-index-pack -v blob.pack

Now, when I run 'git-index-pack' on blob.pack in the current
directory, memory usage is pretty horrific (even with the applied
patch to not leak all everything). Shawn tells me that index-pack
should only be decompressing the object twice - once from the repo and
once from blob.pack - iff I call git-index-pack with --stdin, which I
am not.

If I move the blob.pack into /tmp, and run git-index-pack on it there,
it completes much faster and the memory usage never exceeds 200MB.
(Inside the repo, it takes up over 3G of RES according to top.)

By "much faster", I mean: the entire pack was indexed and completed in
17:42.40, whereas I cancelled the inside-the-repo index because after
56 minutes it was only at 46%.

(And, as far as getting this huge repo published - I have it burned to
a DVD, and I'm going to drop it in the mail to hpa today on my lunch
break, which I should be taking soon.)

-clee
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 12:49 pm

Instead of using --stdout with git-pack-object, you should provide it 
with a suitable base name for the resulting pack and the index will be 
created automatically along side the pack for you.  No need to use 

The 3G should definitely be fixed with the added free().

The CPU usage is explained by the fact that you're running index-pack on 
objects that are all already found in your repo so the collision check 
is triggered.  This is more or like the same issue as if you tried to 
run unpack-objects on the same pack where none of your objects will 
actually be unpacked.


Nicolas
-

From: Chris Lee
Date: Tuesday, April 3, 2007 - 12:54 pm

Not really. This packfile is 2.6GB in size, and apparently it gets mmap'd.

(Yesterday, my machine ran out of memory trying to do index-pack when
the memleak still existed; I have 4G of RAM and, normally, 4G of swap,

Right, and if I was using --stdin, I would expect that. But I'm not.
And, according to Shawn anyway, the current behaviour is not what was
intended.

-clee
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 1:18 pm

Yeah. What happens is that inside the repo, because we do all the 
duplicate object checks (verifying that there are no evil hash collisions) 
even after fixing the memory leak, we end up keeping *track* of all those 
objects.

And with a large repository, it's quite the expensive operation.

That whole "verify no SHA1 hash collision" code is really pretty damn 
paranoid. Maybe we shouldn't have it enabled by default.

So how about this updated patch? We could certainly make "git pull" imply 
"--paranoid" if we want to, but even that is likely pretty unnecessary. 
It's not like anybody has ever shown a SHA1 collision, and if the *local* 
repository is corrupt (and has an object with the wrong SHA1 - that's what 
the testsuite checks for), then it's probably good to get the valid object 
from the remote..

This includes the previous one-liner, but also adds the "--paranoid" flag 
and fixes up the Documentation and tests to match.

Junio, your choice, but regardless which one you choose:

	Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Thanks,

		Linus

---
 Documentation/git-index-pack.txt |    3 +--
 index-pack.c                     |    6 +++++-
 t/t5300-pack-object.sh           |    4 ++--
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 2229ee8..7d8d33b 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -8,8 +8,7 @@ git-index-pack - Build pack index file for an existing packed archive
 
 SYNOPSIS
 --------
-'git-index-pack' [-v] [-o <index-file>] <pack-file>
-'git-index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>] [<pack-file>]
+'git-index-pack' [--stdin [--fix-thin] [--keep]] [-v] [--paranoid] [-o <index-file>] <pack-file>
 
 
 DESCRIPTION
diff --git a/index-pack.c b/index-pack.c
index 6284fe3..8a4c27a 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -45,6 +45,7 @@ static int ...
From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 1:32 pm

Maybe we shouldn't run index-pack on packs for which we _already_ have 
an index for which is the most likely reason for the collision check to 
trigger in the first place.

This is in the same category as trying to run unpack-objects on a pack 

I'm of the opinion that this patch is unnecessary.  It only helps in 
bogus workflows to start with, and it makes the default behavior unsafe 
(unsafe from a paranoid pov, but still).  And in the _normal_ workflow 
it should never trigger.

So I wouldn't merge it.


Nicolas
-

From: Junio C Hamano
Date: Tuesday, April 3, 2007 - 1:38 pm

Hmmmm.  You may have a point.

-

From: Junio C Hamano
Date: Tuesday, April 3, 2007 - 1:40 pm

[sorry, sent a message without finishing]


Hmmmm.  You may have a point.

So maybe we should retitle this thread from "git-index-pack
really does suck.." to "I used git-index-pack in a stupid way"?
 

-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 2:00 pm

See my separate timing numbers, although I bet that Chris can give even 
better ones..

Chris, try applying my patch, and then inside the KDE repo you have, do

	git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack

(ie index the objects of the *git* repository, not the KDE one). That 
should approximate doing a fair-sized "git pull" - getting new objects. Do 
it with and without --paranoid, and time it.

I bet that what I see as a 7% slowdown will be much bigger for you, just 
because the negative lookups will be all that much more expensive when you 
have tons of objects.

			Linus
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 2:28 pm

Like I said this is bogus since the index-pack is throttled by the 
network making this overhead a non issue in real life.

And like I said there should _not_ be such a memory usage difference 

And I bet your newton-raphson lookup idea would shine and bring that 
overhead down considerably in that case.


Nicolas
-

From: Chris Lee
Date: Tuesday, April 3, 2007 - 3:49 pm

% time git-index-pack --paranoid --stdin --fix-thin paranoid.pack <
/usr/local/src/git/.git/objects/pack/*pack
pack	bf8ba7897da9c84d1981ecdc92c0b1979506a4b9
git-index-pack --paranoid --stdin --fix-thin paranoid.pack <   5.28s
user 0.24s system 98% cpu 5.592 total

% time git-index-pack --stdin --fix-thin trusting.pack <
/usr/local/src/git/.git/objects/pack/*pack
pack	bf8ba7897da9c84d1981ecdc92c0b1979506a4b9
git-index-pack --stdin --fix-thin trusting.pack <   5.07s user 0.12s
system 99% cpu 5.202 total


I applied exactly the patch you sent, and it applied perfectly cleanly
- no failures.

I also mailed out the DVD with the repo on it to hpa today, so
hopefully by tomorrow he'll get it. (He's not even two cities over,
and I suspect I could have just driven it to his place, but that might
have been a little awkward since I've never met him.)

Anyway, so, hopefully once he gets it he can put it up somewhere that
you guys can grab it. For reference, the KDE repo is pretty big, but a
"real" conversion of the repo would be bigger; the one that I've been
playing with only has the KDE svn trunk, and only the first 409k
revisions - there are, as of right now, over 650k revisions in KDE's
svn repo. So, realistically speaking, a fully-converted KDE git repo
would probably take up at least 6GB, packed, if not more. Subproject
support would probably be *really* helpful to mitigate that.

-clee
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 4:12 pm

It's entirely possible that the object lookup is good enough to not be a 
problem even for huge packs, and it really only gets to be a problem when 
you actually unpack all the objects.

In that case, the only real case to worry about is indeed the "alternates" 
case (or if people actually use a shared git object directory, but I don't 
think anybody really does - alternates just work well enough, and shared 
object directories are painful enough that I doubt anybody *really* uses 


Sure. I think subproject support is likely the big "missing feature" of 
git right now. The rest is "details", even if they can be big and involved 
details.

But even at only 409k revisions, it's still going to be an order of 
magnitude bigger than what the kernel is, exactly *because* it's such a 
disaster from a maintenance setup standpoint, and it's going to be a 
useful real-world test-case. So whether that is a "good" git archive or 
not, it's going to be useful.

Long ago we used to be able to look at the historic Linux archive as an 
example of a "big" archive, but it's not actually all that much bigger 
than the normal Linux archive any more, and we've pretty much fixed the 
problems we used to have.

[ The historical pack-file is actually smaller, but that's because it was 
  done with a much deeper delta-chain to make it small: the historical 
  archive still has more objects in it than the current active git kernel 
  tree - but it's only in the 20% range, not "20 *times* bigger" ]

The Eclipse tree was useful (and I think we already improved performance 
for you thanks to working with it - I don't know how much faster the 
delta-base cache made things for you, but I'd assume it was at *least* by 
the factor-of-2.5 that we saw on Eclipse), but the KDE is bigger *and* 
deeper (the eclipse tree is 1.7GB, and 136k revisions in the main branch, 
so the KDE tree is more than twice the revisions).

		Linus
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 1:56 pm

Look at what we have to do to look up a SHA1 object.. We create all the 
lookup infrastructure, we don't *just* read the object. The delta base 

Actually, even in the normal workflow it will do all the extra unnecessary 
work, if only because the lookup costs of *not* finding the entry.

Lookie here:

 - git index-pack of the *git* pack-file in the v2.6/linux directory (zero 
   overlap of objects)

   With --paranoid:

	2.75user 0.37system 0:03.13elapsed 99%CPU
	0major+5583minor pagefaults

   Without --paranoid:

	2.55user 0.12system 0:02.68elapsed 99%CPU
	0major+2957minor pagefaults

See? That's the *normal* workflow. Zero objects found. 7% CPU overhead 
from just the unnecessary work, and almost twice as much memory used. Just 
from the index file lookup etc for a decent-sized project.

Now, in the KDE situation, the *unnecessary* lookups will be about ten 
times more expensive, both on memory and CPU, just because the repository 
is about 20x the size. Even with no actual hits.

		Linus
-

From: Shawn O. Pearce
Date: Tuesday, April 3, 2007 - 2:03 pm

OK, but what about that case with unpack-objects?  Didn't we there
do all this work to also check for the object already existing?
During update-index, write-tree and commit-tree don't we also do
a lot of work (per object anyway) to check for a non-existing object?

So even with --paranoid (aka what we have now) index-pack still
should be faster than unpack-objects for any sizeable transfer,
and is just as "safe".

If its the missing-object lookup that is expensive, maybe we should
try to optimize that.  We do it enough already in other parts of
the code...

-- 
Shawn.
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 2:13 pm

I agree 100% that index-pack is much much MUCH better than unpack-objects 

Well, for all other cases it's really the "object found" case that is 
worth optimizing for, so I think optimizing for "no object" is actually 
wrong, unless it also speeds up (or at least doesn't make it worse) the 
"real" normal case.

		Linus
-

From: Shawn O. Pearce
Date: Tuesday, April 3, 2007 - 2:17 pm

Right.  But maybe we shouldn't be scanning for packfiles every
time we don't find a loose object.  Especially if the caller is in
a context where we actually *expect* to not find said object like
half of the time... say in git-add/update-index.  ;-)

-- 
Shawn.
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 2:26 pm

Yes, we could definitely skip the re-lookup if we had a "don't really 
care, I can recreate the object myself" flag (ie anybody who is going to 
write that object)

			Linus
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 2:28 pm

Side note: with "alternates" files, you might well *always* have the 
objects. If you do

	git clone -l -s ...

to create various branches, and then pull between them, you'll actually 
end up in the situation that you'll always find the objects and get back 
to the really expensive case..

		Linus
-

From: Junio C Hamano
Date: Tuesday, April 3, 2007 - 3:31 pm

Ah, that's true.  If you "git clone -l -s A B", create new
objects in A and pull from B, the transfer would not exclude
new objects as they are not visible from B's refs.

In that scenario, the keep-pack behaviour is already worse than
the unpack-objects behaviour.  The former creates a packfile
that duplicates objects that are in A while the latter, although
expensive, ends up doing nothing.

I wonder if we can have a backdoor to avoid any object transfer
in such a case to begin with...




-

From: Shawn O. Pearce
Date: Tuesday, April 3, 2007 - 3:38 pm

Yea, symlink to the corresponding refs directory of the alternate
ODB.  Then the refs will be visible.  ;-)

-- 
Shawn.
-

From: Junio C Hamano
Date: Tuesday, April 3, 2007 - 3:41 pm

I was thinking about going the other way.  When git-fetch is
asked to fetch over a local transport, it could check if the
source is one of its alternate object stores.


-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 2:34 pm

First, I truly believe we should have a 64-bit pack index and fewer 
larger packs than many small packs.

Which leaves us with the actual pack index lookup.  At that point the 
cost of finding an existing object and finding that a given object 
doesn't exist is about the same thing, isn't it?

Optimizing that lookup is going to benefit both cases.


Nicolas
-

From: Shawn O. Pearce
Date: Tuesday, April 3, 2007 - 2:37 pm

Here's the rub:  in the missing object case we didn't find it
in the pack index, but it could be loose.  That's one failed
syscall per object if the object isn't loose.  If the object
isn't loose, it could be that it was *just* removed by a
running prune-packed, and the packfile that it was moved
to was created after we scanned for packfiles, so time to
rescan...

-- 
Shawn.
-

From: Junio C Hamano
Date: Tuesday, April 3, 2007 - 2:44 pm

If that is the only reason we have these reprepare_packed_git()
sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we
should rethink that.  Is there a cheap way to trigger these
rescanning only when a prune-packed is in progress, I wonder...



-

From: Shawn O. Pearce
Date: Tuesday, April 3, 2007 - 2:53 pm

Yea, it is the only reason.  So... if we could have some
magic to trigger that, it would be good.  I just don't
know what magic that would be.

-- 
Shawn.
-

From: Jeff King
Date: Tuesday, April 3, 2007 - 3:10 pm

It doesn't have to be in progress; it might have started and completed
between pack-scanning and object lookup. Something like:

  1. start git-repack -a -d
  2. start git-rev-list, scan for packs
  3. repack moves finished pack into place, starts git-prune-packed
  4. git-prune-pack completes
  5. git-rev-list looks up object

So you would need to have some sort of incremented counter that says
"there's a new pack available."

Perhaps instead of rescanning unconditionally, it should simply stat()
the pack directory and look for a change? That should be much cheaper.

-Jeff
-

From: Dana How
Date: Tuesday, April 3, 2007 - 3:40 pm

Do you get what you want if you move to fewer larger INDEX files
but not pack files -- in the extreme, one large index file?
A "super index" could be built from multiple .idx files.
This would be a new file (format) unencumbered by the past,
so it could be tried out more quickly.
Just like objects are pruned when packed,
.idx files could be pruned when the super index is built.

Perhaps a number of (<2GB) packfiles and a large index
file could work out.

Larger and larger pack files make me nervous.
They are expensive to manipulate,
and >2GB requires a file format change.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 3:52 pm

It sometimes also requires a new filesystem. There are a lot of 
filesystems that can handle more than 4GB total, but not necessarily in a 
single file.

The only really useful such filesystem is probably FAT, which is still 
quite useful for things like USB memory sticks. But that is probably 
already worth supporting.

So I think we want to support 64-bit (or at least something like 40+ bit) 
pack-files, but yes, I think that even if/when we support it, we still 
want to support the "multiple smaller pack-files" schenario exactly 
because for some uses it's much *better* to have ten 2GB packfiles rather 
than one 20GB pack-file.

			Linus
-

From: David Lang
Date: Tuesday, April 3, 2007 - 3:31 pm

however, for historical archives you may end up with wanting to do a 2GB 
packfile of 'recent' stuff, and a 14GB packfile of 'ancient' stuff (with the 
large one built with all space-saving options turned up all the way, no matter 
how much time it takes to build the pack)

David Lang
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 4:00 pm

No.  Larger pack files are not more expensive than the same set of 
objects spread into multiple packs.  In fact I'd say it's quite the 
opposite.  And larger pack files do not require a pack format change -- 
it's just the index that has to change and the index is a local matter.


Nicolas
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 2:21 pm

7% overhead over 2 second and a half of CPU which, _normally_, happens 
when cloning the whole thing over a network connection which, if you're 
lucky and have a 6mbps cable connection, will still be spread over 5 
minutes of real time.  And that is assuming that you're cloning a big 
project inside itself which wouldn't work anyway.  Otherwise a big clone 
wound run index-pack in an empty repo where the lookup of exinsting 
object is zero.  Remains git-fetch which should concern itself with much 

So?  When would you really perform such an operation in a meaningful 
way?

The memory usage worries me.  I still cannot explain nor justify it.  
But the CPU overhead is certainly not of any concern in _normal_ usage 
scenarios, is it?

If anything that might be a good test case for the newton-raphson pack 
lookup idea.


Nicolas
-

From: Junio C Hamano
Date: Tuesday, April 3, 2007 - 1:34 pm

I agree with that reasoning. We did not do paranoid in git-pull
long after we introduced the .keep thing anyway, so I do not
think the following patch is even needed, but I am throwing it
out just for discussion.



diff --git a/fetch-pack.c b/fetch-pack.c
index 06f4aec..c687f9f 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -522,6 +522,7 @@ static int get_pack(int xd[2])
 
 	if (do_keep) {
 		*av++ = "index-pack";
+		*av++ = "--paranoid";
 		*av++ = "--stdin";
 		if (!quiet && !no_progress)
 			*av++ = "-v";

-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 1:53 pm

1) None of the objects in a pack should exist in the local repo when 
   fetching, meaning that the paranoia code should not be executed 
   normally.

2) Running index-pack on a pack _inside_ a repository is a dubious thing 
   to do with questionable usefulness already.

3) It is unefficient to run pack-objects with --stdout just to feed the 
   result to index-pack afterwards while repack-objects can create the 
   index itself, which is the source of this discussion.
   
4) I invite you to read the commit log for 8685da42561 where the 
   _perception_ of GIT's security is discussed which led to the paranoia 
   check, and sometimes the perception is more valuable than the 
   reality, especially when it is free.

Therefore Linus' patch and this one are working around the wrong issue 
as described in (3) IMHO.

What could be done instead, if really really needed, is to have the 
paranoia test be made conditional on index-pack --stdin instead.  But 
please no bogus extra switches pretty please.


Nicolas
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 1:33 pm

Some trivial timings for indexing just the kernel pack..

Without --paranoid:

	24.61user 2.16system 0:27.04elapsed 99%CPU
	0major+14120minor pagefaults

With --paranoid:

	42.74user 3.04system 0:46.36elapsed 98%CPU
	0major+72768minor pagefaults

so it's a noticeable CPU issue, but it's even more noticeable in memory 
usage (55MB vs 284MB - pagefaults give a good way to look at how much 
memory really got allocated for the process).

All that extra memory is just for SHA1 commit ID information. 

Now, clearly the usage scenario here is a big odd (ie the case where we 
have all the objects already), so in that sense this is very much a 
worst-case situation, and you simply shouldn't *do* something like this, 
but at the same time, I'm just not convinced a very theoretical SHA1 
collision check is worth it. 

Btw, even if we don't have any of the objects, if you have tons and tons 
of objects and do a "git pull", just the *lookup* of the nonexistent 
objects will be expensive: first we won't find it in any pack, then we'll 
look at the loose objects, and then we'll look int he pack *again* due to 
the race avoidance. So looking up nonexistent objects is actually pretty 
expensive.

In fact, "--paranoid" takes one second more for me even totally outside of 
a git repository, just because we waste so much time trying to look up 
non-existent object files ;)

			Linus


-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 2:05 pm

I don't see where that might be.  The only thing that the paranoia check 
triggers is:

	foo = read_sha1_file(blah);
	memcmp(foo with bar);
	free(foo);

So where is that commit ID information actually stored when using 

Not if you consider that it is performed _while_ receiving (and waiting 
for) the pack data over the net in the normal case.


Nicolas
-

From: Shawn O. Pearce
Date: Tuesday, April 3, 2007 - 2:11 pm

That might just be the mmap code in sha1_file.c.  We are mmaping
the existing packfile, to fetch the object.  That counts as virtual
memory.  ;-)

-- 
Shawn.
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 2:24 pm

I've got the numbers: it uses much more memory when doing even failing 
lookups, ie:

	With --paranoid: 5583 minor pagefaults (21MB)
	Without --paranoid: 2957 minor pagefaults (11MB)

(remember, this was *just* the git pack, not the kernel pack)

It could be as simple as just the index file itself. That's 11MB for the 
kernel. Imagine if the index file was 20 times bigger - 200MB of memory 
paged in with bad access patterns just for unnecessary lookups.

Running valgrind shows no leak at all without --paranoid. With --paranoid, 
there's some really trivial stuff (the "packed_git" structure etc, so I 

..which is why I think it makes sense for "pull" to be paranoid. I just 
don't think it makes sense to be paranoid *all* the time, since it's 
clearly expensive.

		Linus
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 2:42 pm

Again that presumes you have to page in the whole index, which should 
not happen when pulling (much smaller packs) and with a better lookup 

Make it conditionnal on --stdin then.  This covers all cases where we 
really want the secure thing to happen, and the --stdin case already 
perform the atomic rename-and-move thing when the pack is fully indexed.


Nicolas
-

From: Junio C Hamano
Date: Tuesday, April 3, 2007 - 3:07 pm

Repacking objects in a repository uses pack-objects without
using index-pack, as you suggested Chris.  Is there a sane usage
of index-pack that does not use --stdin?  I do not think of any.

If there isn't, the "conditional on --stdin" suggestion means we
unconditionally do the secure thing for all the sane usage, and
go unsecure for an insane usage that we do not really care about.

If so, it seems to me that it would be the simplest not to touch
the code at all, except that missing free().

Am I missing something?


-

From: Shawn O. Pearce
Date: Tuesday, April 3, 2007 - 3:11 pm

Nope. I agree with you completely.

-- 
Shawn.
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 3:34 pm

Not from my point of view.


Nicolas
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 3:14 pm

Actually, since SHA1's are randomly distributed, together with the binary 
search, you probably *will* pull in the bulk of the index, even with a 

I don't care *what* it is conditional on, but your arguments suck. You 
claim that it's not a normal case to already have the objects, when it 
*is* a normal case for alternates, etc.

I don't understand why you argue against hard numbers. You have none of 
your own.

		Linus
-

From: Nicolas Pitre
Date: Tuesday, April 3, 2007 - 3:55 pm

Are hard numbers like 7% overhead (because right now that's all we have) 
really worth it against bad _perceptions_?

Sure, the SHA1 collision attack is paranoia.  But it is becoming 
increasingly *possible*.

And when we only had unpack-objects on the receiving end of a fetch, you 
yourself bragged about the implied security of GIT in the presence of a 
SHA1 collision attack.  Because let's admit it: when a SHA1 collision 
will happen it is way more probable to come on purpose than from pure 
accident.  But as you said at the time, it is not a problem because GIT 
trusts local objects more than remote ones and incidentally 
unpack-objects doesn't overwrite existing objects.

The keeping of fetched packs broke that presumption of trust towards 
local objects and it opened a real path for potential future attacks.  
Those attacks are still fairly theoretical of course.  But for how 
_long_?  Do we want GIT to be considered backdoor prone in a couple 
years from now just because we were obsessed by a 7% CPU overhead?

I think we have much more to gain by playing it safe and being more 
secure and paranoid than trying to squeeze some CPU cycles out of an 
operation that is likely to ever be bounded by network speed for most 
people.

And we _know_ that the operation can be optimized further anyway.

So IMHO in this case hard numbers alone aren't the end of it.  Not as 
long as they're reasonably low.  And especially not for a command which 
is 1) rather infrequent and 2) not really interactive like git-log might 
be.


Nicolas
-

From: David Lang
Date: Tuesday, April 3, 2007 - 3:36 pm

this is why -paranoid should be left on for network pulls, but having it on for 
the local uses means that the cost isn't hidden in the network limits isn't 
good.

David Lang
-

From: Alex Riesen
Date: Wednesday, April 4, 2007 - 2:51 am

You never know what pull is networked (or should I say: remote enough
to cause a collision).
-

From: David Lang
Date: Friday, April 6, 2007 - 2:56 pm

so leave it on for all pulls, but for other commands don't turn it on.

remember that the command that linus ran into at the start of the thread wasn't 
a pull.

David Lang
-

From: Junio C Hamano
Date: Friday, April 6, 2007 - 3:47 pm

Are you referring to this command

 $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack

in this message?

-

From: Junio C Hamano
Date: Friday, April 6, 2007 - 3:49 pm

From: Linus Torvalds <torvalds@linux-foundation.org>
  Subject: git-index-pack really does suck..
  Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
  Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>

(sorry, chomped the message).



-

From: David Lang
Date: Friday, April 6, 2007 - 3:22 pm

probably (I useually don't keep the mail after I read or reply to it)

David Lang
-

From: Junio C Hamano
Date: Friday, April 6, 2007 - 3:55 pm

Well, then you should remember that the command linus ran into
was pretty much about pull, nothing else.

The quoted command was only to illustrate what 'git-pull'
invokes internally.  I do not think of any reason to use that
command for cases other than 'git-pull'.  What's the use case
you have in mind to run that command outside of the context of
git-pull?



-

From: David Lang
Date: Friday, April 6, 2007 - 3:28 pm

I guess I'm not remembering the thread accurately then (and/or am mising it up 
with a different thread). I thought that Linus had identified other cases that 
were impacted (something about proving that the object doesn't exist)

David Lang
-

From: Linus Torvalds
Date: Tuesday, April 3, 2007 - 4:29 pm

If it actually stays at just 7% even with large repos (and the numbers 
from Chris seem to say that it doesn't get worse - in fact, it may be that 
the lookup gets relatively more efficient for a large repo thanks to the 
log(n) costs), I agree that 7% probably isn't worth worrying about when 
weighed against "guaranteed no SHA1 collision". Especially as long as 
you'd normally only hit it when your real performance issue is going to be 
the network.

So especially if we can make sure that the *local* case is ok, where the 
network isn't going to be the bottleneck, I think we can/should do the 
paranoia.

That's especially true as it is also the local case where the 7% has 
already been shown to be just the best case, with the worst case being 
many hundred percent (and memory use going up from 55M to 280M in one 
example), thanks to us actually *finding* the objects.

			Linus
-

Previous thread: StGit patchname numeric prefix? by Jamie Border on Tuesday, April 3, 2007 - 4:49 am. (3 messages)

Next thread: Re: [PATCH] gitweb: Support comparing blobs (files) with different names by Jakub Narebski on Tuesday, April 3, 2007 - 7:57 am. (3 messages)