Junio, Nico, I think we need to do something about it. CLee was complaining about git-index-pack on #irc with the partial KDE repo, and while I don't have the KDE repo, I decided to investigate a bit. Even with just the kernel repo (with a single 170MB pack-file), I can do git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack and it uses 52s of CPU-time, and on my 4GB machine it actually started doing IO and swapping, because git-index-pack grew to 4.8GB in size. So while I initially thought I'd want a bigger test-case to see the problem, I sure as heck don't. The 52s of CPU time exploded into almost three minutes of actual real-time: 47.33user 5.79system 2:41.65elapsed 32%CPU 2117major+1245763minor And that's on a good system with a powerful CPU, "enough memory" for any reasonable development, and good disks! Very much ungood-plus-plus. I haven't looked into exactly why yet, but I bet it's just that we keep every single object expanded in memory. We do need to keep the objects around, so that we can resolve delta's, but we can certainly do it other ways. Two suggestion for other ways: - simple one: don't keep unexploded objects around, just keep the deltas, and spend tons of CPU-time just re-expanding them if required. We *should* be able to do it with just keeping the original 170MB pack-file in memory, not expanding it to 3.8GB! Still, even this will be painful once you have a big pack-file, and the CPU waste is nasty (although a delta-base cache like we do in sha1_file.c would probably fix it 99% - at that point it's getting less simple, and the "best" solution below looks more palatable) - best one: when writing out the pack-file, we incrementally keep a "struct packed_git" around, and update the index for it dynamically, and totally get rid of all objects that we've written out, because we can re-create them. This means that we should have _zero_ memory footprint ...
Ahh. False alarm. The problem is actually largely a really stupid memory leak in the SHA1 collision checking (which wouldn't trigger on a normal pull, but obviously does trigger for every single object when testing!) This trivial patch fixes most of it. git-index-pack still uses too much memory, but it does a *lot* better. Junio, please get this into 1.5.1 (I *think* the SHA1 checking is new, but if it exists in 1.5.0 too, it obviously needs the same fix). It still grows, but it grew to just 287M in size now for the 170M kernel object: 41.59user 1.39system 0:43.64elapsed 0major+73552minor which is quite a lot better. Duh. Linus --- index-pack.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/index-pack.c b/index-pack.c index 6284fe3..3c768fb 100644 --- a/index-pack.c +++ b/index-pack.c @@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size, if (size != has_size || type != has_type || memcmp(data, has_data, size) != 0) die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1)); + free(has_data); } } -
Damn! Please don't report those things when I'm out for lunch so I could have Indeed. Nicolas -
Even better: - Fix my own stupid mistake with a _single_ line of code: diff --git a/index-pack.c b/index-pack.c index 6284fe3..3c768fb 100644 --- a/index-pack.c +++ b/index-pack.c @@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size, if (size != has_size || type != has_type || memcmp(data, has_data, size) != 0) die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1)); + free(has_data); } } The thing is, that code path is executed _only_ when index-pack is encountering an object already in the repository in order to protect against possible SHA1 collision attacks. See commit 8685da42561d log for the full story. Normally this should not happen in normal usage scenarios because the objects you fetch are those that you don't already have. But if you manually run index-pack inside an existing repository then you'll already have _all_ those objects already explaining the high CPU usage. But this is no excuse for not freeing the data though. Nicolas -
There's another issue here.
I'm running git-index-pack as part of a workflow like so:
$ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
$ grep 'blob' /tmp/all-objects > /tmp/blob-objects
$ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
--delta-base-offset --all-progress --stdout > blob.pack
$ git-index-pack -v blob.pack
Now, when I run 'git-index-pack' on blob.pack in the current
directory, memory usage is pretty horrific (even with the applied
patch to not leak all everything). Shawn tells me that index-pack
should only be decompressing the object twice - once from the repo and
once from blob.pack - iff I call git-index-pack with --stdin, which I
am not.
If I move the blob.pack into /tmp, and run git-index-pack on it there,
it completes much faster and the memory usage never exceeds 200MB.
(Inside the repo, it takes up over 3G of RES according to top.)
By "much faster", I mean: the entire pack was indexed and completed in
17:42.40, whereas I cancelled the inside-the-repo index because after
56 minutes it was only at 46%.
(And, as far as getting this huge repo published - I have it burned to
a DVD, and I'm going to drop it in the mail to hpa today on my lunch
break, which I should be taking soon.)
-clee
-
Instead of using --stdout with git-pack-object, you should provide it with a suitable base name for the resulting pack and the index will be created automatically along side the pack for you. No need to use The 3G should definitely be fixed with the added free(). The CPU usage is explained by the fact that you're running index-pack on objects that are all already found in your repo so the collision check is triggered. This is more or like the same issue as if you tried to run unpack-objects on the same pack where none of your objects will actually be unpacked. Nicolas -
Not really. This packfile is 2.6GB in size, and apparently it gets mmap'd. (Yesterday, my machine ran out of memory trying to do index-pack when the memleak still existed; I have 4G of RAM and, normally, 4G of swap, Right, and if I was using --stdin, I would expect that. But I'm not. And, according to Shawn anyway, the current behaviour is not what was intended. -clee -
Yeah. What happens is that inside the repo, because we do all the duplicate object checks (verifying that there are no evil hash collisions) even after fixing the memory leak, we end up keeping *track* of all those objects. And with a large repository, it's quite the expensive operation. That whole "verify no SHA1 hash collision" code is really pretty damn paranoid. Maybe we shouldn't have it enabled by default. So how about this updated patch? We could certainly make "git pull" imply "--paranoid" if we want to, but even that is likely pretty unnecessary. It's not like anybody has ever shown a SHA1 collision, and if the *local* repository is corrupt (and has an object with the wrong SHA1 - that's what the testsuite checks for), then it's probably good to get the valid object from the remote.. This includes the previous one-liner, but also adds the "--paranoid" flag and fixes up the Documentation and tests to match. Junio, your choice, but regardless which one you choose: Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Thanks, Linus --- Documentation/git-index-pack.txt | 3 +-- index-pack.c | 6 +++++- t/t5300-pack-object.sh | 4 ++-- 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt index 2229ee8..7d8d33b 100644 --- a/Documentation/git-index-pack.txt +++ b/Documentation/git-index-pack.txt @@ -8,8 +8,7 @@ git-index-pack - Build pack index file for an existing packed archive SYNOPSIS -------- -'git-index-pack' [-v] [-o <index-file>] <pack-file> -'git-index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>] [<pack-file>] +'git-index-pack' [--stdin [--fix-thin] [--keep]] [-v] [--paranoid] [-o <index-file>] <pack-file> DESCRIPTION diff --git a/index-pack.c b/index-pack.c index 6284fe3..8a4c27a 100644 --- a/index-pack.c +++ b/index-pack.c @@ -45,6 +45,7 @@ static int ...
Maybe we shouldn't run index-pack on packs for which we _already_ have an index for which is the most likely reason for the collision check to trigger in the first place. This is in the same category as trying to run unpack-objects on a pack I'm of the opinion that this patch is unnecessary. It only helps in bogus workflows to start with, and it makes the default behavior unsafe (unsafe from a paranoid pov, but still). And in the _normal_ workflow it should never trigger. So I wouldn't merge it. Nicolas -
[sorry, sent a message without finishing] Hmmmm. You may have a point. So maybe we should retitle this thread from "git-index-pack really does suck.." to "I used git-index-pack in a stupid way"? -
See my separate timing numbers, although I bet that Chris can give even better ones.. Chris, try applying my patch, and then inside the KDE repo you have, do git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack (ie index the objects of the *git* repository, not the KDE one). That should approximate doing a fair-sized "git pull" - getting new objects. Do it with and without --paranoid, and time it. I bet that what I see as a 7% slowdown will be much bigger for you, just because the negative lookups will be all that much more expensive when you have tons of objects. Linus -
Like I said this is bogus since the index-pack is throttled by the network making this overhead a non issue in real life. And like I said there should _not_ be such a memory usage difference And I bet your newton-raphson lookup idea would shine and bring that overhead down considerably in that case. Nicolas -
% time git-index-pack --paranoid --stdin --fix-thin paranoid.pack < /usr/local/src/git/.git/objects/pack/*pack pack bf8ba7897da9c84d1981ecdc92c0b1979506a4b9 git-index-pack --paranoid --stdin --fix-thin paranoid.pack < 5.28s user 0.24s system 98% cpu 5.592 total % time git-index-pack --stdin --fix-thin trusting.pack < /usr/local/src/git/.git/objects/pack/*pack pack bf8ba7897da9c84d1981ecdc92c0b1979506a4b9 git-index-pack --stdin --fix-thin trusting.pack < 5.07s user 0.12s system 99% cpu 5.202 total I applied exactly the patch you sent, and it applied perfectly cleanly - no failures. I also mailed out the DVD with the repo on it to hpa today, so hopefully by tomorrow he'll get it. (He's not even two cities over, and I suspect I could have just driven it to his place, but that might have been a little awkward since I've never met him.) Anyway, so, hopefully once he gets it he can put it up somewhere that you guys can grab it. For reference, the KDE repo is pretty big, but a "real" conversion of the repo would be bigger; the one that I've been playing with only has the KDE svn trunk, and only the first 409k revisions - there are, as of right now, over 650k revisions in KDE's svn repo. So, realistically speaking, a fully-converted KDE git repo would probably take up at least 6GB, packed, if not more. Subproject support would probably be *really* helpful to mitigate that. -clee -
It's entirely possible that the object lookup is good enough to not be a problem even for huge packs, and it really only gets to be a problem when you actually unpack all the objects. In that case, the only real case to worry about is indeed the "alternates" case (or if people actually use a shared git object directory, but I don't think anybody really does - alternates just work well enough, and shared object directories are painful enough that I doubt anybody *really* uses Sure. I think subproject support is likely the big "missing feature" of git right now. The rest is "details", even if they can be big and involved details. But even at only 409k revisions, it's still going to be an order of magnitude bigger than what the kernel is, exactly *because* it's such a disaster from a maintenance setup standpoint, and it's going to be a useful real-world test-case. So whether that is a "good" git archive or not, it's going to be useful. Long ago we used to be able to look at the historic Linux archive as an example of a "big" archive, but it's not actually all that much bigger than the normal Linux archive any more, and we've pretty much fixed the problems we used to have. [ The historical pack-file is actually smaller, but that's because it was done with a much deeper delta-chain to make it small: the historical archive still has more objects in it than the current active git kernel tree - but it's only in the 20% range, not "20 *times* bigger" ] The Eclipse tree was useful (and I think we already improved performance for you thanks to working with it - I don't know how much faster the delta-base cache made things for you, but I'd assume it was at *least* by the factor-of-2.5 that we saw on Eclipse), but the KDE is bigger *and* deeper (the eclipse tree is 1.7GB, and 136k revisions in the main branch, so the KDE tree is more than twice the revisions). Linus -
Look at what we have to do to look up a SHA1 object.. We create all the lookup infrastructure, we don't *just* read the object. The delta base Actually, even in the normal workflow it will do all the extra unnecessary work, if only because the lookup costs of *not* finding the entry. Lookie here: - git index-pack of the *git* pack-file in the v2.6/linux directory (zero overlap of objects) With --paranoid: 2.75user 0.37system 0:03.13elapsed 99%CPU 0major+5583minor pagefaults Without --paranoid: 2.55user 0.12system 0:02.68elapsed 99%CPU 0major+2957minor pagefaults See? That's the *normal* workflow. Zero objects found. 7% CPU overhead from just the unnecessary work, and almost twice as much memory used. Just from the index file lookup etc for a decent-sized project. Now, in the KDE situation, the *unnecessary* lookups will be about ten times more expensive, both on memory and CPU, just because the repository is about 20x the size. Even with no actual hits. Linus -
OK, but what about that case with unpack-objects? Didn't we there do all this work to also check for the object already existing? During update-index, write-tree and commit-tree don't we also do a lot of work (per object anyway) to check for a non-existing object? So even with --paranoid (aka what we have now) index-pack still should be faster than unpack-objects for any sizeable transfer, and is just as "safe". If its the missing-object lookup that is expensive, maybe we should try to optimize that. We do it enough already in other parts of the code... -- Shawn. -
I agree 100% that index-pack is much much MUCH better than unpack-objects Well, for all other cases it's really the "object found" case that is worth optimizing for, so I think optimizing for "no object" is actually wrong, unless it also speeds up (or at least doesn't make it worse) the "real" normal case. Linus -
Right. But maybe we shouldn't be scanning for packfiles every time we don't find a loose object. Especially if the caller is in a context where we actually *expect* to not find said object like half of the time... say in git-add/update-index. ;-) -- Shawn. -
Yes, we could definitely skip the re-lookup if we had a "don't really care, I can recreate the object myself" flag (ie anybody who is going to write that object) Linus -
Side note: with "alternates" files, you might well *always* have the objects. If you do git clone -l -s ... to create various branches, and then pull between them, you'll actually end up in the situation that you'll always find the objects and get back to the really expensive case.. Linus -
Ah, that's true. If you "git clone -l -s A B", create new objects in A and pull from B, the transfer would not exclude new objects as they are not visible from B's refs. In that scenario, the keep-pack behaviour is already worse than the unpack-objects behaviour. The former creates a packfile that duplicates objects that are in A while the latter, although expensive, ends up doing nothing. I wonder if we can have a backdoor to avoid any object transfer in such a case to begin with... -
Yea, symlink to the corresponding refs directory of the alternate ODB. Then the refs will be visible. ;-) -- Shawn. -
I was thinking about going the other way. When git-fetch is asked to fetch over a local transport, it could check if the source is one of its alternate object stores. -
First, I truly believe we should have a 64-bit pack index and fewer larger packs than many small packs. Which leaves us with the actual pack index lookup. At that point the cost of finding an existing object and finding that a given object doesn't exist is about the same thing, isn't it? Optimizing that lookup is going to benefit both cases. Nicolas -
Here's the rub: in the missing object case we didn't find it in the pack index, but it could be loose. That's one failed syscall per object if the object isn't loose. If the object isn't loose, it could be that it was *just* removed by a running prune-packed, and the packfile that it was moved to was created after we scanned for packfiles, so time to rescan... -- Shawn. -
If that is the only reason we have these reprepare_packed_git() sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we should rethink that. Is there a cheap way to trigger these rescanning only when a prune-packed is in progress, I wonder... -
Yea, it is the only reason. So... if we could have some magic to trigger that, it would be good. I just don't know what magic that would be. -- Shawn. -
It doesn't have to be in progress; it might have started and completed between pack-scanning and object lookup. Something like: 1. start git-repack -a -d 2. start git-rev-list, scan for packs 3. repack moves finished pack into place, starts git-prune-packed 4. git-prune-pack completes 5. git-rev-list looks up object So you would need to have some sort of incremented counter that says "there's a new pack available." Perhaps instead of rescanning unconditionally, it should simply stat() the pack directory and look for a change? That should be much cheaper. -Jeff -
Do you get what you want if you move to fewer larger INDEX files but not pack files -- in the extreme, one large index file? A "super index" could be built from multiple .idx files. This would be a new file (format) unencumbered by the past, so it could be tried out more quickly. Just like objects are pruned when packed, .idx files could be pruned when the super index is built. Perhaps a number of (<2GB) packfiles and a large index file could work out. Larger and larger pack files make me nervous. They are expensive to manipulate, and >2GB requires a file format change. Thanks, -- Dana L. How danahow@gmail.com +1 650 804 5991 cell -
It sometimes also requires a new filesystem. There are a lot of filesystems that can handle more than 4GB total, but not necessarily in a single file. The only really useful such filesystem is probably FAT, which is still quite useful for things like USB memory sticks. But that is probably already worth supporting. So I think we want to support 64-bit (or at least something like 40+ bit) pack-files, but yes, I think that even if/when we support it, we still want to support the "multiple smaller pack-files" schenario exactly because for some uses it's much *better* to have ten 2GB packfiles rather than one 20GB pack-file. Linus -
however, for historical archives you may end up with wanting to do a 2GB packfile of 'recent' stuff, and a 14GB packfile of 'ancient' stuff (with the large one built with all space-saving options turned up all the way, no matter how much time it takes to build the pack) David Lang -
No. Larger pack files are not more expensive than the same set of objects spread into multiple packs. In fact I'd say it's quite the opposite. And larger pack files do not require a pack format change -- it's just the index that has to change and the index is a local matter. Nicolas -
7% overhead over 2 second and a half of CPU which, _normally_, happens when cloning the whole thing over a network connection which, if you're lucky and have a 6mbps cable connection, will still be spread over 5 minutes of real time. And that is assuming that you're cloning a big project inside itself which wouldn't work anyway. Otherwise a big clone wound run index-pack in an empty repo where the lookup of exinsting object is zero. Remains git-fetch which should concern itself with much So? When would you really perform such an operation in a meaningful way? The memory usage worries me. I still cannot explain nor justify it. But the CPU overhead is certainly not of any concern in _normal_ usage scenarios, is it? If anything that might be a good test case for the newton-raphson pack lookup idea. Nicolas -
I agree with that reasoning. We did not do paranoid in git-pull
long after we introduced the .keep thing anyway, so I do not
think the following patch is even needed, but I am throwing it
out just for discussion.
diff --git a/fetch-pack.c b/fetch-pack.c
index 06f4aec..c687f9f 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -522,6 +522,7 @@ static int get_pack(int xd[2])
if (do_keep) {
*av++ = "index-pack";
+ *av++ = "--paranoid";
*av++ = "--stdin";
if (!quiet && !no_progress)
*av++ = "-v";
-
1) None of the objects in a pack should exist in the local repo when fetching, meaning that the paranoia code should not be executed normally. 2) Running index-pack on a pack _inside_ a repository is a dubious thing to do with questionable usefulness already. 3) It is unefficient to run pack-objects with --stdout just to feed the result to index-pack afterwards while repack-objects can create the index itself, which is the source of this discussion. 4) I invite you to read the commit log for 8685da42561 where the _perception_ of GIT's security is discussed which led to the paranoia check, and sometimes the perception is more valuable than the reality, especially when it is free. Therefore Linus' patch and this one are working around the wrong issue as described in (3) IMHO. What could be done instead, if really really needed, is to have the paranoia test be made conditional on index-pack --stdin instead. But please no bogus extra switches pretty please. Nicolas -
Some trivial timings for indexing just the kernel pack.. Without --paranoid: 24.61user 2.16system 0:27.04elapsed 99%CPU 0major+14120minor pagefaults With --paranoid: 42.74user 3.04system 0:46.36elapsed 98%CPU 0major+72768minor pagefaults so it's a noticeable CPU issue, but it's even more noticeable in memory usage (55MB vs 284MB - pagefaults give a good way to look at how much memory really got allocated for the process). All that extra memory is just for SHA1 commit ID information. Now, clearly the usage scenario here is a big odd (ie the case where we have all the objects already), so in that sense this is very much a worst-case situation, and you simply shouldn't *do* something like this, but at the same time, I'm just not convinced a very theoretical SHA1 collision check is worth it. Btw, even if we don't have any of the objects, if you have tons and tons of objects and do a "git pull", just the *lookup* of the nonexistent objects will be expensive: first we won't find it in any pack, then we'll look at the loose objects, and then we'll look int he pack *again* due to the race avoidance. So looking up nonexistent objects is actually pretty expensive. In fact, "--paranoid" takes one second more for me even totally outside of a git repository, just because we waste so much time trying to look up non-existent object files ;) Linus -
I don't see where that might be. The only thing that the paranoia check triggers is: foo = read_sha1_file(blah); memcmp(foo with bar); free(foo); So where is that commit ID information actually stored when using Not if you consider that it is performed _while_ receiving (and waiting for) the pack data over the net in the normal case. Nicolas -
That might just be the mmap code in sha1_file.c. We are mmaping the existing packfile, to fetch the object. That counts as virtual memory. ;-) -- Shawn. -
I've got the numbers: it uses much more memory when doing even failing lookups, ie: With --paranoid: 5583 minor pagefaults (21MB) Without --paranoid: 2957 minor pagefaults (11MB) (remember, this was *just* the git pack, not the kernel pack) It could be as simple as just the index file itself. That's 11MB for the kernel. Imagine if the index file was 20 times bigger - 200MB of memory paged in with bad access patterns just for unnecessary lookups. Running valgrind shows no leak at all without --paranoid. With --paranoid, there's some really trivial stuff (the "packed_git" structure etc, so I ..which is why I think it makes sense for "pull" to be paranoid. I just don't think it makes sense to be paranoid *all* the time, since it's clearly expensive. Linus -
Again that presumes you have to page in the whole index, which should not happen when pulling (much smaller packs) and with a better lookup Make it conditionnal on --stdin then. This covers all cases where we really want the secure thing to happen, and the --stdin case already perform the atomic rename-and-move thing when the pack is fully indexed. Nicolas -
Repacking objects in a repository uses pack-objects without using index-pack, as you suggested Chris. Is there a sane usage of index-pack that does not use --stdin? I do not think of any. If there isn't, the "conditional on --stdin" suggestion means we unconditionally do the secure thing for all the sane usage, and go unsecure for an insane usage that we do not really care about. If so, it seems to me that it would be the simplest not to touch the code at all, except that missing free(). Am I missing something? -
Nope. I agree with you completely. -- Shawn. -
Actually, since SHA1's are randomly distributed, together with the binary search, you probably *will* pull in the bulk of the index, even with a I don't care *what* it is conditional on, but your arguments suck. You claim that it's not a normal case to already have the objects, when it *is* a normal case for alternates, etc. I don't understand why you argue against hard numbers. You have none of your own. Linus -
Are hard numbers like 7% overhead (because right now that's all we have) really worth it against bad _perceptions_? Sure, the SHA1 collision attack is paranoia. But it is becoming increasingly *possible*. And when we only had unpack-objects on the receiving end of a fetch, you yourself bragged about the implied security of GIT in the presence of a SHA1 collision attack. Because let's admit it: when a SHA1 collision will happen it is way more probable to come on purpose than from pure accident. But as you said at the time, it is not a problem because GIT trusts local objects more than remote ones and incidentally unpack-objects doesn't overwrite existing objects. The keeping of fetched packs broke that presumption of trust towards local objects and it opened a real path for potential future attacks. Those attacks are still fairly theoretical of course. But for how _long_? Do we want GIT to be considered backdoor prone in a couple years from now just because we were obsessed by a 7% CPU overhead? I think we have much more to gain by playing it safe and being more secure and paranoid than trying to squeeze some CPU cycles out of an operation that is likely to ever be bounded by network speed for most people. And we _know_ that the operation can be optimized further anyway. So IMHO in this case hard numbers alone aren't the end of it. Not as long as they're reasonably low. And especially not for a command which is 1) rather infrequent and 2) not really interactive like git-log might be. Nicolas -
this is why -paranoid should be left on for network pulls, but having it on for the local uses means that the cost isn't hidden in the network limits isn't good. David Lang -
You never know what pull is networked (or should I say: remote enough to cause a collision). -
so leave it on for all pulls, but for other commands don't turn it on. remember that the command that linus ran into at the start of the thread wasn't a pull. David Lang -
Are you referring to this command $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack in this message? -
From: Linus Torvalds <torvalds@linux-foundation.org> Subject: git-index-pack really does suck.. Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT) Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org> (sorry, chomped the message). -
probably (I useually don't keep the mail after I read or reply to it) David Lang -
Well, then you should remember that the command linus ran into was pretty much about pull, nothing else. The quoted command was only to illustrate what 'git-pull' invokes internally. I do not think of any reason to use that command for cases other than 'git-pull'. What's the use case you have in mind to run that command outside of the context of git-pull? -
I guess I'm not remembering the thread accurately then (and/or am mising it up with a different thread). I thought that Linus had identified other cases that were impacted (something about proving that the object doesn't exist) David Lang -
If it actually stays at just 7% even with large repos (and the numbers from Chris seem to say that it doesn't get worse - in fact, it may be that the lookup gets relatively more efficient for a large repo thanks to the log(n) costs), I agree that 7% probably isn't worth worrying about when weighed against "guaranteed no SHA1 collision". Especially as long as you'd normally only hit it when your real performance issue is going to be the network. So especially if we can make sure that the *local* case is ok, where the network isn't going to be the bottleneck, I think we can/should do the paranoia. That's especially true as it is also the local case where the 7% has already been shown to be just the best case, with the worst case being many hundred percent (and memory use going up from 55M to 280M in one example), thanks to us actually *finding* the objects. Linus -
