login
Header Space

 
 

Re: Git and GCC

Previous thread: Re: Git and GCC by David Miller on Wednesday, December 5, 2007 - 10:28 pm. (1 message)

Next thread: When a merge turns into a conflict by Anand Kumria on Thursday, December 6, 2007 - 12:49 am. (2 messages)
To: <dberlin@...>
Cc: <ismail@...>, <gcc@...>, <git@...>
Date: Wednesday, December 5, 2007 - 10:52 pm

From: "Daniel Berlin" &lt;dberlin@dberlin.org&gt;

I find it ironic that you were even willing to write tools to
facilitate your hg based gcc workflow.  That really shows what your
thinking is on this matter, in that you're willing to put effort
towards making hg work better for you but you're not willing to expend
that level of effort to see if git can do so as well.

This is what really eats me from the inside about your dissatisfaction
with git.  Your analysis seems to be a self-fullfilling prophecy, and
that's totally unfair to both hg and git.
-
To: David Miller <davem@...>
Cc: <dberlin@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 7:57 am

Hi,



... I actually appreciate people complaining -- in the meantime.  It shows 
right away what group you belong to in the "Those who can do, do, those 
who can't, complain.".

You can see that very easily on the git list, or on the #git channel on 
irc.freenode.net.  There is enough data for a study which yearns to be 
written, that shows how quickly we resolve issues with people that are 
sincerely interested in a solution.

(Of course, on the other hand, there are also quite a few cases which show 
how frustrating (for both sides) and unfruitful discussions started by a 
complaint are.)

So I fully expect an issue like Daniel's to be resolved in a matter of 
minutes on the git list, if the OP gives us a chance.  If we are not even 
Cc'ed, you are completely right, she or he probably does not want the 
issue to be resolved.

Ciao,
Dscho

-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: David Miller <davem@...>, <dberlin@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 8:04 am

Thursday 06 December 2007 13:57:06 Johannes Schindelin yazmıştı:

Lets be fair about this, Ollie Wild already sent a mail about git-svn disk 
usage and there is no concrete solution yet, though it seems the bottleneck 
is known.

Regards,
ismail


-- 
Never learn by your mistakes, if you do you may never dare to try again.
-
To: David Miller <davem@...>
Cc: <ismail@...>, <gcc@...>, <git@...>
Date: Wednesday, December 5, 2007 - 11:47 pm

See, now you claim to know my thinking.
I went back to hg because the GIT's space usage wasn't even in the
ballpark, i couldn't get git-svn rebase to update the revs after the
initial import (even though i had properly used a rewriteRoot).

The size is clearly not just svn data, it's in the git pack itself.

I spent a long time working on SVN to reduce it's space usage (repo
side and cleaning up the client side and giving a path to svn devs to
reduce it further), as well as ui issues, and I really don't feel like
having to do the same for GIT.

I'm tired of having to spend a large amount of effort to get my tools
to work.  If the community wants to find and fix the problem, i've
already said repeatedly i'll happily give over my repo, data,
whatever.  You are correct i am not going to spend even more effort
when i can be productive with something else much quicker.  The devil
i know (committing to svn) is better than the devil i don't (diving
into git source code and finding/fixing what is causing this space
blowup).
The python extension took me a few hours (&lt; 4).
Oh?
You seem to be taking this awfully personally.
I came into this completely open minded. Really, I did (i'm sure
you'll claim otherwise).
GIT people told me it would work great and i'd have a really small git
repo and be able to commit back to svn.
I tried it.
It didn't work out.
It doesn't seem to be usable for whatever reason.
I'm happy to give details, data, whatever.

I made the engineering decision that my effort would be better spent
doing something I knew i could do quickly (make hg commit back to svn
for my purposes) then trying to improve larger issues in GIT (UI and
space usage).  That took me a few hours, and I was happy again.

I would have been incredibly happy to have git just have come up with
a 400 meg gcc repository, and to be happily committing away from
git-svn to gcc's repository  ...
But it didn't happen.
So far, you have yet to actually do anything but incorrectly tell me
what I am thi...
To: Daniel Berlin <dberlin@...>
Cc: David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 12:25 am

I fought with this a few months ago when I did my own clone of gcc svn.
My bad for only discussing this on #git at the time.  Should have put
this to the list as well.

If anyone recalls my report was something along the lines of
git gc --aggressive explodes pack size.

git repack -a -d --depth=100 --window=100 produced a ~550MB packfile
immediately afterwards a git gc --aggressive produces a 1.5G packfile.

This was for all branches/tags, not just trunk like Daniel's repo.

The best theory I had at the time was that the gc doesn't find as good
deltas or doesn't allow the same delta chain depth and so generates a 
new object in the pack, rather the reusing a good delta it already has
in the well-packed pack.

Cheers,

Harvey

-
To: Harvey Harrison <harvey.harrison@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 12:54 am

Yes, --aggressive is generally a bad idea. I think we should remove it or 
at least fix it. It doesn't do what the name implies, because it actually 
throws away potentially good packing, and re-does it all from a clean 
slate.

That said, it's totally pointless for a person who isn't a git proponent 
to do an initial import, and in that sense I agree with Daniel: he 
shouldn't waste his time with tools that he doesn't know or care about, 
since there are people who *can* do a better job, and who know what they 
are doing, and understand and like the tool.

While you can do a half-assed job with just mindlessly running "git 
svnimport" (which is deprecated these days) or "git svn clone" (better), 
the fact is, to do a *good* import does likely mean spending some effort 
on it. Trying to make the user names / emails to be better with a mailmap, 
for example. 

[ By default, for example, "git svn clone/fetch" seems to create those 
  horrible fake email addresses that contain the ID of the SVN repo in 
  each commit - I'm not talking about the "git-svn-id", I'm talking about 
  the "user@hex-string-goes-here" thing for the author. Maybe people don't 
  really care, but isn't that ugly as hell? I'd think it's worth it doing 
  a really nice import, spending some effort on it.

  But maybe those things come from the older CVS-&gt;SVN import, I don't 
  really know. I've done a few SVN imports, but I've done them just for 
  stuff where I didn't want to touch SVN, but just wanted to track some 
  project like libgpod. For things like *that*, a totally mindless "git 
  svn" thing is fine ]

Of course, that does require there to be git people in the gcc crowd who 
are motivated enough to do the proper import and then make sure it's 
up-to-date and hosted somewhere. If those people don't exist, I'm not sure 
there's much idea to it.

The point being, you cannot ask a non-git person to do a major git import 
for an actual switch-over. Yes, it *can* be as simple as just doing a

...
To: Linus Torvalds <torvalds@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 1:04 am

git svn does accept a mailmap at import time with the same format as the
cvs importer I think.  But for someone that just wants a repo to check
out this was easiest.  I'd be willing to spend the time to do a nicer
job if there was any interest from the gcc side, but I'm not that
invested (other than owing them for an often-used tool).

Harvey

-
To: <dberlin@...>
Cc: <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 12:20 am

From: "Daniel Berlin" &lt;dberlin@dberlin.org&gt;

And other users have shown much smaller metadata from a GIT import,
and yes those are including all of the repository history and branches
not just the trunk.
-
To: David Miller <davem@...>
Cc: <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 12:32 am

I followed the instructions in the tutorials.
I followed the instructions given to by people who created these.
I came up with a 1.5 gig pack file.
You want to help, or you want to argue with me.
Right now it sounds like you are trying to blame me or make it look
like i did something wrong.

You are of course, welcome to try it yourself.
I can give you the absolute exactly commands I gave, and with git
1.5.3.7, it will give you a 1.5 gig pack file.
-
To: <dberlin@...>
Cc: <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 12:48 am

From: "Daniel Berlin" &lt;dberlin@dberlin.org&gt;

Several people replied in this thread showing what options can lead to
smaller pack files.

They also listed what the GIT limitations are that would effect the
kind of work you are doing, which seemed to mostly deal with the high
space cost of branching and tags when converting to/from SVN repos.
-
To: David Miller <davem@...>
Cc: <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 1:11 am

Actually, one person did, but that's okay, let's assume it was several.
I am currently trying Harvey's options.

I asked about using the pre-existing repos so i didn't have to do
this, but they were all
1. Done using read-only imports or
2. Don't contain full history
(IE the one that contains full history that is often posted here was

Actually, it turns out that git-gc --aggressive does this dumb thing
to pack files sometimes regardless of whether you converted from an
SVN repo or not.
-
To: Daniel Berlin <dberlin@...>
Cc: David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 2:09 am

[Empty message]
To: Linus Torvalds <torvalds@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, Git List <git@...>
Date: Thursday, December 6, 2007 - 3:12 pm

I'd like to learn more about that.  Can someone point me to
either more documentation on it?  In the absence of that,
perhaps a pointer to the source code that implements it?

I guess one question I posit is, would it be more accurate
to think of this as a "delta net" in a weighted graph rather
than a "delta chain"?

Thanks,
jdl


-
To: Jon Loeliger <jdl@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, Git List <git@...>
Date: Thursday, December 6, 2007 - 4:04 pm

See Documentation/technical/pack-heuristics.txt,
but the document predates and does not talk about delta
reusing, which was covered here:


Yes.
-
To: Jon Loeliger <jdl@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, Git List <git@...>
Date: Thursday, December 6, 2007 - 5:02 pm

A somewhat funny thing about this is ...

$ git show --stat --summary b116b297
commit b116b297a80b54632256eb89dd22ea2b140de622
Author: Jon Loeliger &lt;jdl@jdl.com&gt;
Date:   Thu Mar 2 19:19:29 2006 -0600

    Added Packing Heursitics IRC writeup.
    
    Signed-off-by: Jon Loeliger &lt;jdl@jdl.com&gt;
    Signed-off-by: Junio C Hamano &lt;junkio@cox.net&gt;

 Documentation/technical/pack-heuristics.txt |  466 +++++++++++++++++++++++++++
 1 files changed, 466 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/technical/pack-heuristics.txt
-
To: Junio C Hamano <gitster@...>
Cc: Jon Loeliger <jdl@...>, Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, Git List <git@...>
Date: Thursday, December 6, 2007 - 6:26 pm

Ah, fishing for compliments.  The cookie baking season...

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To: David Kastrup <dak@...>
Cc: Junio C Hamano <gitster@...>, Jon Loeliger <jdl@...>, Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, Git List <git@...>
Date: Thursday, December 6, 2007 - 6:38 pm

Indeed.  Here are some really good &amp; sweet recipes (IMHO).

http://www.xenotime.net/linux/recipes/


---
~Randy
Features and documentation: http://lwn.net/Articles/260136/
-
To: Jon Loeliger <jdl@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, Git List <git@...>
Date: Thursday, December 6, 2007 - 3:39 pm

Well, in a very real sense, what the delta code does is:
 - just list every single object in the whole repository
 - walk over each object, trying to find another object that it can be 
   written as a delta against
 - write out the result as a pack-file

That's simplified: we may not walk _all_ objects, for example: only a 
global repack does that (and most pack creations are actually for pushign 
and pulling between two repositories, so we only walk the objects that are 
in the source but not the destination repository).

The interesting phase is the "walk each object, try to find a delta" part. 
In particular, you don't want to try to find a delta by comparing each 
object to every other object out there (that would be O(n^2) in objects, 
and with a fairly high constant cost too!). So what it does is to sort the 
objects by a few heuristics (type of object, base name that object was 
found as when traversing a tree and size, and how recently it was found in 
the history).

And then over that sorted list, it tries to find deltas between entries 
that are "close" to each other (and that's where the "--window=xyz" thing 
comes in - it says how big the window is for objects being close. A 
smaller window generates somewhat less good deltas, but takes a lot less 
effort to generate).

The source is in git/builtin-pack-objects.c, with the core of it being

 - try_delta() - try to generate a *single* delta when given an object 
   pair.

 - find_deltas() - do the actual list traversal

 - prepare_pack() and type_size_sort() - create the delta sort list from 
   the list of objects.


It's certainly not a simple chain, it's more of a set of acyclic directed 
graphs in the object list. And yes, it's weigted by the size of the delta 
between objects, and the optimization problem is kind of akin to finding 
the smallest spanning tree (well, forest - since you do *not* want to 
create one large graph, you also want to make the individual trees shallow 
enough that you don't h...
To: Linus Torvalds <torvalds@...>
Cc: Jon Loeliger <jdl@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, Ismail Donmez <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 8:29 pm

Hmmm...

I think that these two problems (find minimal spanning forest with
limited depth and traverse graph) with the additional constraint to
avoid calculating weights / avoid calculating whole graph would be
a good problem to present at CompSci course.

Just a thought...
-- 
Jakub Narebski
Poland
ShadeHawk on #git
-
To: Linus Torvalds <torvalds@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 2:24 pm

No disrespect is meant by this reply.  I am just curious (and I am
probably misunderstanding something)..  Why remove all of the
documentation entirely?  Wouldn't it be better to just document it
more thoroughly?  I thought you did a fine job in this post in
explaining its purpose, when to use it, when not to, etc.  Removing
the documention seems counter-intuitive when you've already gone to
the trouble of creating good documentation here in this post.
-
To: NightStrike <nightstrike@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 2:45 pm

Well, part of it is that I don't think "--aggressive" as it is implemented 
right now is really almost *ever* the right answer. We could change the 
implementation, of course, but generally the right thing to do is to not 
use it (tweaking the "--window" and "--depth" manually for the repacking 
is likely the more natural thing to do).

The other part of the answer is that, when you *do* want to do what that 
"--aggressive" tries to achieve, it's such a special case event that while 
it should probably be documented, I don't think it should necessarily be 
documented where it is now (as part of "git gc"), but as part of a much 

I'm so used to writing emails, and I *like* trying to explain what is 
going on, so I have no problems at all doing that kind of thing. However, 
trying to write a manual or man-page or other technical documentation is 
something rather different.

IOW, I like explaining git within the _context_ of a discussion or a 
particular problem/issue. But documentation should work regardless of 
context (or at least set it up), and that's the part I am not so good at.

In other words, if somebody (hint hint) thinks my explanation was good and 
readable, I'd love for them to try to turn it into real documentation by 
editing it up and creating enough context for it! But I'm nort personally 
very likely to do that. I'd just send Junio the patch to remove a 
misleading part of the documentation we have.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 1:36 am

hehe.. I'd love to, actually.  I can work on it next week.
-
To: Linus Torvalds <torvalds@...>
Cc: David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 2:04 pm

I worked on Monotone and other systems that use object stores. for a
little while :)
In particular, I believe GIT's original object store was based on

Sure. SVN actually supports this (surprisingly), it just never happens
to choose delta bases that aren't related by ancestry.  (IE it would
have absolutely no problem with you using random other parts of the
repository as delta bases, and i've played with it before).

I actually advocated we move towards an object store model, as
ancestry can be a  crappy way of approximating similarity when you
I gave this a try overnight, and it definitely helps a lot.

If your forever and a day is spent figuring out which deltas to use,
you can reduce this significantly.
If it is spent writing out the data, it's much harder. :)
-
To: Daniel Berlin <dberlin@...>
Cc: Linus Torvalds <torvalds@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 10:42 pm

I've updated the public mirror repo with the very-packed version.

People cloning it now should get the just over 300MB repo now.

git.infradead.org/gcc.git


Cheers,

Harvey

-
To: Harvey Harrison <harvey.harrison@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 11:01 pm

Side note: it might be interesting to compare timings for 
history-intensive stuff with and without this kind of very-packed 
situation.

The very density of a smaller pack-file might be enough to overcome the 
downsides (more CPU time to apply longer delta-chains), but regardless, 
real numbers talks, bullshit walks. So wouldn't it be nice to have real 
numbers?

One easy way to get real numbers for history would be to just time some 
reasonably costly operation that uses lots of history. Ie just do a 

	time git blame -C gcc/regclass.c &gt; /dev/null

and see if the deeper delta chains are very expensive.

(Yeah, the above is pretty much designed to be the worst possible case for 
this kind of aggressive history packing, but I don't know if that choice 
of file to try to annotate is a good choice or not. I suspect that "git 
blame -C" with a CVS import is just horrid, because CVS commits tend to be 
pretty big and nasty and not as localized as we've tried to make things in 
the kernel, so doing the code copy detection is probably horrendously 
expensive)

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Harvey Harrison <harvey.harrison@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 12:06 am

jonsmirl@terra:/video/gcc$ time git blame -C gcc/regclass.c &gt; /dev/null

real    1m21.967s
user    1m21.329s
sys     0m0.640s

The Mozilla repo is at least 50% larger than the gcc one. It took me
23 minutes to repack the gcc one on my $800 Dell. The trick to this is
lots of RAM and 64b. There is little disk IO during the compression
phase, everything is cached.

I have a 4.8GB git process with 4GB of physical memory. Everything
started slowing down a lot when the process got that big. Does git
really need 4.8GB to repack? I could only keep 3.4GB resident. Luckily
this happen at 95% completion. With 8GB of memory you should be able
to do this repack in under 20 minutes.

jonsmirl@terra:/video/gcc$ time git repack -a -d -f --depth=250 --window=250
real    22m54.380s
user    69m18.948s


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Harvey Harrison <harvey.harrison@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 1:21 am

Well, I was also hoping for a "compared to not-so-aggressive packing" 
number on the same machine.. IOW, what I was wondering is whether there is 
a visible performance downside to the deeper delta chains in the 300MB 
pack vs the (less aggressive) 500MB pack.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Harvey Harrison <harvey.harrison@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 3:08 am

Same machine with a default pack

jonsmirl@terra:/video/gcc/.git/objects/pack$ ls -l
total 2145716
-r--r--r-- 1 jonsmirl jonsmirl   23667932 2007-12-07 02:03
pack-bd163555ea9240a7fdd07d2708a293872665f48b.idx
-r--r--r-- 1 jonsmirl jonsmirl 2171385413 2007-12-07 02:03
pack-bd163555ea9240a7fdd07d2708a293872665f48b.pack
jonsmirl@terra:/video/gcc/.git/objects/pack$

Delta lengths have virtually no impact. The bigger pack file causes
more IO which offsets the increased delta processing time.

One of my rules is smaller is almost always better. Smaller eliminates
IO and helps with the CPU cache. It's like the kernel being optimized
for size instead of speed ending up being  faster.

time git blame -C gcc/regclass.c &gt; /dev/null
real    1m19.289s
user    1m17.853s


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Linus Torvalds <torvalds@...>, Harvey Harrison <harvey.harrison@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 3:36 pm

I can confirm this.

I just did a repack keeping the default depth of 50 but with window=100 
instead of the default of 10, and the pack shrunk from 2171385413 bytes 
down to 410607140 bytes.

So our default window size is definitely not adequate for the gcc repo.

OTOH, I recall tytso mentioning something about not having much return 
on  a bigger window size in his tests when he proposed to increase the 
default delta depth to 50.  So there is definitely some kind of threshold 
at which point the increased window size stops being advantageous wrt 
the number of cycles involved, and we should find a way to correlate it 
to the data set to have a better default window size than the current 
fixed default.


Nicolas
-
To: Jon Smirl <jonsmirl@...>
Cc: Linus Torvalds <torvalds@...>, Harvey Harrison <harvey.harrison@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 12:21 am

Probably you have too many cached delta results.  By default, every 
delta smaller than 1000 bytes is kept in memory until the write phase.  
Try using pack.deltacachesize = 256M or lower, or try disabling this 
caching entirely with pack.deltacachelimit = 0.


Nicolas
-
To: Daniel Berlin <dberlin@...>
Cc: David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 2:29 pm

Yes and no. 

Monotone does what git does for the blobs. But there is a big difference 
in how git then does it for everything else too, ie trees and history. 
Tree being in that object store in particular are very important, and one 
of the biggest deals for deltas (actually, for two reasons: most of the 
time they don't change AT ALL if some subdirectory gets no changes and you 
don't need any delta, and even when they do change, it's usually going to 

It's almost all about figuring out the delta. Which is why *not* using 
"-f" (or "--aggressive") is such a big deal for normal operation, because 
then you just skip it all.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>, <gitster@...>
Date: Thursday, December 6, 2007 - 8:03 am

The default was not to change the window or depth at all.  As suggested
by Jon Smirl, Linus Torvalds and others, default to

	--window=250 --depth=250

Signed-off-by: Johannes Schindelin &lt;johannes.schindelin@gmx.de&gt;
---

	On Wed, 5 Dec 2007, Linus Torvalds wrote:

	&gt; On Thu, 6 Dec 2007, Daniel Berlin wrote:
	&gt; &gt; 
	&gt; &gt; Actually, it turns out that git-gc --aggressive does this dumb 
	&gt; &gt; thing to pack files sometimes regardless of whether you 
	&gt; &gt; converted from an SVN repo or not.
	&gt; 
	&gt; Absolutely. git --aggressive is mostly dumb. It's really only 
	&gt; useful for the case of "I know I have a *really* bad pack, and I 
	&gt; want to throw away all the bad packing decisions I have done".
	&gt;
	&gt; [...]
	&gt; 
	&gt; So the equivalent of "git gc --aggressive" - but done *properly* 
	&gt; - is to do (overnight) something like
	&gt; 
	&gt; 	git repack -a -d --depth=250 --window=250

	How about this, then?
	
 builtin-gc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/builtin-gc.c b/builtin-gc.c
index 799c263..c6806d3 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -23,7 +23,7 @@ static const char * const builtin_gc_usage[] = {
 };
 
 static int pack_refs = 1;
-static int aggressive_window = -1;
+static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 20;
 
@@ -192,6 +192,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 
 	if (aggressive) {
 		append_option(argv_repack, "-f", MAX_ADD);
+		append_option(argv_repack, "--depth=250", MAX_ADD);
 		if (aggressive_window &gt; 0) {
 			sprintf(buf, "--window=%d", aggressive_window);
 			append_option(argv_repack, buf, MAX_ADD);
-- 
1.5.3.7.2157.g9598e

-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>, <gitster@...>
Date: Thursday, December 6, 2007 - 11:30 am

Wow

/usr/bin/time git repack -a -d -f --window=250 --depth=250


23266.37user 581.04system 7:41:25elapsed 86%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (419835major+123275804minor)pagefaults 0swaps

-r--r--r-- 1 hharrison hharrison  29091872 2007-12-06 07:26
pack-1d46ca030c3d6d6b95ad316deb922be06b167a3d.idx
-r--r--r-- 1 hharrison hharrison 324094684 2007-12-06 07:26
pack-1d46ca030c3d6d6b95ad316deb922be06b167a3d.pack


That extra delta depth really does make a difference.  Just over a
300MB pack in the end, for all gcc branches/tags as of last night.

Cheers,

Harvey

-
To: Harvey Harrison <harvey.harrison@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, Git Mailing List <git@...>, Junio C Hamano <gitster@...>
Date: Thursday, December 6, 2007 - 12:19 pm

Heh. And this is why you want to do it exactly *once*, and then just 

But yeah, especially if you allow longer delta chains, the end result can 
be much smaller (and what makes the one-time repack more expensive is the 
window size, not the delta chain - you could make the delta chains longer 
with no cost overhead at packing time)

HOWEVER. 

The longer delta chains do make it potentially much more expensive to then 
use old history. So there's a trade-off. And quite frankly, a delta depth 
of 250 is likely going to cause overflows in the delta cache (which is 
only 256 entries in size *and* it's a hash, so it's going to start having 
hash conflicts long before hitting the 250 depth limit).

So when I said "--depth=250 --window=250", I chose those numbers more as 
an example of extremely aggressive packing, and I'm not at all sure that 
the end result is necessarily wonderfully usable. It's going to save disk 
space (and network bandwidth - the delta's will be re-used for the network 
protocol too!), but there are definitely downsides too, and using long 
delta chains may simply not be worth it in practice.

(And some of it might just want to have git tuning, ie if people think 
that long deltas are worth it, we could easily just expand on the delta 
hash, at the cost of some more memory used!)

That said, the good news is that working with *new* history will not be 
affected negatively, and if you want to be _really_ sneaky, there are ways 
to say "create a pack that contains the history up to a version one year 
ago, and be very aggressive about those old versions that we still want to 
have around, but do a separate pack for newer stuff using less aggressive 
parameters"

So this is something that can be tweaked, although we don't really have 
any really nice interfaces for stuff like that (ie the git delta cache 
size is hardcoded in the sources and cannot be set in the config file, and 
the "pack old history more aggressively" involves some manual scripting 
and k...
To: Harvey Harrison <harvey.harrison@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>, <gitster@...>
Date: Thursday, December 6, 2007 - 11:56 am

Hi,


Wow.

Ciao,
Dscho
-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>, <gitster@...>
Date: Thursday, December 6, 2007 - 10:22 am

well, this will explode on many quite reasonnably sized systems. This
should also use a memory-limit that could be auto-guessed from the
system total physical memory (50% of the actual memory could be a good
idea e.g.).

  On very large repositories, using that on the e.g. linux kernel, swaps
like hell on a machine with 1Go of ram, and almost nothing running on it
(less than 200Mo of ram actually used)
To: Pierre Habouzit <madcoder@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>, <gitster@...>
Date: Thursday, December 6, 2007 - 11:55 am

Hi,


Yes.

However, I think that --aggressive should be aggressive, and if you decide 
to run it on a machine which lacks the muscle to be aggressive, well, you 
should have known better.

The upside: if you run this on a strong machine and clone it to a weak 
machine, you'll still have the benefit of a small pack (and you should 
mark it as .keep, too, to keep the benefit...)

Ciao,
Dscho

-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Pierre Habouzit <madcoder@...>, Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>, <gitster@...>
Date: Thursday, December 6, 2007 - 1:05 pm

That's a rather cheap shot.  "you should have known better" than
expecting to be able to use a documented command and option because the
git developers happened to have a nicer machine...

_How_ is one supposed to have known better?

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>, <gitster@...>
Date: Thursday, December 6, 2007 - 9:42 am

I'd also suggest adding a comment in the man pages that this should
only be done rarely, and that it can potentially take a *long* time
(i.e., overnight) for big repositories, and in general it's not worth
the effort to use --aggressive.

Apologies to Linus and to the gcc folks, since I was the one who
originally coded up gc --aggressive, and at the time my intent was
"rarely does it make sense, and it may take a long time".  The reason
why I didn't make the default --window and --depth larger is because
at the time the biggest repo I had easy access to was the Linux
kernel's, and there you rapidly hit diminishing returns at much
smaller numbers, so there was no real point in using --window=250
--depth=250.

Linus later pointed out that what we *really* should do is at some
point was to change repack -f to potentially retry to find a better
delta, but to reuse the existing delta if it was no worse.  That
automatically does the right thing in the case where you had
previously done a repack with --window=&lt;large n&gt; --depth=&lt;large n&gt;,
but then later try using "gc --agressive", which ends up doing a worse
job and throwing away the information from the previous repack with
large window and depth sizes.  Unfortunately no one ever got around to
implementing that.

Regards,

						- Ted
-
To: Theodore Tso <tytso@...>
Cc: Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>, <gitster@...>
Date: Thursday, December 6, 2007 - 10:15 am

I did start looking at it, but there are subtle issues to consider, such 
as making sure not to create delta loops.  Currently this is avoided by 
never involving already reused deltas in new delta chains, except for 
edge base objects.

IOW, this requires some head scratching which I didn't have the time for 
so far.


Nicolas
-
To: Linus Torvalds <torvalds@...>
Cc: Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 3:49 am

Since I have the whole gcc repo locally I'll give this a shot overnight
just to see what can be done at the extreme end or things.

Harvey

-
To: Harvey Harrison <harvey.harrison@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 10:01 am

Don't forget to add -f as well.


Nicolas
-
To: Harvey Harrison <harvey.harrison@...>
Cc: Linus Torvalds <torvalds@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 4:11 am

When I tried this on a very large repo, at least one with some large files
in it, git quickly exceeded my physical memory and started thrashing the
machine.  I had good results with

  git config pack.deltaCacheSize 512m
  git config pack.windowMemory 512m

of course adjusting based on your physical memory.  I think changing the
windowMemory will affect the resulting compression, so changing these
ratios might get better compression out of the result.

If you're really patient, though, you could leave the unbounded window,
hope you have enough swap, and just let it run.

Dave
-
To: Daniel Berlin <dberlin@...>
Cc: David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 1:15 am

While you won't get the git svn metadata if you clone the infradead
repo, it can be recreated on the fly by git svn if you want to start
commiting directly to gcc svn.

Harvey

-
To: Harvey Harrison <harvey.harrison@...>
Cc: David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 1:17 am

I will give this a try :)
-
To: Daniel Berlin <dberlin@...>
Cc: Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 2:47 am

Back when I was working on the Mozilla repository we were able to
convert the full 4GB CVS repository complete with all history into a
450MB pack file. That work is where the git-fastimport tool came from.
But it took a month of messing with the import tools to achieve this
and Mozilla still chose another VCS (mainly because of poor Windows
support in git).

Like Linus says, this type of command will yield the smallest pack file:
 git repack -a -d --depth=250 --window=250

I do agree that importing multi-gigabyte repositories is not a daily
occurrence nor a turn-key operation. There are significant issues when
translating from one VCS to another. The lack of global branch
tracking in CVS causes extreme problems on import. Hand editing of CVS
files also caused endless trouble.

The key to converting repositories of this size is RAM. 4GB minimum,
more would be better. git-repack is not multi-threaded. There were a
few attempts at making it multi-threaded but none were too successful.
If I remember right, with loads of RAM, a repack on a 450MB repository
was taking about five hours on a 2.8Ghz Core2. But this is something
you only have to do once for the import. Later repacks will reuse the
original deltas.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Daniel Berlin <dberlin@...>, Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 3:15 am

Actually, Nicolas put quite a bit of work into multi-threading the
repack process; the results have been in master for some time, and will
be in the soon-to-be-released v1.5.4.

The downside is that the threading partitions the object space, so the
resulting size is not necessarily as small (but I don't know that
anybody has done testing on large repos to find out how large the
difference is).

-Peff
-
To: Jeff King <peff@...>
Cc: Jon Smirl <jonsmirl@...>, Daniel Berlin <dberlin@...>, Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 10:18 am

Quick guesstimate is in the 1% ballpark.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Jon Smirl <jonsmirl@...>, Daniel Berlin <dberlin@...>, Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 1:39 pm

Fortunately, we now have numbers. Harvey Harrison reported repacking the

I tried the threaded repack with pack.threads = 3 on a dual-processor
machine, and got:

  time git repack -a -d -f --window=250 --depth=250

  real    309m59.849s
  user    377m43.948s
  sys     8m23.319s

  -r--r--r-- 1 peff peff  28570088 2007-12-06 10:11 pack-1fa336f33126d762988ed6fc3f44ecbe0209da3c.idx
  -r--r--r-- 1 peff peff 339922573 2007-12-06 10:11 pack-1fa336f33126d762988ed6fc3f44ecbe0209da3c.pack

So it is about 5% bigger. What is really disappointing is that we saved
only about 20% of the time. I didn't sit around watching the stages, but
my guess is that we spent a long time in the single threaded "writing
objects" stage with a thrashing delta cache.

-Peff
-
To: <peff@...>
Cc: <nico@...>, <jonsmirl@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 11:31 pm

From: Jeff King &lt;peff@peff.net&gt;

If someone can give me a good way to run this test case I can
have my 64-cpu Niagara-2 box crunch on this and see how fast
it goes and how much larger the resulting pack file is.
-
To: David Miller <davem@...>
Cc: <nico@...>, <jonsmirl@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 2:38 am

That would be fun to see. The procedure I am using is this:

# compile recent git master with threaded delta
cd git
echo THREADED_DELTA_SEARCH = 1 &gt;&gt;config.mak
make install

# get the gcc pack
mkdir gcc &amp;&amp; cd gcc
git --bare init
git config remote.gcc.url git://git.infradead.org/gcc.git
git config remote.gcc.fetch \
  '+refs/remotes/gcc.gnu.org/*:refs/remotes/gcc.gnu.org/*'
git remote update

# make a copy, so we can run further tests from a known point
cd ..
cp -a gcc test

# and test multithreaded large depth/window repacking
cd test
git config pack.threads 4
time git repack -a -d -f --window=250 --depth=250

-Peff
-
To: Jeff King <peff@...>
Cc: David Miller <davem@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 3:10 am

64 threads with 64 CPUs, if they are multicore you want even more.


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: <jonsmirl@...>
Cc: <peff@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 8:53 am

From: "Jon Smirl" &lt;jonsmirl@gmail.com&gt;


Didn't work very well, even with the one-liner patch for
chunk_size it died.  I think I need to build 64-bit
binaries.

davem@huronp11:~/src/GCC/git/test$ time git repack -a -d -f --window=250 --depth=250
Counting objects: 1190671, done.
fatal: Out of memory? mmap failed: Cannot allocate memory

real    58m36.447s
user    289m8.270s
sys     4m40.680s
davem@huronp11:~/src/GCC/git/test$ 

While it did run the load was anywhere between 5 and 9, although it
did create 64 threads, and the size of the process was about 3.2GB
This may be in part why it wasn't able to use all 64 thread
effectively.  Like I said it seemed to have 9 active at best, at any
one time, most of the time only 4 or 5 were busy doing anything.

Also I could end up being performance limited by SHA, it's not very
well tuned on Sparc.  It's been on my TODO list to code up the crypto
unit support for Niagara-2 in the kernel, then work with Herbert Xu on
the userland interfaces to take advantage of that in things like
libssl.  Even a better C/asm version would probably improve GIT
performance a bit.

Is SHA a significant portion of the compute during these repacks?
I should run oprofile...
-
To: <jonsmirl@...>
Cc: <peff@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Monday, December 10, 2007 - 5:57 am

From: David Miller &lt;davem@davemloft.net&gt;

While doing the initial object counting, most of the time is spent in
lookup_object(), memcmp() (via hashcmp()), and inflate().  I tried to
see if I could do some tricks on sparc with the hashcmp() but the sha1
pointers are very often not even 4 byte aligned.

I suspect lookup_object() could be improved if it didn't use a hash
table without chaining, but I can see why 'struct object' size is a
concern and thus why things are done the way they are.

samples  %        app name                 symbol name
504      13.7517  libc-2.6.1.so            memcmp
386      10.5321  libz.so.1.2.3.3          inflate
288       7.8581  git                      lookup_object
248       6.7667  libz.so.1.2.3.3          inflate_fast
201       5.4843  libz.so.1.2.3.3          inflate_table
175       4.7749  git                      decode_tree_entry
 ...

Deltifying is %94 consumed by create_delta(), the rest is completely
in the noise.

samples  %        app name                 symbol name
10581    94.8373  git                      create_delta
181       1.6223  git                      create_delta_index
72        0.6453  git                      prepare_pack
55        0.4930  libc-2.6.1.so            loop
34        0.3047  libz.so.1.2.3.3          inflate_fast
33        0.2958  libc-2.6.1.so            _int_malloc
22        0.1972  libshadow.so             shadowUpdatePacked
21        0.1882  libc-2.6.1.so            _int_free
19        0.1703  libc-2.6.1.so            malloc
 ...
-
To: David Miller <davem@...>
Cc: <jonsmirl@...>, <peff@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 1:23 pm

I doubt yu can use the hardware support. Kernel-only hw support is 
inherently broken for any sane user-space usage, the setup costs are just 
way way too high. To be useful, crypto engines need to support direct user 
space access (ie a regular instruction, with all state being held in 

SHA1 is almost totally insignificant on x86. It hardly shows up. But we 
have a good optimized version there.

zlib tends to be a lot more noticeable (especially the uncompression: it 
may be faster than compression, but it's done _so_ much more that it 
totally dominates).

			Linus
-
To: <torvalds@...>
Cc: <jonsmirl@...>, <peff@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 9:55 pm

From: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

Unfortunately they are hypervisor calls, and you have to give
the thing physical addresses for the buffer to work on, so
letting userland get at it directly isn't currently doable.

I still believe that there are cases where userland can take
advantage of in-kernel crypto devices, such as when we are
streaming the data into the kernel anyways (for a write()
or sendmsg()) and the user just wants the transformation to
be done on that stream.

As a specific case, hardware crypto SSL support works quite
well for sendmsg() user packet data.  And this the kind of API


zlib is really hard to optimize on Sparc, I've tried numerous times.
Actually compress is the real cycle killer, and in that case the inner
loop wants to dereference 2-byte shorts at a time but they are
unaligned half of the time, and any the check for alignment nullifies
the gains of avoiding the two byte loads.

Uncompress I don't think is optimized at all on any platform with
asm stuff like the compress side is.  It's a pretty straightforward
transformation and the memory accesses dominate the overhead.

I'll do some profiling to see what might be worth looking into.
-
To: Linus Torvalds <torvalds@...>
Cc: David Miller <davem@...>, <jonsmirl@...>, <peff@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 4:26 pm

Have you considered alternatives, like:
http://www.oberhumer.com/opensource/ucl/
-- 
Giovanni Bajo
-
To: Giovanni Bajo <rasky@...>
Cc: Linus Torvalds <torvalds@...>, David Miller <davem@...>, <jonsmirl@...>, <peff@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 6:14 pm

&lt;quote&gt;
  As compared to LZO, the UCL algorithms achieve a better compression
  ratio but *decompression* is a little bit slower. See below for some
  rough timings.
&lt;/quote&gt;

It is uncompression speed that is more important, because it is used
much more often.

-- 
Jakub Narebski
ShadeHawk on #git

-
To: Jakub Narebski <jnareb@...>
Cc: Linus Torvalds <torvalds@...>, David Miller <davem@...>, <jonsmirl@...>, <peff@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 7:14 pm

I know, but the point is not what is the fastestest, but if it's fast
enough to get off the profiles. I think UCL is fast enough since it's
still times faster than zlib. Anyway, LZO is GPL too, so why not
considering it too. They are good libraries.
-- 
Giovanni Bajo

-
To: Giovanni Bajo <rasky@...>
Cc: Jakub Narebski <jnareb@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, <jonsmirl@...>, <peff@...>, <nico@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 7:33 pm

At worst, you could also use fastlz (www.fastlz.org), which is faster
than all of these by a factor of 4 (and compression wise, is actually
sometimes better, sometimes worse, than LZO).
-
To: Daniel Berlin <dberlin@...>
Cc: Giovanni Bajo <rasky@...>, Jakub Narebski <jnareb@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, <jonsmirl@...>, <peff@...>, <nico@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Saturday, December 8, 2007 - 8:00 am

Hi,


fastLZ is awfully short on details when it comes to a comparison of the 
resulting file sizes.

The only result I saw was that for the (single) example they chose, 
compressed size was 470MB as opposed to 361MB for zip's _fastest_ mode.

Really, that's not acceptable for me in the context of git.

Besides, if you change the compression algorithm you will have to add 
support for legacy clients to _recompress_ with libz.  Which most likely 
would make Sisyphos grin watching them servers.

Ciao,
Dscho

-
To: Jakub Narebski <jnareb@...>
Cc: Giovanni Bajo <rasky@...>, Linus Torvalds <torvalds@...>, David Miller <davem@...>, <jonsmirl@...>, <peff@...>, <nico@...>, <dberlin@...>, <harvey.harrison@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 7:04 pm

So why didn't we consider lzo then? It's much faster than zlib.

__Luke

  
-
To: Jeff King <peff@...>
Cc: Nicolas Pitre <nico@...>, Jon Smirl <jonsmirl@...>, Daniel Berlin <dberlin@...>, Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 2:35 pm

I don't think you spent all that much time writing the objects. That part 
isn't very intensive, it's mostly about the IO.

I suspect you may simply be dominated by memory-throughput issues. The 
delta matching doesn't cache all that well, and using two or more cores 
isn't going to help all that much if they are largely waiting for memory 
(and quite possibly also perhaps fighting each other for a shared cache? 
Is this a Core 2 with the shared L2?)

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Jeff King <peff@...>, Nicolas Pitre <nico@...>, Jon Smirl <jonsmirl@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 8:47 pm

Some interesting stats from the highly packed gcc repo.  The long chain
lengths very quickly tail off.  Over 60% of the objects have a chain
length of 20 or less.  If anyone wants the full list let me know.  I
also have included a few other interesting points, the git default
depth of 50, my initial guess of 100 and every 10% in the cumulative
distribution from 60-100%.

This shows the git default of 50 really isn't that bad, and after
about 100 it really starts to get sparse.  

Harvey

1:	103817	103817	10.20%	1017922
2:	67332	171149	16.81%
3:	57520	228669	22.46%
4:	52570	281239	27.63%
5:	43910	325149	31.94%
6:	37520	362669	35.63%
7:	35248	397917	39.09%
8:	29819	427736	42.02%
9:	27619	455355	44.73%
10:	22656	478011	46.96%
11:	21073	499084	49.03%
12:	18738	517822	50.87%
13:	16674	534496	52.51%
14:	14882	549378	53.97%
15:	14424	563802	55.39%
16:	12765	576567	56.64%
17:	11662	588229	57.79%
18:	11845	600074	58.95%
19:	11694	611768	60.10%
20:	9625	621393	61.05%
34:	5354	719356	70.67%
50:	3395	785342	77.15%
60:	2547	815072	80.07%
100:	1644	898284	88.25%
113:	1292	917046	90.09%
158:	959	967429	95.04%
200:	652	997653	98.01%
219:	491	1008132	99.04%
245:	179	1017717	99.98%
246:	111	1017828	99.99%
247:	61	1017889	100.00%
248:	27	1017916	100.00%
249:	6	1017922	100.00%

-
To: Harvey Harrison <harvey.harrison@...>
Cc: Linus Torvalds <torvalds@...>, Jeff King <peff@...>, Nicolas Pitre <nico@...>, Jon Smirl <jonsmirl@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Monday, December 10, 2007 - 5:54 am

Do you have a way to know which files have the longest chains?

I have a suspiscion that the ChangeLog* files are among them,
not only because they are, almost without exception, only modified
by prepending text to the previous version (and a fairly small amount
compared to the size of the file), and therefore the diff is simple
(a single hunk) so that the limit on chain depth is probably what
causes a new copy to be created. 

Besides that these files grow quite large and become some of the 
largest files in the tree, and at least one of them is changed 
for every commit. This leads again to many versions of fairly 
large files.

If this guess is right, this implies that most of the size gains
from longer chains comes from having less copies of the ChangeLog*
files. From a performance point of view, it is rather favourable
since the differences are simple. This would also explain why
the window parameter has little effect.

	Regards,
	Gabriel
-
To: Gabriel Paubert <paubert@...>
Cc: Harvey Harrison <harvey.harrison@...>, Linus Torvalds <torvalds@...>, Jeff King <peff@...>, Jon Smirl <jonsmirl@...>, Daniel Berlin <dberlin@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Monday, December 10, 2007 - 11:35 am

With 'git verify-pack -v' you get the delta depth for each object.

My gcc repo is currently repacked with a max delta depth of 50, and 
a quick sample of those objects at the depth limit does indeed show the 
content of the ChangeLog file.  But I have occurrences of the root 
directory tree object too, and the "GCC machine description for IA-32" 
content as well.

But yes, the really deep delta chains are most certainly going to 

Well, actually the window parameter does have big effects.  For instance 
the default of 10 is completely inadequate for the gcc repo, since 
changing the window size from 10 to 100 made the corresponding pack 
shrink from 2.1GB down to 400MB, with the same max delta depth.


Nicolas
-
To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Jon Smirl <jonsmirl@...>, Daniel Berlin <dberlin@...>, Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Friday, December 7, 2007 - 3:31 am

It can get nasty with super-long deltas thrashing the cache, I think.
But in this case, I think it ended up being just a poor division of
labor caused by the chunk_size parameter using the quite large window

I think the chunk_size more or less explains it. I have had reasonable
success keeping both CPUs busy on similar tasks in the past (but with
smaller window sizes).

For reference, it was a Core 2 Duo; do they all share L2, or is there
something I can look for in /proc/cpuinfo?

-Peff
-
To: Linus Torvalds <torvalds@...>
Cc: Jeff King <peff@...>, Nicolas Pitre <nico@...>, Daniel Berlin <dberlin@...>, Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 2:55 pm

When I lasted looked at the code, the problem was in evenly dividing
the work. I was using a four core machine and most of the time one
core would end up with 3-5x the work of the lightest loaded core.
Setting pack.threads up to 20 fixed the problem. With a high number of
threads I was able to get a 4hr pack to finished in something like
1:15.

A scheme where each core could work a minute without communicating to
the other cores would be best. It would also be more efficient if the
cores could avoid having sync points between them.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Linus Torvalds <torvalds@...>, Jeff King <peff@...>, Daniel Berlin <dberlin@...>, Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 3:08 pm

But as far as I know you didn't try my latest incarnation which has been
available in Git's master branch for a few months already.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Linus Torvalds <torvalds@...>, Jeff King <peff@...>, Daniel Berlin <dberlin@...>, Harvey Harrison <harvey.harrison@...>, David Miller <davem@...>, <ismail@...>, <gcc@...>, <git@...>
Date: Thursday, December 6, 2007 - 5:39 pm

I've deleted all my giant packs. Using the kernel pack:
4GB Q6600

Using the current thread pack code I get these results.

The interesting case is the last one. I set it to 15 threads and
monitored with 'top'.
For 0-60% compression I was at 300% CPU, 60-74% was 200% CPU and
74-100% was 100% CPU. It never used all for cores. The only other
things running were top and my desktop. This is the same load
balancing problem I observed earlier. Much more clock time was spent
in the 2/1 core phases than the 3 core one.

Threaded, threads = 5

jonsmirl@terra:/home/linux$ time git repack -a -d -f
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objects: 100% (648366/648366), done.
Total 648366 (delta 528994), reused 0 (delta 0)

real    1m31.395s
user    2m59.239s
sys     0m3.048s
jonsmirl@terra:/home/linux$

12 seconds counting
53 seconds compressing
38 seconds writing

Without threads,

jonsmirl@terra:/home/linux$ time git repack -a -d -f
warning: no threads support, ignoring pack.threads
Counting objects: 648366, done.
Compressing objects: 100% (647457/647457), done.
Writing objec