login
Header Space

 
 

Re: Something is broken in repack

Previous thread: [RFC/PATCH] Add a --nosort option to pack-objects by Mike Hommey on Friday, December 7, 2007 - 5:10 pm. (8 messages)

Next thread: git-svn branch naming question by Miklos Vajna on Friday, December 7, 2007 - 9:04 pm. (13 messages)
To: Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 7:05 pm

Using this config:
[pack]
        threads = 4
        deltacachesize = 256M
        deltacachelimit = 0

And the 330MB gcc pack for input
 git repack -a -d -f  --depth=250 --window=250

complete seconds RAM
10%  47 1GB
20%  29 1Gb
30%  24 1Gb
40%  18 1GB
50%  110 1.2GB
60%  85 1.4GB
70%  195 1.5GB
80%  186 2.5GB
90%  489 3.8GB
95%  800 4.8GB
I killed it because it started swapping

The mmaps are only about 400MB in this case.
At the end the git process had 4.4GB of physical RAM allocated.

Starting from a highly compressed pack greatly aggravates the problem.
Starting with a 2GB pack of the same data my process size only grew to
3GB with 2GB of mmaps.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Git Mailing List <git@...>
Date: Monday, December 10, 2007 - 3:56 pm

You said having reproduced the issue, albeit not as severe, with the 
Linux kernel repo.  I did just that:

# to get the default pack:
$ git repack -a -f -d

# first measurement with a repack from a default pack
$ /usr/bin/time git repack -a -f --window=256 --depth=256
2572.17user 5.87system 22:46.80elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k
15720inputs+356640outputs (71major+264376minor)pagefaults 0swaps

# do it again to start from a highly packed pack
$ /usr/bin/time git repack -a -f --window=256 --depth=256
2573.53user 5.62system 22:45.60elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k
29176inputs+356664outputs (210major+274887minor)pagefaults 0swaps

This is with pack.threads=2 on a P4 with HT, and I'm using the machine 
for other tasks as well, but all measured time is sensibly the same for 
both cases.  Virtual memory allocation never reached 700MB in both cases 
either.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Git Mailing List <git@...>
Date: Monday, December 10, 2007 - 4:05 pm

This is the mail about the kernel pack, the one you quoted is a gcc run.

The kernel repo has the same problem but not nearly as bad.

Starting from a default pack
 git repack -a -d -f  --depth=1000 --window=1000
Uses 1GB of physical memory

Now do the command again.
 git repack -a -d -f  --depth=1000 --window=1000
Uses 1.3GB of physical memory

I suspect the gcc repo has much longer revision chains than the kernel
one since the kernel repo is only a few years old. The Mozilla repo
contained revision chains with over 2,000 revisions. Longer revision
chains result in longer delta chains.

So what is allocating the extra memory? Either a function of the
number of entries in the chain, or related to accessing the chain
since a chain with more entries will need to be accessed more times.

I have a 168MB kernel pack now after 15 minutes of four cores at 100%.

Here's another observation, the gcc objects are larger. Kernel has
650K objects in 190MB, gcc has 870K objects in 330MB. Average gcc
object is 30% larger. How should the average kernel developer


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Nicolas Pitre <nico@...>, Git Mailing List <git@...>
Date: Monday, December 10, 2007 - 4:16 pm

Could this be explained by the ChangeLog file?  It's large; it has tons of
revisions; it is a prime candidate for delta compression.

Morten
-
To: Jon Smirl <jonsmirl@...>
Cc: Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 9:46 pm

Since you have a different result according to the source pack used then 
those cache settings, even if there was a bug with them, are not 



Which is quite reasonable, even if the same issue might still be there.

So the problem seems to be related to the pack access code and not the 
repack code.  And it must have something to do with the number of deltas 
being replayed.  And because the repack is attempting delta compression 
roughly from newest to oldest, and because old objects are typically in 
a deeper delta chain, then this might explain the logarithmic slowdown.

So something must be wrong with the delta cache in sha1_file.c somehow.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 10:22 pm

What could be wrongly allocating 4GB of memory? Figure that out and
you should have your answer. The slow down may be coming from having
to search through more and more objects in memory.

Memory consumption seem to be correlated to the depth of the delta
chain being accessed. It blows up tremendously right at the end. It
may even be a square of the length of the chain length. For the normal
default case the square didn't hurt, but 250*250 = 62,500 which would
eat a huge amount of memory.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Nicolas Pitre <nico@...>
Cc: Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 10:04 pm

I applied the delta accounting patch. It took about 200MB of from the
memory use but that doesn't make a dent in 4GB of allocations.


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 10:28 pm

Right.  I didn't expect much from that fix.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 11:29 pm

The kernel repo has the same problem but not nearly as bad.

Starting from a default pack
 git repack -a -d -f  --depth=1000 --window=1000
Uses 1GB of physical memory

Now do the command again.
 git repack -a -d -f  --depth=1000 --window=1000
Uses 1.3GB of physical memory

I suspect the gcc repo has much longer revision chains than the kernel
one since the kernel repo is only a few years old. The Mozilla repo
contained revision chains with over 2,000 revisions. Longer revision
chains result in longer delta chains.

So what is allocating the extra memory? Either a function of the
number of entries in the chain, or related to accessing the chain
since a chain with more entries will need to be accessed more times.

I have a 168MB kernel pack now after 15 minutes of four cores at 100%.

Here's another observation, the gcc objects are larger. Kernel has
650K objects in 190MB, gcc has 870K objects in 330MB. Average gcc
object is 30% larger. How should the average kernel developer
interpret this?

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Nicolas Pitre <nico@...>, Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 11:37 pm

With my repo that contains a bunch of 50MB tarfiles, I've found I must
specify --window-memory as well to keep repack from using nearly unbounded
amounts of memory.  Perhaps it is the larger files found in gcc that
provokes this.

A window size of 1000 can take a lot of memory if the objects are large.

Dave
-
To: David Brown <git@...>, Nicolas Pitre <nico@...>, Git Mailing List <git@...>
Date: Saturday, December 8, 2007 - 12:22 am

This is a partial solution to the problem. Adding window size =256M
took memory consumption down from 4.8GB to 2.8GB. It took an hour to
run the test.

It not the complete solution since my git process is still using 2.4GB
physical memory. I also still experiencing a lot of slow down in the
last 10%.

Does the gcc repo contain some giant objects? Why wasn't the memory
freed after their chain was processed?

Most of the last 10% is being done on a single CPU. There must be a
chain of giant objects that is unbalancing everything.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: David Brown <git@...>, Git Mailing List <git@...>
Date: Saturday, December 8, 2007 - 12:30 am

I'm about to send a patch to fix the thread balancing for real this 
time.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: David Brown <git@...>, Git Mailing List <git@...>
Date: Saturday, December 8, 2007 - 1:01 am

Something is really broken in the last 5% of that repo. I have been
processing at 97% for 30 minutes without moving to 98%.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Git Mailing List <git@...>
Date: Saturday, December 8, 2007 - 1:12 am

This is a clear sign of a problem, indeed.

I'll be away for the weekend, so here's a few things to try out if you 
feel like it:

1) Make sure the problem occurs with the thread code disabled.  That 
   would eliminate one variable, and will help for #2.

2) Try bissecting the issue.  If you can find an old Git version where 
   the issue doesn't appear then simply run "git bissect" to find the 
   exact commit causing the problem.  Best with a repo that doesn't take
   ages to repack.

3) Compile Git against the dmalloc library in order to identify where
   the huge memory leak is happening.


Nicolas
-
To: Jon Smirl <jonsmirl@...>
Cc: Nicolas Pitre <nico@...>, Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 11:48 pm

I sent out a partial delta breakdown for the gcc repo earlier, here's
the whole list.

breakdown of the gcc packfile:

Total objects
1017922

ChainLength	Objects	Cumulative
1:	103817	103817
2:	67332	171149
3:	57520	228669
4:	52570	281239
5:	43910	325149
6:	37520	362669
7:	35248	397917
8:	29819	427736
9:	27619	455355
10:	22656	478011
11:	21073	499084
12:	18738	517822
13:	16674	534496
14:	14882	549378
15:	14424	563802
16:	12765	576567
17:	11662	588229
18:	11845	600074
19:	11694	611768
20:	9625	621393
21:	9031	630424
22:	8437	638861
23:	8217	647078
24:	7927	655005
25:	7955	662960
26:	7092	670052
27:	7004	677056
28:	6724	683780
29:	6626	690406
30:	5875	696281
31:	5970	702251
32:	5726	707977
33:	6025	714002
34:	5354	719356
35:	6413	725769
36:	4933	730702
37:	4888	735590
38:	4561	740151
39:	4366	744517
40:	4166	748683
41:	4531	753214
42:	4029	757243
43:	3701	760944
44:	3647	764591
45:	3553	768144
46:	3509	771653
47:	3473	775126
48:	3442	778568
49:	3379	781947
50:	3395	785342
51:	3315	788657
52:	3168	791825
53:	3345	795170
54:	3166	798336
55:	3237	801573
56:	2795	804368
57:	2768	807136
58:	2666	809802
59:	2723	812525
60:	2547	815072
61:	2565	817637
62:	2622	820259
63:	2521	822780
64:	2492	825272
65:	2529	827801
66:	2566	830367
67:	2685	833052
68:	2458	835510
69:	2457	837967
70:	2440	840407
71:	2410	842817
72:	2337	845154
73:	2301	847455
74:	2201	849656
75:	2127	851783
76:	2256	854039
77:	2038	856077
78:	1925	858002
79:	1965	859967
80:	1929	861896
81:	1890	863786
82:	1873	865659
83:	1964	867623
84:	1898	869521
85:	1839	871360
86:	1933	873293
87:	1876	875169
88:	1851	877020
89:	1789	878809
90:	1790	880599
91:	1804	882403
92:	1696	884099
93:	1863	885962
94:	1889	887851
95:	1766	889617
96:	1731	891348
97:	1775	893123
98:	1750	894873
99:	1767	896640
100:	1644	898284
101:	1642	899926
102:	1489	901415
103:	1532	902947
104:	1564	904511
105:	1477	905988
106:	1461	907449
107:	1383	908832
108:	1422	910254
109:	131...
To: Nicolas Pitre <nico@...>
Cc: Jon Smirl <jonsmirl@...>, Git Mailing List <git@...>
Date: Saturday, December 8, 2007 - 6:18 pm

I was reaching the same conclusion but haven't managed to spot anything
blatantly wrong in that area.  Will need to dig more.
-
To: Junio C Hamano <gitster@...>
Cc: Jon Smirl <jonsmirl@...>, Git Mailing List <git@...>
Date: Sunday, December 9, 2007 - 10:49 pm

I didn't find anything wrong there either. I'll have to run some more 
gcc repacking tests myself, despite not having a blazingly fast machine 
making for rather long turnarounds.


Nicolas
-
To: Jon Smirl <jonsmirl@...>
Cc: Nicolas Pitre <nico@...>, Git Mailing List <git@...>
Date: Sunday, December 9, 2007 - 4:05 am

Does this problem have correlation with the use of threads?  Do you see
the same bloat with or without THREADED_DELTA_SEARCH defined?
-
To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Git Mailing List <git@...>
Date: Sunday, December 9, 2007 - 2:25 pm

Something else seems to be wrong.

With threading turned off,  5000 CPU seconds and 13% done.
With threading turned on, threads = 1, 5000 CPU seconds, 13%
With threading turned on, threads = 2, 180 CPU seconds, 13%
With threading turned on, threads = 4, 150 CPU seconds, 13%

This can't be right, four cores are not 40x one core. So maybe the
observed logarithmic slow down is because the percent complete is
being reported wrong in the threaded case. If that's the case we may
be looking in the wrong place for problems.

The times are only approximate, I'm using the CPU for other things.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Sunday, December 9, 2007 - 9:07 pm

It may be right.  The object list to apply delta compression on doesn't 
necessarily require a uniform amount of cycles throughout.  When using 
multiple threads, the list is broken in parts for each thread, and later 
parts might end up being simply much easier to process, therefore 

I really doubt it.


Nicolas
-
To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Git Mailing List <git@...>
Date: Sunday, December 9, 2007 - 11:19 am

I just started a non-threaded one. It will be four or five hours
before it finishes.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Nicolas Pitre <nico@...>
Cc: Jon Smirl <jonsmirl@...>, Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 11:44 pm

All I have is a qualitative observation, but during the process of
creating the pack, there was a _huge_ slowdown between 10-15%
(hundreds/dozens per second to single object per second and a
corresponding increase in process size).  Didn't keep any numbers
at the time, but it was noticable.

I wonder if there are a bunch of huge objects somewhere in gcc's
history?

Harvey

-
To: Jon Smirl <jonsmirl@...>, Nicolas Pitre <nico@...>
Cc: Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 8:37 pm

I think deltacachesize is broken.

The code in try_delta() that replaces a delta cache entry with another one 
seems very buggy wrt that whole "delta_cache_size" update. It does

	delta_cache_size -= trg_entry-&gt;delta_size;

to account for the old delta going away, but it does this *after* having 
already replaced trg_entry-&gt;delta_size with the new delta entry.

I suspect there are other issues going on too, but that's the one that I 
noticed from a quick look-through.

Nico? I think this one is yours..

		Linus


-
To: Junio C Hamano <gitster@...>
Cc: Linus Torvalds <torvalds@...>, Jon Smirl <jonsmirl@...>, Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 9:27 pm

The wrong value was substracted from delta_cache_size when replacing
a cached delta, as trg_entry-&gt;delta_size was used after the old size
had been replaced by the new size.

Noticed by Linus.

Signed-off-by: Nicolas Pitre &lt;nico@cam.org&gt; 
---


Doh!  Mea culpa.

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 4f44658..350ece4 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1422,10 +1422,6 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		}
 	}
 
-	trg_entry-&gt;delta = src_entry;
-	trg_entry-&gt;delta_size = delta_size;
-	trg-&gt;depth = src-&gt;depth + 1;
-
 	/*
 	 * Handle memory allocation outside of the cache
 	 * accounting lock.  Compiler will optimize the strangeness
@@ -1439,7 +1435,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		trg_entry-&gt;delta_data = NULL;
 	}
 	if (delta_cacheable(src_size, trg_size, delta_size)) {
-		delta_cache_size += trg_entry-&gt;delta_size;
+		delta_cache_size += delta_size;
 		cache_unlock();
 		trg_entry-&gt;delta_data = xrealloc(delta_buf, delta_size);
 	} else {
@@ -1447,6 +1443,10 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		free(delta_buf);
 	}
 
+	trg_entry-&gt;delta = src_entry;
+	trg_entry-&gt;delta_size = delta_size;
+	trg-&gt;depth = src-&gt;depth + 1;
+
 	return 1;
 }
 
-
To: Git Mailing List <git@...>, Nicolas Pitre <nico@...>
Date: Monday, December 10, 2007 - 10:25 pm

New run using same configuration. With the addition of the more
efficient load balancing patches and delta cache accounting.

Seconds are wall clock time. They are lower since the patch made
threading better at using all four cores. I am stuck at 380-390% CPU
utilization for the git process.

complete seconds RAM
10%   60    900M (includes counting)
20%   15    900M
30%   15    900M
40%   50    1.2G
50%   80    1.3G
60%   70    1.7G
70%   140  1.8G
80%   180  2.0G
90%   280  2.2G
95%   530  2.8G - 1,420 total to here, previous was 1,983
100% 1390 2.85G
During the writing phase RAM fell to 1.6G
What is being freed in the writing phase??

I have no explanation for the change in RAM usage. Two guesses come to
mind. Memory fragmentation. Or the change in the way the work was
split up altered RAM usage.

Total CPU time was 195 minutes in 70 minutes clock time. About 70%
efficient. During the compress phase all four cores were active until
the last 90 seconds. Writing the objects took over 23 minutes CPU
bound on one core.

New pack file is: 270,594,853
Old one was: 344,543,752
It still has 828,660 objects




-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Git Mailing List <git@...>
Date: Monday, December 10, 2007 - 11:49 pm

The cached delta results, but you put a cap of 256MB for them.

Could you try again with that cache disabled entirely, with 
pack.deltacachesize = 1 (don't use 0 as that means unbounded).

And then, while still keeping the delta cache disabled, could you try 
with pack.threads = 2, and pack.threads = 1 ?

I'm sorry to ask you to do this but I don't have enough ram to even 
complete a repack with threads=2 so I'm reattempting single threaded at 
the moment.  But I really wonder if the threading has such an effect on 

You mean the pack for the gcc repo is now less than 300MB?  Wow.


Nicolas
-
To: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>
Cc: Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 1:25 am

I already have a threads = 1 running with this config. Binary and
config were same from threads=4 run.

10% 28min 950M
40% 135min 950M
50% 157min 900M
60% 160min 830M
100% 170min 830M

Something is hurting bad with threads. 170 CPU minutes with one
thread, versus 195 CPU minutes with four threads.

Is there a different memory allocator that can be used when
multithreaded on gcc? This whole problem may be coming from the memory
allocation function. git is hardly interacting at all on the thread
level so it's likely a problem in the C run-time.

[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
[pack]
        threads = 1
        deltacachesize = 256M
        windowmemory = 256M
        deltacachelimit = 0
[remote "origin"]
        url = git://git.infradead.org/gcc.git
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "trunk"]
        remote = origin


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 2:01 am

On Tue, 11 Dec 2007 00:25:55 -0500

You might want to try Google's malloc, it's basically a drop in replacement
with some optional built-in performance monitoring capabilities.  It is said
to be much faster and better at threading than glibc's:

  http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools
  http://google-perftools.googlecode.com/svn/trunk/doc/tcmalloc.html


You can LD_PRELOAD it or link directly.

Cheers,
Sean
-
To: Sean <seanlkml@...>
Cc: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 2:20 am

I'm 45 minutes into a run using it. It doesn't seem to be any faster
but it is reducing memory consumption significantly. The run should be


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, <gcc@...>
Cc: Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 1:29 am

I added the gcc people to the CC, it's their repository. Maybe they
can help up sort this out.



-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 9:31 am

Unless there is a Git expert amongst the gcc crowd, I somehow doubt it. 
And gcc people with an interest in Git internals are probably already on 
the Git mailing list.


Nicolas
-
To: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, <gcc@...>
Cc: Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 3:01 am

Switching to the Google perftools malloc
http://goog-perftools.sourceforge.net/

10%   30  828M
20%   15  831M
30%   10  834M
40%   50  1014M
50%   80  1086M
60%   80  1500M
70% 200  1.53G
80% 200  1.85G
90% 260  1.87G
95% 520  1.97G
100% 1335 2.24G

Google allocator knocked 600MB off from memory use.
Memory consumption did not fall during the write out phase like it did with gcc.

Since all of this is with the same code except for changing the
threading split, those runs where memory consumption went to 4.5GB
with the gcc allocator must have triggered an extreme problem with
fragmentation.

Total CPU time 196 CPU minutes vs 190 for gcc. Google's claims of
being faster are not true.

So why does our threaded code take 20 CPU minutes longer (12%) to run
than the same code with a single thread? Clock time is obviously
faster. Are the threads working too close to each other in memory and
bouncing cache lines between the cores? Q6600 is just two E6600s in
the same package, the caches are not shared.

Why does the threaded code need 2.24GB (google allocator, 2.85GB gcc)
with 4 threads? But only need 950MB with one thread? Where's the extra
gigabyte going?

Is there another allocator to try? One that combines Google's
efficiency with gcc's speed?




-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 9:49 am

Of course there'll always be a certain amount of wasted cycles when 
threaded.  The locking overhead, the extra contention for IO, etc.  So 
12% overhead (3% per thread) when using 4 threads is not that bad I 

I really don't know.

Did you try with pack.deltacachesize set to 1 ?

And yet, this is still missing the actual issue.  The issue being that 
the 2.1GB pack as a _source_ doesn't cause as much memory to be 
allocated even if the _result_ pack ends up being the same.

I was able to repack the 2.1GB pack on my machine which has 1GB of ram. 
Now that it has been repacked, I can't repack it anymore, even when 
single threaded, as it start crowling into swap fairly quickly.  It is 
really non intuitive and actually senseless that Git would require twice 
as much RAM to deal with a pack that is 7 times smaller.


Nicolas (still puzzled)
-
To: Jon Smirl <jonsmirl@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 11:00 am

OK, here's something else for you to try:

	core.deltabasecachelimit=0
	pack.threads=2
	pack.deltacachesize=1

With that I'm able to repack the small gcc pack on my machine with 1GB 
of ram using:

	git repack -a -f -d --window=250 --depth=250

and top reports a ~700m virt and ~500m res without hitting swap at all.
It is only at 25% so far, but I was unable to get that far before.

Would be curious to know what you get with 4 threads on your machine.


Nicolas
-
To: Jon Smirl <jonsmirl@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 12:20 pm

Well, around 55% memory usage skyrocketed to 1.6GB and the system went 
deep into swap.  So I restarted it with no threads.

Nicolas (even more puzzled)
-
To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 12:21 pm

On the plus side you are seeing what I see, so it proves I am not imagining it.


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Wednesday, December 12, 2007 - 1:12 am

Well... This is weird.

It seems that memory fragmentation is really really killing us here.  
The fact that the Google allocator did manage to waste quite less memory 
is a good indicator already.

I did modify the progress display to show accounted memory that was 
allocated vs memory that was freed but still not released to the system.  
At least that gives you an idea of memory allocation and fragmentation 
with glibc in real time:

diff --git a/progress.c b/progress.c
index d19f80c..46ac9ef 100644
--- a/progress.c
+++ b/progress.c
@@ -8,6 +8,7 @@
  * published by the Free Software Foundation.
  */
 
+#include &lt;malloc.h&gt;
 #include "git-compat-util.h"
 #include "progress.h"
 
@@ -94,10 +95,12 @@ static int display(struct progress *progress, unsigned n, const char *done)
 	if (progress-&gt;total) {
 		unsigned percent = n * 100 / progress-&gt;total;
 		if (percent != progress-&gt;last_percent || progress_update) {
+			struct mallinfo m = mallinfo();
 			progress-&gt;last_percent = percent;
-			fprintf(stderr, "%s: %3u%% (%u/%u)%s%s",
-				progress-&gt;title, percent, n,
-				progress-&gt;total, tp, eol);
+			fprintf(stderr, "%s: %3u%% (%u/%u) %u/%uMB%s%s",
+				progress-&gt;title, percent, n, progress-&gt;total,
+				m.uordblks &gt;&gt; 18, m.fordblks &gt;&gt; 18,
+				tp, eol);
 			fflush(stderr);
 			progress_update = 0;
 			return 1;

This shows that at some point the repack goes into a big memory surge.  
I don't have enough RAM to see how fragmented memory gets though, since 
it starts swapping around 50% done with 2 threads.

With only 1 thread, memory usage grows significantly at around 11% with 
a pretty noticeable slowdown in the progress rate.

So I think the theory goes like this:

There is a block of big objects together in the list somewhere.  
Initially, all those big objects are assigned to thread #1 out of 4.  
Because those objects are big, they get really slow to delta compress, 
and storing them all in a window with 250 slots takes s...
To: Jon Smirl <jonsmirl@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Wednesday, December 12, 2007 - 12:13 pm

Note: I didn't know what unit of memory those blocks represents, so the 
shift is most probably wrong.


Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Jon Smirl <jonsmirl@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Thursday, December 13, 2007 - 3:32 am

Me neither, but it appears to me as if hblkhd holds the actual memory
consumed by the process. It seems to store the information in bytes,
which I find a bit dubious unless glibc has some internal multiplier.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To: <ae@...>
Cc: <nico@...>, <jonsmirl@...>, <gitster@...>, <gcc@...>, <git@...>
Date: Friday, December 14, 2007 - 12:03 pm

mallinfo() will only give you the used memory for the main arena.
When you have separate arenas (likely when concurrent threads have
been used), the only way to get the full picture is to call
malloc_stats(), which prints to stderr.

Regards,
Wolfram.

-
To: Jon Smirl <jonsmirl@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Wednesday, December 12, 2007 - 11:48 am

OK scrap that.

When I returned to the computer this morning, the repack was 
completed... with a 1.3GB pack instead.

So... The gcc repo apparently really needs a large window to efficiently 
compress those large objects.

But when those large objects are already well deltified and you repack 
again with a large window, somehow the memory allocator is way more 
involved, probably even 
more so when there are several threads in parallel amplifying the issue, 
and things probably get to a point of no return with regard to memory 
fragmentation after a while.

So... my conclusion is that the glibc allocator has fragmentation issues 
with this work load, given the notable difference with the Google 
allocator, which itself might not be completely immune to fragmentation 
issues of its own.  And because the gcc repo requires a large window of 
big objects to get good compression, then you're better not using 4 
threads to repack it with -a -f.  The fact that the size of the source 
pack has such an influence is probably only because the increased usage 
of the delta base object cache is playing a role in the global memory 
allocation pattern, allowing for the bad fragmentation issue to occur.

If you could run one last test with the mallinfo patch I posted, without 
the pack.windowmemory setting, and adding the reported values along with 
those from top, then we could formally conclude to memory fragmentation 
issues.

So I don't think Git itself is actually bad.  The gcc repo most 
certainly constitute a nasty use case for memory allocators, but I don't 
think there is much we can do about it besides possibly implementing our 
own memory allocator with active defragmentation where possible (read 
memcpy) at some point to give glibc's allocator some chance to breathe a 
bit more.

In the mean time you might have to use only one thread and lots of 
memory to repack the gcc repo, or find the perfect memory allocator to 
be used with Git.  After all, packing the whole gcc history...
To: Nicolas Pitre <nico@...>
Cc: Jon Smirl <jonsmirl@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Thursday, December 13, 2007 - 9:32 am

Is there an alternative to "git repack -a -d" that repacks everything
but the first pack?
-- 
Duy
-
To: <git@...>
Cc: <gcc@...>
Date: Thursday, December 13, 2007 - 11:32 am

That would be a pretty good idea for big repositories.  If I were to 
implement it, I would actually add a .git/config option like 
pack.permanent so that more than one pack could be made permanent; then 
to repack really really everything you'd need "git repack -a -a -d".

Paolo

-
To: Paolo Bonzini <bonzini@...>
Cc: <git@...>, <gcc@...>
Date: Thursday, December 13, 2007 - 12:39 pm

It's already there: If you have a pack .git/objects/pack/pack-foo.pack, then
"touch .git/objects/pack/pack-foo.keep" marks the pack as precious.

-- Hannes

-
To: <unlisted-recipients@...>, <@...>
Cc: <git@...>, <gcc@...>
Date: Thursday, December 13, 2007 - 12:29 pm

Actually there is something like this, as seen from the source of 
git-repack:

             for e in `cd "$PACKDIR" &amp;&amp; find . -type f -name '*.pack' \
                      | sed -e 's/^\.\///' -e 's/\.pack$//'`
             do
                     if [ -e "$PACKDIR/$e.keep" ]; then
                             : keep
                     else
                             args="$args --unpacked=$e.pack"
                             existing="$existing $e"
                     fi
             done

So, just create a file named as the pack, but with extension ".keep".

Paolo
-
To: Nicolas Pitre <nico@...>
Cc: Jon Smirl <jonsmirl@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Wednesday, December 12, 2007 - 12:37 pm

Yes.

Note that delta following involves patterns something like

   allocate (small) space for delta
   for i in (1..depth) {
	allocate large space for base
	allocate large space for result
	.. apply delta ..
	free large space for base
	free small space for delta
   }

so if you have some stupid heap algorithm that doesn't try to merge and 
re-use free'd spaces very aggressively (because that takes CPU time!), you 
might have memory usage be horribly inflated by the heap having all those 
holes for all the objects that got free'd in the chain that don't get 
aggressively re-used.

Threaded memory allocators then make this worse by probably using totally 
different heaps for different threads (in order to avoid locking), so they 
will *all* have the fragmentation issue.

And if you *really* want to cause trouble for a memory allocator, what you 
should try to do is to allocate the memory in one thread, and free it in 
another, and then things can really explode (the freeing thread notices 
that the allocation is not in its thread-local heap, so instead of really 
freeing it, it puts it on a separate list of areas to be freed later by 
the original thread when it needs memory - or worse, it adds it to the 
local thread list, and makes it effectively totally impossible to then 
ever merge different free'd allocations ever again because the freed 
things will be on different heap lists!).

I'm not saying that particular case happens in git, I'm just saying that 
it's not unheard of. And with the delta cache and the object lookup, it's 
not at _all_ impossible that we hit the "allocate in one thread, free in 
another" case!

		Linus
-
To: <torvalds@...>
Cc: <nico@...>, <jonsmirl@...>, <gitster@...>, <gcc@...>, <git@...>
Date: Friday, December 14, 2007 - 12:12 pm

ptmalloc2 (in glibc) _per arena_ is basically best-fit.  This is the
best known general strategy, but it certainly cannot be the best in

It depends how large 'large' is -- if it exceeds the mmap() threshold
(settable with mallopt(M_MMAP_THRESHOLD, ...))
the 'large' spaces will be allocated with mmap() and won't cause
any internal fragmentation.
It might pay to experiment with this parameter if it is hard to

Indeed.

Could someone perhaps try ptmalloc3
(http://malloc.de/malloc/ptmalloc3-current.tar.gz) on this case?

Thanks,
Wolfram.

-
To: Wolfram Gloger <wmglo@...>
Cc: <torvalds@...>, <nico@...>, <jonsmirl@...>, <gitster@...>, <gcc@...>, <git@...>
Date: Friday, December 14, 2007 - 12:45 pm

Uh what?  Someone crank out his copy of "The Art of Computer
Programming", I think volume 1.  Best fit is known (analyzed and proven
and documented decades ago) to be one of the worst strategies for memory
allocation.  Exactly because it leads to huge fragmentation problems.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To: <dak@...>
Cc: <wmglo@...>, <torvalds@...>, <nico@...>, <jonsmirl@...>, <gitster@...>, <gcc@...>, <git@...>
Date: Friday, December 14, 2007 - 12:59 pm

Well, quoting http://gee.cs.oswego.edu/dl/html/malloc.html:

"As shown by Wilson et al, best-fit schemes (of various kinds and
approximations) tend to produce the least fragmentation on real loads
compared to other general approaches such as first-fit."

See [Wilson 1995] ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps for
more details and references.

Regards,
Wolfram.
-
To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Wednesday, December 12, 2007 - 1:12 pm

Is it hard to hack up something that statically allocates a big block
of memory per thread for these two and then just reuses it?
   allocate (small) space for delta
   allocate large space for base

The alternating between long term and short term allocations


-- 
Jon Smirl
jonsmirl@gmail.com
-
To: <torvalds@...>
Cc: <nico@...>, <jonsmirl@...>, <gitster@...>, <gcc@...>, <git@...>
Date: Wednesday, December 12, 2007 - 12:42 pm

From: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;

One thing that supports these theories is that, while running
these large repacks, I notice that the RSS is roughly 2/3 of
the amount of virtual address space allocated.

I personally don't think it's unreasonable for GIT to have it's
own customized allocator at least for certain object types.
-
To: David Miller <davem@...>
Cc: <nico@...>, <jonsmirl@...>, <gitster@...>, <gcc@...>, <git@...>
Date: Wednesday, December 12, 2007 - 12:54 pm

Well, we actually already *do* have a customized allocator, but currently 
only for the actual core "object descriptor" that really just has the SHA1 
and object flags in it (and a few extra words depending on object type).

Those are critical for certain loads, and small too (so using the standard 
allocator wasted a _lot_ of memory). In addition, they're fixed-size and 
never free'd, so a specialized allocator really can do a lot better than 
any general-purpose memory allocator ever could.

But the actual object *contents* are currently all allocated with whatever 
the standard libc malloc/free allocator is that you compile for (or load 
dynamically). Havign a specialized allocator for them is a much more 
involved issue, exactly because we do have interesting allocation patterns 
etc.

That said, at least those object allocations are all single-threaded (for 
right now, at least), so even when git does multi-threaded stuff, the core 
sha1_file.c stuff is always run under a single lock, and a simpler 
allocator that doesn't care about threads is likely to be much better than 
one that tries to have thread-local heaps etc.

I suspect that is what the google allocator does. It probably doesn't have 
per-thread heaps, it just uses locking (and quite possibly things like 
per-*size* heaps, which is much more memory-efficient and helps avoid some 
of the fragmentation problems). 

Locking is much slower than per-thread accesses, but it doesn't have the 
issues with per-thread-fragmentation and all the problems with one thread 
allocating and another one freeing.

			Linus
-
To: Nicolas Pitre <nico@...>
Cc: Jon Smirl <jonsmirl@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Wednesday, December 12, 2007 - 4:05 am

Maybe an malloc/free/mmap wrapper that records the requested sizes and
alloc/free order and dumps them to file so that one can make a compact
git-free standalone test case for the glibc maintainers might be a good
thing.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To: <dak@...>
Cc: <nico@...>, <jonsmirl@...>, <gitster@...>, <gcc@...>, <git@...>
Date: Friday, December 14, 2007 - 12:18 pm

I already have such a wrapper:

http://malloc.de/malloc/mtrace-20060529.tar.gz

But note that it does interfere with the thread scheduling, so it
can't record the exact same allocation pattern as when not using the
wrapper.

Regards,
Wolfram.
-
To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 11:36 am

Changing those parameters really slowed down counting the objects. I
used to be able to count in 45 seconds now it took 130 seconds. I am
still have the Google allocator linked in.

4 threads, cumulative clock time
25%     200 seconds, 820/627M
55%     510 seconds, 1240/1000M - little late recording
75%     15 minutes, 1658/1500M
90%      22 minutes, 1974/1800M
it's still running but there is no significant change.

Are two types of allocations being mixed?
1) long term, global objects kept until the end of everything
2) volatile, private objects allocated only while the object is being
compressed and then freed

Separating these would make a big difference to the fragmentation
problem. Single threading probably wouldn't see a fragmentation
problem from mixing the allocation types.

When a thread is created it could allocated a private 20MB (or
whatever) pool. The volatile, private objects would come from that
pool. Long term objects would stay in the global pool. Since they are
long term they will just get laid down sequentially in memory.
Separating these allocation types make things way easier for malloc.

CPU time would be helped by removing some of the locking if possible.

-- 
Jon Smirl
jonsmirl@gmail.com
-
To: Jon Smirl <jonsmirl@...>
Cc: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 12:33 pm

Threaded code *always* takes more CPU time. The only thing you can hope 
for is a wall-clock reduction. You're seeing probably a combination of 
 (a) more cache misses
 (b) bigger dataset active at a time
and a probably fairly miniscule

Sure they are shared. They're just not *entirely* shared. But they are 
shared between each two cores, so each thread essentially has only half 
the cache they had with the non-threaded version.

Threading is *not* a magic solution to all problems. It gives you 
potentially twice the CPU power, but there are real downsides that you 

I suspect that it's really simple: you have a few rather big files in the 
gcc history, with deep delta chains. And what happens when you have four 
threads running at the same time is that they all need to keep all those 
objects that they are working on - and their hash state - in memory at the 
same time!

So if you want to use more threads, that _forces_ you to have a bigger 
memory footprint, simply because you have more "live" objects that you 
work on. Normally, that isn't much of a problem, since most source files 
are small, but if you have a few deep delta chains on big files, both the 
delta chain itself is going to use memory (you may have limited the size 
of the cache, but it's still needed for the actual delta generation, so 
it's not like the memory usage went away).

That said, I suspect there are a few things fighting you:

 - threading is hard. I haven't looked a lot at the changes Nico did to do 
   a threaded object packer, but what I've seen does not convince me it is 
   correct. The "trg_entry" accesses are *mostly* protected with 
   "cache_lock", but nothing else really seems to be, so quite frankly, I 
   wouldn't trust the threaded version very much. It's off by default, and 
   for a good reason, I think.

   For example: the packing code does this:

	if (!src-&gt;data) {
		read_lock();
		src-&gt;data = read_sha1_file(src_entry-&gt;idx.sha1, &amp;type, &amp;sz);
		read_unlock...
To: Linus Torvalds <torvalds@...>
Cc: Jon Smirl <jonsmirl@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 1:21 pm

I beg to differ (of course, since I always know precisely what I do, and 
like you, my code never has bugs).

Seriously though, the trg_entry has not to be protected at all.  Why? 
Simply because each thread has its own exclusive set of objects which no 
&gt;    see a NULL src-&gt;data, they
To: <nico@...>
Cc: <torvalds@...>, <jonsmirl@...>, <gitster@...>, <gcc@...>, <git@...>
Date: Tuesday, December 11, 2007 - 1:24 pm

From: Nicolas Pitre &lt;nico@cam.org&gt;

If you repack on the smaller pack file, git has to expand more stuff
internally in order to search the deltas, whereas with the larger pack
file I bet git has to less often undelta'ify to get base objects blobs
for delta search.

In fact that behavior makes perfect sense to me and I don't understand
GIT internals very well :-)
-
To: David Miller <davem@...>
Cc: Linus Torvalds <torvalds@...>, <jonsmirl@...>, Junio C Hamano <gitster@...>, <gcc@...>, <git@...>
Date: Tuesday, December 11, 2007 - 1:44 pm

Of course.  I came to that conclusion two days ago.  And despite being 
pretty familiar with the involved code (I wrote part of it myself) I 
just can't spot anything wrong with it so far.

But somehow the threading code keep distracting people from that issue 
since it gets to do the same work whether or not the source pack is 
densely packed or not.

Nicolas 
(who wish he had access to a much faster machine to investigate this issue)
-
To: Nicolas Pitre <nico@...>
Cc: David Miller <davem@...>, Linus Torvalds <torvalds@...>, <jonsmirl@...>, Junio C Hamano <gitster@...>, <gcc@...>, <git@...>
Date: Tuesday, December 11, 2007 - 4:26 pm

If it's still an issue next week, we'll have a 16 core (8 dual-core cpu's)
machine with some 32gb of ram in that'll be free for about two days.
You'll have to remind me about it though, as I've got a lot on my mind
these days.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To: Jon Smirl <jonsmirl@...>
Cc: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 1:28 pm

Depends on your allocation patterns. For our apps, it certainly is :)
Of course, i don't know if we've updated the external allocator in a
while, i'll bug the people in charge of it.
-
To: Jon Smirl <jonsmirl@...>
Cc: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, <gcc@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 3:34 am

Did you use the tcmalloc with heap checker/profiler, or tcmalloc_minimal?

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-
To: Jon Smirl <jonsmirl@...>
Cc: Git Mailing List <git@...>, Nicolas Pitre <nico@...>
Date: Monday, December 10, 2007 - 10:55 pm

entry-&gt;delta_data is the only thing I can think of that are freed
in the function that have been allocated much earlier before entering
the function.
-
To: Junio C Hamano <gitster@...>
Cc: Jon Smirl <jonsmirl@...>, Git Mailing List <git@...>
Date: Monday, December 10, 2007 - 11:27 pm

Yet all -&gt;delta-data instances are limited to 256MB according to Jon's 
config.




Nicolas
-
To: Nicolas Pitre <nico@...>
Cc: Junio C Hamano <gitster@...>, Jon Smirl <jonsmirl@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 7:08 am

Maybe address space fragmentation is involved here?  malloc/free for
large areas works using mmap in glibc.  There must be enough
_contiguous_ space for a new allocation to succeed.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To: David Kastrup <dak@...>
Cc: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, Jon Smirl <jonsmirl@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 8:08 am

Well, that's interesting, but there is a way to know for sure instead
of taking bets. Just use valgrind --tool=3Dmassif and look at the pretty
picture, it'll tell what was going on very accurately.

  Note that I find your explanation unlikely: glibc uses mmap for sizes
over 128k by default (IIRC), and as soon as you use mmaps, that's the
kernel that deals with the address space, and it's not necessarily
contiguous, that's only true for the heap.
--=20
=C2=B7O=C2=B7  Pierre Habouzit
=C2=B7=C2=B7O                                                madcoder@debia=
n.org
OOO                                                http://www.madism.org
To: Pierre Habouzit <madcoder@...>
Cc: Nicolas Pitre <nico@...>, Junio C Hamano <gitster@...>, Jon Smirl <jonsmirl@...>, Git Mailing List <git@...>
Date: Tuesday, December 11, 2007 - 8:18 am

Every single allocation needs to be contiguous in virtual address space
and must not collide with existing virtual address space allocations.
So fragmentation is at least a logistical issue.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To: Jon Smirl <jonsmirl@...>
Cc: Git Mailing List <git@...>
Date: Friday, December 7, 2007 - 10:56 pm

Just out of curiousity, does adding

         [pack]
                 windowmemory = 256M

help.  I've found this to grow very large when there are large blobs.

Dave
-
Previous thread: [RFC/PATCH] Add a --nosort option to pack-objects by Mike Hommey on Friday, December 7, 2007 - 5:10 pm. (8 messages)

Next thread: git-svn branch naming question by Miklos Vajna on Friday, December 7, 2007 - 9:04 pm. (13 messages)