Using this config:
[pack]
threads = 4
deltacachesize = 256M
deltacachelimit = 0
And the 330MB gcc pack for input
git repack -a -d -f --depth=250 --window=250
complete seconds RAM
10% 47 1GB
20% 29 1Gb
30% 24 1Gb
40% 18 1GB
50% 110 1.2GB
60% 85 1.4GB
70% 195 1.5GB
80% 186 2.5GB
90% 489 3.8GB
95% 800 4.8GB
I killed it because it started swapping
The mmaps are only about 400MB in this case.
At the end the git process had 4.4GB of physical RAM allocated.
Starting from a highly compressed pack greatly aggravates the problem.
Starting with a 2GB pack of the same data my process size only grew to
3GB with 2GB of mmaps.
--
Jon Smirl
jonsmirl@gmail.com
-You said having reproduced the issue, albeit not as severe, with the Linux kernel repo. I did just that: # to get the default pack: $ git repack -a -f -d # first measurement with a repack from a default pack $ /usr/bin/time git repack -a -f --window=256 --depth=256 2572.17user 5.87system 22:46.80elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k 15720inputs+356640outputs (71major+264376minor)pagefaults 0swaps # do it again to start from a highly packed pack $ /usr/bin/time git repack -a -f --window=256 --depth=256 2573.53user 5.62system 22:45.60elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k 29176inputs+356664outputs (210major+274887minor)pagefaults 0swaps This is with pack.threads=2 on a P4 with HT, and I'm using the machine for other tasks as well, but all measured time is sensibly the same for both cases. Virtual memory allocation never reached 700MB in both cases either. Nicolas -
This is the mail about the kernel pack, the one you quoted is a gcc run. The kernel repo has the same problem but not nearly as bad. Starting from a default pack git repack -a -d -f --depth=1000 --window=1000 Uses 1GB of physical memory Now do the command again. git repack -a -d -f --depth=1000 --window=1000 Uses 1.3GB of physical memory I suspect the gcc repo has much longer revision chains than the kernel one since the kernel repo is only a few years old. The Mozilla repo contained revision chains with over 2,000 revisions. Longer revision chains result in longer delta chains. So what is allocating the extra memory? Either a function of the number of entries in the chain, or related to accessing the chain since a chain with more entries will need to be accessed more times. I have a 168MB kernel pack now after 15 minutes of four cores at 100%. Here's another observation, the gcc objects are larger. Kernel has 650K objects in 190MB, gcc has 870K objects in 330MB. Average gcc object is 30% larger. How should the average kernel developer -- Jon Smirl jonsmirl@gmail.com -
Could this be explained by the ChangeLog file? It's large; it has tons of revisions; it is a prime candidate for delta compression. Morten -
Since you have a different result according to the source pack used then those cache settings, even if there was a bug with them, are not Which is quite reasonable, even if the same issue might still be there. So the problem seems to be related to the pack access code and not the repack code. And it must have something to do with the number of deltas being replayed. And because the repack is attempting delta compression roughly from newest to oldest, and because old objects are typically in a deeper delta chain, then this might explain the logarithmic slowdown. So something must be wrong with the delta cache in sha1_file.c somehow. Nicolas -
What could be wrongly allocating 4GB of memory? Figure that out and you should have your answer. The slow down may be coming from having to search through more and more objects in memory. Memory consumption seem to be correlated to the depth of the delta chain being accessed. It blows up tremendously right at the end. It may even be a square of the length of the chain length. For the normal default case the square didn't hurt, but 250*250 = 62,500 which would eat a huge amount of memory. -- Jon Smirl jonsmirl@gmail.com -
I applied the delta accounting patch. It took about 200MB of from the memory use but that doesn't make a dent in 4GB of allocations. -- Jon Smirl jonsmirl@gmail.com -
Right. I didn't expect much from that fix. Nicolas -
The kernel repo has the same problem but not nearly as bad. Starting from a default pack git repack -a -d -f --depth=1000 --window=1000 Uses 1GB of physical memory Now do the command again. git repack -a -d -f --depth=1000 --window=1000 Uses 1.3GB of physical memory I suspect the gcc repo has much longer revision chains than the kernel one since the kernel repo is only a few years old. The Mozilla repo contained revision chains with over 2,000 revisions. Longer revision chains result in longer delta chains. So what is allocating the extra memory? Either a function of the number of entries in the chain, or related to accessing the chain since a chain with more entries will need to be accessed more times. I have a 168MB kernel pack now after 15 minutes of four cores at 100%. Here's another observation, the gcc objects are larger. Kernel has 650K objects in 190MB, gcc has 870K objects in 330MB. Average gcc object is 30% larger. How should the average kernel developer interpret this? -- Jon Smirl jonsmirl@gmail.com -
With my repo that contains a bunch of 50MB tarfiles, I've found I must specify --window-memory as well to keep repack from using nearly unbounded amounts of memory. Perhaps it is the larger files found in gcc that provokes this. A window size of 1000 can take a lot of memory if the objects are large. Dave -
This is a partial solution to the problem. Adding window size =256M took memory consumption down from 4.8GB to 2.8GB. It took an hour to run the test. It not the complete solution since my git process is still using 2.4GB physical memory. I also still experiencing a lot of slow down in the last 10%. Does the gcc repo contain some giant objects? Why wasn't the memory freed after their chain was processed? Most of the last 10% is being done on a single CPU. There must be a chain of giant objects that is unbalancing everything. -- Jon Smirl jonsmirl@gmail.com -
I'm about to send a patch to fix the thread balancing for real this time. Nicolas -
Something is really broken in the last 5% of that repo. I have been processing at 97% for 30 minutes without moving to 98%. -- Jon Smirl jonsmirl@gmail.com -
This is a clear sign of a problem, indeed. I'll be away for the weekend, so here's a few things to try out if you feel like it: 1) Make sure the problem occurs with the thread code disabled. That would eliminate one variable, and will help for #2. 2) Try bissecting the issue. If you can find an old Git version where the issue doesn't appear then simply run "git bissect" to find the exact commit causing the problem. Best with a repo that doesn't take ages to repack. 3) Compile Git against the dmalloc library in order to identify where the huge memory leak is happening. Nicolas -
I sent out a partial delta breakdown for the gcc repo earlier, here's the whole list. breakdown of the gcc packfile: Total objects 1017922 ChainLength Objects Cumulative 1: 103817 103817 2: 67332 171149 3: 57520 228669 4: 52570 281239 5: 43910 325149 6: 37520 362669 7: 35248 397917 8: 29819 427736 9: 27619 455355 10: 22656 478011 11: 21073 499084 12: 18738 517822 13: 16674 534496 14: 14882 549378 15: 14424 563802 16: 12765 576567 17: 11662 588229 18: 11845 600074 19: 11694 611768 20: 9625 621393 21: 9031 630424 22: 8437 638861 23: 8217 647078 24: 7927 655005 25: 7955 662960 26: 7092 670052 27: 7004 677056 28: 6724 683780 29: 6626 690406 30: 5875 696281 31: 5970 702251 32: 5726 707977 33: 6025 714002 34: 5354 719356 35: 6413 725769 36: 4933 730702 37: 4888 735590 38: 4561 740151 39: 4366 744517 40: 4166 748683 41: 4531 753214 42: 4029 757243 43: 3701 760944 44: 3647 764591 45: 3553 768144 46: 3509 771653 47: 3473 775126 48: 3442 778568 49: 3379 781947 50: 3395 785342 51: 3315 788657 52: 3168 791825 53: 3345 795170 54: 3166 798336 55: 3237 801573 56: 2795 804368 57: 2768 807136 58: 2666 809802 59: 2723 812525 60: 2547 815072 61: 2565 817637 62: 2622 820259 63: 2521 822780 64: 2492 825272 65: 2529 827801 66: 2566 830367 67: 2685 833052 68: 2458 835510 69: 2457 837967 70: 2440 840407 71: 2410 842817 72: 2337 845154 73: 2301 847455 74: 2201 849656 75: 2127 851783 76: 2256 854039 77: 2038 856077 78: 1925 858002 79: 1965 859967 80: 1929 861896 81: 1890 863786 82: 1873 865659 83: 1964 867623 84: 1898 869521 85: 1839 871360 86: 1933 873293 87: 1876 875169 88: 1851 877020 89: 1789 878809 90: 1790 880599 91: 1804 882403 92: 1696 884099 93: 1863 885962 94: 1889 887851 95: 1766 889617 96: 1731 891348 97: 1775 893123 98: 1750 894873 99: 1767 896640 100: 1644 898284 101: 1642 899926 102: 1489 901415 103: 1532 902947 104: 1564 904511 105: 1477 905988 106: 1461 907449 107: 1383 908832 108: 1422 910254 109: 131...
I was reaching the same conclusion but haven't managed to spot anything blatantly wrong in that area. Will need to dig more. -
I didn't find anything wrong there either. I'll have to run some more gcc repacking tests myself, despite not having a blazingly fast machine making for rather long turnarounds. Nicolas -
Does this problem have correlation with the use of threads? Do you see the same bloat with or without THREADED_DELTA_SEARCH defined? -
Something else seems to be wrong. With threading turned off, 5000 CPU seconds and 13% done. With threading turned on, threads = 1, 5000 CPU seconds, 13% With threading turned on, threads = 2, 180 CPU seconds, 13% With threading turned on, threads = 4, 150 CPU seconds, 13% This can't be right, four cores are not 40x one core. So maybe the observed logarithmic slow down is because the percent complete is being reported wrong in the threaded case. If that's the case we may be looking in the wrong place for problems. The times are only approximate, I'm using the CPU for other things. -- Jon Smirl jonsmirl@gmail.com -
It may be right. The object list to apply delta compression on doesn't necessarily require a uniform amount of cycles throughout. When using multiple threads, the list is broken in parts for each thread, and later parts might end up being simply much easier to process, therefore I really doubt it. Nicolas -
I just started a non-threaded one. It will be four or five hours before it finishes. -- Jon Smirl jonsmirl@gmail.com -
All I have is a qualitative observation, but during the process of creating the pack, there was a _huge_ slowdown between 10-15% (hundreds/dozens per second to single object per second and a corresponding increase in process size). Didn't keep any numbers at the time, but it was noticable. I wonder if there are a bunch of huge objects somewhere in gcc's history? Harvey -
I think deltacachesize is broken. The code in try_delta() that replaces a delta cache entry with another one seems very buggy wrt that whole "delta_cache_size" update. It does delta_cache_size -= trg_entry->delta_size; to account for the old delta going away, but it does this *after* having already replaced trg_entry->delta_size with the new delta entry. I suspect there are other issues going on too, but that's the one that I noticed from a quick look-through. Nico? I think this one is yours.. Linus -
The wrong value was substracted from delta_cache_size when replacing
a cached delta, as trg_entry->delta_size was used after the old size
had been replaced by the new size.
Noticed by Linus.
Signed-off-by: Nicolas Pitre <nico@cam.org>
---
Doh! Mea culpa.
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 4f44658..350ece4 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1422,10 +1422,6 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
}
}
- trg_entry->delta = src_entry;
- trg_entry->delta_size = delta_size;
- trg->depth = src->depth + 1;
-
/*
* Handle memory allocation outside of the cache
* accounting lock. Compiler will optimize the strangeness
@@ -1439,7 +1435,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
trg_entry->delta_data = NULL;
}
if (delta_cacheable(src_size, trg_size, delta_size)) {
- delta_cache_size += trg_entry->delta_size;
+ delta_cache_size += delta_size;
cache_unlock();
trg_entry->delta_data = xrealloc(delta_buf, delta_size);
} else {
@@ -1447,6 +1443,10 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
free(delta_buf);
}
+ trg_entry->delta = src_entry;
+ trg_entry->delta_size = delta_size;
+ trg->depth = src->depth + 1;
+
return 1;
}
-New run using same configuration. With the addition of the more efficient load balancing patches and delta cache accounting. Seconds are wall clock time. They are lower since the patch made threading better at using all four cores. I am stuck at 380-390% CPU utilization for the git process. complete seconds RAM 10% 60 900M (includes counting) 20% 15 900M 30% 15 900M 40% 50 1.2G 50% 80 1.3G 60% 70 1.7G 70% 140 1.8G 80% 180 2.0G 90% 280 2.2G 95% 530 2.8G - 1,420 total to here, previous was 1,983 100% 1390 2.85G During the writing phase RAM fell to 1.6G What is being freed in the writing phase?? I have no explanation for the change in RAM usage. Two guesses come to mind. Memory fragmentation. Or the change in the way the work was split up altered RAM usage. Total CPU time was 195 minutes in 70 minutes clock time. About 70% efficient. During the compress phase all four cores were active until the last 90 seconds. Writing the objects took over 23 minutes CPU bound on one core. New pack file is: 270,594,853 Old one was: 344,543,752 It still has 828,660 objects -- Jon Smirl jonsmirl@gmail.com -
The cached delta results, but you put a cap of 256MB for them. Could you try again with that cache disabled entirely, with pack.deltacachesize = 1 (don't use 0 as that means unbounded). And then, while still keeping the delta cache disabled, could you try with pack.threads = 2, and pack.threads = 1 ? I'm sorry to ask you to do this but I don't have enough ram to even complete a repack with threads=2 so I'm reattempting single threaded at the moment. But I really wonder if the threading has such an effect on You mean the pack for the gcc repo is now less than 300MB? Wow. Nicolas -
I already have a threads = 1 running with this config. Binary and
config were same from threads=4 run.
10% 28min 950M
40% 135min 950M
50% 157min 900M
60% 160min 830M
100% 170min 830M
Something is hurting bad with threads. 170 CPU minutes with one
thread, versus 195 CPU minutes with four threads.
Is there a different memory allocator that can be used when
multithreaded on gcc? This whole problem may be coming from the memory
allocation function. git is hardly interacting at all on the thread
level so it's likely a problem in the C run-time.
[core]
repositoryformatversion = 0
filemode = true
bare = false
logallrefupdates = true
[pack]
threads = 1
deltacachesize = 256M
windowmemory = 256M
deltacachelimit = 0
[remote "origin"]
url = git://git.infradead.org/gcc.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "trunk"]
remote = origin
--
Jon Smirl
jonsmirl@gmail.com
-On Tue, 11 Dec 2007 00:25:55 -0500 You might want to try Google's malloc, it's basically a drop in replacement with some optional built-in performance monitoring capabilities. It is said to be much faster and better at threading than glibc's: http://code.google.com/p/google-perftools/wiki/GooglePerformanceTools http://google-perftools.googlecode.com/svn/trunk/doc/tcmalloc.html You can LD_PRELOAD it or link directly. Cheers, Sean -
I'm 45 minutes into a run using it. It doesn't seem to be any faster but it is reducing memory consumption significantly. The run should be -- Jon Smirl jonsmirl@gmail.com -
I added the gcc people to the CC, it's their repository. Maybe they can help up sort this out. -- Jon Smirl jonsmirl@gmail.com -
Unless there is a Git expert amongst the gcc crowd, I somehow doubt it. And gcc people with an interest in Git internals are probably already on the Git mailing list. Nicolas -
Switching to the Google perftools malloc http://goog-perftools.sourceforge.net/ 10% 30 828M 20% 15 831M 30% 10 834M 40% 50 1014M 50% 80 1086M 60% 80 1500M 70% 200 1.53G 80% 200 1.85G 90% 260 1.87G 95% 520 1.97G 100% 1335 2.24G Google allocator knocked 600MB off from memory use. Memory consumption did not fall during the write out phase like it did with gcc. Since all of this is with the same code except for changing the threading split, those runs where memory consumption went to 4.5GB with the gcc allocator must have triggered an extreme problem with fragmentation. Total CPU time 196 CPU minutes vs 190 for gcc. Google's claims of being faster are not true. So why does our threaded code take 20 CPU minutes longer (12%) to run than the same code with a single thread? Clock time is obviously faster. Are the threads working too close to each other in memory and bouncing cache lines between the cores? Q6600 is just two E6600s in the same package, the caches are not shared. Why does the threaded code need 2.24GB (google allocator, 2.85GB gcc) with 4 threads? But only need 950MB with one thread? Where's the extra gigabyte going? Is there another allocator to try? One that combines Google's efficiency with gcc's speed? -- Jon Smirl jonsmirl@gmail.com -
Of course there'll always be a certain amount of wasted cycles when threaded. The locking overhead, the extra contention for IO, etc. So 12% overhead (3% per thread) when using 4 threads is not that bad I I really don't know. Did you try with pack.deltacachesize set to 1 ? And yet, this is still missing the actual issue. The issue being that the 2.1GB pack as a _source_ doesn't cause as much memory to be allocated even if the _result_ pack ends up being the same. I was able to repack the 2.1GB pack on my machine which has 1GB of ram. Now that it has been repacked, I can't repack it anymore, even when single threaded, as it start crowling into swap fairly quickly. It is really non intuitive and actually senseless that Git would require twice as much RAM to deal with a pack that is 7 times smaller. Nicolas (still puzzled) -
OK, here's something else for you to try: core.deltabasecachelimit=0 pack.threads=2 pack.deltacachesize=1 With that I'm able to repack the small gcc pack on my machine with 1GB of ram using: git repack -a -f -d --window=250 --depth=250 and top reports a ~700m virt and ~500m res without hitting swap at all. It is only at 25% so far, but I was unable to get that far before. Would be curious to know what you get with 4 threads on your machine. Nicolas -
Well, around 55% memory usage skyrocketed to 1.6GB and the system went deep into swap. So I restarted it with no threads. Nicolas (even more puzzled) -
On the plus side you are seeing what I see, so it proves I am not imagining it. -- Jon Smirl jonsmirl@gmail.com -
Well... This is weird.
It seems that memory fragmentation is really really killing us here.
The fact that the Google allocator did manage to waste quite less memory
is a good indicator already.
I did modify the progress display to show accounted memory that was
allocated vs memory that was freed but still not released to the system.
At least that gives you an idea of memory allocation and fragmentation
with glibc in real time:
diff --git a/progress.c b/progress.c
index d19f80c..46ac9ef 100644
--- a/progress.c
+++ b/progress.c
@@ -8,6 +8,7 @@
* published by the Free Software Foundation.
*/
+#include <malloc.h>
#include "git-compat-util.h"
#include "progress.h"
@@ -94,10 +95,12 @@ static int display(struct progress *progress, unsigned n, const char *done)
if (progress->total) {
unsigned percent = n * 100 / progress->total;
if (percent != progress->last_percent || progress_update) {
+ struct mallinfo m = mallinfo();
progress->last_percent = percent;
- fprintf(stderr, "%s: %3u%% (%u/%u)%s%s",
- progress->title, percent, n,
- progress->total, tp, eol);
+ fprintf(stderr, "%s: %3u%% (%u/%u) %u/%uMB%s%s",
+ progress->title, percent, n, progress->total,
+ m.uordblks >> 18, m.fordblks >> 18,
+ tp, eol);
fflush(stderr);
progress_update = 0;
return 1;
This shows that at some point the repack goes into a big memory surge.
I don't have enough RAM to see how fragmented memory gets though, since
it starts swapping around 50% done with 2 threads.
With only 1 thread, memory usage grows significantly at around 11% with
a pretty noticeable slowdown in the progress rate.
So I think the theory goes like this:
There is a block of big objects together in the list somewhere.
Initially, all those big objects are assigned to thread #1 out of 4.
Because those objects are big, they get really slow to delta compress,
and storing them all in a window with 250 slots takes s...Note: I didn't know what unit of memory those blocks represents, so the shift is most probably wrong. Nicolas -
Me neither, but it appears to me as if hblkhd holds the actual memory consumed by the process. It seems to store the information in bytes, which I find a bit dubious unless glibc has some internal multiplier. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
mallinfo() will only give you the used memory for the main arena. When you have separate arenas (likely when concurrent threads have been used), the only way to get the full picture is to call malloc_stats(), which prints to stderr. Regards, Wolfram. -
OK scrap that. When I returned to the computer this morning, the repack was completed... with a 1.3GB pack instead. So... The gcc repo apparently really needs a large window to efficiently compress those large objects. But when those large objects are already well deltified and you repack again with a large window, somehow the memory allocator is way more involved, probably even more so when there are several threads in parallel amplifying the issue, and things probably get to a point of no return with regard to memory fragmentation after a while. So... my conclusion is that the glibc allocator has fragmentation issues with this work load, given the notable difference with the Google allocator, which itself might not be completely immune to fragmentation issues of its own. And because the gcc repo requires a large window of big objects to get good compression, then you're better not using 4 threads to repack it with -a -f. The fact that the size of the source pack has such an influence is probably only because the increased usage of the delta base object cache is playing a role in the global memory allocation pattern, allowing for the bad fragmentation issue to occur. If you could run one last test with the mallinfo patch I posted, without the pack.windowmemory setting, and adding the reported values along with those from top, then we could formally conclude to memory fragmentation issues. So I don't think Git itself is actually bad. The gcc repo most certainly constitute a nasty use case for memory allocators, but I don't think there is much we can do about it besides possibly implementing our own memory allocator with active defragmentation where possible (read memcpy) at some point to give glibc's allocator some chance to breathe a bit more. In the mean time you might have to use only one thread and lots of memory to repack the gcc repo, or find the perfect memory allocator to be used with Git. After all, packing the whole gcc history...
Is there an alternative to "git repack -a -d" that repacks everything but the first pack? -- Duy -
That would be a pretty good idea for big repositories. If I were to implement it, I would actually add a .git/config option like pack.permanent so that more than one pack could be made permanent; then to repack really really everything you'd need "git repack -a -a -d". Paolo -
It's already there: If you have a pack .git/objects/pack/pack-foo.pack, then "touch .git/objects/pack/pack-foo.keep" marks the pack as precious. -- Hannes -
Actually there is something like this, as seen from the source of
git-repack:
for e in `cd "$PACKDIR" && find . -type f -name '*.pack' \
| sed -e 's/^\.\///' -e 's/\.pack$//'`
do
if [ -e "$PACKDIR/$e.keep" ]; then
: keep
else
args="$args --unpacked=$e.pack"
existing="$existing $e"
fi
done
So, just create a file named as the pack, but with extension ".keep".
Paolo
-Yes.
Note that delta following involves patterns something like
allocate (small) space for delta
for i in (1..depth) {
allocate large space for base
allocate large space for result
.. apply delta ..
free large space for base
free small space for delta
}
so if you have some stupid heap algorithm that doesn't try to merge and
re-use free'd spaces very aggressively (because that takes CPU time!), you
might have memory usage be horribly inflated by the heap having all those
holes for all the objects that got free'd in the chain that don't get
aggressively re-used.
Threaded memory allocators then make this worse by probably using totally
different heaps for different threads (in order to avoid locking), so they
will *all* have the fragmentation issue.
And if you *really* want to cause trouble for a memory allocator, what you
should try to do is to allocate the memory in one thread, and free it in
another, and then things can really explode (the freeing thread notices
that the allocation is not in its thread-local heap, so instead of really
freeing it, it puts it on a separate list of areas to be freed later by
the original thread when it needs memory - or worse, it adds it to the
local thread list, and makes it effectively totally impossible to then
ever merge different free'd allocations ever again because the freed
things will be on different heap lists!).
I'm not saying that particular case happens in git, I'm just saying that
it's not unheard of. And with the delta cache and the object lookup, it's
not at _all_ impossible that we hit the "allocate in one thread, free in
another" case!
Linus
-ptmalloc2 (in glibc) _per arena_ is basically best-fit. This is the best known general strategy, but it certainly cannot be the best in It depends how large 'large' is -- if it exceeds the mmap() threshold (settable with mallopt(M_MMAP_THRESHOLD, ...)) the 'large' spaces will be allocated with mmap() and won't cause any internal fragmentation. It might pay to experiment with this parameter if it is hard to Indeed. Could someone perhaps try ptmalloc3 (http://malloc.de/malloc/ptmalloc3-current.tar.gz) on this case? Thanks, Wolfram. -
Uh what? Someone crank out his copy of "The Art of Computer Programming", I think volume 1. Best fit is known (analyzed and proven and documented decades ago) to be one of the worst strategies for memory allocation. Exactly because it leads to huge fragmentation problems. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Well, quoting http://gee.cs.oswego.edu/dl/html/malloc.html: "As shown by Wilson et al, best-fit schemes (of various kinds and approximations) tend to produce the least fragmentation on real loads compared to other general approaches such as first-fit." See [Wilson 1995] ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps for more details and references. Regards, Wolfram. -
Is it hard to hack up something that statically allocates a big block of memory per thread for these two and then just reuses it? allocate (small) space for delta allocate large space for base The alternating between long term and short term allocations -- Jon Smirl jonsmirl@gmail.com -
From: Linus Torvalds <torvalds@linux-foundation.org> One thing that supports these theories is that, while running these large repacks, I notice that the RSS is roughly 2/3 of the amount of virtual address space allocated. I personally don't think it's unreasonable for GIT to have it's own customized allocator at least for certain object types. -
Well, we actually already *do* have a customized allocator, but currently only for the actual core "object descriptor" that really just has the SHA1 and object flags in it (and a few extra words depending on object type). Those are critical for certain loads, and small too (so using the standard allocator wasted a _lot_ of memory). In addition, they're fixed-size and never free'd, so a specialized allocator really can do a lot better than any general-purpose memory allocator ever could. But the actual object *contents* are currently all allocated with whatever the standard libc malloc/free allocator is that you compile for (or load dynamically). Havign a specialized allocator for them is a much more involved issue, exactly because we do have interesting allocation patterns etc. That said, at least those object allocations are all single-threaded (for right now, at least), so even when git does multi-threaded stuff, the core sha1_file.c stuff is always run under a single lock, and a simpler allocator that doesn't care about threads is likely to be much better than one that tries to have thread-local heaps etc. I suspect that is what the google allocator does. It probably doesn't have per-thread heaps, it just uses locking (and quite possibly things like per-*size* heaps, which is much more memory-efficient and helps avoid some of the fragmentation problems). Locking is much slower than per-thread accesses, but it doesn't have the issues with per-thread-fragmentation and all the problems with one thread allocating and another one freeing. Linus -
Maybe an malloc/free/mmap wrapper that records the requested sizes and alloc/free order and dumps them to file so that one can make a compact git-free standalone test case for the glibc maintainers might be a good thing. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
I already have such a wrapper: http://malloc.de/malloc/mtrace-20060529.tar.gz But note that it does interfere with the thread scheduling, so it can't record the exact same allocation pattern as when not using the wrapper. Regards, Wolfram. -
Changing those parameters really slowed down counting the objects. I used to be able to count in 45 seconds now it took 130 seconds. I am still have the Google allocator linked in. 4 threads, cumulative clock time 25% 200 seconds, 820/627M 55% 510 seconds, 1240/1000M - little late recording 75% 15 minutes, 1658/1500M 90% 22 minutes, 1974/1800M it's still running but there is no significant change. Are two types of allocations being mixed? 1) long term, global objects kept until the end of everything 2) volatile, private objects allocated only while the object is being compressed and then freed Separating these would make a big difference to the fragmentation problem. Single threading probably wouldn't see a fragmentation problem from mixing the allocation types. When a thread is created it could allocated a private 20MB (or whatever) pool. The volatile, private objects would come from that pool. Long term objects would stay in the global pool. Since they are long term they will just get laid down sequentially in memory. Separating these allocation types make things way easier for malloc. CPU time would be helped by removing some of the locking if possible. -- Jon Smirl jonsmirl@gmail.com -
Threaded code *always* takes more CPU time. The only thing you can hope
for is a wall-clock reduction. You're seeing probably a combination of
(a) more cache misses
(b) bigger dataset active at a time
and a probably fairly miniscule
Sure they are shared. They're just not *entirely* shared. But they are
shared between each two cores, so each thread essentially has only half
the cache they had with the non-threaded version.
Threading is *not* a magic solution to all problems. It gives you
potentially twice the CPU power, but there are real downsides that you
I suspect that it's really simple: you have a few rather big files in the
gcc history, with deep delta chains. And what happens when you have four
threads running at the same time is that they all need to keep all those
objects that they are working on - and their hash state - in memory at the
same time!
So if you want to use more threads, that _forces_ you to have a bigger
memory footprint, simply because you have more "live" objects that you
work on. Normally, that isn't much of a problem, since most source files
are small, but if you have a few deep delta chains on big files, both the
delta chain itself is going to use memory (you may have limited the size
of the cache, but it's still needed for the actual delta generation, so
it's not like the memory usage went away).
That said, I suspect there are a few things fighting you:
- threading is hard. I haven't looked a lot at the changes Nico did to do
a threaded object packer, but what I've seen does not convince me it is
correct. The "trg_entry" accesses are *mostly* protected with
"cache_lock", but nothing else really seems to be, so quite frankly, I
wouldn't trust the threaded version very much. It's off by default, and
for a good reason, I think.
For example: the packing code does this:
if (!src->data) {
read_lock();
src->data = read_sha1_file(src_entry->idx.sha1, &type, &sz);
read_unlock...I beg to differ (of course, since I always know precisely what I do, and like you, my code never has bugs). Seriously though, the trg_entry has not to be protected at all. Why? Simply because each thread has its own exclusive set of objects which no > see a NULL src->data, they
From: Nicolas Pitre <nico@cam.org> If you repack on the smaller pack file, git has to expand more stuff internally in order to search the deltas, whereas with the larger pack file I bet git has to less often undelta'ify to get base objects blobs for delta search. In fact that behavior makes perfect sense to me and I don't understand GIT internals very well :-) -
Of course. I came to that conclusion two days ago. And despite being pretty familiar with the involved code (I wrote part of it myself) I just can't spot anything wrong with it so far. But somehow the threading code keep distracting people from that issue since it gets to do the same work whether or not the source pack is densely packed or not. Nicolas (who wish he had access to a much faster machine to investigate this issue) -
If it's still an issue next week, we'll have a 16 core (8 dual-core cpu's) machine with some 32gb of ram in that'll be free for about two days. You'll have to remind me about it though, as I've got a lot on my mind these days. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
Depends on your allocation patterns. For our apps, it certainly is :) Of course, i don't know if we've updated the external allocator in a while, i'll bug the people in charge of it. -
Did you use the tcmalloc with heap checker/profiler, or tcmalloc_minimal? -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
entry->delta_data is the only thing I can think of that are freed in the function that have been allocated much earlier before entering the function. -
Yet all ->delta-data instances are limited to 256MB according to Jon's config. Nicolas -
Maybe address space fragmentation is involved here? malloc/free for large areas works using mmap in glibc. There must be enough _contiguous_ space for a new allocation to succeed. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Well, that's interesting, but there is a way to know for sure instead of taking bets. Just use valgrind --tool=3Dmassif and look at the pretty picture, it'll tell what was going on very accurately. Note that I find your explanation unlikely: glibc uses mmap for sizes over 128k by default (IIRC), and as soon as you use mmaps, that's the kernel that deals with the address space, and it's not necessarily contiguous, that's only true for the heap. --=20 =C2=B7O=C2=B7 Pierre Habouzit =C2=B7=C2=B7O madcoder@debia= n.org OOO http://www.madism.org
Every single allocation needs to be contiguous in virtual address space and must not collide with existing virtual address space allocations. So fragmentation is at least a logistical issue. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Just out of curiousity, does adding
[pack]
windowmemory = 256M
help. I've found this to grow very large when there are large blobs.
Dave
-
| Rafael J. Wysocki | [Bug #10714] powerpc: Badness seen on 2.6.26-rc2 with lockdep enabled |
| Artem Bityutskiy | [RFC PATCH 06/26] UBIFS: add superblock and master node |
| Eric Paris | TALPA - a threat model? well sorta. |
| Balbir Singh | Re: [RFC][PATCH] Remove cgroup member from struct page |
git: | |
| Francis Moreau | emacs and git... |
| Daniel Berlin | git annotate runs out of memory |
| Wink Saville | Using git with Eclipse |
| Francis Moreau | git-bisect: weird usage of read(1) |
| Marc Balmer | Re: bcw(4) is gone |
| Stuart Henderson | Re: SMTP flood + spamdb |
| Theo de Raadt | Re: Richard Stallman... |
| Bryan Irvine | Re: Speed Problems |
| Christoph Lameter | tbench regression on each kernel release from 2.6.22 -> 2.6.28 |
| Peter Zijlstra | Re: [tbench regression fixes]: digging out smelly deadmen. |
| Johannes Berg | Re: mac80211 truesize bugs |
| Johannes Berg | [RFC] mac80211: assign needed_headroom/tailroom for netdevs |
