Hi,
This is my attempt to implement the 'lazy clone' I've read about a bit in the
git mailing list archive, but did not see implemented anywhere - the clone
that fetches a minimal amount of data with the possibility to download the
rest later (transparently!) when necessary. I am sorry to send it as a huge
patch, not as a series of patches, but as I don't know if I chose a way that is
acceptable for you [I'm new to the git code ;-)], I'd like to hear some
feedback first, and then I'll split it into smaller pieces for easier
integration - if OK.Background:
Currently we are evaluating the usage of git for OpenOffice.org as one of the
candidates (SVN is the other one), seehttp://wiki.services.openoffice.org/wiki/SCM_Migration
I've provided a git import of OOo with the entire history; the problem is that
the pack has 2.5G, so it's not too convenient to download for casual
developers that just want to try it. Shallow clone is not a possibility - we
don't get patches through mailing lists, so we need the pull/push, and also
thanks to the OOo development cycle, we have too many living heads which
causes the shallow clone to download about 1.5G even with --depth 1. Lazy
clone sounded like the right idea to me. With this proof-of-concept
implementation, just about 550M from the 2.5G is downloaded, which is still
about twice as much in comparison with downloading a tarball, but bearable.The principle:
During the initial clone, just the commit objects are downloaded. Then, any
time an object is requested, it is downloaded from the remote repository if
not available locally. To make this usable and performing, when a tree is
requested, it is downloaded together with all the subtrees and blobs at which
it points. Every subsequent pull (of stuff newer than what was cloned) is
supposed to use the normal git mechanisms.Protocol extensions:
I've extended the git protocol in 2 ways:
- added the 'commits-only' flag that is used during the clone to get a pack
containin...
Hi,
2nd part of my review:
I thought about this function again. It seems we have something similar
in builtin-pack-objects.c, which is easier to read. The equivalent would
be:static void read_from_stdin(int *num, char ***records)
{
char line[4096];
int alloc = 0;*num = 0;
*records = NULL;
for (;;) {
if (!fgets(line, sizeof(line), stdin)) {
if (feof(stdin))
break;
if (!ferror(stdin))
die("fgets returned NULL, not EOF, nor error!");
if (errno != EINTR)
die("fgets: %s", strerror(errno));
clearerr(stdin);
continue;
}
if (!line[0])
continue;
ALLOC_GROW(*records, *num + 1, alloc);
(*records)[(*num)++] = xstrdup(line);
}Please have a different option than --shared for lazy clones. Maybe
--lazy? ;-)I can see why you reused --shared, though. But let's make this more
I might be missing something, but I do not believe this is necessary.
I think it would make sense. For example if you have a local machine
<bikeshedding>maybe remote-alternates (note the dash instead
Seems that you do something like the read_from_stdin() here, only from a
file. It appears to me as if the function wants to be a library function
(taking a FILE * parameter, and maybe closing it after use, or evenIt'd be probably better to make this an array which uses ALLOC_GROW() in
That is a
return error("Error %d while calling fetch-pack", err);
And it does not really matter what type of error it is: you must report
The curly brackets are not necessary. Plus, with fill_remote_list() as
you defined it, it will break down with submodules (see 481f0ee6(FixThis just cries out loud for a non-recursive approach: have two arrays,
clear the second, fetch the objects in the first array, then fill the
second with the objects referred to by the first array's objects. ThenMaybe it would be nicer to have the has_remote_alternates() check only in
Or
revs.tag_objects = revs.tree_objects = revs.blob_ob...
It was not implemented because it was thought to be hard; git assumes
in many places that if it has an object, it has all objects referenced
by it.But it is very nice of you to [try to] implement 'lazy clone'/'remote
alternates'.Could you provide some benchmarks (time, network throughtput, latency)
One of the reasons why 'lazy clone' was not implemented was the fact
that by using large enough window, and larger than default delta
length you can repack "archive pack" (and keep it from trying to
repack using .keep files, see git-config(1)) much tighter than with
default (time and CPU conserving) options, and much, much tighter than
pack which is result of fast-import driven import.Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
you would need machine with large amount of memory to repack itWouldn't be easier to try to fix shallow clone implementation to allow
for pushing from shallow to full clone (fetching from full to shallow
is implemented), and perhaps also push/pull between two shallow
clones?As to many living heads: first, you don't need to fetch all
heads. Currently git-clone has no option to select subset of heads to
clone, but you can always use git-init + hand configuration +
git-remote and git-fetch for actual fetching.By the way, did you try to split OpenOffice.org repository at the
components boundary into submodules (subprojects)? This would also
limit amount of needed download, as you don't neeed to download and
checkout all subprojects.The problem of course is _how_ to split repository into
submodules. Submodules should be enough self contained so theDo you have any numbers for OOo repository like number of revisions,
depth of DAG of commits (maximum number of revisions in one line of
commits), number of files, size of checkout, average size of file,
etc.?--
Jakub Narebski
Poland
ShadeHawk on #git
-
Hi Jakub,
Unfortunately not yet :-( The only data I have that clone done on
git://localhost/ooo.git took 10 minutes without the lazy clone, and 7.5
minutes with it - and then I sent the patch for review here ;-) The deadline
for our SVN vs. git comparison for OOo is the next Friday, so I'll definitelyAs I answered elsewhere, unfortunately it goes out of memory even on 8G
I tried to look into it a bit, but unfortunately did not see a clear way how
to do it transparently for the user - say you pull a branch that is based offRight, might be interesting as well. But still the missing push/pull is
Yes, and got to much nicer repositories by that ;-) - by only moving some
binary stuff out of the CVS to a separate tree. The problem is that the dealI'll try to provide the data ASAP.
Regards,
Jan
-
Hi, Jan!
Here perhaps another optimization which wasn't done because git is
fast enough on moderately-sized repositories, namely that IIRC git-clone
(and git-fetch for sure) over native (smart) protocol recreates pack,
even if sometimes better and simplier would be to just copy (transfer)
existing pack.But this would need multi-pack "extension". (it should work just now
without transport protocol extension, receiver must only be awareIf I remember correctly fetching _into_ shallow clone works correctly,
as deepening depth of shallow clone. What is not implemented AFAIK, but
should be not too hard would be to allow to push from shallow clone
to full clone. This way the network of full clones (functioning as
centres to publish your work) and shallow + few branches repos (working
repositories).I don't know if that would be enough.
For better support git would need to exchange graft-like information,
and use union of restrictions to get correct commits.You can configure separate 'remote's for the same repository
with different heads. This would work both for pull and for push.I think the solution proposed by Marco Costalba, namely of creating
"archive" repository, and "live" repository, joining them if needed
by grafts, similarly to how linux kernel has live repo, and historical
import repo, would be good alternative to shallow or lazy clone.There would be "archive" repo (or repos), read only, with whole history,
very tightly packed with kept packs, with all branches and all tags,
and "live" repo, with only current history (a year, or since major
API change, or from today, or something like that), with only important
branches (or repos, each containg important for a team set of branches).
There would be prepared graft file to join two histories, if you haveSidenote: due to (from what I have read) heavy use of topic branches
in OOo development, Subversion would have to be used with svnmerge
extension, or together with SVK, to make work with it not compl...
Try setting the following config variables as follows:
git config pack.deltaCacheLimit 1
git config pack.deltaCacheSize 1
git config pack.windowMemory 1gThat should help keeping memory usage somewhat bounded.
Nicolas
-
Hi,
I tried that:
$ git config pack.deltaCacheLimit 1
$ git config pack.deltaCacheSize 1
$ git config pack.windowMemory 2g
$ #/usr/bin/time git repack -a -d -f --window=250 --depth=250
$ du -s objects/
2548137 objects/
$ /usr/bin/time git repack -a -d -f --window=250 --depth=250
Counting objects: 2477715, done.
fatal: Out of memory, malloc failed411764)
Command exited with non-zero status 1
9356.95user 53.33system 2:38:58elapsed 98%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (31929major+18088744minor)pagefaults 0swapsNote that this is on a 2.4GHz Quadcode CPU with 3.5GB RAM.
I'm retrying with smaller values, but at over 2.5 hours per try, this is
getting tedious.Ciao,
Dscho-
This has nothing to do with repacking memory usage, but even tighter
packs can be obtained with:git config repack.usedeltabaseoffset true
This is not the default yet.
Nicolas
-
I have successfully repacked this repo a few times on a 2.1GHz system with 16G.
The smallest attained pack was about 1.45G (1556569742B).
This run took about 7 hours 26 min.
I ran: git repack -a -d -f --window=250 --depth=250
Here are the relevent config entries:
[pack]
threads = 1
compression = 9
[repack]
usedeltabaseoffset = trueOther runs:
* Same as above, but with default compression:
pack size: 1560624388
time: 7 hours 11 minNot much difference in time or size.
* Multi threaded (250m window)
[pack]
threads = 4
windowmemory = 250m
compression = 9
[repack]
usedeltabaseoffset = truepack size: 1767405703
time: 3 hoursFirst >99% took 50min. Last 10000 objects took 2hours.
* Multi threaded (500m window)
[pack]
threads = 4
windowmemory = 500m
compression = 9
[repack]
usedeltabaseoffset = truepack size: 1640820903
time: forgot to time, but between 3-4 hours based on file timeI just received Dscho's email, this is interesting to compare
with his single threaded result of 1638490531. I wonder if he
used deltabaseoffset? I think his machine is a little faster
than this one. So using 4 threads finished twice as fast and
produced a similar pack size. Actually, the difference could
just be the compression setting.* Deeper (git repack -a -d -f --window=250 --depth=500)
[pack]
threads = 1
compression = 9
[repack]
usedeltabaseoffset = truepack size: 1578263745
time: 7 hours 58 minLarger pack compared to --depth=250.
-brandon
-
Right. That's because the algorithm to distribute the load between
threads ends up stealing work from other threads whenever a thread is
done with its own share. So the easy objects are quickly done with by a
few threads until they all converge onto the hard ones. In the non
threaded case, the slow down ocurs around 12%.It looks like those hard objects are huge binary blobs. If they could
be removed from the repository entirely and regenerated as needed
instead of being carried around then I expect the repository size would
fall below the 500MB mark.Nicolas
-
Hi,
Nope. Wanted it to be as compatible as possible.
Ciao,
Dscho-
Hi,
Now, _that_ is strange. Using 150 instead of 250 brings it down even
quicker!$ /usr/bin/time git repack -a -d -f --window=150 --depth=150
Counting objects: 2477715, done.
Compressing objects: 19% (481551/2411764)
Compressing objects: 19% (482333/2411764)
fatal: Out of memory, malloc failed411764)
Command exited with non-zero status 1
7118.37user 54.15system 2:01:44elapsed 98%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (29834major+17122977minor)pagefaults 0swaps(I hit the Return key twice during the time I suspected it would go out of
memory, so it might have been really at 20%.)Ideas?
Ciao,
Dscho-
Hi,
I made the window much smaller (512 megabyte), and it still runs, after 27
hours:Compressing objects: 20% (484132/2411764)
However, it seems that it only worked on about 4000 objects in the last
20(!) hours. So, the first 19% were relatively quick. The next percent
not at all.Will keep you posted,
Dscho-
Hi,
Finally!
I updated to newest git+patches (git version 1.5.4.1.1353.g0d5dd), reset
windowMemory to 512m and restarted the process:$ /usr/bin/time git repack -a -d -f --window=250 --depth=250
Counting objects: 2477715, done.
Compressing objects: 100% (2411764/2411764), done.
Writing objects: 100% (2477715/2477715), done.
Total 2477715 (delta 1876242), reused 0 (delta 0)
21733.55user 175.32system 6:10:37elapsed 98%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (81921major+63880453minor)pagefaults 0swapsA little over 6 hours, with one core (of the four available). Not bad, I
say.The result is:
$ ls -la objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack
-rwxrwxrwx 1 root root 1638490531 2008-02-14 17:51
objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack1.6G looks much better than 2.4G, wouldn't you say? Jan, if you want it,
please give me a place to upload it to.Ciao,
Dscho-
Hi Johannes,
Thank you! In the meantime, I happened to produce something similar.
Unfortunately even mine was too late for another round of tests to present it
in our git vs. svn comparison (with todays deadline) - so we just mentioned
in the report that the tested repository still had reserves [but the numbers
celkem 1636608
-r--r--r-- 1 kendy users 59264432 2008-02-10 15:22
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.idx
-r--r--r-- 1 kendy users 1614968445 2008-02-10 15:22
celkem 1644160
-r--r--r-- 1 kendy users 59264432 2008-02-11 16:09
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.idx
-rw-r--r-- 1 kendy users 0 2008-02-11 16:29
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.keep
-r--r--r-- 1 kendy users 1622697708 2008-02-11 16:09
pack-909b501d3d673f10a66adfefdf8371933e7a6f3e.packThe 'minimal3' case was with '--window=250 --depth=250', 'minimal4' was
with '--window=250 --depth=50'I tried the --depth=50 because I read 'making it too deep affects the
performance on the unpacker side' in the man page. How big the difference
could be in practice, please?Regards,
Jan
-
Do you perchance know why OOo needs so large pack? Perhaps you could
try running contrib/stats/packinfo.pl on this pack to examine it to
get to know what takes most space.What is the size of checkout, by the way?
Hmmm... I wonder if packv4 would help...
--
Jakub Narebski
Poland
ShadeHawk on #git
-
Earlier in this thread Sean did some analysis and found lots of large
objects, and he mentioned that he sent a listing to Jan for inspection.2.4G
-brandon
-
Hi,
$ ~/git/contrib/stats/packinfo.pl < \
objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack 2>&1 | \
tee packinfo.txt
Illegal division by zero at /home/imaging/git/contrib/stats/packinfo.plI work on a bare repository, but:
$ git archive origin/master | wc -c
2010060800Or more precisely:
$ echo $(($(git ls-tree -l -r origin/master | sed -n 's/^[^ ]* [^ ]* [^ ]*
*\([0-9]*\).*$/\1/p' | tr '\012' +)0))
1947839459So yes, we still have the crown of the _whole_ repository being _smaller_
than a single checkout.I could imagine that it does, what with it being so much better with
strings. But it would come at a price of performance, I guess, as the
string table should be well over 64k.Ciao,
Dscho-
Errr... sorry, I should have been more explicit. What I meant here
is the result of$ git verify-pack -v <packfile> | \
That's huuuuge tree. Compared to that 1.6G (or 1.4G) packfile doesn't
look large.I wonder if proper subdivision into submodules (which should encourage
better code by the way, see TAOUP), and perhaps partial checkouts
wouldn't be better solution than lazy clone. But it is nice to have
long discussed about feature, even if at RFC stage, but with some code.--
Jakub Narebski
Poland
-
Hi Jakub,
Yes, I'd love to see the OOo tree split into several parts, I've already
proposed a division (http://www.nabble.com/OOo-source-split-td13096065.html),
but it'll take some more time I'm afraid :-(Regards,
Jan
-
Hi,
Heh. I was too lazy to look up the usage, so I just did what I thought
would make sense...So here it goes:
$ git verify-pack -v
objects/pack/pack-e4dc6da0a10888ec4345490575efc587b7523b45.pack |
~/git/contrib/stats/packinfo.pl | tee packinfo.txt
all sizes: count 601473 total 2855826280 min 0 max 62173032 mean
4748.05 median 232 std_dev 221254.37
all path sizes: count 601473 total 2855826280 min 0 max 62173032 mean
4748.05 median 232 std_dev 221254.37
tree sizes: count 601473 total 2855826280 min 0 max 62173032 mean
4748.05 median 232 std_dev 221254.37
tree path sizes: count 601473 total 2855826280 min 0 max 62173032 mean
4748.05 median 232 std_dev 221254.37
depths: count 2477715 total 70336238 min 0 max 250 mean 28.39
median 4 std_dev 55.49Something in my gut tells me that those four repetitive lines are not
I think partial checkouts are wrong. If you can work on partial
checkouts, chances are that what you work on should be a submodule.Having said that, I can understand if some people do not want to have the
hassle of test^H^H^H^Husing submodules...Ciao,
Dscho-
IMHO there is place for submodules, there is place for partial
checkouts, and perhaps there is even place for the combination of two.For example while Documentation/ isn't a good candidate for a submodule,
because as you add new feature yuou want to add to documentation, if
you change some feature you want to change documentation: there are
whole-tree commits which contain changes outside Documentation/.
Nevertheless there are some people (technical writers) which are
interested only in Documentation; perhaps only in few files there.
They would want to have partial checkout, I guess.On the other hand cgit and msysgit use submodules, and I think it is
good solution. I wonder if Sourcemage Linux distro uses submodules...
In the case of cgit I think having git.git or its clone/fork as
submodule is a good idea, but perhaps even better would be to checkout
only part of it: libgit or libgitthin--
Jakub Narebski
Poland
-
Do you by chance have repack.usedeltabaseoffset turned on? That has the
unfortunate side effect of changing the output of verify-pack -v to be
almost useless for my packinfo script (specifically, it no longer
reports the parent SHA1 hash for deltas, and the script is basically all
about deltra tree statistics.) I suppose that should probably be fixed,
but I never looked into it.-bcd
-
Hi,
Ouch. That must have been a leftover from earlier attempts. I did not
_mean_ to keep it, but now that I have a pretty packed repository, I think
I'll just keep it as-is.Ciao,
Dscho-
I should really come around to fixing packed_object_info_detail() for
the OBJ_OFS_DELTA case one day.Nicolas
-
Please don't.
Obtaining the SHA-1 of your delta base would require unpacking your
delta base and then doing a SHA-1 hash of it. Or alternatively
doing a search through the .idx for the object that starts at the
requested OFS. Either way, its really expensive for a minor detail
of output in verify-pack. Something that any script can produce
with a simple reverse lookup table.Its also run after we just spent a hell of a lot of time and disk
IO trying to verify the packfile. We slammed through the pack
once to do its overall SHA-1, and then god knows how many times as
we iterate the objects in pack order, not delta base order, thus
causing the delta base cache to become overwhelmed and constantly
fault out entries. Pack verification is stupid and slow. This
would make -v even worse.But if you are going to do that, you may also want to fix the
"*store_size = 0 /* notyet */" that's like 5 lines above. :)BTW, why does this return const char* from typename(type) instead
of just returning the enum object_type and letting the caller do
typename() if they want it? Most of our other code that returns
types returns the enum, not the string. :-\--
Shawn.
-
I intended to use the pack index of course. And the code already exists
Not _that_ expensive actually. Like I say, in pack-objects we do it all
Yeah, that's easy too.
Nicolas
-
It just was not converted from the old string interface. I
thought you are old enough to remember ;-)
-
That being said, the most useful output for figuring out where all the
space in the pack is going in my experience is gotten from:git-verify-pack -v | packinfo.pl -tree -filenames
That will produce a huge amount of output, which is basically the tree
structure of the delta chains in the file. If things aren't being
deltified together properly, it's usually pretty obvious.A delta chain in this output looks approximately like this:
# 0 blob 03156f21... 1767 1767 Documentation/git-lost-found.txt @ tags/v1.2.0~142
# 1 blob f52a9d7f... 10 1777 Documentation/git-lost-found.txt @ tags/v1.5.0-rc1~74
# 2 blob a8cc5739... 51 1828 Documentation/git-lost+found.txt @ tags/v0.99.9h^0
# 3 blob 660e90b1... 15 1843 Documentation/git-lost+found.txt @ master~3222^2~2
# 4 blob 0cb8e3bb... 33 1876 Documentation/git-lost+found.txt @ master~3222^2~3
# 2 blob e48607f0... 311 2088 Documentation/git-lost-found.txt @ tags/v1.5.2-rc3~4
# size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85
# path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26# The first number after the sha1 is the object size, the second
# number is the path size. The statistics are across all objects in
# the previous delta tree. Obviously they are omitted for trees of
# one object.# A path size is the sum of the size of the delta chain, including the
# base object. In other words, it's how many bytes need be read to
# reassemble the file from deltas.This is also quite slow, as it runs git-ls-tree -t -r on every commit in
the repository to assign file names to blobs. You can leave out the
-filenames option to not do this (if you don't care about seeing
filenames, that is).-bcd
-
No. Well, it would help a bit, maybe in the 10-20% range, but nothing
as significant as going from 2.6G to 1.5G, or like in the GCC case, from
1.3G to 230M.Nicolas
-
I found that out with gcc. 95% went down in no time and the last 5%
took two hours. The 5% that got stuck were chains with 2000+ entries.The neat thing about the multithread code is that it will keep
splitting the work load. That lets all of the easy deltas finish and
not get stuck behind the problem objects.With quad core on gcc one core would get stuck on the problem objects.
The other three would finish their list and start splitting the
problem list. This effectively sorts the problems to the end of the
work load. By printing the object hash out as they are completed you
can easily identify the problem objects. If I recall right on gcc the
problem was a configure file that had 2000 entries in its delta chain.
That one delta chain took over an hour to process.Could there be an N squared type problem when 2000 entry delta chains
are encountered? Maybe something that just isn't noticeable when
depth/window=50. Has testing been done with really long object chains
to make sure that only the minimal amount of work is being done? It
seems like something is breaking down when the chain length exceeds
the window size.--
Jon Smirl
jonsmirl@gmail.com
-
I'd suggest making the memory window smaller yet.
512MB is a *big* amount of memory, if you fill it up, and end up using an
O(n**2) algorithm on the objects within the window (which it is: the
repacking algorithm is O(n) in _total_ objects, but the constant part is
basically O(winsize^2).I'd suggest that a reasonable window memory limit is around just a few
megabytes (eg 4MB to maybe 64MB). If you have "normal" source files,
you're still going to be limited by the window _count_ size (assume normal
source files are in the few tens of kB), and for those occasional large
files, you'd better hope that the sort heursistics are good enough.Linus
-
How many diffs should it take to compress a 2000 delta chain with
window/depth=250?--
Jon Smirl
jonsmirl@gmail.com
-
There's no fixed answer. We do various culling heurstics to avoid actually
generating a diff at all if it looks unlikely to succeed etc. But in
general, the way the window works is that
(a) we only need to generate the _unpacked_ object once
(b) we compare each object to the "window-1" preceding objects, which is
how I got the O(windowsize^2)
(c) but then that "compare" relatively seldom involves actually
generating a whole diff!So the answer is: in _theory_ each object may be compared to
(windowsize-1) other objects, but in practice it's much less than that.Linus
-
That's not really true, of course. But my (broken and inexact) logic is
that we get one cost multiplier from the number of objects, and one from
the size of the objects.So *if* we have the situation of not limiting the window size, we
basically have a big slowdown from raising the window in number of
objects: not only do we get a slowdown from comparing more objects, we
spend relatively more time comparing the *large* ones to begin with and
having more of them just makes it even more skewed - when we hit a series
of big blocks, the window will also contain more big blocks, so it kind of
a double whammy.But I don't think calling it O(windowsize^2) is really correct. It's still
O(windowsize), it's just that the purely "number-of-object" thing doesn't
account for big objects being much more expensive to diff. So you really
want to make the *memory* limiter the big one, because that's the one that
actually approximates how much time you end up spending.So ignore that O(n^2) blather. It's not correct. What _is_ correct is that
we want to aggressively limit memory size, because CPU cost goes up
linearly not just with number of objects, but also super-linearly with
size of the object ("super-linear" due to bad cache behavior and in worst
case due to paging).Linus
-
In the gcc case I wasn't running out memory. I believe was CPU bound
for an hour processing a single object chain with 2000 entries. That
sure doesn't feel like O(windowsize).Maybe someone playing the the OO repo can stick in an appropriate
printf and see how many diffs are really being done just to make sure--
Jon Smirl
jonsmirl@gmail.com
-
Well, there's another - and totally unrelated - issue with *pre-existing*
delta chains that are very deep.Namely the fact that since such a deep delta chain will exhaust the
delta-cache, you will now have a O(n*chaindepth) behaviour when you unpack
the objects (in order to generate the deltas) in the first place!So that really has nothing to do with the new window (or delta) depth at
all, just with the _previous_ window depth.See sha1_file.c: MAX_DELTA_CACHE.
If you have a 2000-deep delta chain, then the delta-cache should be big
enough that you hit in it regularly without flushing it when you traverse
down the chain. So MAX_DELTA_CACHE should generally be at _least_ as much
as the max delta chain length, which is obviously normally the case
(default max delta chain length: 10).We could probably fairly easily make that MAX_DELTA_CACHE be a config
option, but right now you have to recompile to test that theory of mine.Or just limit your delta depth to something much smaller (ie ~100 or so)
Linus
-
Yeah... this repo is really a pain to repack. I have access to a
8-processor machine with 8GB of ram and all my repack attempts so far
were killed after using too much memory, despite the window memory
limit. Those were threaded repack attempts, so the first 98% was really
quick, like less than 15 minutes, but then all threads converged on this
small fraction of the object space which appears to cause problems.
And then I'm presuming I ran into the same threaded memory fragmentation
issue. Might be worth attaching gdb to it and extract a sample of the
object SHA1's populating the delta window when the slowdown occurs to
see what they actually are...I'm attempting a single-threaded repack now.
Nicolas
-
You're probably hitting the same memory allocator fragmentation issue I
had with the gcc repo. On my machine with 1GB of ram, I was able to
repack the 1.5GB source pack just fine, but repacking the 300MB source
pack was impossible due to memory exhaustion.My theory is that the smaller pack has many more deltas with deeper
delta chains, and this is stumping much harder on the memory allocator
which fails to prevent fragmentation at some point. When Jon Smirl
tested Git using the Google memory allocator there was around 1GB less
allocated, which might indicate that the glibc allocator has issues with
some of Git's workloads.Nicolas
-
I'm forgetting everything again, but I seem to recall that the Google
allocator only made a significant difference with multithreading. It
is much better at keeping the threads from fragmenting each other.
It's very easy to try it, all you have to do is add another lib the--
Jon Smirl
jonsmirl@gmail.com
-
Turning on multi-core support greatly increases the memory
consumption; at least double the single thread case.Going over the original repository and deleting (get all copies out of
the history) those giant i18n files generated by programs than Sean
refers to would be my first step. If you have 5,000 revisions of a
10MB file I suspect it would take a huge amount of memory to pack.--
Jon Smirl
jonsmirl@gmail.com
-
Hi,
That's why I did not do it.
Ciao,
Dscho
-
On Sat, 09 Feb 2008 22:10:06 -0500 (EST)
Hi Nicolas,
Tried that earlier today and got a 1.6G pack (on a 2G machine). There are
some big objects in that repo.. over 100 are 30 to 62M in size, 400 more
over 10M, and ~40,000 over 100K. Would you expect a larger memory window
(on a better machine) to help shrink the repo down any more?Sean
-
Well, I don't think so. Anyway, with the above pack.windowMemory
setting, the window probably gets shrinked if those big objects are all
to be found in the same window. So that would be the setting to
increase if you have lots of ram.Finding out what those huge objects are, and if they actually need to be
there, would be a good thing to do to reduce any repository size.Nicolas
-
On Sun, 10 Feb 2008 00:22:09 -0500 (EST)
Sounds like it would be worthwhile then for Jan to try on that 8G machine
Okay, i've sent the sha1's of the top 500 to Jan for inspection. It appears
that many of the largest objects are automatically generated i18n files that
could be regenerated from source files when needed rather than being checked
in themselves; but that's for the OO folks to decide.Thanks,
Sean
-
Good practice is to not add generated files to version control.
But sometimes such files are stored if regenerating them is costly
(./configure file in some cases, 'man' and 'html' branches in git.git).IIRC Dana How tried also to deal with repository with large binary
files in repo, although in that case those had shallow history. IIRC
the proposed solution was to pack all such large objects undeltified
into separate "large-objects" kept pack.You can mark large files with (undocumented except for RelNotes)
'delta' gitattribute, but I don't know if it would help in your
case.--
Jakub Narebski
Poland
-
On Mon, 11 Feb 2008, Jakub Narebski wrote:
> On Sun, 10 Feb 2008, Sean napisa
Sorry, my mistake.
Although in Dana case separating large blobs into non-packed loose
objects (her patches), or separate kept non-delta large blobs only
pack (proposed solution), were shared over networked filesystem.
So the amortized size of repository was smaller... ;-ppp--
Jakub Narebski
Poland
ShadeHawk on #git
-
A lot of memory is 2-4GB. Without this much memory you will trigger
swapping and the pack process will finish in about a month. Note that
only one machine needs to have this kind of memory. It can be used to
make the optimized pack of the project history and mark it with .keep
files. It doesn't take a lot of memory to use the optimized packs,
only to make them.There are some patches for making repack work multi-core. Not sure if
they made it into the main git tree yet. These patches work almost
linearly. A eight hour repack will take 2.5 hours on a quad core
machine.There is very good chance your 1.5GB repo will turn into 300MB if it
is extremely packed. This is something you only need to do once, but
you'll probably end up doing it a dozen times trying to get it just
right.--
Jon Smirl
jonsmirl@gmail.com
-
Well, my modest little Celeron M laptop w/ 1GB of ram did the full
repack overnight on the gcc repo, so a month is a bit of an
exaggeration.Cheers,
Harvey
-
Try it again with window=250 and depth=250. That's how you get the
--
Jon Smirl
jonsmirl@gmail.com
-
Yes, I know, and I did if you remember back to the gcc discussion.
Harvey
-
Now that you mention it I seem to recall some changes were made to git
during that discussion that reduced the memory footprint and made the
optimized gcc repack fit into 1GB. I've forgotten the exact timings
and git is a moving target. When I was working on Mozilla it needed
2.4GB to avoid swapping but that was with a much older git.The rule is: if it starts swapping it is going to take way longer that
you are probably willing to wait. Buying more RAM is a cheap and easy
fix.If people are having trouble with large repositories please let the
git community know and your issues will probably get quickly fixed.
We can't fix something we don't know about.--
Jon Smirl
jonsmirl@gmail.com
-
Yes, they are. You need to compile with"make THREADED_DELTA_SEARCH=yes"
or add THREADED_DELTA_SEARCH=yes into config.mak for it to be enabled
though. Then you have to set the pack.threads configuration variable
appropriately to use it.Nicolas
-
I sent a patch to get it to auto-detect multi-core machines, but I see
now that it was commented upon for finalization (by Nicolas, actually)
and I must have missed that, thinking it had been applied because I got
an accidental merge in my own tree.As such, I've been using that patch the last several months without
problems. I'll rework them as per Nicolas' suggestions and resend.--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
Allow pack.threads config option and --threads command line option to
accept '0' as an argument and set the number of created threads equal
to the number of online processors in this case.Signed-off-by: Brandon Casey <casey@nrlssc.navy.mil>
---I was preparing this patch when I saw your email. I looked up your
the old email you were talking about. Your function is better since
it is cross platform.When you redo your patch, you may want to adopt one aspect of this
one. I used a setting of zero to imply "set number of threads to
number of cpus". This allows the user to specifically set pack.threads
in the config file to zero with the above mentioned meaning, or to
override a setting in the config file from the command line with
--threads=0. This is rather than having to delete the option from the
config file.-brandon
builtin-pack-objects.c | 22 ++++++++++++++++++----
1 files changed, 18 insertions(+), 4 deletions(-)diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 692a761..5c55c11 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1852,11 +1852,11 @@ static int git_pack_config(const char *k, const char *v)
}
if (!strcmp(k, "pack.threads")) {
delta_search_threads = git_config_int(k, v);
- if (delta_search_threads < 1)
+ if (delta_search_threads < 0)
die("invalid number of threads specified (%d)",
delta_search_threads);
#ifndef THREADED_DELTA_SEARCH
- if (delta_search_threads > 1)
+ if (delta_search_threads != 1)
warning("no threads support, ignoring %s", k);
#endif
return 0;
@@ -2121,10 +2121,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
if (!prefixcmp(arg, "--threads=")) {
char *end;
delta_search_threads = strtoul(arg+10, &end, 0);
- if (!arg[10] || *end || delta_search_threads < 1)
+ if (!arg[10] || *end || delta_search_threads < 0)
usage(pack_usage);
#ifndef THREADED_DELTA_SEARCH
- if (delta_search_threa...
That make sense. Perhaps even go so far as to allow 'auto' as a
But this is not so good. For one thing you've dropped windows support
entirely. The last comment on my own patch was that get_num_active_cpus()
should live in a file of its own. You've taken one step back from that
and not even kept it in its own function.I think perhaps it's time to introduce thread-compat.[ch] to deal with
thread-related cross-platform things like this.I'll recook my patch and send it in a few minutes, using your suggestions
and Nicolas combined.--
Andreas Ericsson andreas.ericsson@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
-
<snip>
There are 2 things, here:
- Probably, you can make your pack smaller with proper window sizing.
Try taking a look at the "Git and GCC" that crossed borders between
the gcc and the git mailing lists.
- There are tricks to do roughly what you want without modifying git.
For example, you can prepare several "shared" clones of your repo (git
clone -s) and leave in each only a few branches. Cloning from these will
only pull the needed data.Mike
-
Hi Mike,
Good to know about this, thank you! The problem currently is that we are
trying to produce SVN and git trees containing the same data, the same number
of branches, etc. for the sake of comparison. If git wins, and it will be
chosen for OOo, we'll be hopefully able to do more tuning - and I'm sure I'll
ask here for help ;-)Regards,
Jan
-
Hi,
The problem is, of course, that the shared clones are not updated
automatically, whenever the big repository is updated.Ciao,
Dscho-
Hi,
You might want to make the full_info static, and only send the options the
This chunk could use ALLOC_GROW() quite nicely (would make it more
readable, and avoid errors). Also, I'd use alloc_nr() instead of theYou can initialise it to 0 right away...
Unfortunately, I have to go now... so I will review the rest
(from builtin-fetch.c on) later.It's great seeing that you work on this!
Thanks,
Dscho
-
For comparison, how big was the svn repo you're testing? My experience
has been about 15-20 times smaller than SVN once a tuned repack has
been done.Cheers,
Harvey
-
Hi Harvey,
Another guy created the SVN repo, IIRC he said it had 55G.
Regards,
Jan
-
How did you repack your repository?
We know that current defaults are not suitable for large projects. For
example, the gcc git repository shrinked from 1.5GB pack down to 230MB
after some tuning.Nicolas
-
Hi Nicolas,
After the suggestions in this thread I tried to experiment with the --window
and --depth options of git-repack, and indeed, there are still reserves.So far I'm at 2G (saved 500M), unfortunately the aggressive values like
--window=250 --depth=250 that someone mentioned here cause out-of-memory on a
machine with 8G :-( If there's anybody brave enough here to try as well, I'd
be grateful. Maybe it would be also interesting to _exactly_ locate what
causes the oom, and eg. exclude the object from the pack if possible.The tree is available here:
git clone git://o3-build.services.openoffice.org/git/ooo.git
git clone http://o3-build.services.openoffice.org/~svn/ooo.git (the same over
http://)Thank you in advance!
Regards,
Jan
-
Sorry to enter so late in this thread. I just would like to ask if you
have evaluated a different approach for casual developers.The approach is the one used by Linux tree.
Linux git repository is not very big and can be downloaded with easy.
On the other end Linux history spans many more years then the repo
does.The design choice here is two have *two repositories*, one with recent
stuff and one historical, with stuff older then version 2.6.12We have to say that this choice come by accident due to Linus
switching from bitkeeper to git around 2.6.12 but today it's a more or
less a conscious choice because there exists the git historical repo,
converted from bk, and this repo is still kept separated, also if
technically could be grafted to the main one to create a super big
Linux repo.Advantage of this approach are:
- Lean and fast everyday repos, where actual development occurs
- Easy clone also for casual users
- Possibility to have anyway the whole history when needed
A variation on this theme could be to have always two repos, one with
recent stuff, say last 5 years of development, and one with *the
whole* history, not only with old stuff as in the historical Linux
tree, in this case it's easier for people that need digging very old
changes to do this avoiding browsing two repos as occurs now with
Linux.Marco
P.S: Idea here is that of a kind of cache memory for git repos ;-)
-
Hi,
I do not think that this is an option: Jan already tried a shallow clone
(which would amount to something like what you propose), and it was still
too large.Ciao,
Dscho
-
I think that was still pulling all the branches, so a shallow clone of
just a couple of branches might be feasible.Dave.
-
Hi,
Indeed:
$ git ls-remote git://o3-build.services.openoffice.org/git/ooo.git|wc -l
3970
$ git ls-remote --heads git://o3-build.services.openoffice.org/git/ooo.git|
wc -l
751Fetching just master is a little hard on the server (it spends quite a
lot of time deltifying -- minutes! -- especially between 80% and 95%,
and indexing is even slower), but other than
that:$ /usr/bin/time git fetch --depth=1 \
git://o3-build.services.openoffice.org/git/ooo.git \
master:refs/remotes/origin/master
warning: no common commits
remote: Generating pack...
remote: Done counting 79934 objects.
remote: Deltifying 79934 objects...
remote: 100% (79934/79934) done
Indexing 79934 objects...
remote: Total 79934 (delta 34549), reused 51323 (delta 20737)
100% (79934/79934) done
Resolving 34549 deltas...
100% (34549/34549) done
* refs/remotes/origin/master: storing branch 'master' of
git://o3-build.services.openoffice.org/git/ooo
commit: 29990e4
46.48user 4.60system 16:48.29elapsed 5%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+941205minor)pagefaults 0swaps$ du .git/objects/pack/
464688 .git/objects/pack/
$ /usr/bin/time git repack -a -d -f --window=250 --depth=250
Generating pack...
Done counting 79934 objects.
Deltifying 79934 objects...
100% (79934/79934) done
Writing 79934 objects...
100% (79934/79934) done
Total 79934 (delta 40013), reused 0 (delta 0)
Pack pack-350e4edca93ee75ef3d85269284a24775bf6b24f created.
Removing unused objects 100%...
Done.
1869.78user 6.66system 31:36.50elapsed 98%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (2031major+1753824minor)pagefaults 0swaps
$ du .git/objects/pack/
454636 .git/objects/pack/Of course, the clone time would be reduced dramatically if the repository
you clone from has only "master", and is fully (re-)packed.So I was not completely correct in my assumption that a clear cut a la
linux-2.6 (possibly grafting historical-linux) would not help.Ciao,
Dscho-
What Git version is this?
You better try out 1.5.4 for packing comparisons. It produces slightly
tighter packs than 1.5.3.Nicolas
-
Hi,
Ooops. I thought I updated, but no: 1.5.3.6.2835.gf9ebf
Ciao,
Dscho-
Speaking of which, I haven't taken a look at builtin-pack-objects.c deep
enough but shouldn't it be possible to do prepare_pack and
write_pack_file in one pass ?Mike
-
No.
Nicolas
-
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Martin Michlmayr | Network slowdown due to CFS |
git: | |
| Paweł Staszewski | rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
