OpenOffice.org is looking for a new SCM (Software Configuration
Management) tool, or at least was on Friday, 19 Jan 2007;
see: http://blogs.sun.com/GullFOSS/entry/openoffice_org_scmOne of the SCMs considered is Git. One of others is Subversion.
There is a functional git tree with the entire OOo history for testing
purposes that can be found at: http://go-oo.org/git.What I am concerned about is some of git benchmark results at Git page
on OpenOffice.org wiki:
http://wiki.services.openoffice.org/wiki/Git#Comparison
Actually it is comparison with CVS and Subversion, although most
benchmarks are done only for git.In 'Size of data on the server' git has CVS beat hands down: 1.3G vs
8.5G for sources, 591M vs 1.1G for third party. I think it is similar
for Subversion. I hope that repository is fully packed: IIRC the Mozilla
CVS repository import was about 0.6GB pack file, not 1.3GB.The problem is with 'Size of checkout': to start working in repository
one needs 1.4G (sources) and 98M (third party) for CVS checkout (it is
1.5G for sources for Subversion checkout). Ordinary for distributed SCM
you would need size of repository + size of sources (working area),
which is 2.8G for sources and 688M for third party stuff files you can
hack on + the history]. This makes some prefer to go centralized SCM
route, i.e. Subversion as replacement for CVS (+ CWS, ChildWorkSpace).What might help here is splitting repository into current (e.g. from
OOo 2.0) and historical part, and / or using shallow clone. Implementing
partial checkouts, i.e. checking out only part of working area (and
using 'theirs' strategy for merging not-checked-out part for merges)
would help. Splitting repository into submodules, and submodule
support -- it depends on organization of OOo sources, would certainly
help for third party stuff repository.'Checkout time' (which should be renamed to 'Initial checkout time'),
in which git also loses with 130 minutes (Linux, 2MBit DSL) [from
go-oo.org], 1...
The text bases for Subversion really should take another 1.4 GiB.
You could also split along project boundaries, but this is probably
IIRC, GIT accesses every file in the tree, not just the ones that need
updating. How many files were actually updated when you changed
branches in your experiment?
-
Hi,
No. Git does not access every file, but rather all stats. That is a huge
difference. And it should not take _that_ long for ~64000 files. Granted,
it will cause a substantial delay, but not in the range of minutes.Ciao,
Dscho
-
It's worse... On my laptop the switch took ~ten minutes, not three.
A diff --stat takes over six minutes!! For reference, dd:in the pack
file with my disk takes ~50 seconds.The reason is simple. I have a lousy one gigabyte RAM only, while
git wants 1.7GB virtual to do the diff-stat. and 800 MB resident. The swap is having a party,$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 2 1861632 14108 428 126816 70 347 605 647 594 1041 11 2 74 13
0 2 1861204 12096 420 125724 3096 8 3096 24 625 1171 5 1 0 94
0 2 1860896 18972 404 115836 3524 292 3524 292 671 1474 7 4 0 89
0 2 1860820 18668 364 113736 3556 784 3556 784 669 1384 7 5 0 88
0 3 1860420 19692 300 109904 3008 180 3156 180 684 1325 8 5 0 87
0 3 1860184 18560 300 108596 3316 232 3396 232 643 1246 8 4 0 88
0 2 1859856 21808 292 103744 2108 32 2356 32 637 1319 9 1 0 90-- robin
-
[resend - correcting a couple of typos and addressing git@vger
correctly - apologies]That is true, unfortunately. git will fly if it can fit its working
set plus the kernel stat cache for your working tree in memory. And
the underlying assumption is that for large trees you'll have gobs of
RAM. If things don't fit, it does get rather slow...But... just to put things in perspective, how long does it take to
*compile* that checkout on that same laptop. I remember reading
instructions to the tune of "don't even try to compile this with less
than 4GB RAM, a couple of CPUs and 12hs". Those were for the OSX build
IIRC.Ah - it's moved to the general instructions: "Building OOo takes some
time (approx 10-12 hours on standard desktop PC) ":
http://wiki.services.openoffice.org/wiki/Building_OpenOffice.org#Startin...So I don't think anyone working on projects the size of the kernel or
OO.org is going to be happy with 1GB RAM.cheers
m
-
No idea. I wouldn't try it without distcc and ccache anyway which makes the
The kernel 2.6 repo isn't in the same ball park wrt to size. Hacking the kernel is quite fine
on this machine and even smaller though the first compile takes some time. Having more
is always fun though.Consider another huge project like Eclipse. Similar operations take a loong time (not anywhere
near the eons that CVS need, but...) and building Eclipse with 1GB i very reasonable so sheer
project size does not per se demand powerful computers. KDE is another huge project that
is reasonable to build with 1GB. The first time is somewhat painful, but rebuilding is not.-- robin
-
Hi Jakub,
I did the git numbers, so if they are wrong - blame me :-) I am also curious
about the SVN numbers, because the SVN conversion [from my point of view]
cheats a lot. From what I know, it does not contain the historical branches
(yes, the >3000 of them that are in the git tree), and if I understood that
correctly, instead of history in the branches, they commit just
'integration commits' [one commit for all the changes in the branch] which
breaks 'svn blame' completely.Unfortunately, I did not have a chance to try the SVN tree yet to see it
Considering the size OOo needs for build (>8G without languages),
the ~1.4G overhead for history is very well bearable. I am surprised about
the 100M overhead for SVN as well - from my experience it is usually about
the size of the project itself; but maybe they improved something in SVNWe should better split the OOo sources; it's a process that already started
[UNO runtime environment vs. OOo without URE], and I proposed some moreGood point, and I already changed the page in the morning. I also added the
I am really curious about the SVN tree. As I said, I did not see it yet.
There is just some info about it here:
http://wiki.services.openoffice.org/wiki/SVNMigration, but I cannot check itFor the git tests, it was:
CPU: AMD Athlon(tm) 64 Processor 3200+
RAM: 1G RAM
Disk (info from bonnie):
---Sequential Output (nosync)--- ---Sequential Input-- --Rnd Seek-
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --04k (03)-
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
one 1*2000 37819 77.6 44296 16.8 16982 5.1 35203 63.9 45915 6.6 152.4 0.4Regards,
Jan
-
I think the supposition that SVN uses hardlinks for pristine copy
of sources (HEAD version) seems probable; then there it is 100M overhead
plus size of changed files, and of course this tricks works only on
filesystems which support hardlinks, and assumes either hardlinks beingI forgot to add there is possible to graft historical repository to the
current work repository, resulting in full history available. For example
Linux kernel repository has backported from BK historical repository, andgit-clone(1):
--depth <depth>::
Create a 'shallow' clone with a history truncated to the
specified number of revs. A shallow repository has
number of limitations (you cannot clone or fetch from
it, nor push from nor into it), but is adequate if you
want to only look at near the tip of a large project
with a long history, and would want to send in a fixes
as patches.It is possible that those limitations will be lifted in the future
(if possible), so there is alternate possibility to reduce neededThe problem with implementing this feature (you can do partial checkout
using low level commands, but this feature is not implemented [yet?]
per se) is with doing merge on part which is not checked out. Might
not be a problem for OOo; but this might be also not needed for OOo.
Sometimes submodules are better, sometimes partial checkout is theIn my opinion each submodule should be able to compile and test by
itself. You can go X.Org route with splitting sources into modules...
or you can make use of the new submodules support (currently plumbing
level, i.e. low level commands), aka. gitlinks.The submodules support makes it possible to split sources into
independent modules (parts), which can be developed independently,
and which you can download (clone, fetch) or not, while making it
possible to bind it all together into one superproject.See (somewhat not up to date) http://git.or.cz/gitwiki/SubprojectSupport
By the ...
Hi Jakub,
It's probably too tight limitation for regular developers; for random hackers
Indeed, this is the case of URE - it is supposed to run by separately & be
From what I know, it does not.
Thank you and others for all the input!
Last question: what is the status of the Win32 support? I got a full clone
using the Cygwin git 1.5.0 [it took 6hrs 20min on a Xen virtual machine; I
have to try it with real hardware], MinGW version did not work for me too
well :-( Are there any other options? Is
http://git.or.cz/gitwiki/WindowsInstall up-to-date?Regards,
Jan
-
By the way, even without submodule support, which for now is plumbing
level only, it would be possible to pull separate subprojects into main
project, like git repository does now with gitk repository, and with
git-gui repository. The latter is merged putting git-gui files in separate
directory in git.git repository, via using 'subtree' merge strategy.Submodules / subprojects are something similar to Subversion svn:externals
done right.--
Jakub Narebski
Poland
-
It kind of works. Performance is horrible, but still better
than almost everything comparable (and there isn't anything
comparable). You have to be very careful not to push it
(them, actually: cygwin and windows) too hard: it is quick
to fall over taking down the whole machine with it (yes,
avoid Ctrl-C at all costs).Avoid Win32 if possible, work somewhere in a sane environment,
Yes.
-
Are you sure? Using the graft mechanism, Git can make this very easy and
almost transparent for the user - when he clones he gets no history but
he can use say some simple vendor-provided script to download the
historical packfile and graft it to the 'current' tree. After that, the
graft acts completely transparently and it 'seems' like the history
goes on continuously from OOo prehistory up to the latest commit.Besides, in case you discover a year later that the conversion was
broken in some places etc., you can just fix this, re-run the conversion
and simply regraft your history to point at the 'new' historical commit,
without affecting your current development and commit ids at all. For
this reason alone, I'd seriously consider grafting history separately
when migrating any non-trivial project from other SCM to Git.Then again, due to the sheer tree sizes etc., I'm not sure how much
would throwing the history away actually reduce the packfile size.--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Ever try. Ever fail. No matter. // Try again. Fail again. Fail better.
-- Samuel Beckett
-
Hi Pasky,
Interesting, I did not know that it is possible to do it so that it appears
transparently; this would be indeed a tremendous win - we could start nearly
from scratch ;-)Please - where could I find more info? Like what does the script have to do,
Thanks a lot,
Jan
-
Hi,
you can see an example script at
http://repo.or.cz/w/elinks.git?a=blob;f=contrib/grafthistory.sh
and I have tried vainly few times to get a similar script to the kernel
toohttp://lists.zerezo.com/linux-kernel/msg6599002.html
that can use both wget and curl and will also download tag refs for the
history.The format of the grafts file itself (.git/info/grafts) is pretty
simple (just one-graft-per-line where you first say the commit id and
then the parent commit(s) to be drafted onto it), please see
Documentation/repository-layout.txt for details.--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Ever try. Ever fail. No matter. // Try again. Fail again. Fail better.
-- Samuel Beckett
-
Hi,
by the way, this script goes back to very ancient Git times, maybe by
now git-fetch could be convinced to do all the hard work for you.
Actually, maybe just something (totally untested) likegit remote add -f historical {http,git}://historical_repository_url
cat <<EOF >>.git/info/grafts
... the graft specs go here ...
EOFmight work prefectly fine nowadays that git keeps the remote refs in a
separate namespace tidily. This way you don't have to care about all the
manual wgetting, ls-remote magic etc. The downside is that this is
available only since git-1.5.0 (Debian stable has older version; maybe
even newer git version is required, I'm not sure).--
Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
Ever try. Ever fail. No matter. // Try again. Fail again. Fail better.
-- Samuel Beckett
-
Hi,
It took me longer here, but the reason might be that my "local" repository
I imagine that might be related to the vast amount of remote branches.
IIRC we do not pack them with git-gc, and ext3 is not that good with big
directories (remember: 3464 branches!).Maybe oprofile knows a bit more where the hotspots are.
Ciao,
Dscho-
Hi,
FWIW I can confirm the number "100min".
Something I realized with pain is that the refs/ directory is 24MB big.
Yep. Really. They have 3464 heads and 2639 tags. I suspect that this is
the reason why.Will play with it.
Ciao,
Dscho-
Hi Johannes,
I should probably produce even a tree where would be the merged branches
deleted, right...Regards,
Jan
-
Hi,
FWIW, I just deleted all branches except for one, packed the tags, and did
a local clone (via NFS, urgh) _without_ checking the files out.Now it takes 25 minutes vs 50 minutes before (in an _extremely_
unscientific test, mind you).So, this issue is worth looking at, probably.
Ciao,
Dscho-
Hi!
Then packed refs would certainly help with speed and a bit with size.
--
Jakub Narebski
-
Btw, this reminds me: we really should start out clones with a fully
packed set of refs. It seems stupid to get the refs in one go, and then
explode them into thousands of files.A trivial patch is to just do
git pack-refs --all --prune
in the "git-clone.sh" script rather than force people to do it themselves,
but we really probably shouldn't have ever even unpacked them in the first
place. That is kind of stupid, but especially since that thing is written
in shell, it's hard to do anything smarter.Of course, I don't know what the hell openoffice is doing with that many
branches and tags, but I guess it's a normal result of having used CVS/SVN
- you want to tag every single merge you do, and all branches stay around
forever, because you can never merge them back and get rid of them.It's always sad to see the crap that is CVS, and how bad decisions in CVS
end up resulting in pain downstream.Linus
-
It is not just being in shell.
Although I do agree that the initial clone is special, I would
rather make clone just a thin wrapper to fetch that also happens
to perform necessary initial setup.Keeping fetched and updated refs in core and write a packed refs
out in one go in git-fetch--tool (and later, git-fetch all in C)
would be much simpler if we do not have to worry about existing
refs (aka "git clone" special case); I am not sure if packing
refs is desirable in general for incremental "git-fetch".-
Fair enough. It's true that for the general case of "git fetch", it's much
less obvious how to keep things packed.So maybe the right thing really *is* to just add the
git pack-refs --all --prune
to the git-clone wrapper.
Linus
-
Hi,
Indeed for size: du -h reported 11 megabyte for the tags directory. After
packing them, a 265 kilobyte file is left. Of course, git-show-ref now
becomes a speed demon again.Ciao,
Dscho-
I'm fairly sure it's not. If so that would also affect the speed of
operations wouldn't it?I also doubt the subversion checkout size - subversion keeps a pristine copy
of the HEAD file - so a subversion checkout is usually over twice the size ofI wonder if they are measuring the time for the generation of the commit
message or something? Or perhaps by using "git-commit -a" is causing a checkI'd also like to see some of the numbers for the other systems, I tried to use
subversion with the linux kernel once and got fed up waiting for it to do
anything. I suspect the reason numbers aren't shown for the others is thatWasn't there a recent change that made repacking after a clone unnecessary?
That would certainly reduce the checkout size.Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
-
A fully packed clone of the OOo git repo was indeed 1.3G, and the entrire
checkout + repo was indeed 8.5G (using git 1.5.1.2).Took about 46m to clone on a server with decent bandwith, ~5.5m user time,
Not from the numbers that are quoted it won't, they are fully packed
sizes.--
Julian---
"Do you have blacks, too?"George W. Bush
To Brazilian president Fernando Cardoso
November 8, 2001
Washington, D.C.
-
I'm more confused now then. I assumed the figures were accurate, but they
cannot be:CVS git SVN
Size of data on the server 8.5G 1.3G n/a
Size of checkout 1.4G 2.8G 1.5GI don't doubt the 1.3G on the server - and assume that is fully packed. The
checkout sizes are suspicious though. Is that 2.8G packed?
- If it is, then we can deduce that this is a repo+source size, since the
server is packed size+0 therefore the size of the source tree is
2.8G - 1.3G = 1.5G
In which case the other figures are wrong:
- CVS checkout is 1.4G - impossible, the source tree is 1.5G. And where is
the overhead of the CVS directories which would make it more than 1.5G?
- SVN checkout overhead is always _at least_ the size of the source tree
because it keeps a pristine copy of HEAD. If the source tree is 1.5G,
then this figure should be at least 3G.
- If it is not, then we're back to "I don't believe that git was packed"Something smells fishy here - either the source tree size is included in some,
but not in others or the git repository wasn't packed.Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
-
Could it be that there is a mode in svn checkout that allows
pristine to be hardlinked to the working tree copies? It
requires an editor that can be told to break hardlinks when
making modifications (and the user obviously needs to know about
it), but to save 1.5G it is worth it and if _I_ were hacking on
SVN that would be an obvious optimization to add.-
Hi Andy,
Unfortunately I don't have the _exact_ numbers here any more so I cannot prove
it ;-) - but this is a rounding problem [CVS checkout is slightly more than
1.4G]. Similarly, overhead of of CVS directories is 0 when we count inYes, this surprises me as well. I've heard about some improvements in the
As I wrote, I am looking forward to seeing the SVN tree myself for further
testing.Regards,
Jan
-
0.1G would have been an awfully big rounding error. Regardless, Julian has
put me right on that - the git checked out size was actually 2.7GB - thisVery much so - I've tried with a 1.4.2 and my own small repository and the
pristine copies are stored uncompressed as always. 0.1G now sounds plain
wrong. Maybe there are some switches I should be using to svn checkout.Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
-
oops, meant 2.7G not 8.5G there ... sorry, was working from memory.
jp3@electron: ooo(unxsplash)>du -sh .git
1.3G .git
jp3@electron: ooo(unxsplash)>du -sh .
2.7G .
jp3@electron: ooo(unxsplash)>ls .git/objects/1.3G is the packed size ...
jp3@electron: ooo(unxsplash)>ls -sh .git/objects/pack/
total 1.3G
37M pack-87efcac9bcb117328e8a1b0c1b42c88c3603c5b7.idx
1.2G pack-87efcac9bcb117328e8a1b0c1b42c88c3603c5b7.pack--
Julian---
To err is humor.
-
Not a problem. That fixes one ambiguity:
2.7G - 1.3G = 1.4G
Which is the same as the CVS checkout size. Both the CVS and git figures are
now consistent:
CVS git SVN
Size of data on the server 8.5G 1.3G n/a
Size of checkout 1.4G 2.7G 1.5G
Overhead in checkout 0G 1.3G 0.1GCould be I suppose. Although, in that case CVS should have suffered the same
because the disparity was in the source tree size. Packed git shouldn't
suffer any filesystem overhead (relatively) because the majority of it's
space is taken up by one large pack file (which of course only suffers fileI've just checked using subversion 1.4.2 and the .svn/text-base/*.svn-base
files are all uncompressed copies of the working tree files. Doesn't lookThanks for your help. It's all looking more consistent to me now; only the
subversion figures seem wrong.I wonder when they're going to get timing numbers for the non-git systems.
That must be a monster of a repository for them to deal with.Andy
--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
-
Except that it's 2.8G, I forgot I had switched branch. I switched to the
unxsplash branch, and _that_ is 2.7G checked out.(du -s .) - (du -s .git) = 1.49G
--
Julian---
"Consider a spherical bear, in simple harmonic motion..."
-- Professor in the UCB physics department
-
Yes, depending on where you cut off and how reasonable the
Partial checkouts, perhaps, "theirs", NO.
Consider that you are working on the tip with partial checkout.
Somebody has a bugfix that is applicable to all of ancient, old,
maintenance and current codebase. Naturally you would want the
bugfix to be applied to ancient, merge it to old, and then
maintenance and then current (the last one is what you are
working on).What happens if you actually pull ancient when you are partially
This is probably the most sane way.
-
| Greg Kroah-Hartman | [PATCH 008/196] Chinese: add translation of volatile-considered-harmful.txt |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg KH | Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in |
| Andrew Morton | -mm merge plans for 2.6.23 |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Herbert Xu | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
| Rémi Denis-Courmont | [PATCH 01/14] Phonet global definitions |
