I hate irc. I'm reading the irc logs, and seeing that people have problems, but (a) it was while I was asleep and (b) irc use doesn't encourage people to actually explain what the problems _are_, so I have no clue. So now I know that "spyderous" has problems importing some 1GB gentoo CVS archive, but that's pretty much it. Grr. Are people afraid to post to git@vger.kernel.org, or what? I saw that people tried to suggest posting to the git mailing list, but can any of you who are active on irc be a bit more forceful? And perhaps we don't make this mailing list address well enough known? As far as I'm aware, the git mailing list isn't closed, so people should be able to post here without even subscribing. I can well understand that you might not want to subscribe and prefer to look ove rthe list through some archive setup (the way I look at the irc logs), and maybe we should just make the git mailing list address more obvious. Right now, the "community" page at http://git.or.cz/community.html doesn't even mention the git mailing list address directly, it just tells you how you can subscribe and read the archives. Can we perhaps fix that, and the people who are active on irc please also make it clear to people that if they have some real problems that don't get an immediate answer, the git mailing list ends up where a lot of people can actually look more closely at it.. And tell them what the address is. Linus -
I hate irc, too. Number of times easily solvable usage problems come up and I look at the log to realize when the solutions suggested were waaaaay suboptimal it is too late (with loops being quite active recently things have improved a lot, but we should not expect him to be 24/7). Maybe somebody can run a dumb 'bot that notices somebody said something that ends with a '?' and there is no activity there for N minutes and inject a recorded message that reminds the mailing list address ;-). -
FWIW, I have mentionned a problem that may be the same, under
Message-ID <20060107090148.GB32585@nowhere.earth>, that was on January
7th. Namely, when importing a repository with very large files over
pserver or ssh, timeouts can occur and prevent the import from
working. But, as you said, it's not easy to get precise info from the
logs :)
Best regards,
--
Yann Dirson <ydirson@altern.org> |
Debian-related: <dirson@debian.org> | Support Debian GNU/Linux:
| Freedom, Power, Stability, Gratis
http://ydirson.free.fr/ | Check <http://www.debian.org/>
-For big repositories, you really shouldn't use pserver or ssh anyway. You should try really really hard to just get a local copy, and do it that way. It's going to be tons faster, and will avoid a lot of the problems, including network timeouts etc. Linus -
Hi all, I just subscribed and this post is the only one I've got from the thread, so I'm responding to it instead of the original. Gentoo's an IRC-based community, so I tend to try IRC first for any problems I have and fall back to the list later if I can't get things figured out. Here's a rough summary: Our main repo is actually a bit over 2G (2103621223) now that I check, but it's not very complex. There's actually just one branch, and I don't think anyone would care if we lost the history from it because it's a release branch from a few years ago. Somebody else tried importing it with git-cvsimport, but he said he hit some kind of problem and recalled that it was a cvsps segfault. Sounds about right, since I've never gotten cvsps to run successfully on the whole repo either. I tried with parsecvs, but it runs into OOM even on a machine with 4G RAM after reading in all the ,v files, presumably while it's building some huge tree of changesets in memory. Keith Packard's suggested that there are ways to reduce parsecvs's memory use, because it retains the full tree in memory for each revision rather than just the files that actually changed. But my C skills are pretty weak; I'm an OK reader but not much of a writer yet. Thanks, Donnie
Can you point to it? I'm not a CVS user, but I've played with cvsps before (to get it to work), and I'm a humanitarian - rescuing people from CVS is to me not just a good idea, it's a moral imperative. Linus -
I don't want to post the link publicly for a few reasons, including the huge amount of bandwidth it would suck up for lots of people to download it. I've sent it to you off-list, and if anyone else would also like it, please drop me a note. Thanks, Donnie
Ok. It's still converting (that's a big archive), but it has passed the
cvsps stage without errors for me, and the conversion so far seems ok. But
it has only gotten to
Author: vapier <vapier> 2002-09-23 12:32:42
Changed GPL to GPL-2 in LICENSE and updated SRC_URI to use mirror:
so it has converted only slightly more than the first two years of
history in the roughly 30 minutes I've let it run. So it will take several
hours.
The reason it works for me is likely simply the fact that I had a few
patches to my cvsps already. I'm appending the stupid patches, I'm not
guaranteeing that they are correct at all, although the three _committed_
patches are almost certainly correct (and the last uncommitted one is
almost certainly totally broken). The patches are against clean cvsps 2.1.
Also, when I say "the conversion so far seems ok", I obviously don't
actually know what the hell the archive is supposed to look like, so I can
only say that the end result seems not totally insane.
To do a good conversion, you'll want to make sure that you have a author
name conversion file. See the "-A" flag in "git help cvsimport" (if you
have the man-pages installed).
Linus
---
commit 534120d9a47062eecd7b53fd7ac0b70d97feb4fd
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date: Wed Mar 22 11:20:59 2006 -0800
Increase log-length limit to 64kB
Yeah, it should be dynamic. I'm lazy.
---
cvsps_types.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/cvsps_types.h b/cvsps_types.h
index b41e2a9..dba145d 100644
--- a/cvsps_types.h
+++ b/cvsps_types.h
@@ -8,7 +8,7 @@ #define CVSPS_TYPES_H
#include <time.h>
-#define LOG_STR_MAX 32768
+#define LOG_STR_MAX 65536
#define AUTH_STR_MAX 64
#define REV_STR_MAX 64
#define MIN(a, b) ((a) < (b) ? (a) : (b))
commit 82fcf7e31bbeae3b01a8656549e9b8fd89d598eb
Author: Linus Torvalds <torvalds@g5.osdl.org>
Date: Wed Mar 22 11:23:37 2006 -0800
...Btw, trying this import (which got interrupted by a thunderstorm and one of our first power failures in a long time - just a few seconds, but enough to power off everything but my laptops) it became very obvious that "git cvsimport" really _really_ should re-pack the archive every once in a while. The old "repack every month or so" approach doesn't work that well when you try to import several years of history in a few hours. Now, you can just repack after the whole thing is done (it will probably take no more than ~15 minutes or so), but it would probably be best if the import script itself decided to repack every once in a while just to avoid wasting a lot of diskspace _during_ the import itself. So this isn't so much a correctness issue as a "avoid wasting time and space" issue, but still.. Linus -
Fortunately the storms haven't been that bad down in Corvallis. cvsps also worked fine for me, but git-cvsimport broke in the middle. The command I'm using is 'git-cvsimport -P ../gentoo.cvsps -k -d /media/scm_comparison -A ~/dev/Authors -v gentoo-x86 | tee cvsimport.log'= Here's the last bits: Fetching gnome-base/gnome-applets/gnome-applets-1.4.0.4-r1.ebuild v 1.5= Update gnome-base/gnome-applets/gnome-applets-1.4.0.4-r1.ebuild: 947 byte= s Fetching gnome-base/gnome-applets/gnome-applets-1.4.0.4-r2.ebuild v 1.3= Update gnome-base/gnome-applets/gnome-applets-1.4.0.4-r2.ebuild: 977 byte= s Fetching gnome-base/gnome-applets/gnome-applets-2.0.0-r1.ebuild v 1.2 Update gnome-base/gnome-applets/gnome-applets-2.0.0-r1.ebuild: 2704 bytes= Fetching gnome-base/gnome-applets/gnome-applets-2.0.0.ebuild v 1.2 Update gnome-base/gnome-applets/gnome-applets-2.0.0.ebuild: 3031 bytes Tree ID 4d19a84efce2de9cfb42ac0397e0036bbed2ad65 Parent ID ecb78bbe30369a76e2599d0d17de8fe922dca211 Committed patch 14615 (origin 2002-07-16 20:13:15) Commit ID 4dd2179e0c1369e07cd268fb5c8b150c3a2a1094 Delete net-fs/openafs/openafs-1.2.2-r6.ebuild Delete net-fs/openafs/files/digest-openafs-1.2.2-r6 Tree ID bfc7320883983655d7d2ea2c6d04f85b45365ce1 Parent ID 4dd2179e0c1369e07cd268fb5c8b150c3a2a1094 Committed patch 14616 (origin 2002-07-16 20:15:15) Commit ID 7a36de9c4c9b93337ed789ae2341cad3d0991c6d Unknown: error Cannot allocate memory Fetching profiles/package.mask v 1.992 cat: write error: Broken pipe Thanks, Donnie
Hmm. It's actually possible that it did that for me too - I had put the cvsimport in an xterm and forgotten about it, and just assumed that the power failure was what broke it. But maybe it had broken down before that Hmm. I don't actually know perl, and my original "cvsimport" script was actually this funny C program that generated a shell script to do the import. That worked fine, and had no memory leaks, but it was a truly hacky thing of horrible beauty. Or rather, it _would_ have been that, if it had had any beauty to be horrible about. But at least I would have been able to debug it. But the perl one I can't parse any more. That said, the whole "Unknown:" printout seems to come from the subroutine "_line()", which just reads a line from the cvs server. Did you do a "top" at any time just before this all happened? It _sounds_ like it might actually be a memory leak on the CVS server side, and the problem may (or may not) be due to the optimization that keeps a single long-running CVS server instance for the whole process. I wouldn't be in the least surprised if that ends up triggering a slow leak in CVS itself, and then CVS runs out of memory. That would likely have been obvious in any "top" output just before the failure. Smurf, Martin, Dscho.. Any ideas? My old script just ran RCS directly on the files, and had no issues like that. I'll happily admit that my old script generator thing was horrible, but it was a lot easier to debug than the smarter perl script that uses a CVS server connection.. Linus -
Running a few tests right now. Looks like cvs (Debian/etch 1.12.9-13) itself is not leaking any memory. The Perl (Debian/etch 5.8.7-something and now 5.8.8-4) process OTOH is visibly allocating memory. Starts off at 4MB and gets up to ~17MB by the time it has done 6K commits. I am trying to figure out whether the leak is in the script or in the Perl implementation, using PadWalk, Devel::Leak and friends. If the Or a slow leak in Perl? The 5.8.8 release notes do talk about some leaks being fixed, but this 5.8.8 isn't making a difference. Working on it. martin -
Thanks. Looking at what I did convert, that horrid gentoo CVS tree is interesting. The resulting (partial) git history has 93413 commits and 850,000+ objects total, all in a totally linear history. And that's just up to April 2004, so the full tree is probably a million objects. The good news is that git seems to handle that size repo no problem at all. The repack did indeed take a long while, but it packed it all down to a 189MB pack-file (and 20MB pack index). Considering that the bzip2'd tar-file of the CVS history was 157MB, and the actual CVS footprint was about 1.6GB, if git stays at under a quarter gigabyte for the whole archive once converted (which sounds likely, counting indexing), git would basically cut down the disk usage for a live repo by a factor of 7 or so. _And_ I can do a "git log origin > /dev/null" in about 2.4 seconds. Take that, CVS. Linus -
Ok, so there's 3 patches posted that should help narrow down the problem. There's a new -L <imit> so that Donnie can get his stuff done by running it in a while(true) loop. Not proud of it, but hey. And there are two patches that I suspect may fix the leak. After applying them, the cvsimport process grows up to ~13MB and then tapers off, at least as far as my patience has gotten me. It's late on this side of the globe so I'll look at the results tomorrow morning. (BTW, I typo-ed Linus' address in the git-send-email invocation. Will resend to him separately) I'll also prep a patch as Linus suggests to do auto-repacking while Heh. Faster Gitticat, Kill Kill Kill! martin -
Ok, initial results are promising. git-cvsimport appears to be still slowly growing, but it's at 40M (ie pretty tiny, considering that cvsps grew to 800+MB on this archive) and growth seems to actually be slowing. My conversion is only up to September 2002, but if it doesn't suddenly hit some huge growth spurt, I wouldn't expect it to run out of memory. The CVS server process itself is tiny, and doesn't seem to grow at all. As to packing, it doing something like while : do sleep 30 # # repack roughly every 25600 objects # n=$(ls .git/objects/00 2> /dev/null | wc -l) if [ $n -gt 100 ]; then git repack -a # # Stupid sleep to make sure that nobody is still # using any unpacked objects after the pack got # generated # sleep 10 git prune-packed fi done or similar (the above is totally untested - I've just done it by hand a few times) should work. It's perfectly ok to repack the archive even while the cvsimport script is adding more data and changing it. Linus -
That's great news. The cvs archive seems to have large commits every once in a while, so I suspect the residual memory growth may be related to those. Or to a smaller leak I haven't nailed. My test box is bloody slow it seems. I'll try and get hold of a faster Given that we are running batch, it is safe and simple to stop the import, repack, prune-packed, and keep going. Don't think we'll win any races by running it in parallel ;-) cheers, martin -
OK, I started a new run without -L, and I'm watching it in top right now. The cvsimport seems to be doing alright, but the cvs server process sucks about another megabyte of virtual every 4-5 seconds. This is a bit concerning since I don't have any swap. Shortly after it hit 670M, I got "Cannot allocate memory" again. I've got a gig of RAM, and around 300M was resident in various processes at the time. So it seems the problem is in cvs itself. I will try another run with -L now. Thanks, Donnie
Hmm. My cvs server doesn't really grow at all. It's at 13M RSS. What version of cvs are you running? [torvalds@g5 ~]$ cvs --version Concurrent Versions System (CVS) 1.11.21 (client/server) maybe that matters. (but my import is only up to Jun 22, 2003 so far). Linus -
Yeah, that's the thing. RSS stayed about the same (according to top), Concurrent Versions System (CVS) 1.12.12 (client/server) Looks like there's a .13 out but the zlib interaction is badly broken (-z >=3D1) so my system didn't get upgraded. I'll try it anyway after the= -L run finishes. Thanks, Donnie
Not for me. The virtual size is certainly bigger than RSS, but not by a huge amount. So this might be a regression in CVS, since you seem to have a newer version than I do. The latest stable CVS release is 1.11.21, I think: you seem to be running the "development" version (1.12.x). Linus -
Backed down to the 1.11 series, things seem to be going fine so far. Thanks, Donnie
Finally hit an OOM sometime in the past day (yep, a week later) =3D\. Not= sure whether it was cvsimport or cvs. Anyone else had more luck? Thanks, Donnie
It seemed like it had finished on the machine I was running it, and I assumed it was alright in yours too. Looking closer it only made it till April 2004 -- but it may have been killed by a sysadmin, the captured log talks about 'signal 9', I have no idea what the OOM sends. It had done 285070 of 343822 patchsets. Have you dropped the -a from the git-repack invocation? That should help. Try also Linus' patch for git-rev-list. The other thing hurting us is that the commits are _huge_. I wonder how you guys were managing this with CVS. Now _this_ explains why cvsimport grows humongous. I'll try to rework the commit loop so that we don't need to hold all the filenames in memory. It seems to be choking with the commits after April 2004. But that will have to wait till tonight. cheers, martin -
Looking closer, I see that the memory suckers do appear to be git, from dmesg: Out of Memory: Kill process 17230 (git-repack) score 97207 and children. Out of memory: Killed process 17231 (git-rev-list). Just ends like this: Tree ID 2cc632e5e1d3a430a2cc891bf33c4a12f19a4d0e Parent ID ad92d7073a52458e0581633bbd8ccbbec838d9e6 Committed patch 249100 (origin 2005-08-20 05:05:58) Commit ID 28941f00d714f57ab49f1fd725d1c3ce8a5d0b93 Fetching sys-kernel/ck-sources/ChangeLog v 1.113 Update sys-kernel/ck-sources/ChangeLog: 25425 bytes Fetching sys-kernel/ck-sources/Manifest v 1.164 Update sys-kernel/ck-sources/Manifest: 252 bytes Delete sys-kernel/ck-sources/ck-sources-2.6.12_p5-r1.ebuild Fetching sys-kernel/ck-sources/ck-sources-2.6.12_p6.ebuild v 1.1 New sys-kernel/ck-sources/ck-sources-2.6.12_p6.ebuild: 1438 bytes Delete sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p5-r1 Fetching sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p6 v 1.1 New sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p6: 279 bytes Can't fork at /usr/bin/git-cvsimport line 592, <CVS> line 3810053. I wasn't running with a version that did repacks; I just suspended the Thanks, Donnie
With the latest cvsimport in Junio's repo, a lot of RAM and a bit of patience... gitview http://git.catalyst.net.nz/gitweb?p=gentoo.git;a=summary fetchable http://git.catalyst.net.nz/git/gentoo.git#cvshead Still pushing it, will be there in a minute or so. The packed repo weights about 660MB. Not too bad given the size of the project and the number of commits. martin -
Heh. I think you should enable caching in your apache config. And maybe we should make that part of the gitweb docs. Without a caching web-server, gitweb is pretty slow, but it caches _beautifully_. That gentoo repo has a lot of "duplicate" commits that cvsps will mark as two separate commits because there's one commit for the files, and one commit for whatever the "Manifest" file is. I wonder if those commits should generally be merged or something. That said, things like that are most easily fixed as a git->git update (along with adding name translation), which can avoid re-writing the trees. Linus -
I know I should -- but I'm hoping to find the time to rework gitweb a bit to actually work fast instead. It bothers me that it is so slow on a basically idle machine, and where I can perform the corresponding git operations in the commandline in a blink. And caching is great for really busy sites (aka kernel.org) but git.catalyst.net.nz only serves a handful of small repos for a small Yep, large projects often have good reasons to run custom imports, merging certain commits, rewriting log messages (like the X.org guys were doing). It can be done at the cvsimport stage or later -- I think Pasky has a rewritehistory tool hidden somewhere in Cogito, but I haven't used it. cheers, martin -
We've got a guy who got a Summer of Code project to work on CVS migration, so this could be something along his lines. Thanks, Donnie
He'll want a fast box to wrangle with this repo ;-) martin -
I have a dual opteron with 4gb of ram "on loan" from work :) It still dies though, using git cvsimport or parsecvs. I talked to Keith Packard about adding support to parsecvs for recording the actual changed changesets, but I haven't yet started on implementing that since he isn't using cvsps in parsecvs. I also haven't had a chance to look at the git-cvsimport sources yet, was hoping to get to that later this week. -
The machine I am running this is more constrained than that, and it doesn't die. It just takes maybe 30hs. Make sure it's not a bad cvs binary you got there (latest from gentoo seems to leak memory). And if it's still dying... give us some more details ;-) cheers, martin -
After reading the whole thread on this, I've using a git checkout of git, cvsps-2.1 and cvs-1.11.12, running overnight in verbose mode with screen. Hopefully will have a repo in the morning ;) -
Good stuff. I am rerunning it to prove (and bench) a complete an uninterrupted import. So far it's done 4hs 30m, footprint grown to 207MB, 49750 commits. So I think it will be done in approx 30hs on this single-cpu opteron. Most commits are small, but there is a handful that are downright massive -- and we hold all the file list in memory, which I think explains (most of) the memory growth. I've looked into avoiding holding the whole filelist in memory, but it involves rewriting the cvsps output parsing loop, which is better left for a rainy day, with a test case that doesn't take 30hs to resolve. cheers, martin -
Ok the box this was running on had issues, so I switched to using pearl.amd64.dev.gentoo.org, a dual core amd64 X2 4600+ with 4 gigs of ram and plenty of disk. The "problem" now is just converstion time...30 hours and I'm into 2004-09-17...but it's been in 2004 all day, seems like most of the commits are in the last three years. Are there architectural issues with doing this in parallel? Since the repository commits are all in cvs, it should be possible to do the work in parallel, since you know what all the commits touch. The concern would be ordering of nodes in the tree; you'd end up building a bunch of subtrees and patching them together? -Alec Warner -
I don't think you can do this in parallel. What I would do is remove the -a from the git-repack invocation. It does hurt import times quite a bit -- just do a git-repack -a -d when it's done. And... having said that, there is still a memory leak somehow, somewhere. It's been evading me for 2 weeks now, so I feel an idiot now. Not too bad in general, but it shows clearly in the gentoo and Well... parsecvs does a bit of this but in sequential fashion... it imports all the files first, and then runs through the history building the tree+commits in order, committing them. It saves a lot of time in the file imports by parsing the RCS file directly. The downside is that it must keep a filename+version=>sha1 mapping -- which I think is why parsecvs won't fit in memory until it's changed to store it on disk somehow ;-) You are forced to do it in a sequence because cvsps only tells you about the files added/removed/changed in a commit -- you need the ancestor to have a view of what the whole tree looked like. The only room for parallelism I see is to fork off new processes to work on branches in parallel. martin -
Only repack at the end then? disk space isn't an issue here so I'll give 30565 antarus 17 0 470m 456m 1640 S 14 11.6 234:23.38 git-cvsimport 30566 antarus 16 0 6753m 147m 752 S 7 3.7 120:27.06 cvs Not helpful in the Gentoo case, since we only have one branch; minus an accident when a dev branched gentoo-x86 a while back ;) I'll keep chugging on this one; it won't be the final import as I haven't used the complete Authors file, so I will try the repacking optimization next time I do an import. -Alec Warner -
On Sun, 04 Jun 2006 22:36:44 -0400 Hi Alec, You may want to go back and do another import for other reasons, but if the only reason is to fix up the author information it would be _much_ faster to simply rewrite the git commit history. Cogito has something called "cg-admin-rewritehist" which should do what you need and there are other scripts floating around specificially for rewriting just the author information. HTH, Sean -
Not exactly -- by removing the -a from the git-repack invocation what you get is cheap "partial" packing rather than a full repack. This is somewhat inefficient disk-wise, perhaps by 10% or so. But full repacks get more and more expensive as the repo grows. So you don't need to run git-repack -a -d at the end, but it will be a Cool. If it dies for any reason, just do git-update-ref refs/heads/master refs/heads/origin git-update-ref HEAD origin git-checkout You only need to do this the first time -- after that, the core heads are set. Rerun the script and it will pick up where it left. If it dies again, just do git-checkout to see the latest files. (Above, replace origin with your -o option if you are using it. I normally use -o cvshead.) martin -
Sounds like you had the "git repack -a -d" thing in your cvsimport. The current git rev-list should use only about a third of the memory of the one you used, so hopefully you could just update your git version, and then continue with the "git cvsimport" without having to start all over. Linus -
That would mean that you do have Linus' patch then. Grep cvsimport for repack and remove the -a -- and consider using his recent patch to rev-list. My dmesg talks about an earlier cvs segfault. Nasty tree you have here -- it's breaking all sorts of things... and teaching us a thing or two Hmmm? How can you be at patch 249100 and still be a good year ahead of me? Have you told cvsps to cut off old history? Another thing I found is that this import uses a lot of $TMPDIR, so if your TMPDIR is small, you'll hit all sorts of problems. cheers, martin -
You certainly would think so, and I did as well, but available evidence indicates otherwise. I'm not sure how the repack got in there. donnie@supernova ~ $ type git-cvsimport git-cvsimport is /usr/bin/git-cvsimport donnie@supernova ~ $ grep repack /usr/bin/git-cvsimport donnie@supernova ~ $ All I can think of is that I somehow OOM'd when I manually ran a repack and didn't notice it. But that should've at least made me unable to Nope. I ran the exact cvsps flags you posted earlier to create it. Thanks, Donnie
Sounds likely -- and cvsimport restarts gracefully, though you might want to do git checkout HEAD to get a usable checkout if the very first import failed. However, the default head is master, and what you want to look at is origin or whatever you passed as your -o parameter. I use cvshead normally, so I do Oh, that was an earlier PEBKAK at my end: I did git log HEAD instead of git log cvshead. My import is now at 293145 (cvshead +0000 2005-12-25 12:24:42) which looks promising. cheers, martin -
What version of cvs are you using? Perhaps trying a different one? The dev machine where I am running the import is a slug! It's still working on it, only gotten to 7700 commits, with the cvsimport process stable at 28MB RAM and cvs stable at 4MB. cheers, martin -
I have to say, that cvsimport script really does do horrible things. It's basically a fork/exec/exit benchmark, as far as I can tell. Running oprofile on the thing, the top offenders are (ignore the 45% idle thing: it's just because this was run on a dual-cpu system, so since it's almost completely single-threaded you get ~50% idle by default). 3117654 45.8708 vmlinux vmlinux .power4_idle 802313 11.8046 vmlinux vmlinux .unmap_vmas 632913 9.3122 vmlinux vmlinux .copy_page_range 150359 2.2123 vmlinux vmlinux .release_pages 131330 1.9323 vmlinux vmlinux .vm_normal_page 117836 1.7337 libperl.so libperl.so (no symbols) 74098 1.0902 libgklayout.so libgklayout.so (no symbols) 54680 0.8045 vmlinux vmlinux .free_pages_and_swap_cache 54300 0.7989 libfb.so libfb.so (no symbols) 49052 0.7217 vmlinux vmlinux .copy_4K_page 46559 0.6850 libc-2.4.so libc-2.4.so getc 42677 0.6279 vmlinux vmlinux .page_remove_rmap 41133 0.6052 libc-2.4.so libc-2.4.so ferror .. those kernel functions are all about process create/exit, and COW faulting after the fork. Now, this is on ppc, so process creation is likely slower (idiotic PPC VM page table hashes), but Linux is actually very good at doing this, and the fact that process create/exit is so high is a very big sign that the script just ends up executing a _ton_ of small simple processes that do almost nothing. I wonder why those "git-update-index" calls seem to be (assuming I read the perl correctly) done only a few files at a time. We can do a hundreds in one go, but it see...
Ahh. stracing the CVS server seems to imply that it forks off a subprocess for every command. It doesn't actually execute any external program, but just does a fork + muck around in the ,v files + exit. Maybe one of the changes in the 1.12.x versions is to not do that, which might explain why Donnie seems to see much better performance, but also sees all the memory leakage? Linus -
Hi,
No, fifty.
The beast *was* mainly written to do this remotely...
--=20
Matthias Urlichs | {M:U} IT Design @ m-u-it.de | smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
- -
The worst form of inequality is to try to make unequal things equal.
-- AristotleI don't think the remote usability is valid, except for some really small repositories. The fact that it takes hours even when the CVS server is local doesn't bode well for doing it remotely for any but the most trivial things. I really think it would be better to have local use be the optimized case, with remote being the "it's _possible_" case. Linus -
I really don't think that using the local cvs binary is a problem at all. In my experience, the thing is fairly fast and optimized when you ask it to perform file-oriented questions and that's all we do, really. If you want to try it, you'll see that local checkouts of large trees (like this gentoo one) are fairly fast. Not as fast as GIT itself, but good enough. I think Donnie has hit a bug with a bad version of cvs, but other than that, my experience with it is that it is fairly well behaved -- even if the tool is bad, ubiquity has lead to resiliency Agreed, but I think we won't see much benefit in direct parsing. And we'll have to take the hit of double-implementation. In any case, we have it already -- parsecvs does it quite well (modulo memory leaks!) and I've used it several times in conjunction with cvsimport. Just perform the initial import with parsecvs and then 'track' the remote project with cvsimport. The problem is that they lead to slightly different trees. So their output is not consistent, and I don't think that'll be easy to fix. cheers, martin -
Fair enough. My worry was mainly that the cvs server was doing something stupid, but I suspect most of the fork/exec's are probably from the I didn't get parsecvs working when I tried it a long time ago, and Donnie reported that it ran out of memory, so I didn't even really consider it. I'd love for it to work well, and it may be reasonable to do really big imports on multi-gigabyte 64-bit machines (after all, they aren't _hard_ to find any more, and you only need to do it once). That said, it still seems pretty stupid to require that much memory just to import from CVS. Linus -
Sorry! s/trees/histories/ there. The trees are (or should!) be the same, and tree differences should be addressed as bugs. Differences in how history is parsed are unavoidable right now. martin -
I think cvsimport predates that option, but these days that loop can be optimized by feeding --index-info from standard input. -
Oh, yep, that'd be a good addition. I think we can also cut down on the number of fork+exec calls (as Linus points out they are killing us) by caching some data we should already have that we are repeatedly asking from git-ref-parse. Other TODOs from my reading of the code last night... - Switch from line-oriented reads to block reads when fetching files from CVS. This gentoo has repo has some large binary blobs in it and we end up slurping them into memory. - Stop abusing globals in commit() -- pass the commit data as parameters. - Further profiling? Whatever we are doing, we aren't doing it fast :( Will be trying to do those things in the next few days, don't mind if someone jumps in as well. martin -
This patch is relatively simple, and I'll post it in a moment. I also made a few other cleanups to commit() which apply on top of that; Some of the globals actually get modified in commit() (e.g., @old and @new get cleared). So we need to either pass them in as references or remember to do that cleanup each time it is called (which is really only I can look at the line/block CVS file slurping, but not tonight. -Peff -
This change attempts to clean up the commit function to make it a bit
easier to read (or at least the first half of it). It also improves
robustness and performance. Specifically:
- report get_headref errors on opening ref unless the error is ENOENT
- use regex to check for sha1 instead of length
- use lexically scoped filehandles which get cleaned up automagically
- check for error on both 'print' and 'close' (since output is buffered)
- avoid "fork, do some perl, then exec" in commit(). It's not necessary,
and we probably end up COW'ing parts of the perl process. Plus the code
is much smaller because we can use open2()
- avoid calling strftime over and over (mainly a readability cleanup)
---
I know this patch is quite large. I can try to split it if you want, but
I suspect it's not worth the effort (either you like refactoring or you
don't :) ).
9dc9f05ab5e1cbd8765238e7b1da0addd6f4296a
git-cvsimport.perl | 150 ++++++++++++++++++++++------------------------------
1 files changed, 64 insertions(+), 86 deletions(-)
9dc9f05ab5e1cbd8765238e7b1da0addd6f4296a
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index 4efb0a5..f8feb52 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -23,7 +23,7 @@ use File::Basename qw(basename dirname);
use Time::Local;
use IO::Socket;
use IO::Pipe;
-use POSIX qw(strftime dup2);
+use POSIX qw(strftime dup2 :errno_h);
use IPC::Open2;
$SIG{'PIPE'}="IGNORE";
@@ -429,22 +429,25 @@ sub getwd() {
return $pwd;
}
+sub is_sha1 {
+ my $s = shift;
+ return $s =~ /^[a-zA-Z0-9]{40}$/;
+}
-sub get_headref($$) {
+sub get_headref ($$) {
my $name = shift;
my $git_dir = shift;
- my $sha;
- if (open(C,"$git_dir/refs/heads/$name")) {
- chomp($sha = <C>);
- close(C);
- length($sha) == 40
- or die "Cannot get head id for $name ($sha): $!\n";
+ my $f = "$git_dir/refs/heads/$name";
+ if(open(my $fh, $f)) {
+ chomp(my $r = <$fh>);
+ is_sha...Why run "env" and not just muck with %ENV? -
Oops, that's an obvious fork optimization that I should have caught.
Patch is below. Note that this will now affect the environment of all
sub-processes, but it shouldn't matter since we reset it right before
commit. However, if anyone is worried, we can stash the old %ENV in
another hash temporarily.
-Peff
PS What is the preferred format for throwing patches into replies like
this? Putting the patch at the end (as here) or throwing the reply
comments in the ignored section near the diffstat?
---
cvsimport: set up commit environment in perl instead of using env
---
44c4a9f67322302ca49146a7c143c07ea67da366
git-cvsimport.perl | 13 ++++++-------
1 files changed, 6 insertions(+), 7 deletions(-)
44c4a9f67322302ca49146a7c143c07ea67da366
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index 41ee9a6..83d7d3c 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -618,14 +618,13 @@ sub commit {
}
my $commit_date = strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date));
+ $ENV{GIT_AUTHOR_NAME} = $author_name;
+ $ENV{GIT_AUTHOR_EMAIL} = $author_email;
+ $ENV{GIT_AUTHOR_DATE} = $commit_date;
+ $ENV{GIT_COMMITTER_NAME} = $author_name;
+ $ENV{GIT_COMMITTER_EMAIL} = $author_email;
+ $ENV{GIT_COMMITTER_DATE} = $commit_date;
my $pid = open2(my $commit_read, my $commit_write,
- 'env',
- "GIT_AUTHOR_NAME=$author_name",
- "GIT_AUTHOR_EMAIL=$author_email",
- "GIT_AUTHOR_DATE=$commit_date",
- "GIT_COMMITTER_NAME=$author_name",
- "GIT_COMMITTER_EMAIL=$author_email",
- "GIT_COMMITTER_DATE=$commit_date",
'git-commit-tree', $tree, @commit_args);
# compatibility with git2cvs
--
-Are you two talking about running git-commit-tree via env is two fork-execs instead of just one? Does that have a measurable difference? Not that I have anything against the updated code, but I do not You could do it either way. Although I personally find the former easier to read (meshes well with "do not top post" mantra), it appears many other people finds the cover letter material should come after the first '---' separator. If you append the patch to your message, btw, you would need to realize that the receiving end needs to edit your message to remove the top part before running "git am" to apply. -
Yes, that's what I was talking about. No, probably not a huge difference. I did some performance measurements of all of the recent cvsimport changes on a small-ish personal repo (I don't have the gentoo repo). The results were not significant (<= 1% improvement for each change). I would expect some of the changes (index-info, fetchfile) to have an impact on a repo with different characteristics (like the gentoo one). -Peff -
Jeff, good stuff -- aiming at exactly the things that had been nagging me. Given that we have that -- should we remember it and avoid re-reading the headref from disk? A %seenheads cache would save us 99.9% of the hassle. In related news, I've dealt with file reads from the socket being memorybound. Should merge ok. cheers, martin -
Hmm. Is it just me, or does the current "git cvsimport" have new problems: [torvalds@merom git]$ git cvsimport -d ~/CVS gentoo-x86 causes Committing initial tree 34bd3dcd4bfd79bad35ce3fb08b2e21108195db8 Server has gone away while fetching BUGS-TODO 1.1, retrying... Retry failed at /home/torvalds/bin/git-cvsimport line 366, <GEN2656> line 9. and that's it for the import. I don't see what would have caused it in the changes, but it definitely worked earlier.. Linus -
Martin, that problem seems to go away when I initialize $res to 0 in _fetchfile. I don't know perl, and maybe local variables are pre-initialized to empty. It's entirely possible that the fact that it now seems to work for me is purely timing-related, since I also ended up using "-P cvsps-output" to avoid having a huge cvsps binary in memory at the same time. Linus "perl illiterate" Torvalds -
Strange! Cannot repro here with v5.8.8 (debian/etch 5.8.8-4) but
initialising it doesn't hurt, so let's do it:
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index ace7087..abbfd0b 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -371,7 +371,7 @@ sub file {
}
sub _fetchfile {
my ($self, $fh, $cnt) = @_;
- my $res;
+ my $res = 0;
my $bufsize = 1024 * 1024;
while($cnt) {
if ($bufsize > $cnt) {
cheers,
martin
-I can reproduce with debian perl 5.8.8-4. The bug is only triggered by 0-length files, so presumably your test repo doesn't have any. -Peff -
Given that we are all working off the gentoo repo here, it means that my machine is slower than Linus' unreleased Intel box. And that I am too impatient... In any case, the fix is correct as Junio points out. cheers, martin -
When a new file that is empty is created, sub _line would call sub _fetchfile with $cnt == 0, and it can return $res which is initialized to 'undef'. That explains why sub file says $self->_line() returned an undef and I think what you did is the right fix. -
Meaning...? Perl5 can pass only one flat array, so the above is Merged OK, and I think your last suggestion makes sense. I'll go to bed after pushing out Jeff's two patches and yours. -
Of course. I had actually missed the closing quotes, and thought the error msg wanted to talk about POSIX. 'twas late in the day, seems I'll look into caching headrefs tonight if noone beats me to it. martin -
This should reduce the number of git-update-index forks required per
commit. We now do adds/removes in one call, and we are no longer forced to
deal with argv limitations.
---
This is a repost using -z/NUL instead of line feeds.
d82d215430ae5e79210f73a31f5f8a053f36c27f
git-cvsimport.perl | 36 +++++++++++++-----------------------
1 files changed, 13 insertions(+), 23 deletions(-)
d82d215430ae5e79210f73a31f5f8a053f36c27f
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index d257e66..a65bea6 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -565,29 +565,19 @@ my($patchset,$date,$author_name,$author_
my(@old,@new,@skipped);
sub commit {
my $pid;
- while(@old) {
- my @o2;
- if(@old > 55) {
- @o2 = splice(@old,0,50);
- } else {
- @o2 = @old;
- @old = ();
- }
- system("git-update-index","--force-remove","--",@o2);
- die "Cannot remove files: $?\n" if $?;
- }
- while(@new) {
- my @n2;
- if(@new > 12) {
- @n2 = splice(@new,0,10);
- } else {
- @n2 = @new;
- @new = ();
- }
- system("git-update-index","--add",
- (map { ('--cacheinfo', @$_) } @n2));
- die "Cannot add files: $?\n" if $?;
- }
+
+ open(my $fh, '|-', qw(git-update-index -z --index-info))
+ or die "unable to open git-update-index: $!";
+ print $fh
+ (map { "0 0000000000000000000000000000000000000000\t$_\0" }
+ @old),
+ (map { '100' . sprintf('%o', $_->[0]) . " $_->[1]\t$_->[2]\0" }
+ @new)
+ or die "unable to write to git-update-index: $!";
+ close $fh
+ or die "unable to write to git-update-index: $!";
+ $? and die "git-update-index reported error: $?";
+ @old = @new = ();
$pid = open(C,"-|");
die "Cannot fork: $!" unless defined $pid;
--
1.3.3.g3408
-This change attempts to clean up the commit function to make it a bit
easier to read (or at least the first half of it). It also improves
robustness and performance. Specifically:
- report get_headref errors on opening ref unless the error is ENOENT
- use regex to check for sha1 instead of length
- use lexically scoped filehandles which get cleaned up automagically
- check for error on both 'print' and 'close' (since output is buffered)
- avoid "fork, do some perl, then exec" in commit(). It's not necessary,
and we probably end up COW'ing parts of the perl process. Plus the code
is much smaller because we can use open2()
- avoid calling strftime over and over (mainly a readability cleanup)
---
This is a repost with some minor fixups from Junio (and based off of the
fixed 1/2 patch).
3408c8d8364f816a7c4a34a03045f466bf028540
git-cvsimport.perl | 150 ++++++++++++++++++++++------------------------------
1 files changed, 64 insertions(+), 86 deletions(-)
3408c8d8364f816a7c4a34a03045f466bf028540
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index a65bea6..219f6dc 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -23,7 +23,7 @@ use File::Basename qw(basename dirname);
use Time::Local;
use IO::Socket;
use IO::Pipe;
-use POSIX qw(strftime dup2);
+use POSIX qw(strftime dup2 :errno_h);
use IPC::Open2;
$SIG{'PIPE'}="IGNORE";
@@ -429,22 +429,25 @@ sub getwd() {
return $pwd;
}
+sub is_sha1 {
+ my $s = shift;
+ return $s =~ /^[a-f0-9]{40}$/;
+}
-sub get_headref($$) {
+sub get_headref ($$) {
my $name = shift;
my $git_dir = shift;
- my $sha;
- if (open(C,"$git_dir/refs/heads/$name")) {
- chomp($sha = <C>);
- close(C);
- length($sha) == 40
- or die "Cannot get head id for $name ($sha): $!\n";
+ my $f = "$git_dir/refs/heads/$name";
+ if(open(my $fh, $f)) {
+ chomp(my $r = <$fh>);
+ is_sha1($r) or die "Cannot get head id for $name ($r): $!";
+ return $r...