login
Header Space

 
 

Re: [PATCH 2/2] cvsimport: cleanup commit function

Previous thread: Re: [PATCH] fmt-patch: Support --attach by Jakub Narebski on Saturday, May 20, 2006 - 10:16 am. (6 messages)

Next thread: Segfaults with USE_CURL_MULTI by Florian Weimer on Saturday, May 20, 2006 - 2:47 pm. (6 messages)
To: Git Mailing List <git@...>
Subject: irc usage..
Date: Saturday, May 20, 2006 - 1:26 pm

I hate irc.

I'm reading the irc logs, and seeing that people have problems, but (a) it 
was while I was asleep and (b) irc use doesn't encourage people to 
actually explain what the problems _are_, so I have no clue.

So now I know that "spyderous" has problems importing some 1GB gentoo CVS 
archive, but that's pretty much it. Grr.

Are people afraid to post to git@vger.kernel.org, or what?

I saw that people tried to suggest posting to the git mailing list, but 
can any of you who are active on irc be a bit more forceful? And perhaps 
we don't make this mailing list address well enough known? 

As far as I'm aware, the git mailing list isn't closed, so people should 
be able to post here without even subscribing. I can well understand that 
you might not want to subscribe and prefer to look ove rthe list through 
some archive setup (the way I look at the irc logs), and maybe we should 
just make the git mailing list address more obvious.

Right now, the "community" page at http://git.or.cz/community.html doesn't 
even mention the git mailing list address directly, it just tells you how 
you can subscribe and read the archives.

Can we perhaps fix that, and the people who are active on irc please also 
make it clear to people that if they have some real problems that don't 
get an immediate answer, the git mailing list ends up where a lot of 
people can actually look more closely at it.. And tell them what the 
address is.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: <git@...>
Date: Saturday, May 20, 2006 - 1:50 pm

I hate irc, too.  Number of times easily solvable usage problems
come up and I look at the log to realize when the solutions
suggested were waaaaay suboptimal it is too late (with loops
being quite active recently things have improved a lot, but we
should not expect him to be 24/7).

Maybe somebody can run a dumb 'bot that notices somebody said
something that ends with a '?' and there is no activity there
for N minutes and inject a recorded message that reminds the
mailing list address ;-).

-
To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Saturday, May 20, 2006 - 4:39 pm

FWIW, I have mentionned a problem that may be the same, under
Message-ID &lt;20060107090148.GB32585@nowhere.earth&gt;, that was on January
7th.  Namely, when importing a repository with very large files over
pserver or ssh, timeouts can occur and prevent the import from
working.  But, as you said, it's not easy to get precise info from the
logs :)

Best regards,
-- 
Yann Dirson    &lt;ydirson@altern.org&gt; |
Debian-related: &lt;dirson@debian.org&gt; |   Support Debian GNU/Linux:
                                    |  Freedom, Power, Stability, Gratis
     http://ydirson.free.fr/        | Check &lt;http://www.debian.org/&gt;
-
To: Yann Dirson <ydirson@...>
Cc: Git Mailing List <git@...>
Date: Sunday, May 21, 2006 - 9:45 pm

For big repositories, you really shouldn't use pserver or ssh anyway. You 
should try really really hard to just get a local copy, and do it that 
way. It's going to be tons faster, and will avoid a lot of the problems, 
including network timeouts etc.

		Linus
-
To: Yann Dirson <ydirson@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Saturday, May 20, 2006 - 6:18 pm

Hi all,

I just subscribed and this post is the only one I've got from the
thread, so I'm responding to it instead of the original. Gentoo's an
IRC-based community, so I tend to try IRC first for any problems I have
and fall back to the list later if I can't get things figured out.

Here's a rough summary:

Our main repo is actually a bit over 2G (2103621223) now that I check,
but it's not very complex. There's actually just one branch, and I don't
think anyone would care if we lost the history from it because it's a
release branch from a few years ago.

Somebody else tried importing it with git-cvsimport, but he said he hit
some kind of problem and recalled that it was a cvsps segfault. Sounds
about right, since I've never gotten cvsps to run successfully on the
whole repo either.

I tried with parsecvs, but it runs into OOM even on a machine with 4G
RAM after reading in all the ,v files, presumably while it's building
some huge tree of changesets in memory. Keith Packard's suggested that
there are ways to reduce parsecvs's memory use, because it retains the
full tree in memory for each revision rather than just the files that
actually changed. But my C skills are pretty weak; I'm an OK reader but
not much of a writer yet.

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Yann Dirson <ydirson@...>, Git Mailing List <git@...>
Date: Saturday, May 20, 2006 - 6:45 pm

Can you point to it? I'm not a CVS user, but I've played with cvsps before 
(to get it to work), and I'm a humanitarian - rescuing people from CVS is 
to me not just a good idea, it's a moral imperative.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Yann Dirson <ydirson@...>, Git Mailing List <git@...>
Date: Saturday, May 20, 2006 - 7:12 pm

I don't want to post the link publicly for a few reasons, including the
huge amount of bandwidth it would suck up for lots of people to download
it. I've sent it to you off-list, and if anyone else would also like it,
please drop me a note.

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Yann Dirson <ydirson@...>, Git Mailing List <git@...>
Date: Sunday, May 21, 2006 - 3:24 pm

Ok. It's still converting (that's a big archive), but it has passed the 
cvsps stage without errors for me, and the conversion so far seems ok. But 
it has only gotten to 

	Author: vapier &lt;vapier&gt;  2002-09-23 12:32:42
	Changed GPL to GPL-2 in LICENSE and updated SRC_URI to use mirror:

so it has converted only slightly more than the first two years of 
history in the roughly 30 minutes I've let it run. So it will take several 
hours.

The reason it works for me is likely simply the fact that I had a few 
patches to my cvsps already. I'm appending the stupid patches, I'm not 
guaranteeing that they are correct at all, although the three _committed_ 
patches are almost certainly correct (and the last uncommitted one is 
almost certainly totally broken). The patches are against clean cvsps 2.1.

Also, when I say "the conversion so far seems ok", I obviously don't 
actually know what the hell the archive is supposed to look like, so I can 
only say that the end result seems not totally insane.

To do a good conversion, you'll want to make sure that you have a author 
name conversion file. See the "-A" flag in "git help cvsimport" (if you 
have the man-pages installed).

		Linus

---
commit 534120d9a47062eecd7b53fd7ac0b70d97feb4fd
Author: Linus Torvalds &lt;torvalds@g5.osdl.org&gt;
Date:   Wed Mar 22 11:20:59 2006 -0800

    Increase log-length limit to 64kB
    
    Yeah, it should be dynamic. I'm lazy.
---
 cvsps_types.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/cvsps_types.h b/cvsps_types.h
index b41e2a9..dba145d 100644
--- a/cvsps_types.h
+++ b/cvsps_types.h
@@ -8,7 +8,7 @@ #define CVSPS_TYPES_H
 
 #include &lt;time.h&gt;
 
-#define LOG_STR_MAX 32768
+#define LOG_STR_MAX 65536
 #define AUTH_STR_MAX 64
 #define REV_STR_MAX 64
 #define MIN(a, b) ((a) &lt; (b) ? (a) : (b))


commit 82fcf7e31bbeae3b01a8656549e9b8fd89d598eb
Author: Linus Torvalds &lt;torvalds@g5.osdl.org&gt;
Date:   Wed Mar 22 11:23:37 2006 -0800

   ...
To: Donnie Berkholz <spyderous@...>
Cc: Yann Dirson <ydirson@...>, Git Mailing List <git@...>
Date: Sunday, May 21, 2006 - 11:59 pm

Btw, trying this import (which got interrupted by a thunderstorm and one 
of our first power failures in a long time - just a few seconds, but 
enough to power off everything but my laptops) it became very obvious that 
"git cvsimport" really _really_ should re-pack the archive every once in a 
while.

The old "repack every month or so" approach doesn't work that well when 
you try to import several years of history in a few hours.

Now, you can just repack after the whole thing is done (it will probably 
take no more than ~15 minutes or so), but it would probably be best if the 
import script itself decided to repack every once in a while just to avoid 
wasting a lot of diskspace _during_ the import itself.

So this isn't so much a correctness issue as a "avoid wasting time and 
space" issue, but still..

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Yann Dirson <ydirson@...>, Git Mailing List <git@...>
Date: Monday, May 22, 2006 - 12:19 am

Fortunately the storms haven't been that bad down in Corvallis. cvsps
also worked fine for me, but git-cvsimport broke in the middle. The
command I'm using is 'git-cvsimport -P ../gentoo.cvsps -k -d
/media/scm_comparison -A ~/dev/Authors -v gentoo-x86 | tee cvsimport.log'=


Here's the last bits:

Fetching gnome-base/gnome-applets/gnome-applets-1.4.0.4-r1.ebuild   v 1.5=

Update gnome-base/gnome-applets/gnome-applets-1.4.0.4-r1.ebuild: 947 byte=
s
Fetching gnome-base/gnome-applets/gnome-applets-1.4.0.4-r2.ebuild   v 1.3=

Update gnome-base/gnome-applets/gnome-applets-1.4.0.4-r2.ebuild: 977 byte=
s
Fetching gnome-base/gnome-applets/gnome-applets-2.0.0-r1.ebuild   v 1.2
Update gnome-base/gnome-applets/gnome-applets-2.0.0-r1.ebuild: 2704 bytes=

Fetching gnome-base/gnome-applets/gnome-applets-2.0.0.ebuild   v 1.2
Update gnome-base/gnome-applets/gnome-applets-2.0.0.ebuild: 3031 bytes
Tree ID 4d19a84efce2de9cfb42ac0397e0036bbed2ad65
Parent ID ecb78bbe30369a76e2599d0d17de8fe922dca211
Committed patch 14615 (origin 2002-07-16 20:13:15)
Commit ID 4dd2179e0c1369e07cd268fb5c8b150c3a2a1094
Delete net-fs/openafs/openafs-1.2.2-r6.ebuild
Delete net-fs/openafs/files/digest-openafs-1.2.2-r6
Tree ID bfc7320883983655d7d2ea2c6d04f85b45365ce1
Parent ID 4dd2179e0c1369e07cd268fb5c8b150c3a2a1094
Committed patch 14616 (origin 2002-07-16 20:15:15)
Commit ID 7a36de9c4c9b93337ed789ae2341cad3d0991c6d
Unknown: error  Cannot allocate memory
Fetching profiles/package.mask   v 1.992
cat: write error: Broken pipe

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Martin Langhoff <martin@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 12:50 am

Hmm. It's actually possible that it did that for me too - I had put the 
cvsimport in an xterm and forgotten about it, and just assumed that the 
power failure was what broke it. But maybe it had broken down before that 

Hmm. I don't actually know perl, and my original "cvsimport" script was 
actually this funny C program that generated a shell script to do the 
import. That worked fine, and had no memory leaks, but it was a truly 
hacky thing of horrible beauty. Or rather, it _would_ have been that, if 
it had had any beauty to be horrible about. But at least I would have been 
able to debug it.

But the perl one I can't parse any more. That said, the whole "Unknown:" 
printout seems to come from the subroutine "_line()", which just reads a 
line from the cvs server.

Did you do a "top" at any time just before this all happened? It _sounds_ 
like it might actually be a memory leak on the CVS server side, and the 
problem may (or may not) be due to the optimization that keeps a single 
long-running CVS server instance for the whole process.

I wouldn't be in the least surprised if that ends up triggering a slow 
leak in CVS itself, and then CVS runs out of memory.

That would likely have been obvious in any "top" output just before the 
failure.

Smurf, Martin, Dscho.. Any ideas? My old script just ran RCS directly on 
the files, and had no issues like that. I'll happily admit that my old 
script generator thing was horrible, but it was a lot easier to debug than 
the smarter perl script that uses a CVS server connection..

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 3:42 am

Running a few tests right now. Looks like cvs (Debian/etch 1.12.9-13)
itself is not leaking any memory. The Perl (Debian/etch
5.8.7-something and now 5.8.8-4) process OTOH is visibly allocating
memory. Starts off at 4MB and gets up to ~17MB by the time it has done
6K commits.

I am trying to figure out whether the leak is in the script or in the
Perl implementation, using PadWalk, Devel::Leak and friends. If the

Or a slow leak in Perl? The 5.8.8 release notes do talk about some
leaks being fixed, but this 5.8.8 isn't making a difference.

Working on it.



martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 5:13 am

Thanks. Looking at what I did convert, that horrid gentoo CVS tree is 
interesting. The resulting (partial) git history has 93413 commits and 
850,000+ objects total, all in a totally linear history.

And that's just up to April 2004, so the full tree is probably a million 
objects.

The good news is that git seems to handle that size repo no problem at 
all. The repack did indeed take a long while, but it packed it all down to 
a 189MB pack-file (and 20MB pack index).

Considering that the bzip2'd tar-file of the CVS history was 157MB, and 
the actual CVS footprint was about 1.6GB, if git stays at under a quarter 
gigabyte for the whole archive once converted (which sounds likely, 
counting indexing), git would basically cut down the disk usage for a live 
repo by a factor of 7 or so.

_And_ I can do a "git log origin &gt; /dev/null" in about 2.4 seconds. Take 
that, CVS.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 8:54 am

Ok, so there's 3 patches posted that should help narrow down the
problem. There's a new -L &lt;imit&gt; so that Donnie can get his stuff done
by running it in a while(true) loop. Not proud of it, but hey.

And there are two patches that I suspect may fix the leak. After
applying them, the cvsimport process grows up to ~13MB and then tapers
off, at least as far as my patience has gotten me. It's late on this
side of the globe so I'll look at the results tomorrow morning.

(BTW, I typo-ed Linus' address in the git-send-email invocation. Will
resend to him separately)

I'll also prep a patch as Linus suggests to do auto-repacking while

Heh. Faster Gitticat, Kill Kill Kill!




martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 1:27 pm

Ok, initial results are promising. git-cvsimport appears to be still 
slowly growing, but it's at 40M (ie pretty tiny, considering that cvsps 
grew to 800+MB on this archive) and growth seems to actually be slowing.

My conversion is only up to September 2002, but if it doesn't suddenly hit 
some huge growth spurt, I wouldn't expect it to run out of memory. The CVS 
server process itself is tiny, and doesn't seem to grow at all.

As to packing, it doing something like

	while :
	do
		sleep 30

		#
		# repack roughly every 25600 objects
		#
		n=$(ls .git/objects/00 2&gt; /dev/null | wc -l)
		if [ $n -gt 100 ]; then
			git repack -a
			#
			# Stupid sleep to make sure that nobody is still
			# using any unpacked objects after the pack got
			# generated
			#
			sleep 10
			git prune-packed
		fi
	done

or similar (the above is totally untested - I've just done it by hand a 
few times) should work. It's perfectly ok to repack the archive even while 
the cvsimport script is adding more data and changing it.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 3:46 pm

That's great news. The cvs archive seems to have large commits every
once in a while, so I suspect the residual memory growth may be
related to those. Or to a smaller leak I haven't nailed.

My test box is bloody slow it seems. I'll try and get hold of a faster

Given that we are running batch, it is safe and simple to stop the
import, repack, prune-packed, and keep going. Don't think we'll win
any races by running it in parallel ;-)

cheers,


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 3:09 pm

OK, I started a new run without -L, and I'm watching it in top right
now. The cvsimport seems to be doing alright, but the cvs server process
sucks about another megabyte of virtual every 4-5 seconds. This is a bit
concerning since I don't have any swap. Shortly after it hit 670M, I got
"Cannot allocate memory" again. I've got a gig of RAM, and around 300M
was resident in various processes at the time.

So it seems the problem is in cvs itself. I will try another run with -L
now.

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Martin Langhoff <martin.langhoff@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 3:38 pm

Hmm. My cvs server doesn't really grow at all. It's at 13M RSS.

What version of cvs are you running?

	[torvalds@g5 ~]$ cvs --version

	Concurrent Versions System (CVS) 1.11.21 (client/server)

maybe that matters.

(but my import is only up to Jun 22, 2003 so far).

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Martin Langhoff <martin.langhoff@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 3:49 pm

Yeah, that's the thing. RSS stayed about the same (according to top),

Concurrent Versions System (CVS) 1.12.12 (client/server)

Looks like there's a .13 out but the zlib interaction is badly broken
(-z &gt;=3D1) so my system didn't get upgraded. I'll try it anyway after the=

-L run finishes.

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Martin Langhoff <martin.langhoff@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 4:20 pm

Not for me. The virtual size is certainly bigger than RSS, but not by a 
huge amount. So this might be a regression in CVS, since you seem to have 
a newer version than I do.

The latest stable CVS release is 1.11.21, I think: you seem to be running 
the "development" version (1.12.x).

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Martin Langhoff <martin.langhoff@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 5:48 pm

Backed down to the 1.11 series, things seem to be going fine so far.

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Linus Torvalds <torvalds@...>, Martin Langhoff <martin.langhoff@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 29, 2006 - 5:54 pm

Finally hit an OOM sometime in the past day (yep, a week later) =3D\. Not=

sure whether it was cvsimport or cvs. Anyone else had more luck?

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 29, 2006 - 6:21 pm

It seemed like it had finished on the machine I was running it, and I
assumed it was alright in yours too. Looking closer it only made it
till April 2004 -- but it may have been killed by a sysadmin, the
captured log talks about 'signal 9', I have no idea what the OOM
sends.

It had done 285070 of 343822 patchsets.

Have you dropped the -a from the git-repack invocation? That should
help. Try also Linus' patch for git-rev-list. The other thing hurting
us is that the commits are _huge_. I wonder how you guys were managing
this with CVS. Now _this_ explains why cvsimport grows humongous.

I'll try to rework the commit loop so that we don't need to hold all
the filenames in memory. It seems to be choking with the commits after
April 2004. But that will have to wait till tonight.

cheers,



martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 29, 2006 - 6:32 pm

Looking closer, I see that the memory suckers do appear to be git, from
dmesg:

Out of Memory: Kill process 17230 (git-repack) score 97207 and children.
Out of memory: Killed process 17231 (git-rev-list).

Just ends like this:

Tree ID 2cc632e5e1d3a430a2cc891bf33c4a12f19a4d0e
Parent ID ad92d7073a52458e0581633bbd8ccbbec838d9e6
Committed patch 249100 (origin 2005-08-20 05:05:58)
Commit ID 28941f00d714f57ab49f1fd725d1c3ce8a5d0b93
Fetching sys-kernel/ck-sources/ChangeLog   v 1.113
Update sys-kernel/ck-sources/ChangeLog: 25425 bytes
Fetching sys-kernel/ck-sources/Manifest   v 1.164
Update sys-kernel/ck-sources/Manifest: 252 bytes
Delete sys-kernel/ck-sources/ck-sources-2.6.12_p5-r1.ebuild
Fetching sys-kernel/ck-sources/ck-sources-2.6.12_p6.ebuild   v 1.1
New sys-kernel/ck-sources/ck-sources-2.6.12_p6.ebuild: 1438 bytes
Delete sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p5-r1
Fetching sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p6   v 1.1
New sys-kernel/ck-sources/files/digest-ck-sources-2.6.12_p6: 279 bytes
Can't fork at /usr/bin/git-cvsimport line 592, &lt;CVS&gt; line 3810053.

I wasn't running with a version that did repacks; I just suspended the

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Tuesday, May 30, 2006 - 6:31 pm

With the latest cvsimport in Junio's repo, a lot of RAM and a bit of patience...

  gitview
  http://git.catalyst.net.nz/gitweb?p=gentoo.git;a=summary

  fetchable
  http://git.catalyst.net.nz/git/gentoo.git#cvshead

Still pushing it, will be there in a minute or so. The packed repo
weights about 660MB. Not too bad given the size of the project and the
number of commits.


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Tuesday, May 30, 2006 - 7:07 pm

Heh. I think you should enable caching in your apache config. 

And maybe we should make that part of the gitweb docs. Without a caching 
web-server, gitweb is pretty slow, but it caches _beautifully_.

That gentoo repo has a lot of "duplicate" commits that cvsps will mark as 
two separate commits because there's one commit for the files, and one 
commit for whatever the "Manifest" file is. I wonder if those commits 
should generally be merged or something. 

That said, things like that are most easily fixed as a git-&gt;git update 
(along with adding name translation), which can avoid re-writing the 
trees.

			Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Tuesday, May 30, 2006 - 9:04 pm

I know I should -- but I'm hoping to find the time to rework gitweb a
bit to actually work fast instead. It bothers me that it is so slow on
a basically idle machine, and where I can perform the corresponding
git operations in the commandline in a blink.

And caching is great for really busy sites (aka kernel.org) but
git.catalyst.net.nz only serves a handful of small repos for a small

Yep, large projects often have good reasons to run custom imports,
merging certain commits, rewriting log messages (like the X.org guys
were doing). It can be done at the cvsimport stage or later -- I think
Pasky has a rewritehistory tool hidden somewhere in Cogito, but I
haven't used it.

cheers,


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>, Alec Warner <antarus@...>
Date: Tuesday, May 30, 2006 - 10:49 pm

We've got a guy who got a Summer of Code project to work on CVS
migration, so this could be something along his lines.

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>, Alec Warner <antarus@...>
Date: Wednesday, May 31, 2006 - 2:05 am

He'll want a fast box to wrangle with this repo ;-)


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Wednesday, May 31, 2006 - 9:54 am

I have a dual opteron with 4gb of ram "on loan" from work :)

It still dies though, using git cvsimport or parsecvs.

I talked to Keith Packard about adding support to parsecvs for recording 
the actual changed changesets, but I haven't yet started on implementing 
that since he isn't using cvsps in parsecvs.

I also haven't had a chance to look at the git-cvsimport sources yet, 
was hoping to get to that later this week.
-
To: <antarus@...>
Cc: Donnie Berkholz <spyderous@...>, Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Wednesday, May 31, 2006 - 6:03 pm

The machine I am running this is more constrained than that, and it
doesn't die. It just takes maybe 30hs. Make sure it's not a bad cvs
binary you got there (latest from gentoo seems to leak memory).

And if it's still dying... give us some more details ;-)

cheers,


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Wednesday, May 31, 2006 - 9:42 pm

After reading the whole thread on this, I've using a git checkout of 
git, cvsps-2.1 and cvs-1.11.12, running overnight in verbose mode with 
screen.  Hopefully will have a repo in the morning ;)
-
To: <antarus@...>
Cc: Donnie Berkholz <spyderous@...>, Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Thursday, June 1, 2006 - 3:47 am

Good stuff. I am rerunning it to prove (and bench) a complete an
uninterrupted import. So far it's done 4hs 30m, footprint grown to
207MB, 49750 commits. So I think it will be done in approx 30hs on
this single-cpu opteron.

Most commits are small, but there is a handful that are downright
massive -- and we hold all the file list in memory, which I think
explains (most of) the memory growth. I've looked into avoiding
holding the whole filelist in memory, but it involves rewriting the
cvsps output parsing loop, which is better left for a rainy day, with
a test case that doesn't take 30hs to resolve.

cheers,



martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Sunday, June 4, 2006 - 8:33 pm

Ok the box this was running on had issues, so I switched to using 
pearl.amd64.dev.gentoo.org, a dual core amd64 X2 4600+ with 4 gigs of 
ram and plenty of disk.  The "problem" now is just converstion time...30 
hours and I'm into 2004-09-17...but it's been in 2004 all day, seems 
like most of the commits are in the last three years.  Are there 
architectural issues with doing this in parallel?

Since the repository commits are all in cvs, it should be possible to do 
the work in parallel, since you know what all the commits touch.  The 
concern would be ordering of nodes in the tree; you'd end up building a 
bunch of subtrees and patching them together?

-Alec Warner
-
To: <antarus@...>
Cc: Donnie Berkholz <spyderous@...>, Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Sunday, June 4, 2006 - 10:06 pm

I don't think you can do this in parallel. What I would do is remove
the -a from the git-repack invocation. It does hurt import times quite
a bit -- just do a git-repack -a -d when it's done.

And... having said that, there is still a memory leak somehow,
somewhere. It's been evading me for 2 weeks now, so I feel an idiot
now. Not too bad in general, but it shows clearly in the gentoo and

Well... parsecvs does a bit of this but in sequential fashion... it
imports all the files first, and then runs through the history
building the tree+commits in order, committing them. It saves a lot of
time in the file imports by parsing the RCS file directly. The
downside is that it must keep a filename+version=&gt;sha1 mapping --
which I think is why parsecvs won't fit in memory until it's changed
to store it on disk somehow ;-)

You are forced to do it in a sequence because cvsps only tells you
about the files added/removed/changed in a commit -- you need the
ancestor to have a view of what the whole tree looked like. The only
room for parallelism I see is to fork off new processes to work on
branches in parallel.



martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Sunday, June 4, 2006 - 10:36 pm

Only repack at the end then? disk space isn't an issue here so I'll give 

30565 antarus   17   0  470m 456m 1640 S   14 11.6 234:23.38
git-cvsimport
30566 antarus   16   0 6753m 147m  752 S    7  3.7 120:27.06 cvs


Not helpful in the Gentoo case, since we only have one branch; minus an 
accident when a dev branched gentoo-x86 a while back ;)

I'll keep chugging on this one; it won't be the final import as I 
haven't used the complete Authors file, so I will try the repacking 
optimization next time I do an import.

-Alec Warner
-
To: <antarus@...>
Cc: <martin.langhoff@...>, <spyderous@...>, <torvalds@...>, <ydirson@...>, <git@...>, <smurf@...>, <Johannes.Schindelin@...>
Date: Monday, June 5, 2006 - 12:07 pm

On Sun, 04 Jun 2006 22:36:44 -0400

Hi Alec,

You may want to go back and do another import for other reasons, but if
the only reason is to fix up the author information it would be _much_
faster to simply rewrite the git commit history.  Cogito has something
called "cg-admin-rewritehist" which should do what you need and there
are other scripts floating around specificially for rewriting just the
author information.

HTH,
Sean
-
To: <antarus@...>
Cc: Donnie Berkholz <spyderous@...>, Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Sunday, June 4, 2006 - 11:49 pm

Not exactly -- by removing the -a from the git-repack invocation what
you get is cheap "partial" packing rather than a full repack. This is
somewhat inefficient disk-wise, perhaps by 10% or so. But full repacks
get more and more expensive as the repo grows.

So you don't need to run git-repack -a -d at the end, but it will be a


Cool. If it dies for any reason, just do

  git-update-ref refs/heads/master refs/heads/origin
  git-update-ref HEAD origin
  git-checkout

You only need to do this the first time -- after that, the core heads
are set. Rerun the script and it will pick up where it left. If it
dies again, just do git-checkout to see the latest files.

(Above, replace origin with your -o option if you are using it. I
normally use -o cvshead.)



martin
-
To: Donnie Berkholz <spyderous@...>
Cc: Martin Langhoff <martin.langhoff@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 29, 2006 - 8:43 pm

Sounds like you had the "git repack -a -d" thing in your cvsimport.

The current git rev-list should use only about a third of the memory of 
the one you used, so hopefully you could just update your git version, and 
then continue with the "git cvsimport" without having to start all over.

		Linus
-
To: Donnie Berkholz <spyderous@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 29, 2006 - 8:19 pm

That would mean that you do have Linus' patch then. Grep cvsimport for
repack and remove the -a -- and consider using his recent patch to
rev-list.

My dmesg talks about an earlier cvs segfault. Nasty tree you have here
-- it's breaking all sorts of things... and teaching us a thing or two

Hmmm? How can you be at patch 249100 and still be a good year ahead of
me? Have you told cvsps to cut off old history?

Another thing I found is that this import uses a lot of $TMPDIR, so if
your TMPDIR is small, you'll hit all sorts of problems.

cheers,



martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Tuesday, May 30, 2006 - 1:31 am

You certainly would think so, and I did as well, but available evidence
indicates otherwise. I'm not sure how the repack got in there.

donnie@supernova ~ $ type git-cvsimport
git-cvsimport is /usr/bin/git-cvsimport
donnie@supernova ~ $ grep repack /usr/bin/git-cvsimport
donnie@supernova ~ $

All I can think of is that I somehow OOM'd when I manually ran a repack
and didn't notice it. But that should've at least made me unable to

Nope. I ran the exact cvsps flags you posted earlier to create it.

Thanks,
Donnie
To: Donnie Berkholz <spyderous@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Tuesday, May 30, 2006 - 2:01 am

Sounds likely -- and cvsimport restarts gracefully, though you might want to do

   git checkout HEAD

to get a usable checkout if the very first import failed. However, the
default head is master, and what you want to look at is origin or
whatever you passed as your -o parameter. I use cvshead normally, so I
do


Oh, that was an earlier PEBKAK at my end: I did git log HEAD instead
of git log cvshead. My import is now at  293145 (cvshead +0000
2005-12-25 12:24:42) which looks promising.

cheers,


martin
-
To: Donnie Berkholz <spyderous@...>
Cc: Linus Torvalds <torvalds@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 3:41 pm

What version of cvs are you using? Perhaps trying a different one?

The dev machine where I am running the import is a slug! It's still
working on it, only gotten to 7700 commits, with the cvsimport process
stable at 28MB RAM and cvs stable at 4MB.

cheers,


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 4:11 pm

I have to say, that cvsimport script really does do horrible things. It's 
basically a fork/exec/exit benchmark, as far as I can tell. Running 
oprofile on the thing, the top offenders are (ignore the 45% idle thing: 
it's just because this was run on a dual-cpu system, so since it's almost 
completely single-threaded you get ~50% idle by default).

	3117654  45.8708  vmlinux                  vmlinux                  .power4_idle
	802313   11.8046  vmlinux                  vmlinux                  .unmap_vmas
	632913    9.3122  vmlinux                  vmlinux                  .copy_page_range
	150359    2.2123  vmlinux                  vmlinux                  .release_pages
	131330    1.9323  vmlinux                  vmlinux                  .vm_normal_page
	117836    1.7337  libperl.so               libperl.so               (no symbols)
	74098     1.0902  libgklayout.so           libgklayout.so           (no symbols)
	54680     0.8045  vmlinux                  vmlinux                  .free_pages_and_swap_cache
	54300     0.7989  libfb.so                 libfb.so                 (no symbols)
	49052     0.7217  vmlinux                  vmlinux                  .copy_4K_page
	46559     0.6850  libc-2.4.so              libc-2.4.so              getc
	42677     0.6279  vmlinux                  vmlinux                  .page_remove_rmap
	41133     0.6052  libc-2.4.so              libc-2.4.so              ferror
	..

those kernel functions are all about process create/exit, and COW faulting 
after the fork.

Now, this is on ppc, so process creation is likely slower (idiotic PPC VM 
page table hashes), but Linux is actually very good at doing this, and the 
fact that process create/exit is so high is a very big sign that the 
script just ends up executing a _ton_ of small simple processes that do 
almost nothing.

I wonder why those "git-update-index" calls seem to be (assuming I read 
the perl correctly) done only a few files at a time. We can do a hundreds 
in one go, but it see...
To: Martin Langhoff <martin.langhoff@...>
Cc: Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Matthias Urlichs <smurf@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 4:33 pm

Ahh. stracing the CVS server seems to imply that it forks off a subprocess 
for every command. It doesn't actually execute any external program, but 
just does a fork + muck around in the ,v files + exit.

Maybe one of the changes in the 1.12.x versions is to not do that, which 
might explain why Donnie seems to see much better performance, but also 
sees all the memory leakage?

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Martin Langhoff <martin.langhoff@...>, Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 5:41 pm

Hi,


No, fifty.


The beast *was* mainly written to do this remotely...

--=20
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
 - -
The worst form of inequality is to try to make unequal things equal.
					-- Aristotle
To: Matthias Urlichs <smurf@...>
Cc: Martin Langhoff <martin.langhoff@...>, Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 6:18 pm

I don't think the remote usability is valid, except for some really small 
repositories. The fact that it takes hours even when the CVS server is 
local doesn't bode well for doing it remotely for any but the most trivial 
things.

I really think it would be better to have local use be the optimized case, 
with remote being the "it's _possible_" case.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Matthias Urlichs <smurf@...>, Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 7:23 pm

I really don't think that using the local cvs binary is a problem at
all. In my experience, the thing is fairly fast and optimized when you
ask it to perform file-oriented questions and that's all we do,
really.

If you want to try it, you'll see that local checkouts of large trees
(like this gentoo one) are fairly fast. Not as fast as GIT itself, but
good enough. I think Donnie has hit a bug with a bad version of cvs,
but other than that, my experience with it is that it is fairly well
behaved -- even if the tool is bad, ubiquity has lead to resiliency

Agreed, but I think we won't see much benefit in direct parsing. And
we'll have to take the hit of double-implementation.

In any case, we have it already -- parsecvs does it quite well (modulo
memory leaks!) and I've used it several times in conjunction with
cvsimport. Just perform the initial import with parsecvs and then
'track' the remote project with cvsimport.

The problem is that they lead to slightly different trees. So their
output is not consistent, and I don't think that'll be easy to fix.

cheers,


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Matthias Urlichs <smurf@...>, Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 7:33 pm

Fair enough. My worry was mainly that the cvs server was doing something 
stupid, but I suspect most of the fork/exec's are probably from the 

I didn't get parsecvs working when I tried it a long time ago, and Donnie 
reported that it ran out of memory, so I didn't even really consider it. 
I'd love for it to work well, and it may be reasonable to do really big 
imports on multi-gigabyte 64-bit machines (after all, they aren't _hard_ 
to find any more, and you only need to do it once).

That said, it still seems pretty stupid to require that much memory just 
to import from CVS.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Matthias Urlichs <smurf@...>, Donnie Berkholz <spyderous@...>, Yann Dirson <ydirson@...>, Git Mailing List <git@...>, Johannes Schindelin <Johannes.Schindelin@...>
Date: Monday, May 22, 2006 - 7:29 pm

Sorry! s/trees/histories/ there. The trees are (or should!) be the
same, and tree differences should be addressed as bugs. Differences in
how history is parsed are unavoidable right now.

martin
-
To: Matthias Urlichs <smurf@...>
Cc: <git@...>
Date: Monday, May 22, 2006 - 6:39 pm

I think cvsimport predates that option, but these days that loop
can be optimized by feeding --index-info from standard input.

-
To: Junio C Hamano <junkio@...>
Cc: Matthias Urlichs <smurf@...>, <git@...>
Date: Monday, May 22, 2006 - 7:15 pm

Oh, yep, that'd be a good addition. I think we can also cut down on
the number of fork+exec calls (as Linus points out they are killing
us) by caching some data we should already have that we are repeatedly
asking from git-ref-parse.

Other TODOs from my reading of the code last night...

 - Switch from line-oriented reads to block reads when fetching files
from CVS. This gentoo has repo has some large binary blobs in it and
we end up slurping them into memory.

 - Stop abusing globals in commit() -- pass the commit data as parameters.

 - Further profiling? Whatever we are doing, we aren't doing it fast :(

Will be trying to do those things in the next few days, don't mind if
someone jumps in as well.



martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 2:52 am

This patch is relatively simple, and I'll post it in a moment.

I also made a few other cleanups to commit() which apply on top of that;

Some of the globals actually get modified in commit() (e.g., @old and
@new get cleared).  So we need to either pass them in as references or
remember to do that cleanup each time it is called (which is really only

I can look at the line/block CVS file slurping, but not tonight.

-Peff
-
To: Martin Langhoff <martin.langhoff@...>, Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 3:00 am

This change attempts to clean up the commit function to make it a bit
easier to read (or at least the first half of it). It also improves
robustness and performance. Specifically:
  - report get_headref errors on opening ref unless the error is ENOENT
  - use regex to check for sha1 instead of length
  - use lexically scoped filehandles which get cleaned up automagically
  - check for error on both 'print' and 'close' (since output is buffered)
  - avoid "fork, do some perl, then exec" in commit(). It's not necessary,
    and we probably end up COW'ing parts of the perl process. Plus the code
    is much smaller because we can use open2()
  - avoid calling strftime over and over (mainly a readability cleanup)

---

I know this patch is quite large. I can try to split it if you want, but
I suspect it's not worth the effort (either you like refactoring or you
don't :) ).

9dc9f05ab5e1cbd8765238e7b1da0addd6f4296a
 git-cvsimport.perl |  150 ++++++++++++++++++++++------------------------------
 1 files changed, 64 insertions(+), 86 deletions(-)

9dc9f05ab5e1cbd8765238e7b1da0addd6f4296a
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index 4efb0a5..f8feb52 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -23,7 +23,7 @@ use File::Basename qw(basename dirname);
 use Time::Local;
 use IO::Socket;
 use IO::Pipe;
-use POSIX qw(strftime dup2);
+use POSIX qw(strftime dup2 :errno_h);
 use IPC::Open2;
 
 $SIG{'PIPE'}="IGNORE";
@@ -429,22 +429,25 @@ sub getwd() {
 	return $pwd;
 }
 
+sub is_sha1 {
+	my $s = shift;
+	return $s =~ /^[a-zA-Z0-9]{40}$/;
+}
 
-sub get_headref($$) {
+sub get_headref ($$) {
     my $name    = shift;
     my $git_dir = shift; 
-    my $sha;
     
-    if (open(C,"$git_dir/refs/heads/$name")) {
-	chomp($sha = &lt;C&gt;);
-	close(C);
-	length($sha) == 40
-	    or die "Cannot get head id for $name ($sha): $!\n";
+    my $f = "$git_dir/refs/heads/$name";
+    if(open(my $fh, $f)) {
+      	    chomp(my $r = &lt;$fh&gt;);
+	    is_sha...
To: Martin Langhoff <martin.langhoff@...>, Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 1:47 pm

Why run "env" and not just muck with %ENV?

-
To: Morten Welinder <mwelinder@...>
Cc: Martin Langhoff <martin.langhoff@...>, Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 4:59 pm

Oops, that's an obvious fork optimization that I should have caught.
Patch is below. Note that this will now affect the environment of all
sub-processes, but it shouldn't matter since we reset it right before
commit. However, if anyone is worried, we can stash the old %ENV in
another hash temporarily.

-Peff

PS What is the preferred format for throwing patches into replies like
this? Putting the patch at the end (as here) or throwing the reply
comments in the ignored section near the diffstat?

---
cvsimport: set up commit environment in perl instead of using env

---

44c4a9f67322302ca49146a7c143c07ea67da366
 git-cvsimport.perl |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

44c4a9f67322302ca49146a7c143c07ea67da366
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index 41ee9a6..83d7d3c 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -618,14 +618,13 @@ sub commit {
 	}
 
 	my $commit_date = strftime("+0000 %Y-%m-%d %H:%M:%S",gmtime($date));
+	$ENV{GIT_AUTHOR_NAME} = $author_name;
+	$ENV{GIT_AUTHOR_EMAIL} = $author_email;
+	$ENV{GIT_AUTHOR_DATE} = $commit_date;
+	$ENV{GIT_COMMITTER_NAME} = $author_name;
+	$ENV{GIT_COMMITTER_EMAIL} = $author_email;
+	$ENV{GIT_COMMITTER_DATE} = $commit_date;
 	my $pid = open2(my $commit_read, my $commit_write,
-		'env',
-		"GIT_AUTHOR_NAME=$author_name",
-		"GIT_AUTHOR_EMAIL=$author_email",
-		"GIT_AUTHOR_DATE=$commit_date",
-		"GIT_COMMITTER_NAME=$author_name",
-		"GIT_COMMITTER_EMAIL=$author_email",
-		"GIT_COMMITTER_DATE=$commit_date",
 		'git-commit-tree', $tree, @commit_args);
 
 	# compatibility with git2cvs
-- 
-
To: Jeff King <peff@...>
Cc: Morten Welinder <mwelinder@...>, Martin Langhoff <martin.langhoff@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 7:41 pm

Are you two talking about running git-commit-tree via env is two
fork-execs instead of just one?  Does that have a measurable
difference?

Not that I have anything against the updated code, but I do not

You could do it either way.  Although I personally find the
former easier to read (meshes well with "do not top post"
mantra), it appears many other people finds the cover letter
material should come after the first '---' separator.

If you append the patch to your message, btw, you would need to
realize that the receiving end needs to edit your message to
remove the top part before running "git am" to apply.


-
To: Junio C Hamano <junkio@...>
Cc: Morten Welinder <mwelinder@...>, Martin Langhoff <martin.langhoff@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Wednesday, May 24, 2006 - 5:52 am

Yes, that's what I was talking about. No, probably not a huge
difference. I did some performance measurements of all of the recent
cvsimport changes on a small-ish personal repo (I don't have the gentoo
repo). The results were not significant (&lt;= 1% improvement for each
change).  I would expect some of the changes (index-info, fetchfile) to
have an impact on a repo with different characteristics (like the gentoo
one).

-Peff
-
To: Martin Langhoff <martin.langhoff@...>, Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 4:13 am

Jeff,

good stuff -- aiming at exactly the things that had been nagging me.



Given that we have that -- should we remember it and avoid re-reading
the headref from disk? A %seenheads cache would save us 99.9% of the
hassle.

In related news, I've dealt with file reads from the socket being
memorybound. Should merge ok.

cheers,


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 12:50 pm

Hmm. Is it just me, or does the current "git cvsimport" have new problems:

	[torvalds@merom git]$ git cvsimport -d ~/CVS gentoo-x86

causes

	Committing initial tree 34bd3dcd4bfd79bad35ce3fb08b2e21108195db8
	Server has gone away while fetching BUGS-TODO 1.1, retrying...
	Retry failed at /home/torvalds/bin/git-cvsimport line 366, &lt;GEN2656&gt; line 9.

and that's it for the import.

I don't see what would have caused it in the changes, but it definitely 
worked earlier..

		Linus
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 3:36 pm

Martin, that problem seems to go away when I initialize $res to 0 in 
_fetchfile. 

I don't know perl, and maybe local variables are pre-initialized to empty. 

It's entirely possible that the fact that it now seems to work for me is 
purely timing-related, since I also ended up using "-P cvsps-output" to 
avoid having a huge cvsps binary in memory at the same time.

		Linus "perl illiterate" Torvalds
-
To: Linus Torvalds <torvalds@...>
Cc: Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 4:29 pm

Strange! Cannot repro here with v5.8.8 (debian/etch 5.8.8-4) but
initialising it doesn't hurt, so let's do it:

diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index ace7087..abbfd0b 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -371,7 +371,7 @@ sub file {
 }
 sub _fetchfile {
        my ($self, $fh, $cnt) = @_;
-       my $res;
+       my $res = 0;
        my $bufsize = 1024 * 1024;
        while($cnt) {
            if ($bufsize &gt; $cnt) {

cheers,


martin
-
To: Martin Langhoff <martin.langhoff@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 5:10 pm

I can reproduce with debian perl 5.8.8-4. The bug is only triggered by
0-length files, so presumably your test repo doesn't have any.

-Peff
-
To: Martin Langhoff <martin.langhoff@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <junkio@...>, Matthias Urlichs <smurf@...>, <git@...>
Date: Tuesday, May 23, 2006 - 5:13 pm

Given that we are all working off the gentoo repo here, it means that
my machine is slower than Linus' unreleased Intel box. And that I am
too impatient...

In any case, the fix is correct as Junio points out.

cheers,


martin
-
To: Linus Torvalds <torvalds@...>
Cc: <git@...>
Date: Tuesday, May 23, 2006 - 4:25 pm

When a new file that is empty is created, sub _line would call
sub _fetchfile with $cnt == 0, and it can return $res which
is initialized to 'undef'.  That explains why sub file says
$self-&gt;_line() returned an undef and I think what you did is the
right fix.


-
To: <git@...>
Date: Tuesday, May 23, 2006 - 4:24 am

Meaning...?  Perl5 can pass only one flat array, so the above is

Merged OK, and I think your last suggestion makes sense.  I'll
go to bed after pushing out Jeff's two patches and yours.

-
To: Junio C Hamano <junkio@...>
Cc: <git@...>
Date: Tuesday, May 23, 2006 - 4:32 pm

Of course. I had actually missed the closing quotes, and thought the
error msg wanted to talk about POSIX. 'twas late in the day, seems


I'll look into caching headrefs tonight if noone beats me to it.




martin
-
To: <git@...>
Cc: <martin@...>, <junkio@...>
Date: Tuesday, May 23, 2006 - 3:27 am

This should reduce the number of git-update-index forks required per
commit. We now do adds/removes in one call, and we are no longer forced to
deal with argv limitations.

---

This is a repost using -z/NUL instead of line feeds.

d82d215430ae5e79210f73a31f5f8a053f36c27f
 git-cvsimport.perl |   36 +++++++++++++-----------------------
 1 files changed, 13 insertions(+), 23 deletions(-)

d82d215430ae5e79210f73a31f5f8a053f36c27f
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index d257e66..a65bea6 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -565,29 +565,19 @@ my($patchset,$date,$author_name,$author_
 my(@old,@new,@skipped);
 sub commit {
 	my $pid;
-	while(@old) {
-		my @o2;
-		if(@old &gt; 55) {
-			@o2 = splice(@old,0,50);
-		} else {
-			@o2 = @old;
-			@old = ();
-		}
-		system("git-update-index","--force-remove","--",@o2);
-		die "Cannot remove files: $?\n" if $?;
-	}
-	while(@new) {
-		my @n2;
-		if(@new &gt; 12) {
-			@n2 = splice(@new,0,10);
-		} else {
-			@n2 = @new;
-			@new = ();
-		}
-		system("git-update-index","--add",
-			(map { ('--cacheinfo', @$_) } @n2));
-		die "Cannot add files: $?\n" if $?;
-	}
+
+	open(my $fh, '|-', qw(git-update-index -z --index-info))
+		or die "unable to open git-update-index: $!";
+	print $fh 
+		(map { "0 0000000000000000000000000000000000000000\t$_\0" }
+			@old),
+		(map { '100' . sprintf('%o', $_-&gt;[0]) . " $_-&gt;[1]\t$_-&gt;[2]\0" }
+			@new)
+		or die "unable to write to git-update-index: $!";
+	close $fh
+		or die "unable to write to git-update-index: $!";
+	$? and die "git-update-index reported error: $?";
+	@old = @new = ();
 
 	$pid = open(C,"-|");
 	die "Cannot fork: $!" unless defined $pid;
-- 
1.3.3.g3408

-
To: <git@...>
Cc: <martin@...>, <junkio@...>
Date: Tuesday, May 23, 2006 - 3:27 am

This change attempts to clean up the commit function to make it a bit
easier to read (or at least the first half of it). It also improves
robustness and performance. Specifically:
  - report get_headref errors on opening ref unless the error is ENOENT
  - use regex to check for sha1 instead of length
  - use lexically scoped filehandles which get cleaned up automagically
  - check for error on both 'print' and 'close' (since output is buffered)
  - avoid "fork, do some perl, then exec" in commit(). It's not necessary,
    and we probably end up COW'ing parts of the perl process. Plus the code
    is much smaller because we can use open2()
  - avoid calling strftime over and over (mainly a readability cleanup)

---

This is a repost with some minor fixups from Junio (and based off of the
fixed 1/2 patch).

3408c8d8364f816a7c4a34a03045f466bf028540
 git-cvsimport.perl |  150 ++++++++++++++++++++++------------------------------
 1 files changed, 64 insertions(+), 86 deletions(-)

3408c8d8364f816a7c4a34a03045f466bf028540
diff --git a/git-cvsimport.perl b/git-cvsimport.perl
index a65bea6..219f6dc 100755
--- a/git-cvsimport.perl
+++ b/git-cvsimport.perl
@@ -23,7 +23,7 @@ use File::Basename qw(basename dirname);
 use Time::Local;
 use IO::Socket;
 use IO::Pipe;
-use POSIX qw(strftime dup2);
+use POSIX qw(strftime dup2 :errno_h);
 use IPC::Open2;
 
 $SIG{'PIPE'}="IGNORE";
@@ -429,22 +429,25 @@ sub getwd() {
 	return $pwd;
 }
 
+sub is_sha1 {
+	my $s = shift;
+	return $s =~ /^[a-f0-9]{40}$/;
+}
 
-sub get_headref($$) {
+sub get_headref ($$) {
     my $name    = shift;
     my $git_dir = shift; 
-    my $sha;
     
-    if (open(C,"$git_dir/refs/heads/$name")) {
-	chomp($sha = &lt;C&gt;);
-	close(C);
-	length($sha) == 40
-	    or die "Cannot get head id for $name ($sha): $!\n";
+    my $f = "$git_dir/refs/heads/$name";
+    if(open(my $fh, $f)) {
+      	    chomp(my $r = &lt;$fh&gt;);
+	    is_sha1($r) or die "Cannot get head id for $name ($r): $!";
+	    ret