I was very surprised to find that git-svn does not in fact default to
--repack. I firmly believe it should. Here's an example as to why it
should.I used git-svn to import a repository with 33000 revisions and about
7500 files. It took about 18 hours to import. When it was done,
my .git folder had 242001 files that comprised 2.0GB. I ran `git gc --
agressive --prune` and let that sit overnight (I wish it was more
verbose, it went for over an hour without printing anything), and that
managed to compress the repo down to 334 files and 64MB.Now I have to figure out how to delete the .git folder from my regular
backups.http://skitch.com/kballard/r7mn/results-of-git-gc-ono-macports-repo
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
I believe so too. And nowadays there's "git gc --auto", which was made
for occasions such as this, so it should be a breeze to implement. The
overhead might be low enough that it can be called after _every_
imported revision.--
Karl Hasselström, kha@treskal.com
www.treskal.com/kalle
-
Careful. I made the same mistake and it had to be corrected
with e0cd252eb0ba6453acd64762625b004aa4cc162b."gc --auto" after every 1000 or so feels like a good default and
I would agree that would be a real fix to a real usability bug.Patches?
-
I think 1000 might be too high; considering that (at least in my
experience) it takes on the order of 250-500 ms to import a commit,
the gc --auto overhead of maybe 10 ms isn't so bad.A good compromise might be to run gc --auto after every 10-100
commits, _and_ when the import is done.However, if gc --auto always takes a lot of time without accomplishing
anything in the presence of too many unreachable loose objects it
might not be a good idea to run it at all, since the use of git-svnJust hot air and noise for now from my end. Sorry.
--
Karl Hasselström, kha@treskal.com
www.treskal.com/kalle
-
Note: CC list pruned as, once again, my Mail client decided to send =20
the original message as HTML and it got bounced from the list.
Original CC list: kha@treskal.com, gitster@pobox.comI don't know much about how this works, so if git gc --auto might have =20=
a problem, it seems the simplest fix for now would be to default git-=20
Same. I don't know Perl. Sorry.
-Kevin Ballard
--=20
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Let "git svn" run "git gc --auto" every 100 imported commits, to
reduce the number of loose objects.To handle the common use case of frequent imports, where each
invocation typically fetches less than 100 commits, randomly set the
counter to something in the range 1-100 on initialization. It's almost
as good as saving the counter, and much less of a hassle.Oh, and 100 is just my best guess at a reasonable number. It could
conceivably need tweaking.Signed-off-by: Karl Hasselström <kha@treskal.com>
---
OK, it didn't feel good saying that. So here's my attempt at being a
model citizen. (It's not hard with a change this small ...)I'm not quite sure how this should interact with the --repack flag.
Right now they just coexist, except for never running right after one
another, but conceivably we should do something cleverer. Eric?git-svn.perl | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)diff --git a/git-svn.perl b/git-svn.perl
index 9f2b587..89e1d61 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -1247,7 +1247,7 @@ use File::Path qw/mkpath/;
use File::Copy qw/copy/;
use IPC::Open3;-my $_repack_nr;
+my ($_repack_nr, $_gc_nr, $_gc_period);
# properties that we do not log:
my %SKIP_PROP;
BEGIN {
@@ -1413,6 +1413,8 @@ sub init_vars {
$_repack_nr = $_repack;
$_repack_flags ||= '-d';
}
+ $_gc_period = 100;
+ $_gc_nr = int(rand($_gc_period)) + 1;
}sub verify_remotes_sanity {
@@ -2157,6 +2159,9 @@ sub do_git_commit {
print "Running git repack $_repack_flags ...\n";
command_noisy('repack', split(/\s+/, $_repack_flags));
print "Done repacking\n";
+ } elsif (--$_gc_nr == 0) {
+ $_gc_nr = $_gc_period;
+ command_noisy('gc', '--auto');
}
return $commit;
}-
I found 100 was a bit too low when doing some large repos, I've
been using 1000. I'd argue that --repack=1000 should be done byHow about git gc always gets run at the very end of a git svn fetch?
Just a thought.
Harvey
-
I've found 100 for repack too low in the past, too, which is why
repack defaults to 1000 if no number is specified. I think itI consider --repack is out-of-date now that we have gc --auto. I'm in
I'd much prefer that we run gc --auto at the end of every fetch instead
of doing so randomly for small fetches.--
Eric Wong
-
OK, I'll change it. But remember, gc --auto doesn't do _anything_
unless it's deemed necessary, so it should behave much better thanWill do. What should I do with the repack commadline options? Keep
OK, will do. I'll just have to find a good spot to call it from. Hints
welcome.--
Karl Hasselström, kha@treskal.com
www.treskal.com/kalle
-
Careful. I made the same mistake and it had to be corrected with
e0cd252eb0ba6453acd64762625b004aa4cc162b.I think defaulting to --repack=1000 is a sane first step and you
guys already have most code for it so that is a very safe thing.Switching to "gc --auto" can be done early post 1.5.4, right?
-
Sorry for the latency[1], ack on both of Karl's patches for post-1.5.4.
Here's a conservative change for 1.5.4 (not at all tested):
From dbccd8081c6422569a9ca1211e27f56a24fdf3f3 Mon Sep 17 00:00:00 2001
From: Eric Wong <normalperson@yhbt.net>
Date: Mon, 21 Jan 2008 14:37:41 -0800
Subject: [PATCH] git-svn: default to repacking every 1000 commitsThis should reduce disk space usage when doing large imports.
We'll be switching to "gc --auto" post-1.5.4 to handle
repacking for us.Signed-off-by: Eric Wong <normalperson@yhbt.net>
---
git-svn.perl | 8 +++-----
1 files changed, 3 insertions(+), 5 deletions(-)diff --git a/git-svn.perl b/git-svn.perl
index 9f2b587..12745d5 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -1408,11 +1408,9 @@ sub read_all_remotes {
}sub init_vars {
- if (defined $_repack) {
- $_repack = 1000 if ($_repack <= 0);
- $_repack_nr = $_repack;
- $_repack_flags ||= '-d';
- }
+ $_repack = 1000 unless (defined $_repack && $_repack > 0);
+ $_repack_nr = $_repack;
+ $_repack_flags ||= '-d';
}sub verify_remotes_sanity {
--
Eric Wong[1] - I've been busy with other things and will also be traveling
this week, too.
-
Thanks, but I think you need to do something about this part:
2154: if (defined $_repack && (--$_repack_nr == 0)) {
I'd say
if ($_repack && (--$_repack_nr == 0)) {
-
init_vars() is called unconditionally, and always defines $_repack.
It could actually just be:if (--$_repack_nr == 0) {
--
Eric Wong
-
But that means predecremented --$_repack_nr will count -1, -2, ...
until it wraps around when the user said "--repack=0", meaning
"never repack". Instead you made it "do not repack for a many
many many rounds".Which would be perfectly fine in practice but somehow feels a
bit dirty to me.
-
I just glanced at git-filter-branch.sh (and I must say I was
incredibly surprised to find out it was a shell script) and it seems
it never runs git-gc or git-repack. Doesn't that end up with the same
problems as git-svn sans git-repack when filtering a large number of
commits? I was just thinking, if I were to git-filter-branch on my
massive repo (in fact, the same repo that started this thread, with
over 33000 commits in the upstream svn repo), even if I just do
something as simple as change the commit msg wont I end up with
thousands of unreachable objects? I shudder to think how many
unreachable objects I would have if I pruned the entire dports
directory off of the tree.Am I missing something, or does git-filter-branch really not do any
garbage collection? I tried reading the source, but complex bash
scripts are almost as bad as perl in terms of readability.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
Theoretically yes, and it largely depends on what you do, but
filter-branch goes over the objects that already exists in your
repository, and hopefully you won't be rewriting majority of
them.So the impact of not repacking is probably much less painful in
practice.But again as I said, it largely depends on what you do in your
filter. If you are upcasing (or convert to NFD ;-)) the
contents of all of your blob objects, you would certainly want
to repack every once in a while.-
Another thing I forgot to say in my previous message. The old
refs are kept in reflogs and also in refs/original/, so you willSomething like this, perhaps?
git-filter-branch.sh | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index ebf05ca..8e44001 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -299,6 +299,12 @@ while read commit parents; do
die "msg filter failed: $filter_msg"
sh -c "$filter_commit" "git commit-tree" \
$(git write-tree) $parentstr < ../message > ../map/$commit
+
+ if test $(( $i % 512 )) = 0
+ then
+ git gc --auto
+ fi
+
done <../revs# In case of a subdirectory filter, it is possible that a specified head
-
Offhand that looks good, but we'd probably want to unilaterally do
another git-gc when we're done.diff --git a/git-filter-branch.sh b/git-filter-branch.sh
index ebf05ca..32274a6 100755
--- a/git-filter-branch.sh
+++ b/git-filter-branch.sh
@@ -299,8 +299,16 @@ while read commit parents; do
die "msg filter failed: $filter_msg"
sh -c "$filter_commit" "git commit-tree" \
$(git write-tree) $parentstr < ../message > ../map/$commit
+
+ if test $(( $i % 512 )) = 0
+ then
+ git gc --auto
+ fi
+
done <../revs+git gc --auto
+
# In case of a subdirectory filter, it is possible that a specified
head
# is not in the set of rewritten commits, because it was pruned by the
# revision walker. Fix it by mapping these heads to the next
rewritten--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
I'm actually considering what the cost would be of switching macports
to git (not that it will ever happen - too many anonymous people pull
from svn trunk). Right now the svn trunk contains a subfolder for the
source code and another subfolder for all ~4400+ Portfiles. In such a
theoretical move, I'd want to split that up, probably into two
unrelated branches. Doing so would mean running git-filter-branch over
a linear commit history that's 31580 objects long, with a tree filter
to prune the dports directory away and a msg filter to remove the svn-
id stuff that git-svn left behind. This means that every single commit
objects would be changed, as well as the root tree object for every
single commit. That would be about 63160 objects. I'd also have to
figure out some way to remove the commit objects entirely that only
reference the dports directory. Then I'd have to do it again with the
opposite tree filter (to prune everything but the dports directory and
move the contents of the dports directory up one level) and same msg
filter. Granted, if I do the first action in a branch, that leaves no
unreachable objects (since the originals are still referenced), but
the second operation definitely would leave unreachable objects, and
were I to clone the repository instead and do the operations in the
different repos (which is perfectly legitimate - otherwise I'd have to
clone it after everything else and then delete branches) then both
actions would leave thousands of objects unreachable.I'd suggest a patch to run git gc --auto, but it looks like you just
did in a subsequent email. As for your comments about the reflogs,
can't I disable recording those, at least temporarily? I'd rather
clean up after myself as I work rather than balloon the repository and
collapse it in a single operation at the end.-Kevin Ballard
--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
You could have used git-svn --no-metadata :)
Using a commit filter to implement the pruning will be much faster;
you'll need to make a temporary index, use git-read-tree, git-rm, then
git-commit. This way you avoid the expense of checking out the filesHonestly, the optimisation I mention above will save you much more time.
Note that you can run git-repack -d every half hour out of cron, it is
safe and will let it clean as you go.Sam.
-
Sure, except I imported the svn repo with the intention of continuing
to track it. I'm only floating the idea now of converting the upstream
repo to git, but as I said before we have enough anonymous checkouts
of people tracking trunk that we probably can't justify switchingI suspect an index filter would be simpler, and that's really what I
That's a reasonable suggestion. And I'm still just thinking about
this, so I have no idea if I'll ever actually have to run git-filter-
branch on this massive history.--
Kevin Ballard
http://kevin.sb.org
kevin@sb.org
http://www.tildesoft.com
I wonder if it wouldn't be possible to have filter-branch use
fast-import, so that it would create a pack instead of a lot of loose
objects.Mike
-
Hi,
Not really; the filters are very much tuned to the index-modification and
commit process.And I doubt that the gc --auto would help much; git-filter-branch creates
gazillions of files, and that is likely to bring performance down. If,
that is, you choose _not_ to heed the comment in
Documentation/git-filter-branch.txt lines 44-46:Note that since this operation is extensively I/O expensive, it
might be a good idea to redirect the temporary directory off-disk
with the '-d' option, e.g. on tmpfs. Reportedly the speedup is
very noticeable.Ciao,
Dscho-
I do not think it will help. The objects in packs fast-import
creates cannot be accessed from outside fast-import. Not even
the rest of the core routines running inside that fast-import
process cannot access them via the usual read_sha1_file()
interface, as described in detail in a recent thread [*1*]. The
only way to make it available while you are still feeding new
data to fast-import is to explicitly tell it to finalize the
current pack by issuing a 'mark' command (and fast-import will
start writing to a new pack).And filters need to be able to read the objects previous steps
produced to do their work.Which means that instead of having to deal with many loose
objects, you will now face many little packs, each contains data
changed perhaps at most one commit's worth. You would need to
"repack -a -d" to consolidate these little packs every once in a
while, and I suspect more often than you would need to repack
loose objects, as handling many packs is much more expensive
than handling many loose objects.[Reference]
*1* http://thread.gmane.org/gmane.comp.version-control.git/70964/focus=71076
-
And afterwards, you'll probably want to check the rewritten history
to make sure it is acceptable before doing a git gc --prune.Cheers,
Harvey
-
So here they are again. There was a trivial merge conflict with Eric's
fix, but otherwise they are unchanged.---
Karl Hasselström (2):
Let "git svn" run "git gc --auto" occasionally
git-svn: Don't call git-repack anymoregit-svn.perl | 24 ++++++++++++++----------
1 files changed, 14 insertions(+), 10 deletions(-)--
Karl Hasselström, kha@treskal.com
www.treskal.com/kalle
-
Let "git svn" run "git gc --auto" every 1000 imported commits to
reduce the number of loose objects.To handle the common use case of frequent imports, where each
invocation typically fetches much less than 1000 commits, also run gc
unconditionally at the end of the import."1000" is the same number that was used by default when we called
git-repack. It isn't necessarily still the best choice.Signed-off-by: Karl Hasselström <kha@treskal.com>
---
git-svn.perl | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)diff --git a/git-svn.perl b/git-svn.perl
index 074068c..6cc3157 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -1247,6 +1247,8 @@ use File::Path qw/mkpath/;
use File::Copy qw/copy/;
use IPC::Open3;+my ($_gc_nr, $_gc_period);
+
# properties that we do not log:
my %SKIP_PROP;
BEGIN {
@@ -1407,6 +1409,7 @@ sub read_all_remotes {
}sub init_vars {
+ $_gc_nr = $_gc_period = 1000;
if (defined $_repack || defined $_repack_flags) {
warn "Repack options are obsolete; they have no effect.\n";
}
@@ -2095,6 +2098,10 @@ sub restore_commit_header_env {
}
}+sub gc {
+ command_noisy('gc', '--auto');
+};
+
sub do_git_commit {
my ($self, $log_entry) = @_;
my $lr = $self->last_rev;
@@ -2148,6 +2155,10 @@ sub do_git_commit {
0, $self->svm_uuid);
}
print " = $commit ($self->{ref_id})\n";
+ if (--$_gc_nr == 0) {
+ $_gc_nr = $_gc_period;
+ gc();
+ }
return $commit;
}@@ -3975,6 +3986,7 @@ sub gs_fetch_loop_common {
$max += $inc;
$max = $head if ($max > $head);
}
+ Git::SVN::gc();
}sub match_globs {
-
In a moment, we'll start calling git-gc --auto instead, since it is a
better fit to what we're trying to accomplish.The command line options are still accepted, but don't have any
effect, and we warn the user about that.Signed-off-by: Karl Hasselström <kha@treskal.com>
---
git-svn.perl | 14 +++-----------
1 files changed, 3 insertions(+), 11 deletions(-)diff --git a/git-svn.perl b/git-svn.perl
index 75e97cc..074068c 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -1247,7 +1247,6 @@ use File::Path qw/mkpath/;
use File::Copy qw/copy/;
use IPC::Open3;-my $_repack_nr;
# properties that we do not log:
my %SKIP_PROP;
BEGIN {
@@ -1408,9 +1407,9 @@ sub read_all_remotes {
}sub init_vars {
- $_repack = 1000 unless (defined $_repack && $_repack > 0);
- $_repack_nr = $_repack;
- $_repack_flags ||= '-d';
+ if (defined $_repack || defined $_repack_flags) {
+ warn "Repack options are obsolete; they have no effect.\n";
+ }
}sub verify_remotes_sanity {
@@ -2149,13 +2148,6 @@ sub do_git_commit {
0, $self->svm_uuid);
}
print " = $commit ($self->{ref_id})\n";
- if ($_repack && (--$_repack_nr == 0)) {
- $_repack_nr = $_repack;
- # repack doesn't use any arguments with spaces in them, does it?
- print "Running git repack $_repack_flags ...\n";
- command_noisy('repack', split(/\s+/, $_repack_flags));
- print "Done repacking\n";
- }
return $commit;
}-
Let "git svn" run "git gc --auto" every 1000 imported commits to
reduce the number of loose objects.To handle the common use case of frequent imports, where each
invocation typically fetches much less than 1000 commits, also run gc
unconditionally at the end of the import."1000" is the same number that was used by default when we called
git-repack. It isn't necessarily still the best choice.Signed-off-by: Karl Hasselström <kha@treskal.com>
---
git-svn.perl | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)diff --git a/git-svn.perl b/git-svn.perl
index 988d8f6..be4105c 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -1247,6 +1247,8 @@ use File::Path qw/mkpath/;
use File::Copy qw/copy/;
use IPC::Open3;+my ($_gc_nr, $_gc_period);
+
# properties that we do not log:
my %SKIP_PROP;
BEGIN {
@@ -1407,6 +1409,7 @@ sub read_all_remotes {
}sub init_vars {
+ $_gc_nr = $_gc_period = 1000;
if (defined $_repack || defined $_repack_flags) {
warn "Repack options are obsolete; they have no effect.\n";
}
@@ -2095,6 +2098,10 @@ sub restore_commit_header_env {
}
}+sub gc {
+ command_noisy('gc', '--auto');
+};
+
sub do_git_commit {
my ($self, $log_entry) = @_;
my $lr = $self->last_rev;
@@ -2148,6 +2155,10 @@ sub do_git_commit {
0, $self->svm_uuid);
}
print " = $commit ($self->{ref_id})\n";
+ if (--$_gc_nr == 0) {
+ $_gc_nr = $_gc_period;
+ gc();
+ }
return $commit;
}@@ -3975,6 +3986,7 @@ sub gs_fetch_loop_common {
$max += $inc;
$max = $head if ($max > $head);
}
+ Git::SVN::gc();
}sub match_globs {
-
In a moment, we'll start calling git-gc --auto instead, since it is a
better fit to what we're trying to accomplish.The command line options are still accepted, but don't have any
effect, and we warn the user about that.Signed-off-by: Karl Hasselström <kha@treskal.com>
---
Is this close enough to what you intended?
git-svn.perl | 14 ++------------
1 files changed, 2 insertions(+), 12 deletions(-)diff --git a/git-svn.perl b/git-svn.perl
index 9f2b587..988d8f6 100755
--- a/git-svn.perl
+++ b/git-svn.perl
@@ -1247,7 +1247,6 @@ use File::Path qw/mkpath/;
use File::Copy qw/copy/;
use IPC::Open3;-my $_repack_nr;
# properties that we do not log:
my %SKIP_PROP;
BEGIN {
@@ -1408,10 +1407,8 @@ sub read_all_remotes {
}sub init_vars {
- if (defined $_repack) {
- $_repack = 1000 if ($_repack <= 0);
- $_repack_nr = $_repack;
- $_repack_flags ||= '-d';
+ if (defined $_repack || defined $_repack_flags) {
+ warn "Repack options are obsolete; they have no effect.\n";
}
}@@ -2151,13 +2148,6 @@ sub do_git_commit {
0, $self->svm_uuid);
}
print " = $commit ($self->{ref_id})\n";
- if (defined $_repack && (--$_repack_nr == 0)) {
- $_repack_nr = $_repack;
- # repack doesn't use any arguments with spaces in them, does it?
- print "Running git repack $_repack_flags ...\n";
- command_noisy('repack', split(/\s+/, $_repack_flags));
- print "Done repacking\n";
- }
return $commit;
}-
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Justin Piszcz | exception Emask 0x0 SAct 0x1 / SErr 0x0 action 0x2 frozen |
| Heiko Carstens | Re: -mm merge plans for 2.6.23 -- sys_fallocate |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
| Radu Rendec | htb parallelism on multi-core platforms |
