Re: Trying to use git-filter-branch to compress history by removing large, obsolete binary files

Previous thread: Status of kha/experimental by Catalin Marinas on Sunday, October 7, 2007 - 2:18 pm. (10 messages)

Next thread: [PATCH] git-gui: offer a list of recent repositories on startup by Steffen Prohaska on Sunday, October 7, 2007 - 2:28 pm. (10 messages)
From: Elijah Newren
Date: Sunday, October 7, 2007 - 2:23 pm

Hi,

I'm using git-cvsimport to import some CVS repos, which unfortunately
included dozens of large regression test output files in their ancient
history...some of which measure hundreds of megabytes in size.  I'd
like to prune them out of the git history (I don't have access to
prune them out of the CVS history), but I'm running into problems.

The following set of instructions will duplicate my problem with a
smaller repo; why is the local git repository bigger after running
git-filter-branch rather than smaller as I'd expect?  I'm probably
missing something obvious, but I have no idea what it is.

The steps:

# Make a small repo
mkdir test
cd test
git init
echo hi > there
git add there
git commit -m 'Small repo'

# Add a random 10M binary file
dd if=/dev/urandom of=testme.txt count=10 bs=1M
git add testme.txt
git commit -m 'Add big binary file'

# Remove the 10M binary file
git rm testme.txt
git commit -m 'Remove big binary file'

# Compress the repo, see how big the repo is
git gc --aggressive --prune
du -ks .                       # 10548K
du -ks .git                    # 10532K

# Try to rewrite history to remove the binary file
git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
git reset --hard

# Try to recompress and clean up, then check the new size
git gc --aggressive --prune
du -ks .                       # 10580K !?!?!?
du -ks .git                    # 10564K


Thanks,
Elijah
-

From: Frank Lichtenheld
Date: Sunday, October 7, 2007 - 2:38 pm

The usual suspect would be the reflog.

Gruesse,
-- 
Frank Lichtenheld <frank@lichtenheld.de>
www: http://www.djpig.de/
-

From: Elijah Newren
Date: Sunday, October 7, 2007 - 3:00 pm

The git-filter-branch documentation mentions creating refs/original
under .git.  Unfortunately, it doesn't contain any links or
documentation on how I'd clean those out and I haven't been able to
figure it out.  I asked on #git how to clean these out and got some
answers that didn't work (git branch -d and something else I don't
remember).  So...how do I fix the reflog, and then repack to have a
pack under 11MB in size?

Thanks,
Elijah
-

From: Alex Riesen
Date: Sunday, October 7, 2007 - 3:19 pm

rm -rf .git/refs/original/refs/heads/<the branch where HEAD pointed to>
(assuming you haven't repacked yet)

or just edit .git/packed-refs and remove everything "refs/original"

git reflog expire --all (it is a bit to much. You can just edit
.git/logs/* in any text editor)

-

From: Elijah Newren
Date: Sunday, October 7, 2007 - 3:24 pm

On 10/7/07, Alex Riesen <raa.lkml@gmail.com> wrote:

So...

$ du -hs .
11M     .
$ rm -rf .git/refs/original/
$ vi .git/packed-refs
# Remove the line referring to refs/original...
$ git reflog expire --all
$ git gc --aggressive --prune
$ du -hs .
11M     .

It's still 11MB.

Any other ideas?

Elijah
-

From: Alex Riesen
Date: Sunday, October 7, 2007 - 4:40 pm

you missed something. Your example compresses to about 124k.

-

From: Elijah Newren
Date: Sunday, October 7, 2007 - 5:09 pm

What version of git are you running?  I reran all the steps to which
you responded (repeated below for clarity) with git-1.5.3.3 and still
get 11MB.  Also, you must have different filesystem extents than me
since an empty git repo takes 196k here[1], so I don't think any repo
is going to get down to 124k.

My understanding of the steps you suggest would work:

# Make a small repo
mkdir test
cd test
git init
echo hi > there
git add there
git commit -m 'Small repo'

# Add a random 10M binary file
dd if=/dev/urandom of=testme.txt count=10 bs=1M
git add testme.txt
git commit -m 'Add big binary file'

# Remove the 10M binary file
git rm testme.txt
git commit -m 'Remove big binary file'

# Compress the repo, see how big the repo is
git gc --aggressive --prune
du -ks .                       # 10548K
du -ks .git                    # 10532K

# Try to rewrite history to remove the binary file
git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
git reset --hard

# Try to recompress and clean up, then check the new size
git gc --aggressive --prune
du -ks .                       # 10580K !?!?!?
du -ks .git                    # 10564K

# Do the stuff Alex suggests to trim the history
rm -rf .git/refs/original/
vi .git/packed-refs
# Use vi to remove the line referring to refs/original...
git reflog expire --all
git gc --aggressive --prune
du -ks .                      # Still 10564K


Thanks,
Elijah

[1] An empty git repo takes 196k for me, as can be seen by:
$mkdir tmp
$cd tmp
$git init
$du -hs .
196K    .
-

From: Alex Riesen
Date: Sunday, October 7, 2007 - 11:15 pm

it is ext3. I do not install the hooks (~8k apparent, ~32k fs blocks)

another part of the suggestion re reflogs was to look into the logs,
to check if expire actually removed anything. It seems to have been
the culprit.

-

From: Andreas Ericsson
Date: Monday, October 8, 2007 - 2:23 am

On my system, running git version 1.5.3.3.131.g34c6d,

	git reflog expire --all

does absolutely nothing.

	git reflog expire --expire=0 --all

truncates all the reflogs. I'm not sure if this is intended or not.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-

From: Dmitry Potapov
Date: Sunday, October 7, 2007 - 4:43 pm

I believe this should work:

git reflog expire --all --expire-unreachable=0
git gc --prune

Warning: all unreachable references will be removed!

Dmitry
-

From: Elijah Newren
Date: Sunday, October 7, 2007 - 5:22 pm

Yes, this seems to work.  So the history-rewriting steps are

git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
git reset --hard
rm -rf .git/refs/original/
vi .git/packed-refs
# Use vi to remove the line referring to refs/original...
git reflog expire --all --expire-unreachable=0
git gc --prune


What other scenarios could lead to unreachable references?  I don't
know how to determine whether this is safe or not (except that these
were test repositories anyway, so I don't care what happens to them).

Thanks!
Elijah
-

From: Dmitry Potapov
Date: Sunday, October 7, 2007 - 6:06 pm

Actually, I would rather not, because you rarely need to remove anything
immediately, and 30 days delay is reasonable time to give you a chance
to recover that you removed accidentally. You can reduce it by setting
appropriate value for gc.reflogExpireUnreachable in your configuration.
The only thing you need to do is to remove .git/refs/original/heads/something


Git logs all your action, so even re-writing history would not be
so disastrous if you suddenly realized that you did something wrong.
The history is stored for 30 days by default. Usually, you do not
need to mess with Git internals like you did above. Your useless
files still will disappear after being unreachable for 30 days.

OTOH, if you want to have a clean repository immediately, I believe
'git clone' is a better option. After you made a local clone using
it, 'git gc' should remove old garbage.

Dmitry
-

From: Andreas Ericsson
Date: Monday, October 8, 2007 - 2:27 am

A clone only fetches revs reachable from a ref, so pruning immediately
after a clone is completely pointless.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
-


StGit (and presumably guilt, and any other similar tool) are just
glorified rebase wrappers, so they'll generate tons of unreachable
objects too.

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle
-

From: Dmitry Potapov
Date: Monday, October 8, 2007 - 5:40 am

Not true. git-clone copies the whole pack, so it can contain unreachable
objects. Here is a simple script that demonstrates that without garbage
collection the size of the cloned repository will be the same as the
original one.

===========================================
# Make a small repo
mkdir test
cd test
git init
echo hi > there
git add there
git commit -m 'Small repo'

# Add a random 10M binary file
dd if=/dev/urandom of=testme.txt count=10 bs=1M
git add testme.txt
git commit -m 'Add big binary file'

# Remove the 10M binary file
git rm testme.txt
git commit -m 'Remove big binary file'

# Compress the repo, see how big the repo is
git gc --aggressive --prune
du -ks .                       # 10348
du -ks .git                    # 10344

git-whatchanged

# Try to rewrite history to remove the binary file
git-filter-branch --tree-filter 'rm -f testme.txt' HEAD
git reset --hard

# Remove original refs
rm .git/refs/original/refs/heads/master

# Remove back
cd ..

# Clone repository
git-clone -l test/.git test2

cd test2
du -ks .git # 10360

# Now run garbage collection
git gc
du -ks .git # 96

===========================================

Dmitry
-


Try without the -l option and with a file:// URL:

  git clone file:///path/to/test/.git test2

From the git-clone man page:

--local::
-l::
        When the repository to clone from is on a local machine, this
        flag bypasses normal "git aware" transport mechanism and
        clones the repository by making a copy of HEAD and everything
        under objects and refs directories. The files under
        `.git/objects/` directory are hardlinked to save space when
        possible. This is now the default when the source repository
        is specified with `/path/to/repo` syntax, so it essentially is
        a no-op option. To force copying instead of hardlinking (which
        may be desirable if you are trying to make a back-up of your
        repository), but still avoid the usual "git aware" transport
        mechanism, `--no-hardlinks` can be used.

-- 
Karl Hasselström, kha@treskal.com
      www.treskal.com/kalle
-

From: Johannes Schindelin
Date: Sunday, October 7, 2007 - 4:19 pm

Hi,


Just clone it.  The clone will be much smaller.

Ciao,
Dscho

-

From: Elijah Newren
Date: Sunday, October 7, 2007 - 4:24 pm

$ git clone test test2
<snip>
$ du -hs test
11M     test
$ du -hs test2
11M     test2

Any other ideas?
-

From: Johannes Schindelin
Date: Sunday, October 7, 2007 - 4:28 pm

Hi,


Yep.  Maybe it is necessary to run "git gc" in test2.

Ciao,
Dscho

-

From: Elijah Newren
Date: Sunday, October 7, 2007 - 4:38 pm

Hi,


Sweet, finally solved!  That brings test2 down to 340K.

However, the solution seems somewhat involved...it requires running
git-filter-branch, git reset, removing the .git/refs/original/
directory, editing .git/packed-refs in some editor, running git reflog
expire, cloning the resulting repository, and running git gc yet
again.  It seems like there has to be an easier way.  (Anyone have
one?)

Oh, and git-filter-branch could really use some explanatory note about
how to actually complete rewriting the history.

Thanks,
Elijah
-

From: Johannes Schindelin
Date: Sunday, October 7, 2007 - 5:34 pm

Hi,



It does what it should do.  It is _your_ task to look at refs/original/* 
if everything went alright.  Then you just delete the checked refs.

What made your case so cumbersome was that you wanted the big objects out 
_now_, instead of having them in for a grace period.  BTW this grace 
period is in place to help _you_, not the program.  (In case you fscked up 
and need those objects back.)

Ciao,
Dscho

-

From: Elijah Newren
Date: Sunday, October 7, 2007 - 5:47 pm

Yes, a git filter-branch, git clone, AND git gc in the clone avoids
all those funny ref editing commands.  However, cloning a 5.6GB repo
(the size of one of the real repos I'm dealing with) will likely take
a long time (and may push me past the limits of disk space), so using

Sure, I think that's a sane default.  And I think it's fine that it
should be my task to look at the refs to check that everything worked
okay and delete them.  But it's nearly impossible to figure out how to
do that!  _That_ is my complaint.  I got multiple misleading or
incomplete answers (both on this list and in #git) before getting some
working solutions, so this task is obviously far from trivial.  I
really think that adding instructions about how to check and delete
the relevant refs would be a very useful addition to the
documentation.

Thanks everyone for the help!

Elijah
-

From: Sam Vilain
Date: Sunday, October 7, 2007 - 7:28 pm

You can just delete the logs and references that you don't want and run
git gc --prune.

However.

git gc creates a new pack before deleting the old one.  Garbage
collection usually does this; make a copy of everything to a new place
and then free all of the old space.  If *that* is a problem, ie you
don't have enough space for two copies of the repository and the junk,
you'll have to do a partial import, leave the junk you don't want
unpacked, cleanup and prune, then finish the import.  Which sounds like
a lot of hassle when you should really just find a place with more space
to work with!

Sam.
-

From: J. Bruce Fields
Date: Sunday, October 7, 2007 - 6:00 pm

It seems odd to me, by the way, that filter-branch has its own
home-grown backup mechanism.  Lots of other commands can "lose" commits,
but none of them keep an extra backup like this.

And I find it tedious for quicker jobs which it might otherwise be
useful for (e.g. rewrites of commits in my tree not yet in upstream),
unless I wrap it in a script that cleans up after itself.

--b.
-

From: Johannes Schindelin
Date: Sunday, October 7, 2007 - 6:06 pm

Hi,


The rationale was this: filter-branch recently learnt how to rewrite many 
branches, and it might be tedious to find out which ones.  But then, there 
is git log --no-walk --all, so maybe I really should get rid of 
refs/original/*?

I'd like to have some comments from the heavier filter-branch users on 
that...

Ciao,
Dscho

-

From: Johannes Sixt
Date: Sunday, October 7, 2007 - 11:22 pm

IMHO, a backup of the original refs is needed. However, it may be wise to 
store them in the refs/heads namespace so that 'git branch -d' can delete 
them and 'git branch -m' can move them back if something went wrong.

-- Hannes
-

From: J. Bruce Fields
Date: Monday, October 8, 2007 - 7:36 am

If people want backups like this it'd seem easier to turn this on
optionally with commandline switches, like patch's --backup, --prefix,
--suffix options.

Having it by default leave these backups around, even when everything
succeeds, makes for unnecessary cleanup work in the normal case, and is
inconsistent with the behavior of other git commands that destroy or
rewrite history.

--b.
-

From: Theodore Tso
Date: Monday, October 8, 2007 - 9:37 am

I think what makes git-filter-branch different is that you can change
a large amount of history with git-filter-branch, including large
numbers of tags, etc.  The reflog is quite sufficient to recover from
a screwed up "git commit --amend".  But I don't think the reflog is
going to be sufficient given the kinds of changes that
git-filter-branch can potentially do to your repository.  Maybe
default of --backup vs --no-backup could be changed via a config
parameter, but I think the default is of backing up refs is a good
think....

Perhaps a solution would be to add "git-filter-branch --cleanup" that
that clears the reflog and wipes the backed up tags; perhaps first
asking interactively if the user is really sure he/she wants to do
this.

						- Ted
-

From: J. Bruce Fields
Date: Monday, October 8, 2007 - 12:05 pm

Yeah, it's clearly designed with rewriting a whole repo in mind.

It might also be handy, though, as a quick way to rewrite a single
branch.  (E.g., "add 'Acked-by: Joe' to everything in 'for-upstream' not
in 'origin'", or "rename foo to bar in every commit in 'topic' not in
'origin'".).

I find the current defaults awkward for that case.  Maybe it'd make

Maybe.

--b.
-

From: Johannes Schindelin
Date: Tuesday, October 9, 2007 - 3:37 am

Hi,


FWIW after reading Bruce's reasoning, I tend towards having no "backups" 
by default (I say "backups", since they are _only_ written when the 
respective branch has changed).

And I do not think that the config variable is a good approach; if you 
want backups or not is a per-case decision.  So your proposal would only 
result in even more confusion.

My preference ATM is to write nothing per default, but only when 
--original <namespace> was given.

Ciao,
Dscho

-

From: Alex Riesen
Date: Sunday, October 7, 2007 - 3:08 pm

git-filter-branch makes a backup of your original references:

$ git filter-branch --help
...
       Always verify that the rewritten version is correct: The original refs,
       if different from the rewritten ones, will be stored in the namespace
       refs/original/.
...

These will keep your big files in repository.

-

Previous thread: Status of kha/experimental by Catalin Marinas on Sunday, October 7, 2007 - 2:18 pm. (10 messages)

Next thread: [PATCH] git-gui: offer a list of recent repositories on startup by Steffen Prohaska on Sunday, October 7, 2007 - 2:28 pm. (10 messages)