Hi, I'm using git-cvsimport to import some CVS repos, which unfortunately included dozens of large regression test output files in their ancient history...some of which measure hundreds of megabytes in size. I'd like to prune them out of the git history (I don't have access to prune them out of the CVS history), but I'm running into problems. The following set of instructions will duplicate my problem with a smaller repo; why is the local git repository bigger after running git-filter-branch rather than smaller as I'd expect? I'm probably missing something obvious, but I have no idea what it is. The steps: # Make a small repo mkdir test cd test git init echo hi > there git add there git commit -m 'Small repo' # Add a random 10M binary file dd if=/dev/urandom of=testme.txt count=10 bs=1M git add testme.txt git commit -m 'Add big binary file' # Remove the 10M binary file git rm testme.txt git commit -m 'Remove big binary file' # Compress the repo, see how big the repo is git gc --aggressive --prune du -ks . # 10548K du -ks .git # 10532K # Try to rewrite history to remove the binary file git-filter-branch --tree-filter 'rm -f testme.txt' HEAD git reset --hard # Try to recompress and clean up, then check the new size git gc --aggressive --prune du -ks . # 10580K !?!?!? du -ks .git # 10564K Thanks, Elijah -
The usual suspect would be the reflog. Gruesse, -- Frank Lichtenheld <frank@lichtenheld.de> www: http://www.djpig.de/ -
The git-filter-branch documentation mentions creating refs/original under .git. Unfortunately, it doesn't contain any links or documentation on how I'd clean those out and I haven't been able to figure it out. I asked on #git how to clean these out and got some answers that didn't work (git branch -d and something else I don't remember). So...how do I fix the reflog, and then repack to have a pack under 11MB in size? Thanks, Elijah -
rm -rf .git/refs/original/refs/heads/<the branch where HEAD pointed to> (assuming you haven't repacked yet) or just edit .git/packed-refs and remove everything "refs/original" git reflog expire --all (it is a bit to much. You can just edit .git/logs/* in any text editor) -
On 10/7/07, Alex Riesen <raa.lkml@gmail.com> wrote: So... $ du -hs . 11M . $ rm -rf .git/refs/original/ $ vi .git/packed-refs # Remove the line referring to refs/original... $ git reflog expire --all $ git gc --aggressive --prune $ du -hs . 11M . It's still 11MB. Any other ideas? Elijah -
you missed something. Your example compresses to about 124k. -
What version of git are you running? I reran all the steps to which you responded (repeated below for clarity) with git-1.5.3.3 and still get 11MB. Also, you must have different filesystem extents than me since an empty git repo takes 196k here[1], so I don't think any repo is going to get down to 124k. My understanding of the steps you suggest would work: # Make a small repo mkdir test cd test git init echo hi > there git add there git commit -m 'Small repo' # Add a random 10M binary file dd if=/dev/urandom of=testme.txt count=10 bs=1M git add testme.txt git commit -m 'Add big binary file' # Remove the 10M binary file git rm testme.txt git commit -m 'Remove big binary file' # Compress the repo, see how big the repo is git gc --aggressive --prune du -ks . # 10548K du -ks .git # 10532K # Try to rewrite history to remove the binary file git-filter-branch --tree-filter 'rm -f testme.txt' HEAD git reset --hard # Try to recompress and clean up, then check the new size git gc --aggressive --prune du -ks . # 10580K !?!?!? du -ks .git # 10564K # Do the stuff Alex suggests to trim the history rm -rf .git/refs/original/ vi .git/packed-refs # Use vi to remove the line referring to refs/original... git reflog expire --all git gc --aggressive --prune du -ks . # Still 10564K Thanks, Elijah [1] An empty git repo takes 196k for me, as can be seen by: $mkdir tmp $cd tmp $git init $du -hs . 196K . -
it is ext3. I do not install the hooks (~8k apparent, ~32k fs blocks) another part of the suggestion re reflogs was to look into the logs, to check if expire actually removed anything. It seems to have been the culprit. -
On my system, running git version 1.5.3.3.131.g34c6d, git reflog expire --all does absolutely nothing. git reflog expire --expire=0 --all truncates all the reflogs. I'm not sure if this is intended or not. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
I believe this should work: git reflog expire --all --expire-unreachable=0 git gc --prune Warning: all unreachable references will be removed! Dmitry -
Yes, this seems to work. So the history-rewriting steps are git-filter-branch --tree-filter 'rm -f testme.txt' HEAD git reset --hard rm -rf .git/refs/original/ vi .git/packed-refs # Use vi to remove the line referring to refs/original... git reflog expire --all --expire-unreachable=0 git gc --prune What other scenarios could lead to unreachable references? I don't know how to determine whether this is safe or not (except that these were test repositories anyway, so I don't care what happens to them). Thanks! Elijah -
Actually, I would rather not, because you rarely need to remove anything immediately, and 30 days delay is reasonable time to give you a chance to recover that you removed accidentally. You can reduce it by setting appropriate value for gc.reflogExpireUnreachable in your configuration. The only thing you need to do is to remove .git/refs/original/heads/something Git logs all your action, so even re-writing history would not be so disastrous if you suddenly realized that you did something wrong. The history is stored for 30 days by default. Usually, you do not need to mess with Git internals like you did above. Your useless files still will disappear after being unreachable for 30 days. OTOH, if you want to have a clean repository immediately, I believe 'git clone' is a better option. After you made a local clone using it, 'git gc' should remove old garbage. Dmitry -
A clone only fetches revs reachable from a ref, so pruning immediately after a clone is completely pointless. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 -
StGit (and presumably guilt, and any other similar tool) are just glorified rebase wrappers, so they'll generate tons of unreachable objects too. -- Karl Hasselström, kha@treskal.com www.treskal.com/kalle -
Not true. git-clone copies the whole pack, so it can contain unreachable objects. Here is a simple script that demonstrates that without garbage collection the size of the cloned repository will be the same as the original one. =========================================== # Make a small repo mkdir test cd test git init echo hi > there git add there git commit -m 'Small repo' # Add a random 10M binary file dd if=/dev/urandom of=testme.txt count=10 bs=1M git add testme.txt git commit -m 'Add big binary file' # Remove the 10M binary file git rm testme.txt git commit -m 'Remove big binary file' # Compress the repo, see how big the repo is git gc --aggressive --prune du -ks . # 10348 du -ks .git # 10344 git-whatchanged # Try to rewrite history to remove the binary file git-filter-branch --tree-filter 'rm -f testme.txt' HEAD git reset --hard # Remove original refs rm .git/refs/original/refs/heads/master # Remove back cd .. # Clone repository git-clone -l test/.git test2 cd test2 du -ks .git # 10360 # Now run garbage collection git gc du -ks .git # 96 =========================================== Dmitry -
Try without the -l option and with a file:// URL: git clone file:///path/to/test/.git test2 From the git-clone man page: --local:: -l:: When the repository to clone from is on a local machine, this flag bypasses normal "git aware" transport mechanism and clones the repository by making a copy of HEAD and everything under objects and refs directories. The files under `.git/objects/` directory are hardlinked to save space when possible. This is now the default when the source repository is specified with `/path/to/repo` syntax, so it essentially is a no-op option. To force copying instead of hardlinking (which may be desirable if you are trying to make a back-up of your repository), but still avoid the usual "git aware" transport mechanism, `--no-hardlinks` can be used. -- Karl Hasselström, kha@treskal.com www.treskal.com/kalle -
Hi, Just clone it. The clone will be much smaller. Ciao, Dscho -
$ git clone test test2 <snip> $ du -hs test 11M test $ du -hs test2 11M test2 Any other ideas? -
Hi, Yep. Maybe it is necessary to run "git gc" in test2. Ciao, Dscho -
Hi, Sweet, finally solved! That brings test2 down to 340K. However, the solution seems somewhat involved...it requires running git-filter-branch, git reset, removing the .git/refs/original/ directory, editing .git/packed-refs in some editor, running git reflog expire, cloning the resulting repository, and running git gc yet again. It seems like there has to be an easier way. (Anyone have one?) Oh, and git-filter-branch could really use some explanatory note about how to actually complete rewriting the history. Thanks, Elijah -
Hi, It does what it should do. It is _your_ task to look at refs/original/* if everything went alright. Then you just delete the checked refs. What made your case so cumbersome was that you wanted the big objects out _now_, instead of having them in for a grace period. BTW this grace period is in place to help _you_, not the program. (In case you fscked up and need those objects back.) Ciao, Dscho -
Yes, a git filter-branch, git clone, AND git gc in the clone avoids all those funny ref editing commands. However, cloning a 5.6GB repo (the size of one of the real repos I'm dealing with) will likely take a long time (and may push me past the limits of disk space), so using Sure, I think that's a sane default. And I think it's fine that it should be my task to look at the refs to check that everything worked okay and delete them. But it's nearly impossible to figure out how to do that! _That_ is my complaint. I got multiple misleading or incomplete answers (both on this list and in #git) before getting some working solutions, so this task is obviously far from trivial. I really think that adding instructions about how to check and delete the relevant refs would be a very useful addition to the documentation. Thanks everyone for the help! Elijah -
You can just delete the logs and references that you don't want and run git gc --prune. However. git gc creates a new pack before deleting the old one. Garbage collection usually does this; make a copy of everything to a new place and then free all of the old space. If *that* is a problem, ie you don't have enough space for two copies of the repository and the junk, you'll have to do a partial import, leave the junk you don't want unpacked, cleanup and prune, then finish the import. Which sounds like a lot of hassle when you should really just find a place with more space to work with! Sam. -
It seems odd to me, by the way, that filter-branch has its own home-grown backup mechanism. Lots of other commands can "lose" commits, but none of them keep an extra backup like this. And I find it tedious for quicker jobs which it might otherwise be useful for (e.g. rewrites of commits in my tree not yet in upstream), unless I wrap it in a script that cleans up after itself. --b. -
Hi, The rationale was this: filter-branch recently learnt how to rewrite many branches, and it might be tedious to find out which ones. But then, there is git log --no-walk --all, so maybe I really should get rid of refs/original/*? I'd like to have some comments from the heavier filter-branch users on that... Ciao, Dscho -
IMHO, a backup of the original refs is needed. However, it may be wise to store them in the refs/heads namespace so that 'git branch -d' can delete them and 'git branch -m' can move them back if something went wrong. -- Hannes -
If people want backups like this it'd seem easier to turn this on optionally with commandline switches, like patch's --backup, --prefix, --suffix options. Having it by default leave these backups around, even when everything succeeds, makes for unnecessary cleanup work in the normal case, and is inconsistent with the behavior of other git commands that destroy or rewrite history. --b. -
I think what makes git-filter-branch different is that you can change a large amount of history with git-filter-branch, including large numbers of tags, etc. The reflog is quite sufficient to recover from a screwed up "git commit --amend". But I don't think the reflog is going to be sufficient given the kinds of changes that git-filter-branch can potentially do to your repository. Maybe default of --backup vs --no-backup could be changed via a config parameter, but I think the default is of backing up refs is a good think.... Perhaps a solution would be to add "git-filter-branch --cleanup" that that clears the reflog and wipes the backed up tags; perhaps first asking interactively if the user is really sure he/she wants to do this. - Ted -
Yeah, it's clearly designed with rewriting a whole repo in mind. It might also be handy, though, as a quick way to rewrite a single branch. (E.g., "add 'Acked-by: Joe' to everything in 'for-upstream' not in 'origin'", or "rename foo to bar in every commit in 'topic' not in 'origin'".). I find the current defaults awkward for that case. Maybe it'd make Maybe. --b. -
Hi, FWIW after reading Bruce's reasoning, I tend towards having no "backups" by default (I say "backups", since they are _only_ written when the respective branch has changed). And I do not think that the config variable is a good approach; if you want backups or not is a per-case decision. So your proposal would only result in even more confusion. My preference ATM is to write nothing per default, but only when --original <namespace> was given. Ciao, Dscho -
git-filter-branch makes a backup of your original references:
$ git filter-branch --help
...
Always verify that the rewritten version is correct: The original refs,
if different from the rewritten ones, will be stored in the namespace
refs/original/.
...
These will keep your big files in repository.
-
