login
Header Space

 
 

Re: [PATCH 1/3] repack: modify behavior of -A option to leave unreferenced objects unpacked

Previous thread: git filter-branch --subdirectory-filter by James Sadler on Thursday, May 8, 2008 - 9:01 pm. (10 messages)

Next thread: Java Git (aka jgit) library switching license to BSD/EPL by Shawn O. Pearce on Thursday, May 8, 2008 - 10:11 pm. (7 messages)
To: <git@...>
Date: Thursday, May 8, 2008 - 9:41 pm

Here's what I was thinking (posted using gmane):

diff --git a/git-repack.sh b/git-repack.sh
index e18eb3f..064c331 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -30,7 +30,7 @@ do
        -n)     no_update_info=t ;;
        -a)     all_into_one=t ;;
        -A)     all_into_one=t
-               keep_unreachable=--keep-unreachable ;;
+               keep_unreachable=t ;;
        -d)     remove_redundant=t ;;
        -q)     quiet=-q ;;
        -f)     no_reuse=--no-reuse-object ;;
@@ -78,9 +78,6 @@ case ",$all_into_one," in
        if test -z "$args"
        then
                args='--unpacked --incremental'
-       elif test -n "$keep_unreachable"
-       then
-               args="$args $keep_unreachable"
        fi
        ;;
 esac
@@ -116,7 +113,15 @@ for name in $names ; do
                echo &gt;&amp;2 "old-pack-$name.{pack,idx} in $PACKDIR."
                exit 1
        }
-       rm -f "$PACKDIR/old-pack-$name.pack" "$PACKDIR/old-pack-$name.idx"
+       rm -f "$PACKDIR/old-pack-$name.idx"
+       test -z "$keep_unreachable" ||
+         ! test -f "$PACKDIR/old-pack-$name.pack" ||
+         git unpack-objects &lt; "$PACKDIR/old-pack-$name.pack" || {
+               echo &gt;&amp;2 "Failed unpacking unreachable objects from old pack"
+               echo &gt;&amp;2 "saved as old-pack-$name.pack in $PACKDIR."
+               exit 1
+       }
+       rm -f "$PACKDIR/old-pack-$name.pack"
 done
 
 if test "$remove_redundant" = t
@@ -130,7 +135,18 @@ then
                  do
                        case " $fullbases " in
                        *" $e "*) ;;
-                       *)      rm -f "$e.pack" "$e.idx" "$e.keep" ;;
+                       *)
+                               rm -f "$e.idx" "$e.keep"
+                               if test -n "$keep_unreachable" &amp;&amp;
+                                  test -f "$e.pack"
+                               then
+                                       git unpack-objects &lt; "$e.pack" || {
...
To: Brandon Casey <drafnel@...>
Cc: <git@...>
Date: Friday, May 9, 2008 - 12:19 am

I like it. It makes an easy rule to say "packed objects _never_ get
pruned, they only get demoted to loose objects." And then of course

Yeah, that's what it looks like to me (that the first unpack is
unnecessary, because we will just be putting the new pack into place
that has all the same objects). AIUI, two packs with identical hashes

I think the extra two weeks is fine.

-Peff
--
To: Jeff King <peff@...>
Cc: Brandon Casey <drafnel@...>, <git@...>
Date: Friday, May 9, 2008 - 11:00 am

Isn't there an issue with the "git gc" triggering because there
may be too many loose unreferenced objects?
Still, I do like the approach.

Maybe unreferenced objects and old refs should go to a .git/lost+found
directory and be expired from there. This has a couple of benefits:

   -  Easy to manually inspect or blow away any crud
   -  One git-gc run can make one pack in lost+found,
      avoiding huge numbers of loose objects (and massive disk use)
      when trying to do a large cleanup (to possibly reclaim disk space)
   -  Objects will not be accessible by ordinary git commands for a  
while,
      before they are really removed, avoiding surprises

Only some tools would look in the lost+found to restore stuff.

   -Geert
--
To: Geert Bosch <bosch@...>
Cc: Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 11:14 am

This would be an argument for going the extra mile and having the loose
objects adopt the timestamp of their pack file. In the normal case they

Unreferenced objects are sometimes used by other repositories which have
this repository listed as an alternate. So it may not be a good idea to
make the unreferenced objects inaccessible.

-brandon
--
To: Brandon Casey <casey@...>
Cc: Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 12:12 pm

Well, not necessarily.  If you created a large branch yesterday and you 
are deleting it today, then if you repacked in between means that those 
loose objects won't be more than one day old.  Yet there could be enough 
of them to trigger auto gc.  But that auto gc won't pack those objects 
since they are unreferenced.  Hence auto gc will trigger all the time 

Nah.  If this is really the case then you shouldn't be running gc at all 
in the first place.


Nicolas
--
To: Nicolas Pitre <nico@...>
Cc: Brandon Casey <casey@...>, Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 6:33 pm

True.

I think the true motivation behind --keep-unreachable is not about the
shared object store (aka "alternates") but about races between gc and
push (or fetch).  Before push (or fetch) finishes and updates refs, the
new objects they create would be dangling _and_ the objects these dangling
objects refer to may be packed but unreferenced.  Repacking unreferenced
packed objects was a way to avoid losing them.

--
To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 8:07 pm

I feel like the current approach of (not very well) keeping track of
which objects are still needed is very messy, not very well defined or
based on specific solid principles, and prone to errors and losing
objects.

Things like git clone -shared can only really be used in extremely
specialized setups, or if pruning of unreferenced objects is completely
disabled in the source repository, or if specialized scripts are used to
do the garbage collection that take into account the references of the
"child" repository.  It is my impression that even repo.or.cz, while it
has some safe guards, does not even completely safely handle garbage
collection.  Probably it would be very useful to examples of such
scripts in contrib.

I think that ultimately, some general purpose and reliable solution
needs to be found to handle the cases of (1) a repository having its
objects referenced by another via info/alternates; (2) a repository with
multiple working directories (presumably this should warn/error out
unless given a force option/detach head and warn if you try to switch
HEAD for some working directory to the same branch as some other working
directory).  It seems, btw, that a third type of clone, one which merely
symlinks the objects directory, would also be useful, once there is a
solution to the robustness issue.  This would be a case (3) that needs
to be handled as well.

It seems that clear that ultimately, to handle these three cases, every
repository needs to know about every other repository, probably via a
symlink to other repository's .git directory.  Git gc would then also
examine any refs in this directory, making sure to avoid circular
references that might result from following the symlinks.  It should
also probably error out if it finds a symlink that doesn't point to a
valid git repository, because such a symlink either refers to a
now-deleted repository for which the symlink needs to be cleaned up, or
it refers to a repository that was moved and therefore the symlink needs
...
To: Jeremy Maitin-Shepard <jbms@...>
Cc: Junio C Hamano <gitster@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 8:20 pm

I believe we partially considered that in the past and discarded it
as far too complex implementation-wise for the benefit it gives us.

The current approach of leaving unreachable loose objects around
for 2 weeks is good enough.  Any Git process that has been running
for 2 weeks while still not linking everything it needs into the
reachable refs of that repository is already braindamaged and
shouldn't be running anymore.

If we are dealing with a pack file, those are protected by .keep
"lock files" between the time they are created on disk and the
time that the git-fetch or git-receive-pack process has finished
updating the refs to anchor the pack's contents as reachable.
Every once in a while a stale .keep file gets left behind when a
process gets killed by the OS, and its damn annoying to clean up.

I'd hate to clean up logs from every little git-add or git-commit
that aborted in the middle uncleanly.

-- 
Shawn.
--
To: Shawn O. Pearce <spearce@...>
Cc: Jeremy Maitin-Shepard <jbms@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 9:21 pm

How would that solve the issue that you should not prune/gc the repository
"clone --shared" aka "alternates" borrows from?

By the way, I do not think your "git-commit stopped for two weeks due to a
long editing session of the commit message" should result in any object
lossage, as the new objects are all reachable from the index, and the new
tree nor the new commit hasn't been built while you are typing (rather,
not typing) the log message.

Hmm, a partial commit that uses a temporary index file may lose, come to
think of it.  Perhaps we should teach reachable.c about the temporary
index file as well.  I dunno.
 
--
To: Junio C Hamano <gitster@...>
Cc: Shawn O. Pearce <spearce@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 9:51 pm

The log files are only for handling in-progress commands editing the
repository.  I also describe in first part of the e-mail a possible
solution to that issue as well as the issues created by having multiple
working directories:

When you create a new working directory, you would also create in the
original repository a symlink named
e.g. orig_repo/.git/peers/&lt;some-arbitrary-name-that-doesn't-matter&gt; that
points to the .git directory of the newly created working directory.
git clone -shared would likewise create such a link in the original
repository.  There could be a separate simple command to "destroy" a
repository created via clone -shared or via new-work-dir that would
simply remove this "peer" symlink from any repositories it shares from,
and then rm -rf the target repository.  The list of repositories that a
given target repository shares from would be discovered using perhaps
several different methods, depending on whether it is a new work dir, an
actual separate repository, or the new type of "shared" repository I
suggested in my original e-mail, namely one that has its own refs but
completely shares the object store of the original repository, e.g. via
a symlink to the original repository's objects directory In any case, I
believe the information to go "upstream" is already available, and we
just need to add those "peer" symlinks in order to be able to go
"downstream".

There could also be a simple git command to move a repository that would
take care of updating all of the references that other repositories have
to it.  Currently it is not possible to write such a command, because
the "downstream" links are not stored, but with these added symlinks it
would be possible.

As I said in my previous e-mail, if git gc finds any broken symlinks
(i.e. symlinks that point to invalid repositories), it would error out,
because user attention is required to specify whether the symlinks
correspond to deleted repositories, or to repositories that have been

Well, providin...
To: Jeremy Maitin-Shepard <jbms@...>
Cc: Junio C Hamano <gitster@...>, Shawn O. Pearce <spearce@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, <git@...>
Date: Saturday, May 10, 2008 - 1:25 am

That assumes you _can_ write to the original repository. That may or may
not be the case, depending on your setup.

-Peff
--
To: Jeff King <peff@...>
Cc: Junio C Hamano <gitster@...>, Shawn O. Pearce <spearce@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, <git@...>
Date: Saturday, May 10, 2008 - 1:36 am

Well, I suppose in that case it could print a warning or maybe fail
without some "force" option.  If you can't write to the repository, then
I think it is safe to say that it will never know or care about you, so
you will fundamentally have a fragile setup.  I'd say that except in
very special circumstances, you are better off just not sharing it at
all.

Consider, for instance, that even if the repository that you are sharing
form never deletes branches and never does non-fast-forward updates of
references, it could very well happen to have, due to some temporary
operation, some unreferenced object that happens to be exactly the same
object that you want to add to your repository.  Because you've listed
it in info/alternates, you won't write that object to your own
repository, but then the source repository will very likely garbage
collect the object at some later point, corrupting your repository.

-- 
Jeremy Maitin-Shepard
--
To: Jeremy Maitin-Shepard <jbms@...>
Cc: Jeff King <peff@...>, Junio C Hamano <gitster@...>, Shawn O. Pearce <spearce@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, <git@...>
Date: Saturday, May 10, 2008 - 5:04 am

Hi,


FWIW this argument can be found in the mailing list.  It does not have to 

Counterexample kernel.org.  Counterexample repo.or.cz.

Hth,
Dscho

--
To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Jeff King <peff@...>, Junio C Hamano <gitster@...>, Shawn O. Pearce <spearce@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, <git@...>
Date: Saturday, May 10, 2008 - 12:24 pm

Maybe you can point me at the relevant thread.  Fundamentally, though,
I'd say objects/info/alternates _cannot_ work reliably without the
source repository knowing about the objects that the sharing
repositories need.  Otherwise, there is no way for it to know not to
prune them.  The only way for it to have that information in general is
to write it in the repository.  In a site-specific setting, it may
indeed be possible to rely on some site-specific database, but that is
not particularly relevant.

Currently repository sharing seems to be used in many cases in quite
unsafe ways.  It may seem unfortunate that doing things the "safe way"
is much more of a hassle and doesn't work in certain environments, but
I'd say that is just the way things have to be.

Perhaps you can point me to an existing thread that addresses this idea,

repo.or.cz is not a counterexample.  It is completely "managed", and
could quite easily implement the approach I described.  I don't know
exactly how kernel.org works, but I imagine likewise some setuid helper
script could be provided to write these symlinks.

There is the issue that these setuid helper scripts would mean at the
very least that if user A can "fork" user B's repository, then to some
extent user B can make user A use large amounts of disk space
(i.e. exceed his quota or something) by just referencing a bunch of
temporary objects that user A happens to have in his repository, and it
would take careful examination of the git repository to actually figure
out that it is user B's fault.  I don't think this would be a
significant problem in practice, though.

-- 
Jeremy Maitin-Shepard
--
To: Jeremy Maitin-Shepard <jbms@...>
Cc: Jeff King <peff@...>, Junio C Hamano <gitster@...>, Shawn O. Pearce <spearce@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, <git@...>
Date: Sunday, May 11, 2008 - 7:11 am

Hi,


Unfortunately, a quick search did not turn up anything useful.  Maybe you 

Half true... you said "if you can't write to the repository..." and on 

Well, I think that the setuid helper script would open a whole bunch of 
other issues.

I think that the shared repository problem is rather a semantic one, i.e. 
it is only solvable between the owners of the repository by good-ole 
talking, not something that can be solved by the tool (Git).

Ciao,
Dscho
--
To: Shawn O. Pearce <spearce@...>
Cc: Junio C Hamano <gitster@...>, Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 8:43 pm

It doesn't seem all that complex, and I'd say that fundamentally it is
the _correct_ way to do things.  Being sloppy is always easier in the
short run, but then either means the system is permanently broken or
results in a lot of "fixing up" work later.  I think almost all of the
work of handling these log files could be done without impacting a lot
of code that calls the relevant APIs that would actually use the log
files.  I think the biggest impact would be on non-C code, but even for
that code, appropriate wrapper could be used to avoid having to make

This sort of reasoning just leads to an inherently unreliable system.
Sure, two weeks might seem good enough for nearly all cases, but why
_shouldn't_ I be able to leave my editor open for two weeks before
typing in my commit message and finishing the commit, or wait for two
weeks in the middle of a rebase (it seems that in the new
implementation, temporary refs are created basically to do the same
thing as the log file I described.)  I could easily be typing up my
commit message, then switch to something else, and happen not to come
back to it for two weeks.

Because such a "timeout" based solution isn't really the "correct
solution" but will work most of the time, potential problems won't be
noticed while testing.

Another significant issue is that this timeout means that unreferenced
junk has to stay around in the repository for two weeks for no (good)

First of all, merely exiting due to an error should not cause log files
to be left around.  The only thing that should cause log files to be
left around is kill -9 or a system crash.  Second, by storing the
process id and a timestamp of when the log file was created, it is
possible to reliably determine if a log file is stale.

-- 
Jeremy Maitin-Shepard
--
To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Brandon Casey <casey@...>, Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 7:09 pm

This is what the log history seems to indicate:

	git log -p --grep=keep-unreach

So pack-objects --keep-unreachable was implemented in order to add repack -A,
which now doesn't need --keep-unreachable, and can become obsolete.

Which is just as well, since --keep-unreachable never made it to the
man pages. :-)

If I understand things correctly, there is no user-friendly way to add
loose, unreachable objects to a pack.  This whole architecture was just
to prevent a repack from silently deleting things.

If this is right, the patch below updates the docs.

- Chris


From 443b1201d54f0b7197d18779ce934823e9897b36 Mon Sep 17 00:00:00 2001
From: Chris Frey &lt;cdfrey@foursquare.net&gt;
Date: Fri, 9 May 2008 19:08:26 -0400
Subject: [PATCH] Updating documentation to match Brandon Casey's proposed git-repack patch.

This patch clarifies the git-prune man page, documenting that it only
prunes unpacked objects.  git-repack is documented according to
the new git-repack -A behaviour, which does not depend on
git-pack-objects --keep-unreachable anymore.

Signed-off-by: Chris Frey &lt;cdfrey@foursquare.net&gt;
---
 Documentation/git-prune.txt  |    5 ++++-
 Documentation/git-repack.txt |   14 +++++++++++++-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-prune.txt b/Documentation/git-prune.txt
index f92bb8c..3178bc4 100644
--- a/Documentation/git-prune.txt
+++ b/Documentation/git-prune.txt
@@ -18,12 +18,15 @@ git-prune. See the section "NOTES", below.
 
 This runs `git-fsck --unreachable` using all the refs
 available in `$GIT_DIR/refs`, optionally with additional set of
-objects specified on the command line, and prunes all
+objects specified on the command line, and prunes all unpacked
 objects unreachable from any of these head objects from the object database.
 In addition, it
 prunes the unpacked objects that are also found in packs by
 running `git prune-packed`.
 
+Note that unreachable, packed objects will remain.  If this is
+not d...
To: Nicolas Pitre <nico@...>
Cc: Geert Bosch <bosch@...>, Jeff King <peff@...>, <git@...>
Date: Friday, May 9, 2008 - 12:54 pm

That's true, but the intermediate repack is not the cause here. You'd be
in the same situation if a large branch was created yesterday and then
deleted today even if packing had never occurred.

I do see your point, but you should have said a large branch created a month
ago, deleted today, but repacked yesterday. :)

-brandon

--
To: Brandon Casey <casey@...>
Cc: Geert Bosch <bosch@...>, <git@...>
Date: Friday, May 9, 2008 - 11:53 am

But that is precisely what we're going to do, but in two weeks. Isn't it
better to have the dependent repo fail while the change is recoverable?

-Peff
--
To: Jeff King <peff@...>
Cc: Geert Bosch <bosch@...>, <git@...>
Date: Friday, May 9, 2008 - 11:56 am

good point.

-b

--
To: Brandon Casey <drafnel@...>
Cc: <git@...>
Date: Thursday, May 8, 2008 - 11:21 pm

Neat trick.  Unreachable objects that are only in this pack will get
current timestamp and gets a new lease of life for two weeks and then will
disappear.

--
To: <gitster@...>
Cc: <git@...>, Brandon Casey <drafnel@...>
Date: Saturday, May 10, 2008 - 12:01 am

From: Brandon Casey &lt;drafnel@gmail.com&gt;

The previous behavior of the -A option was to retain any previously
packed objects which had become unreferenced, and place them into the newly
created pack file.  Since git-gc, when run automatically with the --auto
option, calls repack with the -A option, this had the effect of retaining
unreferenced packed objects indefinitely. To avoid this scenario, the
user was required to run git-gc with the little known --prune option or
to manually run repack with the -a option.

This patch changes the behavior of the -A option so that unreferenced
objects that exist in any pack file being replaced, will be unpacked into
the repository. The unreferenced loose objects can then be garbage collected
by git-gc (i.e. git-prune) based on the gc.pruneExpire setting.

Also add new tests for checking whether unreferenced objects which were
previously packed are properly left in the repository unpacked after
repacking.

Signed-off-by: Brandon Casey &lt;drafnel@gmail.com&gt;
---
 git-repack.sh                        |   18 +++++++++---
 t/t7701-repack-unpack-unreachable.sh |   47 ++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+), 5 deletions(-)
 create mode 100755 t/t7701-repack-unpack-unreachable.sh

diff --git a/git-repack.sh b/git-repack.sh
index e18eb3f..a0e06ed 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -30,7 +30,7 @@ do
 	-n)	no_update_info=t ;;
 	-a)	all_into_one=t ;;
 	-A)	all_into_one=t
-		keep_unreachable=--keep-unreachable ;;
+		keep_unreachable=t ;;
 	-d)	remove_redundant=t ;;
 	-q)	quiet=-q ;;
 	-f)	no_reuse=--no-reuse-object ;;
@@ -78,9 +78,6 @@ case ",$all_into_one," in
 	if test -z "$args"
 	then
 		args='--unpacked --incremental'
-	elif test -n "$keep_unreachable"
-	then
-		args="$args $keep_unreachable"
 	fi
 	;;
 esac
@@ -130,7 +127,18 @@ then
 		  do
 			case " $fullbases " in
 			*" $e "*) ;;
-			*)	rm -f "$e.pack" "$e.idx" "$e.keep" ;;
+			*)
+				rm -f "$e.idx" "$e.keep"
+				if te...
To: <drafnel@...>
Cc: <gitster@...>, <git@...>
Date: Saturday, May 10, 2008 - 2:03 am

Can we call this something else (like unpack_unreachable) since it now
has nothing to do with the --keep-unreachable flag?


I still like Geert's suggestion of unpacking them to a _different_
place. That helps to avoid spurious "gc --auto" invocations caused by
too many prunable objects. Though it certainly doesn't solve it, and
maybe that just needs to be fixed separately.

Possibly the "gc --auto" test should be:

  - count objects; if too few, exit
  - count unreachable loose objects; if too few, exit
  - run gc

That means having a lot of unreachable objects will still incur some
extra processing, but not as much as a full repack. And it won't bug the
user with a "you need to repack" message.

-Peff
--
To: Jeff King <peff@...>
Cc: <gitster@...>, <git@...>
Date: Sunday, May 11, 2008 - 12:16 am

Actually I initially changed it to unpack_unreachable, and then
changed it back. The reason I did this is because I think
keep_unreachable still describes what is being accomplished, that
unreachables are being kept. When -A is supplied along with -d,
unreachables are kept by being unpacked. When -d is not supplied,
unreachables are kept in their original pack file. If Geert's proposal
or something else is implemented, keep_unreachable may still be


I've got a thought. How about limiting how often auto repack repacks
by looking at the timestamp of the most recent pack? Wouldn't the
packs already be prepared in most cases i.e. prepare_packed_git()

-brandon
--
To: Jeff King <peff@...>
Cc: <gitster@...>, <git@...>
Date: Sunday, May 11, 2008 - 12:51 am

completely untested and hopefully not mangled by google...

actually, this will do nothing for the case where there exists many
loose unreachable objects and no loose reachable objects since we
won't create a new pack with an updated timestamp to compare against.
So git-gc will continue to spin its wheels without getting anywhere.
Could we update the pack timestamp after running git-gc or use a
timestamp from someplace else?

-brandon

diff --git a/builtin-gc.c b/builtin-gc.c
index 48f7d95..16b1455 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -27,6 +27,7 @@ static int aggressive_window = -1;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static char *prune_expire = "2.weeks.ago";
+static time_t gc_auto_pack_frequency = 21600;  /* 6 hours */

 #define MAX_ADD 10
 static const char *argv_pack_refs[] = {"pack-refs", "--all", "--prune", NULL};
@@ -56,6 +57,10 @@ static int gc_config(const char *var, const char *value)
                gc_auto_pack_limit = git_config_int(var, value);
                return 0;
        }
+       if (!strcmp(var, "gc.autopackfrequency")) {
+               gc_auto_pack_frequency = git_config_ulong(var, value);
+               return 0;
+       }
        if (!strcmp(var, "gc.pruneexpire")) {
                if (!value)
                        return config_error_nonbool(var);
@@ -205,6 +210,14 @@ static int need_to_gc(void)
        else if (!too_many_loose_objects())
                return 0;

+       if (gc_auto_pack_frequency) {
+               prepare_packed_git();
+               if (packed_git &amp;&amp;
+                   packed_git-&gt;mtime &gt;
+                   approxidate("now") - gc_auto_pack_frequency)
+                       return 0;
+       }
+
        if (run_hook())
                return 0;
        return 1;
--
To: Jeff King <peff@...>
Cc: <drafnel@...>, <gitster@...>, <git@...>
Date: Saturday, May 10, 2008 - 9:10 pm

Depends.  If it has no maintenance cost then we might as well keep it 

Having a separate location for objects seems clunky to me.

And the fundamental problem isn't solved indeed -- you may end up with 

Determining the number of unreachable objects is quite costly, packed or 
not.  So that isn't a good thing to do on every 'git gc --auto' 

The auto gc performs incremental packing most of the time.  And that is 
way faster than figuring out which objects are unreachable.

For example, running 'git prune' in my Linux repo takes 16 seconds, even 
when there is nothing to prune.  Running 'git repack' (with no option so 
to perform an incremental repack) took less than 2 seconds to pack 541 
reachable objects that happened to be loose.

I'm now starting to wonder if there is a reason for keeping unreachable 
objects that used to be packed.  Putting --keep-unreachable aside for 
now, the only way an unreachable object could have entered a pack is if 
it used to be reachable before through the commit history or reflog.  
So if they're not reachable anymore, that's most probably because their 
reflog expired.  So what's the point for keeping them even longer?  
What's the reasoning that led to the creation of --keep-unreachable in 
the first place?


Nicolas
--
To: Nicolas Pitre <nico@...>
Cc: Jeff King <peff@...>, <drafnel@...>, <git@...>
Date: Saturday, May 10, 2008 - 9:23 pm

I think the logic went like this.

(1) You may have rewound your head since you last repacked; blobs and
    trees in the rewound commit are already packed now.

(2) Now you may be fetching (or somebody else may be pushing) a commit
    that contains such blobs and/or trees, and the fetch or push is small
    enough that it unpacks, but the packed and unreachable ones are not
    unpacked.

(3) But before that fetch or push finishes to update the ref, you can race
    with a "repack -a -d".




--
To: <gitster@...>
Cc: <git@...>
Date: Saturday, May 10, 2008 - 12:01 am

Here is a formal patch. I removed the first invocation of
unpack-objects since I think it is unneccessary.

Followed up with mods to git-gc.

-brandon


--
To: <gitster@...>
Cc: <git@...>, Brandon Casey <drafnel@...>
Date: Saturday, May 10, 2008 - 12:01 am

From: Brandon Casey &lt;drafnel@gmail.com&gt;

---
 builtin-gc.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/builtin-gc.c b/builtin-gc.c
index 6db2f51..48f7d95 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -219,7 +219,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	char buf[80];
 
 	struct option builtin_gc_options[] = {
-		OPT_BOOLEAN(0, "prune", &amp;prune, "prune unreferenced objects"),
+		OPT_BOOLEAN(0, "prune", &amp;prune, "prune unreferenced objects (deprecated)"),
 		OPT_BOOLEAN(0, "aggressive", &amp;aggressive, "be more thorough (increased runtime)"),
 		OPT_BOOLEAN(0, "auto", &amp;auto_gc, "enable auto-gc mode"),
 		OPT_BOOLEAN('q', "quiet", &amp;quiet, "suppress progress reports"),
@@ -249,7 +249,6 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 		/*
 		 * Auto-gc should be least intrusive as possible.
 		 */
-		prune = 0;
 		if (!need_to_gc())
 			return 0;
 		fprintf(stderr, "Auto packing your repository for optimum "
-- 
1.5.5.67.g9a49

--
To: <gitster@...>
Cc: <git@...>, Brandon Casey <drafnel@...>
Date: Saturday, May 10, 2008 - 12:01 am

From: Brandon Casey &lt;drafnel@gmail.com&gt;

Now that repack -A will leave unreferenced objects unpacked, there is
no reason to use the -a option to repack (which will discard unreferenced
objects). The unpacked unreferenced objects will not be repacked by a
subsequent repack, and will eventually be pruned by git-gc based on the
gc.pruneExpire config option.
---
 builtin-gc.c |   13 ++-----------
 1 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/builtin-gc.c b/builtin-gc.c
index f99ebc7..6db2f51 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -256,17 +256,8 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			"performance. You may also\n"
 			"run \"git gc\" manually. See "
 			"\"git help gc\" for more information.\n");
-	} else {
-		/*
-		 * Use safer (for shared repos) "-A" option to
-		 * repack when not pruning. Auto-gc makes its
-		 * own decision.
-		 */
-		if (prune)
-			append_option(argv_repack, "-a", MAX_ADD);
-		else
-			append_option(argv_repack, "-A", MAX_ADD);
-	}
+	} else
+		append_option(argv_repack, "-A", MAX_ADD);
 
 	if (pack_refs &amp;&amp; run_command_v_opt(argv_pack_refs, RUN_GIT_CMD))
 		return error(FAILED_RUN, argv_pack_refs[0]);
-- 
1.5.5.67.g9a49

--
Previous thread: git filter-branch --subdirectory-filter by James Sadler on Thursday, May 8, 2008 - 9:01 pm. (10 messages)

Next thread: Java Git (aka jgit) library switching license to BSD/EPL by Shawn O. Pearce on Thursday, May 8, 2008 - 10:11 pm. (7 messages)
speck-geostationary