Re: Subject: [PATCH] git-merge-pack

Previous thread: Re: [PATCH] Function for updating refs. by Junio C Hamano on Wednesday, September 5, 2007 - 3:04 am. (2 messages)

Next thread: Re: People unaware of the importance of "git gc"? by David Kastrup on Wednesday, September 5, 2007 - 4:47 am. (1 message)
To: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:09 am

So we had a git bof at linux.conf.eu yesterday, and I leart something
new: even people who have been using git for a long time apparently don't
necessarily realize the importance of repacking.

James Bottomley (the Linux SCSI maintainer) is an old-time BK user, and
very comfy using git. But when he was demonstrating things on his poor old
laptop, simple things like "git branch" literally took a long time, and
James didn't seem to realize that the fact that he had apparently never
ever repacked his repository was a big deal.

The kernel archive is a 190MB pack for me fully repacked (I just checked -
I had actually thought that it was somewhat larger than that), but because
James hadn't repacked, his .git directory was over a gigabyte in size, and
his laptop wasn't able to cache anything at all effectively as a result.

Repacking it took over an hour, simply because everything was *so*
unpacked, and James' kernel repository had something like 92 thousand
loose objects, and several hundred packfiles. Simple operations that
really take much less than a second for me ("git branch" takes 0.022s on
my laptop, which has the same 512M that James had on his) took many many
seconds as a result, and James seemed to think that this was all normal.

And James didn't even want to repack, because it was so expensive (which
he knew - he claims to have never ever repacked at all, but maybe he had
started it and just control-C'd it when it was really slow at some point).

Now, it may be that James didn't realize how important the occasional
garbage collect is exactly *because* he is an old-timer and used BK long
before he used git, and just continued using git simply as a BK
replacement, but it did make me wonder whether maybe this lack of
repacking awareness is fairly common.

I've been against automatic repacking, but that was really based on what
appears to be potentially a very wrong assumption, namely that people
would do the manual repack on their own. If it turns out...

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:07 pm

I do it from time to time. Seldom in working repositories, because
they usually come and go before they have a chance to accumulate
enough of loose objects. I do a partial repack (git repack -d) after
every import from p4 repo, because every snapshot of it is an ugly
mess changing files all over the tree. Sometimes, after I merged a big
chunk with the p4 repo and sent it over (the process involves rebase).

It is usually concious decision when to do a repack or gc. The repack
time is seldom a problem: it is fast enough even on windows (and I do
have big repos and binary objects). The gc causes my machines to swap,
though. Some of them heavily, so there my repos stay longer partially
packed. I do use .keep packs for this reason (and because windows or
cygwin or both have more problems with big files the they have with
small).

I used to clone repos with "-s", but quickly stopped after a few
broken histories. This also tought me to think before running
"git gc" or "git repack -a -d".

On a rare occurance I even use "git repack -a -d -l" and "git
pack-refs" separately.

This was all specific to my day-job. At home, on linux systems I just
run git-gc whenever I please, without even thinking why. It finishes
mostly in less than a minute (the kernel: ~40-50 sec on my P4 2.6GHz, 1Gb).
-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 1:44 pm

Well, this may just prove I'm an idiot, but one of the reasons I rarely
run it is that I have trouble remembering exactly what it does; in
particular,

- does it prune anything that might be needed by a repo I
cloned with -s?
- is there anything that's unsafe to do while the git-gc is
running?
- what are the implications for http users if this is a public
repo?
- is git-gc enough on its own or should I be running something
more agressive ocassionally too?

No doubt they all have simple answers, which probably amount to "just
don't worry about it", and which I could have found in less time than
it'd take to write this email. But when I've got other work to do,
reading "man git-gc" is just enough effort for me to postpone the whole
thing to another day.

So, anyway, your message reminded me to run git-gc on my main working
repo. At which point one of my personal scripts immediately started
failing--it was assuming it could find any ref under .git/refs/, and I
hadn't realized (or maybe I had once, and I'd forgotten) that git-gc
packs refs by default now.

Bah. I don't know what the moral of that story is.

--b.
-

To: J. Bruce Fields <bfields@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 2:46 pm

YES! yikes.

This is about the best argument put forth so far for not automatically
running git-gc. Personally, I think git-gc should not remove unreferenced
objects without --prune (but I haven't done anything about it). But even
if git-gc was modified in this way, an occasional git-gc --prune would
still be necessary to remove all of the unreferenced and dangling objects
safely with a human thinking about the shared repo implications (unless
shared repo handling is modified).

-brandon

-

To: Brandon Casey <casey@...>
Cc: J. Bruce Fields <bfields@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:09 pm

Well, it could also mean that if git finds a dead symbolic link when
looking up an object, it should check the corresponding link target
directory for a pack file with the respective object... and if it
finds such a pack file, create a link to it and use it.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-

To: David Kastrup <dak@...>
Cc: J. Bruce Fields <bfields@...>, Linus Torvalds <torvalds@...>, Brandon Casey <casey@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:20 pm

The problem here is that the clone could be having refs on objects from
the origin that don't have refs left there. git-gc might, at some point,
prune these refs, and the clone would have dangling refs. That could
easily happen, for example, if you rebase a branch in the origin, but
still have a clone with the original branch.

Mike
-

To: David Kastrup <dak@...>
Cc: Linus Torvalds <torvalds@...>, Brandon Casey <casey@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:13 pm

One of the two of us is very confused about what "git-clone -s" does.
See the git-clone man page. I don't think symlinks are involved.

--b.
-

To: J. Bruce Fields <bfields@...>
Cc: Linus Torvalds <torvalds@...>, Brandon Casey <casey@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:43 pm

Guilty as charged.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 12:47 pm

Hi,

I am very new to git but I have thought about this a bit from a user's
perspective. I have several thoughts on the matter.

First, I would like to point out that the hg folks like to compare
themselves to git a lot and they list the need for manual gc as a
reason to choose hg over git. This may not be something that the git
community cares about but I thought I would point it out.

Second, it *is* a hassle. When trying to figure out what I could
convince my co-workers to use, having to gc was something that I did
not think they would be conscious of or care enough about to do. It
makes git more of a PITA than it could be. Similarly, I have no idea
when it is a good time to do a gc. After every commit? Before push?
What if I never push a repo? What if it is a remote repo only used to
sync up with my co-workers, do I have to go there and periodically gc?
This is one reason why I really think that gc should be *plumbing*
and *not* porcelain.

The user should never have to trigger a gc, they should even be
discouraged from doing so. That is how other gc systems are. Can you
imagine if you had a Java app that had a button on it to do a gc?
When should I push it? Should I wait till the system is getting slow
or just start spamming the button whenever I'm bored? I know that
Java/c#/py GC are different than git gc, but they fulfill the same
basic purpose as git gc. IE to clean up unused items and free up
resources. Git additionally may do some re-optimization, but that is
not relevant to a user.

I know this goes against the general mood here (which seems to be
against auto-gc) but I thought I would give my $.02 as a user of git.

Thanks,
Govind.

-

To: Govind Salinas <govindsalinas@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 1:35 pm

That's a good way to think of it IMO. It's a low-level operation (albeit
one that encapsulates other, lower-level ones) that tells git to
rearrange its internal data structures. It is not something that has any
user-visible effect. Every other porcelain-level git command *does
something* from the user's point of view. Running git-gc is basically a
no-op, which from the user's point of view makes it a waste of
keystrokes and an annoying distraction from focusing on the stuff

I'll play devil's advocate for a moment here, though, and say that, as
others have suggested in this thread, git could be made to tell you when
it's appropriate to run gc. So the "I don't know when to run it"
argument isn't a hard one to address.

With that in mind, here's what the message should look like IMO:

---
Your repository can be optimized for better performance and lower disk
usage.
Please run "git gc" to optimize it now, or run "git config gc.auto true"
to tell
git to automatically optimize it in the future (this will launch
processes in the
background.) For more information, "man git-gc".
---

And that "gc.auto" config option (just an arbitrary name, call it
something else if that's no good) actually has four settings:

warn (the default) - prints the warning message, at most once every N
minutes (we can determine a good value for N)
true - launches git-gc in the background as needed
false - suppresses the warning and the check that triggers the warning
foreground - launches git-gc in the foreground as needed (to make it
easier to abort)

I don't buy the "git gc takes too much memory to run in the background"
argument as a reason automatic git-gc is a bad idea. Many of us (me
included) work on machines with plenty of memory to launch a background
git-gc without hampering our development work, and/or on repositories
small enough that it doesn't eat that much memory in the first place.
And if you make it an option that the user has to enable, people on
low-memory mac...

To: Steven Grimm <koreth@...>
Cc: Govind Salinas <govindsalinas@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 2:28 pm

It certainly has a sysadmin-visible effect. Repack a couple of big git
repositories and that's a backup tape gone if you do incremental
backups: and you can't *not* back up the pack files, even though a lot
of the state in them is recoverable from elsewhere on the net: the stuff
which is not recoverable is tangled up with the stuff which is.

(of course the solution here was .keep files. I cheered when they were
introduced and started rolling git out everywhere I could. There's just
one last vast repository maintained by a horrible shell script layered
atop SCCS which I have to find some way to convert...)
-

To: Govind Salinas <govindsalinas@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 1:19 pm

I'll throw my opinion in here as well. I think git should
automatically do repacking by default, (once loose objects exceed some
threshold). There have several posts in this thread from people who
don't want auto-gc, but these same people should be able to avoid it,
and likely without changing habits. That's because:

* They're already in the habit of manually repacking every once in a
while, (or like, Linus, much more often than strictly necessary).

* They've already got cron jobs setup to do the repacking.

And one could augment this with an option to disable the repacking of
course.

And if you're really concerned about people that don't want this
getting it anyway, just determine some useful threshold and then
double it or so before it triggers automatic repacking, (so the
automatic repacking hits only us idiots that completely neglect it).

[Pardon me for continuing to quote in the original top-posted order,
but I like the flow here.]

I know it was surprising to you, Linus, but I'm glad you noticed
it. I've seen the same thing from many users. And git actually
discourages users from learning about repacking. If the user starts
with a small (or new) project, then everything performs well, and
there's no performance problem whatsoever.

So then the problems creep up gradually, and the user has no idea that
he should be doing anything different than he's always done. Instead
the user is left to just conclude that git's performance isn't scaling
well as the project grows. That's a bad conclusion of course, and it's
bad that git sets things up so the user reaches that conclusion.

I don't think the warning message alone is a good fix. I think the
people who would understand the warning and appreciate that they could
then take care of repacking as convenient are the same people that
already understand the repacking concept, and are likely already
repacking occasionally, (so would likely never see the warning).

But the problematic case is the user who knows not...

To: Carl Worth <cworth@...>
Cc: Govind Salinas <govindsalinas@...>, Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 1:55 pm

(my 2 cents as another ordinary new git user)
Hmm, not necessarily. That a system knows what the best action is
doesn't meant that _right now_ is the best time to take that action.
One subtle difference I think between git's gc and Java/python/etc.'s
gc is that in the latter case it is, at least metaphorically, a life
and death situation - if gc isn't run, the application will run out of
memory, where as in git, it's more of a performance degradation issue,
which, sort of, can wait.

On the issue of implementation awareness, a warning message saying
something along the lines of "your repository is getting slower. You
might want to consider running 'git gc', and remember to do that from
time to time." is not much different from "your file system is getting
slower. You might want to consider running <whatever-defrag-tool>, and
remember to do that from time to time."

Neither these messages nor the actions they propose _require_ users to
learn what "repacking", "loose object", or "file fragments" are about
before they can proceed.

Cheers.
--
Jing Xue

-

To: <git@...>
Date: Wednesday, September 5, 2007 - 4:16 am

Can it be that getting rid of unused objects is harder once they are
packed? If that is the case, an automatic pack while mucking about
with temporary branches and/or confidential files would be quite a
nuisance.

Automatic packing maybe would be acceptable if packing was really
transparent to what you do with your repo (including janitoring work).
And it would be nice if automatic packing could be done in an
incremental manner, not bogging down normal work.

--
David Kastrup

-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:42 am

Well independently from the fact that one could suppose that users
should use gc on their own, the big nasty problem with repacking is that
it's really slow. And I just can't imagine git that I use to commit
blazingly fast, will then be unavailable for a very long time (repacks
on my projects -- that are not as big as the kernel but still -- usually

I do, when I'm bored and that I can't get things done. you know, it
has become one of my many twitches when I have an empty tty in front of
me and that I'm doing nothing useful. Though, when I'm in a hack-attack,
well I don't necessarily remember to repack. I'm in one of the (not so
many ?) very lucky companies (yay start-ups) where I could show that git
was very superior, and we now use it as our sole SCM. So when I'm in a
hack attack, it's usually that it's a busy week, and that new patches,
trees, objects (and sometimes with large binary things in it) flows like
hell. And the repository grows larger and larger. Well, the way we chose
to avoid the "I'm coding don't bother me with administrivia"-attitude is
that our users use a small cron that basically runs git gc each day, and
an aggressive repack (with a window of 50 or 100 I don't remember) each
Week-end in a cron. Because the best criterion to repack a repository
is: when there is no-one on the computer.

It has proven quite good, as we have never seen a repository explode
in a day, even after some funny mistakes where people rebase some big
parts of the tree many times, generating very large number of loose
objets.

I know I don't really answer the question, but the point I try to make
is that yeah, some kind of automated way to run the gc is great, but I'm
not sure that _git_ is the tool to automate that, because when *I* use
git, I expect it to be just plain fast, and I don't want it to
occasionally hang.

--=20
=C2=B7O=C2=B7 Pierre Habouzit
=C2=B7=C2=B7O madcoder@debia=
n.org
OOO ...

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 1:51 pm

Indeed. I repack all our git trees in the middle of the night, and our
incremental backup script drops .keep files corresponding to every
existing pack before running the backup.

This is probably a good job for cron :)
-

To: Nix <nix@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 2:14 pm

If you are setting up cron jobs to repack multiple git trees, you are
not the kind of novice or casual git user who this proposal would
primarily be aimed at.

But in any event, since you are doing that, your repos will never
accumulate a high enough percentage of loose objects (whatever the
threshold is) to trigger the warning and/or automatic launch. So you can
continue to operate as before, no difference in behavior, while people
who don't know how / want to set up cron jobs will have their
repositories cleaned too.

git-gc can leave behind a "last completed" timestamp and we can suppress
the check for excess loose objects until some minimum amount of time has
passed since last git-gc. If that amount is greater than the interval
between your cron jobs, you won't even get any (measurable) overhead
from the detection to see if the warning is needed.

-Steve

-

To: Steven Grimm <koreth@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 2:22 pm

True enough: but the point is that it was only about three lines of code
(a locate and git-gc pipeline). We could just put that in the
documentation...

... which people then won't read. Oh well. Sorry for the mindless

I personally wonder if git-gc shouldn't use a proportional scheme, so
that only some packs get repacked, maybe the smallest ones (and when
they grow to the same size as the next largest one, the two get repacked
into one). This has the singular advantage that you won't have to
carefully drop .keep files everywhere or have to worry about your git-gc
of 50K of loose objects suddenly deciding to repack 100Mb of packfiles
and taking ages.

It's probably not hard to implement, but I don't need it because I keep
everything packed anyway...
-

To: Nix <nix@...>
Cc: Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 2:54 pm

Not only that. Currently the "Counting objects" phase when running
git-gc on the Linux repo takes a significant amount of time, even if
there is little to repack.

If any kind of automatic repack is implemented, it should be an
incremental repacking only, not the full thing, i.e. git-repack without
-a, or git-pack-objects with --unpacked. The idea is to be the least
intrusive as possible. Also, object walking should be limited to
objects linked to a commit object which is itself unpacked in order to
cut on the time required to fully enumerate all objects.

This way a semi-packed state will always be preserved and should be good
enough. The full repacking should probably be left to manual execution
of git-gc.

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 4:01 pm

Ok, how about doing something like this?

-- >8 -- snipsnap -- >8 -- clipcrap -- >8 --
Implement git gc --auto

This implements a new option "git gc --auto". When gc.auto is
set to a positive value, and the object database has accumulated
roughly that many number of loose objects, this runs a
lightweight version of "git gc". The primary difference from
the full "git gc" is that it does not pass "-a" option to "git
repack", which means we do not try to repack _everything_, but
only repack incrementally. We still do "git prune-packed". The
default threshold is arbitrarily set by yours truly to:

- not trigger it for fully unpacked git v0.99 history;

- do trigger it for fully unpacked git v1.0.0 history;

- not trigger it for incremental update to git v1.0.0 starting
from fully packed git v0.99 history.

This patch does not add invocation of the "auto repacking". It
is left to key Porcelain commands that could produce tons of
loose objects to add a call to "git gc --auto" after they are
done their work. Obvious candidates are:

git add
git fetch
git merge
git rebase

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

builtin-gc.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 63 insertions(+), 1 deletions(-)

diff --git a/builtin-gc.c b/builtin-gc.c
index 9397482..093b3dd 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -20,6 +20,7 @@ static const char builtin_gc_usage[] = "git-gc [--prune] [--aggressive]";

static int pack_refs = 1;
static int aggressive_window = -1;
+static int gc_auto_threshold = 6700;

#define MAX_ADD 10
static const char *argv_pack_refs[] = {"pack-refs", "--all", "--prune", NULL};
@@ -28,6 +29,8 @@ static const char *argv_repack[MAX_ADD] = {"repack", "-a", "-d", "-l", NULL};
static const char *argv_prune[] = {"prune", NULL};
static const char *argv_rerere[] = {"rerere", "gc", NULL};

+static const char *argv_repack_auto[] = {"repack", "-d...

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 11:54 am

Hi,

Please don't do that.

When you share objects with another git directory, git-gc --auto can get
rid of the objects when some objects go away in the referenced repository.

So we need _at least_ check gc.auto not being set in the repo when "git
clone --share"ing it (and fail otherwise).

My preferred way would be to set it in "git init" so that existing setups
are not affected, and put some big red message on top of the next release
notes that people might want to set gc.auto in their existing setups.

Ciao,
Dscho

-

To: Johannes Schindelin <Johannes.Schindelin@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 1:49 pm

I thought the whole point of "gc --auto" was to have something
that does not lose/prune any objects, even the ones that do not
seem to be referenced from anywhere. That is why invocations of
"git gc --auto" do not say --prune as you saw the second patch,
and the repack command "gc --auto" runs is "repack -d -l"
instead of "repack -a -d -l", which means that it does run
git-prune-packed after repacking but not git-prune.

Maybe I am missing something...

-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Friday, September 7, 2007 - 6:12 am

Hi,

No, _I_ missed the fact that no pack is rewritten...

Sorry for the line noise,
Dscho

-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Friday, September 7, 2007 - 12:48 am

No, you aren't Junio. `gc --auto` as you defined it is safe.
It won't delete objects from the database. So it won't impact shared
repositories, or readers that are actively running in parallel with
the gc. Both of which are important.

--
Shawn.
-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 2:15 pm

I think "repack -d -l" should be ok from a safety perspective, but I'd
also like to say that always running it incrementally is going to largely
suck after a time.

IOW, if you get lots of small incrmental packs, after a while you really
*do* need to do "git gc" to get the real pack generated.

In the case I saw, James really had hundreds of pack-files. That makes all
our object lookups suck. Yes, not having loose objects at all is a big
deal too, and yes, we try to start from the last pack-file we found (for
the locality that we hope is there), but it's still pretty bad from a
cache usage standpoint, and when we create a new object, we'll first
search (in vain) in all the hundreds of pack-files.

So would "git gc --auto" have helped James? I'm sure it would have. But he
already had lots of pack-files from doing "git fetch/pull", and while
doing the "git gc --auto" will likely *delay* the point where you need to
do a full repack, it doesn't make it go away.

We still need to tell people to do a full git gc at some point, or do it
for them. And the longer you delay doing it, the more expensive it's going
to get to do and/or the worse the final packing is going to be (especially
if it ends up reusing non-optimal packing decisions from the smaller
packs).

So I think the --auto stuff is still worth it, but it's really just
pushing the pain somewhat further out.

(In the kernel community, if you fetch my tree daily, you really *are*
going to have hundreds and hundreds of packfiles just from doing that).

So I'd really like us to also remind people to do a *real* and full "git
gc", not just the incremental ones.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 7:12 pm

This is a beginning of "git-merge-pack" that combines smaller
packs into one. Currently it does not actually create a new
pack, but pretends that it is a (dumb) "git-rev-list --objects"
that lists the objects in the affected packs. You have to pipe
its output to "git-pack-objects".

The command reads names of pack-*.pack files from the standard
input, outputs the objects' names in the order they are stored
in the original packs (i.e. the offset order). This sorting is
done in order to emulate the traversal order the original
"git-rev-list --objects" that was used to create the existing
pack listed the objects.

While this approach would give the resulting packfile very
similar locality of access as the original, it does not give the
"name" component you would see in "git-rev-list --objects"
output. This information is used as the clustering cue while
computing delta, and the lack of it means you can get horrible
delta selection. You do _not_ want to run the downstream
"git-pack-objects" without the optimization/heuristics to reuse
delta. IOW, do not run it with --no-reuse-delta.

To consolidate all packs that are smaller than a megabytes into
one, you would use it in its current form like this:

$ old=$(find .git/objects/pack -type f -name '*.pack' -size 1M)
$ new=$(echo "$old" | git merge-pack | git pack-objects pack)
$ for p in $old; do rm -f $p ${p%.pack}.idx; done
$ for s in pack idx; do mv pack-$new.$s .git/objects/pack/; done

An obvious next steps that can be done in parallel by interested
parties would be:

(1) come up with a way to give "name" aka "clustering cue" (I
think this is very hard);

(2) run the above four command sequence internally without
having to resort to shell wrapper (easy).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

Linus Torvalds <torvalds@linux-foundation.org> writes:

> IOW, if you get lots of small incrmental packs, after a while you really
> *do* need to do "git gc"...

To: <git@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Junio C Hamano <gitster@...>
Date: Friday, September 7, 2007 - 3:24 am

Can I suggest not calling it git-merge-pack? It makes it look like it's a new
merge strategy called "pack"...

git-merge-base
git-merge-file
git-merge-index
git-merge-octopus
git-merge-one-file
git-merge-ours
git-merge-recur
git-merge-recursive
git-merge-resolve
git-merge-stupid
git-merge-subtree
git-merge-tree

Andy

--
Dr Andy Parkins, M Eng (hons), MIET
andyparkins@gmail.com
-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Friday, September 7, 2007 - 3:11 am

This gives a new meaning to the term "merge". IMHO, "git-combine-pack" would
be a better name.

-- Hannes

-

To: Johannes Sixt <j.sixt@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Friday, September 7, 2007 - 3:34 am

Yeah, that makes sense, but I think this can and should be done
as part of pack-objects itself as Nico suggested.

So consider that patch scrapped for now.
-

To: Junio C Hamano <gitster@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 8:51 pm

I wonder if this is the best way to go. In the context of a really fast
repack happening automatically after (or during) user interactive
operations, the above seems a bit heavyweight and slow to me.

I would have concatenated all packs provided on the command line into a
single one, simply by reading data from existing packs and writing it
back without any processing at all. The offset for OBJ_OFS_DELTA is
relative so a simple concatenation will just work.

Then the index for that pack can be created just as easily by reading
existing pack index files and storing the data into an array of struct
pack_idx_entry, adding the appropriate offset to object offsets, then
call write_idx_file().

All data is read once and written once making it no more costly than a
simple file copy. On the flip side it wouldn't get rid of duplicated
objects (I don't know if that matters i.e. if something might break with

It is, and IMHO not worth it. If you do it separately from the usual
pack-objects process you'll perform extra IO and decompression when
walking tree objects just to reconstruct those paths, becoming really
slow by the context definition I provided above.

If you really want to do it then the best way might simply to reverse
your find result above, in order to use pack-objects as if the larger
packs, i.e. the ones that you don't want to merge, simply had an
associated .keep file.

In fact, since we want to _also_ perform a repack of loose objects in
the context of automatic repacking, I wonder why we wouldn't use that
--unpacked= argument to also repack smallish packs at the same time in
only one pack-objects pass. Or maybe I'm missing something?

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Friday, September 7, 2007 - 12:43 am

As I was planning to do this outside of pack-objects, I did not
want to write something that intimately knows the details of

I do not think duplicates create problems, as long as the pack
idx remains sane. But a bigger issue is for people who fetch
over dumb protocols, from a repository that repacks with "-a -d"

I think this is a much better idea. You obviously need some
twist to the pack-objects, and being lazy that was the reason I
did not want to do this that way.

When a new parameter, perhaps --lossless, is given, together
with the --unpacked= parameters, we can change pack-objects to
iterate over all objects in the --unpacked= packs, and add the
ones that are not marked for inclusion to the set of objects to
be packed, after doing the usual "objects to be packed"
discovery.

I am not sure --lossless is a good option name from marketing
point of view, though.
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Saturday, September 8, 2007 - 6:01 am

The usual command line that uses "--unpacked=<existing>" option
looks like this:

git pack-objects --non-empty --all --reflog \
--unpacked --unpacked=<existing> \
packname-prefix

This packs loose objects and objects in the named existing
packs that are reachable from any and all refs and reflog
entries. It is typically used by "git repack -a -d", which
then removes the named existing packs from the repository, and
has an effect of getting rid of unreachable objects these packs
hold.

This adds "--repack-unpacked" option to pack-objects to help
combining small packs into one, without losing unreferenced
objects that are in the packs. When this option is given in
addition to the above command line, we also make sure all the
objects in the named existing packs are included in the result.

This allows us to safely remove the packs that were named on the
command line after installing the resulting pack in the
repository.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

I am too tired to keep staring at this code now. Fixes,
improvements, replacements and enhancements, in the code,
documentation and tests, are very much welcomed.

builtin-pack-objects.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 93 insertions(+), 2 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 12509fa..9bc2faa 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -21,7 +21,7 @@ git-pack-objects [{ -q | --progress | --all-progress }] \n\
[--window=N] [--window-memory=N] [--depth=N] \n\
[--no-reuse-delta] [--no-reuse-object] [--delta-base-offset] \n\
[--non-empty] [--revs [--unpacked | --all]*] [--reflog] \n\
- [--stdout | base-name] [<ref-list | <object-list]";
+ [--stdout | base-name] [--repack-unpacked] [<ref-list | <object-list]";

struct object_entry {
struct pack_idx_entry idx;
@@ -57,7 +57,7 @@ static struct object_entry **written_list;
stat...

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Saturday, September 8, 2007 - 10:57 pm

I can't help but think it would be better to store a "struct object*"
here instead of the "const unsigned char *sha1". Both are of type
pointer but we only need the struct object* when are working with

Per above just pass the "struct object*" into this function and
store that into the array, instead of the sha1. Of course you now

Really? It is not possible for two objects to be placed at the
same offset within the same packfile and yet have two different

If you save the "struct object*" into the in_pack_object (instead
of the SHA-1) as I described above you can avoid this second
lookup_unknown_object() call when we decide that we do need to pack
the object.

Otherwise I think it looks quite sane.

--
Shawn.
-

To: Shawn O. Pearce <spearce@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Sunday, September 9, 2007 - 1:04 am

This actually was meant to be used to sort object entries from
multiple packs together. The update to pack-objects you are
commenting on deals with one packfile at a time, but I think we
probably should collect from all packs and then sort (which was
how merge-pack used this function).

Other than that, I think your comments make sense.
-

To: Junio C Hamano <gitster@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Shawn O. Pearce <spearce@...>, Git Mailing List <git@...>
Date: Sunday, September 9, 2007 - 8:29 am

I'm not sure sorting objects from multiple packs together like that is
going to help deltification. It is unlikely that related objects (e.g..
objects having the same path) will be located at the same offset in
different packs.

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Sunday, September 9, 2007 - 1:49 pm

Yes. But when you are merging several packfiles together and you
don't supply `--no-delta-reuse` then we're really just going to
copy the data from the sources to the output. There is not a lot
of deltification to be performed; maybe only a handful of loose
objects will need to locate deltas. So helping deltification is
not really of concern here.

What Junio is trying to do here is at least preserve their order
within the packfile as that should help to preserve their locality
of access.

Only I'm not sure that's the best merging strategy available to us.

What about something like this:

1) Read all packfile indexes, sort by offset.

2) Locate first commit object within each packfile.
3) Get that commit's commit date; if no commit is in the
packfile at all use the modification date of the packfile.
4) Sort the packfiles by their chosen date descending (more
recent items are closer to the front of the list).

5) Add objects:
foreach type in commit tree blob tag
foreach packfile in sorted_packs_from_4
while current_object->type == $type
if (current_object->flags & ADDED) == 0
add current_object
current_object++

This way data is still organized by the original order that rev-list
gave us when we created the small packfiles, but we also try to place
data from more recent packfiles into the front of the new packfile.
Its a rough approximation of what rev-list would have given us for
object ordering when it performed a traversal. Its also a whole lot
cheaper than rev-list and lets us continue to include unreachable
objects, which was the point of this patch.

--
Shawn.
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Saturday, September 8, 2007 - 5:50 am

Even though our convention is "zero return means good", it goes a
bit too far for matches_pack_name() to return 0 when it found
the pack is what the name refers to. This fixes that silly and
obvious interface bug.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

Junio C Hamano <gitster@pobox.com> writes:

> Nicolas Pitre <nico@cam.org> writes:
> ...
>> In fact, since we want to _also_ perform a repack of loose objects in
>> the context of automatic repacking, I wonder why we wouldn't use that
>> --unpacked= argument to also repack smallish packs at the same time in
>> only one pack-objects pass. Or maybe I'm missing something?
>
> I think this is a much better idea. You obviously need some
> twist to the pack-objects, and being lazy that was the reason I
> did not want to do this that way.

So what follows is two-patch series, which still is a rough
sketch, as I am feeling a bit too tired to do tests and
documentation (help is always welcomed, hint hint).

This message contains the first one, which is more or less
independent, that exposes matches_pack_name() function from
sha1_file.c, while fixing a silly and obvious interface bug.

cache.h | 1 +
sha1_file.c | 14 +++++++-------
2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/cache.h b/cache.h
index 70abbd5..3fa5b8e 100644
--- a/cache.h
+++ b/cache.h
@@ -529,6 +529,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign
extern unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep);
extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *);
+extern int matches_pack_name(struct packed_git *p, const char *name);

/* Dumb serve...

To: Junio C Hamano <gitster@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Saturday, September 8, 2007 - 9:56 pm

For something with a boolean return value I agree.

Otherwise, for interfaces where a non zero return value means error, it
is best to use a negative value (like -1) which has less of a positive
connotation, and that wasn't done here either.

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Friday, September 7, 2007 - 12:07 am

Yea, that's a really quick repack. :-) Plus its actually something
that can be easily halted in the middle and resumed later. Just need
to save the list of packfiles you are concatenating so you can pick
up later when you get more time.

There shouldn't be a problem with having duplicates in the packfile.
You can do one of two things:

a) Omit the duplicates from the .idx when you merge the .idx tables
together to produce the new one. Just take the object with the
earliest offset.

b) Leave the duplicates in the final .idx. In this case the
binary search may pick any of them, but it wouldn't matter
which it finds.

About the only process that might care about duplicates would be
index-pack. I don't think it makes sense to run index-pack on a
packfile you already have a .idx for. I don't think it would have
a problem with the duplicate SHA-1s either, but it wouldn't be hard

Not might, *must*. If you delete the old ones before the new
ones are ready then readers can run into problems trying to access
the objects. We've spent some effort trying to make these sorts
of operations safe. No sense in destroying that by getting the
order wrong here. :)

--
Shawn.
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 9:58 pm

Honestly, I do not believe in that mode of operation that much.

"While the user is waiting for the EDITOR"?

Because you do not know how much time you will be given before
you start, unless

(1) your process can be snapshotted and you can restart at the
next chance; or

(2) it is so cheap and you can afford to abort and start over
from scratch at the next chance; or

(3) it is so quick that you can simply have the user wait until
you are done without adding too much latency to be annoying,
when you cannnot finish before the EDITOR come back;

I think that is a false sense of "ok, we will be able to do
something else in the background meantime", which is not so

Well, I said "name" in quotes because you do _NOT_ have to give
the real name. I was not thinking about doing the actual tree
traversal at all. What you need to do is to come up with a
token that is the same for the objects in the same deltification
chain so that they cluster together, and that should be doable
by looking at the delta chain patterns inside a packfile.
-

To: Junio C Hamano <gitster@...>
Cc: Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 10:32 pm

I think we have to aim for #3. "Automatic" certainly doesn't imply "can
be slow". It should be reasonably instantaneous, otherwise it'll become
annoying quickly enough. If it can't be (almost) instantaneous in 99%
of normal cases, then I think it simply should be remain asynchronously
througha manual invokation of 'git gc' and we only need to teach/remind

Obviously! Sorry for being slow.

But I still think that a single repack pass should already be able to
pick loose objects and selected (small) packs, and produce a pack with
them all. No need for a separate merge-pack I'd say.

Nicolas
-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 7:35 pm

Ok, so I had to double-check that builtin-pack-objects then deals properly
with duplicate object names (which it does seem to do), so maybe it's
worth adding a comment to that effect.

But ACK, this seems to be the right thing to do to generate a single
bigger pack from many smaller ones.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Johannes Schindelin <Johannes.Schindelin@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 2:29 pm

I wonder if it makes sense to repack just the small incremental packs
into a large (but still incremental) pack, rather than repacking the
entire repository. Presumably that would be a lot faster than a full
"git gc", while still giving you reasonably good packing (at least, if
the threshold is set to a hugh enough number of small packs) and keeping
things fast. That could run as a second phase of "git gc --auto" -- it
should be quick enough to not be too terribly annoying since we're not
running it in the background.

Yeah, if you use the same repo for a long time, you'll accumulate a ton
of medium-sized packs this way, but (a) that's much better than the
situation we have today, and (b) it puts off the performance degradation
for long enough that it becomes more reasonable to expect people to find
out about running the full "git gc" in the meantime, or for git to
further evolve to not need it.

-Steve

-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 10:45 pm

...

Danger... If the user sets `gc.auto` to a low enough value and
they are also unlucky enough to have a few truely unreachable (thus
pruneable) objects in .git/objects/17/ then this is going to run
a bunch of gc work on every commit they make.

I'm actually running into this problem in git-gui. On Windows
it suggests a repack if there is one object in .git/objects/42/.
Some users have been unlucky enough to stage a file, have it
hash into that directory, then restage a different version of it.
The prior one is never considered reachable (it was never committed),
but will now *always* cause git-gui to suggest a repack on every
startup. For all time.

Yea, I need to fix that.

But this suffers from the same fate if the user sets gc.auto too
small and doesn't realize that the reason Git is always repacking
is because over the last 6 months they have been unlucky enough to
stage the magic number of unreachable blobs into the 17 directory
and they have *never* run `git gc --prune` because the auto thing
is working just fine for them and they don't realize they need to
prune every once in a blue moon.

--
Shawn.
-

To: Shawn O. Pearce <spearce@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 10:49 pm

Check the modification times on those files and don't count ones that
are older than the last git-gc run, maybe? That'd take care of the problem.

-Steve
-

To: Steven Grimm <koreth@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 10:56 pm

Eh, that could mean a bunch of stat calls that it would be nice
to avoid. The counter Junio (and git-gui) implements just does
a readdir(). Reasonably cheap.

Maybe just save a ".git/gc_last_auto" with the last object count
of .git/objects/17, after repacking. If the count is over the
gc.auto limit *and* is still over the limit after subtracting the
".git/gc_last_auto" value then consider that auto is required.

This way the file is only consulted if we are really thinking
about running a repack, and its only written to if we actually do
the repack. So we only take the extra penalty if we are going to
be taking a *really* big extra penalty by repacking.

--
Shawn.
-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:18 pm

or a non-POSIX snprintf returning "negative value" (Microsoft)

-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 4:37 pm

This makes the two commands to call "git gc --auto" when they
are done.

I earlier said that obvious candidates also include merge and
rebase, but these are lot less frequent operations compared to
add, and more importantly, in a normal workflow they would
almost always happen after "git fetch" is done.

In other words, if you are downstream developer, the automatic
invocation in "git fetch" will take care of things for you, and
otherwise if you do not have an upstream, you would be doing
your own development, so "git add" to add your changes will take
care of the auto invocation for you.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* This is obviously a follow-up to the previous one that allows
you to say "git gc --auto". I somewhat feel dirty about
calling cmd_gc() bypassing fork & exec from "git add",
though...

builtin-add.c | 2 ++
git-fetch.sh | 1 +
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/builtin-add.c b/builtin-add.c
index 105a9f0..8431c16 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -263,9 +263,11 @@ int cmd_add(int argc, const char **argv, const char *prefix)

finish:
if (active_cache_changed) {
+ const char *args[] = { "gc", "--auto", NULL };
if (write_cache(newfd, active_cache, active_nr) ||
close(newfd) || commit_locked_index(&lock_file))
die("Unable to write new index file");
+ cmd_gc(2, args, NULL);
}

return 0;
diff --git a/git-fetch.sh b/git-fetch.sh
index c3a2001..86050eb 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -375,3 +375,4 @@ case "$orig_head" in
fi
;;
esac
+git gc --auto
--
1.5.3.1.840.g0fedbc

-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 8:02 am

Hi,

Since all git-gc seems to do is to fork() and exec() other git programs,
this should be fine (have not looked at cmd_gc() in a while, though).

Ciao,
Dscho

-

To: Junio C Hamano <gitster@...>
Cc: Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 4:35 pm

A big part of the repack cost is the counting of objects. I don't know
if --unpacked to git-pack-objects skips walking trees of a packed commit
object. If no then it probably should to gain a significant speed up,
or maybe a separate option should be created to actually imply this

Nope! 'git add' creates loose objects which are not yet reachable from

I think that would be a much better idea to simply decrease the

and git commit. Which resumes it to commit creating operation.

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:49 pm

Good point. I think that makes sense.

-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:59 pm

The point of auto gc is to pack new objects created in loose
format, so a good rule of thumb is where we do update-ref after
creating a new commit.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
Let's chuck the previous "git add/git fetch" one, and replace it
with this.

Also I realize I misread your earlier comment about "git add".
You are still among the only few people on the list that I
consider are always more right than I am ;-).

git-am.sh | 2 ++
git-commit.sh | 1 +
git-merge.sh | 1 +
git-rebase--interactive.sh | 2 ++
4 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/git-am.sh b/git-am.sh
index 6809aa0..4db4701 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -466,6 +466,8 @@ do
"$GIT_DIR"/hooks/post-applypatch
fi

+ git gc --auto
+
go_next
done

diff --git a/git-commit.sh b/git-commit.sh
index 1d04f1f..d22d35e 100755
--- a/git-commit.sh
+++ b/git-commit.sh
@@ -652,6 +652,7 @@ git rerere

if test "$ret" = 0
then
+ git gc --auto
if test -x "$GIT_DIR"/hooks/post-commit
then
"$GIT_DIR"/hooks/post-commit
diff --git a/git-merge.sh b/git-merge.sh
index 3a01db0..697bec2 100755
--- a/git-merge.sh
+++ b/git-merge.sh
@@ -82,6 +82,7 @@ finish () {
;;
*)
git update-ref -m "$rlogm" HEAD "$1" "$head" || exit 1
+ git gc --auto
;;
esac
;;
diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index abc2b1c..8258b7a 100755
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -307,6 +307,8 @@ do_next () {
rm -rf "$DOTEST" &&
warn "Successfully rebased and updated $HEADNAME."

+ git gc --auto
+
exit
}

-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 10:39 pm

Why bother with git-rebase--interactive.sh? It calls two tools,
git-cherry-pick (which calls git-commit) and git-commit to do its
per-commit dirty work. So on every step of `git rebase -i` we are
now running `git gc --auto`. No need to also run it at the end.

Note this is also true of `git rebase -m` as that uses the wonderful
feature of `git commit -C $oldid` per commit to make the new commit.

--
Shawn.
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:46 pm

Bzzt, I am releaved to see you are sometimes wrong ;-)

They are reachable from the index and are not subject to

One thing that I find lacking in that auto patch is actually
that we should sometimes consolidate multiple small packs into a
single larger one. Any behaviour change to encourage creation
of many tiny packs should be avoided until it materializes.

Probably we should introduce a built-in minimum value for a
positive gc.auto, somewhere around 1000 or so, for this reason.

-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Thursday, September 6, 2007 - 1:55 am

Hm. Isn't it possible to work with several index files at once? I
seem to remember that even git-add does this itself. So what is it
that protects objects in such a temporary index from being garbage
collected by a different git process running on the same repository?

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-

To: Junio C Hamano <gitster@...>
Cc: Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 7:04 pm

Why not just let the default value take care of it? If someone really
wants to set gc.auto to 50, why prevent it?

The more I think of it, the less I like automatic repack. There is
always a bad case for it somewhere.

Nicolas
-

To: Nicolas Pitre <nico@...>
Cc: Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 7:42 pm

I tend to agree, but at the same time, I think the long term
goal should be not to have bad cases.

Old timers like ourselves learned to run "repack -a -d" when not
doing real work (i.e. beginning of the day while fetching
coffee, before leaving to lunch break, end of the day before
leaving) and we have been _trained_ not to feel that a choir,
but I think that is wrong. "Sync freezes I/O for and causes my
real-time databasy job undue latency --- I would want to disable
swapper/bdflush/whatever machine-wide and prefer typing 'sync'
from the command line when it is convenient for me" is fine for
an experienced user working on a single user machine, but it
still feels wrong (we do not have "multi-user" issues in git
repository, so this analogy is not quite right, though).
-

To: Junio C Hamano <gitster@...>
Cc: Nicolas Pitre <nico@...>, Nix <nix@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 8:27 pm

The best solution is make "git gc" unnecessary.
At the long term, and without loss of efficiency.
-

To: Nicolas Pitre <nico@...>
Cc: Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:14 pm

I think `git fetch' works reasonably well as is: unless you're fetching
every five minutes you often find you get packs anyway. There's no point
packing incrementally *too* often, or you replace a lots-of-objects
problem with a lots-of-packs problem, after which you're worse off than
when you started.
-

To: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 4:50 am

What about kicking off a repack in the background at the ends of certain
commands? With an option to disable, of course. It could run at a low
priority and could even sleep a lot to avoid saturating the system's
disks -- since it'd be running asynchronously there should be no problem
if it takes longer to run.

Alternately, if it's possible to break the repack work up into chunks
that can be executed a bit at a time, you could do a small amount of
repacking very frequently (possibly still in the background) rather than
the whole thing at once. I suspect the nature of a repack, where you
presumably want everything loaded at once, would make that a challenge,
but it might not be impossible.

On the more general question...

IMO expecting end users to regularly perform what are essentially
database administration tasks (running git-gc is akin to rebuilding
indexes or packing tables on a DBMS) is naive. Heck, even database
administrators don't like to run database administration commands;
PostgreSQL added the "autovacuum" feature precisely because manual
periodic repacking (and the associated monitoring to figure out when to
do it) was too annoying for developers and DBAs. But you don't have to
look that far; anyone who has worked in IT can tell you horror stories
of users, including developers, whose computers have slowed to a crawl
because the users never bothered to defrag their hard disks. And that
affects *everything* the users do, not just version control operations!

It'll get worse as better UIs and tool integration become available and
git gains large numbers of users who are neither software developers nor
system administrators, and wouldn't know a packfile from a hole in the
ground. I'm talking web designers, graphic artists, mechanical
engineers, even managers and secretaries -- all of those people are in
git's ultimate target audience, even if it's not ready for them today.
None of them is going to be interested in doing random housekeeping
op...

To: <git@...>
Date: Wednesday, September 5, 2007 - 5:13 am

You'll potentially get accumulating unfinished files from
aborted/killed repack processes. If communication fails, you'll get a
new repack session for every command you start. If a repository is
used by multiple people...

And so on. The multiuser aspect makes it a bad idea to do any
janitorial tasks automatically. You don't really want every user to
start a repack at the same time.

--
David Kastrup

-

To: Steven Grimm <koreth@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:14 am

there is an issue with that: repack is memory and CPU intensive. Of
course renicing the process deals with the CPU issue, but not with the
memory one. I've often seen repacks eat more than 300 to 400Mo of memory
on not so big repositories: it seems (and experience tells me that, not
looking at the code) that if you have some big binary blobs (we have
=2Eswf's and .fla's in our repository) it can consume quite a lot of RAM
to (presumably) compute efficient deltas.

Sadly there is no way to "renice" the ram usage of a process. Once a
repack is launched, it will make your system swap, and put the whole

Well that's what crons are for. When you install a SGBD in a
reasonable enough distro, it comes with the optimizing scripts in crons,
launched at a reasonable period of the day (localtime). So the
comparison doesn't hold. And that's exactly the problem: it's quite hard
to ship git with an optimizing cron task, because we can't know where
the user will keep his repositories, and when he works, so you have
somehow to do it yourself.

Or you can deal with that with a "rule". At work, we have our devel
trees under $HOME/dev/, so the cron we use is just a (roughly):

find $HOME/dev/ -name .git -type d -maxdepth 4 | while read repo
do
GIT_DIR=3D"$repo" git gc
done

As we work on NFS, with a new developper, we can just setup the cron
for him at a date where he's not supposed to be at work, and that's it.
I'm not sure there is a good solution at all.

Or we could also provide a: git-coffee-break command that would tell
git: do whatever you want with this computer in the next 10 minutes,
there won't be anyone watching, but I assume tea-lovers will feel
excluded.

--=20
=C2=B7O=C2=B7 Pierre Habouzit
=C2=B7=C2=B7O madcoder@debia=
n.org
OOO http://www.madism.org

To: Steven Grimm <koreth@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:07 am

[alias]
begin = gc
leave = gc

That is, the user's manual says 'at the beginning of the day,
run "git begin" to start the day, and at the end of day, run
"git leave" to conclude your day', without saying why ;-)

-

To: Junio C Hamano <gitster@...>
Cc: Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:27 am

I actually like that one ;-)

Anyway - this is turning out to be a bit of a bikeshed-painting event.
You guys should google earlier discussions on this very same subject.
They have always ended in "automatic=bad", "warning=good", and
"careful or you might be called an idiot" before ;-)

cheers,

martin
-

To: Martin Langhoff <martin.langhoff@...>
Cc: Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 5:33 am

There's indeed a real idea behind that. The issue is that the alias
shouldn't be just "gc", but "find-all-repositories-and-do-gc-there".

Currently, AFAIK, that can only be done with a (trivial) script
external to git. I suppose this can easily be added to the core git
porcelain. Perhaps a "git gc --recursive" would do.

It doesn't solve the problem, but makes it easier to solve it (git gc
--recursive in cron for example).

--
Matthieu
-

To: Matthieu Moy <Matthieu.Moy@...>
Cc: Martin Langhoff <martin.langhoff@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 10:17 am

I'm a git newb so I can be wrong here but ...

Why --recursive? Why not use the submodule-information ?

Johan
-

To: Johan De Messemaeker <johan.demessemaeker@...>
Cc: Martin Langhoff <martin.langhoff@...>, Linus Torvalds <torvalds@...>, Steven Grimm <koreth@...>, Junio C Hamano <gitster@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 1:31 pm

all projects are not necessarily subprojects of each others.

I have ~/teaching/some-course/.git (well, almost) and ~/etc/.git which
are two unrelated projects, and to "git gc" both of them, I need
either a script, or two manual invocations.

(yes, I'm really talking about something trivial)

--
Matthieu
-

To: Matthieu Moy <Matthieu.Moy@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 7:56 pm

I tend to have a lot of small projects, so I have on the order of 80 git
repositories on each machine I use, most of which have a 'mothership'
origin on a central, backed-up machine.

When I sit down to work, I want to see which repositories
have changes that need to be pulled. And when I get up to leave, I want
to see which repositories have changes that need to be pushed. Not to
mention files that need committed, loose objects that need packed, etc.

So I wrote the 'git-stale' script, included below. It's not especially
user-friendly, but you might find it useful, as it solves the exact
problem you are talking about (and much more).

It reads 'repository specifications' from ~/.gitstale, one per line,
which are either of the form:

/path/to/repo

which specifies a repo to check, or:

r:/path/to/many/repos

which specifies a hierarchy in which to recursively find repos.

My .gitstale looks something like this:

/home/peff/compile/git
/home/peff/compile/tig
r:/home/peff/work

and I get output something like this (edited for brevity):

Checking (1/77) /home/peff/compile/git...
Checking (2/77) /home/peff/compile/tig...
[...]
Checking (77/77) /home/peff/work/foo...
MERGE:next /home/peff/compile/git
COMMIT: /home/peff/work/foo
PACK: /home/peff/work/foo
PUSH:master /home/peff/work/bar

which translates to:
- the git repo has commits in 'origin/next' that are not in 'next'
(and you might want to merge them in)
- there are uncommitted files in 'foo'
- 'foo' needs packing
- in the 'bar' repo there are commits in master that are not in origin
(and you might want to push)

Hopefully it will be useful to you, though I think it is probably too
specific to my workflow to be part of git.

-Peff

-- >8 --
#!/usr/bin/perl

use strict;
use Getopt::Long;

my $CONFIG_FILE = "$ENV{HOME}/.gitstale";

my $nofetch = $ENV{GITSTALE_NOFETCH};
Getopt::Long::Configure(qw(bundling));
GetOptions('nofetch|n!' => \$nofetch) or exit 100;

my @proj...

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 4:16 am

Very well said ;-)
-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:30 am

I am as old timer as you are so I am not qualified to add much
variety to the discussion, but I agree that excessive cruft is
something we should warn the user about.

I personally was _extremely_ annoyed by git-cvsimport
occassionary deciding to repack whenever it finds more than
certain number of loose objects, not because it is a big import,
but because I happened to start the command to start a very
small import after doing my own development for a while to
accumulate loose objects, and I really hate automatic repacking
for any operation (or tool that thinks it knows better than I do
in general).

Perhaps _exiting_ "git-commit" and "git-fetch" before doing
anything, when the repository has more than 5000 loose objects
with a LOUD bang that instructs an immediate repack would be
good?

I really do not like the idea of automatically running a repack
after first interrupting the original command and then resuming.
For one thing it would make a horribly difficult situation to
debug if anything goes wrong. You cannot reproduce such a
situation easily.
-

To: Junio C Hamano <gitster@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 4:51 am

El 5/9/2007, a las 9:30, Junio C Hamano escribi

To: <git@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>
Date: Wednesday, September 5, 2007 - 4:13 am

What about some sort of middle ground:

When git-fetch and git-commit has done its job and is about to exit, it che=
cks=20
the number of loose object, and if too high tells the user something=20
like "There are too many loose objects in the repo, do you want me to repac=
k?=20
(y/N)". If the user answers "n" or simply <Enter>, it exits immediately=20
without doing anything, but if the user answers "y", or if there is no=20
response, say, within a minute (i.e. the user went to lunch), the repack is=
=20
initiated. (Of course, the user should be told that a Ctrl-C will abort the=
=20
repack and not be harmful in any way.)

If the user answers "n" (or aborts the repack), the question will keep popp=
ing=20
up on the next git-{commit,fetch} to remind/annoy the user until a repack i=
s=20
done.

=2E..Johan

=2D-=20
Johan Herland, <johan@herland.net>
www.herland.net

To: Johan Herland <johan@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, <git@...>
Date: Wednesday, September 5, 2007 - 4:39 am

I don't like commands to be interactive if they don't _need_ to be so.
It kills scripting, it makes it hard for a front-end (git gui or so)
to use the command, ...

--
Matthieu
-

To: Matthieu Moy <Matthieu.Moy@...>
Cc: Johan Herland <johan@...>, Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, <git@...>
Date: Wednesday, September 5, 2007 - 4:51 am

There is absolutely no problem here, as it can be avoided if the
output is not a tty. It's not _that_ hard to guess if you're currently
running in a script or in an interactive shell after all.

Really, git commit/fetch/... whatever suggesting to repack/gc when it
believes it begins to be critical to performance is not a bad idea.
Though the risk is that the warning could be printed very often, but
that can be avoided trivially by just writing to a state file in the
=2Egit directory that the warning was printed not so long time ago, and
that git should STFU for some more commits/time.

--=20
=C2=B7O=C2=B7 Pierre Habouzit
=C2=B7=C2=B7O madcoder@debia=
n.org
OOO http://www.madism.org

To: Johan Herland <johan@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, <git@...>
Date: Wednesday, September 5, 2007 - 5:04 am

I do find it hard to guess _reliably_ if you're running interactively
or not. For example, I've been bitten recently by "git log" running
inside a pager while I was launching it non-interactively inside Emacs
(as part of DVC). I don't know whether this was git's or Emacs's
fault, and the fix was not too hard (GIT_PAGER=cat), but it took some
of my time to get it working.

Adding more interactive stuff means adding more opportunities for this
kind of problems. None will be a huge problem, but each problem will
take some time to be fixed (I'm pretty sure adding an interactive
prompt in git-commit will break DVC's commit functionality, and we'll

_Suggesting_ is a good idea, definitely. Something like

if (number_of_unpacked > 1000 && number_of_unpacked < 10000) {
printf ("more than 1000 unpacked objects. Think of running git-gc\n");
} else if (number_of_unpacked >= 10000) {
printf ("HEY, WHAT THE HELL ARE YOU DOING WITH >10000 UNPACKED OBJECTS???\n"
"I TOLD YOU TO REPACK\n");
}

would be fine with me. The proposal to run git-gc in the background,
with low priority seems to be a good idea too.

But please, don't put an interactive prompt where it's not needed.

--
Matthieu
-

To: Matthieu Moy <Matthieu.Moy@...>
Cc: Linus Torvalds <torvalds@...>, Junio C Hamano <gitster@...>, <git@...>
Date: Wednesday, September 5, 2007 - 4:41 am

Ok, so add an option or config variable to turn on/off this behaviour.

=2E..Johan

=2D-=20
Johan Herland, <johan@herland.net>
www.herland.net

To: Junio C Hamano <gitster@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:26 am

Hi!

This may break automation. I run git-gc monthly via cron, but that
doesn't guarantee I won't get 5000 loose objects before that. And I
agree that automatic run is annoying. Perhaps simple BIG FAT WARNING
is the best after all.

--
Tomash Brechko
-

To: Linus Torvalds <torvalds@...>
Cc: Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:21 am

(resent with CC to git@)

I never followed up on one of your suggestions back in the day -- that
we printed an informational msg along the lines of "you have X loose
objects, it's about time to repack" after some operations (fetch,
merge, commit). These days it's all C, so I'll pass the buck to people
that actually know how to do printf() ;-)

Also -- early users got everything exploded during clone, James is
probable one of them. It is the worst case scenario, really. Users of
a modern git will start off with a large packs, and accumulate little
packs from pulls, so it's not as bad.

In fact, in James' case, it would have been way way way faster to
"steal" the packs from git.kernel.org via http (or your laptop) and
_then_ repack. He'd been sorted in a minute.

cheers,

martin
-

To: Martin Langhoff <martin.langhoff@...>
Cc: Linus Torvalds <torvalds@...>, Git Mailing List <git@...>
Date: Wednesday, September 5, 2007 - 3:37 am

git-gui pops up a dialog that says precisely that, and gives you the
choice of repacking right then and there, or skip it.

As for truly automatic repacking after commands such as fetch, it
could probably be a config option (defaulting to "on"). It'd be
important to have "press any key to abort repacking (with no ill
effects)" type funtctionality, though.

--
Karl Hasselstr

Previous thread: Re: [PATCH] Function for updating refs. by Junio C Hamano on Wednesday, September 5, 2007 - 3:04 am. (2 messages)

Next thread: Re: People unaware of the importance of "git gc"? by David Kastrup on Wednesday, September 5, 2007 - 4:47 am. (1 message)