So we had a git bof at linux.conf.eu yesterday, and I leart something
new: even people who have been using git for a long time apparently don't
necessarily realize the importance of repacking.
James Bottomley (the Linux SCSI maintainer) is an old-time BK user, and
very comfy using git. But when he was demonstrating things on his poor old
laptop, simple things like "git branch" literally took a long time, and
James didn't seem to realize that the fact that he had apparently never
ever repacked his repository was a big deal.
The kernel archive is a 190MB pack for me fully repacked (I just checked -
I had actually thought that it was somewhat larger than that), but because
James hadn't repacked, his .git directory was over a gigabyte in size, and
his laptop wasn't able to cache anything at all effectively as a result.
Repacking it took over an hour, simply because everything was *so*
unpacked, and James' kernel repository had something like 92 thousand
loose objects, and several hundred packfiles. Simple operations that
really take much less than a second for me ("git branch" takes 0.022s on
my laptop, which has the same 512M that James had on his) took many many
seconds as a result, and James seemed to think that this was all normal.
And James didn't even want to repack, because it was so expensive (which
he knew - he claims to have never ever repacked at all, but maybe he had
started it and just control-C'd it when it was really slow at some point).
Now, it may be that James didn't realize how important the occasional
garbage collect is exactly *because* he is an old-timer and used BK long
before he used git, and just continued using git simply as a BK
replacement, but it did make me wonder whether maybe this lack of
repacking awareness is fairly common.
I've been against automatic repacking, but that was really based on what
appears to be potentially a very wrong assumption, namely that people
would do the manual repack on their own. If it turns out...I do it from time to time. Seldom in working repositories, because they usually come and go before they have a chance to accumulate enough of loose objects. I do a partial repack (git repack -d) after every import from p4 repo, because every snapshot of it is an ugly mess changing files all over the tree. Sometimes, after I merged a big chunk with the p4 repo and sent it over (the process involves rebase). It is usually concious decision when to do a repack or gc. The repack time is seldom a problem: it is fast enough even on windows (and I do have big repos and binary objects). The gc causes my machines to swap, though. Some of them heavily, so there my repos stay longer partially packed. I do use .keep packs for this reason (and because windows or cygwin or both have more problems with big files the they have with small). I used to clone repos with "-s", but quickly stopped after a few broken histories. This also tought me to think before running "git gc" or "git repack -a -d". On a rare occurance I even use "git repack -a -d -l" and "git pack-refs" separately. This was all specific to my day-job. At home, on linux systems I just run git-gc whenever I please, without even thinking why. It finishes mostly in less than a minute (the kernel: ~40-50 sec on my P4 2.6GHz, 1Gb). -
Well, this may just prove I'm an idiot, but one of the reasons I rarely run it is that I have trouble remembering exactly what it does; in particular, - does it prune anything that might be needed by a repo I cloned with -s? - is there anything that's unsafe to do while the git-gc is running? - what are the implications for http users if this is a public repo? - is git-gc enough on its own or should I be running something more agressive ocassionally too? No doubt they all have simple answers, which probably amount to "just don't worry about it", and which I could have found in less time than it'd take to write this email. But when I've got other work to do, reading "man git-gc" is just enough effort for me to postpone the whole thing to another day. So, anyway, your message reminded me to run git-gc on my main working repo. At which point one of my personal scripts immediately started failing--it was assuming it could find any ref under .git/refs/, and I hadn't realized (or maybe I had once, and I'd forgotten) that git-gc packs refs by default now. Bah. I don't know what the moral of that story is. --b. -
YES! yikes. This is about the best argument put forth so far for not automatically running git-gc. Personally, I think git-gc should not remove unreferenced objects without --prune (but I haven't done anything about it). But even if git-gc was modified in this way, an occasional git-gc --prune would still be necessary to remove all of the unreferenced and dangling objects safely with a human thinking about the shared repo implications (unless shared repo handling is modified). -brandon -
Well, it could also mean that if git finds a dead symbolic link when looking up an object, it should check the corresponding link target directory for a pack file with the respective object... and if it finds such a pack file, create a link to it and use it. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
The problem here is that the clone could be having refs on objects from the origin that don't have refs left there. git-gc might, at some point, prune these refs, and the clone would have dangling refs. That could easily happen, for example, if you rebase a branch in the origin, but still have a clone with the original branch. Mike -
One of the two of us is very confused about what "git-clone -s" does. See the git-clone man page. I don't think symlinks are involved. --b. -
Guilty as charged. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Hi, I am very new to git but I have thought about this a bit from a user's perspective. I have several thoughts on the matter. First, I would like to point out that the hg folks like to compare themselves to git a lot and they list the need for manual gc as a reason to choose hg over git. This may not be something that the git community cares about but I thought I would point it out. Second, it *is* a hassle. When trying to figure out what I could convince my co-workers to use, having to gc was something that I did not think they would be conscious of or care enough about to do. It makes git more of a PITA than it could be. Similarly, I have no idea when it is a good time to do a gc. After every commit? Before push? What if I never push a repo? What if it is a remote repo only used to sync up with my co-workers, do I have to go there and periodically gc? This is one reason why I really think that gc should be *plumbing* and *not* porcelain. The user should never have to trigger a gc, they should even be discouraged from doing so. That is how other gc systems are. Can you imagine if you had a Java app that had a button on it to do a gc? When should I push it? Should I wait till the system is getting slow or just start spamming the button whenever I'm bored? I know that Java/c#/py GC are different than git gc, but they fulfill the same basic purpose as git gc. IE to clean up unused items and free up resources. Git additionally may do some re-optimization, but that is not relevant to a user. I know this goes against the general mood here (which seems to be against auto-gc) but I thought I would give my $.02 as a user of git. Thanks, Govind. -
That's a good way to think of it IMO. It's a low-level operation (albeit one that encapsulates other, lower-level ones) that tells git to rearrange its internal data structures. It is not something that has any user-visible effect. Every other porcelain-level git command *does something* from the user's point of view. Running git-gc is basically a no-op, which from the user's point of view makes it a waste of keystrokes and an annoying distraction from focusing on the stuff I'll play devil's advocate for a moment here, though, and say that, as others have suggested in this thread, git could be made to tell you when it's appropriate to run gc. So the "I don't know when to run it" argument isn't a hard one to address. With that in mind, here's what the message should look like IMO: --- Your repository can be optimized for better performance and lower disk usage. Please run "git gc" to optimize it now, or run "git config gc.auto true" to tell git to automatically optimize it in the future (this will launch processes in the background.) For more information, "man git-gc". --- And that "gc.auto" config option (just an arbitrary name, call it something else if that's no good) actually has four settings: warn (the default) - prints the warning message, at most once every N minutes (we can determine a good value for N) true - launches git-gc in the background as needed false - suppresses the warning and the check that triggers the warning foreground - launches git-gc in the foreground as needed (to make it easier to abort) I don't buy the "git gc takes too much memory to run in the background" argument as a reason automatic git-gc is a bad idea. Many of us (me included) work on machines with plenty of memory to launch a background git-gc without hampering our development work, and/or on repositories small enough that it doesn't eat that much memory in the first place. And if you make it an option that the user has to enable, people on low-memory mac...
It certainly has a sysadmin-visible effect. Repack a couple of big git repositories and that's a backup tape gone if you do incremental backups: and you can't *not* back up the pack files, even though a lot of the state in them is recoverable from elsewhere on the net: the stuff which is not recoverable is tangled up with the stuff which is. (of course the solution here was .keep files. I cheered when they were introduced and started rolling git out everywhere I could. There's just one last vast repository maintained by a horrible shell script layered atop SCCS which I have to find some way to convert...) -
I'll throw my opinion in here as well. I think git should
automatically do repacking by default, (once loose objects exceed some
threshold). There have several posts in this thread from people who
don't want auto-gc, but these same people should be able to avoid it,
and likely without changing habits. That's because:
* They're already in the habit of manually repacking every once in a
while, (or like, Linus, much more often than strictly necessary).
* They've already got cron jobs setup to do the repacking.
And one could augment this with an option to disable the repacking of
course.
And if you're really concerned about people that don't want this
getting it anyway, just determine some useful threshold and then
double it or so before it triggers automatic repacking, (so the
automatic repacking hits only us idiots that completely neglect it).
[Pardon me for continuing to quote in the original top-posted order,
but I like the flow here.]
I know it was surprising to you, Linus, but I'm glad you noticed
it. I've seen the same thing from many users. And git actually
discourages users from learning about repacking. If the user starts
with a small (or new) project, then everything performs well, and
there's no performance problem whatsoever.
So then the problems creep up gradually, and the user has no idea that
he should be doing anything different than he's always done. Instead
the user is left to just conclude that git's performance isn't scaling
well as the project grows. That's a bad conclusion of course, and it's
bad that git sets things up so the user reaches that conclusion.
I don't think the warning message alone is a good fix. I think the
people who would understand the warning and appreciate that they could
then take care of repacking as convenient are the same people that
already understand the repacking concept, and are likely already
repacking occasionally, (so would likely never see the warning).
But the problematic case is the user who knows not...(my 2 cents as another ordinary new git user) Hmm, not necessarily. That a system knows what the best action is doesn't meant that _right now_ is the best time to take that action. One subtle difference I think between git's gc and Java/python/etc.'s gc is that in the latter case it is, at least metaphorically, a life and death situation - if gc isn't run, the application will run out of memory, where as in git, it's more of a performance degradation issue, which, sort of, can wait. On the issue of implementation awareness, a warning message saying something along the lines of "your repository is getting slower. You might want to consider running 'git gc', and remember to do that from time to time." is not much different from "your file system is getting slower. You might want to consider running <whatever-defrag-tool>, and remember to do that from time to time." Neither these messages nor the actions they propose _require_ users to learn what "repacking", "loose object", or "file fragments" are about before they can proceed. Cheers. -- Jing Xue -
Can it be that getting rid of unused objects is harder once they are packed? If that is the case, an automatic pack while mucking about with temporary branches and/or confidential files would be quite a nuisance. Automatic packing maybe would be acceptable if packing was really transparent to what you do with your repo (including janitoring work). And it would be nice if automatic packing could be done in an incremental manner, not bogging down normal work. -- David Kastrup -
Well independently from the fact that one could suppose that users should use gc on their own, the big nasty problem with repacking is that it's really slow. And I just can't imagine git that I use to commit blazingly fast, will then be unavailable for a very long time (repacks on my projects -- that are not as big as the kernel but still -- usually I do, when I'm bored and that I can't get things done. you know, it has become one of my many twitches when I have an empty tty in front of me and that I'm doing nothing useful. Though, when I'm in a hack-attack, well I don't necessarily remember to repack. I'm in one of the (not so many ?) very lucky companies (yay start-ups) where I could show that git was very superior, and we now use it as our sole SCM. So when I'm in a hack attack, it's usually that it's a busy week, and that new patches, trees, objects (and sometimes with large binary things in it) flows like hell. And the repository grows larger and larger. Well, the way we chose to avoid the "I'm coding don't bother me with administrivia"-attitude is that our users use a small cron that basically runs git gc each day, and an aggressive repack (with a window of 50 or 100 I don't remember) each Week-end in a cron. Because the best criterion to repack a repository is: when there is no-one on the computer. It has proven quite good, as we have never seen a repository explode in a day, even after some funny mistakes where people rebase some big parts of the tree many times, generating very large number of loose objets. I know I don't really answer the question, but the point I try to make is that yeah, some kind of automated way to run the gc is great, but I'm not sure that _git_ is the tool to automate that, because when *I* use git, I expect it to be just plain fast, and I don't want it to occasionally hang. --=20 =C2=B7O=C2=B7 Pierre Habouzit =C2=B7=C2=B7O madcoder@debia= n.org OOO ...
Indeed. I repack all our git trees in the middle of the night, and our incremental backup script drops .keep files corresponding to every existing pack before running the backup. This is probably a good job for cron :) -
If you are setting up cron jobs to repack multiple git trees, you are not the kind of novice or casual git user who this proposal would primarily be aimed at. But in any event, since you are doing that, your repos will never accumulate a high enough percentage of loose objects (whatever the threshold is) to trigger the warning and/or automatic launch. So you can continue to operate as before, no difference in behavior, while people who don't know how / want to set up cron jobs will have their repositories cleaned too. git-gc can leave behind a "last completed" timestamp and we can suppress the check for excess loose objects until some minimum amount of time has passed since last git-gc. If that amount is greater than the interval between your cron jobs, you won't even get any (measurable) overhead from the detection to see if the warning is needed. -Steve -
True enough: but the point is that it was only about three lines of code (a locate and git-gc pipeline). We could just put that in the documentation... ... which people then won't read. Oh well. Sorry for the mindless I personally wonder if git-gc shouldn't use a proportional scheme, so that only some packs get repacked, maybe the smallest ones (and when they grow to the same size as the next largest one, the two get repacked into one). This has the singular advantage that you won't have to carefully drop .keep files everywhere or have to worry about your git-gc of 50K of loose objects suddenly deciding to repack 100Mb of packfiles and taking ages. It's probably not hard to implement, but I don't need it because I keep everything packed anyway... -
Not only that. Currently the "Counting objects" phase when running git-gc on the Linux repo takes a significant amount of time, even if there is little to repack. If any kind of automatic repack is implemented, it should be an incremental repacking only, not the full thing, i.e. git-repack without -a, or git-pack-objects with --unpacked. The idea is to be the least intrusive as possible. Also, object walking should be limited to objects linked to a commit object which is itself unpacked in order to cut on the time required to fully enumerate all objects. This way a semi-packed state will always be preserved and should be good enough. The full repacking should probably be left to manual execution of git-gc. Nicolas -
Ok, how about doing something like this?
-- >8 -- snipsnap -- >8 -- clipcrap -- >8 --
Implement git gc --auto
This implements a new option "git gc --auto". When gc.auto is
set to a positive value, and the object database has accumulated
roughly that many number of loose objects, this runs a
lightweight version of "git gc". The primary difference from
the full "git gc" is that it does not pass "-a" option to "git
repack", which means we do not try to repack _everything_, but
only repack incrementally. We still do "git prune-packed". The
default threshold is arbitrarily set by yours truly to:
- not trigger it for fully unpacked git v0.99 history;
- do trigger it for fully unpacked git v1.0.0 history;
- not trigger it for incremental update to git v1.0.0 starting
from fully packed git v0.99 history.
This patch does not add invocation of the "auto repacking". It
is left to key Porcelain commands that could produce tons of
loose objects to add a call to "git gc --auto" after they are
done their work. Obvious candidates are:
git add
git fetch
git merge
git rebase
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
builtin-gc.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 63 insertions(+), 1 deletions(-)
diff --git a/builtin-gc.c b/builtin-gc.c
index 9397482..093b3dd 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -20,6 +20,7 @@ static const char builtin_gc_usage[] = "git-gc [--prune] [--aggressive]";
static int pack_refs = 1;
static int aggressive_window = -1;
+static int gc_auto_threshold = 6700;
#define MAX_ADD 10
static const char *argv_pack_refs[] = {"pack-refs", "--all", "--prune", NULL};
@@ -28,6 +29,8 @@ static const char *argv_repack[MAX_ADD] = {"repack", "-a", "-d", "-l", NULL};
static const char *argv_prune[] = {"prune", NULL};
static const char *argv_rerere[] = {"rerere", "gc", NULL};
+static const char *argv_repack_auto[] = {"repack", "-d...Hi, Please don't do that. When you share objects with another git directory, git-gc --auto can get rid of the objects when some objects go away in the referenced repository. So we need _at least_ check gc.auto not being set in the repo when "git clone --share"ing it (and fail otherwise). My preferred way would be to set it in "git init" so that existing setups are not affected, and put some big red message on top of the next release notes that people might want to set gc.auto in their existing setups. Ciao, Dscho -
I thought the whole point of "gc --auto" was to have something that does not lose/prune any objects, even the ones that do not seem to be referenced from anywhere. That is why invocations of "git gc --auto" do not say --prune as you saw the second patch, and the repack command "gc --auto" runs is "repack -d -l" instead of "repack -a -d -l", which means that it does run git-prune-packed after repacking but not git-prune. Maybe I am missing something... -
Hi, No, _I_ missed the fact that no pack is rewritten... Sorry for the line noise, Dscho -
No, you aren't Junio. `gc --auto` as you defined it is safe. It won't delete objects from the database. So it won't impact shared repositories, or readers that are actively running in parallel with the gc. Both of which are important. -- Shawn. -
I think "repack -d -l" should be ok from a safety perspective, but I'd also like to say that always running it incrementally is going to largely suck after a time. IOW, if you get lots of small incrmental packs, after a while you really *do* need to do "git gc" to get the real pack generated. In the case I saw, James really had hundreds of pack-files. That makes all our object lookups suck. Yes, not having loose objects at all is a big deal too, and yes, we try to start from the last pack-file we found (for the locality that we hope is there), but it's still pretty bad from a cache usage standpoint, and when we create a new object, we'll first search (in vain) in all the hundreds of pack-files. So would "git gc --auto" have helped James? I'm sure it would have. But he already had lots of pack-files from doing "git fetch/pull", and while doing the "git gc --auto" will likely *delay* the point where you need to do a full repack, it doesn't make it go away. We still need to tell people to do a full git gc at some point, or do it for them. And the longer you delay doing it, the more expensive it's going to get to do and/or the worse the final packing is going to be (especially if it ends up reusing non-optimal packing decisions from the smaller packs). So I think the --auto stuff is still worth it, but it's really just pushing the pain somewhat further out. (In the kernel community, if you fetch my tree daily, you really *are* going to have hundreds and hundreds of packfiles just from doing that). So I'd really like us to also remind people to do a *real* and full "git gc", not just the incremental ones. Linus -
This is a beginning of "git-merge-pack" that combines smaller
packs into one. Currently it does not actually create a new
pack, but pretends that it is a (dumb) "git-rev-list --objects"
that lists the objects in the affected packs. You have to pipe
its output to "git-pack-objects".
The command reads names of pack-*.pack files from the standard
input, outputs the objects' names in the order they are stored
in the original packs (i.e. the offset order). This sorting is
done in order to emulate the traversal order the original
"git-rev-list --objects" that was used to create the existing
pack listed the objects.
While this approach would give the resulting packfile very
similar locality of access as the original, it does not give the
"name" component you would see in "git-rev-list --objects"
output. This information is used as the clustering cue while
computing delta, and the lack of it means you can get horrible
delta selection. You do _not_ want to run the downstream
"git-pack-objects" without the optimization/heuristics to reuse
delta. IOW, do not run it with --no-reuse-delta.
To consolidate all packs that are smaller than a megabytes into
one, you would use it in its current form like this:
$ old=$(find .git/objects/pack -type f -name '*.pack' -size 1M)
$ new=$(echo "$old" | git merge-pack | git pack-objects pack)
$ for p in $old; do rm -f $p ${p%.pack}.idx; done
$ for s in pack idx; do mv pack-$new.$s .git/objects/pack/; done
An obvious next steps that can be done in parallel by interested
parties would be:
(1) come up with a way to give "name" aka "clustering cue" (I
think this is very hard);
(2) run the above four command sequence internally without
having to resort to shell wrapper (easy).
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
Linus Torvalds <torvalds@linux-foundation.org> writes:
> IOW, if you get lots of small incrmental packs, after a while you really
> *do* need to do "git gc"...Can I suggest not calling it git-merge-pack? It makes it look like it's a new merge strategy called "pack"... git-merge-base git-merge-file git-merge-index git-merge-octopus git-merge-one-file git-merge-ours git-merge-recur git-merge-recursive git-merge-resolve git-merge-stupid git-merge-subtree git-merge-tree Andy -- Dr Andy Parkins, M Eng (hons), MIET andyparkins@gmail.com -
This gives a new meaning to the term "merge". IMHO, "git-combine-pack" would be a better name. -- Hannes -
Yeah, that makes sense, but I think this can and should be done as part of pack-objects itself as Nico suggested. So consider that patch scrapped for now. -
I wonder if this is the best way to go. In the context of a really fast repack happening automatically after (or during) user interactive operations, the above seems a bit heavyweight and slow to me. I would have concatenated all packs provided on the command line into a single one, simply by reading data from existing packs and writing it back without any processing at all. The offset for OBJ_OFS_DELTA is relative so a simple concatenation will just work. Then the index for that pack can be created just as easily by reading existing pack index files and storing the data into an array of struct pack_idx_entry, adding the appropriate offset to object offsets, then call write_idx_file(). All data is read once and written once making it no more costly than a simple file copy. On the flip side it wouldn't get rid of duplicated objects (I don't know if that matters i.e. if something might break with It is, and IMHO not worth it. If you do it separately from the usual pack-objects process you'll perform extra IO and decompression when walking tree objects just to reconstruct those paths, becoming really slow by the context definition I provided above. If you really want to do it then the best way might simply to reverse your find result above, in order to use pack-objects as if the larger packs, i.e. the ones that you don't want to merge, simply had an associated .keep file. In fact, since we want to _also_ perform a repack of loose objects in the context of automatic repacking, I wonder why we wouldn't use that --unpacked= argument to also repack smallish packs at the same time in only one pack-objects pass. Or maybe I'm missing something? Nicolas -
As I was planning to do this outside of pack-objects, I did not want to write something that intimately knows the details of I do not think duplicates create problems, as long as the pack idx remains sane. But a bigger issue is for people who fetch over dumb protocols, from a repository that repacks with "-a -d" I think this is a much better idea. You obviously need some twist to the pack-objects, and being lazy that was the reason I did not want to do this that way. When a new parameter, perhaps --lossless, is given, together with the --unpacked= parameters, we can change pack-objects to iterate over all objects in the --unpacked= packs, and add the ones that are not marked for inclusion to the set of objects to be packed, after doing the usual "objects to be packed" discovery. I am not sure --lossless is a good option name from marketing point of view, though. -
The usual command line that uses "--unpacked=<existing>" option
looks like this:
git pack-objects --non-empty --all --reflog \
--unpacked --unpacked=<existing> \
packname-prefix
This packs loose objects and objects in the named existing
packs that are reachable from any and all refs and reflog
entries. It is typically used by "git repack -a -d", which
then removes the named existing packs from the repository, and
has an effect of getting rid of unreachable objects these packs
hold.
This adds "--repack-unpacked" option to pack-objects to help
combining small packs into one, without losing unreferenced
objects that are in the packs. When this option is given in
addition to the above command line, we also make sure all the
objects in the named existing packs are included in the result.
This allows us to safely remove the packs that were named on the
command line after installing the resulting pack in the
repository.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
I am too tired to keep staring at this code now. Fixes,
improvements, replacements and enhancements, in the code,
documentation and tests, are very much welcomed.
builtin-pack-objects.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 93 insertions(+), 2 deletions(-)
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 12509fa..9bc2faa 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -21,7 +21,7 @@ git-pack-objects [{ -q | --progress | --all-progress }] \n\
[--window=N] [--window-memory=N] [--depth=N] \n\
[--no-reuse-delta] [--no-reuse-object] [--delta-base-offset] \n\
[--non-empty] [--revs [--unpacked | --all]*] [--reflog] \n\
- [--stdout | base-name] [<ref-list | <object-list]";
+ [--stdout | base-name] [--repack-unpacked] [<ref-list | <object-list]";
struct object_entry {
struct pack_idx_entry idx;
@@ -57,7 +57,7 @@ static struct object_entry **written_list;
stat...I can't help but think it would be better to store a "struct object*" here instead of the "const unsigned char *sha1". Both are of type pointer but we only need the struct object* when are working with Per above just pass the "struct object*" into this function and store that into the array, instead of the sha1. Of course you now Really? It is not possible for two objects to be placed at the same offset within the same packfile and yet have two different If you save the "struct object*" into the in_pack_object (instead of the SHA-1) as I described above you can avoid this second lookup_unknown_object() call when we decide that we do need to pack the object. Otherwise I think it looks quite sane. -- Shawn. -
This actually was meant to be used to sort object entries from multiple packs together. The update to pack-objects you are commenting on deals with one packfile at a time, but I think we probably should collect from all packs and then sort (which was how merge-pack used this function). Other than that, I think your comments make sense. -
I'm not sure sorting objects from multiple packs together like that is going to help deltification. It is unlikely that related objects (e.g.. objects having the same path) will be located at the same offset in different packs. Nicolas -
Yes. But when you are merging several packfiles together and you
don't supply `--no-delta-reuse` then we're really just going to
copy the data from the sources to the output. There is not a lot
of deltification to be performed; maybe only a handful of loose
objects will need to locate deltas. So helping deltification is
not really of concern here.
What Junio is trying to do here is at least preserve their order
within the packfile as that should help to preserve their locality
of access.
Only I'm not sure that's the best merging strategy available to us.
What about something like this:
1) Read all packfile indexes, sort by offset.
2) Locate first commit object within each packfile.
3) Get that commit's commit date; if no commit is in the
packfile at all use the modification date of the packfile.
4) Sort the packfiles by their chosen date descending (more
recent items are closer to the front of the list).
5) Add objects:
foreach type in commit tree blob tag
foreach packfile in sorted_packs_from_4
while current_object->type == $type
if (current_object->flags & ADDED) == 0
add current_object
current_object++
This way data is still organized by the original order that rev-list
gave us when we created the small packfiles, but we also try to place
data from more recent packfiles into the front of the new packfile.
Its a rough approximation of what rev-list would have given us for
object ordering when it performed a traversal. Its also a whole lot
cheaper than rev-list and lets us continue to include unreachable
objects, which was the point of this patch.
--
Shawn.
-Even though our convention is "zero return means good", it goes a bit too far for matches_pack_name() to return 0 when it found the pack is what the name refers to. This fixes that silly and obvious interface bug. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- Junio C Hamano <gitster@pobox.com> writes: > Nicolas Pitre <nico@cam.org> writes: > ... >> In fact, since we want to _also_ perform a repack of loose objects in >> the context of automatic repacking, I wonder why we wouldn't use that >> --unpacked= argument to also repack smallish packs at the same time in >> only one pack-objects pass. Or maybe I'm missing something? > > I think this is a much better idea. You obviously need some > twist to the pack-objects, and being lazy that was the reason I > did not want to do this that way. So what follows is two-patch series, which still is a rough sketch, as I am feeling a bit too tired to do tests and documentation (help is always welcomed, hint hint). This message contains the first one, which is more or less independent, that exposes matches_pack_name() function from sha1_file.c, while fixing a silly and obvious interface bug. cache.h | 1 + sha1_file.c | 14 +++++++------- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/cache.h b/cache.h index 70abbd5..3fa5b8e 100644 --- a/cache.h +++ b/cache.h @@ -529,6 +529,7 @@ extern void *unpack_entry(struct packed_git *, off_t, enum object_type *, unsign extern unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned long len, enum object_type *type, unsigned long *sizep); extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t); extern const char *packed_object_info_detail(struct packed_git *, off_t, unsigned long *, unsigned long *, unsigned int *, unsigned char *); +extern int matches_pack_name(struct packed_git *p, const char *name); /* Dumb serve...
For something with a boolean return value I agree. Otherwise, for interfaces where a non zero return value means error, it is best to use a negative value (like -1) which has less of a positive connotation, and that wasn't done here either. Nicolas -
Yea, that's a really quick repack. :-) Plus its actually something
that can be easily halted in the middle and resumed later. Just need
to save the list of packfiles you are concatenating so you can pick
up later when you get more time.
There shouldn't be a problem with having duplicates in the packfile.
You can do one of two things:
a) Omit the duplicates from the .idx when you merge the .idx tables
together to produce the new one. Just take the object with the
earliest offset.
b) Leave the duplicates in the final .idx. In this case the
binary search may pick any of them, but it wouldn't matter
which it finds.
About the only process that might care about duplicates would be
index-pack. I don't think it makes sense to run index-pack on a
packfile you already have a .idx for. I don't think it would have
a problem with the duplicate SHA-1s either, but it wouldn't be hard
Not might, *must*. If you delete the old ones before the new
ones are ready then readers can run into problems trying to access
the objects. We've spent some effort trying to make these sorts
of operations safe. No sense in destroying that by getting the
order wrong here. :)
--
Shawn.
-Honestly, I do not believe in that mode of operation that much.
"While the user is waiting for the EDITOR"?
Because you do not know how much time you will be given before
you start, unless
(1) your process can be snapshotted and you can restart at the
next chance; or
(2) it is so cheap and you can afford to abort and start over
from scratch at the next chance; or
(3) it is so quick that you can simply have the user wait until
you are done without adding too much latency to be annoying,
when you cannnot finish before the EDITOR come back;
I think that is a false sense of "ok, we will be able to do
something else in the background meantime", which is not so
Well, I said "name" in quotes because you do _NOT_ have to give
the real name. I was not thinking about doing the actual tree
traversal at all. What you need to do is to come up with a
token that is the same for the objects in the same deltification
chain so that they cluster together, and that should be doable
by looking at the delta chain patterns inside a packfile.
-I think we have to aim for #3. "Automatic" certainly doesn't imply "can be slow". It should be reasonably instantaneous, otherwise it'll become annoying quickly enough. If it can't be (almost) instantaneous in 99% of normal cases, then I think it simply should be remain asynchronously througha manual invokation of 'git gc' and we only need to teach/remind Obviously! Sorry for being slow. But I still think that a single repack pass should already be able to pick loose objects and selected (small) packs, and produce a pack with them all. No need for a separate merge-pack I'd say. Nicolas -
Ok, so I had to double-check that builtin-pack-objects then deals properly with duplicate object names (which it does seem to do), so maybe it's worth adding a comment to that effect. But ACK, this seems to be the right thing to do to generate a single bigger pack from many smaller ones. Linus -
I wonder if it makes sense to repack just the small incremental packs into a large (but still incremental) pack, rather than repacking the entire repository. Presumably that would be a lot faster than a full "git gc", while still giving you reasonably good packing (at least, if the threshold is set to a hugh enough number of small packs) and keeping things fast. That could run as a second phase of "git gc --auto" -- it should be quick enough to not be too terribly annoying since we're not running it in the background. Yeah, if you use the same repo for a long time, you'll accumulate a ton of medium-sized packs this way, but (a) that's much better than the situation we have today, and (b) it puts off the performance degradation for long enough that it becomes more reasonable to expect people to find out about running the full "git gc" in the meantime, or for git to further evolve to not need it. -Steve -
... Danger... If the user sets `gc.auto` to a low enough value and they are also unlucky enough to have a few truely unreachable (thus pruneable) objects in .git/objects/17/ then this is going to run a bunch of gc work on every commit they make. I'm actually running into this problem in git-gui. On Windows it suggests a repack if there is one object in .git/objects/42/. Some users have been unlucky enough to stage a file, have it hash into that directory, then restage a different version of it. The prior one is never considered reachable (it was never committed), but will now *always* cause git-gui to suggest a repack on every startup. For all time. Yea, I need to fix that. But this suffers from the same fate if the user sets gc.auto too small and doesn't realize that the reason Git is always repacking is because over the last 6 months they have been unlucky enough to stage the magic number of unreachable blobs into the 17 directory and they have *never* run `git gc --prune` because the auto thing is working just fine for them and they don't realize they need to prune every once in a blue moon. -- Shawn. -
Check the modification times on those files and don't count ones that are older than the last git-gc run, maybe? That'd take care of the problem. -Steve -
Eh, that could mean a bunch of stat calls that it would be nice to avoid. The counter Junio (and git-gui) implements just does a readdir(). Reasonably cheap. Maybe just save a ".git/gc_last_auto" with the last object count of .git/objects/17, after repacking. If the count is over the gc.auto limit *and* is still over the limit after subtracting the ".git/gc_last_auto" value then consider that auto is required. This way the file is only consulted if we are really thinking about running a repack, and its only written to if we actually do the repack. So we only take the extra penalty if we are going to be taking a *really* big extra penalty by repacking. -- Shawn. -
or a non-POSIX snprintf returning "negative value" (Microsoft) -
This makes the two commands to call "git gc --auto" when they
are done.
I earlier said that obvious candidates also include merge and
rebase, but these are lot less frequent operations compared to
add, and more importantly, in a normal workflow they would
almost always happen after "git fetch" is done.
In other words, if you are downstream developer, the automatic
invocation in "git fetch" will take care of things for you, and
otherwise if you do not have an upstream, you would be doing
your own development, so "git add" to add your changes will take
care of the auto invocation for you.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* This is obviously a follow-up to the previous one that allows
you to say "git gc --auto". I somewhat feel dirty about
calling cmd_gc() bypassing fork & exec from "git add",
though...
builtin-add.c | 2 ++
git-fetch.sh | 1 +
2 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/builtin-add.c b/builtin-add.c
index 105a9f0..8431c16 100644
--- a/builtin-add.c
+++ b/builtin-add.c
@@ -263,9 +263,11 @@ int cmd_add(int argc, const char **argv, const char *prefix)
finish:
if (active_cache_changed) {
+ const char *args[] = { "gc", "--auto", NULL };
if (write_cache(newfd, active_cache, active_nr) ||
close(newfd) || commit_locked_index(&lock_file))
die("Unable to write new index file");
+ cmd_gc(2, args, NULL);
}
return 0;
diff --git a/git-fetch.sh b/git-fetch.sh
index c3a2001..86050eb 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -375,3 +375,4 @@ case "$orig_head" in
fi
;;
esac
+git gc --auto
--
1.5.3.1.840.g0fedbc
-Hi, Since all git-gc seems to do is to fork() and exec() other git programs, this should be fine (have not looked at cmd_gc() in a while, though). Ciao, Dscho -
A big part of the repack cost is the counting of objects. I don't know if --unpacked to git-pack-objects skips walking trees of a packed commit object. If no then it probably should to gain a significant speed up, or maybe a separate option should be created to actually imply this Nope! 'git add' creates loose objects which are not yet reachable from I think that would be a much better idea to simply decrease the and git commit. Which resumes it to commit creating operation. Nicolas -
Good point. I think that makes sense. -
The point of auto gc is to pack new objects created in loose
format, so a good rule of thumb is where we do update-ref after
creating a new commit.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
Let's chuck the previous "git add/git fetch" one, and replace it
with this.
Also I realize I misread your earlier comment about "git add".
You are still among the only few people on the list that I
consider are always more right than I am ;-).
git-am.sh | 2 ++
git-commit.sh | 1 +
git-merge.sh | 1 +
git-rebase--interactive.sh | 2 ++
4 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/git-am.sh b/git-am.sh
index 6809aa0..4db4701 100755
--- a/git-am.sh
+++ b/git-am.sh
@@ -466,6 +466,8 @@ do
"$GIT_DIR"/hooks/post-applypatch
fi
+ git gc --auto
+
go_next
done
diff --git a/git-commit.sh b/git-commit.sh
index 1d04f1f..d22d35e 100755
--- a/git-commit.sh
+++ b/git-commit.sh
@@ -652,6 +652,7 @@ git rerere
if test "$ret" = 0
then
+ git gc --auto
if test -x "$GIT_DIR"/hooks/post-commit
then
"$GIT_DIR"/hooks/post-commit
diff --git a/git-merge.sh b/git-merge.sh
index 3a01db0..697bec2 100755
--- a/git-merge.sh
+++ b/git-merge.sh
@@ -82,6 +82,7 @@ finish () {
;;
*)
git update-ref -m "$rlogm" HEAD "$1" "$head" || exit 1
+ git gc --auto
;;
esac
;;
diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index abc2b1c..8258b7a 100755
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -307,6 +307,8 @@ do_next () {
rm -rf "$DOTEST" &&
warn "Successfully rebased and updated $HEADNAME."
+ git gc --auto
+
exit
}
-Why bother with git-rebase--interactive.sh? It calls two tools, git-cherry-pick (which calls git-commit) and git-commit to do its per-commit dirty work. So on every step of `git rebase -i` we are now running `git gc --auto`. No need to also run it at the end. Note this is also true of `git rebase -m` as that uses the wonderful feature of `git commit -C $oldid` per commit to make the new commit. -- Shawn. -
Bzzt, I am releaved to see you are sometimes wrong ;-) They are reachable from the index and are not subject to One thing that I find lacking in that auto patch is actually that we should sometimes consolidate multiple small packs into a single larger one. Any behaviour change to encourage creation of many tiny packs should be avoided until it materializes. Probably we should introduce a built-in minimum value for a positive gc.auto, somewhere around 1000 or so, for this reason. -
Hm. Isn't it possible to work with several index files at once? I seem to remember that even git-add does this itself. So what is it that protects objects in such a temporary index from being garbage collected by a different git process running on the same repository? -- David Kastrup, Kriemhildstr. 15, 44793 Bochum -
Why not just let the default value take care of it? If someone really wants to set gc.auto to 50, why prevent it? The more I think of it, the less I like automatic repack. There is always a bad case for it somewhere. Nicolas -
I tend to agree, but at the same time, I think the long term goal should be not to have bad cases. Old timers like ourselves learned to run "repack -a -d" when not doing real work (i.e. beginning of the day while fetching coffee, before leaving to lunch break, end of the day before leaving) and we have been _trained_ not to feel that a choir, but I think that is wrong. "Sync freezes I/O for and causes my real-time databasy job undue latency --- I would want to disable swapper/bdflush/whatever machine-wide and prefer typing 'sync' from the command line when it is convenient for me" is fine for an experienced user working on a single user machine, but it still feels wrong (we do not have "multi-user" issues in git repository, so this analogy is not quite right, though). -
The best solution is make "git gc" unnecessary. At the long term, and without loss of efficiency. -
I think `git fetch' works reasonably well as is: unless you're fetching every five minutes you often find you get packs anyway. There's no point packing incrementally *too* often, or you replace a lots-of-objects problem with a lots-of-packs problem, after which you're worse off than when you started. -
What about kicking off a repack in the background at the ends of certain commands? With an option to disable, of course. It could run at a low priority and could even sleep a lot to avoid saturating the system's disks -- since it'd be running asynchronously there should be no problem if it takes longer to run. Alternately, if it's possible to break the repack work up into chunks that can be executed a bit at a time, you could do a small amount of repacking very frequently (possibly still in the background) rather than the whole thing at once. I suspect the nature of a repack, where you presumably want everything loaded at once, would make that a challenge, but it might not be impossible. On the more general question... IMO expecting end users to regularly perform what are essentially database administration tasks (running git-gc is akin to rebuilding indexes or packing tables on a DBMS) is naive. Heck, even database administrators don't like to run database administration commands; PostgreSQL added the "autovacuum" feature precisely because manual periodic repacking (and the associated monitoring to figure out when to do it) was too annoying for developers and DBAs. But you don't have to look that far; anyone who has worked in IT can tell you horror stories of users, including developers, whose computers have slowed to a crawl because the users never bothered to defrag their hard disks. And that affects *everything* the users do, not just version control operations! It'll get worse as better UIs and tool integration become available and git gains large numbers of users who are neither software developers nor system administrators, and wouldn't know a packfile from a hole in the ground. I'm talking web designers, graphic artists, mechanical engineers, even managers and secretaries -- all of those people are in git's ultimate target audience, even if it's not ready for them today. None of them is going to be interested in doing random housekeeping op...
