Re: Extremely slow progress during 'git reflog expire --all'

Previous thread: [PATCH v2 0/3] Handling of gitattribute macro expansion by =?UTF-8?q?Henrik=20Grubbstr=C3=B6m=20=28Grubba=29?= on Friday, April 2, 2010 - 12:33 pm. (4 messages)

Next thread: Re: 'git gc --aggressive' effectively unusable by Frans Pop on Friday, April 2, 2010 - 3:12 pm. (1 message)
From: Frans Pop
Date: Friday, April 2, 2010 - 12:54 pm

I wanted to to a 'git gc' on my kernel repo, but that seemed to end in a 
loop: loads of CPU usage, no output. Using 'ps' I found it's not 'git gc' 
itself, but 'git reflog' that's causing the problem.

From the strace below it does seem like it still makes some progress, but 
I've never had it take anywhere near this long before. Normally it starts 
the count of objects almost immediately.

It's using hardly any memory at all but has one core going flat out.

I'm seeing this with both git 1.6.6.1 and 1.7.0.3 on the same repo.
Environment:
- Debian amd64/Lenny; Core Duo x86_64 2.6.34-rc3 -> 1.6.6.1
- Debian i386/Sid; chroot on the same machine -> 1.7.0.3
I've also tried with 2.6.33 to rule out a kernel issue.

Here's the tail end of an strace I ran. I broke it off after 9+ minutes, 
but I had let it go for longer than that earlier. You can clearly see 
where it starts to "stall" at 21:40:14.

Cheers,
FJP

$ strace -t git reflog expire --all
[...]
21:40:11 open(".git/logs/HEAD.lock", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
21:40:11 open(".git/objects/db/315842dd99ce7fd59697df36185df4b807000a", 
O_RDONLY|O_NOATIME) = 23
21:40:11 fstat(23, {st_mode=S_IFREG|0444, st_size=289, ...}) = 0
21:40:11 mmap(NULL, 289, PROT_READ, MAP_PRIVATE, 23, 0) = 0x7f9b135bb000
21:40:11 close(23)                      = 0
21:40:11 munmap(0x7f9b135bb000, 289)    = 0
21:40:11 open(".git/logs/HEAD", O_RDONLY) = 23
21:40:11 fstat(23, {st_mode=S_IFREG|0644, st_size=171397, ...}) = 0
21:40:11 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|
MAP_ANONYMOUS, -1, 0) = 0x7f9b135bb000
21:40:11 read(23, "bea4c899f2b5fad80099aea979780ef19"..., 4096) = 4096
21:40:11 fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
21:40:11 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|
MAP_ANONYMOUS, -1, 0) = 0x7f9b135ba000
21:40:11 read(23, "aca305e8a0209efda740b522024cfdefe"..., 4096) = 4096
21:40:11 read(23, "es\n134a0f063d0c48f08b8731118c4732"..., 4096) = 4096
21:40:12 read(23, "c82b7849a0fe ...
From: Jeff King
Date: Friday, April 2, 2010 - 2:28 pm

FWIW, I have seen this, too, and managed to get an strace snippet that
looked similar to what you saw (mostly memory allocation, and otherwise
chewing on the CPU). I'm guessing there is some O(n^2) loop in there
somewhere. Unfortunately, mine actually completed after a few minutes
and I wasn't able to replicate.

Can you reproduce the problem on your repo? If so, can you possibly tar
it up and make it available (probably just the .git directory would be
enough)?

-Peff
--

From: Frans Pop
Date: Friday, April 2, 2010 - 2:50 pm

Yes, I can reproduce. I've run it a few times and broke it off each time 
before it finished as I really had no idea how much longer it might take.

$ du -sh .git
1008M   .git

I can make that available, but it's going to take a while to upload and I 
don't want to leave it up too long as I'll be abusing a project service 
for that. So the people who want to look at it would have grab it fairly 
promptly (within 2 days or so).

Is that OK?

Cheers,
FJP
--

From: Jeff King
Date: Friday, April 2, 2010 - 3:41 pm

That would be fine. If even that is a problem, we can arrange off-list
to get it from you privately and I can stick it somewhere for a week or
two.

-Peff
--

From: Frans Pop
Date: Saturday, April 3, 2010 - 7:29 am

The tarball is up at:
http://alioth.debian.org/~fjp/.tmp/linux-2.6_reflog-issue.tar

Because of the Easter weekend I'll leave it up a bit longer. I plan to 
remove it sometime on Thursday.

Cheers,
FJP
--

From: Jeff King
Date: Saturday, April 3, 2010 - 1:33 pm

Thanks, I was able to get it and reproduce your problem. The slowness is
in the expire-unreachable code. You can work around it with:

  git config gc.reflogExpireUnreachable never

Obviously that's not really a fix, but it should let your "git gc" work.

It looks like we do two merge-base calculations for each reflog entry,
which is what takes so long. Perhaps if we know we are going to do a
large number of reachability checks, we can pre-mark all reachable
commits, and then each reflog entry would just need to check the commit
mark.

I don't have any more time now to look at it, but I am cc'ing Junio (who
wrote the original expire-unreachable code) and Shawn (the resident
reflog expert), who may have more input.

-Peff
--

From: Jeff King
Date: Saturday, April 3, 2010 - 1:35 pm

[sorry, resend, I managed to totally screw up Shawn's email in the last
one. Stupid mutt aliases.]


Thanks, I was able to get it and reproduce your problem. The slowness is
in the expire-unreachable code. You can work around it with:

  git config gc.reflogExpireUnreachable never

Obviously that's not really a fix, but it should let your "git gc" work.

It looks like we do two merge-base calculations for each reflog entry,
which is what takes so long. Perhaps if we know we are going to do a
large number of reachability checks, we can pre-mark all reachable
commits, and then each reflog entry would just need to check the commit
mark.

I don't have any more time now to look at it, but I am cc'ing Junio (who
wrote the original expire-unreachable code) and Shawn (the resident
reflog expert), who may have more input.

-Peff
--

From: Junio C Hamano
Date: Sunday, April 4, 2010 - 11:22 am

Thanks for the analysis, but expire_reflog() that is run for each ref
already does that, I think.  It first runs mark_reachable(), then walks
each reflog entry for the ref to call expire_reflog_ent(), which in turn
calls unreachable() that first checks if mark_reachable() has marked the
commit, and if so we don't run in_merge_bases().

But if the commit in question is not reachable, then we end up running
in_merge_bases() to double-check anyway, which is probably the symptom
that was observed.

So perhaps this is a workable compromise?

 builtin/reflog.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/builtin/reflog.c b/builtin/reflog.c
index 64e45bd..7e278b8 100644
--- a/builtin/reflog.c
+++ b/builtin/reflog.c
@@ -230,6 +230,13 @@ static int unreachable(struct expire_reflog_cb *cb, struct commit *commit, unsig
 	/* Reachable from the current ref?  Don't prune. */
 	if (commit->object.flags & REACHABLE)
 		return 0;
+	/*
+	 * Unless there was a clock skew, younger ones that are
+	 * reachable should have been marked by mark_reachable().
+	 */
+	if (cb->cmd->expire_total < commit->date)
+		return 1;
+
 	if (in_merge_bases(commit, &cb->ref_commit, 1))
 		return 0;
 



--

From: Jeff King
Date: Sunday, April 4, 2010 - 11:26 pm

Hmm. It looks like mark_reachable() stops traversing when it hits a
commit older than expire_total. I imagine that's to avoid going all the
way to the roots. But if we hit any unreachable entry, in_merge_bases()
is going to have to go all the way to the roots, anyway.

If we just marked everything, couldn't we then trust the REACHABLE
value and not have to do the in_merge_bases double-check? It has worse
best-case cost (since we always go to the root), but better worst-case
(since we can potentially go to the roots for each unreachable reflog
entry).


If we haven't done a "reflog expire" in a while, won't we have a bunch
of old commits that will still need double-checked, and produce the same
slow behavior? Or is that what your "clock skew" is meant to mean? That
we would have removed those old ones already assuming the commit date
and the reflog entry for them match up? That's not necessarily true if
you move to an older commit, which freshens its reflog entry compared to
the commit date.

I wonder if, in addition to your patch, we should remove the
double-check in_merge_bases and simply report those old ones as
reachable. We may be wrong, but we are erring on the side of keeping
entries, and they will eventually expire in the regular cycle (i.e., 90
days instead of 30).

All of that being said, your patch does drop Frans' case down to about
1s of CPU time, so perhaps it is not worth worrying about beyond that.

-Peff
--

From: Junio C Hamano
Date: Monday, April 5, 2010 - 11:54 am

Yeah, an alternative is to keep the list of commits where the initial
mark_reachable() run stopped, and instead of doing in_merge_bases(),
lazily restart the traversal all the way down to root, and then rely

What I meant was you commit A, a child B, rewind clock to commit its child
C at an incorrect and old time, and fix clock to commit its child D which
is at the tip of a ref.  mark_reachable() will stop at C saying that is

I think a reasonable solution would be along the lines you described, but
the patch you are responding to does err on the wrong side when a clock
skew is there.  Does it matter?  Probably not.

--

From: Jeff King
Date: Monday, April 5, 2010 - 11:02 pm

Ah, yeah, that is much more clever. It has the same worst case
performance as what I proposed, but is much more optimistic that we

True. With the technique you mentioned above, you would reverse your
test and do:

  if (flags & REACHABLE)
    return 0;
  if (expanded_reachable_to_root)
    return 1; /* we know it's not */
  expand_reachable_to_root();
  return !(flags & REACHABLE);

I don't think I care enough to do a patch, though. I don't have a
problem with you applying what you posted earlier.

-Peff
--

From: Junio C Hamano
Date: Wednesday, April 7, 2010 - 11:39 am

Actually I do; I think it breaks correctness a big way (the second
paragraph of the proposed log message of the following).

-- >8 --
Subject: [PATCH] reflog --expire-unreachable: avoid merge-base computation

The option tells the command to expire older reflog entries that refer to
commits that are no longer reachable from the tip of the ref the reflog is
associated with.  To avoid repeated merge_base() invocations, we used to
mark commits that are known to be reachable by walking the history from
the tip until we hit commits that are older than expire-total (which is
the timestamp before which all the reflog entries are expired).

However, it is a different matter if a commit is _not_ known to be
reachable and the commit is known to be unreachable.  Because you can
rewind a ref to an ancient commit and then reset it back to the original
tip, a recent reflog entry can point at a commit that older than the
expire-total timestamp and we shouldn't expire it.  For that reason, we
had to run merge-base computation when a commit is _not_ known to be
reachable.

This introduces a lazy/on-demand traversal of the history to mark
reachable commits in steps.  As before, we mark commits that are newer
than expire-total to optimize the normal case before walking reflog, but
we dig deeper from the commits the initial step left off when we encounter
a commit that is not known to be reachable.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * By the way, the following diff is extremely hard to read, as for some
   reason "diff" failed to notice that the only change is that we have
   added lines at the beginning of mark_reachable(), modified only the
   latter half of unreachable(), and moved mark_reachable() from after
   unreachable() to before.  Neither "git diff --patience" nor running GNU
   diff on to blobs helps.  Hmph...

 builtin-reflog.c |   96 +++++++++++++++++++++++++++++++----------------------
 1 files changed, 56 insertions(+), 40 deletions(-)

diff --git ...
From: Junio C Hamano
Date: Wednesday, April 7, 2010 - 11:43 am

Side note.

It may be an improvement to dig the history even more incrementally.

Inside unreachable(), we currently dig immediately down to root, but it
may give us a better performance in a long history with reflog entries
that wildly jump everywhere in that history if we dug down to the
timestamp of the commit we are looking at.  A patch to do so on top of the
previous one may look like this.

 builtin-reflog.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/builtin-reflog.c b/builtin-reflog.c
index 9792090..42d225f 100644
--- a/builtin-reflog.c
+++ b/builtin-reflog.c
@@ -274,16 +274,19 @@ static int unreachable(struct expire_reflog_cb *cb, struct commit *commit, unsig
 			return 0;
 	}
 
-	/* Reachable from the current ref?  Don't prune. */
-	if (commit->object.flags & REACHABLE)
-		return 0;
+	while (1) {
+		/* Reachable from the current ref?  Don't prune. */
+		if (commit->object.flags & REACHABLE)
+			return 0;
 
-	if (cb->mark_list && cb->mark_limit) {
-		cb->mark_limit = 0; /* dig down to the root */
+		/* Did we mark everything?  Then we know we cannot reach it. */
+		if (!cb->mark_list || !cb->mark_limit)
+			return 1;
+
+		/* Dig down to the timestamp of this commit, or down to root. */
+		cb->mark_limit = (cb->mark_limit < commit->date) ? 0 : commit->date;
 		mark_reachable(cb);
 	}
-
-	return !(commit->object.flags & REACHABLE);
 }
 
 static int expire_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
-- 
1.7.1.rc0.212.gbd88f

--

From: Jeff King
Date: Thursday, April 8, 2010 - 12:00 am

I doubt it matters much in practice. The important features of the
solution are:

  - for most refs, which are either all-reachable or which don't have
    clock skew, don't go to the roots at all. This is the fast case that
    we should do most of the time.

  - for others, don't go to the roots over and over for each entry. This
    is the slow case, but we just need to make sure it's a not the
    horrible slow case that Frans saw.

Your suggestion speeds up the slow case a little bit, but it is already
acceptably fast. Plus this is an optimistic optimization. There are
still cases where you might have to dig to the roots anyway (e.g.,

Why dig to the timestamp? You know you're looking for a particular
commit, so you can dig down to that commit. If it's reachable, stop
there and you have your answer. Any work you do beyond that might be
used for a further entry lookup, but it might not at all. If it's not
reachable, then you're going to end up digging down to the roots anyway.

With an unreachable ref and your scheme, you would just end up not
finding it, extending again, not finding it, extending again, etc. But
you can never be sure it's truly unreachable without going to the roots
or assuming no clock skew, so that is what we'll end up doing.

-Peff
--

From: Jeff King
Date: Wednesday, April 7, 2010 - 11:52 pm

Oh, right. Didn't I even mention that case earlier in the thread? I was
just being dumb. Or maybe I was pretending to be dumb, so that I could

Patch looked fine from my reading. I no longer have Frans' gigantic test
repo available, though, so I can't test whether it fixes the problem
(but I'm pretty sure it must from my earlier analysis).

-Peff
--

Previous thread: [PATCH v2 0/3] Handling of gitattribute macro expansion by =?UTF-8?q?Henrik=20Grubbstr=C3=B6m=20=28Grubba=29?= on Friday, April 2, 2010 - 12:33 pm. (4 messages)

Next thread: Re: 'git gc --aggressive' effectively unusable by Frans Pop on Friday, April 2, 2010 - 3:12 pm. (1 message)