Re: Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

Previous thread: [PATCH] coretemp: Recognize Nehalem CPUs by Darrick J. Wong on Wednesday, August 13, 2008 - 6:51 pm. (5 messages)

Next thread: [GIT]: Sparc by David Miller on Wednesday, August 13, 2008 - 7:45 pm. (2 messages)
To: linux-kernel@vger.kernel.org <linux-kernel@...>
Date: Wednesday, August 13, 2008 - 7:03 pm

I ran into this trying 2.6.27-rc3 today:

[6.461876] kernel BUG at mm/vmalloc.c:217!

(full boot log attached)

I bisected it to:

git-bisect bad 1c89ac55017f982355c7761e1c912c88c941483d

which is:

commit 1c89ac55017f982355c7761e1c912c88c941483d
Merge: 88fa08f... b1b135c...
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue Aug 12 08:40:19 2008 -0700

Merge
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus

*
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
fix spinlock recursion in hvc_console
stop_machine: remove unused variable
modules: extend initcall_debug functionality to the module loader
export virtio_rng.h
lguest: use get_user_pages_fast() instead of get_user_pages()
mm: Make generic weak get_user_pages_fast and EXPORT_GPL it
lguest: don't set MAC address for guest unless specified

Backing out to the previous commit boots just fine. I've attached the
boot.log, config file, lspci.txt and git-bisect-log. If there is
anything else I can do, I'd be happy to oblige, but I'm checking out
right now. I'll check in tomorrow morning (Eastern USA).

Alan

To: Alan D. Brunelle <Alan.Brunelle@...>
Cc: <linux-kernel@...>
Date: Tuesday, August 19, 2008 - 3:46 pm

On Wed, 13 Aug 2008 19:03:20 -0400

Please confirm that this happened at

static struct vm_struct *
__get_vm_area_node(unsigned long size, unsigned long flags, unsigned long start,
unsigned long end, int node, gfp_t gfp_mask, void *caller)
{
struct vm_struct **p, *tmp, *area;
unsigned long align = 1;
unsigned long addr;

Thing is, the code *isn't* in an interrupt. Something got screwed up.

What can happen is that some code has a lock imbalance or a
preempt_disable imbalance and it does preempt_disable() so many times
that the counter overflows eight bits and starts to increment the
softirq counter, then the hardirq counter, then in_interrupt() starts
incorrectly returning true then blam.

You could confirm/debug it with something along the lines of this:

--- a/mm/vmalloc.c~a
+++ a/mm/vmalloc.c
@@ -214,7 +214,9 @@ __get_vm_area_node(unsigned long size, u
unsigned long align = 1;
unsigned long addr;

- BUG_ON(in_interrupt());
+ if (preempt_count() > 10)
+ printk("%s: preempt_count()=%d\n", __func__, preempt_count());
+ WARN_ON(in_interrupt());
if (flags & VM_IOREMAP) {
int bit = fls(size);

_

But this bug could be in practically anywhere in the kernel and
bisection is by far the best way to find it. It's sad and odd that
bisection landed you on a merge commit. I'd suggest that you persist
with the bisection (please).
http://www.kernel.org/doc/local/git-quick.html#example might be useful.

--

To: Andrew Morton <akpm@...>
Cc: Alan D. Brunelle <Alan.Brunelle@...>, <linux-kernel@...>
Date: Monday, August 25, 2008 - 2:14 pm

On Tue, Aug 19, 2008 at 9:46 PM, Andrew Morton

Hm.

Config doesn't even have CONFIG_PREEMPT, so preempt_disable() should be a no-op.

Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--

To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>
Date: Friday, August 22, 2008 - 11:37 am

Sorry, was off on other things for the last couple of days:

I /did/ bisect it down to the aforementioned merge, the question is: How
to crack open that merge into it's composite pieces? (Where do I go to
bisect within that?)

I'll try the patch above, and see what that reports.

Alan
--

To: Alan D. Brunelle <Alan.Brunelle@...>
Cc: <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 12:25 pm

urgh, it's irritating when git-bisect directs you to a merge commit - it
hasn't done it for me for ages.

One (probably wrong) approach is to run

gitk 1c89ac55017f982355c7761e1c912c88c941483d

then peer at the output, work out which real commits were in that
merge.

It looks like the merge ended with
b1b135c8d619cb2c7045d6ee4e48375882518bb5 and started with
40c42076ebd362dc69210cccea101ac80b6d4bd4, so perhaps you can do

git bisect bad b1b135c8d619cb2c7045d6ee4e48375882518bb5
git bisect good 40c42076ebd362dc69210cccea101ac80b6d4bd4

That has an off-by-one error, in that it assumes that
40c42076ebd362dc69210cccea101ac80b6d4bd4 is good, which we don't know.
Perhaps one can pick the preceding commit in the `git log' output to
fix that.

Anyway, I've sent sufficient wrongness to the right list for us to work
out how to do this for real ;)

--

To: Andrew Morton <akpm@...>
Cc: Alan D. Brunelle <Alan.Brunelle@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 1:16 pm

Hmm, but doesn't that happen only when it's actually really the merge
commit that introduces the bug? Both parents of the merge commit were

...I don't quite get this - according to the bisection log,

# good: [b1b135c8d619cb2c7045d6ee4e48375882518bb5] fix spinlock recursion in hvc_console

and now you want to mark it as bad?

You could try to revert some of the merged commits at the point of the
merge, though.

--
Petr "Pasky" Baudis
The next generation of interesting software will be done
on the Macintosh, not the IBM PC. -- Bill Gates
--

To: Petr Baudis <pasky@...>
Cc: <Alan.Brunelle@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 1:51 pm

On Fri, 22 Aug 2008 19:16:51 +0200

A merge commit doesn't contain any kernel changes? It's the individual
commits (aka "patches") which were in that merge which broke stuff.
Confused.

We're trying to dive inside that merge commit to find out which of the

<what bisection log?>

I assume that Alan's bisection search ended up saying that the merge
commit (1c89ac55017f982355c7761e1c912c88c941483d) was the first bad
commit.

--

To: Andrew Morton <akpm@...>
Cc: Petr Baudis <pasky@...>, <Alan.Brunelle@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 5:05 pm

You may find neither parents were buggy, but the result of the merge is.

A trivial example is when one branch changes the semantics of an existing
function and converts all the call sites to the updated semantics, while
the other branch adds a new call site that still relies on the old
behaviour of that function. The merge most likely won't textually
conflict, and neither git merge nor quilt patch would report conflicts,
but the end result is that the new call site added by the latter branch
now gets an unexpected outcome from the function and can misbehave. You
cannot blame the breakage to either branch for such a breakage.
--

To: Junio C Hamano <gitster@...>
Cc: <pasky@...>, <Alan.Brunelle@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 5:16 pm

On Fri, 22 Aug 2008 14:05:21 -0700

Sure, but

a) whoever added a change like that without also causing the build to
break is slappable and

b) there are now two commits (one in each branch) either one of
which, when applied on top of the other branch will introduce the
regression.

That's useful infomation, but we don't know how to get
git-bisect to give it to us.

It's pretty simple. If git-bisect tells us that the regression was
introduced by a merge commit, we want to perform a bisection within
that merge's individual commits.

--

To: Andrew Morton <akpm@...>
Cc: Junio C Hamano <gitster@...>, <pasky@...>, <Alan.Brunelle@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 5:36 pm

bisect already did that. It asked for the left side and the right side,
both were good. What you can still do is creating a _new_ history where
the commits are not in parallel but linearized, like Jeff described. But
that's (in general) not a trivial task as you need to reapply individual
commit patches which can cause conflicts that were already solved in the
existing merges to show up again. Or, it can even produce new conflicts,
for example when a commit on one side was reverted before the merge
happened. In that case, you get into a state with changes that were
never visible before.

Björn
--

To: Andrew Morton <akpm@...>
Cc: Junio C Hamano <gitster@...>, <pasky@...>, <Alan.Brunelle@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 5:21 pm

I mentioned linearizing the two lines of development in another
message in this thread. But there are, of course, two linearizations,
neither of which might be possible to generate automatically, and one
might give you a better answer than the other.

-Peff
--

To: Andrew Morton <akpm@...>
Cc: Petr Baudis <pasky@...>, <Alan.Brunelle@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 3:37 pm

Alan provided his bisection log as an attachment to the original bug

Yep, and that's totally correct as far as bisect is concerned. The
parents of that merge commit are:
88fa08f67bee1a0c765237bdac106a32872f57d2
b1b135c8d619cb2c7045d6ee4e48375882518bb5

And Alan marked both of them as good.

So, unless Alan made a mistake during his bisection, each of the
branches is correct, but the merge did not lead to a correct result. So
while there were no textual conflicts, there were still incompatible
changes regarding the code semantics and compatibility was not restored
during the merge.

To get an overview over what got merged together you can can use
something like:
gitk --left-right 1c89ac55017^1...1c89ac55017^2

Which shows all commits that were on only one side of the merge, with
nice "arrows" that indicate from which side the commit is coming. The
conflict should be between one commit from the left and one commit from
the right side, obviously.

Björn
--

To: Björn <B.Steinbrink@...>
Cc: Andrew Morton <akpm@...>, Petr Baudis <pasky@...>, <Alan.Brunelle@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 3:48 pm

One thing that I have seen proposed (but never tried myself) is that you
can linearize the changes using "rebase -i" (or cherry-picking), and
then bisect that result. That is, given a history

A-B-C-D
\ /
E-F

where the merge "D" introduces the bug, you could try creating:

A-B-C-E'-F'

and bisecting that. And you should know that C is good from your
previous bisection, but that F' probably is not, since it should be
textually the same as D (unless, of course, you had textual conflicts
during the rebase that you fixed up differently).

So in essence you are testing each of E and F, but based on the other
work. So you should be able to find the one patch that causes the
conflict. And depending on the conflict, you may get more information by
doing it the other way. I.e.,:

A-E-F-B'-C'

-Peff
--

To: Björn Steinbrink <B.Steinbrink@...>
Cc: Andrew Morton <akpm@...>, Petr Baudis <pasky@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 3:47 pm

Bj

To: Alan D. Brunelle <Alan.Brunelle@...>
Cc: Björn Steinbrink <B.Steinbrink@...>, Andrew Morton <akpm@...>, Petr Baudis <pasky@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 4:17 pm

--

To: Andrew Morton <akpm@...>
Cc: Petr Baudis <pasky@...>, <linux-kernel@...>, <git@...>
Date: Friday, August 22, 2008 - 2:07 pm

You're right - this is where the bisection ended up (in fact, just
removing this merge commit allowed later kernels to boot just fine).

I've patched the kernel w/ your suggestion, and now I'm getting
something a bit different - and I'm looking into this...

Begin: Loading essential drivers... ...

[ 6.594525] fuse init (API version 7.9)

[ 6.618686] ACPI: SSDT CFFD0D0A, 08C4 (r1 HPQOEM CPU_TM2 1
MSFT 100000E)
[ 6.621481] BUG: unable to handle kernel NULL pointer dereference at
0000000000000858
[ 6.625017] IP: [<ffffffff8025e282>] debug_mutex_add_waiter+0x32/0x80

[ 6.625017] PGD 21a456067 PUD 21a45f067 PMD 0

[ 6.625017] Oops: 0002 [1] SMP

[ 6.625017] CPU 1

[ 6.625017] Modules linked in: processor(+) fan thermal_sys fuse

[ 6.625017] Pid: 1259, comm: modprobe Not tainted 2.6.27-rc3 #24

[ 6.625017] RIP: 0010:[<ffffffff8025e282>] [<ffffffff8025e282>]
debug_mutex_add_waiter+0x32/0x80

[ 6.625017] RSP: 0018:ffff88021a073998 EFLAGS: 00010002

[ 6.625017] RAX: 0000000000000000 RBX: ffff88021a0739d8 RCX:
0000000000000000

[ 6.625017] RDX: 0000000000000001 RSI: ffff88021a0739d8 RDI:
ffffffffa0091a60

[ 6.625017] RBP: ffff88021a0739b8 R08: ffffffff811deea0 R09:
ffff8800a6fdb000

[ 6.625017] R10: ffffffffa008f524 R11: 0000000000000000 R12:
ffffffffa0091a60

[ 6.625017] R13: ffff88021a072000 R14: ffff88021a9840a0 R15:
ffffffffa0091a98

[ 6.625017] FS: 00007f40ce2de6e0(0000) GS:ffff88022fc02a00(0000)
knlGS:0000000000000000

[ 6.625017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b

[ 6.625017] CR2: 0000000000000858 CR3: 000000022d86b000 CR4:
00000000000006e0

[ 6.625017] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000

[ 6.625017] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400

[ 6.625017] Process modprobe (pid: 1259, threadinfo ffff88021a072000,
task ffff88021a9840a0)
[ 6.625017] Stack: 0000000000000000 ffffffffa0091a6...

To: Alan D. Brunelle <Alan.Brunelle@...>
Cc: linux-kernel@vger.kernel.org <linux-kernel@...>, Rusty Russell <rusty@...>, Andrew Morton <akpm@...>
Date: Thursday, August 14, 2008 - 12:14 pm

This is a merge commit that introduced several changes into the

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: linux-kernel@vger.kernel.org <linux-kernel@...>, Rusty Russell <rusty@...>, Andrew Morton <akpm@...>
Date: Thursday, August 14, 2008 - 3:21 pm

I came in this morning, and did a 'git bisect reset' and 'git pull'
built, and everything booted fine. I don't know if some fix came in for
something else, or if the problem got "hidden" (some sort of timing
thing, so the problem still lurks). If you feel it's important to track
this down, if you could give me instructions on how to break apart that
one merge, I could then try individual kernels from there...

Thanks,
Alan
--

To: Alan D. Brunelle <Alan.Brunelle@...>
Cc: linux-kernel@vger.kernel.org <linux-kernel@...>, Rusty Russell <rusty@...>, Andrew Morton <akpm@...>
Date: Thursday, August 14, 2008 - 4:29 pm

I have created a Bugzilla entry at
http://bugzilla.kernel.org/show_bug.cgi?id=11342
for this problem. I'll close it now, but please reopen if if the problem
reappears.

Thanks,
Rafael
--

To: Alan D. Brunelle <Alan.Brunelle@...>
Cc: linux-kernel@vger.kernel.org <linux-kernel@...>
Date: Wednesday, August 13, 2008 - 7:27 pm

Sorry, full boot log attached...

Previous thread: [PATCH] coretemp: Recognize Nehalem CPUs by Darrick J. Wong on Wednesday, August 13, 2008 - 6:51 pm. (5 messages)

Next thread: [GIT]: Sparc by David Miller on Wednesday, August 13, 2008 - 7:45 pm. (2 messages)