Re: v2.6.27-rc7: x86: #GP on panic?

Previous thread: Advice sought: pty/tty gets into "gridlock" if it does too much input/output while stopped by Joe Peterson on Wednesday, September 24, 2008 - 11:49 am. (1 message)

Next thread: [PATCH v2] fbdev: ignore VESA modes if framebuffer does not support them by Michal Januszewski on Wednesday, September 24, 2008 - 12:08 pm. (2 messages)
From: Vegard Nossum
Date: Wednesday, September 24, 2008 - 12:09 pm

Hi,

With 2.6.27-rc7 on qemu-x86_64, it seems that panic will trigger a
General Protection Fault. I haven't seen it before.

[    4.499793] VFS: Cannot open root device "hda1" or unknown-block(2,0)
[    4.502747] Please append a correct "root=" boot option; here are
the available partitions:
[    4.506641] 0800    2048000 sda driver: sd
[    4.508987]   0801    1895638 sda1
[    4.511088]   0802          1 sda2
[    4.512858] 0810       2048 sdb driver: sd
[    4.514915] 0b00    1048575 sr0 driver: sr
[    4.519074] Kernel panic - not syncing: VFS: Unable to mount root
fs on unknown-block(2,0)
[    4.523477] general protection fault: fff2 [1] SMP
[    4.523641] CPU 0
[    4.523641] Modules linked in:
[    4.523641] Pid: 1, comm: swapper Tainted: G        W 2.6.27-rc7 #1
[    4.523641] RIP: 0010:[<ffffffff81019d27>]  [<ffffffff81019d27>]
native_smp_send_stop+0x29/0x2d
[    4.523641] RSP: 0018:ffff880007867d70  EFLAGS: 00000286
[    4.523641] RAX: 00000000000000ff RBX: 0000000000000286 RCX: 0000000000000000
[    4.523641] RDX: 0000000000000005 RSI: ffffffff81019ce1 RDI: 0000000000000000
[    4.523641] RBP: ffff880007867d80 R08: 0000000000000000 R09: 0000000000002800
[    4.523641] R10: 0000000000002800 R11: ffff880001020a40 R12: ffff88000705b018
[    4.523641] R13: ffff88000705b000 R14: 0000000000008001 R15: ffffffff8159d550
[    4.523641] FS:  0000000000000000(0000) GS:ffffffff816fae00(0000)
knlGS:0000000000000000
[    4.523641] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    4.523641] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006a0
[    4.523641] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.523641] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[    4.523641] Process swapper (pid: 1, threadinfo ffff880007866000,
task ffff880007868000)
[    4.523641] Stack:  0000000000005131 ffffffff8159d52d
ffff880007867e70 ffffffff810344a4
[    4.523641]  0000003000000010 ffff880007867e80 ...
From: Ingo Molnar
Date: Thursday, September 25, 2008 - 1:04 am

hm, 0x5a is a simple pop %rdx. A #GP there means the stack segment is 
bust?


so it's preceded by a popfq and on the next instruction we #GP.

but the stack and flags state looks good:

  [    4.523641] RSP: 0018:ffff880007867d70  EFLAGS: 00000286

weird.

	Ingo
--

From: H. Peter Anvin
Date: Thursday, September 25, 2008 - 1:53 am

No, that would be #SS (and segments don't really exist in 64-bit mode 
anyway.)  In 32-bit mode it could mean a code segment overrun.

*However*...

[    4.523477] general protection fault: fff2 [1] SMP

There is an error code attached to the #GP, which is supposed to mean 
that somehow a segment selector was involved. This doesn't look like a 

My guess is that the popfq enables interrupts, and we try to take an 
interrupt through an IDT entry which isn't set up correctly.

	-hpa

--

From: Vegard Nossum
Date: Thursday, September 25, 2008 - 7:07 am

I'm sorry for the false alarm. I discovered that it did not happen on
a clean kernel. My kernel was using this patch.

diff --git a/arch/x86/kernel/cpu/common_64.c b/arch/x86/kernel/cpu/common_64.c
index a11f5d4..abf5bc8 100644
--- a/arch/x86/kernel/cpu/common_64.c
+++ b/arch/x86/kernel/cpu/common_64.c
@@ -261,6 +261,8 @@ void __init early_cpu_init(void)
                 cpu_devs[cvdev->vendor] = cvdev->cpu_dev;
        early_cpu_support_print();
        early_identify_cpu(&boot_cpu_data);
+
+       setup_clear_cpu_cap(X86_FEATURE_PSE);
 }



Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Vegard Nossum
Date: Thursday, September 25, 2008 - 8:20 am

No, I was wrong! It *does* happen for vanilla as well, but it doesn't
happen reliably.

[    4.043370] Kernel panic - not syncing: VFS: Unable to mount root
fs on unknown-block(2,0)
[    4.048765] general protection fault: fff2 [1] SMP
[    4.048765] CPU 0
[    4.048765] Modules linked in:
[    4.048765] Pid: 1, comm: swapper Tainted: G        W 2.6.27-rc7 #8
[    4.048765] RIP: 0010:[<ffffffff81019d27>]  [<ffffffff81019d27>]
native_smp_send_stop+0x29/0x2d
[    4.048765] RSP: 0018:ffff880007867d70  EFLAGS: 00000286
[    4.048765] RAX: 00000000000000ff RBX: 0000000000000286 RCX: 0000000000000000
[    4.048765] RDX: 0000000000000005 RSI: ffffffff81019ce1 RDI: 0000000000000000
[    4.048765] RBP: ffff880007867d80 R08: 0000000000000000 R09: ffff880087867bff
[    4.048765] R10: ffff880087867bff R11: 000000000000000a R12: ffff88000707b018
[    4.048765] R13: ffff88000707b000 R14: 0000000000008001 R15: ffffffff8159d550
[    4.048765] FS:  0000000000000000(0000) GS:ffffffff816fae00(0000)
knlGS:0000000000000000
[    4.048765] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    4.048765] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006a0
[    4.048765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    4.048765] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[    4.048765] Process swapper (pid: 1, threadinfo ffff880007866000,
task ffff880007868000)
[    4.048765] Stack:  000000000000506f ffffffff8159d52d
ffff880007867e70 ffffffff81034454
[    4.048765]  0000003000000010 ffff880007867e80 ffff880007867db0
ffff880007867e80
[    4.048765]  ffff880007867dd0 ffff880007867e80 ffff880007899360
000000000000500e
[    4.048765] Call Trace:
[    4.048765]  [<ffffffff81034454>] panic+0xe8/0x193
[    4.048765]  [<ffffffff8118ef5f>] ? kobject_put+0x44/0x49
[    4.048765]  [<ffffffff8121778e>] ? put_device+0x15/0x17
[    4.048765]  [<ffffffff8121ad49>] ? class_for_each_device+0xfe/0x10e
[    4.048765]  [<ffffffff81715059>] ...
From: Vegard Nossum
Date: Thursday, September 25, 2008 - 1:46 pm

Keeping it going also found this bootup failure:

[    0.321423] Freeing SMP alternatives: 39k freed
[    0.323950] ACPI: Core revision 20080609
[    0.360390] divide error: 0000 [1] SMP
[    0.360944] CPU 0
[    0.360944] Modules linked in:
[    0.360944] Pid: 1, comm: swapper Tainted: G        W 2.6.27-rc7 #9
[    0.360944] RIP: 0010:[<ffffffff81039193>]  [<ffffffff81039193>]
__do_softirq+0x49/0xc5
[    0.360944] RSP: 0018:ffffffff81792f00  EFLAGS: 00000206
[    0.360944] RAX: ffff880007867fd8 RBX: 0000000000000042 RCX: ffff880007867d90
[    0.360944] RDX: ffff880007867d90 RSI: 0000000000000086 RDI: ffffffff817ac208
[    0.360944] RBP: ffffffff81792f20 R08: ffff88000100d0b0 R09: ffff88000100d040
[    0.360944] R10: ffff88000100d040 R11: ffffffff81646b40 R12: ffffffff816ec080
[    0.360944] R13: 000000000000000a R14: 0000000000000000 R15: 0000000000000000
[    0.360944] FS:  0000000000000000(0000) GS:ffffffff816fae00(0000)
knlGS:0000000000000000
[    0.360944] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    0.360944] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006a0
[    0.360944] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.360944] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
[    0.360944] Process swapper (pid: 1, threadinfo ffff880007866000,
task ffff880007868000)
[    0.360944] Stack:  0000000000000046 0000000000000000
ffffffff817893e0 0000000000000030
[    0.360944]  ffffffff81792f38 ffffffff8100d24c ffffffff81792f38
ffffffff81792f58
[    0.360944]  ffffffff8100eb81 ffff880007867ce8 0000000000000000
ffffffff81792f68
[    0.360944] Call Trace:
[    0.360944]  <IRQ>  [<ffffffff8100d24c>] call_softirq+0x1c/0x28
[    0.360944]  [<ffffffff8100eb81>] do_softirq+0x32/0x89
[    0.360944]  [<ffffffff810392ad>] irq_exit+0x3f/0x82
[    0.360944]  [<ffffffff8100e9b3>] do_IRQ+0x147/0x166
[    0.360944]  [<ffffffff8100c5a1>] ret_from_intr+0x0/0xb
[    0.360944]  <EOI>  [<ffffffff8107013f>] ? noop+0x0/0x6
[   ...
From: H. Peter Anvin
Date: Thursday, September 25, 2008 - 1:49 pm

Yes, but there shouldn't be any external interrupts that could turn into 
  a divide error.  It really smells like a Qemu problem -- possibly even 
a Qemu miscompile -- to me.

Does it reproduce in KVM?

	-hpa
--

From: Vegard Nossum
Date: Thursday, September 25, 2008 - 2:02 pm

I have no computer that can do KVM, sorry :-(

Stack trace contains IO_APIC functions, so it seems that maybe the
emulated IOAPIC is trying to (erroneously) deliver an int 0 (for some
reason)? But I don't know, that's just speculation which can be done
better by others, so I will stop now :-)


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: H. Peter Anvin
Date: Thursday, September 25, 2008 - 2:53 pm

I suspect it's a problem in Qemu's IOAPIC model, but it's hard to know 
for sure.

	-hpa
--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 11:43 am

yes - it smells like it tries to deliver vector 0, after the panic code 
has deinitialized the lapic / ioapic.

	Ingo
--

Previous thread: Advice sought: pty/tty gets into "gridlock" if it does too much input/output while stopped by Joe Peterson on Wednesday, September 24, 2008 - 11:49 am. (1 message)

Next thread: [PATCH v2] fbdev: ignore VESA modes if framebuffer does not support them by Michal Januszewski on Wednesday, September 24, 2008 - 12:08 pm. (2 messages)