Re: acpi_button: random oops on boot

Previous thread: [PATCH] arm/income pxa270: enable backlight when pwm_bl is a module by Nicolas Kaiser on Saturday, December 4, 2010 - 7:29 am. (2 messages)

Next thread: RE: by FreeLotto Online Promo on Saturday, December 4, 2010 - 2:06 pm. (1 message)
From: Tobias Karnat
Date: Saturday, December 4, 2010 - 8:49 am

Hi,

I have a problem with random oopses on boot,
every 1 out of 5 times I boot linux freezes.

I was not able to obtain a call trace, however it happens around 10-15
seconds after boot. I can hear that the tg3 driver is initialised.

In this thread, they have the same problem:
https://bugzilla.novell.com/show_bug.cgi?id=647029

Applying the patch from the thread, makes the problem occurring less
often and dmesg shows acpi-button loads for me on hid PNP0C0C and
LNXPWRBN.

Maybe commit e2fb9754d27513918a4936e8cbaad50ff56cfd3d
ACPI: button: remove unnecessary null pointer checks
has unmasked an underlying problem?

-Tobias

--- a/drivers/acpi/button.c	2010-11-22 20:03:49.000000000 +0100
+++ b/drivers/acpi/button.c	2010-12-04 14:11:51.000000000 +0100
@@ -353,9 +353,13 @@
 		goto err_free_button;
 	}
 
+	printk(KERN_INFO PREFIX "button loading\n");
 	hid = acpi_device_hid(device);
+	printk(KERN_INFO PREFIX "hid: <%s>\n", hid);
 	name = acpi_device_name(device);
+	printk(KERN_INFO PREFIX "name: <%s>\n", name);
 	class = acpi_device_class(device);
+	printk(KERN_INFO PREFIX "class: <%s>\n", class);
 
 	if (!strcmp(hid, ACPI_BUTTON_HID_POWER) ||
 	    !strcmp(hid, ACPI_BUTTON_HID_POWERF)) {

--

From: Bjorn Helgaas
Date: Monday, December 6, 2010 - 9:28 am

The oopses from the bugzilla are not the sort I would expect from
a null pointer dereference, but since it's fairly reproducible for
you, it might be worth reverting e2fb9754d27 to see whether it makes
any difference.

Does Rich's script from https://bugzilla.novell.com/show_bug.cgi?id=647029#c30
help you reproduce the problem?

Bjorn
--

From: Tobias Karnat
Date: Monday, December 6, 2010 - 4:01 pm

I was not able to revert it.
But it would only mask the problem anyway...

I have now reverted bf04a77227db76f163bc2355ef4e176794987be2
ACPI: button: cache hid/name/class pointers and build acpi_button

No, it only crashes on boot (without the printk patch).
If it happens the machine is completely dead, SysRq does not work.

However it is definitely the acpi_button module, because removing it
also fixes this.

-Tobias

--

From: Bjorn Helgaas
Date: Monday, December 6, 2010 - 4:26 pm

Right, but now the granularity is "remove the acpi_button driver
completely."  If we can identify a specific statement inside

If it crashes on boot (not when loading an acpi_button module),
you must be building acpi_button into the static kernel.

The acpi_button driver has a fairly complicated add() method.
In the absence of a better idea, I might just comment out blocks
of it and try to isolate the problem.  For example, take out
all the input stuff, take out the wakeup GPE stuff, take out
the type/name setup, etc.

Bjorn
--

From: Tobias Karnat
Date: Monday, December 6, 2010 - 4:54 pm

It does crash on boot either if built-in to the kernel or as a module,
However it does not crash if the module is loaded/unloaded after the

Couldn't this be a compiler issue?
Adding some printk's to fix it seems to be insane.

-Tobias

--

From: Tobias Karnat
Date: Monday, December 6, 2010 - 5:22 pm

Just in case, here is some more info;

-Tobias

description: Tower Computer
product: Precision WorkStation 390
vendor: Dell Inc.
serial: 6WTSG3J
width: 64 bits
capabilities: smbios-2.3 dmi-2.3 vsyscall64 vsyscall32
configuration: administrator_password=enabled boot=normal chassis=tower power-on_password=enabled uuid=44454C4C-5700-1054-8053-B6C04F47334A

*-core
description: Motherboard
product: 0DN075
vendor: Dell Inc.
physical id: 0
serial: ..CN708217BE90AO.

*-firmware
description: BIOS
vendor: Dell Inc.
physical id: 0
version: 2.6.0 (05/19/2008)
size: 64KiB
capacity: 960KiB
capabilities: pci pnp apm upgrade shadowing cdboot bootselect edd int13floppytoshiba int5printscreen int9keyboard int14serial int17printer acpi usb ls120boot biosbootspecification netboot

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.36.1 (root@Tobias-Karnat) (gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5) ) #1 SMP PREEMPT Sat Dec 4 15:09:10 CET 2010
[    0.000000] Command line: root=/dev/md1 ro splash vga=795
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000bfe0ac00 (usable)
[    0.000000]  BIOS-e820: 00000000bfe0ac00 - 00000000bfe0cc00 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000bfe0ec00 - 00000000bfe5cc00 (reserved)
[    0.000000]  BIOS-e820: 00000000bfe5cc00 - 00000000bfe5ec00 (ACPI data)
[    0.000000]  BIOS-e820: 00000000bfe5ec00 - 00000000c0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fed00400 (reserved)
[    0.000000]  BIOS-e820: 00000000fed20000 - 00000000feda0000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
[    0.000000]  BIOS-e820: 00000000ffb00000 - ...
From: Bjorn Helgaas
Date: Monday, December 6, 2010 - 10:15 pm

Agreed, adding printk's is absolutely not any kind of fix.
I think it's more likely to be some sort of memory corruption or
race than a compiler problem.  I assume there is some old kernel
that works fine, even when compiled with the same compiler.

In addition to the isolation ideas I suggested above, you might
boot with "maxcpus=1" and turn on all the Kconfig memory debug
switches.

Bjorn
--

From: Rich Coe
Date: Tuesday, December 7, 2010 - 7:44 am

I agree that it's a timing race condition.  I had an earlier version of
acpi-button with printf's that masked the issue from happening.

Rich

On Mon, 6 Dec 2010 22:15:21 -0700


-- 
--

From: Tobias Karnat
Date: Tuesday, December 7, 2010 - 8:02 am

Booting with mminit_loglevel=4 maxcpus=1 causes this to show up:

[   17.364026] usb 1-6: device descriptor read/64, error -110
[   32.579024] usb 1-6: device descriptor read/64, error -110
[   32.794012] usb 1-6: new high speed USB device using ehci_hcd and address 6
[   47.907019] usb 1-6: device descriptor read/64, error -110
[   63.121019] usb 1-6: device descriptor read/64, error -110
[   63.335022] usb 1-6: new high speed USB device using ehci_hcd and address 7
[   73.748031] usb 1-6: device not accepting address 7, error -110
[   73.861018] usb 1-6: new high speed USB device using ehci_hcd and address 8
[   84.274018] usb 1-6: device not accepting address 8, error -110
[   84.285143] hub 1-0:1.0: unable to enumerate USB device on port 6

I have nothing connected to 1-6:

Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 003: ID 046d:c20a Logitech, Inc. WingMan RumblePad
Bus 004 Device 002: ID 03eb:3301 Atmel Corp. at43301 4-Port Hub
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 003: ID 046d:c062 Logitech, Inc. 
Bus 002 Device 002: ID 413c:2105 Dell Computer Corp. Model L100 Keyboard
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 005: ID 0dda:2026 Integrated Circuit Solution, Inc. USB2.0 Card Reader
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

It shows up at the same time as the crash happens without maxcpus.
However SysRq does work in this case.

-Tobias

--

From: Tobias Karnat
Date: Tuesday, December 7, 2010 - 8:35 am

Oh well, it might be the Card Reader.

I also get this:

[  506.822024] usb 1-6: reset high speed USB device using ehci_hcd and address 5
[  661.830024] usb 1-6: reset high speed USB device using ehci_hcd and address 5
[  902.854026] usb 1-6: reset high speed USB device using ehci_hcd and address 5
[ 1309.894029] usb 1-6: reset high speed USB device using ehci_hcd and address 5
[ 1344.814018] usb 1-6: reset high speed USB device using ehci_hcd and address 5
[ 1891.814028] usb 1-6: reset high speed USB device using ehci_hcd and address 5
[ 2736.838021] usb 1-6: reset high speed USB device using ehci_hcd and address 5
[ 2969.830024] usb 1-6: reset high speed USB device using ehci_hcd and address 5
[ 3002.806023] usb 1-6: reset high speed USB device using ehci_hcd and address 5

I will try booting without it, maybe it fixes the boot issue.

But strange that the printk patch could fix this.
And the card reader did work with earlier kernels ~2.6.2x.

-Tobias

--

From: Tobias Karnat
Date: Tuesday, December 7, 2010 - 8:52 am

No, it wasn't the card reader.

I have no idea why it throws errors with maxcpus=1, but right now the
bigger problem is to get the oops on boot fixed.

-Tobias

Dmesg with mminit_loglevel=4:

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.36.1 (root@Tobias-Karnat) (gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5) ) #1 SMP PREEMPT Sat Dec 4 15:09:10 CET 2010
[    0.000000] Command line: root=/dev/md1 ro single vga=795 mminit_loglevel=4 maxcpus=1
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000bfe0ac00 (usable)
[    0.000000]  BIOS-e820: 00000000bfe0ac00 - 00000000bfe0cc00 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000bfe0ec00 - 00000000bfe5cc00 (reserved)
[    0.000000]  BIOS-e820: 00000000bfe5cc00 - 00000000bfe5ec00 (ACPI data)
[    0.000000]  BIOS-e820: 00000000bfe5ec00 - 00000000c0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fed00400 (reserved)
[    0.000000]  BIOS-e820: 00000000fed20000 - 00000000feda0000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
[    0.000000]  BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 00000001bc000000 (usable)
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] DMI 2.3 present.
[    0.000000] e820 update range: 0000000000000000 - 0000000000001000 (usable) ==> (reserved)
[    0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
[    0.000000] No AGP bridge found
[    0.000000] last_pfn = 0x1bc000 max_arch_pfn = 0x400000000
[    0.000000] MTRR default type: uncachable
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF ...
Previous thread: [PATCH] arm/income pxa270: enable backlight when pwm_bl is a module by Nicolas Kaiser on Saturday, December 4, 2010 - 7:29 am. (2 messages)

Next thread: RE: by FreeLotto Online Promo on Saturday, December 4, 2010 - 2:06 pm. (1 message)