Hi, I have a problem with random oopses on boot, every 1 out of 5 times I boot linux freezes. I was not able to obtain a call trace, however it happens around 10-15 seconds after boot. I can hear that the tg3 driver is initialised. In this thread, they have the same problem: https://bugzilla.novell.com/show_bug.cgi?id=647029 Applying the patch from the thread, makes the problem occurring less often and dmesg shows acpi-button loads for me on hid PNP0C0C and LNXPWRBN. Maybe commit e2fb9754d27513918a4936e8cbaad50ff56cfd3d ACPI: button: remove unnecessary null pointer checks has unmasked an underlying problem? -Tobias --- a/drivers/acpi/button.c 2010-11-22 20:03:49.000000000 +0100 +++ b/drivers/acpi/button.c 2010-12-04 14:11:51.000000000 +0100 @@ -353,9 +353,13 @@ goto err_free_button; } + printk(KERN_INFO PREFIX "button loading\n"); hid = acpi_device_hid(device); + printk(KERN_INFO PREFIX "hid: <%s>\n", hid); name = acpi_device_name(device); + printk(KERN_INFO PREFIX "name: <%s>\n", name); class = acpi_device_class(device); + printk(KERN_INFO PREFIX "class: <%s>\n", class); if (!strcmp(hid, ACPI_BUTTON_HID_POWER) || !strcmp(hid, ACPI_BUTTON_HID_POWERF)) { --
The oopses from the bugzilla are not the sort I would expect from a null pointer dereference, but since it's fairly reproducible for you, it might be worth reverting e2fb9754d27 to see whether it makes any difference. Does Rich's script from https://bugzilla.novell.com/show_bug.cgi?id=647029#c30 help you reproduce the problem? Bjorn --
I was not able to revert it. But it would only mask the problem anyway... I have now reverted bf04a77227db76f163bc2355ef4e176794987be2 ACPI: button: cache hid/name/class pointers and build acpi_button No, it only crashes on boot (without the printk patch). If it happens the machine is completely dead, SysRq does not work. However it is definitely the acpi_button module, because removing it also fixes this. -Tobias --
Right, but now the granularity is "remove the acpi_button driver completely." If we can identify a specific statement inside If it crashes on boot (not when loading an acpi_button module), you must be building acpi_button into the static kernel. The acpi_button driver has a fairly complicated add() method. In the absence of a better idea, I might just comment out blocks of it and try to isolate the problem. For example, take out all the input stuff, take out the wakeup GPE stuff, take out the type/name setup, etc. Bjorn --
It does crash on boot either if built-in to the kernel or as a module, However it does not crash if the module is loaded/unloaded after the Couldn't this be a compiler issue? Adding some printk's to fix it seems to be insane. -Tobias --
Just in case, here is some more info; -Tobias description: Tower Computer product: Precision WorkStation 390 vendor: Dell Inc. serial: 6WTSG3J width: 64 bits capabilities: smbios-2.3 dmi-2.3 vsyscall64 vsyscall32 configuration: administrator_password=enabled boot=normal chassis=tower power-on_password=enabled uuid=44454C4C-5700-1054-8053-B6C04F47334A *-core description: Motherboard product: 0DN075 vendor: Dell Inc. physical id: 0 serial: ..CN708217BE90AO. *-firmware description: BIOS vendor: Dell Inc. physical id: 0 version: 2.6.0 (05/19/2008) size: 64KiB capacity: 960KiB capabilities: pci pnp apm upgrade shadowing cdboot bootselect edd int13floppytoshiba int5printscreen int9keyboard int14serial int17printer acpi usb ls120boot biosbootspecification netboot [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 2.6.36.1 (root@Tobias-Karnat) (gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5) ) #1 SMP PREEMPT Sat Dec 4 15:09:10 CET 2010 [ 0.000000] Command line: root=/dev/md1 ro splash vga=795 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000bfe0ac00 (usable) [ 0.000000] BIOS-e820: 00000000bfe0ac00 - 00000000bfe0cc00 (ACPI NVS) [ 0.000000] BIOS-e820: 00000000bfe0ec00 - 00000000bfe5cc00 (reserved) [ 0.000000] BIOS-e820: 00000000bfe5cc00 - 00000000bfe5ec00 (ACPI data) [ 0.000000] BIOS-e820: 00000000bfe5ec00 - 00000000c0000000 (reserved) [ 0.000000] BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved) [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fed00400 (reserved) [ 0.000000] BIOS-e820: 00000000fed20000 - 00000000feda0000 (reserved) [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved) [ 0.000000] BIOS-e820: 00000000ffb00000 - ...
Agreed, adding printk's is absolutely not any kind of fix. I think it's more likely to be some sort of memory corruption or race than a compiler problem. I assume there is some old kernel that works fine, even when compiled with the same compiler. In addition to the isolation ideas I suggested above, you might boot with "maxcpus=1" and turn on all the Kconfig memory debug switches. Bjorn --
I agree that it's a timing race condition. I had an earlier version of acpi-button with printf's that masked the issue from happening. Rich On Mon, 6 Dec 2010 22:15:21 -0700 -- --
Booting with mminit_loglevel=4 maxcpus=1 causes this to show up: [ 17.364026] usb 1-6: device descriptor read/64, error -110 [ 32.579024] usb 1-6: device descriptor read/64, error -110 [ 32.794012] usb 1-6: new high speed USB device using ehci_hcd and address 6 [ 47.907019] usb 1-6: device descriptor read/64, error -110 [ 63.121019] usb 1-6: device descriptor read/64, error -110 [ 63.335022] usb 1-6: new high speed USB device using ehci_hcd and address 7 [ 73.748031] usb 1-6: device not accepting address 7, error -110 [ 73.861018] usb 1-6: new high speed USB device using ehci_hcd and address 8 [ 84.274018] usb 1-6: device not accepting address 8, error -110 [ 84.285143] hub 1-0:1.0: unable to enumerate USB device on port 6 I have nothing connected to 1-6: Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 004 Device 003: ID 046d:c20a Logitech, Inc. WingMan RumblePad Bus 004 Device 002: ID 03eb:3301 Atmel Corp. at43301 4-Port Hub Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 002 Device 003: ID 046d:c062 Logitech, Inc. Bus 002 Device 002: ID 413c:2105 Dell Computer Corp. Model L100 Keyboard Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 001 Device 005: ID 0dda:2026 Integrated Circuit Solution, Inc. USB2.0 Card Reader Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub It shows up at the same time as the crash happens without maxcpus. However SysRq does work in this case. -Tobias --
Oh well, it might be the Card Reader. I also get this: [ 506.822024] usb 1-6: reset high speed USB device using ehci_hcd and address 5 [ 661.830024] usb 1-6: reset high speed USB device using ehci_hcd and address 5 [ 902.854026] usb 1-6: reset high speed USB device using ehci_hcd and address 5 [ 1309.894029] usb 1-6: reset high speed USB device using ehci_hcd and address 5 [ 1344.814018] usb 1-6: reset high speed USB device using ehci_hcd and address 5 [ 1891.814028] usb 1-6: reset high speed USB device using ehci_hcd and address 5 [ 2736.838021] usb 1-6: reset high speed USB device using ehci_hcd and address 5 [ 2969.830024] usb 1-6: reset high speed USB device using ehci_hcd and address 5 [ 3002.806023] usb 1-6: reset high speed USB device using ehci_hcd and address 5 I will try booting without it, maybe it fixes the boot issue. But strange that the printk patch could fix this. And the card reader did work with earlier kernels ~2.6.2x. -Tobias --
No, it wasn't the card reader. I have no idea why it throws errors with maxcpus=1, but right now the bigger problem is to get the oops on boot fixed. -Tobias Dmesg with mminit_loglevel=4: [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Linux version 2.6.36.1 (root@Tobias-Karnat) (gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5) ) #1 SMP PREEMPT Sat Dec 4 15:09:10 CET 2010 [ 0.000000] Command line: root=/dev/md1 ro single vga=795 mminit_loglevel=4 maxcpus=1 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000bfe0ac00 (usable) [ 0.000000] BIOS-e820: 00000000bfe0ac00 - 00000000bfe0cc00 (ACPI NVS) [ 0.000000] BIOS-e820: 00000000bfe0ec00 - 00000000bfe5cc00 (reserved) [ 0.000000] BIOS-e820: 00000000bfe5cc00 - 00000000bfe5ec00 (ACPI data) [ 0.000000] BIOS-e820: 00000000bfe5ec00 - 00000000c0000000 (reserved) [ 0.000000] BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved) [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fed00400 (reserved) [ 0.000000] BIOS-e820: 00000000fed20000 - 00000000feda0000 (reserved) [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved) [ 0.000000] BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved) [ 0.000000] BIOS-e820: 0000000100000000 - 00000001bc000000 (usable) [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] DMI 2.3 present. [ 0.000000] e820 update range: 0000000000000000 - 0000000000001000 (usable) ==> (reserved) [ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable) [ 0.000000] No AGP bridge found [ 0.000000] last_pfn = 0x1bc000 max_arch_pfn = 0x400000000 [ 0.000000] MTRR default type: uncachable [ 0.000000] MTRR fixed ranges enabled: [ 0.000000] 00000-9FFFF ...
