I did it!
With a little debug patch, I got the oops to ttyS0, but with each
printk() duplicated (I removed some of them). This is from resume:BUG: unable to handle kernel BUG: unable to handle kernel paging
requestpaging request at 00200200
at 00200200
IP:IP: [<c038accc>] list_del+0xc/0x90
[<c038accc>] list_del+0xc/0x90
*pdpt = 00000000331a0001 *pde = 0000000000000000
Oops: 0000 [#1] Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
DEBUG_PAGEALLOCDEBUG_PAGEALLOC
Pid: 3473, comm: bash Not tainted (2.6.27-rc4-00003-ga798564-dirty #30)
EIP: 0060:[<c038accc>] EFLAGS: 00210082 CPU: 0
EIP is at list_del+0xc/0x90
EAX: 00200200 EBX: f77fcc00 ECX: 00000001 EDX: f31a6000
ESI: c088e814 EDI: c088e83c EBP: f31a7be8 ESP: f31a7bd0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3473, ti=f31a6000 task=f30ea700 task.ti=f31a6000)
Stack: Stack: 00000000 00000000 00000002 00000002 00000000 00000000
c01b39be c01b39be f77cefa0 f77cefa0 f77fcbd0 f77fcbd0 f31a7bfc
f31a7bfc c01b39ef c01b39ef
f77cefa0 f77cefa0 00000000 00000000 c1133598 c1133598
f31a7c3c f31a7c3c c01b3ebc c01b3ebc c0b530c0 c0b530c0 f31a7c18
f31a7c18 c015b60e c015b60e
f31a7c18 f31a7c18 ffffffff ffffffff 00000020 00000020
c088e800 c088e800 00200046 00200046 00000000 00000000 00000000
00000000 00000000 00000000
Call Trace:
[<c01b39be>] [<c01b39be>] ? ? get_partial_node+0x3e/0xc0
[<c01b39ef>] [<c01b39ef>] ? ? get_partial_node+0x6f/0xc0
[<c01b3ebc>] [<c01b3ebc>] ? ? __slab_alloc+0x11c/0x4e0
[<c015b60e>] [<c015b60e>] ? ? get_lock_stats+0x1e/0x50
[<c01b5ffc>] [<c01b5ffc>] ? ? __kmalloc+0x11c/0x130
[<c03ceeb2>] [<c03ceeb2>] ? ? tty_buffer_request_room+0xe2/0x130
[<c03ceeb2>] [<c03ceeb2>] ? ? tty_buffer_request_room+0xe2/0x130
[<c03ceeb2>] [<c03ceeb2>] ? ? tty_buffer_request_room+0xe2/0x130
[<c03d131d>] [<c03d131d>] ? ? tty_insert_flip_string_flags+0x2d/0xa0
...
To me, it does, but the question is if that's not caused by the serial console
itself ...Thanks,
Rafael--
Actually, it was not my patch that did it. I appended acpi=off to the
command line. I get similar oopses now, but they make it to the ttyS0.I got a "proper" oops this time (brace yourself, it's long ;-)), and
it does not have serial on the stack:BUG: unable to handle kernel paging request at 00100104
IP: [<c038ad65>] __list_add+0x15/0x90
*pdpt = 00000000325bb001 *pde = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Pid: 3597, comm: bash Not tainted (2.6.27-rc4-00003-ga798564-dirty #30)
EIP: 0060:[<c038ad65>] EFLAGS: 00210082 CPU: 0
EIP is at __list_add+0x15/0x90
EAX: f7455764 EBX: 00100100 ECX: 00100100 EDX: f7455764
ESI: f7455764 EDI: f7455764 EBP: f25e7c18 ESP: f25e7bf4
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3597, ti=f25e6000 task=f25aa700 task.ti=f25e6000)
Stack: f68120bc f6c00400 c0687133 00000000 00000002 00000000 f7455734 f7455764
f7455764 f25e7c40 c018b954 0000001f 00000000 f6c00400 f6c00444 00000001
f68120bc f68120b0 00200002 f25e7cb4 c018d317 f68120bc 00000000 f25aac0c
Call Trace:
[<c0687133>] ? _spin_lock+0x63/0x70
[<c018b954>] ? rmqueue_bulk+0x54/0x80
[<c018d317>] ? get_page_from_freelist+0x5a7/0x720
[<c01600ea>] ? __lock_acquire+0x27a/0xa00
[<c018dd50>] ? __alloc_pages_internal+0xa0/0x450
[<c01acd4b>] ? alloc_pages_current+0x7b/0xc0
[<c01b37fb>] ? new_slab+0x1bb/0x2d0
[<c0687877>] ? _spin_unlock+0x27/0x50
[<c01b40ca>] ? __slab_alloc+0x32a/0x4e0
[<c010b335>] ? native_sched_clock+0xb5/0x110
[<c01b4424>] ? kmem_cache_alloc+0xb4/0xe0
[<c018969e>] ? mempool_alloc_slab+0xe/0x10
[<c018969e>] ? mempool_alloc_slab+0xe/0x10
[<c018969e>] ? mempool_alloc_slab+0xe/0x10
[<c01897a1>] ? mempool_alloc+0x31/0xf0
[<c015e884>] ? trace_hardirqs_on_caller+0xd4/0x160
[<c015e91b>] ? trace_hardirqs_on+0xb/0x10
[<c0368c7e>] ? get_request+0xae/0x2c0
[<c036935c>] ? get_request_wait+0x1c/0x...
Ok, here we have the block layer passing a bad pointer to SLUB this
time. And it's also from the suspend code (although it's the resume
path this time). As we never see an oops from the block layer first,
it's possible that someone else corrupted everything and it just shows
--
IDE again?
--
If this makes it so, then yes:
calling piix_ide_init+0x0/0xb0
initcall piix_ide_init+0x0/0xb0 returned 0 after 0 msecs
calling ide_scan_pcibus+0x0/0xf0
piix 0000:00:1f.1: IDE controller (0x8086:0x27df rev 0x01)
piix 0000:00:1f.1: IDE port disabled
piix 0000:00:1f.1: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xffa0-0xffa7
Probing IDE interface ide0...
hda: WDC WD1600BB-00DAA3, ATA DISK drive
hda: host max PIO4 wanted PIO255(auto-tune) selected PIO4
hda: UDMA/100 mode selected
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14It is also interesting that you mention that; this is from an earlier
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
Hmm.
Would that be possible to switch temporarily to PATA/libata and see if the
problem goes away? Then, we'd get a strong indication that it really is
related to IDE.Thanks,
Rafael
--
Hi,
"again"?
generic_ide_resume() just tries to get new request from the block
IIRC besides conversion to blk_{get,put}_request() conversion from
15th July (commit 5b114715ed63f3a4fdf790f5df61364fc4adadf1) there
weren't any PM related IDE changes recently.Vegard, it would be worth to try if the kernels at the above commit
and at commit 9a2d43b7566caeeeb414aa628bc2759028897dbb (one commit
before the above one) are OK.If it doesn't give the definitive answer then doing git-bisect run
would be of great help in identifying and fixing the source of the
problem quickly.Thanks,
Bart
--
There was another trace in which IDE was also involved.
Sorry for the confusion.
Rafael
--
Do you use any unusual drivers? WHat kind of machine is that?
Can you try with minimal drivers?
Can you try with minimum config? Like smp disabled...? Does echo test > /sys/power/state
work?Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
This one is slightly different:
WARNING: at lib/list_debug.c:26 __list_add+0x61/0x90()
list_add corruption. next->prev should be prev (c088e34c), but was
00200200. (next=f7814774).
Pid: 3572, comm: bash Not tainted 2.6.27-rc4-00003-ga798564-dirty #30
[<c013a56e>] warn_slowpath+0x5e/0x80
[<c015f7bf>] ? validate_chain+0x7df/0xe90
[<c014c4e8>] ? __kernel_text_address+0x8/0x20
[<c01600ea>] ? __lock_acquire+0x27a/0xa00
[<c015b60e>] ? get_lock_stats+0x1e/0x50
[<c015b64d>] ? put_lock_stats+0xd/0x30
[<c038adb1>] __list_add+0x61/0x90
[<c01b3208>] deactivate_slab+0x158/0x1b0
[<c01b3ea5>] __slab_alloc+0x105/0x4e0
[<c015b60e>] ? get_lock_stats+0x1e/0x50
[<c01b4424>] kmem_cache_alloc+0xb4/0xe0
[<c052a51b>] ? serio_queue_event+0x6b/0xe0
[<c052a51b>] ? serio_queue_event+0x6b/0xe0
[<c052a51b>] serio_queue_event+0x6b/0xe0
[<c015e91b>] ? trace_hardirqs_on+0xb/0x10
[<c052a5b8>] serio_resume+0x28/0x30
[<c03f8bde>] device_resume+0x32e/0x380
[<c0168791>] hibernation_snapshot+0xa1/0x220
[<c013b55b>] ? printk+0x1b/0x20
[<c01689f0>] hibernate+0xe0/0x180
[<c01674a0>] ? state_store+0x0/0xd0
[<c016755f>] state_store+0xbf/0xd0
[<c01674a0>] ? state_store+0x0/0xd0
[<c0375ef4>] kobj_attr_store+0x24/0x30
[<c01fa432>] sysfs_write_file+0xa2/0x100
[<c01bbf06>] vfs_write+0x96/0x130
[<c01fa390>] ? sysfs_write_file+0x0/0x100
[<c01bc44d>] sys_write+0x3d/0x70
[<c0104f3b>] sysenter_do_call+0x12/0x3f
=======================
---[ end trace 0bc52e2837e4b8b8 ]---...followed by the "unable to handle kernel paging request" as in
previous e-mail. The numbers 00100104 and 00200200 seem to appear all
the time.At least it is 100% reproducible now and with nice oopses too :-)
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually d...
Hm, slab corruption, it seems.
Do you use SLAB or SLUB?
Rafael
--
SLUB, with all possible debugging:
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLABINFO=y
CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y(Thanks for helping out.)
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
(You're welcome.)
Can you switch to SLAB and retest? I don't really think SLUB is the issue
here, but SLAB may give us additional information.Thanks,
Rafael
--
Got this with SLAB:
BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<c03897ac>] list_del+0xc/0x90
*pdpt = 0000000031480001 *pde = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Pid: 7, comm: events/0 Not tainted (2.6.27-rc4-00003-gef9b1bc #33)
EIP: 0060:[<c03897ac>] EFLAGS: 00010082 CPU: 0
EIP is at list_del+0xc/0x90
EAX: 00000000 EBX: f556afa0 ECX: f54a5380 EDX: f54e3128
ESI: f54f1ef0 EDI: f556afa0 EBP: f68b7ee0 ESP: f68b7ec8
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process events/0 (pid: 7, ti=f68b6000 task=f689c338 task.ti=f68b6000)
Stack: c015b62e f68b7ed4 c015b66d f68b7efc 00000046 f54f1f20 f68b7f1c c01b4054
00000000 f689c338 f54e315c 00000001 f54f1f20 f54a5380 00000000 f54e3128
f55b3348 00000000 f54f1f20 f54f1ef0 00000001 f68b7f3c c01b41ba 00000000
Call Trace:
[<c015b62e>] ? get_lock_stats+0x1e/0x50
[<c015b66d>] ? put_lock_stats+0xd/0x30
[<c01b4054>] ? free_block+0xa4/0x1b0
[<c01b41ba>] ? drain_array+0x5a/0xb0
[<c01b4f82>] ? cache_reap+0x72/0x230
[<c014ada7>] ? run_workqueue+0x107/0x200
[<c014ae0a>] ? run_workqueue+0x16a/0x200
[<c014ada7>] ? run_workqueue+0x107/0x200
[<c01b4f10>] ? cache_reap+0x0/0x230
[<c014b94d>] ? worker_thread+0x7d/0xe0
[<c014e4f0>] ? autoremove_wake_function+0x0/0x50
[<c014b8d0>] ? worker_thread+0x0/0xe0
[<c014e1e2>] ? kthread+0x42/0x70
[<c014e1a0>] ? kthread+0x0/0x70
[<c0105cf3>] ? kernel_thread_helper+0x7/0x14
=======================
Code: e8 01 89 44 24 04 e8 d4 1d db ff 8b 55 04 b8 48 2f 7a c0 e8 47 d8 dd ff e8
c2 dc d7 ff eb a9 55 89 e5 53 89 c3 83 ec 14 8b 40 04 <8b> 00 39 d8 75 24 8b 13
8b 42 04 39 d8 75 41 8b 43 04 89 42 04
EIP: [<c03897ac>] list_del+0xc/0x90 SS:ESP 0068:f68b7ec8
---[ end trace 958cea1a710a109a ]---
note: events/0[7] exited with preempt_count 1
uhci_hcd 0000:00:1d.2: setting latency timer to 64
uhci_hcd 0000:00:1d.3: setting latency t...
Hi Vegard,
Looks like slabp->list is corrupted when we do list_del() in
Here the block layer is passing a pointer to a non-slab page to
kmem_cache_free(). Assuming we can rely on the stack trace, ofAnd now we have the per-cpu page lists corrupted as well. Note that
even though we have tty showing up in the stack traces, the list has
already been corrupted by someone else. So I don't think tty is atI don't know the suspend code at all but the bogus pointer coming from
hibernation_snapshot() seems suspicious. Another possibility is that
the block layer is doing something strange here. Dunno.Pekka
--
I bet on the IDE driver.
Thanks,
Rafael
--
Should be easy enough to switch to libata to verify this, right?
Pavel--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Hi Rafael,
But like I said, the stack traces seem more unreliable than usual so I
could be totally wrong here.
--
Well, it looks like this happened while devices were suspended, but I can't say
--
On Thu, 21 Aug 2008 22:22:44 +0200
But slab hasn't changed in quite some time. I expect something else
has reached over and scribbled on it. Bisection time?--
| Hiten Pandya | Re: up? (emacs docbook xml ide) |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 005/196] Chinese: add translation of SubmittingDrivers |
| James Bottomley | Re: [Ksummit-2008-discuss] Fixing the Kernel Janitors project |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
