Re: latest -git: suspend: unable to handle kernel paging request (was Re: no_console_suspend doesn't work?)

Previous thread: 2.6.26.3-rt2 by Steven Rostedt on Thursday, August 21, 2008 - 1:18 pm. (3 messages)

Next thread: [GIT PATCH] USB patches for 2.6.27-rc4 by Greg KH on Thursday, August 21, 2008 - 1:30 pm. (1 message)
To: Rafael J. Wysocki <rjw@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, August 21, 2008 - 1:28 pm

I did it!

With a little debug patch, I got the oops to ttyS0, but with each
printk() duplicated (I removed some of them). This is from resume:

BUG: unable to handle kernel BUG: unable to handle kernel paging
requestpaging request at 00200200
at 00200200
IP:IP: [<c038accc>] list_del+0xc/0x90
[<c038accc>] list_del+0xc/0x90
*pdpt = 00000000331a0001 *pde = 0000000000000000
Oops: 0000 [#1] Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
DEBUG_PAGEALLOCDEBUG_PAGEALLOC
Pid: 3473, comm: bash Not tainted (2.6.27-rc4-00003-ga798564-dirty #30)
EIP: 0060:[<c038accc>] EFLAGS: 00210082 CPU: 0
EIP is at list_del+0xc/0x90
EAX: 00200200 EBX: f77fcc00 ECX: 00000001 EDX: f31a6000
ESI: c088e814 EDI: c088e83c EBP: f31a7be8 ESP: f31a7bd0
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3473, ti=f31a6000 task=f30ea700 task.ti=f31a6000)
Stack: Stack: 00000000 00000000 00000002 00000002 00000000 00000000
c01b39be c01b39be f77cefa0 f77cefa0 f77fcbd0 f77fcbd0 f31a7bfc
f31a7bfc c01b39ef c01b39ef
f77cefa0 f77cefa0 00000000 00000000 c1133598 c1133598
f31a7c3c f31a7c3c c01b3ebc c01b3ebc c0b530c0 c0b530c0 f31a7c18
f31a7c18 c015b60e c015b60e
f31a7c18 f31a7c18 ffffffff ffffffff 00000020 00000020
c088e800 c088e800 00200046 00200046 00000000 00000000 00000000
00000000 00000000 00000000
Call Trace:
[<c01b39be>] [<c01b39be>] ? ? get_partial_node+0x3e/0xc0
[<c01b39ef>] [<c01b39ef>] ? ? get_partial_node+0x6f/0xc0
[<c01b3ebc>] [<c01b3ebc>] ? ? __slab_alloc+0x11c/0x4e0
[<c015b60e>] [<c015b60e>] ? ? get_lock_stats+0x1e/0x50
[<c01b5ffc>] [<c01b5ffc>] ? ? __kmalloc+0x11c/0x130
[<c03ceeb2>] [<c03ceeb2>] ? ? tty_buffer_request_room+0xe2/0x130
[<c03ceeb2>] [<c03ceeb2>] ? ? tty_buffer_request_room+0xe2/0x130
[<c03ceeb2>] [<c03ceeb2>] ? ? tty_buffer_request_room+0xe2/0x130
[<c03d131d>] [<c03d131d>] ? ? tty_insert_flip_string_flags+0x2d/0xa0
...

To: Vegard Nossum <vegard.nossum@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, August 21, 2008 - 2:10 pm

To me, it does, but the question is if that's not caused by the serial console
itself ...

Thanks,
Rafael

--

To: Rafael J. Wysocki <rjw@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, August 21, 2008 - 2:21 pm

Actually, it was not my patch that did it. I appended acpi=off to the
command line. I get similar oopses now, but they make it to the ttyS0.

I got a "proper" oops this time (brace yourself, it's long ;-)), and
it does not have serial on the stack:

BUG: unable to handle kernel paging request at 00100104
IP: [<c038ad65>] __list_add+0x15/0x90
*pdpt = 00000000325bb001 *pde = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Pid: 3597, comm: bash Not tainted (2.6.27-rc4-00003-ga798564-dirty #30)
EIP: 0060:[<c038ad65>] EFLAGS: 00210082 CPU: 0
EIP is at __list_add+0x15/0x90
EAX: f7455764 EBX: 00100100 ECX: 00100100 EDX: f7455764
ESI: f7455764 EDI: f7455764 EBP: f25e7c18 ESP: f25e7bf4
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3597, ti=f25e6000 task=f25aa700 task.ti=f25e6000)
Stack: f68120bc f6c00400 c0687133 00000000 00000002 00000000 f7455734 f7455764
f7455764 f25e7c40 c018b954 0000001f 00000000 f6c00400 f6c00444 00000001
f68120bc f68120b0 00200002 f25e7cb4 c018d317 f68120bc 00000000 f25aac0c
Call Trace:
[<c0687133>] ? _spin_lock+0x63/0x70
[<c018b954>] ? rmqueue_bulk+0x54/0x80
[<c018d317>] ? get_page_from_freelist+0x5a7/0x720
[<c01600ea>] ? __lock_acquire+0x27a/0xa00
[<c018dd50>] ? __alloc_pages_internal+0xa0/0x450
[<c01acd4b>] ? alloc_pages_current+0x7b/0xc0
[<c01b37fb>] ? new_slab+0x1bb/0x2d0
[<c0687877>] ? _spin_unlock+0x27/0x50
[<c01b40ca>] ? __slab_alloc+0x32a/0x4e0
[<c010b335>] ? native_sched_clock+0xb5/0x110
[<c01b4424>] ? kmem_cache_alloc+0xb4/0xe0
[<c018969e>] ? mempool_alloc_slab+0xe/0x10
[<c018969e>] ? mempool_alloc_slab+0xe/0x10
[<c018969e>] ? mempool_alloc_slab+0xe/0x10
[<c01897a1>] ? mempool_alloc+0x31/0xf0
[<c015e884>] ? trace_hardirqs_on_caller+0xd4/0x160
[<c015e91b>] ? trace_hardirqs_on+0xb/0x10
[<c0368c7e>] ? get_request+0xae/0x2c0
[<c036935c>] ? get_request_wait+0x1c/0x...

To: Vegard Nossum <vegard.nossum@...>
Cc: Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>
Date: Thursday, August 21, 2008 - 5:27 pm

Ok, here we have the block layer passing a bad pointer to SLUB this
time. And it's also from the suspend code (although it's the resume
path this time). As we never see an oops from the block layer first,
it's possible that someone else corrupted everything and it just shows
--

To: Pekka Enberg <penberg@...>, Vegard Nossum <vegard.nossum@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>, Bartlomiej Zolnierkiewicz <bzolnier@...>
Date: Thursday, August 21, 2008 - 6:16 pm

IDE again?

--

To: Rafael J. Wysocki <rjw@...>
Cc: Pekka Enberg <penberg@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>, Bartlomiej Zolnierkiewicz <bzolnier@...>
Date: Friday, August 22, 2008 - 12:45 am

If this makes it so, then yes:

calling piix_ide_init+0x0/0xb0
initcall piix_ide_init+0x0/0xb0 returned 0 after 0 msecs
calling ide_scan_pcibus+0x0/0xf0
piix 0000:00:1f.1: IDE controller (0x8086:0x27df rev 0x01)
piix 0000:00:1f.1: IDE port disabled
piix 0000:00:1f.1: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xffa0-0xffa7
Probing IDE interface ide0...
hda: WDC WD1600BB-00DAA3, ATA DISK drive
hda: host max PIO4 wanted PIO255(auto-tune) selected PIO4
hda: UDMA/100 mode selected
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14

It is also interesting that you mention that; this is from an earlier

Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--

To: Vegard Nossum <vegard.nossum@...>
Cc: Pekka Enberg <penberg@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>, Bartlomiej Zolnierkiewicz <bzolnier@...>
Date: Friday, August 22, 2008 - 5:34 am

Hmm.

Would that be possible to switch temporarily to PATA/libata and see if the
problem goes away? Then, we'd get a strong indication that it really is
related to IDE.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Vegard Nossum <vegard.nossum@...>, Pekka Enberg <penberg@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>
Date: Friday, August 22, 2008 - 6:01 am

Hi,

"again"?

generic_ide_resume() just tries to get new request from the block

IIRC besides conversion to blk_{get,put}_request() conversion from
15th July (commit 5b114715ed63f3a4fdf790f5df61364fc4adadf1) there
weren't any PM related IDE changes recently.

Vegard, it would be worth to try if the kernels at the above commit
and at commit 9a2d43b7566caeeeb414aa628bc2759028897dbb (one commit
before the above one) are OK.

If it doesn't give the definitive answer then doing git-bisect run
would be of great help in identifying and fixing the source of the
problem quickly.

Thanks,
Bart
--

To: Bartlomiej Zolnierkiewicz <bzolnier@...>
Cc: Vegard Nossum <vegard.nossum@...>, Pekka Enberg <penberg@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>
Date: Friday, August 22, 2008 - 6:15 am

There was another trace in which IDE was also involved.

Sorry for the confusion.

Rafael
--

To: Vegard Nossum <vegard.nossum@...>
Cc: Rafael J. Wysocki <rjw@...>, Pekka Enberg <penberg@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>, Bartlomiej Zolnierkiewicz <bzolnier@...>
Date: Friday, August 22, 2008 - 5:28 am

Do you use any unusual drivers? WHat kind of machine is that?

Can you try with minimal drivers?

Can you try with minimum config? Like smp disabled...? Does echo test > /sys/power/state
work?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Rafael J. Wysocki <rjw@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, August 21, 2008 - 2:45 pm

This one is slightly different:

WARNING: at lib/list_debug.c:26 __list_add+0x61/0x90()
list_add corruption. next->prev should be prev (c088e34c), but was
00200200. (next=f7814774).
Pid: 3572, comm: bash Not tainted 2.6.27-rc4-00003-ga798564-dirty #30
[<c013a56e>] warn_slowpath+0x5e/0x80
[<c015f7bf>] ? validate_chain+0x7df/0xe90
[<c014c4e8>] ? __kernel_text_address+0x8/0x20
[<c01600ea>] ? __lock_acquire+0x27a/0xa00
[<c015b60e>] ? get_lock_stats+0x1e/0x50
[<c015b64d>] ? put_lock_stats+0xd/0x30
[<c038adb1>] __list_add+0x61/0x90
[<c01b3208>] deactivate_slab+0x158/0x1b0
[<c01b3ea5>] __slab_alloc+0x105/0x4e0
[<c015b60e>] ? get_lock_stats+0x1e/0x50
[<c01b4424>] kmem_cache_alloc+0xb4/0xe0
[<c052a51b>] ? serio_queue_event+0x6b/0xe0
[<c052a51b>] ? serio_queue_event+0x6b/0xe0
[<c052a51b>] serio_queue_event+0x6b/0xe0
[<c015e91b>] ? trace_hardirqs_on+0xb/0x10
[<c052a5b8>] serio_resume+0x28/0x30
[<c03f8bde>] device_resume+0x32e/0x380
[<c0168791>] hibernation_snapshot+0xa1/0x220
[<c013b55b>] ? printk+0x1b/0x20
[<c01689f0>] hibernate+0xe0/0x180
[<c01674a0>] ? state_store+0x0/0xd0
[<c016755f>] state_store+0xbf/0xd0
[<c01674a0>] ? state_store+0x0/0xd0
[<c0375ef4>] kobj_attr_store+0x24/0x30
[<c01fa432>] sysfs_write_file+0xa2/0x100
[<c01bbf06>] vfs_write+0x96/0x130
[<c01fa390>] ? sysfs_write_file+0x0/0x100
[<c01bc44d>] sys_write+0x3d/0x70
[<c0104f3b>] sysenter_do_call+0x12/0x3f
=======================
---[ end trace 0bc52e2837e4b8b8 ]---

...followed by the "unable to handle kernel paging request" as in
previous e-mail. The numbers 00100104 and 00200200 seem to appear all
the time.

At least it is 100% reproducible now and with nice oopses too :-)

Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually d...

To: Vegard Nossum <vegard.nossum@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Thursday, August 21, 2008 - 3:13 pm

Hm, slab corruption, it seems.

Do you use SLAB or SLUB?

Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Thursday, August 21, 2008 - 3:16 pm

SLUB, with all possible debugging:

CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLABINFO=y
CONFIG_SLUB_DEBUG_ON=y
CONFIG_SLUB_STATS=y

(Thanks for helping out.)

Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--

To: Vegard Nossum <vegard.nossum@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Thursday, August 21, 2008 - 3:37 pm

(You're welcome.)

Can you switch to SLAB and retest? I don't really think SLUB is the issue
here, but SLAB may give us additional information.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Thursday, August 21, 2008 - 4:08 pm

Got this with SLAB:

BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<c03897ac>] list_del+0xc/0x90
*pdpt = 0000000031480001 *pde = 0000000000000000
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Pid: 7, comm: events/0 Not tainted (2.6.27-rc4-00003-gef9b1bc #33)
EIP: 0060:[<c03897ac>] EFLAGS: 00010082 CPU: 0
EIP is at list_del+0xc/0x90
EAX: 00000000 EBX: f556afa0 ECX: f54a5380 EDX: f54e3128
ESI: f54f1ef0 EDI: f556afa0 EBP: f68b7ee0 ESP: f68b7ec8
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process events/0 (pid: 7, ti=f68b6000 task=f689c338 task.ti=f68b6000)
Stack: c015b62e f68b7ed4 c015b66d f68b7efc 00000046 f54f1f20 f68b7f1c c01b4054
00000000 f689c338 f54e315c 00000001 f54f1f20 f54a5380 00000000 f54e3128
f55b3348 00000000 f54f1f20 f54f1ef0 00000001 f68b7f3c c01b41ba 00000000
Call Trace:
[<c015b62e>] ? get_lock_stats+0x1e/0x50
[<c015b66d>] ? put_lock_stats+0xd/0x30
[<c01b4054>] ? free_block+0xa4/0x1b0
[<c01b41ba>] ? drain_array+0x5a/0xb0
[<c01b4f82>] ? cache_reap+0x72/0x230
[<c014ada7>] ? run_workqueue+0x107/0x200
[<c014ae0a>] ? run_workqueue+0x16a/0x200
[<c014ada7>] ? run_workqueue+0x107/0x200
[<c01b4f10>] ? cache_reap+0x0/0x230
[<c014b94d>] ? worker_thread+0x7d/0xe0
[<c014e4f0>] ? autoremove_wake_function+0x0/0x50
[<c014b8d0>] ? worker_thread+0x0/0xe0
[<c014e1e2>] ? kthread+0x42/0x70
[<c014e1a0>] ? kthread+0x0/0x70
[<c0105cf3>] ? kernel_thread_helper+0x7/0x14
=======================
Code: e8 01 89 44 24 04 e8 d4 1d db ff 8b 55 04 b8 48 2f 7a c0 e8 47 d8 dd ff e8
c2 dc d7 ff eb a9 55 89 e5 53 89 c3 83 ec 14 8b 40 04 <8b> 00 39 d8 75 24 8b 13
8b 42 04 39 d8 75 41 8b 43 04 89 42 04
EIP: [<c03897ac>] list_del+0xc/0x90 SS:ESP 0068:f68b7ec8
---[ end trace 958cea1a710a109a ]---
note: events/0[7] exited with preempt_count 1
uhci_hcd 0000:00:1d.2: setting latency timer to 64
uhci_hcd 0000:00:1d.3: setting latency t...

To: Vegard Nossum <vegard.nossum@...>
Cc: Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>
Date: Thursday, August 21, 2008 - 5:21 pm

Hi Vegard,

Looks like slabp->list is corrupted when we do list_del() in

Here the block layer is passing a pointer to a non-slab page to
kmem_cache_free(). Assuming we can rely on the stack trace, of

And now we have the per-cpu page lists corrupted as well. Note that
even though we have tty showing up in the stack traces, the list has
already been corrupted by someone else. So I don't think tty is at

I don't know the suspend code at all but the bogus pointer coming from
hibernation_snapshot() seems suspicious. Another possibility is that
the block layer is doing something strange here. Dunno.

Pekka
--

To: Pekka Enberg <penberg@...>
Cc: Vegard Nossum <vegard.nossum@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>
Date: Thursday, August 21, 2008 - 6:10 pm

I bet on the IDE driver.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Pekka Enberg <penberg@...>, Vegard Nossum <vegard.nossum@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>
Date: Friday, August 22, 2008 - 5:38 am

Should be easy enough to switch to libata to verify this, right?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Rafael J. Wysocki <rjw@...>
Cc: Vegard Nossum <vegard.nossum@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Jens Axboe <jens.axboe@...>
Date: Friday, August 22, 2008 - 3:02 am

Hi Rafael,

But like I said, the stack traces seem more unreliable than usual so I
could be totally wrong here.
--

To: Vegard Nossum <vegard.nossum@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, August 21, 2008 - 4:22 pm

Well, it looks like this happened while devices were suspended, but I can't say

--

To: Rafael J. Wysocki <rjw@...>
Cc: <vegard.nossum@...>, <linux-kernel@...>, Pekka Enberg <penberg@...>
Date: Thursday, August 21, 2008 - 4:26 pm

On Thu, 21 Aug 2008 22:22:44 +0200

But slab hasn't changed in quite some time. I expect something else
has reached over and scribbled on it. Bisection time?

--

Previous thread: 2.6.26.3-rt2 by Steven Rostedt on Thursday, August 21, 2008 - 1:18 pm. (3 messages)

Next thread: [GIT PATCH] USB patches for 2.6.27-rc4 by Greg KH on Thursday, August 21, 2008 - 1:30 pm. (1 message)