Re: Current Git: BUG: unable to handle kernel paging request at 0000000001a40ca0

Previous thread: [PATCH] dm9000: Support MAC address setting through platform data. by Laurent Pinchart on Wednesday, July 23, 2008 - 8:41 am. (4 messages)

Next thread: [PATCH 12/16 v4] ibmveth: Automatically enable larger rx buffer pools for larger mtu by Robert Jennings on Wednesday, July 23, 2008 - 11:34 am. (1 message)
From: Dieter Ries
Date: Wednesday, July 23, 2008 - 8:39 am

Hi,

I just encountered a Bug in latest git:

As this is my first bugreport, I am not sure who to cc and which 
information to provide, so please advise me. Some information is below.



ifconfig triggered:

BUG: unable to handle kernel paging request at 0000000001a40ca0
IP: [<ffffffff80290632>] kmem_cache_alloc+0x50/0x81
PGD 79d33067 PUD 79cf7067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: radeon drm uinput snd_hda_intel iwl3945 snd_pcm 
snd_timer rfkill snd led_class snd_page_alloc
Pid: 3516, comm: ifconfig Not tainted 2.6.26-06077-gc010b2f #23
RIP: 0010:[<ffffffff80290632>]  [<ffffffff80290632>] 
kmem_cache_alloc+0x50/0x81
RSP: 0000:ffff880079d079e8  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0000000000000296 RCX: ffffffff802704ae
RDX: ffff880001016700 RSI: 0000000001a40ca0 RDI: ffffffff808b5fa0
RBP: ffff880079d07a08 R08: 000000000000000c R09: 0000000000000001
R10: ffff880079d07a18 R11: ffff880079cce260 R12: 0000000000011280
R13: 0000000000000080 R14: 00000000000112d0 R15: 0000000001670328
FS:  00007fb32edc26f0(0000) GS:ffffffff808b6e80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001a40ca0 CR3: 0000000079d43000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ifconfig (pid: 3516, threadinfo ffff880079d06000, task 
ffff88007f8ed6c0)
Stack:  ffff880079d07a18 00000000000000d0 ffff88007f99e480 0000000000000001
  ffff880079d07a18 ffffffff802704ae ffff880079d07a78 ffffffff802705c2
  ffff88007d01e400 ffffffff80865dc0 ffff880079cced10 0000000000000008
Call Trace:
  [<ffffffff802704ae>] mempool_alloc_slab+0x16/0x18
  [<ffffffff802705c2>] mempool_alloc+0x3e/0xfa
  [<ffffffff802b8db7>] bio_alloc_bioset+0x27/0x94
  [<ffffffff802b8e7e>] bio_alloc+0x15/0x24
  [<ffffffff802bc0ef>] mpage_alloc+0x2b/0x84
  [<ffffffff802bcbb5>] do_mpage_readpage+0x406/0x508
  [<ffffffff80279916>] ? __inc_zone_page_state+0x23/0x25
  ...
From: Vegard Nossum
Date: Wednesday, July 23, 2008 - 10:46 am

Hi,



The code decodes to:

    mov    0x14(%rdx),%eax
    mov    (%rsi,%rax,8),%rax <--- HERE!

which corresponds to this code in mm/slub.c:

                c->freelist = object[c->offset];

So the mov 0x14(%rdx) is the loading of c->offset, which means that
the pointer "c" is held in %rdx (= 0xffff880001016700), and the
variable c->offset is held in %eax (= 0).

It also means that the pointer "object" is held in %rsi (= 0x1a40ca0).
Now, clearly the object pointer is bogus. It was loaded on the line
above:

                object = c->freelist;

..and it may look like c->freelist has become corrupted. This one is
again loaded from the line:

        c = get_cpu_slab(s, smp_processor_id());

Everything seems normal, except the c->freelist pointer.

The rest of the messages are from the same function, but from


...this seems to suggest that none of the backtraces may actually give
a good clue as to who caused the corruption to begin with. (In other
words, I have no more clue than you on who to Cc this.)

Does the number 0x1a40ca0 look familiar to anybody?

Dieter: If this is reproducible, it would probably help quite a bit to
configure the kernel with CONFIG_SLUB_DEBUG and boot with
slub_debug=FZPUT (unless you already have CONFIG_SLUB_DEBUG_ON set, in
which case you are already running with the SLUB debugging at boot).
It might catch the corruption before it becomes fatal, or give us some
more clues anyway.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Dieter Ries
Date: Wednesday, July 23, 2008 - 11:01 am

I am bisecting the bug right now, as it is kind of reproducible. Its 
always network related, sometimes it happens when init brings up the 
nics, sometimes afterwards with ifconfig etc.

I will add CONFIG_SLUB_DEBUG_ON for the next runs, next time it happens 
cu
Dieter
--

From: Dieter Ries
Date: Wednesday, July 23, 2008 - 2:53 pm

I tried to bisect the bug, which failed because there were too many 
kernels not booting with other problems, I guess bisecting just fails in 
the merge window.

With CONFIG_SLUB_DEBUG_ON the output looks different, unfortunately 
netconsole stops before those are transmitted.

As there are always some lines about e1000 in the backtraces, I tried to 
boot without LAN cable connected, and it worked, and crashed afterwards 
when I plugged the cable in, with a bug in net/core/dev.c.

I will search on tomorrow, as it is getting quite late now.

Should I copy the messages with CONFIG_SLUB_DEBUG_ON by hand, or are 
just some parts important?

cu
Dieter
--

From: Vegard Nossum
Date: Wednesday, July 23, 2008 - 3:00 pm

There were some e1000 patches in flight on LKML recently; you might be
able to find them and see if it helps you. It also seems that some
changes were just committed to -git, so I guess you should try the
very latest from there.

You also Cced netdev from the start, so somebody from there should be
able to help you more from here than I. :-)


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Dieter Ries
Date: Wednesday, July 23, 2008 - 11:51 pm

I think I managed to catch one of those:


general protection fault: 0000 [1] SMP
CPU 0
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.26-06373-gcaf076e #49
RIP: 0010:[<ffffffff805e08f9>]  [<ffffffff805e08f9>] 
nf_nat_move_storage+0x21/0x7a
RSP: 0018:ffffffff8091ab80  EFLAGS: 00010206
RAX: ffffffff805e08d8 RBX: ffff88007d1fb948 RCX: 000000000000006b
RDX: ffff88007d175e10 RSI: ffff88007d175e7b RDI: ffff88007d1fb948
RBP: ffffffff8091aba0 R08: 0000000000000000 R09: ffff88007d175e90
R10: ffffe20000000008 R11: ffff88007d175e10 R12: 59d2c3ffff88007d
R13: ffff88007d175e7b R14: 00000000000000a0 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffffffff8089ee80(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff808b0000, task ffffffff80842340)
Stack:  0000000000000002 ffff88007d3d2000 ffff88007d1fb948 0000000000000070
  ffffffff8091abf0 ffffffff8059d3c4 ffffffff8091ac40 0000000100000001
  ffffffff809e3658 ffff88007d3d2000 0000000000000002 ffff88007f9f6500
Call Trace:
  <IRQ>  [<ffffffff8059d3c4>] __nf_ct_ext_add+0x15f/0x1f7
  [<ffffffff805e762c>] nf_nat_fn+0x84/0x152
  [<ffffffff805e77d8>] nf_nat_in+0x2f/0x71
  [<ffffffff805953d8>] nf_iterate+0x48/0x85
  [<ffffffff805b19c0>] ? ip_rcv_finish+0x0/0x35d
  [<ffffffff80595478>] nf_hook_slow+0x63/0xcb
  [<ffffffff805b19c0>] ? ip_rcv_finish+0x0/0x35d
  [<ffffffff8028fe7c>] ? __slab_alloc+0x413/0x4bd
  [<ffffffff805b21b8>] ip_rcv+0x257/0x297
  [<ffffffff80581461>] netif_receive_skb+0x1f1/0x263
  [<ffffffff80495b34>] e1000_receive_skb+0x46/0x5d
  [<ffffffff8049830b>] e1000_clean_rx_irq+0x20e/0x2a6
  [<ffffffff8024cce8>] ? getnstimeofday+0x3f/0xa0
  [<ffffffff804952ce>] e1000_clean+0x6d/0x218
  [<ffffffff8024ad39>] ? hrtimer_get_next_event+0xa8/0xb8
  ...
From: Pekka Enberg
Date: Thursday, July 24, 2008 - 6:49 am

Hi Dieter,


Dieter, can you please try this patch:

http://lkml.org/lkml/2008/7/24/186

Patrick, it looks like this could be the same bug Ingo is hitting?
--

From: Patrick McHardy
Date: Thursday, July 24, 2008 - 6:51 am

Yes, that most likely the same underlying problem.
--

From: Dieter Ries
Date: Thursday, July 24, 2008 - 10:53 am

it looks like it is fixed with the patch. But I (still/now) have the 
following Warning:


WARNING: at net/core/dev.c:1344 __netif_schedule+0x31/0x79()
Modules linked in:
Pid: 9, comm: events/0 Not tainted 2.6.26-06371-g338b9bb-dirty #51

Call Trace:
  [<ffffffff80234c3a>] warn_on_slowpath+0x5d/0x92
  [<ffffffff8023dc24>] ? lock_timer_base+0x2b/0x4f
  [<ffffffff8023dde1>] ? __mod_timer+0xc6/0xd8
  [<ffffffff8026820a>] ? ftrace_record_ip+0x211/0x262
  [<ffffffff8020b1e4>] ? mcount_call+0x5/0x31
  [<ffffffff805802bf>] __netif_schedule+0x31/0x79
  [<ffffffff80498890>] e1000_watchdog_task+0x47d/0x695
  [<ffffffff80498413>] ? e1000_watchdog_task+0x0/0x695
  [<ffffffff80244454>] run_workqueue+0x81/0x10a
  [<ffffffff80244e46>] worker_thread+0xdd/0xea
  [<ffffffff80247b3d>] ? autoremove_wake_function+0x0/0x3d
  [<ffffffff80244d69>] ? worker_thread+0x0/0xea
  [<ffffffff80247a0a>] kthread+0x4e/0x7c
  [<ffffffff8020c279>] child_rip+0xa/0x11
  [<ffffffff802479bc>] ? kthread+0x0/0x7c
  [<ffffffff8020c26f>] ? child_rip+0x0/0x11

---[ end trace a344035e20d497ce ]---

but this is work in progress I guess.

Thanks for support!

cu
Dieter
--

Previous thread: [PATCH] dm9000: Support MAC address setting through platform data. by Laurent Pinchart on Wednesday, July 23, 2008 - 8:41 am. (4 messages)

Next thread: [PATCH 12/16 v4] ibmveth: Automatically enable larger rx buffer pools for larger mtu by Robert Jennings on Wednesday, July 23, 2008 - 11:34 am. (1 message)