[bug, netconsole, SLUB] BUG skbuff_head_cache: Poison overwritten

Previous thread: Linux Power Management Mini-Summit -- Ottawa -- July 22, 2008 by Len Brown on Thursday, July 17, 2008 - 2:27 pm. (5 messages)

Next thread: [PATCH] mtd: diskonchip.c fix sparse endian warnings by Harvey Harrison on Thursday, July 17, 2008 - 2:57 pm. (1 message)
From: Ingo Molnar
Date: Thursday, July 17, 2008 - 2:42 pm

A regression to v2.6.26:

I started getting this skb-head corruption message today, on a T60 
laptop with e1000:

PM: Removing info for No Bus:vcs11
device: 'vcs11': device_create_release
=============================================================================
BUG skbuff_head_cache: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b
INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098
INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440
INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3
INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00

Bytes b4 0xf658adf0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
  Object 0xf658ae00:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf658ae10:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf658ae20:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf658ae30:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf658ae40:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf658ae50:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf658ae60:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf658ae70:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
 Redzone 0xf658aea0:  bb bb bb bb                                     ����            
 Padding 0xf658aec8:  5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
 Padding 0xf658aed8:  5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
 Padding 0xf658aee8:  5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
 Padding 0xf658aef8:  5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ        
Pid: 5098, comm: gdm-binary Not tainted 2.6.26-tip #3094
 [<c0186f99>] print_trailer+0xa9/0xf0
 ...
From: David Miller
Date: Thursday, July 17, 2008 - 2:45 pm

From: Ingo Molnar <mingo@elte.hu>

This is very unlikely to be added by us networking folks, no
networking merges have happened for the 2.6.27 merge window yet :-)
--

From: Ingo Molnar
Date: Thursday, July 17, 2008 - 3:06 pm

and Cc:-ed SLUB folks. Could be a sleeper cell of bugs gone active ;-) 

Or could be SLUB (-debugging) breakage. Netconsole is pretty reliable on 
this box. (and the bootup continued just fine after this report)

Just re-tried it, the bug is reliably repeatable. Will try a bisection 
run.

	Ingo
--

From: David Miller
Date: Thursday, July 17, 2008 - 3:09 pm

From: Ingo Molnar <mingo@elte.hu>

This bug would be a quite positive result then :)
--

From: Ingo Molnar
Date: Thursday, July 17, 2008 - 3:43 pm

hm, but it was not reproducible on the third and fourth attempt :-( I 
tried hard to provoke it by generating artificial parallel network and 
netconsole output - but it didnt want to trigger. Heisenbug ...

Maybe the debug output gives someone an idea about the nature of the 
bug?

	Ingo
--

From: Vegard Nossum
Date: Thursday, July 17, 2008 - 4:15 pm

1. Notice the range. It's just a single byte.
2. Notice the value. It's just a ++.

Probably a stray increment of a uint8_t somewhere on a freed object?

The offset from the beginning of the object is 0xf658ae9c - 0xf658ae00 = 0x9c.

How big is a struct sk_buff? Hm.. it is in fact quite big. Now what
member has offset 0x9c? Seems to depend on your config. Is there any
way you can figure it out, Ingo? I'll try it with your config too.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Vegard Nossum
Date: Thursday, July 17, 2008 - 4:35 pm

With your config:

(gdb) p ((struct sk_buff *) 0)->truesize
Cannot access memory at address 0x9c

Now just audit users of ->truesize... There are quite a few.

Which one would only += 1?


Vegard

PS: I might be on the completely wrong track. So far I only have bad
experiences with this sk_buff...

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Ingo Molnar
Date: Thursday, July 17, 2008 - 4:52 pm

hmm ... your analysis gave me a wonderful albeit admittedly remote idea:

 If only we had some kernel technology that could track and validate 
 memory accesses, and point out the cases where we access uninitialized 
 memory, just like Valgrind?

... something like kmemcheck? ;-)

So i booted that box with tip/master and kmemcheck enabled. (plus a few 
fixlets to make networking allocations be properly tracked by 
kmemcheck.)

It was a slow bootup and long wait, but it gave a few hits here:

kmemcheck: Caught 8-bit read from uninitialized memory (f653ad24)
iiiiiiiiiiiiiiiiuuuuuuuuuuuuuuuuuuuuuiuuuuuuuuuuuuuuuuuuuuuuuuuu
                                    ^

Pid: 2484, comm: arping Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05e973c>] EFLAGS: 00010282 CPU: 0
EIP is at __copy_skb_header+0x7c/0x100
EAX: 00000000 EBX: f653acc0 ECX: f653ac00 EDX: f653ac00
ESI: f653ac50 EDI: f653ad10 EBP: c09b9e84 ESP: c09ddaa8
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 36513000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
 [<c05e97e7>] __skb_clone+0x27/0xe0
 [<c05eb101>] skb_clone+0x41/0x60
 [<c065cbf1>] packet_rcv+0xc1/0x290
 [<c05f07ad>] netif_receive_skb+0x20d/0x400
 [<c03b2aa7>] e1000_receive_skb+0x47/0x180
 [<c03b3983>] e1000_clean_rx_irq+0x223/0x2e0
 [<c03b225b>] e1000_clean+0x5b/0x200
 [<c05f29db>] net_rx_action+0xfb/0x160
 [<c0129092>] __do_softirq+0x82/0xf0
 [<c0105b8a>] call_on_stack+0x1a/0x30

false positive? Find below the quick hacks i did to pre-initialize skb 
allocations that have RX DMA into them.

another one is:

kmemcheck: Caught 8-bit read from uninitialized memory (f653a902)
iiuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
  ^

Pid: 2575, comm: hcid Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c02b9926>] EFLAGS: 00010293 CPU: 0
EIP is at __copy_to_user_ll+0x46/0x70
EAX: 00000004 EBX: b7f3c478 ECX: 00000002 EDX: f653a900
ESI: f653a902 EDI: b7f3c47a ...
From: Ingo Molnar
Date: Thursday, July 17, 2008 - 5:01 pm

two other ones:

kmemcheck: Caught 32-bit read from uninitialized memory (f459ac00)
uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
^

Pid: 1614, comm: kjournald Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05ea904>] EFLAGS: 00010216 CPU: 0
EIP is at skb_copy_bits+0x54/0x220
EAX: 00000110 EBX: 00000110 ECX: 00000044 EDX: f6c256c0
ESI: f459ac00 EDI: f459a000 EBP: c09b9f60 ESP: c09de048
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 2e849000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
 [<c05ebe28>] skb_copy+0x78/0x90
 [<c05f9426>] neigh_timer_handler+0x156/0x2c0
 [<c012ba62>] run_timer_softirq+0x142/0x180
 [<c0129092>] __do_softirq+0x82/0xf0
 [<c0105b8a>] call_on_stack+0x1a/0x30
 [<ffffffff>] 0xffffffff


plus:

kmemcheck: Caught 32-bit read from uninitialized memory (f654b590)
iiiiiiiiiiiiiiiiuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
                ^

Pid: 3130, comm: git-read-tree Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05eb396>] EFLAGS: 00010286 CPU: 0
EIP is at pskb_expand_head+0x86/0x150
EAX: 00000140 EBX: f6763124 ECX: 00000027 EDX: f6549400
ESI: f654b590 EDI: f6549650 EBP: c09b9a80 ESP: c09dfa28
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 2e849000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
 [<c05eb67d>] __pskb_pull_tail+0x21d/0x300
 [<c03b45dd>] e1000_xmit_frame+0x1fd/0xa60
 [<c05f0fd4>] dev_hard_start_xmit+0x254/0x2e0
 [<c0609721>] __qdisc_run+0x61/0x1b0
 [<c05f383e>] dev_queue_xmit+0x25e/0x340
 [<c061f315>] ip_finish_output+0x105/0x260
 [<c061f739>] ip_output+0x49/0x50
 [<c061debd>] ip_local_out+0x1d/0x30
 [<c061fa85>] ip_queue_xmit+0x1a5/0x340
 [<c062ffd3>] tcp_transmit_skb+0x443/0x800
 [<c06316aa>] __tcp_push_pending_frames+0xfa/0x730
 [<c062ee58>] tcp_rcv_established+0x3b8/0x6f0
 [<c0634f50>] tcp_v4_do_rcv+0x2d0/0x510
 [<c063720e>] ...
From: Vegard Nossum
Date: Thursday, July 17, 2008 - 5:05 pm

Hm, if you think it was that slow, I am suspecting you were also using
SLUB debugging.

This can actually be negative, since now SLUB will access the objects
(+redzone +padding) and possibly trick kmemcheck into thinking they
were initialized in the first place.

But what we are really looking for is "read from freed memory"
messages. So I would actually recommend this: Disable kmemcheck's
reporting of uninitialized memory, simply to make it easier to spot
the "freed" messages more easily.

Maybe something like this (warning: whitespace-munged):

diff --git a/arch/x86/mm/kmemcheck/error.c b/arch/x86/mm/kmemcheck/error.c
index 56410c6..6944cb7 100644
--- a/arch/x86/mm/kmemcheck/error.c
+++ b/arch/x86/mm/kmemcheck/error.c
@@ -98,6 +98,9 @@ void kmemcheck_error_save(enum kmemcheck_shadow state,
                return;
        prev_ip = regs->ip;

+       if (state == KMEMCHECK_SHADOW_UNINITIALIZED)
+               return;
+
        e = error_next_wr();
        if (!e)
                return;


If this only happens during boot, it would also be a good idea to
simply reboot the machine a lot...


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Ingo Molnar
Date: Thursday, July 17, 2008 - 5:16 pm

nope:

 # CONFIG_SLUB_DEBUG is not set


yeah, i've got a script for that. Will try it overnight.

	Ingo
--

From: David Miller
Date: Thursday, July 17, 2008 - 7:13 pm

From: Ingo Molnar <mingo@elte.hu>

Maybe.  Every SKB object allocated is fully initialized
in __alloc_skb():

	/*
	 * Only clear those fields we need to clear, not those that we will
	 * actually initialise below. Hence, don't put any more fields after
	 * the tail pointer in struct sk_buff!
	 */
	memset(skb, 0, offsetof(struct sk_buff, tail));

That leaves the following trailing members of struct sk_buff:

	/* These elements must be at the end, see alloc_skb() for details.  */
	sk_buff_data_t		tail;
	sk_buff_data_t		end;
	unsigned char		*head,
				*data;
	unsigned int		truesize;
	atomic_t		users;

which are the explicitly initialized right after the quotes memset().

	skb->truesize = size + sizeof(struct sk_buff);
	atomic_set(&skb->users, 1);
	skb->head = data;
	skb->data = data;
	skb_reset_tail_pointer(skb);
	skb->end = skb->tail + size;

When we clone, there are probably some fields we don't copy over
explicitly.  And we usually do that because they don't matter or
if they do the caller will take care of it.
--

From: David Miller
Date: Thursday, July 17, 2008 - 7:03 pm

From: "Vegard Nossum" <vegard.nossum@gmail.com>

It's supposed to be 0x6b, this would be a "--"

Also it (more likely IMHO) could be clearing a flag with the value 0x01.

--

From: Vegard Nossum
Date: Friday, July 18, 2008 - 12:03 am

It could be. But like I said in a later e-mail, the thing is likely
sk_buff->truesize. Which is not a flags variable. It _is_ however, a
counter, which is frequently -= and atomic_sub()ed.

That field is also an int, not a byte like I suggested above. This is
fine, though. "--" on an int can of course legitimately update/change
just the lower byte of an int.

But.. it could also be some random corruption coming from elsewhere.
Maybe even bad RAM (it's just a single bit anyway). But that's less
likely.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: David Miller
Date: Friday, July 18, 2008 - 12:12 am

From: "Vegard Nossum" <vegard.nossum@gmail.com>

skb->truesize is ever incremented or decremented by only one.

Usually it is changed by the entire packet size, or at least one MSS's
worth.

On packet free, it will be decremented by at least sizeof(struct sk_buff)
--

From: Ingo Molnar
Date: Friday, July 18, 2008 - 2:05 am

ok, i looked at the logs once more and while i thought that it occured 
twice it only occured once:

  Jul 17 20:22:14 europe kernel: BUG skbuff_head_cache: Poison overwritten

... this would explain why my attempts to bisect and reproduce it 
failed. I got too excited about it being seemingly reproducible and 
possibly bisectable (memory corruption bugs rarely are). My overnight 
reboot-the-same-kernel tests didnt show anything either.

It's a known-reliable system with thousands of bootups:

 Jul 17 20:20:54 europe kernel: Linux version 2.6.26-tip (mingo@europe) 
 (gcc version 4.2.2) #3094 SMP Thu Jul 17 20:19:27 CEST 2008

... but a hw fluke is never out of question. (It wasnt a particularly 
hot day but the evening was unusually humid, maybe that made the 
difference.)

So lets close this for now, it's not a reproducible regression that we 
can act upon. I'll update this thread if anything new happens. 
Netconsole is reliable on this system in any case.

	Ingo
--

From: Ingo Molnar
Date: Friday, July 18, 2008 - 12:10 pm

Hm, just got another skb related weirdness and crash on that testbox:

[   13.495000] usb 1-0:1.0: uevent
[   19.392000] Attempt to release alive inet socket f6fac040
[   20.387000] EXT3 FS on sda1, internal journal
[   20.762500] eth0: no IPv6 routers present
[   22.431000] warning: process `kudzu' used the deprecated sysctl system call with 1.23.
[   22.906000] ------------[ cut here ]------------
[   22.906000] kernel BUG at mm/slab.c:2927!
[   22.906000] invalid opcode: 0000 [#1] PREEMPT SMP 
[   22.906000] Modules linked in:
[   22.906000] 
[   22.906000] Pid: 1831, comm: ip6tables-resto Tainted: G        W (2.6.26-tip-03923-g2c96bcb-dirty #3134)
[   22.906000] EIP: 0060:[<b01878d2>] EFLAGS: 00010046 CPU: 0
[   22.906000] EIP is at cache_alloc_refill+0xfa/0x431
[   22.906000] EAX: 00000004 EBX: f6ec1a00 ECX: f67826c0 EDX: f6fa3c40
[   22.906000] ESI: f6fac000 EDI: 0000000b EBP: f5c01ec4 ESP: f5c01e70
[   22.906000]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[   22.906000] Process ip6tables-resto (pid: 1831, ti=f5c01000 task=f5d2f060 task.ti=f5c01000)
[   22.906000] Stack: 00000004 00000004 000000d0 000000d0 f67826c8 f67826d0 f6fac01c 000080d0 
[   22.906000]        f6fa3c40 f6fa9800 f67826c0 00000000 7f431025 b014a3e9 b2bd6ab8 00000000 
[   22.906000]        00000000 00000000 00000246 f6fa3c40 f6fa3c40 f5c01ee4 b0187793 b07c11d6 
[   22.906000] Call Trace:
[   22.906000]  [<b014a3e9>] ? __wake_up_bit+0xc/0x37
[   22.906000]  [<b0187793>] ? kmem_cache_alloc+0x53/0x98
[   22.906000]  [<b05589b5>] ? sk_prot_alloc+0x1e/0xa7
[   22.906000]  [<b026cb86>] ? cap_capable+0xa/0x2b
[   22.906000]  [<b0559fa4>] ? sk_alloc+0x20/0xa3
[   22.906000]  [<b05c4364>] ? inet6_create+0x187/0x316
[   22.906000]  [<b0556e87>] ? __sock_create+0x15b/0x229
[   22.906000]  [<b0556f9c>] ? sock_create+0x29/0x2e
[   22.906000]  [<b0557112>] ? sys_socket+0x1e/0x42
[   22.906000]  [<b05577b6>] ? sys_socketcall+0x56/0x163
[   22.906000]  [<b011a418>] ? resume_userspace+0x8/0x30
[   22.906000]  ...
From: Ingo Molnar
Date: Friday, July 18, 2008 - 12:55 pm

Totally weird bug. Using the same bzImage i was able to crash the box 2 
times in a row. Then i tried another (but similarly configured) kernel 
and it wouldnt crash.

Then i tried the original bzImage again (bit for bit equivalent) and it 
wouldnt crash 5 times in a row.

I've attached the second crashlog below - in case it's interesting to 
anyone.

This cannot be random corruption or hw flakiness as the two crashes in a 
row are so similar. One (remote) possibility would be for some hw 
component to be in a weird state that doesnt get proper re-initialized 
after reboot. As long as that condition lingered, it would crash. Once 
that condition was gone, it wouldnt crash.

For now i've given up on this variant of the bug. I'll continue with 
randconfig searching - maybe another config gives us something more 
debuggable ...

And note that this time it was SLAB not SLUB - so any SLUB impact is 
highly unlikely. (i strongly suspect some hw initialization issue.)

	Ingo

-------------------->
[    0.000000] Linux version 2.6.26-tip-03923-g2c96bcb-dirty (mingo@europe) (gcc version 4.2.2) #3134 SMP PREEMPT Fri Jul 18 20:06:13 CEST 2008
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
[    0.000000]  BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
[    0.000000]  BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable)
[    0.000000]  BIOS-e820: 000000007f6e0000 - 000000007f6f3000 (ACPI data)
[    0.000000]  BIOS-e820: 000000007f6f3000 - 000000007f700000 (ACPI NVS)
[    0.000000]  BIOS-e820: 000000007f700000 - 0000000080000000 (reserved)
[    0.000000]  BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[    0.000000]  BIOS-e820: 00000000fed00000 - 00000000fed00400 ...
From: Vegard Nossum
Date: Thursday, July 17, 2008 - 4:27 pm

It doesn't actually work. The config says

    # head: 088fcf34

and I checked out this from the tip tree. But kernel-config still
complains about unknown config options... What went wrong?


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Ingo Molnar
Date: Thursday, July 17, 2008 - 4:56 pm

that's ok - i've got some local qa helpers that have config options. 

Things like making SMP bootups more likely in randconfig, adding various 
boot parameters to the bootup via .config methods (so that boot 
parameters can be randomized via make randconfig), etc.

these:

  CONFIG_BOOTPARAM_SUPPORT=y
  CONFIG_BOOTPARAM_NO_HZ_OFF=y
  CONFIG_BOOTPARAM_NMI_WATCHDOG_BIT_0=y
  CONFIG_BOOTPARAM_LAPIC=y
  CONFIG_BOOTPARAM_IDLE_MWAIT=y
  CONFIG_BOOTPARAM_NOPAT=y
  CONFIG_BOOTPARAM_NOTSC=y

are equivalent to adding this to the boot line:

   nohz=off nmi_watchdog=1 lapic idle=mwait nopat notsc

although i dont think they are normally material to netconsole workings.

	Ingo
--

From: Vegard Nossum
Date: Monday, July 21, 2008 - 4:41 am

Hm, your config doesn't contain SLUB_DEBUG either. Isn't that a
requirement for seeing the BUG you're getting?

$ grep SLUB config-Thu_Jul_17_20_24_45_CEST_2008.bad
# CONFIG_SLUB_DEBUG is not set
CONFIG_SLUB=y

Did you post the wrong config?


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Evgeniy Polyakov
Date: Thursday, July 17, 2008 - 10:46 pm

Hi Ingo.


Out of curiosity, why does it scream at allocation time?
Does SLUB have a debug check at freeing time? If so, how does it work
and why didn't it caught use after free there?

-- 
	Evgeniy Polyakov
--

From: Pekka Enberg
Date: Friday, July 18, 2008 - 2:02 am

Hi Evgeniy,


Because it's checking for use-after-free errors. The object is
poisoned with POISON_FREE when it's free'd and we verify the poison
values at allocation time.


You can't detect use after free before the object is actually free'd ;-)

                                      Pekka
--

From: Ingo Molnar
Date: Friday, July 18, 2008 - 2:09 am

yeah, we want to check use-after free at the next allocation point - 
i.e. as late as possible to gather all corruptions that happened 
meanwhile.

We could in theory have a SLUB debug mode where a SCHED_IDLE kernel 
thread would periodically check all free objects (of that CPU) in the 
background to ensure their integrity. That would catch corruptions 
sooner, with a possibly still meaningful context to print out. [right 
after the IRQ or process that corrupts them finishes running]

It could also be hooked into ftrace to print out the last few hundred 
kernel function calls executed prior any corruption. ftrace/slub-debug 
plugin perhaps?

	Ingo
--

From: Pekka Enberg
Date: Friday, July 18, 2008 - 2:15 am

Hi Ingo,


Well, there's this Norwegian guy, Vegard, who has written a small
piece of code that can detect use-after-free _as it happens_. I think
he calls the thing kmemcheck :-).
--

From: Evgeniy Polyakov
Date: Friday, July 18, 2008 - 3:16 am

Hi Pekka.


Does it also scream on double free event? Just to closer guilty
circles... 0x9c offset is somewhere at the very end of the skbuff
structure, likely skb->users.

Can you also check in some kind of this patch to catch freed skb freeing
for testing?

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3666216..dda96bf 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -419,6 +419,14 @@ void kfree_skb(struct sk_buff *skb)
 {
 	if (unlikely(!skb))
 		return;
+
+	{
+		u8 *ptr = (u8 *)(&skb->users);
+
+		if (*ptr == POISON_FREE || *ptr == POISON_INUSE || *ptr == POISON_END)
+			BUG();
+	}
+
 	if (likely(atomic_read(&skb->users) == 1))
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))

-- 
	Evgeniy Polyakov
--

From: Pekka Enberg
Date: Friday, July 18, 2008 - 7:44 am

Hi Evgeniy,

On Fri, Jul 18, 2008 at 12:02:26PM +0300, Pekka Enberg


Yeah. See the free_debug_processing() function in mm/slub.c for
details (the on_freelist() part). However, if you look at slab_free()
you can see that in the SLUB fast-path we don't do any of these
debugging checks. So you can end up with slab corruption without a
nice error message.

                                  Pekka
--

From: Christoph Lameter
Date: Friday, July 18, 2008 - 7:48 am

The slub fastpath is not used when debugging is enabled.

Without debugging on double frees will typically corrupt the freepointer. So you get an invalid pointer reference in __slab_free.

--

From: Evgeniy Polyakov
Date: Friday, July 18, 2008 - 9:07 am

Hi Pekka.


Yup, I see.

Actually with the network there will be no duuble free as long as uses
use kfree_skb(), since first freeing (with skb->users being equal to 1,
so user owns this skb) will fill skb with 0x6b, so any subsequent
kfree_skb() will try to decrement particulary huge reference counter
(0x6b6b6b6b) and actually will never reach 1 again, so things will look
quite clear, but if the same slub area will be reused for the next skb
before it is freed second time, very hard-to-debug data corruption will
happen, so I belive this kind of checks should be done into kfree_skb()
(of course under special debug compile option). I posted that patch just
for the reference and that Ingo would be able to detect where skb is
used after already being freed.

-- 
	Evgeniy Polyakov
--

From: Pekka J Enberg
Date: Friday, July 18, 2008 - 2:00 am

Hi Ingo,


[snip]
 

Nope.



It's bit unfortunate that we don't see full dump of the corruption here 
because SLUB limits the output to 128 bytes. Ingo, you might want to try 
this patch so that we can see all of it:

diff --git a/mm/slub.c b/mm/slub.c
index 5f6e2c4..f69d181 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -492,7 +492,7 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
 	if (p > addr + 16)
 		print_section("Bytes b4", p - 16, 16);
 
-	print_section("Object", p, min(s->objsize, 128));
+	print_section("Object", p, s->objsize);
 
 	if (s->flags & SLAB_RED_ZONE)
 		print_section("Redzone", p + s->objsize,
--

From: Ingo Molnar
Date: Friday, July 18, 2008 - 2:11 am

ok, applied this as a debug special to tip/out-of-tree - future 
incidents should have the full object dump.

would makes sense for upstream too i think, or increase the limit to 4K 
or so. (which is still fair to be dumped into a syslog)

	Ingo
--

From: Pekka Enberg
Date: Friday, July 18, 2008 - 2:16 am

SLUB already limits object sizes to less than PAGE_SIZE so the patch
should be fine. Christoph, are you okay with this going upstream?
--

From: Christoph Lameter
Date: Friday, July 18, 2008 - 6:54 am

I am fine with the patch. Just be aware that we are going to be shoveling a lot of stuff into the log. The object size can be greater than 4k for non kmalloc slab caches. True in general object sizes are <4k.


--

From: Ingo Molnar
Date: Monday, July 21, 2008 - 2:41 am

update about this problem: just triggered another colorful crash, see 
below. This was with the 4K object dump patch already, maybe the dump 
gives a clue?

The upstream base of this test kernel was v2.6.26-5253-g14b395e - i.e. 
post the big networking pull, but this problem predates it. (It first 
triggered after v2.6.26)

All the crashes trigger in or close to networking code - not a single 
block IO DMA or other DMA crash happened so far, and no filesystem 
corruptions or anything like that which would signal hw trouble.

	Ingo

------------------>
initcall sctp_init+0x0/0x697 returned 0 after 9 msecs
calling  powernowk8_init+0x0/0x6e
initcall powernowk8_init+0x0/0x6e returned -19 after 0 msecs
calling  hpet_insert_resource+0x0/0x1e
initcall hpet_insert_resource+0x0/0x1e returned 0 after 0 msecs
calling  lapic_insert_resource+0x0/0x44
initcall lapic_insert_resource+0x0/0x44 returned 0 after 0 msecs
calling  init_lapic_nmi_sysfs+0x0/0x33
initcall init_lapic_nmi_sysfs+0x0/0x33 returned 0 after 0 msecs
=============================================================================
BUG skbuff_head_cache: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xf7ccc100-0xf7ccc103. First byte 0x0 instead of 0x6b
INFO: Allocated in __alloc_skb+0x30/0x10e age=1 cpu=1 pid=1
INFO: Freed in __kfree_skb+0x63/0x66 age=1 cpu=0 pid=0
INFO: Slab 0xc1c34ca0 objects=16 used=1 fp=0xf7ccc100 flags=0x400000c3
INFO: Object 0xf7ccc100 @offset=256 fp=0xf7ccc200

Bytes b4 0xf7ccc0f0:  5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
  Object 0xf7ccc100:  00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ....kkkkkkkkkkkk
  Object 0xf7ccc110:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf7ccc120:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf7ccc130:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xf7ccc140:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ...
From: Pekka Enberg
Date: Monday, July 21, 2008 - 2:52 am

SGkgSW5nbywKCk9uIE1vbiwgSnVsIDIxLCAyMDA4IGF0IDEyOjQxIFBNLCBJbmdvIE1vbG5hciA8
bWluZ29AZWx0ZS5odT4gd3JvdGU6Cj4gdXBkYXRlIGFib3V0IHRoaXMgcHJvYmxlbToganVzdCB0
cmlnZ2VyZWQgYW5vdGhlciBjb2xvcmZ1bCBjcmFzaCwgc2VlCj4gYmVsb3cuIFRoaXMgd2FzIHdp
dGggdGhlIDRLIG9iamVjdCBkdW1wIHBhdGNoIGFscmVhZHksIG1heWJlIHRoZSBkdW1wCj4gZ2l2
ZXMgYSBjbHVlPwoKLi4udG8gcG9pbnQgb3V0IHRoZSBvYnZpb3VzOgoKPiA9PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PQo+IEJVRyBza2J1ZmZfaGVhZF9jYWNoZTogUG9pc29uIG92ZXJ3cml0dGVuCj4gLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0KPgo+IElORk86IDB4ZjdjY2MxMDAtMHhmN2NjYzEwMy4gRmlyc3Qg
Ynl0ZSAweDAgaW5zdGVhZCBvZiAweDZiCj4gSU5GTzogQWxsb2NhdGVkIGluIF9fYWxsb2Nfc2ti
KzB4MzAvMHgxMGUgYWdlPTEgY3B1PTEgcGlkPTEKPiBJTkZPOiBGcmVlZCBpbiBfX2tmcmVlX3Nr
YisweDYzLzB4NjYgYWdlPTEgY3B1PTAgcGlkPTAKPiBJTkZPOiBTbGFiIDB4YzFjMzRjYTAgb2Jq
ZWN0cz0xNiB1c2VkPTEgZnA9MHhmN2NjYzEwMCBmbGFncz0weDQwMDAwMGMzCj4gSU5GTzogT2Jq
ZWN0IDB4ZjdjY2MxMDAgQG9mZnNldD0yNTYgZnA9MHhmN2NjYzIwMAo+Cj4gQnl0ZXMgYjQgMHhm
N2NjYzBmMDogIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVh
IFpaWlpaWlpaWlpaWlpaWloKPiAgT2JqZWN0IDB4ZjdjY2MxMDA6ICAwMCAwMCAwMCAwMCA2YiA2
YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiAuLi4ua2tra2tra2tra2trCgpVc2UgYWZ0
ZXIgZnJlZSB3aGVyZSBmaXJzdCBmb3VyIGJ5dGVzIGFyZSB6ZXJvZWQuCgo+ICBPYmplY3QgMHhm
N2NjYzExMDogIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZi
IGtra2tra2tra2tra2tra2sKPiAgT2JqZWN0IDB4ZjdjY2MxMjA6ICA2YiA2YiA2YiA2YiA2YiA2
YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiBra2tra2tra2tra2tra2trCj4gIE9iamVj
dCAweGY3Y2NjMTMwOiAgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIg
NmIgNmIga2tra2tra2tra2tra2trawo+ICBPYmplY3QgMHhmN2NjYzE0MDogIDZiIDZiIDZiIDZi
IDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIGtra2tra2tra2tra2tra2sKPiAg
T2JqZWN0IDB4ZjdjY2MxNTA6ICA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2
YiA2YiA2YiA2YiBra2tra2tra2tra2tra2trCj4gIE9iamVjdCAweGY3Y2NjMTYwOiAgNmIgNmIg
NmIgNmIgNmIgNmIg ...
From: Evgeniy Polyakov
Date: Monday, July 21, 2008 - 3:06 am

Hi.


Not that obvious...
skb->next is cleared in lots of places, in xmit network helper
for example, but since rest of the packet was not modified, it
means given skb was not freed, so it will not help.

Ingo do you see other similar dumps with last byte modified?
That's the one which can help to determine the reason.

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Monday, July 21, 2008 - 3:50 am

the problem is, most of the crashes dont come with any usable dump. This 
is a laptop so netconsole is the only reliable route out - and if 
something in networking crashes chances are that it hoses netconsole 
before it can get anything out.

Another thing is that i'm activating netconsole on this box via a kernel 
boot line and from within a bzImage (to get it activated as early as 
possible) - maybe that's a tad too early for certain initialization 
sequences?

I could try run tests with netconsole deactivated, if you think that's a 
worthwile line of probing this problem. (although that would make me do 
blind tests in essence - having kernel log output is really essential.)

	Ingo
--

From: Vegard Nossum
Date: Monday, July 21, 2008 - 4:03 am

How about posting your custom patches for public review? :-D

(There was recently an odd case of kzalloc() succeeding even though it
was called before the slab caches were initialized -- and it only
generated a warning about irqs-on happening too early. I just mean to
say that it _can_ happen. On another machine, it crashed
spectacularly. I guess you're not enabling the netconsole before slab
allocator is up?)


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Ingo Molnar
Date: Monday, July 21, 2008 - 4:13 am

I changed nothing in that area - being able to enable the netconsole 
like that has been a years-old feature. Otherwise it's just tip/master, 

if i had any changes in that area i'd have declared them. Find below the 
full bootlog of the last crash - can you see any init order weirdness in 
it?

Btw., i disabled netconsole now and got 3 successful bootups in a row. 
(which is still not significant statistically - i had a series of 40 
bootups in a row before. Then again i had like 5 failures in a row 
earlier this morning, before disabling netconsole.)

Btw. #2, in the log below you can see this partial line:

 initcall pm_qos_power_init+0x0/0x54Linux version 2.6.26-tip (mingo@europe) (gcc version 4.2.2) #3261 SMP Mon Jul 21 11:17:45 CEST 2008

That was a previous bootup with another failure but which came without 
any log output.

Hey, if you find this report frustrating, dont worry, i almost never 
report easy bugs - i can fix them just fine ;-)

	Ingo

--------------->
initcall pm_qos_power_init+0x0/0x54Linux version 2.6.26-tip (mingo@europe) (gcc version 4.2.2) #3261 SMP Mon Jul 21 11:17:45 CEST 2008
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
 BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
 BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable)
 BIOS-e820: 000000007f6e0000 - 000000007f6f3000 (ACPI data)
 BIOS-e820: 000000007f6f3000 - 000000007f700000 (ACPI NVS)
 BIOS-e820: 000000007f700000 - 0000000080000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
 BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
 BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: ...
From: Christoph Lameter
Date: Monday, July 21, 2008 - 9:19 am

Kzalloc for objects > 4k will succeed even during early boot since we 
will directly call into the page allocator.


--

From: Vegard Nossum
Date: Monday, July 21, 2008 - 1:23 pm

On Mon, Jul 21, 2008 at 6:19 PM, Christoph Lameter

It wasn't just that; it got into __slab_alloc() and out again. See the
backtrace at http://lkml.org/lkml/2008/7/19/23

It's scary, I know. It's a good thing we have irqtrace :-)


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Evgeniy Polyakov
Date: Monday, July 21, 2008 - 4:25 am

Let's try this way first. If system will continue to crash, we will add
some debug options in various pathes. Existing reports do not contain
enough information unfortunately, so we will not lose too much.

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Monday, July 21, 2008 - 4:55 am

ok. I've turned off netconsole - 8 successful bootups in a row so far. 
The box is a slow booter/builder with an 8 kernels/hour test throughput, 
so if everything goes fine we should have meaningful results in about 10 
hours.

( there are other, faster testboxes in -tip testing with 33 kernels/hour 
  build+boot throughput where we'd have to wait only 2 hours - but as 
  per Murphy's law they dont trigger this bug ;-)

	Ingo
--

From: Evgeniy Polyakov
Date: Monday, July 21, 2008 - 5:57 am

Since 2.6.25 there was only single change in netpoll.c:
f5184d267c1aedb9b7a8cc44e08ff6b8d382c3b5
Which looks innocent.

Is your driver e1000 or e1000e? Can you check different one?

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Monday, July 21, 2008 - 7:01 am

i cannot check e1000 anymore due to this upstream commit:

| d03157babed7424f5391af43200593768ce69c9a is first bad commit
| commit d03157babed7424f5391af43200593768ce69c9a
| Author: Auke Kok <auke-jan.h.kok@intel.com>
| Date:   Sun Jun 22 15:21:29 2008 -0700
|
|    e1000: remove PCI Express device IDs
|
|    We do not want to prolong the situation much longer that e1000
|    and e1000e support these devices at the same time. As a result,
|    take out the bandage that was added for the interim period
|    and remove all the PCI Express device IDs from e1000.

but yes, this box was using e1000 for a long time, and recently migrated 
to e1000e. I'm not sure there's any connection, do you think there is?

	Ingo
--

From: Ingo Molnar
Date: Monday, July 21, 2008 - 12:21 pm

70 bootups with netconsole turned off, and not a single crash so far.

So it's now a strong likelyhood that this crash is a combination of 
e1000e+netconsole.

	Ingo
--

From: Evgeniy Polyakov
Date: Monday, July 21, 2008 - 2:24 pm

e1000_clean_tx_irq() call looks particulary suspicious: it is called
without adapter->tx_queue_lock in poll controller (netconsole callback)
and with that lock in NAPI handler.

Can you check kind of this patch:

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 869544b..5f180d8 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4067,7 +4067,9 @@ static void e1000_netpoll(struct net_device *netdev)
 	disable_irq(adapter->pdev->irq);
 	e1000_intr(adapter->pdev->irq, netdev);
 
+	spin_lock(&adapter->tx_queue_lock);
 	e1000_clean_tx_irq(adapter);
+	spin_unlock(&adapter->tx_queue_lock);
 
 	enable_irq(adapter->pdev->irq);
 }


-- 
	Evgeniy Polyakov
--

From: David Miller
Date: Monday, July 21, 2008 - 4:33 pm

From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>

The call even seems pointless, since the caller will call
->poll() (which is e1000_clean) as the very next action,
and that will invoke e1000_clean_tx_irq() properly.

I would just delete this call from e1000_netpoll() entirely.
--

From: Ingo Molnar
Date: Tuesday, July 22, 2008 - 12:50 am

ok, i've added the patch below to tip/out-of-tree.

Overnight test had about 100 successful bootups on this testbox. (until 
it stopped on a drivers/net/hp.c build error - which is unrelated to 
this problem)

So testing with netconsole disabled is conclusive enough to implicate 
netconsole strongly. I've now re-enabled netconsole on the testbox and 
will continue the test with the fix below. Previously it would crash 
within 10-40 iterations.

	Ingo

----------------->
commit bf89280dea6d97671aa5f75f2591ae7e8e3e6699
Author: Ingo Molnar <mingo@elte.hu>
Date:   Tue Jul 22 09:44:32 2008 +0200

    e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call
    
    Evgeniy Polyakov noticed that drivers/net/e1000e/netdev.c:e1000_netpoll()
    was calling e1000_clean_tx_irq() without taking the TX lock.
    
    David Miller suggested to remove the call altogether: since in this
    callpah there's periodic calls to ->poll() anyway which will do
    e1000_clean_tx_irq() and will garbage-collect any finished TX ring
    descriptors.
    
    This might solve the e1000e+netconsole crashes i've been seeing:
    
    =============================================================================
    BUG skbuff_head_cache: Poison overwritten
    -----------------------------------------------------------------------------
    
    INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b
    INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098
    INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440
    INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3
    INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 drivers/net/e1000e/netdev.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 869544b..9c0f56b 100644
--- a/drivers/net/e1000e/netdev.c
+++ ...
From: Ingo Molnar
Date: Tuesday, July 22, 2008 - 6:34 am

it's now past 50 successful iterations with e1000e+netconsole enabled 
again with the fix applied, and not a single crash, hang, memory 
corruption or other weirdness happened. Jeff, please apply the fix 
below.

Thanks,

	Ingo

---------------->
commit 1fd48807674d4f6defe82258c46619ce2c116943
Author: Ingo Molnar <mingo@elte.hu>
Date:   Tue Jul 22 09:44:32 2008 +0200

    e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call
    
    Evgeniy Polyakov noticed that drivers/net/e1000e/netdev.c:e1000_netpoll()
    was calling e1000_clean_tx_irq() without taking the TX lock.
    
    David Miller suggested to remove the call altogether: since in this
    callpah there's periodic calls to ->poll() anyway which will do
    e1000_clean_tx_irq() and will garbage-collect any finished TX ring
    descriptors.
    
    This fix solved the e1000e+netconsole crashes i've been seeing:
    
    =============================================================================
    BUG skbuff_head_cache: Poison overwritten
    -----------------------------------------------------------------------------
    
    INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b
    INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098
    INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440
    INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3
    INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00
    
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 drivers/net/e1000e/netdev.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 869544b..9c0f56b 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4067,8 +4067,6 @@ static void e1000_netpoll(struct net_device *netdev)
 	disable_irq(adapter->pdev->irq);
 	e1000_intr(adapter->pdev->irq, netdev);
 
-	e1000_clean_tx_irq(adapter);
-
 	enable_irq(adapter->pdev->irq);
 }
 ...
From: David Miller
Date: Wednesday, July 23, 2008 - 3:31 pm

From: Ingo Molnar <mingo@elte.hu>

I'll take care of integrating this, thanks Ingo.
--

From: Jeff Kirsher
Date: Wednesday, July 23, 2008 - 3:40 pm

Thanks Dave.  I have the patch in my queue of patches to push, just had
not pushed it to you and Jeff yet.

-- 
Cheers,
Jeff

--

From: Christoph Lameter
Date: Monday, July 21, 2008 - 9:22 am

The skb was definitely freed. 6b is written over an object when it is 
freed. Something else retained a pointer to the skb and was confident 
that the skb still exists.


--

From: Evgeniy Polyakov
Date: Monday, July 21, 2008 - 12:57 pm

Hi Christoph.


I meant it was not tried to be freed second time, since skb->users area
(the very end of the skb) was not changed from 6b to 6a, but its
skb->next pointer (first field in the skb) was set to NULL, so after
that skb was not used at all.

It could be an interesting kmemcheck extension to catch not only
non-initialized memory, but also writes into just freed one, but not yet
returned by allocator to the next user.

-- 
	Evgeniy Polyakov
--

From: Ingo Molnar
Date: Monday, July 21, 2008 - 1:05 pm

hm, i think kmemcheck does that mostly already. If not then I think 
kmemcheck should just turn off all buffering that might happen on the 
SLUB level (if it doesnt do it already). I.e. a kfree()d buffer should 
be marked as uninitialized immediately.

	Ingo
--

From: Vegard Nossum
Date: Monday, July 21, 2008 - 1:22 pm

We don't catch writes (yet, because kmemcheck gets confused by SLUB's
own writes so far), but yes, we mark objects "freed" when they are
freed (until the slab page is returned to the page allocator) and we
may detect reads from this area before the object is reallocated. In
that case, a "read from freed memory" message will appear.

Unfortunately, it doesn't work quite as well as I'd like it to yet. We
did catch one real bug in ACPI in this way (read-after-free). But as I
have discovered, not all machines support single-stepping over REP
STOS instructions properly. Which means that kmemcheck breaks silently
on some machines. There's also the recent DMA patch, which might
improve things. Linus is right, it's really fragile. But we are
working on it. It is going in the right direction anyway.

(Pekka and I were also working on an extension to SLUB which delays
the freeing of objects, i.e. so that objects will remain in the
"freed" state for a certain amount of time before it can be allocated
again. But we don't really have a _good_ patch for this yet, it was
more like an ugly hack at the time.)


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Christoph Lameter
Date: Friday, July 18, 2008 - 6:55 am

Typically this means that a refcount at the head of the object was decremented. But the object was already freed.
--

Previous thread: Linux Power Management Mini-Summit -- Ottawa -- July 22, 2008 by Len Brown on Thursday, July 17, 2008 - 2:27 pm. (5 messages)

Next thread: [PATCH] mtd: diskonchip.c fix sparse endian warnings by Harvey Harrison on Thursday, July 17, 2008 - 2:57 pm. (1 message)