A regression to v2.6.26: I started getting this skb-head corruption message today, on a T60 laptop with e1000: PM: Removing info for No Bus:vcs11 device: 'vcs11': device_create_release ============================================================================= BUG skbuff_head_cache: Poison overwritten ----------------------------------------------------------------------------- INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098 INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440 INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3 INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00 Bytes b4 0xf658adf0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xf658ae00: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf658ae10: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf658ae20: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf658ae30: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf658ae40: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf658ae50: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf658ae60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf658ae70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Redzone 0xf658aea0: bb bb bb bb ���� Padding 0xf658aec8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Padding 0xf658aed8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Padding 0xf658aee8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Padding 0xf658aef8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Pid: 5098, comm: gdm-binary Not tainted 2.6.26-tip #3094 [<c0186f99>] print_trailer+0xa9/0xf0 ...
From: Ingo Molnar <mingo@elte.hu> This is very unlikely to be added by us networking folks, no networking merges have happened for the 2.6.27 merge window yet :-) --
and Cc:-ed SLUB folks. Could be a sleeper cell of bugs gone active ;-) Or could be SLUB (-debugging) breakage. Netconsole is pretty reliable on this box. (and the bootup continued just fine after this report) Just re-tried it, the bug is reliably repeatable. Will try a bisection run. Ingo --
From: Ingo Molnar <mingo@elte.hu> This bug would be a quite positive result then :) --
hm, but it was not reproducible on the third and fourth attempt :-( I tried hard to provoke it by generating artificial parallel network and netconsole output - but it didnt want to trigger. Heisenbug ... Maybe the debug output gives someone an idea about the nature of the bug? Ingo --
1. Notice the range. It's just a single byte. 2. Notice the value. It's just a ++. Probably a stray increment of a uint8_t somewhere on a freed object? The offset from the beginning of the object is 0xf658ae9c - 0xf658ae00 = 0x9c. How big is a struct sk_buff? Hm.. it is in fact quite big. Now what member has offset 0x9c? Seems to depend on your config. Is there any way you can figure it out, Ingo? I'll try it with your config too. Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
With your config: (gdb) p ((struct sk_buff *) 0)->truesize Cannot access memory at address 0x9c Now just audit users of ->truesize... There are quite a few. Which one would only += 1? Vegard PS: I might be on the completely wrong track. So far I only have bad experiences with this sk_buff... -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
hmm ... your analysis gave me a wonderful albeit admittedly remote idea:
If only we had some kernel technology that could track and validate
memory accesses, and point out the cases where we access uninitialized
memory, just like Valgrind?
... something like kmemcheck? ;-)
So i booted that box with tip/master and kmemcheck enabled. (plus a few
fixlets to make networking allocations be properly tracked by
kmemcheck.)
It was a slow bootup and long wait, but it gave a few hits here:
kmemcheck: Caught 8-bit read from uninitialized memory (f653ad24)
iiiiiiiiiiiiiiiiuuuuuuuuuuuuuuuuuuuuuiuuuuuuuuuuuuuuuuuuuuuuuuuu
^
Pid: 2484, comm: arping Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05e973c>] EFLAGS: 00010282 CPU: 0
EIP is at __copy_skb_header+0x7c/0x100
EAX: 00000000 EBX: f653acc0 ECX: f653ac00 EDX: f653ac00
ESI: f653ac50 EDI: f653ad10 EBP: c09b9e84 ESP: c09ddaa8
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 36513000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c05e97e7>] __skb_clone+0x27/0xe0
[<c05eb101>] skb_clone+0x41/0x60
[<c065cbf1>] packet_rcv+0xc1/0x290
[<c05f07ad>] netif_receive_skb+0x20d/0x400
[<c03b2aa7>] e1000_receive_skb+0x47/0x180
[<c03b3983>] e1000_clean_rx_irq+0x223/0x2e0
[<c03b225b>] e1000_clean+0x5b/0x200
[<c05f29db>] net_rx_action+0xfb/0x160
[<c0129092>] __do_softirq+0x82/0xf0
[<c0105b8a>] call_on_stack+0x1a/0x30
false positive? Find below the quick hacks i did to pre-initialize skb
allocations that have RX DMA into them.
another one is:
kmemcheck: Caught 8-bit read from uninitialized memory (f653a902)
iiuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
^
Pid: 2575, comm: hcid Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c02b9926>] EFLAGS: 00010293 CPU: 0
EIP is at __copy_to_user_ll+0x46/0x70
EAX: 00000004 EBX: b7f3c478 ECX: 00000002 EDX: f653a900
ESI: f653a902 EDI: b7f3c47a ...two other ones:
kmemcheck: Caught 32-bit read from uninitialized memory (f459ac00)
uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
^
Pid: 1614, comm: kjournald Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05ea904>] EFLAGS: 00010216 CPU: 0
EIP is at skb_copy_bits+0x54/0x220
EAX: 00000110 EBX: 00000110 ECX: 00000044 EDX: f6c256c0
ESI: f459ac00 EDI: f459a000 EBP: c09b9f60 ESP: c09de048
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 2e849000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c05ebe28>] skb_copy+0x78/0x90
[<c05f9426>] neigh_timer_handler+0x156/0x2c0
[<c012ba62>] run_timer_softirq+0x142/0x180
[<c0129092>] __do_softirq+0x82/0xf0
[<c0105b8a>] call_on_stack+0x1a/0x30
[<ffffffff>] 0xffffffff
plus:
kmemcheck: Caught 32-bit read from uninitialized memory (f654b590)
iiiiiiiiiiiiiiiiuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
^
Pid: 3130, comm: git-read-tree Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05eb396>] EFLAGS: 00010286 CPU: 0
EIP is at pskb_expand_head+0x86/0x150
EAX: 00000140 EBX: f6763124 ECX: 00000027 EDX: f6549400
ESI: f654b590 EDI: f6549650 EBP: c09b9a80 ESP: c09dfa28
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 2e849000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c05eb67d>] __pskb_pull_tail+0x21d/0x300
[<c03b45dd>] e1000_xmit_frame+0x1fd/0xa60
[<c05f0fd4>] dev_hard_start_xmit+0x254/0x2e0
[<c0609721>] __qdisc_run+0x61/0x1b0
[<c05f383e>] dev_queue_xmit+0x25e/0x340
[<c061f315>] ip_finish_output+0x105/0x260
[<c061f739>] ip_output+0x49/0x50
[<c061debd>] ip_local_out+0x1d/0x30
[<c061fa85>] ip_queue_xmit+0x1a5/0x340
[<c062ffd3>] tcp_transmit_skb+0x443/0x800
[<c06316aa>] __tcp_push_pending_frames+0xfa/0x730
[<c062ee58>] tcp_rcv_established+0x3b8/0x6f0
[<c0634f50>] tcp_v4_do_rcv+0x2d0/0x510
[<c063720e>] ...Hm, if you think it was that slow, I am suspecting you were also using
SLUB debugging.
This can actually be negative, since now SLUB will access the objects
(+redzone +padding) and possibly trick kmemcheck into thinking they
were initialized in the first place.
But what we are really looking for is "read from freed memory"
messages. So I would actually recommend this: Disable kmemcheck's
reporting of uninitialized memory, simply to make it easier to spot
the "freed" messages more easily.
Maybe something like this (warning: whitespace-munged):
diff --git a/arch/x86/mm/kmemcheck/error.c b/arch/x86/mm/kmemcheck/error.c
index 56410c6..6944cb7 100644
--- a/arch/x86/mm/kmemcheck/error.c
+++ b/arch/x86/mm/kmemcheck/error.c
@@ -98,6 +98,9 @@ void kmemcheck_error_save(enum kmemcheck_shadow state,
return;
prev_ip = regs->ip;
+ if (state == KMEMCHECK_SHADOW_UNINITIALIZED)
+ return;
+
e = error_next_wr();
if (!e)
return;
If this only happens during boot, it would also be a good idea to
simply reboot the machine a lot...
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
nope: # CONFIG_SLUB_DEBUG is not set yeah, i've got a script for that. Will try it overnight. Ingo --
From: Ingo Molnar <mingo@elte.hu> Maybe. Every SKB object allocated is fully initialized in __alloc_skb(): /* * Only clear those fields we need to clear, not those that we will * actually initialise below. Hence, don't put any more fields after * the tail pointer in struct sk_buff! */ memset(skb, 0, offsetof(struct sk_buff, tail)); That leaves the following trailing members of struct sk_buff: /* These elements must be at the end, see alloc_skb() for details. */ sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; unsigned int truesize; atomic_t users; which are the explicitly initialized right after the quotes memset(). skb->truesize = size + sizeof(struct sk_buff); atomic_set(&skb->users, 1); skb->head = data; skb->data = data; skb_reset_tail_pointer(skb); skb->end = skb->tail + size; When we clone, there are probably some fields we don't copy over explicitly. And we usually do that because they don't matter or if they do the caller will take care of it. --
From: "Vegard Nossum" <vegard.nossum@gmail.com> It's supposed to be 0x6b, this would be a "--" Also it (more likely IMHO) could be clearing a flag with the value 0x01. --
It could be. But like I said in a later e-mail, the thing is likely sk_buff->truesize. Which is not a flags variable. It _is_ however, a counter, which is frequently -= and atomic_sub()ed. That field is also an int, not a byte like I suggested above. This is fine, though. "--" on an int can of course legitimately update/change just the lower byte of an int. But.. it could also be some random corruption coming from elsewhere. Maybe even bad RAM (it's just a single bit anyway). But that's less likely. Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
From: "Vegard Nossum" <vegard.nossum@gmail.com> skb->truesize is ever incremented or decremented by only one. Usually it is changed by the entire packet size, or at least one MSS's worth. On packet free, it will be decremented by at least sizeof(struct sk_buff) --
ok, i looked at the logs once more and while i thought that it occured twice it only occured once: Jul 17 20:22:14 europe kernel: BUG skbuff_head_cache: Poison overwritten ... this would explain why my attempts to bisect and reproduce it failed. I got too excited about it being seemingly reproducible and possibly bisectable (memory corruption bugs rarely are). My overnight reboot-the-same-kernel tests didnt show anything either. It's a known-reliable system with thousands of bootups: Jul 17 20:20:54 europe kernel: Linux version 2.6.26-tip (mingo@europe) (gcc version 4.2.2) #3094 SMP Thu Jul 17 20:19:27 CEST 2008 ... but a hw fluke is never out of question. (It wasnt a particularly hot day but the evening was unusually humid, maybe that made the difference.) So lets close this for now, it's not a reproducible regression that we can act upon. I'll update this thread if anything new happens. Netconsole is reliable on this system in any case. Ingo --
Hm, just got another skb related weirdness and crash on that testbox: [ 13.495000] usb 1-0:1.0: uevent [ 19.392000] Attempt to release alive inet socket f6fac040 [ 20.387000] EXT3 FS on sda1, internal journal [ 20.762500] eth0: no IPv6 routers present [ 22.431000] warning: process `kudzu' used the deprecated sysctl system call with 1.23. [ 22.906000] ------------[ cut here ]------------ [ 22.906000] kernel BUG at mm/slab.c:2927! [ 22.906000] invalid opcode: 0000 [#1] PREEMPT SMP [ 22.906000] Modules linked in: [ 22.906000] [ 22.906000] Pid: 1831, comm: ip6tables-resto Tainted: G W (2.6.26-tip-03923-g2c96bcb-dirty #3134) [ 22.906000] EIP: 0060:[<b01878d2>] EFLAGS: 00010046 CPU: 0 [ 22.906000] EIP is at cache_alloc_refill+0xfa/0x431 [ 22.906000] EAX: 00000004 EBX: f6ec1a00 ECX: f67826c0 EDX: f6fa3c40 [ 22.906000] ESI: f6fac000 EDI: 0000000b EBP: f5c01ec4 ESP: f5c01e70 [ 22.906000] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 22.906000] Process ip6tables-resto (pid: 1831, ti=f5c01000 task=f5d2f060 task.ti=f5c01000) [ 22.906000] Stack: 00000004 00000004 000000d0 000000d0 f67826c8 f67826d0 f6fac01c 000080d0 [ 22.906000] f6fa3c40 f6fa9800 f67826c0 00000000 7f431025 b014a3e9 b2bd6ab8 00000000 [ 22.906000] 00000000 00000000 00000246 f6fa3c40 f6fa3c40 f5c01ee4 b0187793 b07c11d6 [ 22.906000] Call Trace: [ 22.906000] [<b014a3e9>] ? __wake_up_bit+0xc/0x37 [ 22.906000] [<b0187793>] ? kmem_cache_alloc+0x53/0x98 [ 22.906000] [<b05589b5>] ? sk_prot_alloc+0x1e/0xa7 [ 22.906000] [<b026cb86>] ? cap_capable+0xa/0x2b [ 22.906000] [<b0559fa4>] ? sk_alloc+0x20/0xa3 [ 22.906000] [<b05c4364>] ? inet6_create+0x187/0x316 [ 22.906000] [<b0556e87>] ? __sock_create+0x15b/0x229 [ 22.906000] [<b0556f9c>] ? sock_create+0x29/0x2e [ 22.906000] [<b0557112>] ? sys_socket+0x1e/0x42 [ 22.906000] [<b05577b6>] ? sys_socketcall+0x56/0x163 [ 22.906000] [<b011a418>] ? resume_userspace+0x8/0x30 [ 22.906000] ...
Totally weird bug. Using the same bzImage i was able to crash the box 2 times in a row. Then i tried another (but similarly configured) kernel and it wouldnt crash. Then i tried the original bzImage again (bit for bit equivalent) and it wouldnt crash 5 times in a row. I've attached the second crashlog below - in case it's interesting to anyone. This cannot be random corruption or hw flakiness as the two crashes in a row are so similar. One (remote) possibility would be for some hw component to be in a weird state that doesnt get proper re-initialized after reboot. As long as that condition lingered, it would crash. Once that condition was gone, it wouldnt crash. For now i've given up on this variant of the bug. I'll continue with randconfig searching - maybe another config gives us something more debuggable ... And note that this time it was SLAB not SLUB - so any SLUB impact is highly unlikely. (i strongly suspect some hw initialization issue.) Ingo --------------------> [ 0.000000] Linux version 2.6.26-tip-03923-g2c96bcb-dirty (mingo@europe) (gcc version 4.2.2) #3134 SMP PREEMPT Fri Jul 18 20:06:13 CEST 2008 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f000 (usable) [ 0.000000] BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved) [ 0.000000] BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable) [ 0.000000] BIOS-e820: 000000007f6e0000 - 000000007f6f3000 (ACPI data) [ 0.000000] BIOS-e820: 000000007f6f3000 - 000000007f700000 (ACPI NVS) [ 0.000000] BIOS-e820: 000000007f700000 - 0000000080000000 (reserved) [ 0.000000] BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved) [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) [ 0.000000] BIOS-e820: 00000000fed00000 - 00000000fed00400 ...
It doesn't actually work. The config says
# head: 088fcf34
and I checked out this from the tip tree. But kernel-config still
complains about unknown config options... What went wrong?
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
that's ok - i've got some local qa helpers that have config options. Things like making SMP bootups more likely in randconfig, adding various boot parameters to the bootup via .config methods (so that boot parameters can be randomized via make randconfig), etc. these: CONFIG_BOOTPARAM_SUPPORT=y CONFIG_BOOTPARAM_NO_HZ_OFF=y CONFIG_BOOTPARAM_NMI_WATCHDOG_BIT_0=y CONFIG_BOOTPARAM_LAPIC=y CONFIG_BOOTPARAM_IDLE_MWAIT=y CONFIG_BOOTPARAM_NOPAT=y CONFIG_BOOTPARAM_NOTSC=y are equivalent to adding this to the boot line: nohz=off nmi_watchdog=1 lapic idle=mwait nopat notsc although i dont think they are normally material to netconsole workings. Ingo --
Hm, your config doesn't contain SLUB_DEBUG either. Isn't that a requirement for seeing the BUG you're getting? $ grep SLUB config-Thu_Jul_17_20_24_45_CEST_2008.bad # CONFIG_SLUB_DEBUG is not set CONFIG_SLUB=y Did you post the wrong config? Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
Hi Ingo. Out of curiosity, why does it scream at allocation time? Does SLUB have a debug check at freeing time? If so, how does it work and why didn't it caught use after free there? -- Evgeniy Polyakov --
Hi Evgeniy,
Because it's checking for use-after-free errors. The object is
poisoned with POISON_FREE when it's free'd and we verify the poison
values at allocation time.
You can't detect use after free before the object is actually free'd ;-)
Pekka
--
yeah, we want to check use-after free at the next allocation point - i.e. as late as possible to gather all corruptions that happened meanwhile. We could in theory have a SLUB debug mode where a SCHED_IDLE kernel thread would periodically check all free objects (of that CPU) in the background to ensure their integrity. That would catch corruptions sooner, with a possibly still meaningful context to print out. [right after the IRQ or process that corrupts them finishes running] It could also be hooked into ftrace to print out the last few hundred kernel function calls executed prior any corruption. ftrace/slub-debug plugin perhaps? Ingo --
Hi Ingo, Well, there's this Norwegian guy, Vegard, who has written a small piece of code that can detect use-after-free _as it happens_. I think he calls the thing kmemcheck :-). --
Hi Pekka.
Does it also scream on double free event? Just to closer guilty
circles... 0x9c offset is somewhere at the very end of the skbuff
structure, likely skb->users.
Can you also check in some kind of this patch to catch freed skb freeing
for testing?
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3666216..dda96bf 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -419,6 +419,14 @@ void kfree_skb(struct sk_buff *skb)
{
if (unlikely(!skb))
return;
+
+ {
+ u8 *ptr = (u8 *)(&skb->users);
+
+ if (*ptr == POISON_FREE || *ptr == POISON_INUSE || *ptr == POISON_END)
+ BUG();
+ }
+
if (likely(atomic_read(&skb->users) == 1))
smp_rmb();
else if (likely(!atomic_dec_and_test(&skb->users)))
--
Evgeniy Polyakov
--
Hi Evgeniy,
On Fri, Jul 18, 2008 at 12:02:26PM +0300, Pekka Enberg
Yeah. See the free_debug_processing() function in mm/slub.c for
details (the on_freelist() part). However, if you look at slab_free()
you can see that in the SLUB fast-path we don't do any of these
debugging checks. So you can end up with slab corruption without a
nice error message.
Pekka
--
The slub fastpath is not used when debugging is enabled. Without debugging on double frees will typically corrupt the freepointer. So you get an invalid pointer reference in __slab_free. --
Hi Pekka. Yup, I see. Actually with the network there will be no duuble free as long as uses use kfree_skb(), since first freeing (with skb->users being equal to 1, so user owns this skb) will fill skb with 0x6b, so any subsequent kfree_skb() will try to decrement particulary huge reference counter (0x6b6b6b6b) and actually will never reach 1 again, so things will look quite clear, but if the same slub area will be reused for the next skb before it is freed second time, very hard-to-debug data corruption will happen, so I belive this kind of checks should be done into kfree_skb() (of course under special debug compile option). I posted that patch just for the reference and that Ingo would be able to detect where skb is used after already being freed. -- Evgeniy Polyakov --
Hi Ingo, [snip] Nope. It's bit unfortunate that we don't see full dump of the corruption here because SLUB limits the output to 128 bytes. Ingo, you might want to try this patch so that we can see all of it: diff --git a/mm/slub.c b/mm/slub.c index 5f6e2c4..f69d181 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -492,7 +492,7 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p) if (p > addr + 16) print_section("Bytes b4", p - 16, 16); - print_section("Object", p, min(s->objsize, 128)); + print_section("Object", p, s->objsize); if (s->flags & SLAB_RED_ZONE) print_section("Redzone", p + s->objsize, --
ok, applied this as a debug special to tip/out-of-tree - future incidents should have the full object dump. would makes sense for upstream too i think, or increase the limit to 4K or so. (which is still fair to be dumped into a syslog) Ingo --
SLUB already limits object sizes to less than PAGE_SIZE so the patch should be fine. Christoph, are you okay with this going upstream? --
I am fine with the patch. Just be aware that we are going to be shoveling a lot of stuff into the log. The object size can be greater than 4k for non kmalloc slab caches. True in general object sizes are <4k. --
update about this problem: just triggered another colorful crash, see below. This was with the 4K object dump patch already, maybe the dump gives a clue? The upstream base of this test kernel was v2.6.26-5253-g14b395e - i.e. post the big networking pull, but this problem predates it. (It first triggered after v2.6.26) All the crashes trigger in or close to networking code - not a single block IO DMA or other DMA crash happened so far, and no filesystem corruptions or anything like that which would signal hw trouble. Ingo ------------------> initcall sctp_init+0x0/0x697 returned 0 after 9 msecs calling powernowk8_init+0x0/0x6e initcall powernowk8_init+0x0/0x6e returned -19 after 0 msecs calling hpet_insert_resource+0x0/0x1e initcall hpet_insert_resource+0x0/0x1e returned 0 after 0 msecs calling lapic_insert_resource+0x0/0x44 initcall lapic_insert_resource+0x0/0x44 returned 0 after 0 msecs calling init_lapic_nmi_sysfs+0x0/0x33 initcall init_lapic_nmi_sysfs+0x0/0x33 returned 0 after 0 msecs ============================================================================= BUG skbuff_head_cache: Poison overwritten ----------------------------------------------------------------------------- INFO: 0xf7ccc100-0xf7ccc103. First byte 0x0 instead of 0x6b INFO: Allocated in __alloc_skb+0x30/0x10e age=1 cpu=1 pid=1 INFO: Freed in __kfree_skb+0x63/0x66 age=1 cpu=0 pid=0 INFO: Slab 0xc1c34ca0 objects=16 used=1 fp=0xf7ccc100 flags=0x400000c3 INFO: Object 0xf7ccc100 @offset=256 fp=0xf7ccc200 Bytes b4 0xf7ccc0f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ Object 0xf7ccc100: 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ....kkkkkkkkkkkk Object 0xf7ccc110: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf7ccc120: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf7ccc130: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object 0xf7ccc140: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ...
SGkgSW5nbywKCk9uIE1vbiwgSnVsIDIxLCAyMDA4IGF0IDEyOjQxIFBNLCBJbmdvIE1vbG5hciA8 bWluZ29AZWx0ZS5odT4gd3JvdGU6Cj4gdXBkYXRlIGFib3V0IHRoaXMgcHJvYmxlbToganVzdCB0 cmlnZ2VyZWQgYW5vdGhlciBjb2xvcmZ1bCBjcmFzaCwgc2VlCj4gYmVsb3cuIFRoaXMgd2FzIHdp dGggdGhlIDRLIG9iamVjdCBkdW1wIHBhdGNoIGFscmVhZHksIG1heWJlIHRoZSBkdW1wCj4gZ2l2 ZXMgYSBjbHVlPwoKLi4udG8gcG9pbnQgb3V0IHRoZSBvYnZpb3VzOgoKPiA9PT09PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT09PT09PQo+IEJVRyBza2J1ZmZfaGVhZF9jYWNoZTogUG9pc29uIG92ZXJ3cml0dGVuCj4gLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0KPgo+IElORk86IDB4ZjdjY2MxMDAtMHhmN2NjYzEwMy4gRmlyc3Qg Ynl0ZSAweDAgaW5zdGVhZCBvZiAweDZiCj4gSU5GTzogQWxsb2NhdGVkIGluIF9fYWxsb2Nfc2ti KzB4MzAvMHgxMGUgYWdlPTEgY3B1PTEgcGlkPTEKPiBJTkZPOiBGcmVlZCBpbiBfX2tmcmVlX3Nr YisweDYzLzB4NjYgYWdlPTEgY3B1PTAgcGlkPTAKPiBJTkZPOiBTbGFiIDB4YzFjMzRjYTAgb2Jq ZWN0cz0xNiB1c2VkPTEgZnA9MHhmN2NjYzEwMCBmbGFncz0weDQwMDAwMGMzCj4gSU5GTzogT2Jq ZWN0IDB4ZjdjY2MxMDAgQG9mZnNldD0yNTYgZnA9MHhmN2NjYzIwMAo+Cj4gQnl0ZXMgYjQgMHhm N2NjYzBmMDogIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVh IFpaWlpaWlpaWlpaWlpaWloKPiAgT2JqZWN0IDB4ZjdjY2MxMDA6ICAwMCAwMCAwMCAwMCA2YiA2 YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiAuLi4ua2tra2tra2tra2trCgpVc2UgYWZ0 ZXIgZnJlZSB3aGVyZSBmaXJzdCBmb3VyIGJ5dGVzIGFyZSB6ZXJvZWQuCgo+ICBPYmplY3QgMHhm N2NjYzExMDogIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZi IGtra2tra2tra2tra2tra2sKPiAgT2JqZWN0IDB4ZjdjY2MxMjA6ICA2YiA2YiA2YiA2YiA2YiA2 YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiBra2tra2tra2tra2tra2trCj4gIE9iamVj dCAweGY3Y2NjMTMwOiAgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIg NmIgNmIga2tra2tra2tra2tra2trawo+ICBPYmplY3QgMHhmN2NjYzE0MDogIDZiIDZiIDZiIDZi IDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIGtra2tra2tra2tra2tra2sKPiAg T2JqZWN0IDB4ZjdjY2MxNTA6ICA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2 YiA2YiA2YiA2YiBra2tra2tra2tra2tra2trCj4gIE9iamVjdCAweGY3Y2NjMTYwOiAgNmIgNmIg NmIgNmIgNmIgNmIg ...
Hi. Not that obvious... skb->next is cleared in lots of places, in xmit network helper for example, but since rest of the packet was not modified, it means given skb was not freed, so it will not help. Ingo do you see other similar dumps with last byte modified? That's the one which can help to determine the reason. -- Evgeniy Polyakov --
the problem is, most of the crashes dont come with any usable dump. This is a laptop so netconsole is the only reliable route out - and if something in networking crashes chances are that it hoses netconsole before it can get anything out. Another thing is that i'm activating netconsole on this box via a kernel boot line and from within a bzImage (to get it activated as early as possible) - maybe that's a tad too early for certain initialization sequences? I could try run tests with netconsole deactivated, if you think that's a worthwile line of probing this problem. (although that would make me do blind tests in essence - having kernel log output is really essential.) Ingo --
How about posting your custom patches for public review? :-D (There was recently an odd case of kzalloc() succeeding even though it was called before the slab caches were initialized -- and it only generated a warning about irqs-on happening too early. I just mean to say that it _can_ happen. On another machine, it crashed spectacularly. I guess you're not enabling the netconsole before slab allocator is up?) Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
I changed nothing in that area - being able to enable the netconsole like that has been a years-old feature. Otherwise it's just tip/master, if i had any changes in that area i'd have declared them. Find below the full bootlog of the last crash - can you see any init order weirdness in it? Btw., i disabled netconsole now and got 3 successful bootups in a row. (which is still not significant statistically - i had a series of 40 bootups in a row before. Then again i had like 5 failures in a row earlier this morning, before disabling netconsole.) Btw. #2, in the log below you can see this partial line: initcall pm_qos_power_init+0x0/0x54Linux version 2.6.26-tip (mingo@europe) (gcc version 4.2.2) #3261 SMP Mon Jul 21 11:17:45 CEST 2008 That was a previous bootup with another failure but which came without any log output. Hey, if you find this report frustrating, dont worry, i almost never report easy bugs - i can fix them just fine ;-) Ingo ---------------> initcall pm_qos_power_init+0x0/0x54Linux version 2.6.26-tip (mingo@europe) (gcc version 4.2.2) #3261 SMP Mon Jul 21 11:17:45 CEST 2008 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009f000 (usable) BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved) BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable) BIOS-e820: 000000007f6e0000 - 000000007f6f3000 (ACPI data) BIOS-e820: 000000007f6f3000 - 000000007f700000 (ACPI NVS) BIOS-e820: 000000007f700000 - 0000000080000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved) BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved) BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: ...
Kzalloc for objects > 4k will succeed even during early boot since we will directly call into the page allocator. --
On Mon, Jul 21, 2008 at 6:19 PM, Christoph Lameter It wasn't just that; it got into __slab_alloc() and out again. See the backtrace at http://lkml.org/lkml/2008/7/19/23 It's scary, I know. It's a good thing we have irqtrace :-) Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
Let's try this way first. If system will continue to crash, we will add some debug options in various pathes. Existing reports do not contain enough information unfortunately, so we will not lose too much. -- Evgeniy Polyakov --
ok. I've turned off netconsole - 8 successful bootups in a row so far. The box is a slow booter/builder with an 8 kernels/hour test throughput, so if everything goes fine we should have meaningful results in about 10 hours. ( there are other, faster testboxes in -tip testing with 33 kernels/hour build+boot throughput where we'd have to wait only 2 hours - but as per Murphy's law they dont trigger this bug ;-) Ingo --
Since 2.6.25 there was only single change in netpoll.c: f5184d267c1aedb9b7a8cc44e08ff6b8d382c3b5 Which looks innocent. Is your driver e1000 or e1000e? Can you check different one? -- Evgeniy Polyakov --
i cannot check e1000 anymore due to this upstream commit: | d03157babed7424f5391af43200593768ce69c9a is first bad commit | commit d03157babed7424f5391af43200593768ce69c9a | Author: Auke Kok <auke-jan.h.kok@intel.com> | Date: Sun Jun 22 15:21:29 2008 -0700 | | e1000: remove PCI Express device IDs | | We do not want to prolong the situation much longer that e1000 | and e1000e support these devices at the same time. As a result, | take out the bandage that was added for the interim period | and remove all the PCI Express device IDs from e1000. but yes, this box was using e1000 for a long time, and recently migrated to e1000e. I'm not sure there's any connection, do you think there is? Ingo --
70 bootups with netconsole turned off, and not a single crash so far. So it's now a strong likelyhood that this crash is a combination of e1000e+netconsole. Ingo --
e1000_clean_tx_irq() call looks particulary suspicious: it is called without adapter->tx_queue_lock in poll controller (netconsole callback) and with that lock in NAPI handler. Can you check kind of this patch: diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 869544b..5f180d8 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -4067,7 +4067,9 @@ static void e1000_netpoll(struct net_device *netdev) disable_irq(adapter->pdev->irq); e1000_intr(adapter->pdev->irq, netdev); + spin_lock(&adapter->tx_queue_lock); e1000_clean_tx_irq(adapter); + spin_unlock(&adapter->tx_queue_lock); enable_irq(adapter->pdev->irq); } -- Evgeniy Polyakov --
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru> The call even seems pointless, since the caller will call ->poll() (which is e1000_clean) as the very next action, and that will invoke e1000_clean_tx_irq() properly. I would just delete this call from e1000_netpoll() entirely. --
ok, i've added the patch below to tip/out-of-tree.
Overnight test had about 100 successful bootups on this testbox. (until
it stopped on a drivers/net/hp.c build error - which is unrelated to
this problem)
So testing with netconsole disabled is conclusive enough to implicate
netconsole strongly. I've now re-enabled netconsole on the testbox and
will continue the test with the fix below. Previously it would crash
within 10-40 iterations.
Ingo
----------------->
commit bf89280dea6d97671aa5f75f2591ae7e8e3e6699
Author: Ingo Molnar <mingo@elte.hu>
Date: Tue Jul 22 09:44:32 2008 +0200
e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call
Evgeniy Polyakov noticed that drivers/net/e1000e/netdev.c:e1000_netpoll()
was calling e1000_clean_tx_irq() without taking the TX lock.
David Miller suggested to remove the call altogether: since in this
callpah there's periodic calls to ->poll() anyway which will do
e1000_clean_tx_irq() and will garbage-collect any finished TX ring
descriptors.
This might solve the e1000e+netconsole crashes i've been seeing:
=============================================================================
BUG skbuff_head_cache: Poison overwritten
-----------------------------------------------------------------------------
INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b
INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098
INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440
INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3
INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/net/e1000e/netdev.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 869544b..9c0f56b 100644
--- a/drivers/net/e1000e/netdev.c
+++ ...it's now past 50 successful iterations with e1000e+netconsole enabled
again with the fix applied, and not a single crash, hang, memory
corruption or other weirdness happened. Jeff, please apply the fix
below.
Thanks,
Ingo
---------------->
commit 1fd48807674d4f6defe82258c46619ce2c116943
Author: Ingo Molnar <mingo@elte.hu>
Date: Tue Jul 22 09:44:32 2008 +0200
e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call
Evgeniy Polyakov noticed that drivers/net/e1000e/netdev.c:e1000_netpoll()
was calling e1000_clean_tx_irq() without taking the TX lock.
David Miller suggested to remove the call altogether: since in this
callpah there's periodic calls to ->poll() anyway which will do
e1000_clean_tx_irq() and will garbage-collect any finished TX ring
descriptors.
This fix solved the e1000e+netconsole crashes i've been seeing:
=============================================================================
BUG skbuff_head_cache: Poison overwritten
-----------------------------------------------------------------------------
INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b
INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098
INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440
INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3
INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/net/e1000e/netdev.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 869544b..9c0f56b 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4067,8 +4067,6 @@ static void e1000_netpoll(struct net_device *netdev)
disable_irq(adapter->pdev->irq);
e1000_intr(adapter->pdev->irq, netdev);
- e1000_clean_tx_irq(adapter);
-
enable_irq(adapter->pdev->irq);
}
...From: Ingo Molnar <mingo@elte.hu> I'll take care of integrating this, thanks Ingo. --
Thanks Dave. I have the patch in my queue of patches to push, just had not pushed it to you and Jeff yet. -- Cheers, Jeff --
The skb was definitely freed. 6b is written over an object when it is freed. Something else retained a pointer to the skb and was confident that the skb still exists. --
Hi Christoph. I meant it was not tried to be freed second time, since skb->users area (the very end of the skb) was not changed from 6b to 6a, but its skb->next pointer (first field in the skb) was set to NULL, so after that skb was not used at all. It could be an interesting kmemcheck extension to catch not only non-initialized memory, but also writes into just freed one, but not yet returned by allocator to the next user. -- Evgeniy Polyakov --
hm, i think kmemcheck does that mostly already. If not then I think kmemcheck should just turn off all buffering that might happen on the SLUB level (if it doesnt do it already). I.e. a kfree()d buffer should be marked as uninitialized immediately. Ingo --
We don't catch writes (yet, because kmemcheck gets confused by SLUB's own writes so far), but yes, we mark objects "freed" when they are freed (until the slab page is returned to the page allocator) and we may detect reads from this area before the object is reallocated. In that case, a "read from freed memory" message will appear. Unfortunately, it doesn't work quite as well as I'd like it to yet. We did catch one real bug in ACPI in this way (read-after-free). But as I have discovered, not all machines support single-stepping over REP STOS instructions properly. Which means that kmemcheck breaks silently on some machines. There's also the recent DMA patch, which might improve things. Linus is right, it's really fragile. But we are working on it. It is going in the right direction anyway. (Pekka and I were also working on an extension to SLUB which delays the freeing of objects, i.e. so that objects will remain in the "freed" state for a certain amount of time before it can be allocated again. But we don't really have a _good_ patch for this yet, it was more like an ugly hack at the time.) Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
Typically this means that a refcount at the head of the object was decremented. But the object was already freed. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build wi |
