A regression to v2.6.26:
I started getting this skb-head corruption message today, on a T60
laptop with e1000:PM: Removing info for No Bus:vcs11
device: 'vcs11': device_create_release
=============================================================================
BUG skbuff_head_cache: Poison overwritten
-----------------------------------------------------------------------------INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b
INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098
INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440
INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3
INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00Bytes b4 0xf658adf0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Object 0xf658ae00: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf658ae10: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf658ae20: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf658ae30: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf658ae40: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf658ae50: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf658ae60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf658ae70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Redzone 0xf658aea0: bb bb bb bb ����
Padding 0xf658aec8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Padding 0xf658aed8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Padding 0xf658aee8: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Padding 0xf658aef8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
Pid: 5098, comm: gdm-binary Not tainted 2.6.26-tip #3094
[<c0186f99>] print_trailer+0xa9/...
Typically this means that a refcount at the head of the object was decremented. But the object was already freed.
--
ok, applied this as a debug special to tip/out-of-tree - future
incidents should have the full object dump.would makes sense for upstream too i think, or increase the limit to 4K
or so. (which is still fair to be dumped into a syslog)Ingo
--
update about this problem: just triggered another colorful crash, see
below. This was with the 4K object dump patch already, maybe the dump
gives a clue?The upstream base of this test kernel was v2.6.26-5253-g14b395e - i.e.
post the big networking pull, but this problem predates it. (It first
triggered after v2.6.26)All the crashes trigger in or close to networking code - not a single
block IO DMA or other DMA crash happened so far, and no filesystem
corruptions or anything like that which would signal hw trouble.Ingo
------------------>
initcall sctp_init+0x0/0x697 returned 0 after 9 msecs
calling powernowk8_init+0x0/0x6e
initcall powernowk8_init+0x0/0x6e returned -19 after 0 msecs
calling hpet_insert_resource+0x0/0x1e
initcall hpet_insert_resource+0x0/0x1e returned 0 after 0 msecs
calling lapic_insert_resource+0x0/0x44
initcall lapic_insert_resource+0x0/0x44 returned 0 after 0 msecs
calling init_lapic_nmi_sysfs+0x0/0x33
initcall init_lapic_nmi_sysfs+0x0/0x33 returned 0 after 0 msecs
=============================================================================
BUG skbuff_head_cache: Poison overwritten
-----------------------------------------------------------------------------INFO: 0xf7ccc100-0xf7ccc103. First byte 0x0 instead of 0x6b
INFO: Allocated in __alloc_skb+0x30/0x10e age=1 cpu=1 pid=1
INFO: Freed in __kfree_skb+0x63/0x66 age=1 cpu=0 pid=0
INFO: Slab 0xc1c34ca0 objects=16 used=1 fp=0xf7ccc100 flags=0x400000c3
INFO: Object 0xf7ccc100 @offset=256 fp=0xf7ccc200Bytes b4 0xf7ccc0f0: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ
Object 0xf7ccc100: 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ....kkkkkkkkkkkk
Object 0xf7ccc110: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf7ccc120: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf7ccc130: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xf7ccc140: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ...
SGkgSW5nbywKCk9uIE1vbiwgSnVsIDIxLCAyMDA4IGF0IDEyOjQxIFBNLCBJbmdvIE1vbG5hciA8
bWluZ29AZWx0ZS5odT4gd3JvdGU6Cj4gdXBkYXRlIGFib3V0IHRoaXMgcHJvYmxlbToganVzdCB0
cmlnZ2VyZWQgYW5vdGhlciBjb2xvcmZ1bCBjcmFzaCwgc2VlCj4gYmVsb3cuIFRoaXMgd2FzIHdp
dGggdGhlIDRLIG9iamVjdCBkdW1wIHBhdGNoIGFscmVhZHksIG1heWJlIHRoZSBkdW1wCj4gZ2l2
ZXMgYSBjbHVlPwoKLi4udG8gcG9pbnQgb3V0IHRoZSBvYnZpb3VzOgoKPiA9PT09PT09PT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PQo+IEJVRyBza2J1ZmZfaGVhZF9jYWNoZTogUG9pc29uIG92ZXJ3cml0dGVuCj4gLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0KPgo+IElORk86IDB4ZjdjY2MxMDAtMHhmN2NjYzEwMy4gRmlyc3Qg
Ynl0ZSAweDAgaW5zdGVhZCBvZiAweDZiCj4gSU5GTzogQWxsb2NhdGVkIGluIF9fYWxsb2Nfc2ti
KzB4MzAvMHgxMGUgYWdlPTEgY3B1PTEgcGlkPTEKPiBJTkZPOiBGcmVlZCBpbiBfX2tmcmVlX3Nr
YisweDYzLzB4NjYgYWdlPTEgY3B1PTAgcGlkPTAKPiBJTkZPOiBTbGFiIDB4YzFjMzRjYTAgb2Jq
ZWN0cz0xNiB1c2VkPTEgZnA9MHhmN2NjYzEwMCBmbGFncz0weDQwMDAwMGMzCj4gSU5GTzogT2Jq
ZWN0IDB4ZjdjY2MxMDAgQG9mZnNldD0yNTYgZnA9MHhmN2NjYzIwMAo+Cj4gQnl0ZXMgYjQgMHhm
N2NjYzBmMDogIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVhIDVh
IFpaWlpaWlpaWlpaWlpaWloKPiAgT2JqZWN0IDB4ZjdjY2MxMDA6ICAwMCAwMCAwMCAwMCA2YiA2
YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiAuLi4ua2tra2tra2tra2trCgpVc2UgYWZ0
ZXIgZnJlZSB3aGVyZSBmaXJzdCBmb3VyIGJ5dGVzIGFyZSB6ZXJvZWQuCgo+ICBPYmplY3QgMHhm
N2NjYzExMDogIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZi
IGtra2tra2tra2tra2tra2sKPiAgT2JqZWN0IDB4ZjdjY2MxMjA6ICA2YiA2YiA2YiA2YiA2YiA2
YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiBra2tra2tra2tra2tra2trCj4gIE9iamVj
dCAweGY3Y2NjMTMwOiAgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIgNmIg
NmIgNmIga2tra2tra2tra2tra2trawo+ICBPYmplY3QgMHhmN2NjYzE0MDogIDZiIDZiIDZiIDZi
IDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIDZiIGtra2tra2tra2tra2tra2sKPiAg
T2JqZWN0IDB4ZjdjY2MxNTA6ICA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2YiA2
YiA2YiA2YiA2YiBra2tra2tra2tra2tra2trCj4gIE9iamVjdCAweGY3Y2NjMTYwOiAgNmIgNmIg
NmIgNmIgNmIgNmIgNmIg...
Hi.
Not that obvious...
skb->next is cleared in lots of places, in xmit network helper
for example, but since rest of the packet was not modified, it
means given skb was not freed, so it will not help.Ingo do you see other similar dumps with last byte modified?
That's the one which can help to determine the reason.--
Evgeniy Polyakov
--
The skb was definitely freed. 6b is written over an object when it is
freed. Something else retained a pointer to the skb and was confident
that the skb still exists.--
Hi Christoph.
I meant it was not tried to be freed second time, since skb->users area
(the very end of the skb) was not changed from 6b to 6a, but its
skb->next pointer (first field in the skb) was set to NULL, so after
that skb was not used at all.It could be an interesting kmemcheck extension to catch not only
non-initialized memory, but also writes into just freed one, but not yet
returned by allocator to the next user.--
Evgeniy Polyakov
--
hm, i think kmemcheck does that mostly already. If not then I think
kmemcheck should just turn off all buffering that might happen on the
SLUB level (if it doesnt do it already). I.e. a kfree()d buffer should
be marked as uninitialized immediately.Ingo
--
We don't catch writes (yet, because kmemcheck gets confused by SLUB's
own writes so far), but yes, we mark objects "freed" when they are
freed (until the slab page is returned to the page allocator) and we
may detect reads from this area before the object is reallocated. In
that case, a "read from freed memory" message will appear.Unfortunately, it doesn't work quite as well as I'd like it to yet. We
did catch one real bug in ACPI in this way (read-after-free). But as I
have discovered, not all machines support single-stepping over REP
STOS instructions properly. Which means that kmemcheck breaks silently
on some machines. There's also the recent DMA patch, which might
improve things. Linus is right, it's really fragile. But we are
working on it. It is going in the right direction anyway.(Pekka and I were also working on an extension to SLUB which delays
the freeing of objects, i.e. so that objects will remain in the
"freed" state for a certain amount of time before it can be allocated
again. But we don't really have a _good_ patch for this yet, it was
more like an ugly hack at the time.)Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
the problem is, most of the crashes dont come with any usable dump. This
is a laptop so netconsole is the only reliable route out - and if
something in networking crashes chances are that it hoses netconsole
before it can get anything out.Another thing is that i'm activating netconsole on this box via a kernel
boot line and from within a bzImage (to get it activated as early as
possible) - maybe that's a tad too early for certain initialization
sequences?I could try run tests with netconsole deactivated, if you think that's a
worthwile line of probing this problem. (although that would make me do
blind tests in essence - having kernel log output is really essential.)Ingo
--
Let's try this way first. If system will continue to crash, we will add
some debug options in various pathes. Existing reports do not contain
enough information unfortunately, so we will not lose too much.--
Evgeniy Polyakov
--
ok. I've turned off netconsole - 8 successful bootups in a row so far.
The box is a slow booter/builder with an 8 kernels/hour test throughput,
so if everything goes fine we should have meaningful results in about 10
hours.( there are other, faster testboxes in -tip testing with 33 kernels/hour
build+boot throughput where we'd have to wait only 2 hours - but as
per Murphy's law they dont trigger this bug ;-)Ingo
--
70 bootups with netconsole turned off, and not a single crash so far.
So it's now a strong likelyhood that this crash is a combination of
e1000e+netconsole.Ingo
--
e1000_clean_tx_irq() call looks particulary suspicious: it is called
without adapter->tx_queue_lock in poll controller (netconsole callback)
and with that lock in NAPI handler.Can you check kind of this patch:
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 869544b..5f180d8 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4067,7 +4067,9 @@ static void e1000_netpoll(struct net_device *netdev)
disable_irq(adapter->pdev->irq);
e1000_intr(adapter->pdev->irq, netdev);+ spin_lock(&adapter->tx_queue_lock);
e1000_clean_tx_irq(adapter);
+ spin_unlock(&adapter->tx_queue_lock);enable_irq(adapter->pdev->irq);
}--
Evgeniy Polyakov
--
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
The call even seems pointless, since the caller will call
->poll() (which is e1000_clean) as the very next action,
and that will invoke e1000_clean_tx_irq() properly.I would just delete this call from e1000_netpoll() entirely.
--
ok, i've added the patch below to tip/out-of-tree.
Overnight test had about 100 successful bootups on this testbox. (until
it stopped on a drivers/net/hp.c build error - which is unrelated to
this problem)So testing with netconsole disabled is conclusive enough to implicate
netconsole strongly. I've now re-enabled netconsole on the testbox and
will continue the test with the fix below. Previously it would crash
within 10-40 iterations.Ingo
----------------->
commit bf89280dea6d97671aa5f75f2591ae7e8e3e6699
Author: Ingo Molnar <mingo@elte.hu>
Date: Tue Jul 22 09:44:32 2008 +0200e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call
Evgeniy Polyakov noticed that drivers/net/e1000e/netdev.c:e1000_netpoll()
was calling e1000_clean_tx_irq() without taking the TX lock.David Miller suggested to remove the call altogether: since in this
callpah there's periodic calls to ->poll() anyway which will do
e1000_clean_tx_irq() and will garbage-collect any finished TX ring
descriptors.This might solve the e1000e+netconsole crashes i've been seeing:
=============================================================================
BUG skbuff_head_cache: Poison overwritten
-----------------------------------------------------------------------------INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b
INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098
INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440
INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3
INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/net/e1000e/netdev.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 869544b..9c0f56b 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e100...
it's now past 50 successful iterations with e1000e+netconsole enabled
again with the fix applied, and not a single crash, hang, memory
corruption or other weirdness happened. Jeff, please apply the fix
below.Thanks,
Ingo
---------------->
commit 1fd48807674d4f6defe82258c46619ce2c116943
Author: Ingo Molnar <mingo@elte.hu>
Date: Tue Jul 22 09:44:32 2008 +0200e1000e: fix e1000_netpoll(), remove extraneous e1000_clean_tx_irq() call
Evgeniy Polyakov noticed that drivers/net/e1000e/netdev.c:e1000_netpoll()
was calling e1000_clean_tx_irq() without taking the TX lock.David Miller suggested to remove the call altogether: since in this
callpah there's periodic calls to ->poll() anyway which will do
e1000_clean_tx_irq() and will garbage-collect any finished TX ring
descriptors.This fix solved the e1000e+netconsole crashes i've been seeing:
=============================================================================
BUG skbuff_head_cache: Poison overwritten
-----------------------------------------------------------------------------INFO: 0xf658ae9c-0xf658ae9c. First byte 0x6a instead of 0x6b
INFO: Allocated in __alloc_skb+0x2c/0x110 age=0 cpu=0 pid=5098
INFO: Freed in __kfree_skb+0x31/0x80 age=0 cpu=1 pid=4440
INFO: Slab 0xc16cc140 objects=16 used=1 fp=0xf658ae00 flags=0x400000c3
INFO: Object 0xf658ae00 @offset=3584 fp=0xf658af00Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/net/e1000e/netdev.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 869544b..9c0f56b 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -4067,8 +4067,6 @@ static void e1000_netpoll(struct net_device *netdev)
disable_irq(adapter->pdev->irq);
e1000_intr(adapter->pdev->irq, netdev);- e1000_clean_tx_irq(adapter);
-
enable_irq(ad...
From: Ingo Molnar <mingo@elte.hu>
I'll take care of integrating this, thanks Ingo.
--
Thanks Dave. I have the patch in my queue of patches to push, just had
not pushed it to you and Jeff yet.--
Cheers,
Jeff--
Since 2.6.25 there was only single change in netpoll.c:
f5184d267c1aedb9b7a8cc44e08ff6b8d382c3b5
Which looks innocent.Is your driver e1000 or e1000e? Can you check different one?
--
Evgeniy Polyakov
--
i cannot check e1000 anymore due to this upstream commit:
| d03157babed7424f5391af43200593768ce69c9a is first bad commit
| commit d03157babed7424f5391af43200593768ce69c9a
| Author: Auke Kok <auke-jan.h.kok@intel.com>
| Date: Sun Jun 22 15:21:29 2008 -0700
|
| e1000: remove PCI Express device IDs
|
| We do not want to prolong the situation much longer that e1000
| and e1000e support these devices at the same time. As a result,
| take out the bandage that was added for the interim period
| and remove all the PCI Express device IDs from e1000.but yes, this box was using e1000 for a long time, and recently migrated
to e1000e. I'm not sure there's any connection, do you think there is?Ingo
--
How about posting your custom patches for public review? :-D
(There was recently an odd case of kzalloc() succeeding even though it
was called before the slab caches were initialized -- and it only
generated a warning about irqs-on happening too early. I just mean to
say that it _can_ happen. On another machine, it crashed
spectacularly. I guess you're not enabling the netconsole before slab
allocator is up?)Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
Kzalloc for objects > 4k will succeed even during early boot since we
will directly call into the page allocator.--
On Mon, Jul 21, 2008 at 6:19 PM, Christoph Lameter
It wasn't just that; it got into __slab_alloc() and out again. See the
backtrace at http://lkml.org/lkml/2008/7/19/23It's scary, I know. It's a good thing we have irqtrace :-)
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
I changed nothing in that area - being able to enable the netconsole
like that has been a years-old feature. Otherwise it's just tip/master,if i had any changes in that area i'd have declared them. Find below the
full bootlog of the last crash - can you see any init order weirdness in
it?Btw., i disabled netconsole now and got 3 successful bootups in a row.
(which is still not significant statistically - i had a series of 40
bootups in a row before. Then again i had like 5 failures in a row
earlier this morning, before disabling netconsole.)Btw. #2, in the log below you can see this partial line:
initcall pm_qos_power_init+0x0/0x54Linux version 2.6.26-tip (mingo@europe) (gcc version 4.2.2) #3261 SMP Mon Jul 21 11:17:45 CEST 2008
That was a previous bootup with another failure but which came without
any log output.Hey, if you find this report frustrating, dont worry, i almost never
report easy bugs - i can fix them just fine ;-)Ingo
--------------->
initcall pm_qos_power_init+0x0/0x54Linux version 2.6.26-tip (mingo@europe) (gcc version 4.2.2) #3261 SMP Mon Jul 21 11:17:45 CEST 2008
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable)
BIOS-e820: 000000007f6e0000 - 000000007f6f3000 (ACPI data)
BIOS-e820: 000000007f6f3000 - 000000007f700000 (ACPI NVS)
BIOS-e820: 000000007f700000 - 0000000080000000 (reserved)
BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
BIOS-e820: 00000000fed00000 - 00000000fed00400 (reserved)
BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
BIOS-e820: 00000000fed1c000 - 00000000fed90000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ff8...
SLUB already limits object sizes to less than PAGE_SIZE so the patch
should be fine. Christoph, are you okay with this going upstream?
--
I am fine with the patch. Just be aware that we are going to be shoveling a lot of stuff into the log. The object size can be greater than 4k for non kmalloc slab caches. True in general object sizes are <4k.
--
Hi Ingo.
Out of curiosity, why does it scream at allocation time?
Does SLUB have a debug check at freeing time? If so, how does it work
and why didn't it caught use after free there?--
Evgeniy Polyakov
--
Hi Pekka.
Does it also scream on double free event? Just to closer guilty
circles... 0x9c offset is somewhere at the very end of the skbuff
structure, likely skb->users.Can you also check in some kind of this patch to catch freed skb freeing
for testing?diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3666216..dda96bf 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -419,6 +419,14 @@ void kfree_skb(struct sk_buff *skb)
{
if (unlikely(!skb))
return;
+
+ {
+ u8 *ptr = (u8 *)(&skb->users);
+
+ if (*ptr == POISON_FREE || *ptr == POISON_INUSE || *ptr == POISON_END)
+ BUG();
+ }
+
if (likely(atomic_read(&skb->users) == 1))
smp_rmb();
else if (likely(!atomic_dec_and_test(&skb->users)))--
Evgeniy Polyakov
--
Hi Evgeniy,
On Fri, Jul 18, 2008 at 12:02:26PM +0300, Pekka Enberg
Yeah. See the free_debug_processing() function in mm/slub.c for
details (the on_freelist() part). However, if you look at slab_free()
you can see that in the SLUB fast-path we don't do any of these
debugging checks. So you can end up with slab corruption without a
nice error message.Pekka
--
The slub fastpath is not used when debugging is enabled.
Without debugging on double frees will typically corrupt the freepointer. So you get an invalid pointer reference in __slab_free.
--
yeah, we want to check use-after free at the next allocation point -
i.e. as late as possible to gather all corruptions that happened
meanwhile.We could in theory have a SLUB debug mode where a SCHED_IDLE kernel
thread would periodically check all free objects (of that CPU) in the
background to ensure their integrity. That would catch corruptions
sooner, with a possibly still meaningful context to print out. [right
after the IRQ or process that corrupts them finishes running]It could also be hooked into ftrace to print out the last few hundred
kernel function calls executed prior any corruption. ftrace/slub-debug
plugin perhaps?Ingo
--
Hi Ingo,
Well, there's this Norwegian guy, Vegard, who has written a small
piece of code that can detect use-after-free _as it happens_. I think
he calls the thing kmemcheck :-).
--
It doesn't actually work. The config says
# head: 088fcf34
and I checked out this from the tip tree. But kernel-config still
complains about unknown config options... What went wrong?Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
that's ok - i've got some local qa helpers that have config options.
Things like making SMP bootups more likely in randconfig, adding various
boot parameters to the bootup via .config methods (so that boot
parameters can be randomized via make randconfig), etc.these:
CONFIG_BOOTPARAM_SUPPORT=y
CONFIG_BOOTPARAM_NO_HZ_OFF=y
CONFIG_BOOTPARAM_NMI_WATCHDOG_BIT_0=y
CONFIG_BOOTPARAM_LAPIC=y
CONFIG_BOOTPARAM_IDLE_MWAIT=y
CONFIG_BOOTPARAM_NOPAT=y
CONFIG_BOOTPARAM_NOTSC=yare equivalent to adding this to the boot line:
nohz=off nmi_watchdog=1 lapic idle=mwait nopat notsc
although i dont think they are normally material to netconsole workings.
Ingo
--
Hm, your config doesn't contain SLUB_DEBUG either. Isn't that a
requirement for seeing the BUG you're getting?$ grep SLUB config-Thu_Jul_17_20_24_45_CEST_2008.bad
# CONFIG_SLUB_DEBUG is not set
CONFIG_SLUB=yDid you post the wrong config?
Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
1. Notice the range. It's just a single byte.
2. Notice the value. It's just a ++.Probably a stray increment of a uint8_t somewhere on a freed object?
The offset from the beginning of the object is 0xf658ae9c - 0xf658ae00 = 0x9c.
How big is a struct sk_buff? Hm.. it is in fact quite big. Now what
member has offset 0x9c? Seems to depend on your config. Is there any
way you can figure it out, Ingo? I'll try it with your config too.Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
From: "Vegard Nossum" <vegard.nossum@gmail.com>
It's supposed to be 0x6b, this would be a "--"
Also it (more likely IMHO) could be clearing a flag with the value 0x01.
--
It could be. But like I said in a later e-mail, the thing is likely
sk_buff->truesize. Which is not a flags variable. It _is_ however, a
counter, which is frequently -= and atomic_sub()ed.That field is also an int, not a byte like I suggested above. This is
fine, though. "--" on an int can of course legitimately update/change
just the lower byte of an int.But.. it could also be some random corruption coming from elsewhere.
Maybe even bad RAM (it's just a single bit anyway). But that's less
likely.Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
Hm, just got another skb related weirdness and crash on that testbox:
[ 13.495000] usb 1-0:1.0: uevent
[ 19.392000] Attempt to release alive inet socket f6fac040
[ 20.387000] EXT3 FS on sda1, internal journal
[ 20.762500] eth0: no IPv6 routers present
[ 22.431000] warning: process `kudzu' used the deprecated sysctl system call with 1.23.
[ 22.906000] ------------[ cut here ]------------
[ 22.906000] kernel BUG at mm/slab.c:2927!
[ 22.906000] invalid opcode: 0000 [#1] PREEMPT SMP
[ 22.906000] Modules linked in:
[ 22.906000]
[ 22.906000] Pid: 1831, comm: ip6tables-resto Tainted: G W (2.6.26-tip-03923-g2c96bcb-dirty #3134)
[ 22.906000] EIP: 0060:[<b01878d2>] EFLAGS: 00010046 CPU: 0
[ 22.906000] EIP is at cache_alloc_refill+0xfa/0x431
[ 22.906000] EAX: 00000004 EBX: f6ec1a00 ECX: f67826c0 EDX: f6fa3c40
[ 22.906000] ESI: f6fac000 EDI: 0000000b EBP: f5c01ec4 ESP: f5c01e70
[ 22.906000] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 22.906000] Process ip6tables-resto (pid: 1831, ti=f5c01000 task=f5d2f060 task.ti=f5c01000)
[ 22.906000] Stack: 00000004 00000004 000000d0 000000d0 f67826c8 f67826d0 f6fac01c 000080d0
[ 22.906000] f6fa3c40 f6fa9800 f67826c0 00000000 7f431025 b014a3e9 b2bd6ab8 00000000
[ 22.906000] 00000000 00000000 00000246 f6fa3c40 f6fa3c40 f5c01ee4 b0187793 b07c11d6
[ 22.906000] Call Trace:
[ 22.906000] [<b014a3e9>] ? __wake_up_bit+0xc/0x37
[ 22.906000] [<b0187793>] ? kmem_cache_alloc+0x53/0x98
[ 22.906000] [<b05589b5>] ? sk_prot_alloc+0x1e/0xa7
[ 22.906000] [<b026cb86>] ? cap_capable+0xa/0x2b
[ 22.906000] [<b0559fa4>] ? sk_alloc+0x20/0xa3
[ 22.906000] [<b05c4364>] ? inet6_create+0x187/0x316
[ 22.906000] [<b0556e87>] ? __sock_create+0x15b/0x229
[ 22.906000] [<b0556f9c>] ? sock_create+0x29/0x2e
[ 22.906000] [<b0557112>] ? sys_socket+0x1e/0x42
[ 22.906000] [<b05577b6>] ? sys_socketcall+0x56/0x163
[ 22.906000] ...
Totally weird bug. Using the same bzImage i was able to crash the box 2
times in a row. Then i tried another (but similarly configured) kernel
and it wouldnt crash.Then i tried the original bzImage again (bit for bit equivalent) and it
wouldnt crash 5 times in a row.I've attached the second crashlog below - in case it's interesting to
anyone.This cannot be random corruption or hw flakiness as the two crashes in a
row are so similar. One (remote) possibility would be for some hw
component to be in a weird state that doesnt get proper re-initialized
after reboot. As long as that condition lingered, it would crash. Once
that condition was gone, it wouldnt crash.For now i've given up on this variant of the bug. I'll continue with
randconfig searching - maybe another config gives us something more
debuggable ...And note that this time it was SLAB not SLUB - so any SLUB impact is
highly unlikely. (i strongly suspect some hw initialization issue.)Ingo
-------------------->
[ 0.000000] Linux version 2.6.26-tip-03923-g2c96bcb-dirty (mingo@europe) (gcc version 4.2.2) #3134 SMP PREEMPT Fri Jul 18 20:06:13 CEST 2008
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
[ 0.000000] BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000d2000 - 00000000000d4000 (reserved)
[ 0.000000] BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 000000007f6e0000 (usable)
[ 0.000000] BIOS-e820: 000000007f6e0000 - 000000007f6f3000 (ACPI data)
[ 0.000000] BIOS-e820: 000000007f6f3000 - 000000007f700000 (ACPI NVS)
[ 0.000000] BIOS-e820: 000000007f700000 - 0000000080000000 (reserved)
[ 0.000000] BIOS-e820: 00000000f0000000 - 00000000f4000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fed00000 - 00000000fed00400 (reser...
ok, i looked at the logs once more and while i thought that it occured
twice it only occured once:Jul 17 20:22:14 europe kernel: BUG skbuff_head_cache: Poison overwritten
... this would explain why my attempts to bisect and reproduce it
failed. I got too excited about it being seemingly reproducible and
possibly bisectable (memory corruption bugs rarely are). My overnight
reboot-the-same-kernel tests didnt show anything either.It's a known-reliable system with thousands of bootups:
Jul 17 20:20:54 europe kernel: Linux version 2.6.26-tip (mingo@europe)
(gcc version 4.2.2) #3094 SMP Thu Jul 17 20:19:27 CEST 2008... but a hw fluke is never out of question. (It wasnt a particularly
hot day but the evening was unusually humid, maybe that made the
difference.)So lets close this for now, it's not a reproducible regression that we
can act upon. I'll update this thread if anything new happens.
Netconsole is reliable on this system in any case.Ingo
--
From: "Vegard Nossum" <vegard.nossum@gmail.com>
skb->truesize is ever incremented or decremented by only one.
Usually it is changed by the entire packet size, or at least one MSS's
worth.On packet free, it will be decremented by at least sizeof(struct sk_buff)
--
hmm ... your analysis gave me a wonderful albeit admittedly remote idea:
If only we had some kernel technology that could track and validate
memory accesses, and point out the cases where we access uninitialized
memory, just like Valgrind?... something like kmemcheck? ;-)
So i booted that box with tip/master and kmemcheck enabled. (plus a few
fixlets to make networking allocations be properly tracked by
kmemcheck.)It was a slow bootup and long wait, but it gave a few hits here:
kmemcheck: Caught 8-bit read from uninitialized memory (f653ad24)
iiiiiiiiiiiiiiiiuuuuuuuuuuuuuuuuuuuuuiuuuuuuuuuuuuuuuuuuuuuuuuuu
^Pid: 2484, comm: arping Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05e973c>] EFLAGS: 00010282 CPU: 0
EIP is at __copy_skb_header+0x7c/0x100
EAX: 00000000 EBX: f653acc0 ECX: f653ac00 EDX: f653ac00
ESI: f653ac50 EDI: f653ad10 EBP: c09b9e84 ESP: c09ddaa8
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 36513000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c05e97e7>] __skb_clone+0x27/0xe0
[<c05eb101>] skb_clone+0x41/0x60
[<c065cbf1>] packet_rcv+0xc1/0x290
[<c05f07ad>] netif_receive_skb+0x20d/0x400
[<c03b2aa7>] e1000_receive_skb+0x47/0x180
[<c03b3983>] e1000_clean_rx_irq+0x223/0x2e0
[<c03b225b>] e1000_clean+0x5b/0x200
[<c05f29db>] net_rx_action+0xfb/0x160
[<c0129092>] __do_softirq+0x82/0xf0
[<c0105b8a>] call_on_stack+0x1a/0x30false positive? Find below the quick hacks i did to pre-initialize skb
allocations that have RX DMA into them.another one is:
kmemcheck: Caught 8-bit read from uninitialized memory (f653a902)
iiuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
^Pid: 2575, comm: hcid Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c02b9926>] EFLAGS: 00010293 CPU: 0
EIP is at __copy_to_user_ll+0x46/0x70
EAX: 00000004 EBX: b...
From: Ingo Molnar <mingo@elte.hu>
Maybe. Every SKB object allocated is fully initialized
in __alloc_skb():/*
* Only clear those fields we need to clear, not those that we will
* actually initialise below. Hence, don't put any more fields after
* the tail pointer in struct sk_buff!
*/
memset(skb, 0, offsetof(struct sk_buff, tail));That leaves the following trailing members of struct sk_buff:
/* These elements must be at the end, see alloc_skb() for details. */
sk_buff_data_t tail;
sk_buff_data_t end;
unsigned char *head,
*data;
unsigned int truesize;
atomic_t users;which are the explicitly initialized right after the quotes memset().
skb->truesize = size + sizeof(struct sk_buff);
atomic_set(&skb->users, 1);
skb->head = data;
skb->data = data;
skb_reset_tail_pointer(skb);
skb->end = skb->tail + size;When we clone, there are probably some fields we don't copy over
explicitly. And we usually do that because they don't matter or
if they do the caller will take care of it.
--
Hm, if you think it was that slow, I am suspecting you were also using
SLUB debugging.This can actually be negative, since now SLUB will access the objects
(+redzone +padding) and possibly trick kmemcheck into thinking they
were initialized in the first place.But what we are really looking for is "read from freed memory"
messages. So I would actually recommend this: Disable kmemcheck's
reporting of uninitialized memory, simply to make it easier to spot
the "freed" messages more easily.Maybe something like this (warning: whitespace-munged):
diff --git a/arch/x86/mm/kmemcheck/error.c b/arch/x86/mm/kmemcheck/error.c
index 56410c6..6944cb7 100644
--- a/arch/x86/mm/kmemcheck/error.c
+++ b/arch/x86/mm/kmemcheck/error.c
@@ -98,6 +98,9 @@ void kmemcheck_error_save(enum kmemcheck_shadow state,
return;
prev_ip = regs->ip;+ if (state == KMEMCHECK_SHADOW_UNINITIALIZED)
+ return;
+
e = error_next_wr();
if (!e)
return;If this only happens during boot, it would also be a good idea to
simply reboot the machine a lot...Vegard
--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
nope:
# CONFIG_SLUB_DEBUG is not set
yeah, i've got a script for that. Will try it overnight.
Ingo
--
two other ones:
kmemcheck: Caught 32-bit read from uninitialized memory (f459ac00)
uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
^Pid: 1614, comm: kjournald Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05ea904>] EFLAGS: 00010216 CPU: 0
EIP is at skb_copy_bits+0x54/0x220
EAX: 00000110 EBX: 00000110 ECX: 00000044 EDX: f6c256c0
ESI: f459ac00 EDI: f459a000 EBP: c09b9f60 ESP: c09de048
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 2e849000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c05ebe28>] skb_copy+0x78/0x90
[<c05f9426>] neigh_timer_handler+0x156/0x2c0
[<c012ba62>] run_timer_softirq+0x142/0x180
[<c0129092>] __do_softirq+0x82/0xf0
[<c0105b8a>] call_on_stack+0x1a/0x30
[<ffffffff>] 0xffffffffplus:
kmemcheck: Caught 32-bit read from uninitialized memory (f654b590)
iiiiiiiiiiiiiiiiuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
^Pid: 3130, comm: git-read-tree Not tainted (2.6.26-tip #20187)
EIP: 0060:[<c05eb396>] EFLAGS: 00010286 CPU: 0
EIP is at pskb_expand_head+0x86/0x150
EAX: 00000140 EBX: f6763124 ECX: 00000027 EDX: f6549400
ESI: f654b590 EDI: f6549650 EBP: c09b9a80 ESP: c09dfa28
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: f71c2700 CR3: 2e849000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c05eb67d>] __pskb_pull_tail+0x21d/0x300
[<c03b45dd>] e1000_xmit_frame+0x1fd/0xa60
[<c05f0fd4>] dev_hard_start_xmit+0x254/0x2e0
[<c0609721>] __qdisc_run+0x61/0x1b0
[<c05f383e>] dev_queue_xmit+0x25e/0x340
[<c061f315>] ip_finish_output+0x105/0x260
[<c061f739>] ip_output+0x49/0x50
[<c061debd>] ip_local_out+0x1d/0x30
[<c061fa85>] ip_queue_xmit+0x1a5/0x340
[<c062ffd3>] tcp_transmit_skb+0x443/0x800
[<c06316aa>] __tcp_push_pending_frames+0xfa/0x730
[<...
With your config:
(gdb) p ((struct sk_buff *) 0)->truesize
Cannot access memory at address 0x9cNow just audit users of ->truesize... There are quite a few.
Which one would only += 1?
Vegard
PS: I might be on the completely wrong track. So far I only have bad
experiences with this sk_buff...--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
From: Ingo Molnar <mingo@elte.hu>
This is very unlikely to be added by us networking folks, no
networking merges have happened for the 2.6.27 merge window yet :-)
--
and Cc:-ed SLUB folks. Could be a sleeper cell of bugs gone active ;-)
Or could be SLUB (-debugging) breakage. Netconsole is pretty reliable on
this box. (and the bootup continued just fine after this report)Just re-tried it, the bug is reliably repeatable. Will try a bisection
run.Ingo
--
hm, but it was not reproducible on the third and fourth attempt :-( I
tried hard to provoke it by generating artificial parallel network and
netconsole output - but it didnt want to trigger. Heisenbug ...Maybe the debug output gives someone an idea about the nature of the
bug?Ingo
--
From: Ingo Molnar <mingo@elte.hu>
This bug would be a quite positive result then :)
--
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Amit K. Arora | [RFC] Heads up on sys_fallocate() |
| Chuck Ebbert | Why do so many machines need "noapic"? |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Natalie Protasevich | [BUG] New Kernel Bugs |
