I'm trying to track down a memory corruption bug in the atl1 network device driver that is exposed only when operating with 4GB or more memory. (NB: The driver uses a 32-bit DMA mask.) The bug is hit under certain conditions whenever the network interface is commanded down. The information provided here is against recent -git, but this problem also afflicts the atl1 driver in kernels 2.6.2[345].y. Dmesg and config attached. I can reproduce the bug at will by simply using scp to copy a few hundred megabytes from a remote host, then by executing 'ifconfig eth0 down'. Here is the relevant console output when the bug is hit. I note that the apparent corrupting data beginning at address 0xffff81010fcff402 is actually a received ethernet frame. (The value 00:17:31:4e:9d:41 is the MAC address of the local host, corresponding to the destination address of a received frame.) Can someone with more experience than me please take a look and give me some advice or explain what might be happening here? (What may be obvious to you is probably not obvious to me.) =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D BUG kmalloc-2048: Poison overwritten ---------------------------------------------------------------------------= -- INFO: 0xffff81010fcff402-0xffff81010fcff9f9. First byte 0x0 instead of 0x6b INFO: Allocated in dev_alloc_skb+0x16/0x2c age=3D9459 cpu=3D0 pid=3D2754 INFO: Freed in skb_release_data+0xa8/0xad age=3D9434 cpu=3D0 pid=3D2754 INFO: Slab 0xffffe20005d6f540 objects=3D15 used=3D0 fp=3D0xffff81010fcfb1b0= flags=3D0x8000000000002082 INFO: Object 0xffff81010fcff3f0 @offset=3D29680 fp=3D0xffff81010fcfd2d0 Bytes b4 0xffff81010fcff3e0: 3b bd 00 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a = 5a ;=EF=BF=BD......ZZZZZZZZ Object 0xffff81010fcff3f0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ...
Jay Cliburn wrote, On 05/03/2008 08:09 PM: I'm definitely with less experience, so I wonder why it can't be a simple race between atl1_clean_rx_ring() and something (maybe even pending atl1_intr_rx()) on the other cpu writing skb while kfreeing? Regards, Jarek P. --
Jarek Poplawski wrote, On 05/04/2008 04:24 PM: Hmm... atl1_intr_rx() looks impossible, so atl1_alloc_rx_buffers()? Jarek P. --
On Sun, 04 May 2008 16:55:08 +0200 I booted with nosmp and the bug is *much* harder to hit, but I still hit it once out of about 10 tries. Does the fact that I hit it once using nosmp disprove the race theory? --
Probably not: I don't know how about preemption model, but especially some maybe unkilled timers/watchdogs or workqueues could be considered. Of course this idea looks very unprobable (should happen with less than 4GB too), but should be quite easy to verify by adding some temporary spinlocks around these rx ring operations? Jarek P. --
Hmm#2... Of course for testing without smp disabling local irqs should be enough. BTW... since this check is done during __free_slab it seems such a corruption could take place before this closing, so maybe this atl1_intr_rx() should be considered in these races yet... Jarek P. --
Are you familiar with the kernel command-line option slub_debug=FZPU ? If this was not yet enabled, recompiling your kernel with CONFIG_SLUB=y and CONFIG_SLUB_DEBUG=y enabled and specifying slub_debug=FZPU on the kernel command line will probably provide more detailed information about the cause of the memory corruption. Bart. --
On Sun, 4 May 2008 17:02:59 +0200
Bart,
Using slub_debug=FZPU didn't seem to add more information.
CONFIG_SLUB=y and CONFIG_SLUB_DEBUG=y are (and were previously) set.
Thanks.
title Fedora (2.6.26-rc1)
root (hd0,5)
kernel /vmlinuz-2.6.26-rc1 ro root=LABEL=/1 console=ttyS0,38400 console=tty0 slub_debug=FZPU
initrd /initrd-2.6.26-rc1.img
=============================================================================
BUG kmalloc-2048: Poison overwritten
-----------------------------------------------------------------------------
INFO: 0xffff81010004297a-0xffff810100042f71. First byte 0x0 instead of 0x6b
INFO: Allocated in dev_alloc_skb+0x16/0x2c age=5813 cpu=0 pid=3029
INFO: Freed in skb_release_data+0xa8/0xad age=201 cpu=0 pid=0
INFO: Slab 0xffffe20005801600 objects=15 used=0 fp=0xffff810100045b18 flags=0x8000000000002082
INFO: Object 0xffff810100042968 @offset=10600 fp=0xffff8101000418d8
Bytes b4 0xffff810100042958: aa 91 fd ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a �.��....ZZZZZZZZ
Object 0xffff810100042968: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
Object 0xffff810100042978: 6b 6b 00 17 31 4e 9d 41 00 0f db bc af 14 08 00 kk..1N.A..ۼ�...
Object 0xffff810100042988: 45 00 00 4e 87 5e 00 00 40 11 6e 82 c0 a8 01 fe E..N.^..@.n.�������.�
Object 0xffff810100042998: c0 a8 01 70 00 89 00 89 00 3a 3b 67 00 09 00 00 ��.p.....:;g....
Object 0xffff8101000429a8: 00 01 00 00 00 00 00 00 20 43 4b 41 41 41 41 41 .........CKAAAAA
Object 0xffff8101000429b8: 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA
Object 0xffff8101000429c8: 41 41 41 41 41 41 41 41 41 00 00 21 00 01 f0 53 AAAAAAAAA..!..
Object 0xffff8101000429d8: 56 17 df 3e 3b 9f b7 1f 2d 29 f0 68 cf 4d 61 97 V.�>;.�.-)�h�Ma.
Redzone 0xffff810100043168: bb bb bb bb bb bb bb bb �