Need help debugging memory corruption

Previous thread: [PATCH v2] unify sys_pipe implementation by Ulrich Drepper on Saturday, May 3, 2008 - 11:01 am. (5 messages)

Next thread: nommu compile error by Adrian Bunk on Saturday, May 3, 2008 - 11:41 am. (1 message)
From: Jay Cliburn
Date: Saturday, May 3, 2008 - 11:09 am

I'm trying to track down a memory corruption bug in the atl1 network
device driver that is exposed only when operating with 4GB or more
memory.  (NB: The driver uses a 32-bit DMA mask.)  The bug is hit under
certain conditions whenever the network interface is commanded down.

The information provided here is against recent -git, but this problem
also afflicts the atl1 driver in kernels 2.6.2[345].y.  Dmesg and
config attached.

I can reproduce the bug at will by simply using scp to copy a few
hundred megabytes from a remote host, then by executing 'ifconfig eth0
down'.

Here is the relevant console output when the bug is hit.  I note that
the apparent corrupting data beginning at address 0xffff81010fcff402 is
actually a received ethernet frame.  (The value 00:17:31:4e:9d:41 is
the MAC address of the local host, corresponding to the destination
address of a received frame.)

Can someone with more experience than me please take a look and give me
some advice or explain what might be happening here?  (What may be
obvious to you is probably not obvious to me.)

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
BUG kmalloc-2048: Poison overwritten
---------------------------------------------------------------------------=
--

INFO: 0xffff81010fcff402-0xffff81010fcff9f9. First byte 0x0 instead of 0x6b
INFO: Allocated in dev_alloc_skb+0x16/0x2c age=3D9459 cpu=3D0 pid=3D2754
INFO: Freed in skb_release_data+0xa8/0xad age=3D9434 cpu=3D0 pid=3D2754
INFO: Slab 0xffffe20005d6f540 objects=3D15 used=3D0 fp=3D0xffff81010fcfb1b0=
 flags=3D0x8000000000002082
INFO: Object 0xffff81010fcff3f0 @offset=3D29680 fp=3D0xffff81010fcfd2d0

Bytes b4 0xffff81010fcff3e0:  3b bd 00 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a =
5a ;=EF=BF=BD......ZZZZZZZZ
  Object 0xffff81010fcff3f0:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ...
From: Jarek Poplawski
Date: Sunday, May 4, 2008 - 7:24 am

Jay Cliburn wrote, On 05/03/2008 08:09 PM:

I'm definitely with less experience, so I wonder why it can't be
a simple race between atl1_clean_rx_ring() and something (maybe even
pending atl1_intr_rx()) on the other cpu writing skb while kfreeing?

Regards,
Jarek P.

--

From: Jarek Poplawski
Date: Sunday, May 4, 2008 - 7:55 am

Jarek Poplawski wrote, On 05/04/2008 04:24 PM:


Hmm... atl1_intr_rx() looks impossible, so atl1_alloc_rx_buffers()?

Jarek P. 


--

From: Jay Cliburn
Date: Sunday, May 4, 2008 - 12:55 pm

On Sun, 04 May 2008 16:55:08 +0200

I booted with nosmp and the bug is *much* harder to hit, but I still
hit it once out of about 10 tries.  Does the fact that I hit it once
using nosmp disprove the race theory?
--

From: Jarek Poplawski
Date: Monday, May 5, 2008 - 12:27 am

Probably not: I don't know how about preemption model, but especially
some maybe unkilled timers/watchdogs or workqueues could be considered.
Of course this idea looks very unprobable (should happen with less
than 4GB too), but should be quite easy to verify by adding some
temporary spinlocks around these rx ring operations?

Jarek P.
--

From: Jarek Poplawski
Date: Monday, May 5, 2008 - 12:50 am

Hmm#2... Of course for testing without smp disabling local irqs should
be enough. BTW... since this check is done during __free_slab it seems
such a corruption could take place before this closing, so maybe this
atl1_intr_rx() should be considered in these races yet...

Jarek P.
--

From: Bart Van Assche
Date: Sunday, May 4, 2008 - 8:02 am

Are you familiar with the kernel command-line option slub_debug=FZPU ?
If this was not yet enabled, recompiling your kernel with
CONFIG_SLUB=y and CONFIG_SLUB_DEBUG=y enabled and specifying
slub_debug=FZPU on the kernel command line will probably provide more
detailed information about the cause of the memory corruption.

Bart.
--

From: Jay Cliburn
Date: Sunday, May 4, 2008 - 12:52 pm

On Sun, 4 May 2008 17:02:59 +0200

Bart,

Using slub_debug=FZPU didn't seem to add more information.
CONFIG_SLUB=y and CONFIG_SLUB_DEBUG=y are (and were previously) set.

Thanks.

title Fedora (2.6.26-rc1)
        root (hd0,5)
        kernel /vmlinuz-2.6.26-rc1 ro root=LABEL=/1 console=ttyS0,38400 console=tty0 slub_debug=FZPU
        initrd /initrd-2.6.26-rc1.img


=============================================================================
BUG kmalloc-2048: Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xffff81010004297a-0xffff810100042f71. First byte 0x0 instead of 0x6b
INFO: Allocated in dev_alloc_skb+0x16/0x2c age=5813 cpu=0 pid=3029
INFO: Freed in skb_release_data+0xa8/0xad age=201 cpu=0 pid=0
INFO: Slab 0xffffe20005801600 objects=15 used=0 fp=0xffff810100045b18 flags=0x8000000000002082
INFO: Object 0xffff810100042968 @offset=10600 fp=0xffff8101000418d8

Bytes b4 0xffff810100042958:  aa 91 fd ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a �.��....ZZZZZZZZ
  Object 0xffff810100042968:  6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
  Object 0xffff810100042978:  6b 6b 00 17 31 4e 9d 41 00 0f db bc af 14 08 00 kk..1N.A..ۼ�...
  Object 0xffff810100042988:  45 00 00 4e 87 5e 00 00 40 11 6e 82 c0 a8 01 fe E..N.^..@.n.�������.�
  Object 0xffff810100042998:  c0 a8 01 70 00 89 00 89 00 3a 3b 67 00 09 00 00 ��.p.....:;g....
  Object 0xffff8101000429a8:  00 01 00 00 00 00 00 00 20 43 4b 41 41 41 41 41 .........CKAAAAA
  Object 0xffff8101000429b8:  41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 41 AAAAAAAAAAAAAAAA
  Object 0xffff8101000429c8:  41 41 41 41 41 41 41 41 41 00 00 21 00 01 f0 53 AAAAAAAAA..!..
  Object 0xffff8101000429d8:  56 17 df 3e 3b 9f b7 1f 2d 29 f0 68 cf 4d 61 97 V.�>;.�.-)�h�Ma.
 Redzone 0xffff810100043168:  bb bb bb bb bb bb bb bb                         �
Previous thread: [PATCH v2] unify sys_pipe implementation by Ulrich Drepper on Saturday, May 3, 2008 - 11:01 am. (5 messages)

Next thread: nommu compile error by Adrian Bunk on Saturday, May 3, 2008 - 11:41 am. (1 message)