Re: Potential re(4) / netbsd-4 / i386 problem?

Previous thread: Re: automatic vnd'ing on kernel boot by Toru Nishimura on Tuesday, March 2, 2010 - 2:12 pm. (1 message)

Next thread: removing aiboost(4) as redundant by Constantine Aleksandrovich Murenin on Friday, March 5, 2010 - 11:47 pm. (9 messages)
From: Brad du Plessis
Date: Tuesday, March 2, 2010 - 11:23 pm

Hi all,

I've been seeing panics on a netbsd-4/i386 machine which appears to be 
related to the reception of oversized frames:

re0: discarding oversize frame (len=8813)
re0: discarding oversize frame (len=2191)
re0: discarding oversize frame (len=10478)
uvm_fault(0xc0a44aa0, 0xe4ff7000, 1) -> 0xe
kernel: supervisor trap page fault, code=0
Stopped in pid 592.1 (cat_nw) at        netbsd:m_tag_delete_chain+0x20: 
movl    0
(%ebx),%eax
db{1}> bt
m_tag_delete_chain(c2686d00,0,881cad0,5,f2946af0) at 
netbsd:m_tag_delete_chain+0
x20
sbdrop(c1cb84c4,d1c,d108aba4,0,0) at netbsd:sbdrop+0x25e
sbflush(c1cb84c4,c1cb84f4,c1cb84c4,d1c,c22dd94c) at netbsd:sbflush+0x2f
tcp_disconnect(c22dd94c,0,0,c1cc0800,0) at netbsd:tcp_disconnect+0x43
tcp_usrreq(c1cb8444,6,0,0,0) at netbsd:tcp_usrreq+0x285
sodisconnect(c1cb8444,1,0,0,d1c) at netbsd:sodisconnect+0xb0
soclose(c1cb8444,0,d108ac2c,c043fd1c,d1210d04) at netbsd:soclose+0x1e0
soo_close(d1210d04,d0f7adec,d108ac04,d0deb480,62) at netbsd:soo_close+0x1b
closef(d1210d04,d0f7adec,d108ac68,d0f789fc,c1cc0800) at netbsd:closef+0x14c
syscall_plain() at netbsd:syscall_plain+0xa4
--- syscall (number 6) ---
0xbaff932b:
db{1}>


I can get the panic quite regularly (will crash after a few hours) when 
loading the network. I've seen a few different back traces from the 
panics but the "discarding oversize frame" message always leads up to 
the panic (and I only see these messages before a panic, not when its 
running fine).

I removed the "ppsratecheck" in if_ethersubr.c so I could see every 
oversized frame instance, and I took it upon myself to put a few 
printouts in rtl8169.c to figure out what was going on. What I found was 
that the panic always follows the arrival of 2 oversized frames in the 
handling of a read interrupt. (i.e. I put a printout before and after 
the loop in rtl8169.c:re_rxeof and found the crash happening after 2 
"discarding oversize frame" instances in 1 loop)

The netbsd-4 source is from about 2 days ago and I haven't ...
From: Izumi Tsutsui
Date: Friday, March 5, 2010 - 3:18 am

- options DIAGNOSTIC might help debug
- does it happen on UP kernel (GENERIC, not GENERIC.MP), or netbsd-5?

---
Izumi Tsutsui
From: Brad du Plessis
Date: Friday, March 5, 2010 - 3:39 am

I'm busy trying to reproduce with options DIAGNOSTIC right now. So far 
Will try after this after the next panic.

I strongly suspect that the problems were introduced after pullup-4 
#1339. I've got 13 netbsd-4 machines that were updated to recent 
netbsd-4 source containing pullup-4 #1339 and about 5 of these have 
suddenly started panicking in the 3 days since the update (these were 
all running fine for months before this update). All of these machines 
experience heavy network loading 24/7.

Thanks,
Brad
From: Brad du Plessis
Date: Wednesday, July 21, 2010 - 5:06 am

I've managed to reproduce this now in netbsd-5 too (source is about 3 
months old, not sure if there have been any changes since):

re0: discarding oversize frame (len=9041)
re0: discarding oversize frame (len=16158)
panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file 
"../../../../kern_subr_pool.c", line 2580


As I think I've said before, the actual crash point is different
every time but the panic is always preceded by the discarding
oversize frame. Sometimes the len in the oversize frame message
is len=-1.

Any advice?

Thanks,
  Brad
From: Brad du Plessis
Date: Wednesday, July 21, 2010 - 5:55 am

Sorry, typo, should be:
panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file 
"../../../../kern/subr_pool.c", line 2580
From: Manuel Bouyer
Date: Friday, July 23, 2010 - 3:12 am

is it possible that the re device is writting past its buffer (via DMA) and
overwriting random memory ?

-- 
Manuel Bouyer <bouyer@antioche.eu.org>
     NetBSD: 26 ans d'experience feront toujours la difference
--
From: der Mouse
Date: Friday, July 23, 2010 - 3:54 am

Isn't that one thing the iommu is for?  Oh, wait....

Well, use machines whose designers cut corners on hardware design and
guess what happens.

Actually, my main reason for writing is to mention that I have a
laptop, running 4.0.1, with an re onboard, and have never seen such
random crashes.  I can give more details if they matter.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse@rodents-montreal.org
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B
From: Brad du Plessis
Date: Sunday, July 25, 2010 - 11:22 pm

I've got 3 motherboards with re onboard that I've tested, 2 of the 3 
have the problem.
I checked the re hwrev and the one that works fine is 0x28000000. The 2 
boards that don't
work have hwrev 0x38000000 and 0x3C400000. The board that's fine is a 
commercial
Intel DG41MJ while the other 2 are both DFI industrial boards (LT600-DR, 
LT330-B).
From: der Mouse
Date: Sunday, July 25, 2010 - 11:50 pm

My laptop is a Sony Vaio (PCG-5G3L).  The re is

re0: interrupting at ioapic0 pin 18 (irq 7)
re0: Ethernet address 00:13:a9:f2:6f:af
re0: using 256 tx descriptors
rlphy0 at re0 phy 7: RTL8201L 10/100 media interface, rev. 1
rlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto

I don't see anything there that looks like the rev numbers you're
talking about.  While now is not a good time, I'll have a look at the
code and see if I can find the hwrev value you're talking about and
print out its value for my hardware.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse@rodents-montreal.org
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B
From: Brad du Plessis
Date: Monday, July 26, 2010 - 12:34 am

I manually printed it out in re_attach in rtl8169.c.

Thanks,
  Brad
From: der Mouse
Date: Monday, July 26, 2010 - 8:29 pm

0x34000000.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse@rodents-montreal.org
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B
From: Brad du Plessis
Date: Friday, July 23, 2010 - 4:14 am

I've tried this and I'm still able to reproduce the problem.



Is it possible that I can increase this buffer and check it for overrun 
when data arrives? If so, where can I find this buffer?

Brad
From: David Young
Date: Friday, July 23, 2010 - 2:19 pm

Check for buffer overruns by reserving a "guard region" on each side of
a DMA buffer.  Write guard bytes (0xdeadbeefdeadbeef or something) in
each guard region.  Check whether any bytes in the guard region were
modified before you reclaim a DMA buffer.

If that doesn't detect any problems, consider reclaiming DMA buffers
lazily: don't reclaim a buffer immediately, put put it on a queue.  When
the queue grows N buffers deep, reclaim the first buffer you put on
it.  Maybe an errant DMA lands on the buffer it while it "rests" on the
queue?  Check a buffer's guard regions before you reclaim it.  Consider
comparing the whole buffer against a copy you make when you put it on
the queue.

Dave

-- 
David Young             OJC Technologies
dyoung@ojctech.com      Urbana, IL * (217) 278-3933
From: Brad du Plessis
Date: Thursday, October 28, 2010 - 12:42 am

I've narrowed down the problem here to a specific change.
Basically with netbsd-4 branch I see the failure, but if I revert
only the file:

./src/sys/dev/mii/rgephy.c

to netbsd-4-0-1-RELEASE the problem goes away. Looking at the difference
between the 2 revisions I would guess the most likely cause is the
difference in register writes in rgephy_reset?

Unfortunately for my purposes one of the two motherboard types I have
exhibiting the problem has an RTL8111C which (without the netbsd-4 changes)
fails to detect the media automatically (forcing it to 1000baseT has it 
sync
at 100baseTX for some reason).


Are there any changes I could make to the netbsd-4 rgephy.c to find a
fix for this?(netbsd-5 has the same problem by the way)

Thanks,
  Brad


# cd /usr/src/sys/dev/mii
# cvs diff -u -r netbsd-4-0-1-RELEASE -r netbsd-4 rgephy.c

Index: rgephy.c
===================================================================
RCS file: /cvsroot/src/sys/dev/mii/rgephy.c,v
retrieving revision 1.15
retrieving revision 1.15.2.1
diff -u -r1.15 -r1.15.2.1
--- rgephy.c    29 Nov 2006 13:57:59 -0000    1.15
+++ rgephy.c    18 Aug 2009 09:46:50 -0000    1.15.2.1
@@ -1,4 +1,4 @@
-/*    $NetBSD: rgephy.c,v 1.15 2006/11/29 13:57:59 tsutsui Exp $    */
+/*    $NetBSD: rgephy.c,v 1.15.2.1 2009/08/18 09:46:50 bouyer Exp $    */

  /*
   * Copyright (c) 2003
@@ -33,7 +33,7 @@
   */

  #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: rgephy.c,v 1.15 2006/11/29 13:57:59 tsutsui 
Exp $");
+__KERNEL_RCSID(0, "$NetBSD: rgephy.c,v 1.15.2.1 2009/08/18 09:46:50 
bouyer Exp $");


  /*
@@ -61,7 +61,12 @@
  static int    rgephy_match(struct device *, struct cfdata *, void *);
  static void    rgephy_attach(struct device *, struct device *, void *);

-CFATTACH_DECL(rgephy, sizeof(struct mii_softc),
+struct rgephy_softc {
+    struct mii_softc mii_sc;
+    int mii_revision;
+};
+
+CFATTACH_DECL(rgephy, sizeof(struct rgephy_softc),
      rgephy_match, rgephy_attach, mii_phy_detach, ...
From: Brad du Plessis
Date: Monday, January 3, 2011 - 7:24 am

A quick update.

My oversize frame problems are greatly reduced by
reverting rgephy.c as above, I have however still seen
1 instance of an oversize frame on a system that was at
the time experiencing very high disk I/O load. The
network device stopped working on this system until the
system was rebooted.

So as I see it there appear to be 2 problems which may or
may not be related:

1. Something in the netbsd-4 branch version of rgephy.c
    causes the system to experience a high number of what
    it thinks are oversize frames (on my hardware, given the
    nature of the network traffic in my test). Reverting this file to
    netbsd-4-0-1-RELEASE cures this.

2. Once the driver does handle an oversize frame, the
    kernel will either panic with what appears to be memory
    corruption or the LAN will stop working.


  Brad
From: Brad du Plessis
Date: Monday, March 8, 2010 - 2:35 am

Had a kernel running with options DIAGNOSTIC and while I didn't see any 
printout other than those
I saw before, I was able to get a kgdb session up and I have a back 
trace. Printed out a few bits
and pieces, not sure what would be of real interest:

Program received signal SIGSEGV, Segmentation fault.
0xc05dba53 in ether_input (ifp=0xc21f503c, m=0xc24a4800) at 
../../../../net/if_ethersubr.c:648
648             etype = ntohs(eh->ether_type);
(gdb) bt
#0  0xc05dba53 in ether_input (ifp=0xc21f503c, m=0xc24a4800) at 
../../../../net/if_ethersubr.c:648
#1  0xc0382d93 in re_rxeof (sc=0xc21f5000) at 
../../../../dev/ic/rtl8169.c:1374
#2  0xc03832a2 in re_intr (arg=0xc21f5000) at 
../../../../dev/ic/rtl8169.c:1565
#3  0xc062a5d5 in intr_biglock_wrapper (vp=0xc21aef80) at 
../../../../arch/x86/x86/intr.c:544
#4  0xc0108198 in Xintr_ioapic_level11 ()
#5  0xc21aef80 in ?? ()
#6  0x00000000 in ?? ()
(gdb) p eh
$1 = (struct ether_header *) 0x58a07f87
(gdb) p m
$2 = (struct mbuf *) 0xc24a4800
(gdb) p *eh
Cannot access memory at address 0x58a07f87
(gdb) p *m
$3 = {m_hdr = {mh_next = 0x388a43eb, mh_nextpkt = 0x3bd2a781,
     mh_data = 0x58a07f87 <Address 0x58a07f87 out of bounds>, mh_owner = 
0x8e878f3c, mh_len = 1082,
     mh_flags = -187037215, mh_paddr = 515969279, mh_type = 27862}, 
M_dat = {MH = {MH_pkthdr = {
         rcvif = 0xc21f503c, tags = {slh_first = 0xaf76}, len = 1082, 
csum_flags = 197,
         csum_data = 2687232, segsz = 419436127}, MH_dat = {MH_ext = {
           ext_buf = 0x1d010ad0 <Address 0x1d010ad0 out of bounds>, 
ext_free = 0x450008,
           ext_arg = 0x84912800, ext_size = 104857600, ext_type = 
0xa8c03063, ext_nextref = 0xa8c00102,
           ext_prevref = 0x657dca02, ext_un = {extun_paddr = 1729288689, 
extun_pgs = {0x6712d9f1,
               0x9d4a3f62, 0x1050e731, 0xeae401b, 0x0, 0x0, 0xd0dcbbd0, 
0x5b2500f, 0x0, 0x290100,
               0x1900165f, 0x80010ad0, 0x450008, 0x7dc2a005, 0x6400000, 
0xa8c0b12c, 0xa8c00f02}},
           ...
From: David Young
Date: Friday, March 5, 2010 - 4:34 pm

A few questions come to mind as I read re_rxeof():

Does the hardware always respect the buffer length that NetBSD writes to
the Rx descriptors?  Buffer-overrun bugs do happen, even in hardware.

How many fragments do the oversize frames come in?

The driver seems to expect that !RE_RDESC_STAT_EOF is mutually exclusive
with RE_RDESC_STAT_RXERRSUM.  Maybe that is not really so?

Dave

-- 
David Young             OJC Technologies
dyoung@ojctech.com      Urbana, IL * (217) 278-3933
From: Brad du Plessis
Date: Thursday, March 11, 2010 - 1:39 am

I put a printout in re_rxeof() when sc->re_head!=NULL and I never see 
this (even before it panics)
so I assume I'm never actually getting to process multi-fragment 
packets. I also enabled the
RE_DEBUG printouts (in re_rexeof) and I saw a few of those printouts 
moments before the panic:

re0: discarding oversize frame (len=-4)
re0: RX error (rxstat = 0x1f97e350), frame alignment error, out of 
buffer space, CRC error
re0: RX error (rxstat = 0x1970a3b0), frame alignment error, FIFO 
overrun, giant packet
re0: discarding oversize frame (len=10703)
re0: RX error (rxstat = 0x1d52f996), frame alignment error, FIFO overrun
re0: RX error (rxstat = 0x18736652), frame alignment error, FIFO 
overrun, giant packet
re0: RX error (rxstat = 0x18706240), frame alignment error, FIFO 
overrunm giant packet
uvm_fault(0xc0c90dc0, 0x264b9000, 1) -> 0xe

Not sure if this helps at all.

Regards,
  Brad
Previous thread: Re: automatic vnd'ing on kernel boot by Toru Nishimura on Tuesday, March 2, 2010 - 2:12 pm. (1 message)

Next thread: removing aiboost(4) as redundant by Constantine Aleksandrovich Murenin on Friday, March 5, 2010 - 11:47 pm. (9 messages)