Hi all,
I've been seeing panics on a netbsd-4/i386 machine which appears to be
related to the reception of oversized frames:
re0: discarding oversize frame (len=8813)
re0: discarding oversize frame (len=2191)
re0: discarding oversize frame (len=10478)
uvm_fault(0xc0a44aa0, 0xe4ff7000, 1) -> 0xe
kernel: supervisor trap page fault, code=0
Stopped in pid 592.1 (cat_nw) at netbsd:m_tag_delete_chain+0x20:
movl 0
(%ebx),%eax
db{1}> bt
m_tag_delete_chain(c2686d00,0,881cad0,5,f2946af0) at
netbsd:m_tag_delete_chain+0
x20
sbdrop(c1cb84c4,d1c,d108aba4,0,0) at netbsd:sbdrop+0x25e
sbflush(c1cb84c4,c1cb84f4,c1cb84c4,d1c,c22dd94c) at netbsd:sbflush+0x2f
tcp_disconnect(c22dd94c,0,0,c1cc0800,0) at netbsd:tcp_disconnect+0x43
tcp_usrreq(c1cb8444,6,0,0,0) at netbsd:tcp_usrreq+0x285
sodisconnect(c1cb8444,1,0,0,d1c) at netbsd:sodisconnect+0xb0
soclose(c1cb8444,0,d108ac2c,c043fd1c,d1210d04) at netbsd:soclose+0x1e0
soo_close(d1210d04,d0f7adec,d108ac04,d0deb480,62) at netbsd:soo_close+0x1b
closef(d1210d04,d0f7adec,d108ac68,d0f789fc,c1cc0800) at netbsd:closef+0x14c
syscall_plain() at netbsd:syscall_plain+0xa4
--- syscall (number 6) ---
0xbaff932b:
db{1}>
I can get the panic quite regularly (will crash after a few hours) when
loading the network. I've seen a few different back traces from the
panics but the "discarding oversize frame" message always leads up to
the panic (and I only see these messages before a panic, not when its
running fine).
I removed the "ppsratecheck" in if_ethersubr.c so I could see every
oversized frame instance, and I took it upon myself to put a few
printouts in rtl8169.c to figure out what was going on. What I found was
that the panic always follows the arrival of 2 oversized frames in the
handling of a read interrupt. (i.e. I put a printout before and after
the loop in rtl8169.c:re_rxeof and found the crash happening after 2
"discarding oversize frame" instances in 1 loop)
The netbsd-4 source is from about 2 days ago and I haven't ...- options DIAGNOSTIC might help debug - does it happen on UP kernel (GENERIC, not GENERIC.MP), or netbsd-5? --- Izumi Tsutsui
I'm busy trying to reproduce with options DIAGNOSTIC right now. So far Will try after this after the next panic. I strongly suspect that the problems were introduced after pullup-4 #1339. I've got 13 netbsd-4 machines that were updated to recent netbsd-4 source containing pullup-4 #1339 and about 5 of these have suddenly started panicking in the 3 days since the update (these were all running fine for months before this update). All of these machines experience heavy network loading 24/7. Thanks, Brad
I've managed to reproduce this now in netbsd-5 too (source is about 3 months old, not sure if there have been any changes since): re0: discarding oversize frame (len=9041) re0: discarding oversize frame (len=16158) panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file "../../../../kern_subr_pool.c", line 2580 As I think I've said before, the actual crash point is different every time but the panic is always preceded by the discarding oversize frame. Sometimes the len in the oversize frame message is len=-1. Any advice? Thanks, Brad
Sorry, typo, should be: panic: kernel diagnostic assertion "pcg->pcg_avail == 0" failed: file "../../../../kern/subr_pool.c", line 2580
is it possible that the re device is writting past its buffer (via DMA) and
overwriting random memory ?
--
Manuel Bouyer <bouyer@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--
Isn't that one thing the iommu is for? Oh, wait.... Well, use machines whose designers cut corners on hardware design and guess what happens. Actually, my main reason for writing is to mention that I have a laptop, running 4.0.1, with an re onboard, and have never seen such random crashes. I can give more details if they matter. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mouse@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
I've got 3 motherboards with re onboard that I've tested, 2 of the 3 have the problem. I checked the re hwrev and the one that works fine is 0x28000000. The 2 boards that don't work have hwrev 0x38000000 and 0x3C400000. The board that's fine is a commercial Intel DG41MJ while the other 2 are both DFI industrial boards (LT600-DR, LT330-B).
My laptop is a Sony Vaio (PCG-5G3L). The re is re0: interrupting at ioapic0 pin 18 (irq 7) re0: Ethernet address 00:13:a9:f2:6f:af re0: using 256 tx descriptors rlphy0 at re0 phy 7: RTL8201L 10/100 media interface, rev. 1 rlphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto I don't see anything there that looks like the rev numbers you're talking about. While now is not a good time, I'll have a look at the code and see if I can find the hwrev value you're talking about and print out its value for my hardware. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mouse@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
I manually printed it out in re_attach in rtl8169.c. Thanks, Brad
0x34000000. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTML mouse@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
I've tried this and I'm still able to reproduce the problem. Is it possible that I can increase this buffer and check it for overrun when data arrives? If so, where can I find this buffer? Brad
Check for buffer overruns by reserving a "guard region" on each side of a DMA buffer. Write guard bytes (0xdeadbeefdeadbeef or something) in each guard region. Check whether any bytes in the guard region were modified before you reclaim a DMA buffer. If that doesn't detect any problems, consider reclaiming DMA buffers lazily: don't reclaim a buffer immediately, put put it on a queue. When the queue grows N buffers deep, reclaim the first buffer you put on it. Maybe an errant DMA lands on the buffer it while it "rests" on the queue? Check a buffer's guard regions before you reclaim it. Consider comparing the whole buffer against a copy you make when you put it on the queue. Dave -- David Young OJC Technologies dyoung@ojctech.com Urbana, IL * (217) 278-3933
I've narrowed down the problem here to a specific change.
Basically with netbsd-4 branch I see the failure, but if I revert
only the file:
./src/sys/dev/mii/rgephy.c
to netbsd-4-0-1-RELEASE the problem goes away. Looking at the difference
between the 2 revisions I would guess the most likely cause is the
difference in register writes in rgephy_reset?
Unfortunately for my purposes one of the two motherboard types I have
exhibiting the problem has an RTL8111C which (without the netbsd-4 changes)
fails to detect the media automatically (forcing it to 1000baseT has it
sync
at 100baseTX for some reason).
Are there any changes I could make to the netbsd-4 rgephy.c to find a
fix for this?(netbsd-5 has the same problem by the way)
Thanks,
Brad
# cd /usr/src/sys/dev/mii
# cvs diff -u -r netbsd-4-0-1-RELEASE -r netbsd-4 rgephy.c
Index: rgephy.c
===================================================================
RCS file: /cvsroot/src/sys/dev/mii/rgephy.c,v
retrieving revision 1.15
retrieving revision 1.15.2.1
diff -u -r1.15 -r1.15.2.1
--- rgephy.c 29 Nov 2006 13:57:59 -0000 1.15
+++ rgephy.c 18 Aug 2009 09:46:50 -0000 1.15.2.1
@@ -1,4 +1,4 @@
-/* $NetBSD: rgephy.c,v 1.15 2006/11/29 13:57:59 tsutsui Exp $ */
+/* $NetBSD: rgephy.c,v 1.15.2.1 2009/08/18 09:46:50 bouyer Exp $ */
/*
* Copyright (c) 2003
@@ -33,7 +33,7 @@
*/
#include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: rgephy.c,v 1.15 2006/11/29 13:57:59 tsutsui
Exp $");
+__KERNEL_RCSID(0, "$NetBSD: rgephy.c,v 1.15.2.1 2009/08/18 09:46:50
bouyer Exp $");
/*
@@ -61,7 +61,12 @@
static int rgephy_match(struct device *, struct cfdata *, void *);
static void rgephy_attach(struct device *, struct device *, void *);
-CFATTACH_DECL(rgephy, sizeof(struct mii_softc),
+struct rgephy_softc {
+ struct mii_softc mii_sc;
+ int mii_revision;
+};
+
+CFATTACH_DECL(rgephy, sizeof(struct rgephy_softc),
rgephy_match, rgephy_attach, mii_phy_detach, ...A quick update.
My oversize frame problems are greatly reduced by
reverting rgephy.c as above, I have however still seen
1 instance of an oversize frame on a system that was at
the time experiencing very high disk I/O load. The
network device stopped working on this system until the
system was rebooted.
So as I see it there appear to be 2 problems which may or
may not be related:
1. Something in the netbsd-4 branch version of rgephy.c
causes the system to experience a high number of what
it thinks are oversize frames (on my hardware, given the
nature of the network traffic in my test). Reverting this file to
netbsd-4-0-1-RELEASE cures this.
2. Once the driver does handle an oversize frame, the
kernel will either panic with what appears to be memory
corruption or the LAN will stop working.
Brad
Had a kernel running with options DIAGNOSTIC and while I didn't see any
printout other than those
I saw before, I was able to get a kgdb session up and I have a back
trace. Printed out a few bits
and pieces, not sure what would be of real interest:
Program received signal SIGSEGV, Segmentation fault.
0xc05dba53 in ether_input (ifp=0xc21f503c, m=0xc24a4800) at
../../../../net/if_ethersubr.c:648
648 etype = ntohs(eh->ether_type);
(gdb) bt
#0 0xc05dba53 in ether_input (ifp=0xc21f503c, m=0xc24a4800) at
../../../../net/if_ethersubr.c:648
#1 0xc0382d93 in re_rxeof (sc=0xc21f5000) at
../../../../dev/ic/rtl8169.c:1374
#2 0xc03832a2 in re_intr (arg=0xc21f5000) at
../../../../dev/ic/rtl8169.c:1565
#3 0xc062a5d5 in intr_biglock_wrapper (vp=0xc21aef80) at
../../../../arch/x86/x86/intr.c:544
#4 0xc0108198 in Xintr_ioapic_level11 ()
#5 0xc21aef80 in ?? ()
#6 0x00000000 in ?? ()
(gdb) p eh
$1 = (struct ether_header *) 0x58a07f87
(gdb) p m
$2 = (struct mbuf *) 0xc24a4800
(gdb) p *eh
Cannot access memory at address 0x58a07f87
(gdb) p *m
$3 = {m_hdr = {mh_next = 0x388a43eb, mh_nextpkt = 0x3bd2a781,
mh_data = 0x58a07f87 <Address 0x58a07f87 out of bounds>, mh_owner =
0x8e878f3c, mh_len = 1082,
mh_flags = -187037215, mh_paddr = 515969279, mh_type = 27862},
M_dat = {MH = {MH_pkthdr = {
rcvif = 0xc21f503c, tags = {slh_first = 0xaf76}, len = 1082,
csum_flags = 197,
csum_data = 2687232, segsz = 419436127}, MH_dat = {MH_ext = {
ext_buf = 0x1d010ad0 <Address 0x1d010ad0 out of bounds>,
ext_free = 0x450008,
ext_arg = 0x84912800, ext_size = 104857600, ext_type =
0xa8c03063, ext_nextref = 0xa8c00102,
ext_prevref = 0x657dca02, ext_un = {extun_paddr = 1729288689,
extun_pgs = {0x6712d9f1,
0x9d4a3f62, 0x1050e731, 0xeae401b, 0x0, 0x0, 0xd0dcbbd0,
0x5b2500f, 0x0, 0x290100,
0x1900165f, 0x80010ad0, 0x450008, 0x7dc2a005, 0x6400000,
0xa8c0b12c, 0xa8c00f02}},
...A few questions come to mind as I read re_rxeof(): Does the hardware always respect the buffer length that NetBSD writes to the Rx descriptors? Buffer-overrun bugs do happen, even in hardware. How many fragments do the oversize frames come in? The driver seems to expect that !RE_RDESC_STAT_EOF is mutually exclusive with RE_RDESC_STAT_RXERRSUM. Maybe that is not really so? Dave -- David Young OJC Technologies dyoung@ojctech.com Urbana, IL * (217) 278-3933
I put a printout in re_rxeof() when sc->re_head!=NULL and I never see this (even before it panics) so I assume I'm never actually getting to process multi-fragment packets. I also enabled the RE_DEBUG printouts (in re_rexeof) and I saw a few of those printouts moments before the panic: re0: discarding oversize frame (len=-4) re0: RX error (rxstat = 0x1f97e350), frame alignment error, out of buffer space, CRC error re0: RX error (rxstat = 0x1970a3b0), frame alignment error, FIFO overrun, giant packet re0: discarding oversize frame (len=10703) re0: RX error (rxstat = 0x1d52f996), frame alignment error, FIFO overrun re0: RX error (rxstat = 0x18736652), frame alignment error, FIFO overrun, giant packet re0: RX error (rxstat = 0x18706240), frame alignment error, FIFO overrunm giant packet uvm_fault(0xc0c90dc0, 0x264b9000, 1) -> 0xe Not sure if this helps at all. Regards, Brad
