intermittant petabyte usage reported with broadcom nic

Previous thread: drm + 4GB RAM + swiotlb = drm craps out by Dave Airlie on Sunday, April 1, 2007 - 4:44 pm. (11 messages)

Next thread: none
From: CaT
Date: Sunday, April 1, 2007 - 6:43 pm

I take minute by minute snapshots of network traffic by sampling
/proc/net/dev and most of the time everything works fine. Occasionally
though I get petabyte byte traffic and corresponding packet traffic.

This happens on an AMD64, dual core smp box with Broadcom NetXtreme II
nics. The issue happens with both nics but at different times. The same
sampling code runs on p4 boxes with ht on and e1000 nics without issues
so I don't believe it's an issue with my code (famous last words :)
which just does an re to extract the data on a per-line basis and prints
it out. Still, I'll be adding code to log any big readings and hopefully
it'll happen again sooner rather then later.

There is no preemption involved and the kernel is a monolythic build of
2.6.19.[12] (there are two servers).

-- 
    "To the extent that we overreact, we proffer the terrorists the
    greatest tribute."
    	- High Court Judge Michael Kirby
-

From: Andrew Morton
Date: Monday, April 2, 2007 - 12:13 am

How frequently?

Are you able to provide some actual numbers (expected and actual values),


We do perform racy 64-bit updates of some of the stats counters.  But
that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit
kernel on that AMD64 box (are you?)

Plus it's odd that both the byte-counters and the packet-counters go wonky
at the same time.

-

From: CaT
Date: Monday, April 2, 2007 - 12:41 am

I have them in an rrd file. I think though that the numbers will be
'adjusted' to fit in with the timekeeping. The logging code I've added
should provide exact numbers as it'll just dump what it reads from /proc


Correct. The environment is 64bit clean, though the kernel is compiled

If you want I can toss you the rrd graphs that result from the data. The
values do not appear to be static. For example, the resent 2 hits
(within 10 minutes of each other) gave almost 3petabytes and just over 4
petabytes. Interesting is that the incoming data is driven upto
petabytes whilst the outgoing data hits megabytes at that point. This is
consistant and the server is generally quiet.

-- 
    "To the extent that we overreact, we proffer the terrorists the
    greatest tribute."
    	- High Court Judge Michael Kirby
-

From: Jean-Daniel Pauget
Date: Monday, April 2, 2007 - 3:31 am

I don't know if a me-too may help you, but I have exactly the same 
    trouble on a whole set of dell servers, all with bmx drivers (suse 10.1
Linux toronto 2.6.16.27-0.6-smp #1 SMP Wed Dec 13 09:34:50 UTC 2006 x86_64 
x86_64 x86_64 GNU/Linux

.../...
<6>Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.4.31 (January 19, 2006)
<6>ACPI: PCI Interrupt 0000:09:00.0[A] -> GSI 16 (level, low) -> IRQ 169
<6>usbcore: registered new driver hub
<6>eth0: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem f4000000, IRQ 169, node addr 0015c5f18146
<6>ACPI: PCI Interrupt 0000:05:00.0[A] -> GSI 16 (level, low) -> IRQ 169
<6>eth1: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem f8000000, IRQ 169, node addr 0015c5f18144
.../...





    I can patch my app in order to give you those exact numbers (I'm afraid 
    not to be an rrd expert to extract real past values reported)
    on another side, I cannot really test new drivers on those machine just 

    exactly the same, just xeons instead of AMD.

-- 
    Jean-Daniel Pauget

    Tél: +33 (0)2 33 17 20 16
    2, rue André PELCA
    50580 Denneville-Plage
    France

-

From: Michael Chan
Date: Saturday, April 14, 2007 - 5:20 pm

I did a quick test on a 64-bit kernel and did not see any problem with
the counters.  I'll ask the lab to set up a longer term test and monitor
the counters for bogus values.

I also like Andi's idea of using change_page_attr() to isolate the
problem.  I'll try to send you a debug patch in the next few days to try
that out.  Thanks.

-

From: Michael Chan
Date: Monday, April 16, 2007 - 12:10 pm

Here's the debug patch for x86 only that will change the statistics
memory block to read-only.  If the kernel is corrupting it, you should
get a page fault that will crash the system.  If you continue to see
bogus counters, it is definitely a firmware or hardware problem.  Please
try it and let me know.  Thanks.

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 0b7aded..b7d491b 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -47,6 +47,7 @@
 #include <linux/prefetch.h>
 #include <linux/cache.h>
 #include <linux/zlib.h>
+#include <asm/cacheflush.h>
 
 #include "bnx2.h"
 #include "bnx2_fw.h"
@@ -436,6 +437,8 @@ bnx2_free_mem(struct bnx2 *bp)
 		}
 	}
 	if (bp->status_blk) {
+		change_page_attr(virt_to_page(bp->status_blk), 1, PAGE_KERNEL);
+		global_flush_tlb();
 		pci_free_consistent(bp->pdev, bp->status_stats_size,
 				    bp->status_blk, bp->status_blk_mapping);
 		bp->status_blk = NULL;
@@ -501,6 +504,7 @@ bnx2_alloc_mem(struct bnx2 *bp)
 	bp->status_stats_size = status_blk_size +
 				sizeof(struct statistics_block);
 
+	bp->status_stats_size = PAGE_SIZE;
 	bp->status_blk = pci_alloc_consistent(bp->pdev, bp->status_stats_size,
 					      &bp->status_blk_mapping);
 	if (bp->status_blk == NULL)
@@ -508,6 +512,10 @@ bnx2_alloc_mem(struct bnx2 *bp)
 
 	memset(bp->status_blk, 0, bp->status_stats_size);
 
+	/* x86 debug code to see if the kernel is corrupting the statistics */
+	change_page_attr(virt_to_page(bp->status_blk), 1, PAGE_KERNEL_RO);
+	global_flush_tlb();
+
 	bp->stats_blk = (void *) ((unsigned long) bp->status_blk +
 				  status_blk_size);
 
@@ -4307,7 +4315,9 @@ bnx2_timer(unsigned long data)
 	msg = (u32) ++bp->fw_drv_pulse_wr_seq;
 	REG_WR_IND(bp, bp->shmem_base + BNX2_DRV_PULSE_MB, msg);
 
+#if 0
 	bp->stats_blk->stat_FwRxDrop = REG_RD_IND(bp, BNX2_FW_RX_DROP_COUNT);
+#endif
 
 	if (bp->phy_flags & PHY_SERDES_FLAG) {
 		if (CHIP_NUM(bp) == CHIP_NUM_5706)


-

From: CaT
Date: Monday, April 16, 2007 - 4:43 pm

Ahh. Would truly love to but the moment you said 'crash the system' I
had to bail. These boxes are in production and as such a crash would be,
shall we say, unwelcome. I might be able to fenagle something but I
very-much doubt it.

Perhaps Jean-Daniel, who is also experiencing this problem and seemingly
more frequently then I, has a box that he could run your patch on. I
think we both run pretty-much the same hardware (Dell [12]950s). I've
CCed him.

-- 
    "To the extent that we overreact, we proffer the terrorists the
    greatest tribute."
    	- High Court Judge Michael Kirby
-

From: Jean-Daniel Pauget
Date: Tuesday, April 17, 2007 - 5:01 am

Dell 1950/2950 indeed...

    if there is any way to catch that writing without crashing the system 
    (even to the price of some slowness) I can test it. if not, I can't 
    because all my available targets are remote administrated and involved 
    with production processes.
    if luckilly one of them gets free, I'll to apply the latest patch you'd 
    provide me. I may also try it one day I'm close to those machines, so 
    keep me in the list of up to date patches.

-- 
    Jean-Daniel Pauget

    Tél: +33 (0)2 33 17 20 16
    2, rue André PELCA
    50580 Denneville-Plage
    France

-

From: Roland Dreier
Date: Tuesday, April 17, 2007 - 8:58 am

I actually have a couple of Dell 1950 systems with bnx2 NICs too,
which I use for kernel development (ie one more crash is fine :)

If someone can give me an idea for what kind of load to use, I can try
this patch out to see if it triggers.

 - R.
-

From: Michael Chan
Date: Monday, May 21, 2007 - 6:15 pm

We were able to reproduce the problem and confirmed that it was a DMA
problem of the statistics block.  About once an hour on average, wrong
counter values will be DMA'ed to host memory.  Luckily, the DMA write
stays within the intended address range so it will not corrupt other
parts of memory.  Other types of DMA including traffic and buffer
descriptors are not affected.

If you happen to be reading /proc/net/dev within a second after the DMA
corruption, you'll see bogus counters.  One second later and until the
next bad DMA, the counters will be normal again.

We are considering ways to workaround the problem.  Thanks.

-

From: CaT
Date: Thursday, April 12, 2007 - 3:52 pm

I have some now. These are raw lines from /proc/net/dev. In this case it's
eth0 at 22:14 that chucked a wee wibbly.

Apr 11 22:13:02 '  eth0:17227166357 81379716    0    0    0     0          0         0 33090495625 86656584    0    0    0     0       0          0 '
Apr 11 22:13:02 '  eth1:30708022097 91219466    0    0    0     0          0         0 122989582024 125073786    0    0    0     0       0          0 '
Apr 11 22:14:02 '  eth0:220898233988841368 66750274    0    0    0     0          0  86458738 52386430545 101089219 199313    0    0     0  199313          0 '
Apr 11 22:14:02 '  eth1:30708307787 91220183    0    0    0     0          0         0 122989665004 125074344    0    0    0     0       0          0 '
Apr 11 22:15:02 '  eth0:17227454818 81381144    0    0    0     0          0         0 33091307388 86658381    0    0    0     0       0          0 '
Apr 11 22:15:02 '  eth1:30708569308 91220742    0    0    0     0          0         0 122989732601 125074712    0    0    0     0       0          0 '

On another server (same hardware except for 2ru case, more ram and more hds):

Apr  9 06:18:05 '  eth0:1556640056941 3598105481    0    0    0     0          0         0 2281147324747 3318270401    0    0    0     0       0          0 '
Apr  9 06:18:05 '  eth1:912389249044 1190286687    0    0    0     0          0         0 642943095469 991257887    0    0    0     0       0          0 '
Apr  9 06:19:04 '  eth0:14250798570591813804 2284720007938 18638    0    0 18638          0  27375938 1556640980159 3345714490    0    0    0     0       0          0 '
Apr  9 06:19:04 '  eth1:912389281939 1190287072    0    0    0     0          0         0 642943219035 991258183    0    0    0     0       0          0 '
Apr  9 06:20:05 '  eth0:1556643514710 3598121584    0    0    0     0          0         0 2281154391794 3318284878    0    0    0     0       0          0 '

To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
them all as amd64s). Network card driver ...
From: Andrew Morton
Date: Thursday, April 12, 2007 - 4:13 pm

On Fri, 13 Apr 2007 08:52:49 +1000

0x310_c9c6_006a_7f98


0xc5c5_01cb_c5c5_00ac and 0x213_f3ec_ab02

The first one looks like trashed memory: it got overwritten by kernel
addresses.  Except they're x86-32 kernel addresses, and you're running
x86_64 64-bit kernel.  hm.


OK.  I was earlier assuming that you were seeing transient funny numbers. 
But in fact I think you're saying that the numbers go bad, and then stay
bad.

-

From: Roland Dreier
Date: Thursday, April 12, 2007 - 4:18 pm

> > Apr 11 22:14:02 '  eth0:220898233988841368 66750274    0    0    0     0          0  86458738 52386430545 101089219 199313    0    0     0  199313          0 '

 > > Apr 11 22:15:02 '  eth0:17227454818 81381144    0    0    0     0          0         0 33091307388 86658381    0    0    0     0       0          0 '

 > But in fact I think you're saying that the numbers go bad, and then stay bad.

Doesn't look like it -- one minute after the first hiccup the eth0 #s
look reasonable again.

 - R.
-

From: CaT
Date: Thursday, April 12, 2007 - 4:25 pm

Yeah. Sorry for not making it clear. I included good values on either
side of the bad one.

-- 
    "To the extent that we overreact, we proffer the terrorists the
    greatest tribute."
    	- High Court Judge Michael Kirby
-

From: Roland Dreier
Date: Thursday, April 12, 2007 - 4:15 pm

> Apr  9 06:19:04 '  eth0:14250798570591813804 2284720007938 18638    0    0 18638          0  27375938 1556640980159 3345714490    0    0    0     0       0          0 '

One odd thing is that crazy number 14250798570591813804 is
c5c501cbc5c500ac in hex.  I dunno what the significant of the 0xc5 bit
pattern is though...

The other line has 220898233988841368, which is 0x310c9c6006a7f98, not
nearly so regular a patter.

I don't think I'm helping much...
-

From: Roland Dreier
Date: Thursday, April 12, 2007 - 4:28 pm

[Adding Michael Chan, who seems to look after bnx2, to the cc list]

 > To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of
 > them all as amd64s). Network card driver in use is the one defined by
(firmware?) a block of memory to DMA stats into, and just reads from
that memory in its get_stats method.  So if you're seeing wonky stats
from the NIC intermittently, my best guess would be that firmware is
occasionally writing junk into the stats block.

 - R.
-

From: Andi Kleen
Date: Thursday, April 12, 2007 - 6:15 pm

When only the firmware is writing to that area it could be put
into an own page and then write protected with change_page_attr()
That would catch any corruption coming from the rest of the kernel.

-Andi
-

Previous thread: drm + 4GB RAM + swiotlb = drm craps out by Dave Airlie on Sunday, April 1, 2007 - 4:44 pm. (11 messages)

Next thread: none