I am experiencing network data corruption with a 3Com 3C996B-T NIC
(Broadcom NetXtreme BCM5701; driver tg3.ko). I have identified the
following patch as the trigger:
commit fb93134dfc2a6e6fbedc7c270a31da03fce88db9
Author: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed Nov 14 15:45:21 2007 -0800
[TCP]: Fix size calculation in sk_stream_alloc_pskb
We round up the header size in sk_stream_alloc_pskb so that
TSO packets get zero tail room. Unfortunately this rounding
up is not coordinated with the select_size() function used by
TCP to calculate the second parameter of sk_stream_alloc_pskb.
As a result, we may allocate more than a page of data in the
non-TSO case when exactly one page is desired.
In fact, rounding up the head room is detrimental in the non-TSO
case because it makes memory that would otherwise be available to
the payload head room. TSO doesn't need this either, all it wants
is the guarantee that there is no tail room.
So this patch fixes this by adjusting the skb_reserve call so that
exactly the requested amount (which all callers have calculated in
a precise way) is made available as tail room.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch was included in 2.6.24 and 2.6.23.4 -stable. I am
experiencing data corruption with kernels 2.6.23.4 - 2.6.23.16, 2.6.24 -
2.6.24.2, and 2.6.25-rc2-git1. I have verified that reverting the above
patch (by hand) makes the data corruption go away on all affected
kernels (note that in 2.6.25 the function is sk_stream_alloc_skb() in
net/ipv4/tcp.c rather than sk_stream_alloc_pskb() in include/net/sock.h).
(Also note that when testing 2.6.23 - 2.6.23.4, I had to apply the
individual patch "TG3: Fix performance regression on 5705." from 2.6.23.5.)
I do not get data corruption when substituting a SysKonnect 9D21 NIC
(which also uses the tg3.ko ...Assuming this problem is unique to the 5701, I'm not sure how it is exposed by Herbert's patch. One thing unique on the 5701 is that it double-copies all RX packets so that the data starts at offset 2, but What Broadcom chip is on the Syskonnect card? --
From: "Michael Chan" <mchan@broadcom.com> One consequence of Herbert's change is that the chip will see a different datastream. The initial skb->data linear area will be smaller, and the transition to the fragmented area of pages will be quicker. --
I see. Perhaps when we get to the end of the data-stream, there is a tiny frag that the chip cannot handle. That's the only thing I can think of. Please try this patch to see if the problem goes away. This will disable SG on 5701 so we always get linear SKBs. diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index db606b6..bb37e76 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -12717,6 +12717,9 @@ static int __devinit tg3_init_one(struct pci_dev *pdev, } else tp->tg3_flags &= ~TG3_FLAG_RX_CHECKSUMS; + if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701) + dev->features &= ~(NETIF_F_IP_CSUM | NETIF_F_SG); + /* flow control autonegotiation is default behavior */ tp->tg3_flags |= TG3_FLAG_PAUSE_AUTONEG; tp->link_config.flowctrl = TG3_FLOW_CTRL_TX | TG3_FLOW_CTRL_RX; --
This patch does appear to fix the data corruption (tested with 2.6.24.2). However, it results in performance problems with the iSCSI application that I am trying to run on this machine. The test program that I described in the previous message still gets good performance in both directions. "iperf -r" gets good performance in both directions (940 Mbits/s or 117 MB/s). However, my target-mode iSCSI application (which obviously generates rx/tx traffic patterns more complicated than the synthetic tests) gets very poor performance in one direction but good performance in the other direction. iSCSI performance drops to 6 - 15 MB/s when the 3Com NIC is doing heavy rx with light tx, but remains at a decent 115 MB/s when the 3Com NIC is doing heavy tx with light rx. When I revert Herbert's patch instead of applying the patch above, I get 115 MB/s in both cases. (With a stock unpatched kernel, the test fails almost immediately because the iSCSI control PDUs are corrupted, causing the TCP connection to be dropped.) The SysKonnect NIC that does not exhibit this problem has a chip that says "BCM5411KQM" "TT0128 P2Q" and "56975E". Tony --
That's strange. The patch should only affect TX performance slightly since we are just turning off SG for TX. Please take an ethereal trace I think this is the 5700, but please send me the tg3 output that identifies the chip and the revision. Something like this: eth2: Tigon3 [partno(BCM95705) rev 3003 PHY(5705)] (PCI:66MHz:32-bit) 10/100/1000Base-T Ethernet 00:10:18:04:57:0d eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[0] TSOcap[1] --
Update: when I revert Herbert's patch in addition to applying your patch, the iSCSI performance goes back up to 115 MB/s again in both directions. So it looks like turning off SG for TX didn't itself cause the performance drop, but rather that the performance drop is just another manifestation of whatever bug is causing the data corruption. I do not regularly use wireshark or look at network packet dumps, so I am not really sure what to look for. Given the above information, do you still believe that there is value in examining the packet dump? Tony --
Interesting. So the workload that regressed is mostly RX with a little TX traffic? Can you try to reproduce this with something like netperf to eliminate other variables? This is all very puzzling since the patch in question shouldn't change an RX load at all. Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt --
We have established that the slowdown was caused by TCP checksum errors and retransmits. I assume that the slowdown in my test was due to the light TX rather than the heavy RX. I am no TCP protocol expert, but perhaps heavy TX (such as iperf) might not be affected as much because the wire stays busy while waiting for the retransmit, whereas with my light TX iSCSI load, the wire goes idle while waiting for the retransmit because the iSCSI state machine is stalled. Tony --
Hi Tony. Sorry for the radio silence.
Michael and I have discussed this problem a bit. Another possibility is
that the chip may be having difficulty with non-dword aligned TX buffers.
Since we already know the RX side has the same problem, it isn't so
far-fetched to think that perhaps it can affect the TX side too. Can
you give the following patch a try and see if the corruption still
happens?
diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 96043c5..810c711 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -4135,11 +4135,20 @@ static int tigon3_dma_hwbug_workaround(struct tg3 *tp, struct sk_buff *skb,
u32 last_plus_one, u32 *start,
u32 base_flags, u32 mss)
{
- struct sk_buff *new_skb = skb_copy(skb, GFP_ATOMIC);
+ struct sk_buff *new_skb;
dma_addr_t new_addr = 0;
u32 entry = *start;
int i, ret = 0;
+ if (GET_ASIC_REV(tp->pci_chip_rev_id) != ASIC_REV_5701)
+ new_skb = skb_copy(skb, GFP_ATOMIC);
+ else {
+ int more_headroom = 4 - (skb->mac_header & 3);
+
+ new_skb = skb_copy_expand(skb, skb_headroom(skb) + more_headroom,
+ skb_tailroom(skb), GFP_ATOMIC);
+ }
+
if (!new_skb) {
ret = -1;
} else {
@@ -4465,6 +4474,10 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
if (tg3_4g_overflow_test(mapping, len))
would_hit_hwbug = 1;
+ /* Force the 5701 into the double copy path. */
+ if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701)
+ would_hit_hwbug = 1;
+
tg3_set_txd(tp, entry, mapping, len, base_flags,
(skb_shinfo(skb)->nr_frags == 0) | (mss << 1));
--Thanks, your patch fixes the problem (tested on 2.6.24.4). However, I had to change "(skb->mac_header & 3)" in your patch to "((long) skb->mac_header & 3)" since mac_header is a pointer rather than an int on 32-bit systems. Tested-by: Tony Battersby <tonyb@cybernetics.com> --
From: Tony Battersby <tonyb@cybernetics.com> Thanks for testing. Matt, skb->mac_header is either a pointer or an integer offset depending upon whether we are building 32-bit or 64-bit. Testing skb->mac_header is therefore wrong, because it's an offset from a pointer in the 64-bit case and therefore it's alignment does not indicate correctly the actual final alignment of skb->head + skb->max_header. Therefore you should test skb_mac_header(skb) and cast it with (unsigned long). Please respin this fix with that correction so I can apply it and get this bug fixed, thanks! --
Isn't it better to test for skb->data? That's where we tell We think that this problem is unique in Tony's environment because of the PCIE-to-PCI bridge that he is using. We therefore want to test for that bridge and apply the workaround only when it's present. We've never seen this problem in the last 6 or 7 years during the lifetime of the 5701. We'll try to get this done ASAP. Thanks. --
From: "Michael Chan" <mchan@broadcom.com> That's correct. --
Tony,
Below is a patch that attempts to limit the workaround to the bridge you
have on your system. Can you test it and verify that the workaround is
still enabled?
diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 96043c5..52a44c6 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -4135,11 +4135,21 @@ static int tigon3_dma_hwbug_workaround(struct tg3 *tp, struct sk_buff *skb,
u32 last_plus_one, u32 *start,
u32 base_flags, u32 mss)
{
- struct sk_buff *new_skb = skb_copy(skb, GFP_ATOMIC);
+ struct sk_buff *new_skb;
dma_addr_t new_addr = 0;
u32 entry = *start;
int i, ret = 0;
+ if (GET_ASIC_REV(tp->pci_chip_rev_id) != ASIC_REV_5701)
+ new_skb = skb_copy(skb, GFP_ATOMIC);
+ else {
+ int more_headroom = 4 - ((unsigned long)skb->data & 3);
+
+ new_skb = skb_copy_expand(skb,
+ skb_headroom(skb) + more_headroom,
+ skb_tailroom(skb), GFP_ATOMIC);
+ }
+
if (!new_skb) {
ret = -1;
} else {
@@ -4462,7 +4472,9 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
would_hit_hwbug = 0;
- if (tg3_4g_overflow_test(mapping, len))
+ if (tp->tg3_flags3 & TG3_FLG3_5701_DMA_BUG)
+ would_hit_hwbug = 1;
+ else if (tg3_4g_overflow_test(mapping, len))
would_hit_hwbug = 1;
tg3_set_txd(tp, entry, mapping, len, base_flags,
@@ -11339,6 +11351,41 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
}
}
+ if ((GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701)) {
+ static struct tg3_dev_id {
+ u32 vendor;
+ u32 device;
+ } bridge_chipsets[] = {
+ { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PXH_0 },
+ { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_PXH_1 },
+ { },
+ };
+ struct tg3_dev_id *pci_id = &bridge_chipsets[0];
+ struct pci_dev *bridge = NULL;
+
+ while (pci_id->vendor != 0 &&
+ !(tp->tg3_flags3 & TG3_FLG3_5701_DMA_BUG)) {
+ while (1) {
+ bridge = pci_get_device(pci_id->vendor,
...This new patch also passes the test. Thumbs up! Tested-by: Tony Battersby <tonyb@cybernetics.com> Tony --
Hi Tony. Can you give us the output of : sudo lspci -vvv -xxxx -s 03:01.0' (assuming that is still the correct address of the 3Com NIC.) Also, after some digging, I found that the 5701 can run into trouble if a 64-bit DMA read terminates early and then completes as a 32-bit transfer. The problem is reportedly very rare, but the failure mode looks like a match. Can you apply the following patch and see if it helps your performance / corruption problems? diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index db606b6..7ad08ce 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -11409,6 +11409,8 @@ static int __devinit tg3_get_invariants(struct tg3 *tp) tp->tg3_flags |= TG3_FLAG_PCI_HIGH_SPEED; if ((pci_state_reg & PCISTATE_BUS_32BIT) != 0) tp->tg3_flags |= TG3_FLAG_PCI_32BIT; + else if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701) + tp->grc_mode |= GRC_MODE_FORCE_PCI32BIT; /* Chip-specific fixup from Broadcom driver */ if ((tp->pci_chip_rev_id == CHIPREV_ID_5704_A0) && --
The following patch fixes the problem for me. Do we want to accept this
patch and call it a day or continue investigating the source of the problem?
Patch applies to 2.6.24.2, but doesn't apply to 2.6.25-rc. If everyone
agrees that this is the right solution, I will resubmit with a proper
subject line and description.
Tony
--- linux-2.6.24.2/include/net/sock.h.orig 2008-02-20 17:19:20.000000000 -0500
+++ linux-2.6.24.2/include/net/sock.h 2008-02-20 17:25:55.000000000 -0500
@@ -1236,8 +1236,10 @@ static inline struct sk_buff *sk_stream_
{
struct sk_buff *skb;
- /* The TCP header must be at least 32-bit aligned. */
- size = ALIGN(size, 4);
+ /* The TCP header must be at least 32-bit aligned, but some chipsets
+ * such as Broadcom BCM5701 require at least 16-byte alignment.
+ */
+ size = ALIGN(size, 16);
skb = alloc_skb_fclone(size + sk->sk_prot->max_header, gfp);
if (skb) {
--From: Tony Battersby <tonyb@cybernetics.com> A chipset bug, if it even exists, should be worked around in the driver for that hardware. We shouldn't make every other piece of hardware in the world suffer too. --
Yes, we should workaround this in the TG3 driver once we understand what the problem is and how to workaround it. We are still looking through the errata list to sort this out. It looks like it is the starting DMA address of the TX buffer that is causing the problem. --
Update: Herbert's patch alters the arguments to alloc_skb_fclone() and skb_reserve() from within sk_stream_alloc_pskb(). This changes the skb_headroom() and skb_tailroom() of the returned skb. I decided to see if I could detect the precise point at which data corruption started to happen. The result is this table: (sk_stream_alloc_pskb() called with size == 1448; sk->sk_prot->max_header == 160) skb_headroom skb_tailroom test result note 216 1448 fail [1] 344 1448 fail 340 1452 pass 336 1456 pass 332 1460 pass 328 1464 fail 324 1468 pass 320 1472 pass 316 1476 pass 312 1480 fail 308 1484 pass 304 1488 pass 300 1492 pass 296 1496 fail 292 1500 pass 288 1504 pass 284 1508 pass 280 1512 fail 276 1516 pass 272 1520 pass 268 1524 pass 264 1528 fail 260 1532 pass 256 1536 pass [2] Notes: [1] Kernels 2.6.23.4 - 2.6.23.16 and 2.6.24 - current with Herbert's patch [2] Kernels 2.6.23.3 and before without Herbert's patch Note that the first row has skb_headroom + skb_tailroom == 1664; the remaining rows have skb_headroom + skb_tailroom == 1792. From these results, it looks like a data alignment issue. Herbert's patch unfortunately just happened to change the alignment in a way that made it break. Tony --
03:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5701 Gigabit Ethernet (rev 15) Subsystem: Compaq Computer Corporation NC7770 Gigabit Server Adapter (PCI-X, 10/100/1000-T) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64 (16000ns min), Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 17 Region 0: Memory at df7f0000 (64-bit, non-prefetchable) [size=64K] [virtual] Expansion ROM at dfc00000 [disabled] [size=64K] Capabilities: [40] PCI-X non-bridge device Command: DPERE- ERO- RBC=512 OST=1 Status: Dev=03:01.1 64bit+ 133MHz+ SCD- USC- DC=simple DMMRBC=512 DMOST=1 DMCRS=8 RSCEM- 266MHz- 533MHz- Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data <?> Capabilities: [58] Message Signalled Interrupts: Mask- 64bit+ Queue=0/3 Enable- Address: 063000119b608000 Data: 0423 00: e4 14 45 16 06 00 b0 02 15 00 00 02 10 40 00 00 10: 04 00 7f df 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 11 0e 7c 00 30: 00 00 00 00 40 00 00 00 00 00 00 00 0b 01 40 00 40: 07 48 00 00 09 03 03 00 01 50 02 c0 00 20 00 64 50: 03 58 00 00 08 10 21 08 05 00 86 00 00 80 60 9b 60: 11 00 30 06 23 04 00 00 98 02 05 01 0f 00 db 76 70: 8a 10 00 00 c7 00 00 80 50 00 00 00 00 00 00 00 80: 03 58 00 00 00 00 00 00 34 80 13 04 82 10 00 00 90: 09 06 00 01 00 00 00 00 00 00 00 00 c6 01 00 00 a0: 00 00 00 00 fe 02 00 00 00 00 00 00 af 01 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Sorry, this didn't help. I still get data corruption with hardware checksumming o...
Can you confirm whether you're getting TCP checksum errors on the other side that is receiving packets from the 5701? You can just check statistics using netstat -s. I suspect that after we turn off SG, checksum is no longer offloaded and we are getting lots of TCP checksum errors instead that are slowing the performance. --
Confirmed. With a 100 MB read/write test, netstat -s shows 75 bad segments received, and performance in the one direction is about 5 MB/s. When I switch to the SysKonnect NIC, netstat -s shows 0 bad segments received, and performance is 115 MB/s. So that solves that mystery - there is still data corruption, but the software-computed TCP checksum causes the bad packets to be retransmitted rather than being passed on to the application. Tony --
Here is the dmesg output for the SysKonnect NIC: eth0: Tigon3 [partno(SK-9D21) rev 7104 PHY(5411)] (PCI:66MHz:64-bit) 10/100/1000Base-T Ethernet 00:00:5a:9d:0c:4a eth0: RXcsums[1] LinkChgREG[1] MIirq[1] ASF[0] WireSpeed[0] TSOcap[0] eth0: dma_rwctrl[76ff000f] dma_mask[64-bit] Tony --
| Karl Meyer | PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out" |
| David Miller | Slow DOWN, please!!! |
| Mark Fasheh | [PATCH 0/39] Ocfs2 updates for 2.6.28 |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
git: | |
| Shawn O. Pearce | Re: pack operation is thrashing my server |
| Pierre Habouzit | git send-email improvements |
| Matthieu Moy | git push to a non-bare repository |
| Shawn O. Pearce | libgit2 - a true git library |
| Elad Efrat | Integrating securelevel and kauth(9) |
| Hubert Feyrer | Re: Compressed vnd handling tested successfully |
| Lord Isildur | Re: Fork bomb protection patch |
| Matt Thomas | Re: FFS journal |
| Will Maier | cron doesn't run commands in /etc/crontab? |
| Richard Stallman | Real men don't attack straw men |
| Harald Dunkel | Re: Packet Filter: how to keep device names on hardware failure? |
| Jordi Espasa Clofent | Resolving dependencies with pkg_add |
| Question on swap as ramdisk partition | 1 hour ago | Linux kernel |
| Netfilter kernel module | 11 hours ago | Linux kernel |
| serial driver xmit problem | 14 hours ago | Linux kernel |
| Why Windows is better than Linux | 14 hours ago | Linux general |
| How can I see my kernel messages in vt12? | 21 hours ago | Linux kernel |
| Grub | 1 day ago | Linux general |
| vmalloc_fault handling in x86_64 | 1 day ago | Linux kernel |
| epoll_wait()ing on epoll FD | 1 day ago | Linux kernel |
| Framebuffer in x86_64 causes problems to multiseat | 1 day ago | Linux kernel |
| Difference between 2.4 and 2.6 regarding thread creation | 2 days ago | Linux general |
