[1.] One line summary of the problem:
hw csum failure appears in syslog
[2.] Full description of the problem/report:
hw csum failure appears in syslog and sometimes, under heavy network utilization, with NFS-Daemon the Network Device
totally fails. Then no Network Access is possible. Reboot is not required but i must restart the Network with
the following commands: "ifdown eth0" and "rmmod sky2", then "insmod sky2" and "ifup eth0".
[3.] Keywords (i.e., modules, networking, kernel):
sky2 (Marvell Yukon Onboard Ethernet), networking, checksum
[4.] Kernel version (from /proc/version):
Linux version 2.6.18.8 (root@endeavour) (gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)) #6 SMP Sun Aug 5 15:09:57 CEST 2007
[5.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/oops-tracing.txt)
endeavour:~ # dmesg -c
<unknown>: hw csum failure.
Call Trace:
[<ffffffff80248cbc>] __skb_checksum_complete+0x4a/0x62
[<ffffffff803bc9c4>] udp_poll+0x67/0xf3
[<ffffffff8020f3f4>] do_select+0x285/0x46d
[<ffffffff8021c589>] __pollwait+0x0/0xe0
[<ffffffff8027ee8f>] default_wake_function+0x0/0xe
[<ffffffff8025e0a0>] _spin_lock_bh+0x9/0x14
[<ffffffff8022ec61>] release_sock+0x13/0xaa
[<ffffffff8024da32>] udp_sendmsg+0x480/0x563
[<ffffffff80250300>] sock_sendmsg+0xf3/0x110
[<ffffffff8022e430>] sock_recvmsg+0x101/0x120
[<ffffffff80292267>] autoremove_wake_function+0x0/0x2e
[<ffffffff8024362b>] sock_aio_write+0x51/0x60
[<ffffffff80242efe>] try_to_wake_up+0x3e2/0x3f3
[<ffffffff80214323>] sys_select+0x26f/0x3d6
[<ffffffff80388aa0>] sys_sendto+0x119/0x14c
[<ffffffff80257b16>] system_call+0x7e/0x83
[6.] A small shell script or example program which triggers the
problem (if possible)
-
[7.] Environment
Homebrew Server (Asus P5WDG2-WS Motherboard).
[7.1.] Software (add the output of the ver_linux script here)
Linux endeavour 2.6.18.8 #6 SMP Sun Aug 5 15:09:57 CEST 2007 x86_64 x86_64 x86_6 ...i'm experiencing this problem myself. i have 2 servers, one using X86_64 kernel version 2.6.23-rc5 on a 100Mbit network and one with i386 kernel version 2.6.23-rc6 on a 1Gbit network. They both have this issue with the sky2 network device driver whereby the device would stop working and need to be brought down and back up. On the X86_64 kernel on a 100Mbit network, this is a very occasional thing, while on the i386 kernel on a 1Gbit network the device only works for a few minutes at a time. If i set the MTU to 7200 then the device seems to stay functional, but then i see long delays when it's talking to 100Mbit devices with standard 1500 MTU that are outside of its LAN segment. This last might be an artifact caused by the firewall, i dunno. b -
Yes, I have found that I get far less problem in this area leaving the MTU at 1500, then putting a larger MTU (usually 9000) into the routing table for segments, or even just machines, where I know there is direct connectivity. I use 9000 MTU with my directly connected file server, 1500 elsewhere. I can go to 9000 for nbd servers as well, assuming the connection doesn't pass a firewall. I have some hints that while the switches I use will speak 10/100/1000 between machines with different speeds, and will handle jumbo packets between machines at the same speed, if I'm going Gbit/jumbo to 1500/slower performance seems to suffer more than talking smaller packets. That may be because window size needs to be even larger or something. I have some legacy machines talking 10Mbit/half on 10base2 cable, I may be seeing more of this than the average site. That's legacy as in "attached to something expensive to replace." -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -
On Tue, 18 Sep 2007 10:28:43 -0400 If you want to use Jumbo frames, you need to have routers and firewalls that correctly handle ICMP and do path MTU discovery. If you have bridges or firewalls that aren't Jumbo aware on both interfaces, then there will be long timeouts retries for each connection. If you have busted routers and firewalls that swallow ICMP then PMTU won't work well either. -- Stephen Hemminger <shemminger@linux-foundation.org> -
>> ben soo wrote: [...] i turned off the motherboard Marvell Gbit devices and installed a Realtek 8169 card, all the while keeping to kernel version 2.6.23-rc6, and saw all the network problems go away. Must mean it was a sky2 driver bug. Am currently running 2.6.23-rc7 on the affected server and the rc7 sky2 driver is holding up fine so far. thank you! b -
i spoke too soon. The Gbit interface still dies. Lasted around 19hrs. or so. i can't tell if there are hardware issues: yesterday a Gbit NIC on the firewall died. Different chip (Realtek), different driver, different machine, same segment. Segment is a mix of 100Mbit and 1Gbit machines. Symptoms of the failure are it just stops functioning with no error messages. ifconfig says there are packets being TX'd and none being RX'd. Interface can't be brought up again. i use this box as the secondary DNS for our Internet domains (unblocked by firewall), as well as the database server for our developmental web CMS and the LAN ntp server (both services invisible outside the LAN). board is DFI LANPARTY UT CFX3200-DR/G running an Opteron 165. i've brought up the other Gbit network interface on the motherboard and am using that. -
On Sat, 22 Sep 2007 21:56:04 -0400 Does your network switch support flowcontrol? The Yukon-EC may have a hardware problem that happens if FIFO gets full. If flow control is enabled, that shouldn't happen. The newest version of the driver has a watchdog to try and detect and do an automatic reset, but it still means there would be about 2 seconds of downtime. -
If you search through my exchanges with Adrian Bunk WRT sk98lin removal, I mentioned a very similar problem. When I wrote that I had some notes in front of me, which are now archived and not quickly available. unloading and reloading the modules didn't fix it, IIRC. I will be doing an update on the machine in question by the end of the year, and at that time I will try shy2 again, since I'll be able to do better testing. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -
