Re: PROBLEM: Network sky2 Module, kernel version 2.6.23-rc7

Previous thread: Device problems on Kernel 22 by development.jim on Saturday, September 8, 2007 - 6:38 am. (1 message)

Next thread: In search of 10gbps cards/shootout in Linux? by Justin Piszcz on Saturday, September 8, 2007 - 9:44 am. (3 messages)
From: Werner Meurer
Date: Saturday, September 8, 2007 - 8:14 am

[1.] One line summary of the problem:    

hw csum failure appears in syslog

[2.] Full description of the problem/report:

hw csum failure appears in syslog and sometimes, under heavy network utilization, with NFS-Daemon the Network Device 
totally fails. Then no Network Access is possible. Reboot is not required but i must restart the Network with 
the following commands: "ifdown eth0" and "rmmod sky2", then "insmod sky2" and "ifup eth0".

[3.] Keywords (i.e., modules, networking, kernel):

sky2 (Marvell Yukon Onboard Ethernet), networking, checksum

[4.] Kernel version (from /proc/version):

Linux version 2.6.18.8 (root@endeavour) (gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)) #6 SMP Sun Aug 5 15:09:57 CEST 2007

[5.] Output of Oops.. message (if applicable) with symbolic information 
     resolved (see Documentation/oops-tracing.txt)

endeavour:~ # dmesg -c
<unknown>: hw csum failure.

Call Trace:
 [<ffffffff80248cbc>] __skb_checksum_complete+0x4a/0x62
 [<ffffffff803bc9c4>] udp_poll+0x67/0xf3
 [<ffffffff8020f3f4>] do_select+0x285/0x46d
 [<ffffffff8021c589>] __pollwait+0x0/0xe0
 [<ffffffff8027ee8f>] default_wake_function+0x0/0xe
 [<ffffffff8025e0a0>] _spin_lock_bh+0x9/0x14
 [<ffffffff8022ec61>] release_sock+0x13/0xaa
 [<ffffffff8024da32>] udp_sendmsg+0x480/0x563
 [<ffffffff80250300>] sock_sendmsg+0xf3/0x110
 [<ffffffff8022e430>] sock_recvmsg+0x101/0x120
 [<ffffffff80292267>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8024362b>] sock_aio_write+0x51/0x60
 [<ffffffff80242efe>] try_to_wake_up+0x3e2/0x3f3
 [<ffffffff80214323>] sys_select+0x26f/0x3d6
 [<ffffffff80388aa0>] sys_sendto+0x119/0x14c
 [<ffffffff80257b16>] system_call+0x7e/0x83


[6.] A small shell script or example program which triggers the
     problem (if possible)

-

[7.] Environment

Homebrew Server (Asus P5WDG2-WS Motherboard).

[7.1.] Software (add the output of the ver_linux script here)

Linux endeavour 2.6.18.8 #6 SMP Sun Aug 5 15:09:57 CEST 2007 x86_64 x86_64 x86_6             ...
From: ben soo
Date: Monday, September 17, 2007 - 9:02 pm

i'm experiencing this problem myself.  i have 2 servers, one using 
X86_64 kernel version 2.6.23-rc5 on a 100Mbit network and one with 
i386 kernel version 2.6.23-rc6 on a 1Gbit network.

They both have this issue with the sky2 network device driver 
whereby the device would stop working and need to be brought down 
and back up.

On the X86_64 kernel on a 100Mbit network, this is a very 
occasional thing, while on the i386 kernel on a 1Gbit network the 
device only works for a few minutes at a time.  If i set the MTU to 
7200 then the device seems to stay functional, but then i see long 
delays when it's talking to 100Mbit devices with standard 1500 MTU 
that are outside of its LAN segment.

This last might be an artifact caused by the firewall, i dunno.

b

-

From: Bill Davidsen
Date: Tuesday, September 18, 2007 - 7:28 am

Yes, I have found that I get far less problem in this area leaving the 
MTU at 1500, then putting a larger MTU (usually 9000) into the routing 
table for segments, or even just machines, where I know there is direct 
connectivity. I use 9000 MTU with my directly connected file server, 
1500 elsewhere. I can go to 9000 for nbd servers as well, assuming the 
connection doesn't pass a firewall.

I have some hints that while the switches I use will speak 10/100/1000 
between machines with different speeds, and will handle jumbo packets 
between machines at the same speed, if I'm going Gbit/jumbo to 
1500/slower performance seems to suffer more than talking smaller 
packets. That may be because window size needs to be even larger or 
something.

I have some legacy machines talking 10Mbit/half on 10base2 cable, I may 
be seeing more of this than the average site. That's legacy as in 
"attached to something expensive to replace."

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-

From: Stephen Hemminger
Date: Tuesday, September 18, 2007 - 8:05 am

On Tue, 18 Sep 2007 10:28:43 -0400

If you want to use Jumbo frames, you need to have routers and firewalls
that correctly handle ICMP and do path MTU discovery. If you have bridges
or firewalls that aren't Jumbo aware on both interfaces, then there will
be long timeouts retries for each connection. If you have busted routers
and firewalls that swallow ICMP then PMTU won't work well either.


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

-

From: ben soo
Date: Wednesday, September 19, 2007 - 11:43 pm

>> ben soo wrote:
[...]


i turned off the motherboard Marvell Gbit devices and installed a 
Realtek 8169 card, all the while keeping to kernel version 
2.6.23-rc6, and saw all the network problems go away.  Must mean it 
was a sky2 driver bug.

Am currently running 2.6.23-rc7 on the affected server and the rc7 
sky2 driver is holding up fine so far.

thank you!

b
-

From: ben soo
Date: Saturday, September 22, 2007 - 6:56 pm

i spoke too soon.  The Gbit interface still dies.  Lasted around 
19hrs. or so.  i can't tell if there are hardware issues: yesterday 
a Gbit NIC on the firewall died.  Different chip (Realtek), 
different driver, different machine, same segment.  Segment is a 
mix of 100Mbit and 1Gbit machines.

Symptoms of the failure are it just stops functioning with no error 
messages.  ifconfig says there are packets being TX'd and none 
being RX'd.  Interface can't be brought up again.

i use this box as the secondary DNS for our Internet domains 
(unblocked by firewall), as well as the database server for our 
developmental web CMS and the LAN ntp server (both services 
invisible outside the LAN).




board is DFI LANPARTY UT CFX3200-DR/G running an Opteron 165.

i've brought up the other Gbit network interface on the motherboard 
and am using that.

-

From: Stephen Hemminger
Date: Sunday, September 23, 2007 - 11:06 pm

On Sat, 22 Sep 2007 21:56:04 -0400

Does your network switch support flowcontrol?  The Yukon-EC may have
a hardware problem that happens if FIFO gets full. If flow control is enabled,
that shouldn't happen. 

The newest version of the driver has a watchdog to try
and detect and do an automatic reset, but it still means there would be about
2 seconds of downtime.

-

From: Bill Davidsen
Date: Friday, September 28, 2007 - 3:48 pm

If you search through my exchanges with Adrian Bunk WRT sk98lin removal, 
I mentioned a very similar problem. When I wrote that I had some notes 
in front of me, which are now archived and not quickly available. 
unloading and reloading the modules didn't fix it, IIRC.

I will be doing an update on the machine in question by the end of the 
year, and at that time I will try shy2 again, since I'll be able to do 
better testing.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-

Previous thread: Device problems on Kernel 22 by development.jim on Saturday, September 8, 2007 - 6:38 am. (1 message)

Next thread: In search of 10gbps cards/shootout in Linux? by Justin Piszcz on Saturday, September 8, 2007 - 9:44 am. (3 messages)