Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out"

Previous thread: [PATCH] x86: optionally show last exception from/to register contents by Jan Beulich on Monday, August 13, 2007 - 7:33 am. (6 messages)

Next thread: [PATCH] SERIO: Fix modpost warning. by Ralf Baechle on Monday, August 13, 2007 - 8:05 am. (1 message)
To: <linux-kernel@...>
Date: Monday, August 13, 2007 - 7:35 am

Hi,

I am having trouble with the 2.6.23 kernel. With all versions since
2.6.23-rc1 I have trouble with my network connection. When using the
network over a certain level (just browsing the web seems not to be
enough) e.g. when installing packages over the nvsv4 share, all
network stuff freezes for some time and syslog tells me:
Aug 13 13:16:09 frege NETDEV WATCHDOG: eth0: transmit timed out
Aug 13 13:16:39 frege NETDEV WATCHDOG: eth0: transmit timed out
Aug 13 13:17:09 frege NETDEV WATCHDOG: eth0: transmit timed out
Aug 13 13:17:57 frege NETDEV WATCHDOG: eth0: transmit timed out

Some info about my system:

/usr/src/linux-2.6.23-rc3 $ sh scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux frege linux-2.6.23-rc3 #1 SMP PREEMPT Sat Aug 11 16:24:26 CEST
2007 i686 Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel
GNU/Linux

Gnu C 4.2.0
Gnu make 3.81
binutils 2.17
util-linux 2.12r
mount 2.12r
module-init-tools 3.2.2
e2fsprogs 1.39
Linux C Library 2.5
Dynamic linker (ldd) 2.5
Procps 3.2.7
Net-tools 1.60
Kbd 1.12
Sh-utils 6.9
udev 114
Modules Loaded nvidia

lspci -vvv
00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS/940GML and
945GT Express Memory Controller Hub (rev 03)
Subsystem: Intel Corporation Mobile 945GM/PM/GMS/940GML and 945GT
Express Memory Controller Hub
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ >SERR- <PERR-
Latency: 0
Capabilities: [e0] Vendor Specific Information

00:01.0 PCI bridge: Intel Corporation Mobile 945GM/PM/GMS/940GML and
945GT Express PCI Express Root Port (rev 03) (prog-if 00 [Normal
...

To: Karl Meyer <adhocrocker@...>
Cc: <linux-kernel@...>, <netdev@...>
Date: Monday, August 13, 2007 - 12:58 pm

(netdev Cced)

Karl Meyer <adhocrocker@gmail.com> :

Can you:
- send a complete dmesg + /proc/interrupts + .config
- use git bisect to find a suspect changeset
I do not expect any change of behavior between 2.6.22 and
25805dcf9d83098cf5492117ad2669cd14cc9b24 if it can help you narrow
things down (assuming it is a r8169 regression).

--
Ueimor
-

To: Francois Romieu <romieu@...>
Cc: <linux-kernel@...>
Date: Monday, August 13, 2007 - 1:27 pm

Hi,

dmesg, interrupts and .config are attached. I will have a look at git bisect.

To: Karl Meyer <adhocrocker@...>
Cc: <linux-kernel@...>
Date: Tuesday, August 14, 2007 - 3:09 am

Karl Meyer <adhocrocker@gmail.com> :

Can you reproduce the problem when nvidia binary-only stuff is not loaded
after boot ?

--
Ueimor
-

To: Francois Romieu <romieu@...>
Cc: <linux-kernel@...>
Date: Tuesday, August 14, 2007 - 5:25 pm

I did some additional testing, the results are:
[0e4851502f846b13b29b7f88f1250c980d57e944] r8169: merge with version
8.001.00 of Realtek's r8168 driver
does not work, I after some traffic the transmit timeout occurs.
[6dccd16b7c2703e8bbf8bca62b5cf248332afbe2] r8169: merge with version
6.001.00 of Realtek's r8169 driver
Seems to be the last version to work. I did some stress testing (much
more than the level that was enough to make
[0e4851502f846b13b29b7f88f1250c980d57e944] break) and am currently
using this version and no problems so far.

-

To: Karl Meyer <adhocrocker@...>
Cc: <linux-kernel@...>
Date: Tuesday, August 14, 2007 - 6:46 pm

Thanks for the quick feedback.

Can you try the patch below on top of 2.6.23-rc3 ?

If it does not work I'll dissect 0e4851502f846b13b29b7f88f1250c980d57e944
tomorrow.

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index b85ab4a..cdb8a08 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -2749,6 +2749,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
if (!(status & tp->intr_event))
break;

+#if 0
/* Work around for rx fifo overflow */
if (unlikely(status & RxFIFOOver) &&
(tp->mac_version == RTL_GIGA_MAC_VER_11)) {
@@ -2756,6 +2757,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
rtl8169_tx_timeout(dev);
break;
}
+#endif

if (unlikely(status & SYSErr)) {
rtl8169_pcierr_interrupt(dev);
--
Ueimor
-

To: Karl Meyer <adhocrocker@...>
Cc: <linux-kernel@...>, <netdev@...>
Date: Thursday, August 16, 2007 - 10:11 am

(please do not remove the netdev Cc:)

Francois Romieu <romieu@fr.zoreil.com> :

You will find a tgz archive in attachment which contains a serie of patches
(0001-... to 0005-...) to walk from 6dccd16b7c2703e8bbf8bca62b5cf248332afbe2
to 0e4851502f846b13b29b7f88f1250c980d57e944 in smaller steps.

Please apply 0001 on top of 6dccd16b7c2703e8bbf8bca62b5cf248332afbe2. If it
still works, apply 0002 on top of 0001, etc.

--
Ueimor

To: Francois Romieu <romieu@...>
Cc: <linux-kernel@...>, <netdev@...>
Date: Saturday, September 1, 2007 - 3:24 pm

This is what happened today:

Sep 1 21:08:01 frege NETDEV WATCHDOG: eth0: transmit timed out
frege ~ # uname -r
2.6.22.5-cfs-v20.5

-

To: Karl Meyer <adhocrocker@...>
Cc: Francois Romieu <romieu@...>, <linux-kernel@...>, <netdev@...>
Date: Sunday, September 2, 2007 - 6:15 pm

Hi,

Can you reproduce this on 2.6.22 (not 2.6.22.x - it might be a -stable
regression)?

Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/
-

To: Michal Piotrowski <michal.k.k.piotrowski@...>
Cc: Francois Romieu <romieu@...>, <linux-kernel@...>, <netdev@...>
Date: Sunday, September 2, 2007 - 6:21 pm

Hi,

am am looking for this issue for some time now, but there where no
errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2
officially), I also ran git-bisect (for more information see the older
messages in this thread).

-

To: Karl Meyer <adhocrocker@...>
Cc: Michal Piotrowski <michal.k.k.piotrowski@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, September 12, 2007 - 4:50 pm

Karl Meyer <adhocrocker@gmail.com> :

2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before
0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work.
Thus it is not surprizing that it works.

Any update regarding the patchkit that I sent on 2007/08/16 ?

It would help to narrow the culprit.

--
Ueimor
-

To: Francois Romieu <romieu@...>
Cc: Michal Piotrowski <michal.k.k.piotrowski@...>, <linux-kernel@...>, <netdev@...>
Date: Monday, October 1, 2007 - 7:42 am

Hi,

after reading about issues with the nics on kontron boards I did a
bios upgrade,
but this did not change anything.
However, yesterday the nic (onboard) I used died. No link at all,
after switching to
the next onboard nic I got a NETDEV transmit timeout with that one on
kernel 2.6.22-r2.
It seems the whole thing is a hardware issue. I will try to figure out
with kontron.

Sorry :(

Karl

-

To: Francois Romieu <romieu@...>
Cc: Michal Piotrowski <michal.k.k.piotrowski@...>, <linux-kernel@...>, <netdev@...>
Date: Wednesday, September 26, 2007 - 5:07 am

Hi Francois,

this is what I found and sent:

The error exists from patch 2 on. I did some network testing with
patch 1 and currently use it and have no errors so far.
From my experiences up to now patch 1 should be error free.

Do you need additional info?

-

To: Francois Romieu <romieu@...>
Cc: <linux-kernel@...>, <netdev@...>
Date: Tuesday, August 21, 2007 - 6:56 am

fyi:
I do not know whether it is related to the problem, but since using
the version you told me there are these entries is my log:
frege Hangcheck: hangcheck value past margin!
frege Hangcheck: hangcheck value past margin!
frege Hangcheck: hangcheck value past margin!

-

To: Karl Meyer <adhocrocker@...>
Cc: Francois Romieu <romieu@...>, <linux-kernel@...>, <netdev@...>
Date: Monday, August 27, 2007 - 8:50 am

...

BTW, I don't know wheter it's related too, but I think you should try

Regards,
Jarek P.
-

To: Francois Romieu <romieu@...>
Cc: <linux-kernel@...>, <netdev@...>
Date: Monday, August 20, 2007 - 5:25 am

The error exists from patch 2 on. I did some network testing with

-

To: Francois Romieu <romieu@...>
Cc: <linux-kernel@...>, <netdev@...>
Date: Thursday, August 16, 2007 - 5:03 pm

I did some testing today and found that the error occurs after
applying some of the patches. However I did not figure out the exact
patch in which the error "starts" since it sometimes occurs immediatly
when moving some data over the net and sometimes it takes 30 min till
I get the transmit timeout. I will be away till sunday and do some
more testing then.

-

To: Francois Romieu <romieu@...>
Cc: <linux-kernel@...>
Date: Tuesday, August 14, 2007 - 12:47 pm

Sorry, I was wrong, still testing....

-

To: Francois Romieu <romieu@...>
Cc: <linux-kernel@...>
Date: Tuesday, August 14, 2007 - 11:47 am

Hi,

I successfully ran git bisect:
0127215c17414322b350c3c6fbd1a7d8dd13856f is first bad commit
commit 0127215c17414322b350c3c6fbd1a7d8dd13856f
Author: Francois Romieu <romieu@fr.zoreil.com>
Date: Tue Feb 20 22:58:51 2007 +0100

r8169: small 8101 comment

Extracted from version 1.001.00 of Realtek's r8101.

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
d41a52a215fb1b38ba652dda90faf6ed951bccd1 M drivers

I did proof it by doing "git revert
0127215c17414322b350c3c6fbd1a7d8dd13856f" on my git clone, now I am
happily running 2.6.23-rc3-ge60a without the NETDEV WATCHDOG message.

-

Previous thread: [PATCH] x86: optionally show last exception from/to register contents by Jan Beulich on Monday, August 13, 2007 - 7:33 am. (6 messages)

Next thread: [PATCH] SERIO: Fix modpost warning. by Ralf Baechle on Monday, August 13, 2007 - 8:05 am. (1 message)