I don't understand this code, but at first glance drivers/net/r8169.c is obviously buggy. The work->func, rtl8169_reset_task(), calls rtl8169_rx_interrupt() -> netif_receive_skb(), and the last one may only be called from softirq. Oleg. --
Yes, this is wrong. In this context (process context, not softirq), we
should use netif_rx() or just drop frames if we are in reset phase.
Also, this driver schedules a reset in case a fifo error is reported in
rtl8169_rx_interrupt()
if (status & RxFOVF) {
rtl8169_schedule_work(dev, rtl8169_reset_task);
dev->stats.rx_fifo_errors++;
}
This seems very strange too : In case of a RX spike, we reset card ?
--
From: Eric Dumazet <eric.dumazet@gmail.com> It's possible that this has been found to hang the card. If so, it should be documented because otherwise yes we should not be doing this. --
Sergey,
Here is a compiled but untested patch (I dont have the hardware), could
you please test it ?
Thanks
[PATCH] r8169: Fix rtl8169_rx_interrupt()
In case a reset is performed, rtl8169_rx_interrupt() is called from
process context instead of softirq context. Special care must be taken
to call appropriate network core services (netif_rx() instead of
netif_receive_skb()). VLAN handling also corrected.
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Diagnosed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 9d3ebf3..d873639 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -1038,14 +1038,14 @@ static void rtl8169_vlan_rx_register(struct net_device *dev,
}
static int rtl8169_rx_vlan_skb(struct rtl8169_private *tp, struct RxDesc *desc,
- struct sk_buff *skb)
+ struct sk_buff *skb, int polling)
{
u32 opts2 = le32_to_cpu(desc->opts2);
struct vlan_group *vlgrp = tp->vlgrp;
int ret;
if (vlgrp && (opts2 & RxVlanTag)) {
- vlan_hwaccel_receive_skb(skb, vlgrp, swab16(opts2 & 0xffff));
+ __vlan_hwaccel_rx(skb, vlgrp, swab16(opts2 & 0xffff), polling);
ret = 0;
} else
ret = -1;
@@ -1062,7 +1062,7 @@ static inline u32 rtl8169_tx_vlan_tag(struct rtl8169_private *tp,
}
static int rtl8169_rx_vlan_skb(struct rtl8169_private *tp, struct RxDesc *desc,
- struct sk_buff *skb)
+ struct sk_buff *skb, int polling)
{
return -1;
}
@@ -4429,12 +4429,20 @@ out:
return done;
}
+/*
+ * Warning : rtl8169_rx_interrupt() might be called :
+ * 1) from NAPI (softirq) context
+ * (polling = 1 : we should call netif_receive_skb())
+ * 2) from process context (rtl8169_reset_task())
+ * (polling = 0 : we must call netif_rx() instead)
+ */
static int rtl8169_rx_interrupt(struct net_device *dev,
struct rtl8169_private *tp,
void __iomem *ioaddr, u32 budget)
{
...Hello, Sure. Give me a couple of days. By the way, should I test against .34 or .33? Sergey
It's an old bug anyway, so you can pick up whats is the best for you. My personal pref would be current kernel. To trigger the original condition (fifo overflow), you might flood your machine from another one with small packets, for example using pktgen. Thanks --
Hello,
Eric, I did pktgen testing (3 times).
sudo cat /proc/net/pktgen/eth0
Params: count 1000000000 min_pkt_size: 42 max_pkt_size: 42
frags: 0 delay: 0 clone_skb: 4000000 ifname: eth0
flows: 0 flowlen: 0
queue_map_min: 0 queue_map_max: 0
dst_min: 10.6.22.59 dst_max:
src_min: src_max:
src_mac: 00:1a:92:c9:a0:68 dst_mac: 00:1a:92:c9:a0:68
udp_src_min: 9 udp_src_max: 9 udp_dst_min: 9 udp_dst_max: 9
src_mac_count: 0 dst_mac_count: 0
Flags:
Current:
pkts-sofar: 1000000000 errors: 0
started: 9957912410us stopped: 16682620361us idle: 1231us
seq_num: 1000000001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
cur_saddr: 0x3b16060a cur_daddr: 0x3b16060a
cur_udp_dst: 9 cur_udp_src: 9
cur_queue_map: 0
flows: 0
Result: OK: 6724707951(c6724706719+d1231) nsec, 1000000000 (42byte,0frags)
148705pps 49Mb/sec (49964880bps) errors: 0
Without any errors from the r8169 side. Can we consider testing successful?
If so, fell free to add Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Sergey
Nope! Here it is: [17250.998293] ------------[ cut here ]------------ [17250.998305] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0xc1/0x125() [17250.998308] Hardware name: F3JC [17250.998312] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [17250.998315] Modules linked in: pktgen ppp_async crc_ccitt ipv6 ppp_generic slhc snd_hwdep snd_hda_codec_si3054 snd_hda_codec_realtek sdhci_pci sdhci asus_laptop sparse_keymap mmc_core led_class snd_hda_intel snd_hda_codec snd_pcm snd_timer snd_page_alloc rng_core sg evdev i2c_i801 snd soundcore psmouse r8169 serio_raw mii uhci_hcd ehci_hcd sr_mod usbcore cdrom sd_mod ata_piix [17250.998371] Pid: 3985, comm: kpktgend_0 Tainted: G W 2.6.34-rc1-dbg-git6-r8169 #46 [17250.998375] Call Trace: [17250.998383] [<c102e353>] warn_slowpath_common+0x65/0x7c [17250.998388] [<c126b654>] ? dev_watchdog+0xc1/0x125 [17250.998393] [<c102e39e>] warn_slowpath_fmt+0x24/0x27 [17250.998398] [<c126b654>] dev_watchdog+0xc1/0x125 [17250.998405] [<c1036bbb>] ? run_timer_softirq+0x120/0x1eb [17250.998411] [<c1036c11>] run_timer_softirq+0x176/0x1eb [17250.998416] [<c1036bbb>] ? run_timer_softirq+0x120/0x1eb [17250.998421] [<c126b593>] ? dev_watchdog+0x0/0x125 [17250.998426] [<c1032df9>] __do_softirq+0x8d/0x117 [17250.998431] [<c1032eae>] do_softirq+0x2b/0x43 [17250.998436] [<c1032fd3>] irq_exit+0x38/0x75 [17250.998442] [<c1014e75>] smp_apic_timer_interrupt+0x66/0x74 [17250.998448] [<c12c812a>] apic_timer_interrupt+0x36/0x3c [17250.998457] [<c1185d18>] ? do_raw_spin_trylock+0x28/0x37 [17250.998464] [<c12c7101>] _raw_spin_lock+0x2f/0x58 [17250.998472] [<f80855ca>] ? spin_lock+0x8/0xa [pktgen] [17250.998478] [<f80855ca>] spin_lock+0x8/0xa [pktgen] [17250.998484] [<f80873b6>] pktgen_thread_worker+0x9b/0x631 [pktgen] [17250.998491] [<c103f9f1>] ? autoremove_wake_function+0x0/0x2f [17250.998497] [<c103f9f1>] ? autoremove_wake_function+0x0/0x2f [17250.998503] [<f808731b>] ? pktgen_thread_worker+0x0/0x631 ...
But this stack trace is on pktgen side (the sender), and my patch was about the receiver ? I wonder if you dont hit another problem :) --
This is tx side which seems blocked for more than 6 seconds. To really test the patch, you need following setup : machine A with r8169 NIC, patch applied, receiver of pktgen flood. machine B with any NIC, preferably a not buggy one, doing the pktgen flood to machine A If machine A survives, my patch is tested and ok. If machine B crashes, we have another problem to investigate --
Hello, Got it right now. System completely froze. Even SysRq didn't work. /*spin_lock deadlock?*/ NOTE: I'm losing network constantly with pktgen tests [24208.010980] r8169 0000:02:00.0: eth0: link up [24220.010980] r8169 0000:02:00.0: eth0: link up [24232.011030] r8169 0000:02:00.0: eth0: link up [24340.010980] r8169 0000:02:00.0: eth0: link up [24352.010966] r8169 0000:02:00.0: eth0: link up [24364.010966] r8169 0000:02:00.0: eth0: link up [24376.010964] r8169 0000:02:00.0: eth0: link up [24388.010961] r8169 0000:02:00.0: eth0: link up [24400.010959] r8169 0000:02:00.0: eth0: link up [24412.010963] r8169 0000:02:00.0: eth0: link up Traces: [24600.625078] INFO: task events/0:9 blocked for more than 120 seconds. [24600.625083] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [24600.625087] events/0 D 00001636 0 9 2 0x00000000 [24600.625096] f7085ebc 00000046 a87e9490 00001636 c1617cc0 c1617cc0 c1617cc0 c1617cc0 [24600.625109] f707c250 c1617cc0 c1617cc0 00000000 00000000 00000000 00000000 f707bfc0 [24600.625122] c14745c8 c14745c8 f707bfc0 00000202 f7085efc c12c6449 00000000 00085edc [24600.625135] Call Trace: [24600.625148] [<c12c6449>] __mutex_lock_common+0x233/0x3af [24600.625155] [<c12c65ff>] mutex_lock_nested+0x12/0x15 [24600.625163] [<c1265e0f>] ? rtnl_lock+0xf/0x11 [24600.625168] [<c1265e0f>] rtnl_lock+0xf/0x11 [24600.625183] [<f9174acd>] rtl8169_reset_task+0x16/0xee [r8169] [24600.625191] [<c103c887>] worker_thread+0x161/0x233 [24600.625196] [<c103c845>] ? worker_thread+0x11f/0x233 [24600.625205] [<f9174ab7>] ? rtl8169_reset_task+0x0/0xee [r8169] [24600.625214] [<c103f9f1>] ? autoremove_wake_function+0x0/0x2f [24600.625220] [<c103c726>] ? worker_thread+0x0/0x233 [24600.625225] [<c103f6ce>] kthread+0x6a/0x6f [24600.625232] [<c103f664>] ? kthread+0x0/0x6f [24600.625238] [<c1002e42>] kernel_thread_helper+0x6/0x1a [24600.625242] INFO: lockdep is turned off. [24600.625259] INFO: task X:3176 blocked for ...
OK thanks for the report, this rtl8169_reset_task() seems pretty buggy,
or multiple invocation...
Did you tried removing the rtl8169_schedule_work() call from
rtl8169_rx_interrupt() ?
Maybe the reset is not necessary at all in case of fifo overflow..
Cumulative patch :
diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 9d3ebf3..d6ef4dd 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -1038,14 +1038,14 @@ static void rtl8169_vlan_rx_register(struct net_device *dev,
}
static int rtl8169_rx_vlan_skb(struct rtl8169_private *tp, struct RxDesc *desc,
- struct sk_buff *skb)
+ struct sk_buff *skb, int polling)
{
u32 opts2 = le32_to_cpu(desc->opts2);
struct vlan_group *vlgrp = tp->vlgrp;
int ret;
if (vlgrp && (opts2 & RxVlanTag)) {
- vlan_hwaccel_receive_skb(skb, vlgrp, swab16(opts2 & 0xffff));
+ __vlan_hwaccel_rx(skb, vlgrp, swab16(opts2 & 0xffff), polling);
ret = 0;
} else
ret = -1;
@@ -1062,7 +1062,7 @@ static inline u32 rtl8169_tx_vlan_tag(struct rtl8169_private *tp,
}
static int rtl8169_rx_vlan_skb(struct rtl8169_private *tp, struct RxDesc *desc,
- struct sk_buff *skb)
+ struct sk_buff *skb, int polling)
{
return -1;
}
@@ -4429,12 +4429,20 @@ out:
return done;
}
+/*
+ * Warning : rtl8169_rx_interrupt() might be called :
+ * 1) from NAPI (softirq) context
+ * (polling = 1 : we should call netif_receive_skb())
+ * 2) from process context (rtl8169_reset_task())
+ * (polling = 0 : we must call netif_rx() instead)
+ */
static int rtl8169_rx_interrupt(struct net_device *dev,
struct rtl8169_private *tp,
void __iomem *ioaddr, u32 budget)
{
unsigned int cur_rx, rx_left;
unsigned int delta, count;
+ int polling = (budget != ~(u32)0) ? 1 : 0;
cur_rx = tp->cur_rx;
rx_left = NUM_RX_DESC + tp->dirty_rx - cur_rx;
@@ -4459,7 +4467,6 @@ static int rtl8169_rx_interrupt(struct net_device *dev,
if (status & RxCRC)
...Not yet. I'm reading rtl8169_rx_interrupt now and rtl8169_schedule_work in case of RxFOVF Will try it.
Hello, cumulative patch: [ 155.337373] ------------[ cut here ]------------ [ 155.337386] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0xc1/0x125() [ 155.337390] Hardware name: F3JC [ 155.337394] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 155.337397] Modules linked in: pktgen ppp_async crc_ccitt ipv6 ppp_generic slhc snd_hwdep snd_hda_codec_si3054 snd_hda_codec_realtek sdhci_pci sdhci snd_hda_intel snd_hda_codec asus_laptop sparse_keymap mmc_core led_class snd_pcm snd_timer snd_page_alloc psmouse rng_core snd soundcore sg i2c_i801 evdev serio_raw r8169 mii usbhid hid uhci_hcd ehci_hcd sr_mod cdrom sd_mod usbcore ata_piix [ 155.337468] Pid: 7, comm: ksoftirqd/1 Tainted: G W 2.6.34-rc1-dbg-git6-r8169 #47 [ 155.337472] Call Trace: [ 155.337481] [<c102e293>] warn_slowpath_common+0x65/0x7c [ 155.337506] [<c126ac34>] ? dev_watchdog+0xc1/0x125 [ 155.337512] [<c102e2de>] warn_slowpath_fmt+0x24/0x27 [ 155.337517] [<c126ac34>] dev_watchdog+0xc1/0x125 [ 155.337525] [<c1036afb>] ? run_timer_softirq+0x120/0x1eb [ 155.337530] [<c1036b51>] run_timer_softirq+0x176/0x1eb [ 155.337536] [<c1036afb>] ? run_timer_softirq+0x120/0x1eb [ 155.337566] [<c126ab73>] ? dev_watchdog+0x0/0x125 [ 155.337576] [<c1032d39>] __do_softirq+0x8d/0x117 [ 155.337667] [<c1032dee>] do_softirq+0x2b/0x43 [ 155.337729] [<c1032fc1>] run_ksoftirqd+0x71/0x140 [ 155.337745] [<c1032f50>] ? run_ksoftirqd+0x0/0x140 [ 155.337810] [<c103f60e>] kthread+0x6a/0x6f [ 155.337832] [<c103f5a4>] ? kthread+0x0/0x6f [ 155.337903] [<c1002e42>] kernel_thread_helper+0x6/0x1a [ 155.337907] ---[ end trace a22d306b065d4a68 ]--- [ 155.350902] r8169 0000:02:00.0: eth0: link up [ 167.350892] r8169 0000:02:00.0: eth0: link up Sergey
trimming some cc On receiver ? I suspect lot of work is needed on this driver to make it working, but I dont have a machine with said adapter. Are you in 100 Mb full duplex mode ? --
Localhost pollution (both sender and receiver are localhost). In 2 hours I will ethtool eth0 Settings for eth0: Supported ports: [ TP MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Half 1000baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 0 Transceiver: internal Auto-negotiation: on Supports Wake-on: pumbg Wake-on: g Current message level: 0x00000033 (51) Link detected: yes Sergey
Hello,
We did pktgen over LAN testing for several hours:
iftop
19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb
└──────────────────────┴──────────────────────┴───────────────────────┴──────────────────────┴───────────────────────
xxxxxxxx0007t2 => xxxxxxxxxxxxxx.xxxxx.xxxx.xxx 0b 3.60Kb 4.28Kb
<= 92.7Mb 92.7Mb 92.5Mb
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
TX: cumm: 2.50MB peak: 57.7Kb rates: 7.71Kb 5.97Kb 10.7Kb
RX: 28.0GB 92.9Mb 92.7Mb 92.8Mb 92.6Mb
TOTAL: 28.0GB 92.9Mb 92.7Mb 92.8Mb 92.6M
Without any problems.
As soon as I switched back to localhost "pollution":
[ 7343.999279] ------------[ cut here ]------------
[ 7343.999292] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0xc1/0x125()
[ 7343.999295] Hardware name: F3JC
[ 7343.999298] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[ 7343.999301] Modules linked in: pktgen ppp_async crc_ccitt ipv6 ppp_generic slhc snd_hwdep snd_hda_codec_si3054 snd_hda_codec_realtek sdhci_pci sdhci snd_hda_intel mmc_core
asus_laptop sparse_keymap snd_hda_codec led_class snd_pcm rng_core sg psmouse snd_timer snd_page_alloc i2c_i801 evdev snd soundcore serio_raw r8169 mii usbhid hid uhci_hcd
ehci_hcd sr_mod cdrom sd_mod usbcore ata_piix
[ 7343.999361] Pid: 4801, comm: kpktgend_0 Tainted: G W 2.6.34-rc1-dbg-git6-r8169 #47
[ 7343.999364] Call Trace:
[ 7343.999372] [<c102e293>] warn_slowpath_common+0x65/0x7c
[ 7343.999378] [<c126ac34>] ? dev_watchdog+0xc1/0x125
[ 7343.999383] [<c102e2de>] ...What do you exactly mean by 'localhost pollution' ? --
Hello, Sorry. I mean pktgen started on localhost with src set to localhost (not pktgen over LAN from another machine). Sergey
On Wed, Mar 17, 2010 at 08:37:17AM +0100, Eric Dumazet wrote:
This one should help too if Sergey owns a (MSI) 8168.
diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index dfc3573..721e7f3 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -4532,21 +4532,39 @@ static int rtl8169_rx_interrupt(struct net_device *dev,
return count;
}
+static void rtl_napi_cond_schedule(struct rtl8169_private *tp, u16 status)
+{
+ if (status & tp->napi_event) {
+ void __iomem *ioaddr = tp->mmio_addr;
+
+ RTL_W16(IntrMask, tp->intr_event & ~tp->napi_event);
+ mmiowb();
+ napi_schedule(&tp->napi);
+ }
+}
+
static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
{
struct net_device *dev = dev_instance;
struct rtl8169_private *tp = netdev_priv(dev);
void __iomem *ioaddr = tp->mmio_addr;
int handled = 0;
- int status;
+ u16 status;
/* loop handling interrupts until we have no new ones or
* we hit a invalid/hotplug case.
*/
status = RTL_R16(IntrStatus);
while (status && status != 0xffff) {
+ u16 acked;
+
handled = 1;
+ acked = (status & RxFIFOOver) ? (status | RxOverflow) : status;
+ acked &= ~tp->napi_event;
+
+ RTL_W16(IntrStatus, acked);
+
/* Handle all of the error cases first. These will reset
* the chip, so just exit the loop.
*/
@@ -4557,7 +4575,7 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
/* Work around for rx fifo overflow */
if (unlikely(status & RxFIFOOver) &&
- (tp->mac_version == RTL_GIGA_MAC_VER_11)) {
+ (tp->mac_version == RTL_GIGA_MAC_VER_11)) {
netif_stop_queue(dev);
rtl8169_tx_timeout(dev);
break;
@@ -4571,30 +4589,9 @@ static irqreturn_t rtl8169_interrupt(int irq, void *dev_instance)
if (status & LinkChg)
rtl8169_check_link_status(dev, tp, ioaddr);
- /* We need to see the lastest version of tp->intr_mask to
- * avoid ignoring an MSI interrupt and having to wait for
- * another event which may never come.
- ...Hello,
This hunk is rejected. I suppose I should apply patch against unpatched version (Eric's patch).
Correct? /* Eric's patch does make sense to my mind. */
@@ -4604,22 +4601,19 @@
void __iomem *ioaddr = tp->mmio_addr;
int work_done;
+
+ RTL_W16(IntrStatus, tp->napi_event);
+
work_done = rtl8169_rx_interrupt(dev, tp, ioaddr, (u32) budget);
rtl8169_tx_interrupt(dev, tp, ioaddr);
if (work_done < budget) {
napi_complete(napi);
- /* We need for force the visibility of tp->intr_mask
- * for other CPUs, as we can loose an MSI interrupt
- * and potentially wait for a retransmit timeout if we don't.
- * The posted write to IntrMask is safe, as it will
- * eventually make it to the chip and we won't loose anything
- * until it does.
- */
- tp->intr_mask = 0xffff;
- smp_wmb();
RTL_W16(IntrMask, tp->intr_event);
+ mmiowb();
+
+ rtl_napi_cond_schedule(tp, RTL_R16(IntrStatus));
}
Sergey
Hello, Patched r8169 (both Eric's and Francois' patches are applied). /*pktgen localhost2localhost*/ [ 498.818640] pktgen 2.72: Packet Generator for packet performance testing. [ 568.999957] ------------[ cut here ]------------ [ 568.999969] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0xc1/0x125() [ 568.999973] Hardware name: F3JC [ 568.999976] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out [ 568.999979] Modules linked in: pktgen snd_hwdep snd_hda_codec_si3054 snd_hda_codec_realtek asus_laptop sparse_keymap sdhci_pci sdhci snd_hda_intel mmc_core led_class snd_hda_codec snd_pcm snd_timer psmouse snd soundcore snd_page_alloc sg evdev i2c_i801 rng_core serio_raw r8169 mii uhci_hcd sr_mod ehci_hcd cdrom sd_mod usbcore ata_piix [ 569.000029] Pid: 3350, comm: kpktgend_0 Tainted: G W 2.6.34-rc1-dbg-git6-r8169 #48 [ 569.000033] Call Trace: [ 569.000041] [<c102e293>] warn_slowpath_common+0x65/0x7c [ 569.000046] [<c126c024>] ? dev_watchdog+0xc1/0x125 [ 569.000051] [<c102e2de>] warn_slowpath_fmt+0x24/0x27 [ 569.000056] [<c126c024>] dev_watchdog+0xc1/0x125 [ 569.000063] [<c1036afb>] ? run_timer_softirq+0x120/0x1eb [ 569.000069] [<c1036b51>] run_timer_softirq+0x176/0x1eb [ 569.000074] [<c1036afb>] ? run_timer_softirq+0x120/0x1eb [ 569.000079] [<c126bf63>] ? dev_watchdog+0x0/0x125 [ 569.000084] [<c1032d39>] __do_softirq+0x8d/0x117 [ 569.000089] [<c1032dee>] do_softirq+0x2b/0x43 [ 569.000097] [<f809510d>] ? pktgen_xmit+0xdb7/0xe8e [pktgen] [ 569.000102] [<c1032e9c>] _local_bh_enable_ip+0x88/0xb0 [ 569.000107] [<c1032ecc>] local_bh_enable_ip+0x8/0xa [ 569.000114] [<c12c83a0>] _raw_spin_unlock_bh+0x2f/0x32 [ 569.000120] [<f809510d>] pktgen_xmit+0xdb7/0xe8e [pktgen] [ 569.000129] [<f91674a9>] ? rtl8169_start_xmit+0x0/0x304 [r8169] [ 569.000136] [<c1183940>] ? trace_hardirqs_on_thunk+0xc/0x10 [ 569.000143] [<c105012d>] ? print_lock_contention_bug+0x11/0xb2 [ 569.000150] [<f80935ca>] ? spin_lock+0x8/0xa ...
Hello, On (03/16/10 19:48), Eric Dumazet wrote: [..] Eric, I think you can put Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> on the cumulative patch as I don't see any errors with pktgen over LAN testing. Thanks. Sergey
Hi Thanks for the report. Problem is I am not confident enough on this driver. I believe François Romieu is the one that can possibly push a patch ? Thanks --
Thanks. Hm. François' patch introduced new 'problems'/'stack traces' with pktgen testing... François, any comments? Sergey
Eric Dumazet <eric.dumazet@gmail.com> : Feel free to push. -- Ueimor --
OK thanks, here is the patch then, but I prefer let the
rtl8169_schedule_work() from rtl8169_rx_interrupt() for now, its removal
might be the subject of another patch.
Lets be very conservative for now.
[PATCH net-next-2.6] r8169: Fix rtl8169_rx_interrupt()
In case a reset is performed, rtl8169_rx_interrupt() is called from
process context instead of softirq context. Special care must be taken
to call appropriate network core services (netif_rx() instead of
netif_receive_skb()). VLAN handling also corrected.
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Diagnosed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
drivers/net/r8169.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 964305c..1dffe29 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -1054,14 +1054,14 @@ static void rtl8169_vlan_rx_register(struct net_device *dev,
}
static int rtl8169_rx_vlan_skb(struct rtl8169_private *tp, struct RxDesc *desc,
- struct sk_buff *skb)
+ struct sk_buff *skb, int polling)
{
u32 opts2 = le32_to_cpu(desc->opts2);
struct vlan_group *vlgrp = tp->vlgrp;
int ret;
if (vlgrp && (opts2 & RxVlanTag)) {
- vlan_hwaccel_receive_skb(skb, vlgrp, swab16(opts2 & 0xffff));
+ __vlan_hwaccel_rx(skb, vlgrp, swab16(opts2 & 0xffff), polling);
ret = 0;
} else
ret = -1;
@@ -1078,7 +1078,7 @@ static inline u32 rtl8169_tx_vlan_tag(struct rtl8169_private *tp,
}
static int rtl8169_rx_vlan_skb(struct rtl8169_private *tp, struct RxDesc *desc,
- struct sk_buff *skb)
+ struct sk_buff *skb, int polling)
{
return -1;
}
@@ -4467,12 +4467,20 @@ out:
return done;
}
+/*
+ * Warning : rtl8169_rx_interrupt() might be called :
+ * 1) from NAPI (softirq) context
+ * (polling = 1 : we should call ...From: Eric Dumazet <eric.dumazet@gmail.com>
^^^^^^^^
Trailing whitespace I had to delete.
--
