Re: 2.6.29 forcedeth hang W/O NAPI enabled

Previous thread: [net-next PATCH v3] igbvf: add new driver to support 82576 virtual functions by Jeff Kirsher on Wednesday, March 25, 2009 - 2:52 pm. (17 messages)

Next thread: 2.6.29 forcedeth hang W/O NAPI enabled by Adam Richter on Wednesday, March 25, 2009 - 5:06 pm. (2 messages)
From: Adam Richter
Date: Wednesday, March 25, 2009 - 4:24 pm

I am experiencing what is probably the same forcedeth ethernet
hang with FORCEDETH_NAPI disabled as reported by Berkley Shands.  I
want to add the following additional data (items 2-7 basically just
confirm what one would expect):

	1) I can narrow where the problem was introduced.  The problem
	   does not occur for me in 2.6.29-rc8-git6, the last git snapshot
	   before 2.6.29.  There are no changes to forcedeth.c between
	   these versions.

	2) The amount of time it takes to reproduce the problem seems
	   to depend on networking utilization.  I can reproduce the
	   problem in about 30 seconds by doing "ping -f" to a
	   computer on my local ethernet for about one minute.
	   Otherwise, my computer, which normally does not do much
	   network communication takes about an hour to exhibit the
	   problem.

	3) I can recover by doing "rmmod forcedeth ; modprobe forcedeth"
	   even without recompiling with NAPI enabled, but the
	   problem seems to recur more quickly, until reloading the
	   forcedeth module no longer seems to work.  (I infer from
	   Berkley Shands' message that reloading the module
	   recompiled with NAPI enabled will cause the problem not
	   to recur.)

	4) Given that this looks like a NAPI problem, it should come
	   as no surprise that ethernet transmit still works when the
	   problem is occuring.  I know this because I can run ping
	   from the effected machine to a target machine running
	   tcpdump, and the target machine sees the ping packets.

	5) When the problem occurs, "ifconfig eth0" reports a gradually
	   increasing count of "RX packets" (I assume from random
	   broadcast packets originating elsewhere on the local
	   ethernet), and no obvious signs of trouble:
 
          RX packets:2092 errors:0 dropped:0 overruns:0 frame:0
          TX packets:34 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:177338 (173.1 KB)  TX bytes:6732 (6.5 KB)

	6) No complaints on the kernel console ...
From: David Miller
Date: Wednesday, March 25, 2009 - 5:05 pm

From: Adam Richter <adam_richter2004@yahoo.com>

We're pretty sure we know exactly what commit causes this.

Can you try playing with a patch Jarek P. just posted in
the thread where this bug is being discussed?  (Subject:
Revert "gro: Fix legacy path napi_complete crash"):

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..cf53c24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2589,7 +2589,11 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
 			local_irq_enable();
-			napi_complete(napi);
+			napi_gro_flush(napi);
+			local_irq_disable();
+			if (skb_queue_empty(&queue->input_pkt_queue))
+				__napi_complete(napi);
+			local_irq_enable();
 			goto out;
 		}
 		local_irq_enable();

--

From: Adam Richter
Date: Wednesday, March 25, 2009 - 6:20 pm

Hello David,

The patch you forwarded to me seems to work.  Thank you for bringing it
to my attention.

In particular linux-2.6.29 with your patch applied with CONFIG_FORCEDETH_NAPI disabled has been doing "ping -f" on the same
local computer I was tested with before for more than 5 minutes with
no problem.  I am able to surf the web and compose this email in the
meantime.  In the past, "ping -f" with NAPI disabled would produce the problem within about 30 seconds (with NAPI enabled, it did not seem to reproduce the problem, but I am pretty sure that the problem did happen once with with forcedeth module compiled with NAPI, although my quick
efforts to reproduce the problem that way did not immediately succeed).

I have appended the patch you sent, just to be clear, since there are a few patches being discussed in the thread you referred to.  If I have anything else to add, I will follow-up in that thread.

Thank you for your help!

Adam




      
--

From: David Miller
Date: Wednesday, March 25, 2009 - 8:14 pm

From: Adam Richter <adam_richter2004@yahoo.com>

Thanks for testing.
--

From: Herbert Xu
Date: Wednesday, March 25, 2009 - 8:36 pm

Hi Adam:

Any chance you can test this patch instead of the previous one?

net: Fix netpoll lockup in legacy receive path

When I fixed the GRO crash in the legacy receive path I used
napi_complete to replace __napi_complete.  Unfortunately they're
not the same when NETPOLL is enabled, which may result in us
not calling __napi_complete at all.

What's more, we really do need to keep the __napi_complete call
within the IRQ-off section since in theory an IRQ can occur in
between and fill up the backlog to the maximum, causing us to
lock up.

This patch fixes this by essentially open-coding __napi_complete.

Note we no longer need the memory barrier because this function
is per-cpu.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

diff --git a/net/core/dev.c b/net/core/dev.c
index e3fe5c7..2a7f6b3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2588,9 +2588,10 @@ static int process_backlog(struct napi_struct *napi, int quota)
 		local_irq_disable();
 		skb = __skb_dequeue(&queue->input_pkt_queue);
 		if (!skb) {
+			list_del(&napi->poll_list);
+			clear_bit(NAPI_STATE_SCHED, &napi->state);
 			local_irq_enable();
-			napi_complete(napi);
-			goto out;
+			break;
 		}
 		local_irq_enable();
 
@@ -2599,7 +2600,6 @@ static int process_backlog(struct napi_struct *napi, int quota)
 
 	napi_gro_flush(napi);
 
-out:
 	return work;
 }

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Adam Richter
Date: Wednesday, March 25, 2009 - 10:24 pm

OK.  I tried your patch.  It seems fine.  I was able to do "ping -f" for five minutes with no problems, and surf the web during that time.

I preserving the rest of your message below, just to be clear about which patch I tested.



      
--

From: Herbert Xu
Date: Wednesday, March 25, 2009 - 11:58 pm

Thanks for testing! Since you also have a forcedeth card, could
you try rebuilding your kernel with FORCEDETH_NAPI enabled and

Yep it's the right one.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--

From: Adam Richter
Date: Thursday, March 26, 2009 - 4:29 pm

It works so far, doing "ping -f" for more than five minutes while I
browse the web, but bear in mind that I was not able to reproduce the
problem earlier with FORCEDETH_NAPI on demand.  I only saw an ethernet lock up with NAPI enabled once (with the original 2.6.29 code).  I
will let you know if the lockup occurs with your code. 

Just to be clear which version I am now testing, I have attached a
copy of process_backlog() from my net/core/dev.c.

Adam



static int process_backlog(struct napi_struct *napi, int quota)
{
	int work = 0;
	struct softnet_data *queue = &__get_cpu_var(softnet_data);
	unsigned long start_time = jiffies;

	napi->weight = weight_p;
	do {
		struct sk_buff *skb;

		local_irq_disable();
		skb = __skb_dequeue(&queue->input_pkt_queue);
		if (!skb) {
			list_del(&napi->poll_list);
			clear_bit(NAPI_STATE_SCHED, &napi->state);
			local_irq_enable();
			break;
		}
		local_irq_enable();

		napi_gro_receive(napi, skb);
	} while (++work < quota && jiffies == start_time);

	napi_gro_flush(napi);

	return work;
}



      
--

Previous thread: [net-next PATCH v3] igbvf: add new driver to support 82576 virtual functions by Jeff Kirsher on Wednesday, March 25, 2009 - 2:52 pm. (17 messages)

Next thread: 2.6.29 forcedeth hang W/O NAPI enabled by Adam Richter on Wednesday, March 25, 2009 - 5:06 pm. (2 messages)