Re: 2.6.20->2.6.21 - networking dies after random time

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Ingo Molnar
Date: Monday, August 6, 2007 - 12:03 am

* Marcin Ślusarz <marcin.slusarz@gmail.com> wrote:


please try Jarek's second patch too - there was a missing unmask.

	Ingo

-------------->
Subject: genirq: fix simple and fasteoi irq handlers
From: Jarek Poplawski <jarkao2@o2.pl>

After the "genirq: do not mask interrupts by default" patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this "don't mask interrupts by default" idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?
  
So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find & try this before writing for help or changing a network card).


Signed-off-by: Jarek Poplawski <jarkao2@o2.pl>

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c	2007-07-09 01:32:17.000000000 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c	2007-08-05 21:49:46.000000000 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out_unlock;
 	kstat_cpu(cpu).irqs[irq]++;
 
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
 		desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
 
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out_unlock:
 	spin_unlock(&desc->lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 
-	if (unlikely(desc->status & IRQ_INPROGRESS))
-		goto out;
-
 	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);
 	kstat_cpu(cpu).irqs[irq]++;
 
 	/*
-	 * If its disabled or no action available
+	 * If it's running, disabled or no action available
 	 * then mask it and get out of here:
 	 */
 	action = desc->action;
-	if (unlikely(!action || (desc->status & IRQ_DISABLED))) {
+	if (unlikely(!action || (desc->status & (IRQ_INPROGRESS |
+						 IRQ_DISABLED)))) {
 		desc->status |= IRQ_PENDING;
 		if (desc->chip->mask)
 			desc->chip->mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
 
 	spin_lock(&desc->lock);
 	desc->status &= ~IRQ_INPROGRESS;
+	if (!(desc->status & IRQ_DISABLED) && desc->chip->unmask)
+		desc->chip->unmask(irq);
 out:
 	desc->chip->eoi(irq);
 
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jean-Baptiste Vignaud, (Fri Jun 29, 1:50 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Fri Jun 29, 8:07 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Marcin Ślusarz, (Sun Jul 22, 10:44 pm)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Mon Jul 23, 1:53 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Tue Jul 24, 12:18 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Linus Torvalds, (Tue Jul 24, 12:30 pm)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Thomas Gleixner, (Tue Jul 24, 5:19 pm)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Wed Jul 25, 12:23 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Wed Jul 25, 6:57 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Marcin Ślusarz, (Thu Jul 26, 12:16 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Thomas Gleixner, (Thu Jul 26, 1:10 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Thu Jul 26, 1:13 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Thu Jul 26, 1:19 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Thu Jul 26, 1:55 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Thu Jul 26, 2:11 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Marcin Ślusarz, (Mon Jul 30, 12:29 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Tue Jul 31, 6:20 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Marcin Ślusarz, (Wed Aug 1, 12:24 am)
[patch] genirq: fix simple and fasteoi irq handlers, Jarek Poplawski, (Thu Aug 2, 11:07 pm)
Re: [patch] genirq: fix simple and fasteoi irq handlers, Jarek Poplawski, (Fri Aug 3, 2:10 am)
Re: [patch] genirq: fix simple and fasteoi irq handlers, Marcin Ślusarz, (Fri Aug 3, 4:57 am)
Re: [patch] genirq: fix simple and fasteoi irq handlers, Jarek Poplawski, (Fri Aug 3, 5:26 am)
[patch (take 2)] genirq: fix simple and fasteoi irq handlers, Jarek Poplawski, (Sun Aug 5, 11:07 pm)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Marcin Ślusarz, (Sun Aug 5, 11:58 pm)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Marcin Ślusarz, (Mon Aug 6, 12:00 am)
Re: 2.6.20->2.6.21 - networking dies after random time, Ingo Molnar, (Mon Aug 6, 12:03 am)
Re: [patch] genirq: fix simple and fasteoi irq handlers, Marcin Ślusarz, (Mon Aug 6, 12:05 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Marcin Ślusarz, (Tue Aug 7, 12:46 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Tue Aug 7, 1:23 am)
Re: 2.6.20-&gt;2.6.21 - networking dies after random time, Jarek Poplawski, (Tue Aug 7, 3:09 am)