Re: PATCH: 2.6.26-rc8: Fix IRQF_DISABLED for shared interrupts

Previous thread: oops in Audiowerk2 ALSA driver by Marcin Slusarz on Sunday, June 22, 2008 - 9:47 am. (2 messages)

Next thread: [PATCH] Fix open/close race in saa7134 by Arjan van de Ven on Sunday, June 22, 2008 - 10:05 am. (10 messages)
From: Stefan Becker
Date: Sunday, June 22, 2008 - 9:55 am

Hi,

[I'm not subscribed to this list, so please CC: me when you answer]

I get random machine lockups when accessing my USB harddrive with 
kernels 2.6.24/25. They don't occur with kernel 2.6.23. During testing I 
figured out that it has something to do with the USB Bluetooth adaptor. 
If I remove it before the testing I don't get any lockups.

git bisect resulted in the following bad commit:

e9df41c5c5899259541dc928872cad4d07b82076 is first bad commit
commit e9df41c5c5899259541dc928872cad4d07b82076
Author: Alan Stern <stern@rowland.harvard.edu>
Date:   Wed Aug 8 11:48:02 2007 -0400

     USB: make HCDs responsible for managing endpoint queues

     This patch (as954) implements a suggestion of David Brownell's.  Now
     the host controller drivers are responsible for linking and unlinking
     URBs to/from their endpoint queues.  This eliminates the possiblity of
     strange situations where usbcore thinks an URB is linked but the HCD
     thinks it isn't.  It also means HCDs no longer have to check for URBs
     being dequeued before they were fully enqueued.
[TRUNCATED...]


I've attached the kernel configuration.

Test procedure:

- boot test kernel in single user mode
- "service rsyslog start"
- "service messagebus start"
- "service bluetooth start"
- connect USB drive
- "mount /dev/sdb1 /mnt"
- "dd if=/dev/zero of=/mnt/test bs=10M"
   BAD: lockup, GOOD: writes until partition is full



My USB hardware:

Bus 001 Device 006: ID 04b4:6830 Cypress Semiconductor Corp. USB-2.0 IDE 
Adapter
Bus 001 Device 001: ID 1d6b:0002
Bus 003 Device 002: ID 046d:c001 Logitech, Inc. N48/M-BB48 [FirstMouse Plus]
Bus 003 Device 001: ID 1d6b:0001
Bus 002 Device 003: ID 050d:0081 Belkin Components FBT001v2 Bluetooth
Bus 002 Device 001: ID 1d6b:0001


My USB controllers:

$ /sbin/lspci | fgrep USB
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 
Controller (rev 80)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 ...
From: Rene Herman
Date: Sunday, June 22, 2008 - 10:42 am

On 22-06-08 18:55, Stefan Becker wrote:

--

From: Alan Stern
Date: Sunday, June 22, 2008 - 12:31 pm

Does the same problem still occur in 2.6.26-rc7?  Does it occur if you 
rmmod ehci-hcd?

Machine lockups are awfully hard to debug.  Can you get any information
at all (like Alt-SysRq-T) when this happens?  Can you add debugging
printk statements to the USB bluetooth driver to try and localize where

Knowing this doesn't help much without more information.

Do you have any idea why nobody else has reported this sort of problem?  
Is it reproducible on other machines?

Alan Stern

--

From: Stefan Becker
Date: Monday, June 23, 2008 - 8:52 am

Hi,

[I'm not subscribed to this list, so please CC: me when you answer]



Yes, i.e. it also happens when the external hardrive runs as USB 1.1 

SysRq does not work when the machine locks up. I forgot to mention that 
the test machine is a single CPU machine and that the CPU fan starts to 
run full speed when the lockup occurs.

Guessing from the commit returned by git bisect there is a locking 
error, i.e. the CPU runs into a spinlock that is already locked and 


Too bad. Each bisect cycle took 2-3 hours and the whole process took me 
3 days :-( :-(

That commit has spinlock changes so I hoped that it would be a good 

I attached both USB devices to another, newer dual core laptop. I 
couldn't reproduce the problem there, even when I simulated a single CPU 
machine with maxcpus=1.

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
--

From: Alan Stern
Date: Monday, June 23, 2008 - 11:10 am

That is certainly possible.  But an error like that should affect lots 

Around every place where the driver calls into the core.  You might 
also want to debug the places where uhci-hcd acquires and releases 

I didn't mean that your efforts were wasted.  They just don't help much 

Only what I suggested: Print something in the log whenever a lock is 

Alan Stern

--

From: Stefan Becker
Date: Tuesday, June 24, 2008 - 11:41 am

Hi,


I sprinkled some printk's into hci_usb.c. But I see only messages at the 

I played around with printk a little bit more. With the following change 
to uhci_giveback_urb():

--- a/drivers/usb/host/uhci-q.c
+++ b/drivers/usb/host/uhci-q.c
@@ -1526,11 +1530,17 @@ __acquires(uhci->lock)
         }

         uhci_free_urb_priv(uhci, urbp);
+       printk(KERN_CRIT "UHCI UNLINK ENTER %08x %08x\n",
+              (unsigned int) uhci, (unsigned int) urb);
         usb_hcd_unlink_urb_from_ep(uhci_to_hcd(uhci), urb);
+       printk(KERN_CRIT "UHCI UNLINK LEAVE %08x %08x\n",
+              (unsigned int) uhci, (unsigned int) urb);

         spin_unlock(&uhci->lock);

I see

  UHCI UNLINK ENTER xxxxxxxx yyyyyyyy
  UHCI UNLINK LEAVE xxxxxxxx yyyyyyyy
  UHCI UNLINK ENTER xxxxxxxx yyyyyyyy

on the console at the time of the lockup. Then I added the following change:

--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1107,7 +1107,13 @@ EXPORT_SYMBOL_GPL(usb_hcd_check_unlink_urb);
  void usb_hcd_unlink_urb_from_ep(struct usb_hcd *hcd, struct urb *urb)
  {
         /* clear all state linking urb to this dev (and hcd) */
+#if 0
         spin_lock(&hcd_urb_list_lock);
+#else
+       if (!spin_trylock(&hcd_urb_list_lock)) {
+         printk(KERN_CRIT "HCD URB LIST ALREADY LOCKED!\n");
+       }
+#endif
         list_del_init(&urb->urb_list);
         spin_unlock(&hcd_urb_list_lock);
  }

and get this at the time of the lockup:

  UHCI UNLINK ENTER xxxxxxxx yyyyyyyy
  UHCI UNLINK LEAVE xxxxxxxx yyyyyyyy
  UHCI UNLINK ENTER xxxxxxxx yyyyyyyy
  HCD URB LIST ALREADY LOCKED!
             <---- here the original code would lockup
  UHCI UNLINK LEAVE xxxxxxxx yyyyyyyy
  HCD URB LIST ALREADY LOCKED!
  HCD URB LIST ALREADY LOCKED!
  HCD URB LIST ALREADY LOCKED!

So the lockup is caused by an already locked hcd_urb_list_lock. Is there 
a way to see the lock holder? Or any other suggestions how to proceed?

Regards,

	Stefan

---
Stefan Becker
E-Mail: ...
From: Alan Stern
Date: Tuesday, June 24, 2008 - 2:15 pm

Good work!

There isn't any way to see the lock holder except by adding more 
printk's.  Fortunately that lock is private to hcd.c, so you know it's 
not used anywhere else.  And it's not used in very many places in that 
file, so you can add printk's around each one.  While you're at it, you 
ought to enable CONFIG_USB_DEBUG so that more of the context will be 
visible in the log.

The usage in usb_hcd_link_urb_to_ep() appears benign; the code doesn't 
do anything that might hang while holding the lock.  All it does is 
manipulate a linked list.

You have already instrumented the usage in 
usb_hcd_unlink_urb_from_ep().

That leaves only usb_hcd_flush_endpoint().  And once again, the code 
isn't doing anything while holding the lock except following 
linked-list pointers.  

Taken together, these facts suggest that somehow the linked list has
been corrupted.  So you might also want to add some printk's around the
linked-list activities in flush_endpoint, just to see if the pointer
values look reasonable.

Alan Stern

--

From: Stefan Becker
Date: Wednesday, June 25, 2008 - 8:52 am

Hi,


Well, I guess I'm just lucky it didn't turn into a heisenbug with all 

Unfortunately I could only run a small test today. I added some simple 
debugging code for the spinlock usage in hcd.c (see attached diff) and I 
get the following message at lockup (I tried it twice just to be sure):

HCD URB list locked by usb_hcd_link_urb_to_ep!

As far as I understand the matter this only can happen if 
usb_hcd_link_urb_to_ep() gets interrupted while holding the spinlock. 
But according to the contract at the header of the function it should be 
called with interrupts disabled!

I guess the obvious way forward from here is:

  - replace the spin_lock() in the function with the irqsave version

  - if that fixes the problem add debugging code to the function and 
panic with a stack trace when the interrupts aren't disabled one entry 
(don't know how to detect that yet, any suggestions?) That hopefully 
identifies the culprit that calls the function with interrupts enabled.

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
From: Alan Stern
Date: Wednesday, June 25, 2008 - 11:38 am

So it should.  Do you know how to test whether interrupts are enabled?  
There's a routine called raw_irqs_disabled() defined in 
include/asm/irqflags.h.  Stick it inside usb_hcd_link_urb_to_ep, and 

That should do the trick.  Although it would be quicker just to make a
small change to your existing code: The first time the spin_trylock
fails, do a dump_stack().  You did say this is a UP system, right?  
(Because obviously on SMP systems, one expects spin_trylock to fail
from time to time.)

The only callers of usb_hcd_link_urb_to_ep are the host controller
drivers plus a couple of routines in hcd.c itself.  But all the
callers are in the scope of spin_lock_irqsave or spin_lock_irq!  So
maybe there's more going on here than meets the eye.

Alan Stern

--

From: Stefan Becker
Date: Wednesday, June 25, 2008 - 11:31 pm

Hi,


Tried both and got nothing. So either our reasoning is wrong or 
something really weird is going on. I'll try to improve my debugging 
code and use dump_stack() at the place where it locks up.

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
--

From: Alan Stern
Date: Thursday, June 26, 2008 - 7:25 am

Typo: I should have said "the first time it returns zero..."


Well, it would be pretty surprising if spin_lock_irq or 
spin_lock_irqsave were misbehaving.

Something else you should try is clearing your "owner" string just
before the spinlock is released.  You could also add a check after the
release; if the spinlock can't be locked again immediately then
something is wrong.

Alan Stern

--

From: Stefan Becker
Date: Thursday, June 26, 2008 - 3:07 pm

Hi,



Yes, the initial try was misleading. I tinkered around a little bit more 
and finally figured out that it is usb_hcd_unlink_urb_from_ep() itself 
that is called with interrupts enabled!


So with this code in place the error disappears:

void usb_hcd_unlink_urb_from_ep(struct usb_hcd *hcd, struct urb *urb)
{
	/* clear all state linking urb to this dev (and hcd) */
	unsigned int flags;
	spin_lock_irqsave(&hcd_urb_list_lock, flags);
	list_del_init(&urb->urb_list);
	spin_unlock_irqrestore(&hcd_urb_list_lock, flags);
}

This seems to impact USB performance though. In 2.6.23 (without the 
problem) I get 21MB/s with dd, but with the above "fix" only 14MB/s. But 
  I'll recheck once we have a real error fix in place.


After that I added the following code

         if (!raw_irqs_disabled()) {
	  printk(KERN_CRIT "usb_hcd_unlink_urb_from_ep called with interrupts 
enabled!\n");
	  dump_stack();
	}

and collected the attached kernel messages. I checked the messages 
briefly and it seems that the following code paths have the interrupts 
enabled when calling usb_hcd_unlink_urb_from_ep():

   [<c0574d9d>] usb_hcd_unlink_urb_from_ep+0x25/0x6b
   [<de850559>] uhci_giveback_urb+0xcd/0x1e3 [uhci_hcd]
   [<de850e02>] uhci_scan_schedule+0x511/0x720 [uhci_hcd]
...
   [<de8529c3>] uhci_irq+0x131/0x142 [uhci_hcd]
   [<c05750cb>] usb_hcd_irq+0x23/0x51

and

   [<c0574d9d>] usb_hcd_unlink_urb_from_ep+0x25/0x6b
   [<de839d55>] ehci_urb_done+0x73/0x92 [ehci_hcd]
   [<de83a92f>] qh_completions+0x373/0x3eb [ehci_hcd]
   [<de83aa43>] ehci_work+0x9c/0x6a9 [ehci_hcd]
...
   [<de83ec3c>] ehci_irq+0x241/0x265 [ehci_hcd]
...
   [<c05750cb>] usb_hcd_irq+0x23/0x51


Is that enough information to fix the problem?

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
From: David Brownell
Date: Friday, June 27, 2008 - 9:07 am

No, but it suggests a few ways to get closer to the root cause.

That looks fishy.   The IRQ handler does not re-enable IRQs,
so it looks like something re-enabled IRQs in the middle of
an IRQ handler!  Which will obviously cause trouble.

(I'll assume that this isn't a case of a misleading stack dump,
where the IRQ frames are dead ones that were wrongly dumped...)

If 442258e2ff69276ff767f3703b30ce6a31fdd181 isn't in the
kernel with those bugs, try applying it... else, I suggest
you try putting something like your

	if (!raw_irqs_disabled()) {
		printk(KERN_CRIT "interrupts enabled!\n");
		dump_stack();
        }

logic at the beginning and end of usb_hcd_giveback_urb ... and
maybe have it dump the address of the completion handler, when
you can finger such a handler as re-enabling IRQs.

- Dave
--

From: Stefan Becker
Date: Saturday, June 28, 2008 - 7:31 am

Hi David,


That commit is about 4 months later than the commit I tracked down with 

See my reply to Alan's mail...

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
--

From: Alan Stern
Date: Friday, June 27, 2008 - 9:10 am

I don't know, but it's a good start.  The IRQs for uhci-hcd and
ehci-hcd are registered using the IRQF_DISABLED flag, which means that
the handler routines uhci_irq() and ehci_irq() should always be called
with interrupts disabled.

So that's the next thing to test.  Put a raw_irqs_disabled() test at
the start of those two routines, just to make sure that interrupts
don't somehow get enabled by mistake while the routine is running.  If 
interrupts are already enabled when the routines are called then the 
bug is somewhere else in the kernel.

(To make things simpler, you could concentrate on uhci_irq() and
unload ehci-hcd before running the test.)

Alan Stern

--

From: Stefan Becker
Date: Saturday, June 28, 2008 - 7:36 am

Hi Alan,


OK, I've now compiled 2.6.26-rc8 with the attached changes. The output I 
collected shows that interrupts are (always) enabled when (e|u)hci_irq() 
is entered.

I aborted the test run, i.e. I didn't run into the bug. But the kernel 
was so busy with dumping stacks that it probably wouldn't have been 
triggered at all.

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
From: Stefan Becker
Date: Saturday, June 28, 2008 - 8:39 am

Hi Alan,


Second try, slightly different tack: no dump_stack() in the _irq() 
functions, but disabled/enabled indication. Plus debugging code added in 
usb_hcd_add() to see the interrupt registration. Again too much data was 
printed for the bug to appear.

Looking at the collected data it looks like

  - ehci_irq() is always entered with interrupts enabled (WRONG).

  - uhci_irq() is entered sometimes with either interrupts enabled 
(WRONG) or disabled (OK).

But both interrupts are registered with IRQF_DISABLED...

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
From: Alan Stern
Date: Saturday, June 28, 2008 - 9:53 am

It certainly looks like you've found a flaw in the core kernel.  Or 
else a subtle hardware flaw in your CPU...

Just to be certain, take out all your new code and instead add a simple 
test at the start of usb_hcd_irq() in hcd.c.  That routine is the 
registered handler for USB interrupts.  If interrupts are ever enabled 
at that spot, then something is badly wrong.

You might also keep track of the total number of each type of call, so
whenever you find interrupts are enabled, you can print out something
like this:

	usb_hcd_irq(): interrupts disabled %d, enabled %d

where the two values are the numbers of times the routine has been 
called with interrupts off or on, respectively.  You don't have to run 
this for a long time; since the test code should _never_ trigger, any 
output at all will indicate a real problem.

When you get that, change the Subject of your email to make it more 
noticeable.  A line like:

	BUG in 2.6.26-rc8 interrupt handling!

should draw people's attention.  :-)  Include a description of the 
problem and a copy of /proc/interrupts.

Alan Stern

--

From: Becker Stefan (Nokia-D/Salo)
Date: Saturday, June 28, 2008 - 12:34 pm

Hi,

[I'm not subscribed to this list, so please CC: me]

[Subject has been changed, please read the rest of the thread for all 
the details]

On my AthlonXP single CPU laptop I experience random lockups when I 
access the external USB harddrive with kernels >2.6.23. The problem 
doesn't appear with 2.6.23.

Debugging revealed that a routine was called twice that takes a 
spinlock. But this actually shouldn't happen, because the function 
should only be invoked with interrupts disabled. A little more debugging 
revealed that apparently sometimes interrupts are enabled when the USB 
HCD interrupt handlers are called, although they are registered with 
IRQF_DISABLED.

With the attached patch ontop of 2.6.26-rc8 I get these messages:

kernel: USB_HCD_IRQ interrupts disabled 14206 enabled 6289
kernel: USB_HCD_IRQ interrupts disabled 14761 enabled 6290
kernel: USB_HCD_IRQ interrupts disabled 14761 enabled 6291

I also attached the contents from /proc/interrupts from the machine.

Any ideas what could be wrong or any suggestions in which direction I 
should continue debugging?

Regards,

	Stefan

---
Stefan Becker   Nokia-TP/Salo            Senior Specialist, EE SW
E-Mail: Stefan.Becker@nokia.com          SMS: 4871554@???.??
Office: +358-7180-71554                  FAX: +358-7180-44751
GSM   : +358-50-4871554
From: David Brownell
Date: Saturday, June 28, 2008 - 12:51 pm

Do things improve if you "rmmod yenta" or, better yet, never
load that module?

--

From: Stefan Becker
Date: Sunday, June 29, 2008 - 7:57 am

Hi David,

[I'm not subscribed to the list. So please CC: me in your reply]


I was starting to wonder about that too after I had sent the message and 
it led me straight to the root cause. This also explains why I was 
unable to reproduce the problem with defconfig: yenta_socket is not 
included in the default configuration.


The problem is caused by this code in handle_IRQ_event():

	if (!(action->flags & IRQF_DISABLED))
		local_irq_enable_in_hardirq();

	do {
		ret = action->handler(irq, action->dev_id);
         ...
		action = action->next;
	} while (action);

For shared interrupts IRQF_DISABLED will only take effect if the first 
registered handler sets it.


The attached changes fix the problem for me. Feel free to update them if 
they are not up to usual Linux kernel quality standards. I also attached 
the new config I used for testing: defconfig + the minimum additional 
stuff to enable the problem.

There are of course two other options to fix the problem:

  - add IRQF_DISBALED to yenta_socket. But this will fix it only for 
this case...

  - enable/disable interrupts depending on actions IRQF_DISABLED in the 
loop in handle_IRQ_event(). But that would mean more CPU cycles are 
spent in that function...

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com

From: David Brownell
Date: Sunday, June 29, 2008 - 8:09 pm

I was suspecting something rude like that ...

By the way, did you notice the oddness of IRQF_SAMPLE_RANDOM there?
For a shared IRQ, I would rather think that if any IRQ was flagged
as "too predictable for use as IRQ randomness" (by not having that
flag set) then the IRQ should never be sampled ... there's some odd

And it looks plausible to me.  Seems like this patch (or a variant)
should be merged for 2.6.26-final, yes?  Disregarding IRQF_DISABLED
has -- as you noted -- significant potential for oopsage.

I suggest you provide a fully cleaned-up version of this patch with
your signed-off-by line and a proper patch description ... do this
ASAP, since RC8 is getting a bit late for patches to merge!



I don't think you should need that flag; and if you did, it should be
declared in <linux/irq.h> to prevent anyone else from using that bit
for some other purpose.

Instead, I think you can set IRQF_DISABLED in irq_desc[irq].status
to achieve the same effect.

- Dave

--

From: Stefan Becker
Date: Sunday, June 29, 2008 - 10:22 pm

Hi David,


Good question. What happens when you mix a random and a not-so-random 
source: does the result have an as good random quality as the original 

My fear was that if it is a public flag in linux/interrupt.h then 

I actually had the same idea but missed the "irq" in the 
handle_IRQ_event() signature :-(


I'll try to send an updated patch today...

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
--

From: Henrique de Moraes Holschuh
Date: Monday, June 30, 2008 - 7:28 am

It becomes unusable.  In fact, I find it likely that shared IRQs are not
safe for random data gathering at all in general. IRQs are not in fact
completely random, they just (sometimes) have a very small ammount of
randomness in them.  And the combined result you see in a shared IRQ line
could be correlated or "get somewhat more correlated" because of the sharing
(they go over the same BUS -> one can delay the other, etc)... and that
correlation might be dangerous.

IMO, the safe thing to do is to block shared IRQs from being used as random
sources.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--

From: Alan Cox
Date: Monday, June 30, 2008 - 7:26 am

All IRQs on most modern boxes go over the same bus, shared or otherwise.
If anything that should add to the noise.

Alan
--

From: Stefan Becker
Date: Monday, June 30, 2008 - 2:34 am

Hi,


It seems IRQF_SHARED | IRQF_DISABLED has already been discussed several 
times on LKML:

<http://thread.gmane.org/gmane.linux.kernel/638251>
<http://thread.gmane.org/gmane.linux.kernel/561205>

Especially this message

<http://thread.gmane.org/gmane.linux.kernel/561205/focus=561222>

makes me think that Linus will reject my proposed patch for the lockup 
problem as it implements choice (c) on his list.


Alan: should I try to implement his choice (f) insteadn, i.e. drop 
IRQF_DISABLED from USB HCD interrupts and disable interrupts in 
usb_hcd_irq() instead?

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
--

From: David Brownell
Date: Monday, June 30, 2008 - 4:15 am

Given that, I'm surprised that nobody has added a warning that
prints when those two flags are both passed to request_irq().

The problems have been around for a long time, "tested" and known
to be broken.  They'll never go away without nag messages nudging

That's the best solution for USB, given that IRQF_DISABLED isn't
going to do what it claims to do.

- Dave


--

From: Alan Stern
Date: Monday, June 30, 2008 - 7:37 am

I agree.  Note that IRQF_DISABLED is used in hcd_pci.c as well as in
quite a few files in drivers/usb/host.

Alan Stern

--

From: Stefan Becker
Date: Monday, June 30, 2008 - 11:53 am

Hi,

here is the proposed patch to fix the problem. I hope I used git 
correctly to generate it...

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
From: Alan Stern
Date: Monday, June 30, 2008 - 12:35 pm

Don't bother with this extra stuff.  All USB host controller drivers
want to have interrupts disabled when their IRQ handlers run.

Alan Stern

--

From: David Brownell
Date: Monday, June 30, 2008 - 1:31 pm

How about this one instead?  I think it's probably almost midnight
where Stefan is, so I'd not expect Stefan to have an updated
patch very soon ... :)

- dave


====== CUT HERE
Add a workaround for the way the IRQ framework can disregard
IRQF_DISABLED when IRQF_SHARED is set and lockdep isn't in use.

This leaves IRQF_DISABLED active for unshared IRQs, but clears
that flag otherwise (since the genirq code should eventually
start warning folk about that bad combination of flags).

(Thanks to Stefan Becker <Stefan.Becker@nokia.com> for tracking
down the source of hard lockups on his hardware; on some systems
this problem could lead to oopsing.)

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
---
UNTESTED but "obviously right" yes?  Candidate for 2.6.26-rc8+ ...

 drivers/usb/core/hcd.c |   22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

--- a/drivers/usb/core/hcd.c	2008-06-30 13:09:00.000000000 -0700
+++ b/drivers/usb/core/hcd.c	2008-06-30 13:28:04.000000000 -0700
@@ -1687,19 +1687,31 @@ EXPORT_SYMBOL_GPL(usb_bus_start_enum);
 irqreturn_t usb_hcd_irq (int irq, void *__hcd)
 {
 	struct usb_hcd		*hcd = __hcd;
-	int			start = hcd->state;
+	int			start;
+	unsigned long		flags;
+	irqreturn_t		value = IRQ_NONE;
+
+	/* IRQF_DISABLED doesn't work correctly with shared IRQs
+	 * when the first handler doesn't use it.  So let's just
+	 * assume it's never used.
+	 */
+	local_irq_save(flags);
 
+	start = hcd->state;
 	if (unlikely(start == HC_STATE_HALT ||
 	    !test_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags)))
-		return IRQ_NONE;
+		goto done;
 	if (hcd->driver->irq (hcd) == IRQ_NONE)
-		return IRQ_NONE;
+		goto done;
 
+	value = IRQ_HANDLED;
 	set_bit(HCD_FLAG_SAW_IRQ, &hcd->flags);
 
 	if (unlikely(hcd->state == HC_STATE_HALT))
 		usb_hc_died (hcd);
-	return IRQ_HANDLED;
+done:
+	local_irq_restore(flags);
+	return value;
 }
 
 /*-------------------------------------------------------------------------*/
@@ ...
From: Alan Stern
Date: Monday, June 30, 2008 - 2:29 pm

It certainly looks right.  We'll let Stefan test it.

Alan Stern

--

From: David Brownell
Date: Monday, June 30, 2008 - 2:48 pm

Why don't we go with the version Stefan just sent, courtesy of
the rain which was keeping him awake ... he tested it, it works
vs just "obviously correct".

- Dave


--

From: Stefan Becker
Date: Monday, June 30, 2008 - 2:26 pm

Hi,


It is, put the rain pouring down on my house is keeping me awake :-)

Looks like you posted the same changes though...

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
From: Alan Stern
Date: Tuesday, July 1, 2008 - 7:11 am

A few minor problems...  You should post a revised patch with CC: to
Greg KH.  It makes things a little easier if you can put the patch in



The space after "died" should be removed.

Alan Stern

--

From: Leonardo Chiquitto
Date: Tuesday, July 1, 2008 - 7:19 am

It would be nice if this could be included in 2.6.26, as it also fixes a
problem reported in [1] and discussed in [2]

[1] https://bugzilla.novell.com/show_bug.cgi?id=378509
[2] http://marc.info/?l=linux-usb&m=121137189719281&w=4

Thanks,
Leonardo
--

From: Stefan Becker
Date: Tuesday, July 1, 2008 - 9:19 am

Hi Alan & Greg,

third time is the charm (I hope)...


*Sigh* there are certain things you just shouldn't do when burning the midnight oil. I changed the patch according to your suggestions and also copied the comment for usb_hcd_irq() from David's changes.

I hope this can still make it into 2.6.26, because it seems to solve a lot of problems for the users out there...

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com

================================ CUT HERE =============================================
commit 89fa7659cb90f621513d7193b06cf19386451c33
Author: Stefan Becker <stefan.becker@nokia.com>
Date:   Mon Jun 30 21:18:29 2008 +0300

    USB: fix interrupt disabling for HCDs with shared interrupt handlers
    
    As has been discussed several times on LKML, IRQF_SHARED | IRQF_DISABLED
    doesn't work reliably, i.e. a shared interrupt handler CAN'T be certain to
    be called with interrupts disabled. Most USB HCD handlers use IRQF_DISABLED
    and therefore havoc can break out if they share their interrupt with a
    handler that doesn't use it.
    
    On my test machine the yenta_socket interrupt handler (no IRQF_DISABLED)
    was registered before ehci_hcd and one uhci_hcd instance. Therefore all
    usb_hcd_irq() invocations for ehci_hcd and for one uhci_hcd instance
    happened with interrupts enabled. That led to random lockups as USB core
    HCD functions that acquire the same spinlock could be called twice
    from interrupt handlers.
    
    This patch updates usb_hcd_irq() to always disable/restore interrupts.
    usb_add_hcd() will silently remove any IRQF_DISABLED requested from HCD code.
    
    Signed-off-by: Stefan Becker <stefan.becker@nokia.com>
---
 drivers/usb/core/hcd.c |   38 ++++++++++++++++++++++++++++----------
 1 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 09a53e7..c13479c 100644
--- a/drivers/usb/core/hcd.c
+++ ...
From: Greg KH
Date: Tuesday, July 1, 2008 - 11:25 am

Hm, isn't this also an issue in 2.6.25 as well?  I've had a few very
strange reports of USB just locking up in 2.6.25, could this also be the
cause there?

thanks,


Alan and Dave, do you both ack this thing?  I know you've been involved
in creating it, just want to get the formal approval :)

thanks,

greg k-h
--

From: Alan Stern
Date: Tuesday, July 1, 2008 - 11:59 am

It could go back a lot farther than that.  I don't think our interrupt 
setup has been changed significantly for a long time, with one 
exception: We added the IRQF_DISABLED flag fairly recently.  But of 
course, we now know that it doesn't do what we expected.

It's surprising that more people haven't experienced this problem and 

Acked-by: Alan Stern <stern@rowland.harvard.edu>

There still are a few places where spaces were used instead of tabs for 
indenting.  Aside from that it looks fine.

Alan Stern

--

From: Greg KH
Date: Tuesday, July 1, 2008 - 12:13 pm

Thanks, I'll clean that up by hand, not a big deal :)

greg k-h
--

From: David Brownell
Date: Tuesday, July 1, 2008 - 12:21 pm

Maybe ... I'd update the comment to point out that enabling

Let me reboot a couple machines into their new kernels first...

- Dave
 


--

From: Stefan Becker
Date: Tuesday, July 1, 2008 - 12:15 pm

Hi,


As I wrote in the first mail of this thread the problem was uncovered on my machine by the following commit from 2.6.24-development:

 commit e9df41c5c5899259541dc928872cad4d07b82076
 USB: make HCDs responsible for managing endpoint queues

The interrupt problem exists also in 2.6.23 (and previous) but the USB code was not affected, as the old versions of the functions were using spin_lock_irqsave().

I just applied my patch to 2.6.24 & 2.6.25: it applies fine and compiles without problem. So it could be submitted to the 2.6.24.X and 2.6.25.X branches too.

Regards,

	Stefan

---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
--

From: Greg KH
Date: Tuesday, July 1, 2008 - 12:51 pm

[Empty message]
From: David Brownell
Date: Tuesday, July 1, 2008 - 9:22 am

... and use just tabs for those internal indents, not spaces!
--

From: David Brownell
Date: Monday, June 30, 2008 - 12:57 pm

Well, here's a fix for that little problem.

Notice the rude interaction with LOCKDEP too.  If you used that,
you'd never have seen the behavior you saw.  And if you did use
that, with non-IRQF_DISABLED interrupt handlers, you'd wrongly
believe some IRQ code paths always ran with IRQs disabled...

- Dave


===== CUT HERE
We periodically have problems that get tracked down to the IRQ
framework not respecting IRQF_DISABLED for some shared IRQ cases.
Linus views this as "will not fix", but we're still left with
the bugs caused by this misbehavior.

This patch adds a nag message in request_irq(), so that drivers
can fix their IRQ handlers to avoid this problem.

Note that developers will never see the relevant bugs when they
run with LOCKDEP, so it's no wonder these bugs are hard to find.
(That also means LOCKDEP will be missing some IRQ-related bugs
involving IRQ handlers that don't set IRQF_DISABLED.)

Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
---
 kernel/irq/manage.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

--- a/kernel/irq/manage.c	2008-06-30 12:28:58.000000000 -0700
+++ b/kernel/irq/manage.c	2008-06-30 12:46:54.000000000 -0700
@@ -539,6 +539,18 @@ int request_irq(unsigned int irq, irq_ha
 	struct irqaction *action;
 	int retval;
 
+	/*
+	 * handle_IRQ_event() always ignores IRQF_DISABLED except for
+	 * the _first_ irqaction (sigh).  That can cause oopsing, but
+	 * the behavior is classified as "will not fix" so we need to
+	 * start nudging drivers away from using that idiom.
+	 */
+	if ((irqflags & (IRQF_SHARED|IRQF_DISABLED))
+			== (IRQF_SHARED|IRQF_DISABLED))
+		pr_warning("IRQ %d/%s: IRQF_DISABLED is not "
+				"guaranteed on shared IRQs\n",
+				irq, devname);
+
 #ifdef CONFIG_LOCKDEP
 	/*
 	 * Lockdep wants atomic interrupt handlers:

--

Previous thread: oops in Audiowerk2 ALSA driver by Marcin Slusarz on Sunday, June 22, 2008 - 9:47 am. (2 messages)

Next thread: [PATCH] Fix open/close race in saa7134 by Arjan van de Ven on Sunday, June 22, 2008 - 10:05 am. (10 messages)