Hi,
[I'm not subscribed to this list, so please CC: me when you answer]
I get random machine lockups when accessing my USB harddrive with
kernels 2.6.24/25. They don't occur with kernel 2.6.23. During testing I
figured out that it has something to do with the USB Bluetooth adaptor.
If I remove it before the testing I don't get any lockups.
git bisect resulted in the following bad commit:
e9df41c5c5899259541dc928872cad4d07b82076 is first bad commit
commit e9df41c5c5899259541dc928872cad4d07b82076
Author: Alan Stern <stern@rowland.harvard.edu>
Date: Wed Aug 8 11:48:02 2007 -0400
USB: make HCDs responsible for managing endpoint queues
This patch (as954) implements a suggestion of David Brownell's. Now
the host controller drivers are responsible for linking and unlinking
URBs to/from their endpoint queues. This eliminates the possiblity of
strange situations where usbcore thinks an URB is linked but the HCD
thinks it isn't. It also means HCDs no longer have to check for URBs
being dequeued before they were fully enqueued.
[TRUNCATED...]
I've attached the kernel configuration.
Test procedure:
- boot test kernel in single user mode
- "service rsyslog start"
- "service messagebus start"
- "service bluetooth start"
- connect USB drive
- "mount /dev/sdb1 /mnt"
- "dd if=/dev/zero of=/mnt/test bs=10M"
BAD: lockup, GOOD: writes until partition is full
My USB hardware:
Bus 001 Device 006: ID 04b4:6830 Cypress Semiconductor Corp. USB-2.0 IDE
Adapter
Bus 001 Device 001: ID 1d6b:0002
Bus 003 Device 002: ID 046d:c001 Logitech, Inc. N48/M-BB48 [FirstMouse Plus]
Bus 003 Device 001: ID 1d6b:0001
Bus 002 Device 003: ID 050d:0081 Belkin Components FBT001v2 Bluetooth
Bus 002 Device 001: ID 1d6b:0001
My USB controllers:
$ /sbin/lspci | fgrep USB
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1
Controller (rev 80)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 ...Does the same problem still occur in 2.6.26-rc7? Does it occur if you rmmod ehci-hcd? Machine lockups are awfully hard to debug. Can you get any information at all (like Alt-SysRq-T) when this happens? Can you add debugging printk statements to the USB bluetooth driver to try and localize where Knowing this doesn't help much without more information. Do you have any idea why nobody else has reported this sort of problem? Is it reproducible on other machines? Alan Stern --
Hi, [I'm not subscribed to this list, so please CC: me when you answer] Yes, i.e. it also happens when the external hardrive runs as USB 1.1 SysRq does not work when the machine locks up. I forgot to mention that the test machine is a single CPU machine and that the CPU fan starts to run full speed when the lockup occurs. Guessing from the commit returned by git bisect there is a locking error, i.e. the CPU runs into a spinlock that is already locked and Too bad. Each bisect cycle took 2-3 hours and the whole process took me 3 days :-( :-( That commit has spinlock changes so I hoped that it would be a good I attached both USB devices to another, newer dual core laptop. I couldn't reproduce the problem there, even when I simulated a single CPU machine with maxcpus=1. Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com --
That is certainly possible. But an error like that should affect lots Around every place where the driver calls into the core. You might also want to debug the places where uhci-hcd acquires and releases I didn't mean that your efforts were wasted. They just don't help much Only what I suggested: Print something in the log whenever a lock is Alan Stern --
Hi,
I sprinkled some printk's into hci_usb.c. But I see only messages at the
I played around with printk a little bit more. With the following change
to uhci_giveback_urb():
--- a/drivers/usb/host/uhci-q.c
+++ b/drivers/usb/host/uhci-q.c
@@ -1526,11 +1530,17 @@ __acquires(uhci->lock)
}
uhci_free_urb_priv(uhci, urbp);
+ printk(KERN_CRIT "UHCI UNLINK ENTER %08x %08x\n",
+ (unsigned int) uhci, (unsigned int) urb);
usb_hcd_unlink_urb_from_ep(uhci_to_hcd(uhci), urb);
+ printk(KERN_CRIT "UHCI UNLINK LEAVE %08x %08x\n",
+ (unsigned int) uhci, (unsigned int) urb);
spin_unlock(&uhci->lock);
I see
UHCI UNLINK ENTER xxxxxxxx yyyyyyyy
UHCI UNLINK LEAVE xxxxxxxx yyyyyyyy
UHCI UNLINK ENTER xxxxxxxx yyyyyyyy
on the console at the time of the lockup. Then I added the following change:
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1107,7 +1107,13 @@ EXPORT_SYMBOL_GPL(usb_hcd_check_unlink_urb);
void usb_hcd_unlink_urb_from_ep(struct usb_hcd *hcd, struct urb *urb)
{
/* clear all state linking urb to this dev (and hcd) */
+#if 0
spin_lock(&hcd_urb_list_lock);
+#else
+ if (!spin_trylock(&hcd_urb_list_lock)) {
+ printk(KERN_CRIT "HCD URB LIST ALREADY LOCKED!\n");
+ }
+#endif
list_del_init(&urb->urb_list);
spin_unlock(&hcd_urb_list_lock);
}
and get this at the time of the lockup:
UHCI UNLINK ENTER xxxxxxxx yyyyyyyy
UHCI UNLINK LEAVE xxxxxxxx yyyyyyyy
UHCI UNLINK ENTER xxxxxxxx yyyyyyyy
HCD URB LIST ALREADY LOCKED!
<---- here the original code would lockup
UHCI UNLINK LEAVE xxxxxxxx yyyyyyyy
HCD URB LIST ALREADY LOCKED!
HCD URB LIST ALREADY LOCKED!
HCD URB LIST ALREADY LOCKED!
So the lockup is caused by an already locked hcd_urb_list_lock. Is there
a way to see the lock holder? Or any other suggestions how to proceed?
Regards,
Stefan
---
Stefan Becker
E-Mail: ...Good work! There isn't any way to see the lock holder except by adding more printk's. Fortunately that lock is private to hcd.c, so you know it's not used anywhere else. And it's not used in very many places in that file, so you can add printk's around each one. While you're at it, you ought to enable CONFIG_USB_DEBUG so that more of the context will be visible in the log. The usage in usb_hcd_link_urb_to_ep() appears benign; the code doesn't do anything that might hang while holding the lock. All it does is manipulate a linked list. You have already instrumented the usage in usb_hcd_unlink_urb_from_ep(). That leaves only usb_hcd_flush_endpoint(). And once again, the code isn't doing anything while holding the lock except following linked-list pointers. Taken together, these facts suggest that somehow the linked list has been corrupted. So you might also want to add some printk's around the linked-list activities in flush_endpoint, just to see if the pointer values look reasonable. Alan Stern --
Hi, Well, I guess I'm just lucky it didn't turn into a heisenbug with all Unfortunately I could only run a small test today. I added some simple debugging code for the spinlock usage in hcd.c (see attached diff) and I get the following message at lockup (I tried it twice just to be sure): HCD URB list locked by usb_hcd_link_urb_to_ep! As far as I understand the matter this only can happen if usb_hcd_link_urb_to_ep() gets interrupted while holding the spinlock. But according to the contract at the header of the function it should be called with interrupts disabled! I guess the obvious way forward from here is: - replace the spin_lock() in the function with the irqsave version - if that fixes the problem add debugging code to the function and panic with a stack trace when the interrupts aren't disabled one entry (don't know how to detect that yet, any suggestions?) That hopefully identifies the culprit that calls the function with interrupts enabled. Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com
So it should. Do you know how to test whether interrupts are enabled? There's a routine called raw_irqs_disabled() defined in include/asm/irqflags.h. Stick it inside usb_hcd_link_urb_to_ep, and That should do the trick. Although it would be quicker just to make a small change to your existing code: The first time the spin_trylock fails, do a dump_stack(). You did say this is a UP system, right? (Because obviously on SMP systems, one expects spin_trylock to fail from time to time.) The only callers of usb_hcd_link_urb_to_ep are the host controller drivers plus a couple of routines in hcd.c itself. But all the callers are in the scope of spin_lock_irqsave or spin_lock_irq! So maybe there's more going on here than meets the eye. Alan Stern --
Hi, Tried both and got nothing. So either our reasoning is wrong or something really weird is going on. I'll try to improve my debugging code and use dump_stack() at the place where it locks up. Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com --
Typo: I should have said "the first time it returns zero..." Well, it would be pretty surprising if spin_lock_irq or spin_lock_irqsave were misbehaving. Something else you should try is clearing your "owner" string just before the spinlock is released. You could also add a check after the release; if the spinlock can't be locked again immediately then something is wrong. Alan Stern --
Hi,
Yes, the initial try was misleading. I tinkered around a little bit more
and finally figured out that it is usb_hcd_unlink_urb_from_ep() itself
that is called with interrupts enabled!
So with this code in place the error disappears:
void usb_hcd_unlink_urb_from_ep(struct usb_hcd *hcd, struct urb *urb)
{
/* clear all state linking urb to this dev (and hcd) */
unsigned int flags;
spin_lock_irqsave(&hcd_urb_list_lock, flags);
list_del_init(&urb->urb_list);
spin_unlock_irqrestore(&hcd_urb_list_lock, flags);
}
This seems to impact USB performance though. In 2.6.23 (without the
problem) I get 21MB/s with dd, but with the above "fix" only 14MB/s. But
I'll recheck once we have a real error fix in place.
After that I added the following code
if (!raw_irqs_disabled()) {
printk(KERN_CRIT "usb_hcd_unlink_urb_from_ep called with interrupts
enabled!\n");
dump_stack();
}
and collected the attached kernel messages. I checked the messages
briefly and it seems that the following code paths have the interrupts
enabled when calling usb_hcd_unlink_urb_from_ep():
[<c0574d9d>] usb_hcd_unlink_urb_from_ep+0x25/0x6b
[<de850559>] uhci_giveback_urb+0xcd/0x1e3 [uhci_hcd]
[<de850e02>] uhci_scan_schedule+0x511/0x720 [uhci_hcd]
...
[<de8529c3>] uhci_irq+0x131/0x142 [uhci_hcd]
[<c05750cb>] usb_hcd_irq+0x23/0x51
and
[<c0574d9d>] usb_hcd_unlink_urb_from_ep+0x25/0x6b
[<de839d55>] ehci_urb_done+0x73/0x92 [ehci_hcd]
[<de83a92f>] qh_completions+0x373/0x3eb [ehci_hcd]
[<de83aa43>] ehci_work+0x9c/0x6a9 [ehci_hcd]
...
[<de83ec3c>] ehci_irq+0x241/0x265 [ehci_hcd]
...
[<c05750cb>] usb_hcd_irq+0x23/0x51
Is that enough information to fix the problem?
Regards,
Stefan
---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
No, but it suggests a few ways to get closer to the root cause.
That looks fishy. The IRQ handler does not re-enable IRQs,
so it looks like something re-enabled IRQs in the middle of
an IRQ handler! Which will obviously cause trouble.
(I'll assume that this isn't a case of a misleading stack dump,
where the IRQ frames are dead ones that were wrongly dumped...)
If 442258e2ff69276ff767f3703b30ce6a31fdd181 isn't in the
kernel with those bugs, try applying it... else, I suggest
you try putting something like your
if (!raw_irqs_disabled()) {
printk(KERN_CRIT "interrupts enabled!\n");
dump_stack();
}
logic at the beginning and end of usb_hcd_giveback_urb ... and
maybe have it dump the address of the completion handler, when
you can finger such a handler as re-enabling IRQs.
- Dave
--
Hi David, That commit is about 4 months later than the commit I tracked down with See my reply to Alan's mail... Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com --
I don't know, but it's a good start. The IRQs for uhci-hcd and ehci-hcd are registered using the IRQF_DISABLED flag, which means that the handler routines uhci_irq() and ehci_irq() should always be called with interrupts disabled. So that's the next thing to test. Put a raw_irqs_disabled() test at the start of those two routines, just to make sure that interrupts don't somehow get enabled by mistake while the routine is running. If interrupts are already enabled when the routines are called then the bug is somewhere else in the kernel. (To make things simpler, you could concentrate on uhci_irq() and unload ehci-hcd before running the test.) Alan Stern --
Hi Alan, OK, I've now compiled 2.6.26-rc8 with the attached changes. The output I collected shows that interrupts are (always) enabled when (e|u)hci_irq() is entered. I aborted the test run, i.e. I didn't run into the bug. But the kernel was so busy with dumping stacks that it probably wouldn't have been triggered at all. Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com
Hi Alan, Second try, slightly different tack: no dump_stack() in the _irq() functions, but disabled/enabled indication. Plus debugging code added in usb_hcd_add() to see the interrupt registration. Again too much data was printed for the bug to appear. Looking at the collected data it looks like - ehci_irq() is always entered with interrupts enabled (WRONG). - uhci_irq() is entered sometimes with either interrupts enabled (WRONG) or disabled (OK). But both interrupts are registered with IRQF_DISABLED... Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com
It certainly looks like you've found a flaw in the core kernel. Or else a subtle hardware flaw in your CPU... Just to be certain, take out all your new code and instead add a simple test at the start of usb_hcd_irq() in hcd.c. That routine is the registered handler for USB interrupts. If interrupts are ever enabled at that spot, then something is badly wrong. You might also keep track of the total number of each type of call, so whenever you find interrupts are enabled, you can print out something like this: usb_hcd_irq(): interrupts disabled %d, enabled %d where the two values are the numbers of times the routine has been called with interrupts off or on, respectively. You don't have to run this for a long time; since the test code should _never_ trigger, any output at all will indicate a real problem. When you get that, change the Subject of your email to make it more noticeable. A line like: BUG in 2.6.26-rc8 interrupt handling! should draw people's attention. :-) Include a description of the problem and a copy of /proc/interrupts. Alan Stern --
Hi, [I'm not subscribed to this list, so please CC: me] [Subject has been changed, please read the rest of the thread for all the details] On my AthlonXP single CPU laptop I experience random lockups when I access the external USB harddrive with kernels >2.6.23. The problem doesn't appear with 2.6.23. Debugging revealed that a routine was called twice that takes a spinlock. But this actually shouldn't happen, because the function should only be invoked with interrupts disabled. A little more debugging revealed that apparently sometimes interrupts are enabled when the USB HCD interrupt handlers are called, although they are registered with IRQF_DISABLED. With the attached patch ontop of 2.6.26-rc8 I get these messages: kernel: USB_HCD_IRQ interrupts disabled 14206 enabled 6289 kernel: USB_HCD_IRQ interrupts disabled 14761 enabled 6290 kernel: USB_HCD_IRQ interrupts disabled 14761 enabled 6291 I also attached the contents from /proc/interrupts from the machine. Any ideas what could be wrong or any suggestions in which direction I should continue debugging? Regards, Stefan --- Stefan Becker Nokia-TP/Salo Senior Specialist, EE SW E-Mail: Stefan.Becker@nokia.com SMS: 4871554@???.?? Office: +358-7180-71554 FAX: +358-7180-44751 GSM : +358-50-4871554
Do things improve if you "rmmod yenta" or, better yet, never load that module? --
Hi David,
[I'm not subscribed to the list. So please CC: me in your reply]
I was starting to wonder about that too after I had sent the message and
it led me straight to the root cause. This also explains why I was
unable to reproduce the problem with defconfig: yenta_socket is not
included in the default configuration.
The problem is caused by this code in handle_IRQ_event():
if (!(action->flags & IRQF_DISABLED))
local_irq_enable_in_hardirq();
do {
ret = action->handler(irq, action->dev_id);
...
action = action->next;
} while (action);
For shared interrupts IRQF_DISABLED will only take effect if the first
registered handler sets it.
The attached changes fix the problem for me. Feel free to update them if
they are not up to usual Linux kernel quality standards. I also attached
the new config I used for testing: defconfig + the minimum additional
stuff to enable the problem.
There are of course two other options to fix the problem:
- add IRQF_DISBALED to yenta_socket. But this will fix it only for
this case...
- enable/disable interrupts depending on actions IRQF_DISABLED in the
loop in handle_IRQ_event(). But that would mean more CPU cycles are
spent in that function...
Regards,
Stefan
---
Stefan Becker
E-Mail: Stefan.Becker@nokia.com
I was suspecting something rude like that ... By the way, did you notice the oddness of IRQF_SAMPLE_RANDOM there? For a shared IRQ, I would rather think that if any IRQ was flagged as "too predictable for use as IRQ randomness" (by not having that flag set) then the IRQ should never be sampled ... there's some odd And it looks plausible to me. Seems like this patch (or a variant) should be merged for 2.6.26-final, yes? Disregarding IRQF_DISABLED has -- as you noted -- significant potential for oopsage. I suggest you provide a fully cleaned-up version of this patch with your signed-off-by line and a proper patch description ... do this ASAP, since RC8 is getting a bit late for patches to merge! I don't think you should need that flag; and if you did, it should be declared in <linux/irq.h> to prevent anyone else from using that bit for some other purpose. Instead, I think you can set IRQF_DISABLED in irq_desc[irq].status to achieve the same effect. - Dave --
Hi David, Good question. What happens when you mix a random and a not-so-random source: does the result have an as good random quality as the original My fear was that if it is a public flag in linux/interrupt.h then I actually had the same idea but missed the "irq" in the handle_IRQ_event() signature :-( I'll try to send an updated patch today... Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com --
It becomes unusable. In fact, I find it likely that shared IRQs are not safe for random data gathering at all in general. IRQs are not in fact completely random, they just (sometimes) have a very small ammount of randomness in them. And the combined result you see in a shared IRQ line could be correlated or "get somewhat more correlated" because of the sharing (they go over the same BUS -> one can delay the other, etc)... and that correlation might be dangerous. IMO, the safe thing to do is to block shared IRQs from being used as random sources. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh --
All IRQs on most modern boxes go over the same bus, shared or otherwise. If anything that should add to the noise. Alan --
Hi, It seems IRQF_SHARED | IRQF_DISABLED has already been discussed several times on LKML: <http://thread.gmane.org/gmane.linux.kernel/638251> <http://thread.gmane.org/gmane.linux.kernel/561205> Especially this message <http://thread.gmane.org/gmane.linux.kernel/561205/focus=561222> makes me think that Linus will reject my proposed patch for the lockup problem as it implements choice (c) on his list. Alan: should I try to implement his choice (f) insteadn, i.e. drop IRQF_DISABLED from USB HCD interrupts and disable interrupts in usb_hcd_irq() instead? Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com --
Given that, I'm surprised that nobody has added a warning that prints when those two flags are both passed to request_irq(). The problems have been around for a long time, "tested" and known to be broken. They'll never go away without nag messages nudging That's the best solution for USB, given that IRQF_DISABLED isn't going to do what it claims to do. - Dave --
I agree. Note that IRQF_DISABLED is used in hcd_pci.c as well as in quite a few files in drivers/usb/host. Alan Stern --
Hi, here is the proposed patch to fix the problem. I hope I used git correctly to generate it... Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com
Don't bother with this extra stuff. All USB host controller drivers want to have interrupts disabled when their IRQ handlers run. Alan Stern --
How about this one instead? I think it's probably almost midnight
where Stefan is, so I'd not expect Stefan to have an updated
patch very soon ... :)
- dave
====== CUT HERE
Add a workaround for the way the IRQ framework can disregard
IRQF_DISABLED when IRQF_SHARED is set and lockdep isn't in use.
This leaves IRQF_DISABLED active for unshared IRQs, but clears
that flag otherwise (since the genirq code should eventually
start warning folk about that bad combination of flags).
(Thanks to Stefan Becker <Stefan.Becker@nokia.com> for tracking
down the source of hard lockups on his hardware; on some systems
this problem could lead to oopsing.)
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
---
UNTESTED but "obviously right" yes? Candidate for 2.6.26-rc8+ ...
drivers/usb/core/hcd.c | 22 ++++++++++++++++++----
1 file changed, 18 insertions(+), 4 deletions(-)
--- a/drivers/usb/core/hcd.c 2008-06-30 13:09:00.000000000 -0700
+++ b/drivers/usb/core/hcd.c 2008-06-30 13:28:04.000000000 -0700
@@ -1687,19 +1687,31 @@ EXPORT_SYMBOL_GPL(usb_bus_start_enum);
irqreturn_t usb_hcd_irq (int irq, void *__hcd)
{
struct usb_hcd *hcd = __hcd;
- int start = hcd->state;
+ int start;
+ unsigned long flags;
+ irqreturn_t value = IRQ_NONE;
+
+ /* IRQF_DISABLED doesn't work correctly with shared IRQs
+ * when the first handler doesn't use it. So let's just
+ * assume it's never used.
+ */
+ local_irq_save(flags);
+ start = hcd->state;
if (unlikely(start == HC_STATE_HALT ||
!test_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags)))
- return IRQ_NONE;
+ goto done;
if (hcd->driver->irq (hcd) == IRQ_NONE)
- return IRQ_NONE;
+ goto done;
+ value = IRQ_HANDLED;
set_bit(HCD_FLAG_SAW_IRQ, &hcd->flags);
if (unlikely(hcd->state == HC_STATE_HALT))
usb_hc_died (hcd);
- return IRQ_HANDLED;
+done:
+ local_irq_restore(flags);
+ return value;
}
/*-------------------------------------------------------------------------*/
@@ ...It certainly looks right. We'll let Stefan test it. Alan Stern --
Why don't we go with the version Stefan just sent, courtesy of the rain which was keeping him awake ... he tested it, it works vs just "obviously correct". - Dave --
Hi, It is, put the rain pouring down on my house is keeping me awake :-) Looks like you posted the same changes though... Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com
A few minor problems... You should post a revised patch with CC: to Greg KH. It makes things a little easier if you can put the patch in The space after "died" should be removed. Alan Stern --
It would be nice if this could be included in 2.6.26, as it also fixes a problem reported in [1] and discussed in [2] [1] https://bugzilla.novell.com/show_bug.cgi?id=378509 [2] http://marc.info/?l=linux-usb&m=121137189719281&w=4 Thanks, Leonardo --
Hi Alan & Greg, third time is the charm (I hope)... *Sigh* there are certain things you just shouldn't do when burning the midnight oil. I changed the patch according to your suggestions and also copied the comment for usb_hcd_irq() from David's changes. I hope this can still make it into 2.6.26, because it seems to solve a lot of problems for the users out there... Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com ================================ CUT HERE ============================================= commit 89fa7659cb90f621513d7193b06cf19386451c33 Author: Stefan Becker <stefan.becker@nokia.com> Date: Mon Jun 30 21:18:29 2008 +0300 USB: fix interrupt disabling for HCDs with shared interrupt handlers As has been discussed several times on LKML, IRQF_SHARED | IRQF_DISABLED doesn't work reliably, i.e. a shared interrupt handler CAN'T be certain to be called with interrupts disabled. Most USB HCD handlers use IRQF_DISABLED and therefore havoc can break out if they share their interrupt with a handler that doesn't use it. On my test machine the yenta_socket interrupt handler (no IRQF_DISABLED) was registered before ehci_hcd and one uhci_hcd instance. Therefore all usb_hcd_irq() invocations for ehci_hcd and for one uhci_hcd instance happened with interrupts enabled. That led to random lockups as USB core HCD functions that acquire the same spinlock could be called twice from interrupt handlers. This patch updates usb_hcd_irq() to always disable/restore interrupts. usb_add_hcd() will silently remove any IRQF_DISABLED requested from HCD code. Signed-off-by: Stefan Becker <stefan.becker@nokia.com> --- drivers/usb/core/hcd.c | 38 ++++++++++++++++++++++++++++---------- 1 files changed, 28 insertions(+), 10 deletions(-) diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c index 09a53e7..c13479c 100644 --- a/drivers/usb/core/hcd.c +++ ...
Hm, isn't this also an issue in 2.6.25 as well? I've had a few very strange reports of USB just locking up in 2.6.25, could this also be the cause there? thanks, Alan and Dave, do you both ack this thing? I know you've been involved in creating it, just want to get the formal approval :) thanks, greg k-h --
It could go back a lot farther than that. I don't think our interrupt setup has been changed significantly for a long time, with one exception: We added the IRQF_DISABLED flag fairly recently. But of course, we now know that it doesn't do what we expected. It's surprising that more people haven't experienced this problem and Acked-by: Alan Stern <stern@rowland.harvard.edu> There still are a few places where spaces were used instead of tabs for indenting. Aside from that it looks fine. Alan Stern --
Thanks, I'll clean that up by hand, not a big deal :) greg k-h --
Maybe ... I'd update the comment to point out that enabling Let me reboot a couple machines into their new kernels first... - Dave --
Hi, As I wrote in the first mail of this thread the problem was uncovered on my machine by the following commit from 2.6.24-development: commit e9df41c5c5899259541dc928872cad4d07b82076 USB: make HCDs responsible for managing endpoint queues The interrupt problem exists also in 2.6.23 (and previous) but the USB code was not affected, as the old versions of the functions were using spin_lock_irqsave(). I just applied my patch to 2.6.24 & 2.6.25: it applies fine and compiles without problem. So it could be submitted to the 2.6.24.X and 2.6.25.X branches too. Regards, Stefan --- Stefan Becker E-Mail: Stefan.Becker@nokia.com --
... and use just tabs for those internal indents, not spaces! --
Well, here's a fix for that little problem. Notice the rude interaction with LOCKDEP too. If you used that, you'd never have seen the behavior you saw. And if you did use that, with non-IRQF_DISABLED interrupt handlers, you'd wrongly believe some IRQ code paths always ran with IRQs disabled... - Dave ===== CUT HERE We periodically have problems that get tracked down to the IRQ framework not respecting IRQF_DISABLED for some shared IRQ cases. Linus views this as "will not fix", but we're still left with the bugs caused by this misbehavior. This patch adds a nag message in request_irq(), so that drivers can fix their IRQ handlers to avoid this problem. Note that developers will never see the relevant bugs when they run with LOCKDEP, so it's no wonder these bugs are hard to find. (That also means LOCKDEP will be missing some IRQ-related bugs involving IRQ handlers that don't set IRQF_DISABLED.) Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> --- kernel/irq/manage.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) --- a/kernel/irq/manage.c 2008-06-30 12:28:58.000000000 -0700 +++ b/kernel/irq/manage.c 2008-06-30 12:46:54.000000000 -0700 @@ -539,6 +539,18 @@ int request_irq(unsigned int irq, irq_ha struct irqaction *action; int retval; + /* + * handle_IRQ_event() always ignores IRQF_DISABLED except for + * the _first_ irqaction (sigh). That can cause oopsing, but + * the behavior is classified as "will not fix" so we need to + * start nudging drivers away from using that idiom. + */ + if ((irqflags & (IRQF_SHARED|IRQF_DISABLED)) + == (IRQF_SHARED|IRQF_DISABLED)) + pr_warning("IRQ %d/%s: IRQF_DISABLED is not " + "guaranteed on shared IRQs\n", + irq, devname); + #ifdef CONFIG_LOCKDEP /* * Lockdep wants atomic interrupt handlers: --
