Re: [PATCH] USB: fix deadlock in HCD code

Previous thread: Sony laptop battery has odd behavior with 2.6.26-rc3ish build. by Robin Holt on Wednesday, May 21, 2008 - 4:57 am. (5 messages)

Next thread: BUG: unable to handle kernel NULL pointer dereference by Zdenek Kabelac on Wednesday, May 21, 2008 - 5:56 am. (1 message)
From: Jiri Kosina
Date: Wednesday, May 21, 2008 - 5:09 am

hcd_urb_list_lock is used for synchronization between IRQ and non-IRQ 
contexts, so the non-IRQ context has to disable IRQs in order to prevent 
deadlocking with IRQ context.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>

 drivers/usb/core/hcd.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index bf10e9c..19279ed 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -1000,8 +1000,9 @@ EXPORT_SYMBOL_GPL(usb_calc_bus_time);
 int usb_hcd_link_urb_to_ep(struct usb_hcd *hcd, struct urb *urb)
 {
 	int		rc = 0;
+	unsigned long flags;
 
-	spin_lock(&hcd_urb_list_lock);
+	spin_lock_irqsave(&hcd_urb_list_lock, flags);
 
 	/* Check that the URB isn't being killed */
 	if (unlikely(urb->reject)) {
@@ -1034,7 +1035,7 @@ int usb_hcd_link_urb_to_ep(struct usb_hcd *hcd, struct urb *urb)
 		goto done;
 	}
  done:
-	spin_unlock(&hcd_urb_list_lock);
+	spin_unlock_irqrestore(&hcd_urb_list_lock, flags);
 	return rc;
 }
 EXPORT_SYMBOL_GPL(usb_hcd_link_urb_to_ep);
@@ -1106,10 +1107,11 @@ EXPORT_SYMBOL_GPL(usb_hcd_check_unlink_urb);
  */
 void usb_hcd_unlink_urb_from_ep(struct usb_hcd *hcd, struct urb *urb)
 {
+	unsigned long flags;
 	/* clear all state linking urb to this dev (and hcd) */
-	spin_lock(&hcd_urb_list_lock);
+	spin_lock_irqsave(&hcd_urb_list_lock, flags);
 	list_del_init(&urb->urb_list);
-	spin_unlock(&hcd_urb_list_lock);
+	spin_unlock_irqrestore(&hcd_urb_list_lock, flags);
 }
 EXPORT_SYMBOL_GPL(usb_hcd_unlink_urb_from_ep);

--

From: Oliver Neukum
Date: Wednesday, May 21, 2008 - 6:21 am

Which non-irq context is that?

	Regards
		Oliver
--

From: Jiri Kosina
Date: Wednesday, May 21, 2008 - 6:27 am

One example -- assume usb_submit_urb() called from non-IRQ context. Then

usb_hcd_submit_urb() -> rh_urb_enqueue() -> rh_queue_status() -> 
usb_hcd_link_urb_to_ep().

-- 
Jiri Kosina
SUSE Labs
--

From: Oliver Neukum
Date: Wednesday, May 21, 2008 - 6:32 am

This turns out not to be the case. Interrupts are disabled.

static int rh_queue_status (struct usb_hcd *hcd, struct urb *urb)
{
	int		retval;
	unsigned long	flags;
	int		len = 1 + (urb->dev->maxchild / 8);

	spin_lock_irqsave (&hcd_root_hub_lock, flags);
	if (hcd->status_urb || urb->transfer_buffer_length < len) {
		dev_dbg (hcd->self.controller, "not queuing rh status urb\n");
		retval = -EINVAL;
		goto done;
	}

	retval = usb_hcd_link_urb_to_ep(hcd, urb);

I'll investigate.

	Regards
		Oliver
--

From: Jiri Kosina
Date: Wednesday, May 21, 2008 - 6:36 am

You're right, this callchain can't cause the deadlock indeed. I'll go 
through the other possibilities.

To give some background to others reading this thread -- Leonardo (in CC) 
reported [1] that this patch fixes hard kernel hangs he has been seeing 
when using USB CDMA modem.

Still, we don't currently seem to see the place where the interrupts get 
enabled that makes the deadlock to trigger.

[1] https://bugzilla.novell.com/show_bug.cgi?id=378509

-- 
Jiri Kosina
SUSE Labs
--

From: Oliver Neukum
Date: Wednesday, May 21, 2008 - 6:47 am

usb_hcd_flush_endpoint() is an obvious case. Used in the suspend code.
Your patch is indeed correct, but I fear there might be a second bug caused
by wrong calling conditions.

	Regards
		Oliver
--

From: Alan Stern
Date: Wednesday, May 21, 2008 - 7:13 am

The functions you are worried about (usb_hcd_link_urb_to_ep and 
usb_hcd_unlink_urb_from_ep) are documented as requiring that interrupts 
be disabled by their callers.  This patch isn't needed.

(Of course, other patches may be needed to insure that the callers do 

It's not obvious to me.  Where does usb_hcd_flush_endpoint() acquire 

The problem in the Novell bugzilla entry was caused by the fact that 
the OHCI irq routine was invoked with interrupts enabled, owing to a 
missing IRQF_DISABLED flag.  That bug has already been fixed in 2.6.25.

Alan Stern

--

From: Oliver Neukum
Date: Wednesday, May 21, 2008 - 7:31 am

Sorry, yes looking again, it does complicated stuff with dropping the
lock but leaving interrupts disabled.

	Sorry
		Oliver

--

From: David Vrabel
Date: Wednesday, May 21, 2008 - 7:29 am

This requirement is the only reason the whci-hcd driver disables
interrupts.  Removing this requirement would reduce the time that
interrupts are disabled in the whci-hcd driver.

David
-- 
David Vrabel, Senior Software Engineer, Drivers
CSR, Churchill House, Cambridge Business Park,  Tel: +44 (0)1223 692562
Cowley Road, Cambridge, CB4 0WZ                 http://www.csr.com/
--

From: Alan Stern
Date: Wednesday, May 21, 2008 - 7:42 am

That doesn't sound like a valid approach.  If you don't disable
interrupts then you aren't protected against an interrupt handler
submitting an URB and accessing your data structures while whci-hcd is
in the middle of updating them.

Alan Stern

--

From: David Vrabel
Date: Wednesday, May 21, 2008 - 7:53 am

I can't see how urbs can be submitted to whci-hcd from an interrupt
handler (urb callbacks are always called from a workqueue thread) but
they could be submitted from a timer, so you are correct.

David
-- 
David Vrabel, Senior Software Engineer, Drivers
CSR, Churchill House, Cambridge Business Park,  Tel: +44 (0)1223 692562
Cowley Road, Cambridge, CB4 0WZ                 http://www.csr.com/
--

From: Alan Stern
Date: Wednesday, May 21, 2008 - 7:58 am

_Your_ callbacks are always called from a workqueue thread.  Another 
driver's callbacks might not be, and that other driver might submit an 

Alan Stern

--

From: Jiri Kosina
Date: Wednesday, May 21, 2008 - 7:22 am

That indeed is 2.6.25 kernel. I guess you are talking about commit 
442258e2ff69 here. If so, the reporter is definitely using the kernel 
containing this commit, and the lockups still trigger.

Seems that my patch is papering over the real bug (someone enabling 
interrupts somewhere) indeed, but I can't seem to be able to find such 
codepath.

Thanks,

-- 
Jiri Kosina
SUSE Labs
--

From: Alan Stern
Date: Wednesday, May 21, 2008 - 7:46 am

You could try testing the interrupt-enable flag at various places 
in ohci-hcd (start with finish_urb) and printing an error message if 
interrupts are enabled.

One possibility is that in an earlier call to finish_urb,
usb_hcd_giveback_urb was called with interrupts disabled and returned 
with interrupts enabled.  In other words, some driver's callback 
routine may have enabled interrupts incorrectly.

Alan Stern

--

From: Leonardo Chiquitto
Date: Wednesday, May 21, 2008 - 12:32 pm

Alan,

  I am trying your suggestion right now. Lets see if it finds something.

Thanks,
Leonardo

--

From: Oliver Neukum
Date: Wednesday, May 21, 2008 - 1:03 pm

Are we sure the interrupt line is not shared with another driver that calls
spin_unlock_irq() in its interrupt handler?

	Regards
		Oliver

--

From: Oliver Neukum
Date: Wednesday, May 21, 2008 - 6:40 am

Sorry, yes other code paths take it with enabled interrupts.
The patch is good and necessary for 2.6.26.

	Regards
		Oliver

--

From: Peter Zijlstra
Date: Wednesday, May 21, 2008 - 3:26 pm

Just wondering,... doesn't lockdep say anything about the situation?

--

Previous thread: Sony laptop battery has odd behavior with 2.6.26-rc3ish build. by Robin Holt on Wednesday, May 21, 2008 - 4:57 am. (5 messages)

Next thread: BUG: unable to handle kernel NULL pointer dereference by Zdenek Kabelac on Wednesday, May 21, 2008 - 5:56 am. (1 message)