Re: OHCI root_port_reset() deadly loop...

Previous thread: lockdep: how to tell it multiple pte locks is OK? by Jeremy Fitzhardinge on Saturday, October 6, 2007 - 11:31 pm. (7 messages)

Next thread: libertas and endianness by Geert Uytterhoeven on Sunday, October 7, 2007 - 3:15 am. (1 message)
From: David Miller
Date: Saturday, October 6, 2007 - 11:53 pm

When root_port_reset() in ohci-hub.c polls for the end of the reset,
it puts no limit on the loop and will only exit the loop when either
the RH_PS_PRS bit clears or the register returns all 1's (the latter
condition is a recent addition).

If for some reason the bit never clears, we sit here forever and never
exit the loop.

I actually hit this on one of my machines, and I'm trying to track
down what's happening.

Regardless of why my machine is doing this, there absolutely should be
some upper bound put on the number of times we will run through this
loop, perhaps enough such that up to 5 seconds elapses waiting for the
reset bit to clear.  And if it times out we should print a loud
message onto the console, but still try to continue.

Thanks!
-

From: David Brownell
Date: Sunday, October 7, 2007 - 12:31 am

The all-ones typically indicating something like CardBus eject

Sounds like a hardware issue -- unless something else is trashing
controller state.  We've not observed that specific hardware failure
before, for what it's worth.

Is this SPARC, or is ACPI potentially in play?  PCI, or non-PCI?
Are the other ports still behaving?  Is EHCI maybe trying to switch
ownership of that port?  Is maybe the (newish) autosuspend stuff
kicking in?

The OHCI spec requires the controller to stop the reset itself.

Patches accepted.  :)

Since the PRS bit is specified as "write one", with writing zero
as no-effect (since the rest is hardware-timed), the only recovery
procedure might involve resetting the whole controller.  Messy,
and not something the usb core has historically handled very well.

- Dave

-

From: David Miller
Date: Sunday, October 7, 2007 - 12:51 am

From: David Brownell <david-b@pacbell.net>



I'm 700 patches deep with an additionally large backlog of
patches to apply for the networking and sparc64 tree.

I don't have the time, which is why I reported the problem
to the OHCI maintainer instead of letting it slip through

At a minimum you should exit the loop and print out a warning
messages and try to continue even without trying to reset the
whole controller if that will take some developer effort.

Anything is better than just hanging there forever.  Not every
user knows to hit ALT-SYSRQ then 'P' to get a register dump to
figure out why their computer stopped booting.

Every register polling loops, without exception, should have a limit
and exit with an error indication when that limit is reached.  It
will never hang someones machine, because although not this time, in
many cases these kinds of hangs are absolutely impossible to debug
(interrupts disabled, critical lock held, etc.)
-

From: David Miller
Date: Monday, October 8, 2007 - 4:54 pm

From: David Miller <davem@davemloft.net>

To add some more information here, I think the EHCI idea might
hold some water.

What I have here are two NEC OHCI USB interfaces and one NEC EHCI
USB interface on PCI.  Aparently they all go through a shared
USB hub, mapped like this:

HUB Port 1: OHCI #1, EHCI
HUB Port 2: OHCI #2, EHCI
HUB Port 3: OHCI #1, EHCI
HUB Port 4: OHCI #2, EHCI
HUB Port 5: OHCI #1, EHCI

The OHCI ports go out to external USB connectors on the back panel of
the machine, whereas the EHCI is connected up to an internal USB
storage CDROM device and what appears to be another USB hub.

The problem seems to be very strongly tied to timing.  For example
simply adding "ignore_loglevel" to the kernel boot command line can
make the problem go away.

This got me thinking about your EHCI comment.

If these controllers are going through the same HUB, things might go
south if OHCI initialized first, then khubd et al. are asynchronously
accessing the segments behind OHCI at the same time that the EHCI
driver is initializing.  Perhaps, this is the kind of sequence of
events which makes one of the root ports reset in such a way that the
the reset bit never clears.

Given that this machine has 64 cpus, the likelyhood for such parallel
accesses is very likely :-)

Does this make any sense?

Regardless, here is a patch that hardens the OHCI reset handling
loops so that they break out instead of hanging the entire system
should this condition occur.  It's at least better than what the
code does to a user right now which is hang the box completely:

[USB] ohci: Do not hang the system if port reset does not complete.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
index bb9cc59..77ae5b4 100644
--- a/drivers/usb/host/ohci-hub.c
+++ b/drivers/usb/host/ohci-hub.c
@@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 	u32	temp;
 	u16	now = ...
From: Greg KH
Date: Monday, October 8, 2007 - 8:10 pm

Yes it does, I'm seeing reports from some hardware companies of the very
same thing.  If you serialize and load the ehci driver first, and then
the ohci driver, that should fix the problem.

Does that also work for you?  Or are these drivers built into the
kernel?

thanks,

greg k-h
-

From: David Miller
Date: Monday, October 8, 2007 - 8:16 pm

From: Greg KH <greg@kroah.com>

As coicidence would have it I finally found a recipe for triggering
the issue, and it ties into what you're talking about here.

It happens only if I make sure OHCI gets loaded first and then EHCI
right afterwards.

It seems that indeed it is important for EHCI to get loaded first,
and in-kernel this is ensured by the link ordering.

However, when both OHCI and EHCI are built as modules (or, similarly
I guess, OHCI is built-in and EHCI is modular) there appears to be
nothing in userspace which makes sure EHCI gets loaded first.

When this triggers, in OHCI's root_port_reset(), the port status
register reads 0x111 in that inner-loop and the value never changes.
It stays like this forever.
-

From: David Brownell
Date: Monday, October 8, 2007 - 8:34 pm

The old /etc/hotplug/usb.rc script made sure to load those modules
in the correct order:  EHCI first.

-

From: David Miller
Date: Monday, October 8, 2007 - 8:42 pm

From: David Brownell <david-b@pacbell.net>

I expected to find something cute attempting to handle this under
/etc/udev, I have failed so far :-)
-

From: Greg KH
Date: Monday, October 8, 2007 - 9:39 pm

No, nothing cute in udev itself, but it seems that all distros that I
know of have a "load these modules now" type setting in their init
scripts that can be used here.

I can't think of a way to enforce this load order on the modules
themselves due to the fact that OHCI might not even be needed for EHCI
devices on UHCI (Intel) based chipsets :(

Can anyone else?

thanks,

greg k-h
-

From: David Miller
Date: Monday, October 8, 2007 - 9:47 pm

From: Greg KH <greg@kroah.com>

The three modules perhaps should be a bundle of whatever ones you have
enabled, and internally we can dispatch the initialization to occur in
the correct order from a top-level module_init().

If the devices need to be initialized in a certain order in a
situation like this, it really seems like it is the kernel's job to
enforce it.
-

From: Benjamin Herrenschmidt
Date: Monday, October 8, 2007 - 10:11 pm

Is the problem strictly an ordering problem or just a race ? In the
later case, maybe some better arbitration by the USB core to make
sure things are quiescent or in a known state when letting a new HCD
register might help ?

Ben.


-

From: Greg KH
Date: Monday, October 8, 2007 - 11:06 pm

I agree.

Here's some information from Intel about where they have seen this
happen for UHCI controllers, so it's not just an OHCI issue :(

thanks,

greg k-h

----------------


We had a logic analyzer attached to the bus going to the ESB (ICH) which
has the USB controller in it. In the passing case we would see no
accesses to UHCI IO registers while EHCI initialized and sets its config
flag. The EHCI Port Status & Control registers were then read and then
we see a write to the EHCI Port Status & Control registers port owner
bit for the low speed devices (keyboard & mouse). This turns control
back over to the companion UHCI controller.=20

In our most prevalent failing case (#1 below) we never saw the write to
port owner bit on the ports with the low speed devices. In the passing
case we see the write to the port owner bit.

I do not see how this would have anything to do with flakey hardware
especially since we can reproduce this on all of our systems and the
same device (USB controller) is used on multiple products.=20

I really believe that this has to do with the UHCI and EHCI drivers
running on top of each other. This seems to be happening fairly often on
our systems. If the EHCI driver runs first then we do not see the
failure. If they are running at the same time then we see different
failure symptoms.=20

1) We see that the ports with low speed devices are still in EHCI mode
(port owner bit not written to in EHCI driver). In our analyzer captures
we see the reads from the Port Status & Control register and it is
indicating that there are low speed devices on the ports. Can you tell
us why the driver would not be doing the write to the port owner bit
when it sees that low speed devices are attached to that port? Is there
something specific that it looks for and decides not to do the write?

2) In other cases we see that the ports with the low speed devices are
back in UHCI mode but the ports are disabled. In this case we see from
the analyzer traces that ...
From: David Brownell
Date: Tuesday, October 9, 2007 - 12:22 pm

It's hard to say without more info about what's going on.  It
might be on a non-enumeraton code path -- where nothing expects
to set the OWNER bit -- or the CONNECT bit might not be set.
See host/ehci-hub.c for details about exactly what the EHCI parts
are doing, and core/hub.c for how it's driven.

(It can be tricky to make sense of all that, easier if debugging
messages are turned on.   But that changes timings.  Better yet
would be being non-intrusive tracing of the actual code paths

If the UHCI driver got disconnect and reconnect notifications,
I expect it would do so.  At least OHCI seems *not* to have gotten
such notifications.  I'm speculating that the "just a switch"
behavior for the CF (and OWNER) bits may not allow notifications
like that to happen when the companion controllers are resetting.

- Dave

-

From: Alan Stern
Date: Wednesday, October 10, 2007 - 8:32 am

Right.  In general the driver _does_ write to the port owner bit when 
it sees a low-speed device is attached to the port.  If that didn't 
happen in this case, maybe it's because the driver didn't see it.  

It certainly would help us if the tests were made on a kernel with

This implies that the reset sequence finished successfully.  Did that 

But above they said that it _had_ completed!  Did they mean that the 
reset was complete but the driver hadn't yet detected that it was 

That seems likely.  There's no way (as far as I can tell) for a host to 
see a disconnect while a reset is in progress.  Of course, as soon as 
the reset is over then the host should see the disconnect, and it 
should set the corresponding Connect-Status-Changed bit.

Amusing scenario: Suppose that ehci-hcd initializes the EHCI controller 
(taking control of the port), sees that the device isn't high speed, 
and switches port ownership back to the companion controller, all 
during the span of the companion's reset.  The companion would never 
know anything had happened!  But then everything should just work -- 
the port would be enabled and communication should proceed normally.

Alan Stern

-

From: David Brownell
Date: Monday, October 8, 2007 - 10:00 pm

Assuming PCI is present, /sys/bus/pci/devices/*/class can tell
if EHCI is present (0x0c0320) ... if so, load that driver.
Then repeat for OHCI (0x0c0310) and UHCI (0x0c0300).
-

From: David Miller
Date: Monday, October 8, 2007 - 10:23 pm

From: David Brownell <david-b@pacbell.net>

These are facts all of us know very well, but implementing this in
userspace in a failsafe manner isn't practical.  That's what we're
discussing.

There are things that autoload USB drivers way before udev or similar
even get started.

For example, the first thing some distributions do is try to load the
correct keyboard maps.  Guess what that can do?  It triggers a load of
all of the modular USB host controller drivers in case we have a USB
keyboard.

The only real solution is in the kernel, because it is the only
clean place to trap all of the potential module load events.
-

From: Benjamin Herrenschmidt
Date: Monday, October 8, 2007 - 11:43 pm

That will not work for all of the non-PCI implementations though.

Ben.


-

From: David Brownell
Date: Tuesday, October 9, 2007 - 11:48 am

Oddly enough, that's why I said "assuming PCI is present".  ;)

Anything using the ARC/TDI RTL doesn't have this issue ... it
includes full/low speed support directly, without a companion
controller.  That covers the Freescale silicon and some others.
For a long time I think that was the only generally available
non-PCI implementation.

But you're right that there are (now) a few other cases.
In-tree right now are also the Au1200, ps3, and 440epx.

- Dave

-

From: Alan Stern
Date: Tuesday, October 9, 2007 - 9:01 am

I agree with Ben; the best way to solve this is in the kernel.  Even if 
you did manage to make sure that modules were loaded in the right order 
initially, that wouldn't help if somebody starts unloading and loading 
modules by hand.

The underlying cause of the problem appears to be related to the fact
that it is impossible in principle to detect a disconnect while a
full/low-speed reset is in progress.  However the hardware shouldn't
rely on this, and it should fail gracefully when an unexpected
disconnect does occur.

For example, what would the controller do if the user unplugged the USB 
cable right in the middle of a reset?

Anyway, it certainly would be easy enough to make port resets mutually
exclusive with EHCI initialization.  That would work on any platform,
PCI or not.

Alan Stern

-

From: Greg KH
Date: Tuesday, October 9, 2007 - 10:39 am

That's a good idea, any thoughts as to how to do this?  I think I can
find some Intel and Dell people who would be glad to test this out if
you have a proposed patch.

thanks,

greg k-h
-

From: Alan Stern
Date: Tuesday, October 9, 2007 - 11:42 am

Here is a proposed patch.  I haven't tried running it, but it compiles 
okay.

Alan Stern



Index: usb-2.6/drivers/usb/core/hcd.h
===================================================================
--- usb-2.6.orig/drivers/usb/core/hcd.h
+++ usb-2.6/drivers/usb/core/hcd.h
@@ -19,6 +19,8 @@
 
 #ifdef __KERNEL__
 
+#include <linux/rwsem.h>
+
 /* This file contains declarations of usbcore internals that are mostly
  * used or exposed by Host Controller Drivers.
  */
@@ -470,5 +472,9 @@ static inline void usbmon_urb_complete(s
 		: (in_interrupt () ? "in_interrupt" : "can sleep"))
 
 
-#endif /* __KERNEL__ */
+/* Mutual exclusion for EHCI CF initialization.  This interferes with
+ * port reset on some companion controllers.
+ */
+extern struct rw_semaphore ehci_cf_port_reset_rwsem;
 
+#endif /* __KERNEL__ */
Index: usb-2.6/drivers/usb/core/hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/core/hub.c
+++ usb-2.6/drivers/usb/core/hub.c
@@ -125,6 +125,12 @@ MODULE_PARM_DESC(use_both_schemes,
 		"try the other device initialization scheme if the "
 		"first one fails");
 
+/* Mutual exclusion for EHCI CF initialization.  This interferes with
+ * port reset on some companion controllers.
+ */
+DECLARE_RWSEM(ehci_cf_port_reset_rwsem);
+EXPORT_SYMBOL_GPL(ehci_cf_port_reset_rwsem);
+
 
 static inline char *portspeed(int portstatus)
 {
@@ -1579,6 +1585,8 @@ static int hub_port_reset(struct usb_hub
 {
 	int i, status;
 
+	down_read(&ehci_cf_port_reset_rwsem);
+
 	/* Reset the port */
 	for (i = 0; i < PORT_RESET_TRIES; i++) {
 		status = set_port_feature(hub->hdev,
@@ -1610,7 +1618,7 @@ static int hub_port_reset(struct usb_hub
 			usb_set_device_state(udev, status
 					? USB_STATE_NOTATTACHED
 					: USB_STATE_DEFAULT);
-			return status;
+			goto done;
 		}
 
 		dev_dbg (hub->intfdev,
@@ -1623,6 +1631,8 @@ static int hub_port_reset(struct usb_hub
 		"Cannot enable port %i.  Maybe the USB cable is ...
From: David Brownell
Date: Tuesday, October 9, 2007 - 11:59 am

You have two copies of that comment; I'd drop this one, and add a
better one around CF initialization.

What I'd have here is a comment that this symbol is exported ONLY

That deserves a comment about how CF and companion port resets
don't mix well.

... assuming this does check out as the problem, which I expect
it will.

-

From: David Miller
Date: Tuesday, October 9, 2007 - 2:27 pm

From: Alan Stern <stern@rowland.harvard.edu>

Are we sure that this is the only event that causes the conflicts between
the EHCI and it's companion virtual OHCI/UHCI host(s)?

Is there a chip reset done to the EHCI controller early during device
probe that could cause problems too?

I know it's grotty, but I definitely feel more comfortable with the
fix involving making sure EHCI loads and fully probes first.
-

From: David Brownell
Date: Tuesday, October 9, 2007 - 2:43 pm

It's the event which can cause trouble when EHCI starts up
after OHCI/UHCI (rather than before it).

The other event is setting the OWNER bit, which clearly
has not caused trouble.

Does it resolve the problem you observed?

- Dave

-

From: David Miller
Date: Tuesday, October 9, 2007 - 3:00 pm

From: David Brownell <david-b@pacbell.net>

I will definitely give the patch a try and report back.
-

From: David Miller
Date: Tuesday, October 9, 2007 - 9:35 pm

From: David Miller <davem@davemloft.net>

My tests look really good with Alan Stern's patch.

With a kernel is hacked to load OHCI before EHCI, and
that always hangs, the patch makes the hang go away
reliably.
-

From: David Miller
Date: Monday, October 15, 2007 - 3:01 pm

From: David Miller <davem@davemloft.net>

Bad news, even with the rwsem after a lot more testing I can still
trigger the hang in ohci_hub_control() :-(

I think we need to go back to considering the total serialization
approach to this problem.
-

From: David Brownell
Date: Monday, October 15, 2007 - 4:39 pm

We shouldn't need that.  What happens if you add an msleep(5)
before ehci-hcd::ehci_run() drops ehci_cf_port_reset_rwsem?

The theory there being that the switch triggered by setting CF
doesn't take effect instantaneously, contrary to the effective
assumption of that code.  A delay of 5 msec seems like it should
be more than enough, but that's kind of a guess ... it's good to
keep that low, since unfortunately that's in the critical path
for OLPC "resume from idle".

- Dave

-

From: David Miller
Date: Monday, October 15, 2007 - 4:58 pm

From: David Brownell <david-b@pacbell.net>


I want to help with this, but if I even breath on the kernel the bug
goes away.  The race just gets harder to trigger, and if we just keep
adding things it'll make the problem go away but for the absolutely
wrong reasons.

The only way we will provably fix this is to make sure EHCI initialize
fully, first, regardless of kernel config or what userland does.

Also, David, you haven't done anything with the feedback I gave to the
most recent revision of the OHCI hub reset anti-wedge patch.  You
removed the debug logging when the outer-loop timeout expires, and I
asked that you put that back so that if it happens there is some
chance to know that this is what happened.  If it's not supposed to
happen, there is no harm in putting the debugging log message there
so that if the impossible does happen we find out about it.

I really don't think it's appropriate for that bug fix to sit yet
another week.
-

From: Alan Stern
Date: Tuesday, October 16, 2007 - 8:23 am

Unfortunately that simply isn't possible.  No matter what you do, the 
user can always unload ehci-hcd and then load it back in again.

Do you have any idea _where_ in ohci_hub_control the hang still occurs?  
Is it the same unbounded reset loop?

Does the patch below satisfy both Davids?

Alan Stern



Index: usb-2.6/drivers/usb/host/ohci-hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-hub.c
+++ usb-2.6/drivers/usb/host/ohci-hub.c
@@ -560,10 +560,10 @@ static void start_hnp(struct ohci_hcd *o
 /* called from some task, normally khubd */
 static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 {
-	__hc32 __iomem *portstat = &ohci->regs->roothub.portstatus [port];
-	u32	temp;
-	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	u16	reset_done = now + PORT_RESET_MSEC;
+	__hc32 __iomem	*portstat = &ohci->regs->roothub.portstatus[port];
+	u32		temp;
+	unsigned long	reset_done = jiffies +
+				msecs_to_jiffies(PORT_RESET_MSEC);
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
@@ -578,6 +578,8 @@ static inline int root_port_reset (struc
 				return -ESHUTDOWN;
 			if (!(temp & RH_PS_PRS))
 				break;
+			if (time_after(jiffies, reset_done))
+				break;
 			udelay (500);
 		}
 
@@ -589,8 +591,7 @@ static inline int root_port_reset (struc
 		/* start the next reset, sleep till it's probably done */
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
-		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+	} while (time_before_eq(jiffies, reset_done));
 	/* caller synchronizes using PRSC */
 
 	return 0;

-

From: David Miller
Date: Tuesday, October 16, 2007 - 3:06 pm

From: Alan Stern <stern@rowland.harvard.edu>

Yes we can, by making OHCI and EHCI one module with a top-level
dispatch.  If you enable both OHCI and EHCI, the top-level
module will dispatch the host initializations in the correct order.

This is what I've suggested from the beginning.
-

From: Greg KH
Date: Tuesday, October 16, 2007 - 3:20 pm

Wait, you can have hardware with both EHCI and UHCI too.  Does that mean
we should merge all three together?  I don't think so :)

But perhaps we can order the hardware init stuff from all three together
like this into a separate module they all depend on.  In a way, that's
what the lock tried to do, right?  Are we just not catching all places
we could have hardware being talked to by two modules at the same time?

thanks,

greg k-h
-

From: Alan Stern
Date: Wednesday, October 17, 2007 - 8:56 am

No, we do catch the one place where it happens.  The problem seems to
be that the hardware update takes some time.  That is, on one side we
take the write lock, talk to the EHCI hardware, and drop the write
lock.  On the other side we take the read lock, talk to the OHCI 
hardware, and drop the read lock.  Nevertheless, the interference 
occurs.  David B.'s interpretation is that the hardware's change of 
state takes more time than the CPU uses in manipulating locks and 
switching tasks.  Hence his suggestion for adding a delay.

Alan Stern

-

From: David Miller
Date: Tuesday, October 16, 2007 - 3:08 pm

From: Alan Stern <stern@rowland.harvard.edu>


No, there are no warning messages that trigger

My last patch was just fine, it timed out appropriately and warned the
user telling them exactly what event timed out.

I'm reposting it here for reference, this is getting silly:

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
index bb9cc59..9149593 100644
--- a/drivers/usb/host/ohci-hub.c
+++ b/drivers/usb/host/ohci-hub.c
@@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 	u32	temp;
 	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
 	u16	reset_done = now + PORT_RESET_MSEC;
+	int	limit_1;
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
 	 * this might need to be deadline-scheduled.
 	 */
-	do {
+	limit_1 = 100;
+	while (--limit_1 >= 0) {
+		int limit_2;
+
 		/* spin until any current reset finishes */
-		for (;;) {
+		limit_2 = PORT_RESET_HW_MSEC * 2;
+		while (--limit_2 >= 0) {
 			temp = ohci_readl (ohci, portstat);
 			/* handle e.g. CardBus eject */
 			if (temp == ~(u32)0)
@@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 				break;
 			udelay (500);
 		}
+		if (limit_2 < 0) {
+			ohci_warn(ohci, "Root port inner-loop reset timeout, "
+				  "portstat[%08x]\n", temp);
+		}
 
 		if (!(temp & RH_PS_CCS))
 			break;
@@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
 		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+		if (!tick_before(now, reset_done))
+			break;
+	}
+	if (limit_1 < 0) {
+		ohci_warn(ohci, "Root port outer-loop reset timeout, "
+			  "now[%04x] reset_done[%04x]\n",
+			  now, reset_done);
+	}
 	/* caller synchronizes using PRSC */
 
 ...
From: Alan Stern
Date: Wednesday, October 17, 2007 - 8:51 am

Where did 100 come from?  Why not rely on the reset_done criterion,

This also is a somewhat arbitrary limit.  There's no obvious reason why
PORT_RESET_HW_MSEC should be both the limit for this loop and the

I realize that you are copying existing code here, but this makes the 
same kind of mistake you are trying to fix.  If a broken controller 
fails to increment its framenumber register, this termination condition 

What reason is there for having two warning messages?  One ought to be 
enough.

Alan Stern



Index: usb-2.6/drivers/usb/host/ohci-hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-hub.c
+++ usb-2.6/drivers/usb/host/ohci-hub.c
@@ -560,10 +560,10 @@ static void start_hnp(struct ohci_hcd *o
 /* called from some task, normally khubd */
 static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 {
-	__hc32 __iomem *portstat = &ohci->regs->roothub.portstatus [port];
-	u32	temp;
-	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	u16	reset_done = now + PORT_RESET_MSEC;
+	__hc32 __iomem	*portstat = &ohci->regs->roothub.portstatus[port];
+	u32		temp;
+	unsigned long	reset_done = jiffies +
+				msecs_to_jiffies(PORT_RESET_MSEC);
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
@@ -578,6 +578,12 @@ static inline int root_port_reset (struc
 				return -ESHUTDOWN;
 			if (!(temp & RH_PS_PRS))
 				break;
+			if (time_after(jiffies, reset_done)) {
+				ohci_warn(ohci, "Port %d reset timeout, "
+						"status %x\n",
+						port + 1, temp);
+				break;
+			}
 			udelay (500);
 		}
 
@@ -589,8 +595,7 @@ static inline int root_port_reset (struc
 		/* start the next reset, sleep till it's probably done */
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
-		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+	} while (time_before_eq(jiffies, ...
From: David Miller
Date: Wednesday, October 17, 2007 - 4:03 pm

From: Alan Stern <stern@rowland.harvard.edu>

In my patch it was possible for the inner loop one to succeed, but the
outer one to not do so.

In your's this is not the case so I guess it's OK.

I wonder if it's so wise trying to do two things at once.  Here we are
adding the loop timeouts, and also changing to using jiffies based
timeouts rather than a chip timer register based one.

I preferred my patches because it solved one single problem, the lack
of loop limits.  The timeout mechanism could have been changed in
another followon patch.
-

From: Alan Stern
Date: Thursday, October 18, 2007 - 7:28 am

At this point I'd be perfectly happy to sit back and let David B.
choose, among all the possibilities that have been posted, the one he
thinks is best.

Alan Stern

-

From: David Brownell
Date: Tuesday, October 16, 2007 - 11:26 am

Not if what I suggested is happening is really what's happening.
(Quoted next.)

It's got to be just a *simple* hardware race, and the msleep would
reliably prevent it since the switch takes a finite amount of time
to do its job.  I've had to struggle with real heisenbugs, and this
doesn't have enough conflicting behaviors inside the silicon (or

So, you're unwilling to explore whether that suggestion addresses

As Alan noted: no can do, in general.  That's why I've not griped
harder at the distro vendors who are ignoring the fairly simple
recommendation that's been around for six years now:  load EHCI
before other USB controller drivers.  

Admittedly, until you turned up this glitch there was no downside


It will exit by the inner loop (with diagnostic) before it exits
from the outer one.  Then the hub logic and other code will give

The version I sent should just merge.

- Dave

-

From: David Brownell
Date: Monday, October 8, 2007 - 9:09 pm

There's actually no such thing as an "EHCI port" or an "OHCI port".
Instead, there's a set of ports, each of which can be switched so
the USB differential data signals go up to either controller.

When EHCI starts, that switch points to EHCI so that devices can try
enumerating with high speed signaling.  When a device doesn't respond
to that "chirp", the EHCI root hub driver switches the port to the

Yes, that's why I asked about EHCI.  My speculation would be that
OHCI starts the reset, and EHCI claims the port before it completes;
or contrariwise OHCI starts the reset right after EHCI claims it.

And there's some point in that process where a hardware race makes
the trouble you've observed.  I believe there are plenty of other
places where it's perfectly fine if EHCI grabs the port, or this
little race would have shown up many times before.

- Dave


-

From: Benjamin Herrenschmidt
Date: Monday, October 8, 2007 - 10:13 pm

Since we can't know which O/UHCI is paired with which EHCI, we can't
really have generic code to deal with that race, but maybe we can be
smart and basically mutex khubd activity such as port reset vs.
registration of any new HCD ?

I'm not even sure module load order is 100% fault proof here since khubd
spawns as a thread...
 
Ben.


-

From: David Miller
Date: Monday, October 8, 2007 - 10:26 pm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

I'm concerned about that as well, thanks for bringing it up.

My understanding, however, is that the critical thing is that the EHCI
device reset being done by the EHCI driver probe occurs and completes
first.  If that is true, then just making sure EHCI loads initially is
a sufficient constraint to fix this problem.
-

From: Benjamin Herrenschmidt
Date: Monday, October 8, 2007 - 11:37 pm

Yup, that would be, though I hate that sort of load order
dependencies...

Ben.


-

From: David Brownell
Date: Monday, October 8, 2007 - 9:36 pm

Don't need this "limit_1" timeout; "reset_done" handles all
the timeout needed there.  The regs->fmnumber is essentially

This is the loop that didn't terminate for you, right?
PORT_RESET_HW_MSEC is the ceiling you should use here,

What values do you see for "portstat"?

I suspect there will be some flag set which would allow a more
immediate exit from that loop.  RH_PS_CCS might clear, for example.

And in any case, if that fails I don't see any reason not to just
-

From: David Miller
Date: Monday, October 8, 2007 - 9:44 pm

From: David Brownell <david-b@pacbell.net>

If the hardware hangs and the register stops incrementing,
the entire kernel will hang.  That is unacceptable.




Absolutely nothing clears in the register from it's initial value.

Here is the patch with the limit_2 initial value fixed.

I kept loop_1 in there, it is necessary.  No kernel code should
hang in an endless loop because of malfunctioning hardware.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
index bb9cc59..9149593 100644
--- a/drivers/usb/host/ohci-hub.c
+++ b/drivers/usb/host/ohci-hub.c
@@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 	u32	temp;
 	u16	now = ohci_readl(ohci, &ohci->regs->fmnumber);
 	u16	reset_done = now + PORT_RESET_MSEC;
+	int	limit_1;
 
 	/* build a "continuous enough" reset signal, with up to
 	 * 3msec gap between pulses.  scheduler HZ==100 must work;
 	 * this might need to be deadline-scheduled.
 	 */
-	do {
+	limit_1 = 100;
+	while (--limit_1 >= 0) {
+		int limit_2;
+
 		/* spin until any current reset finishes */
-		for (;;) {
+		limit_2 = PORT_RESET_HW_MSEC * 2;
+		while (--limit_2 >= 0) {
 			temp = ohci_readl (ohci, portstat);
 			/* handle e.g. CardBus eject */
 			if (temp == ~(u32)0)
@@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 				break;
 			udelay (500);
 		}
+		if (limit_2 < 0) {
+			ohci_warn(ohci, "Root port inner-loop reset timeout, "
+				  "portstat[%08x]\n", temp);
+		}
 
 		if (!(temp & RH_PS_CCS))
 			break;
@@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
 		ohci_writel (ohci, RH_PS_PRS, portstat);
 		msleep(PORT_RESET_HW_MSEC);
 		now = ohci_readl(ohci, &ohci->regs->fmnumber);
-	} while (tick_before(now, reset_done));
+		if (!tick_before(now, reset_done))
+			break;
+	}
+	if (limit_1 < 0) {
+		ohci_warn(ohci, "Root port ...
From: David Brownell
Date: Tuesday, October 9, 2007 - 9:38 am

That's unfortunately useless ... PPS, PRS, CCS (meaning powered,
resetting, connected).  In short, there is *no* indication from
the OHCI hardware that a disconnect happened.

-

From: David Miller
Date: Tuesday, October 9, 2007 - 1:41 pm

From: David Brownell <david-b@pacbell.net>

But it's an excellent example of why my patch is mandatory.

When the device gets into a stuck state, this code does the wrong
thing.

Instead of trying to deal with bad situations, it hangs the system
instead.  That sucks for anyone trying to debug something like this.
This has caused weeks of debugging and grief for people, and it
could have all been eliminated if this looping code were more
robust to hardware failures.

If you can't see why this is bad programming practice in a hardware
driver, I will try to get my patch to someone who does.

Even though I'm severly overloaded, you asked me for a patch, I gave
you one.  And now you want to push it back at me because it isn't
clear to you yet that the code there right now is a huge problem.

And all of this is independant of making sure EHCI loads first,
we need to fix that too.
-

From: Greg KH
Date: Tuesday, October 9, 2007 - 1:46 pm

I agree.  David Brownell, unless you strongly object, I'll take David
Miller's patch and add it to my tree.

And if you do object, can you respond to the objection with a version of
the patch that you would accept?  :)

thanks,

greg k-h
-

From: David Brownell
Date: Tuesday, October 9, 2007 - 2:05 pm

I'll send some version with a signoff.

- Dave

-

From: David Brownell
Date: Tuesday, October 9, 2007 - 2:09 pm

And I got some questions as I reviewed it.  That's far
from un-heard-of.  One particular question related to
one potential simplification ... which unfortunately

You're reading things into what I wrote which were not there.
Calm down.

- Dave

-

Previous thread: lockdep: how to tell it multiple pte locks is OK? by Jeremy Fitzhardinge on Saturday, October 6, 2007 - 11:31 pm. (7 messages)

Next thread: libertas and endianness by Geert Uytterhoeven on Sunday, October 7, 2007 - 3:15 am. (1 message)