When root_port_reset() in ohci-hub.c polls for the end of the reset, it puts no limit on the loop and will only exit the loop when either the RH_PS_PRS bit clears or the register returns all 1's (the latter condition is a recent addition). If for some reason the bit never clears, we sit here forever and never exit the loop. I actually hit this on one of my machines, and I'm trying to track down what's happening. Regardless of why my machine is doing this, there absolutely should be some upper bound put on the number of times we will run through this loop, perhaps enough such that up to 5 seconds elapses waiting for the reset bit to clear. And if it times out we should print a loud message onto the console, but still try to continue. Thanks! -
The all-ones typically indicating something like CardBus eject Sounds like a hardware issue -- unless something else is trashing controller state. We've not observed that specific hardware failure before, for what it's worth. Is this SPARC, or is ACPI potentially in play? PCI, or non-PCI? Are the other ports still behaving? Is EHCI maybe trying to switch ownership of that port? Is maybe the (newish) autosuspend stuff kicking in? The OHCI spec requires the controller to stop the reset itself. Patches accepted. :) Since the PRS bit is specified as "write one", with writing zero as no-effect (since the rest is hardware-timed), the only recovery procedure might involve resetting the whole controller. Messy, and not something the usb core has historically handled very well. - Dave -
From: David Brownell <david-b@pacbell.net> I'm 700 patches deep with an additionally large backlog of patches to apply for the networking and sparc64 tree. I don't have the time, which is why I reported the problem to the OHCI maintainer instead of letting it slip through At a minimum you should exit the loop and print out a warning messages and try to continue even without trying to reset the whole controller if that will take some developer effort. Anything is better than just hanging there forever. Not every user knows to hit ALT-SYSRQ then 'P' to get a register dump to figure out why their computer stopped booting. Every register polling loops, without exception, should have a limit and exit with an error indication when that limit is reached. It will never hang someones machine, because although not this time, in many cases these kinds of hangs are absolutely impossible to debug (interrupts disabled, critical lock held, etc.) -
From: David Miller <davem@davemloft.net> To add some more information here, I think the EHCI idea might hold some water. What I have here are two NEC OHCI USB interfaces and one NEC EHCI USB interface on PCI. Aparently they all go through a shared USB hub, mapped like this: HUB Port 1: OHCI #1, EHCI HUB Port 2: OHCI #2, EHCI HUB Port 3: OHCI #1, EHCI HUB Port 4: OHCI #2, EHCI HUB Port 5: OHCI #1, EHCI The OHCI ports go out to external USB connectors on the back panel of the machine, whereas the EHCI is connected up to an internal USB storage CDROM device and what appears to be another USB hub. The problem seems to be very strongly tied to timing. For example simply adding "ignore_loglevel" to the kernel boot command line can make the problem go away. This got me thinking about your EHCI comment. If these controllers are going through the same HUB, things might go south if OHCI initialized first, then khubd et al. are asynchronously accessing the segments behind OHCI at the same time that the EHCI driver is initializing. Perhaps, this is the kind of sequence of events which makes one of the root ports reset in such a way that the the reset bit never clears. Given that this machine has 64 cpus, the likelyhood for such parallel accesses is very likely :-) Does this make any sense? Regardless, here is a patch that hardens the OHCI reset handling loops so that they break out instead of hanging the entire system should this condition occur. It's at least better than what the code does to a user right now which is hang the box completely: [USB] ohci: Do not hang the system if port reset does not complete. Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c index bb9cc59..77ae5b4 100644 --- a/drivers/usb/host/ohci-hub.c +++ b/drivers/usb/host/ohci-hub.c @@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port) u32 temp; u16 now = ...
Yes it does, I'm seeing reports from some hardware companies of the very same thing. If you serialize and load the ehci driver first, and then the ohci driver, that should fix the problem. Does that also work for you? Or are these drivers built into the kernel? thanks, greg k-h -
From: Greg KH <greg@kroah.com> As coicidence would have it I finally found a recipe for triggering the issue, and it ties into what you're talking about here. It happens only if I make sure OHCI gets loaded first and then EHCI right afterwards. It seems that indeed it is important for EHCI to get loaded first, and in-kernel this is ensured by the link ordering. However, when both OHCI and EHCI are built as modules (or, similarly I guess, OHCI is built-in and EHCI is modular) there appears to be nothing in userspace which makes sure EHCI gets loaded first. When this triggers, in OHCI's root_port_reset(), the port status register reads 0x111 in that inner-loop and the value never changes. It stays like this forever. -
The old /etc/hotplug/usb.rc script made sure to load those modules in the correct order: EHCI first. -
From: David Brownell <david-b@pacbell.net> I expected to find something cute attempting to handle this under /etc/udev, I have failed so far :-) -
No, nothing cute in udev itself, but it seems that all distros that I know of have a "load these modules now" type setting in their init scripts that can be used here. I can't think of a way to enforce this load order on the modules themselves due to the fact that OHCI might not even be needed for EHCI devices on UHCI (Intel) based chipsets :( Can anyone else? thanks, greg k-h -
From: Greg KH <greg@kroah.com> The three modules perhaps should be a bundle of whatever ones you have enabled, and internally we can dispatch the initialization to occur in the correct order from a top-level module_init(). If the devices need to be initialized in a certain order in a situation like this, it really seems like it is the kernel's job to enforce it. -
Is the problem strictly an ordering problem or just a race ? In the later case, maybe some better arbitration by the USB core to make sure things are quiescent or in a known state when letting a new HCD register might help ? Ben. -
I agree. Here's some information from Intel about where they have seen this happen for UHCI controllers, so it's not just an OHCI issue :( thanks, greg k-h ---------------- We had a logic analyzer attached to the bus going to the ESB (ICH) which has the USB controller in it. In the passing case we would see no accesses to UHCI IO registers while EHCI initialized and sets its config flag. The EHCI Port Status & Control registers were then read and then we see a write to the EHCI Port Status & Control registers port owner bit for the low speed devices (keyboard & mouse). This turns control back over to the companion UHCI controller.=20 In our most prevalent failing case (#1 below) we never saw the write to port owner bit on the ports with the low speed devices. In the passing case we see the write to the port owner bit. I do not see how this would have anything to do with flakey hardware especially since we can reproduce this on all of our systems and the same device (USB controller) is used on multiple products.=20 I really believe that this has to do with the UHCI and EHCI drivers running on top of each other. This seems to be happening fairly often on our systems. If the EHCI driver runs first then we do not see the failure. If they are running at the same time then we see different failure symptoms.=20 1) We see that the ports with low speed devices are still in EHCI mode (port owner bit not written to in EHCI driver). In our analyzer captures we see the reads from the Port Status & Control register and it is indicating that there are low speed devices on the ports. Can you tell us why the driver would not be doing the write to the port owner bit when it sees that low speed devices are attached to that port? Is there something specific that it looks for and decides not to do the write? 2) In other cases we see that the ports with the low speed devices are back in UHCI mode but the ports are disabled. In this case we see from the analyzer traces that ...
It's hard to say without more info about what's going on. It might be on a non-enumeraton code path -- where nothing expects to set the OWNER bit -- or the CONNECT bit might not be set. See host/ehci-hub.c for details about exactly what the EHCI parts are doing, and core/hub.c for how it's driven. (It can be tricky to make sense of all that, easier if debugging messages are turned on. But that changes timings. Better yet would be being non-intrusive tracing of the actual code paths If the UHCI driver got disconnect and reconnect notifications, I expect it would do so. At least OHCI seems *not* to have gotten such notifications. I'm speculating that the "just a switch" behavior for the CF (and OWNER) bits may not allow notifications like that to happen when the companion controllers are resetting. - Dave -
Right. In general the driver _does_ write to the port owner bit when it sees a low-speed device is attached to the port. If that didn't happen in this case, maybe it's because the driver didn't see it. It certainly would help us if the tests were made on a kernel with This implies that the reset sequence finished successfully. Did that But above they said that it _had_ completed! Did they mean that the reset was complete but the driver hadn't yet detected that it was That seems likely. There's no way (as far as I can tell) for a host to see a disconnect while a reset is in progress. Of course, as soon as the reset is over then the host should see the disconnect, and it should set the corresponding Connect-Status-Changed bit. Amusing scenario: Suppose that ehci-hcd initializes the EHCI controller (taking control of the port), sees that the device isn't high speed, and switches port ownership back to the companion controller, all during the span of the companion's reset. The companion would never know anything had happened! But then everything should just work -- the port would be enabled and communication should proceed normally. Alan Stern -
Assuming PCI is present, /sys/bus/pci/devices/*/class can tell if EHCI is present (0x0c0320) ... if so, load that driver. Then repeat for OHCI (0x0c0310) and UHCI (0x0c0300). -
From: David Brownell <david-b@pacbell.net> These are facts all of us know very well, but implementing this in userspace in a failsafe manner isn't practical. That's what we're discussing. There are things that autoload USB drivers way before udev or similar even get started. For example, the first thing some distributions do is try to load the correct keyboard maps. Guess what that can do? It triggers a load of all of the modular USB host controller drivers in case we have a USB keyboard. The only real solution is in the kernel, because it is the only clean place to trap all of the potential module load events. -
That will not work for all of the non-PCI implementations though. Ben. -
Oddly enough, that's why I said "assuming PCI is present". ;) Anything using the ARC/TDI RTL doesn't have this issue ... it includes full/low speed support directly, without a companion controller. That covers the Freescale silicon and some others. For a long time I think that was the only generally available non-PCI implementation. But you're right that there are (now) a few other cases. In-tree right now are also the Au1200, ps3, and 440epx. - Dave -
I agree with Ben; the best way to solve this is in the kernel. Even if you did manage to make sure that modules were loaded in the right order initially, that wouldn't help if somebody starts unloading and loading modules by hand. The underlying cause of the problem appears to be related to the fact that it is impossible in principle to detect a disconnect while a full/low-speed reset is in progress. However the hardware shouldn't rely on this, and it should fail gracefully when an unexpected disconnect does occur. For example, what would the controller do if the user unplugged the USB cable right in the middle of a reset? Anyway, it certainly would be easy enough to make port resets mutually exclusive with EHCI initialization. That would work on any platform, PCI or not. Alan Stern -
That's a good idea, any thoughts as to how to do this? I think I can find some Intel and Dell people who would be glad to test this out if you have a proposed patch. thanks, greg k-h -
Here is a proposed patch. I haven't tried running it, but it compiles
okay.
Alan Stern
Index: usb-2.6/drivers/usb/core/hcd.h
===================================================================
--- usb-2.6.orig/drivers/usb/core/hcd.h
+++ usb-2.6/drivers/usb/core/hcd.h
@@ -19,6 +19,8 @@
#ifdef __KERNEL__
+#include <linux/rwsem.h>
+
/* This file contains declarations of usbcore internals that are mostly
* used or exposed by Host Controller Drivers.
*/
@@ -470,5 +472,9 @@ static inline void usbmon_urb_complete(s
: (in_interrupt () ? "in_interrupt" : "can sleep"))
-#endif /* __KERNEL__ */
+/* Mutual exclusion for EHCI CF initialization. This interferes with
+ * port reset on some companion controllers.
+ */
+extern struct rw_semaphore ehci_cf_port_reset_rwsem;
+#endif /* __KERNEL__ */
Index: usb-2.6/drivers/usb/core/hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/core/hub.c
+++ usb-2.6/drivers/usb/core/hub.c
@@ -125,6 +125,12 @@ MODULE_PARM_DESC(use_both_schemes,
"try the other device initialization scheme if the "
"first one fails");
+/* Mutual exclusion for EHCI CF initialization. This interferes with
+ * port reset on some companion controllers.
+ */
+DECLARE_RWSEM(ehci_cf_port_reset_rwsem);
+EXPORT_SYMBOL_GPL(ehci_cf_port_reset_rwsem);
+
static inline char *portspeed(int portstatus)
{
@@ -1579,6 +1585,8 @@ static int hub_port_reset(struct usb_hub
{
int i, status;
+ down_read(&ehci_cf_port_reset_rwsem);
+
/* Reset the port */
for (i = 0; i < PORT_RESET_TRIES; i++) {
status = set_port_feature(hub->hdev,
@@ -1610,7 +1618,7 @@ static int hub_port_reset(struct usb_hub
usb_set_device_state(udev, status
? USB_STATE_NOTATTACHED
: USB_STATE_DEFAULT);
- return status;
+ goto done;
}
dev_dbg (hub->intfdev,
@@ -1623,6 +1631,8 @@ static int hub_port_reset(struct usb_hub
"Cannot enable port %i. Maybe the USB cable is ...You have two copies of that comment; I'd drop this one, and add a better one around CF initialization. What I'd have here is a comment that this symbol is exported ONLY That deserves a comment about how CF and companion port resets don't mix well. ... assuming this does check out as the problem, which I expect it will. -
From: Alan Stern <stern@rowland.harvard.edu> Are we sure that this is the only event that causes the conflicts between the EHCI and it's companion virtual OHCI/UHCI host(s)? Is there a chip reset done to the EHCI controller early during device probe that could cause problems too? I know it's grotty, but I definitely feel more comfortable with the fix involving making sure EHCI loads and fully probes first. -
It's the event which can cause trouble when EHCI starts up after OHCI/UHCI (rather than before it). The other event is setting the OWNER bit, which clearly has not caused trouble. Does it resolve the problem you observed? - Dave -
From: David Brownell <david-b@pacbell.net> I will definitely give the patch a try and report back. -
From: David Miller <davem@davemloft.net> My tests look really good with Alan Stern's patch. With a kernel is hacked to load OHCI before EHCI, and that always hangs, the patch makes the hang go away reliably. -
From: David Miller <davem@davemloft.net> Bad news, even with the rwsem after a lot more testing I can still trigger the hang in ohci_hub_control() :-( I think we need to go back to considering the total serialization approach to this problem. -
We shouldn't need that. What happens if you add an msleep(5) before ehci-hcd::ehci_run() drops ehci_cf_port_reset_rwsem? The theory there being that the switch triggered by setting CF doesn't take effect instantaneously, contrary to the effective assumption of that code. A delay of 5 msec seems like it should be more than enough, but that's kind of a guess ... it's good to keep that low, since unfortunately that's in the critical path for OLPC "resume from idle". - Dave -
From: David Brownell <david-b@pacbell.net> I want to help with this, but if I even breath on the kernel the bug goes away. The race just gets harder to trigger, and if we just keep adding things it'll make the problem go away but for the absolutely wrong reasons. The only way we will provably fix this is to make sure EHCI initialize fully, first, regardless of kernel config or what userland does. Also, David, you haven't done anything with the feedback I gave to the most recent revision of the OHCI hub reset anti-wedge patch. You removed the debug logging when the outer-loop timeout expires, and I asked that you put that back so that if it happens there is some chance to know that this is what happened. If it's not supposed to happen, there is no harm in putting the debugging log message there so that if the impossible does happen we find out about it. I really don't think it's appropriate for that bug fix to sit yet another week. -
Unfortunately that simply isn't possible. No matter what you do, the
user can always unload ehci-hcd and then load it back in again.
Do you have any idea _where_ in ohci_hub_control the hang still occurs?
Is it the same unbounded reset loop?
Does the patch below satisfy both Davids?
Alan Stern
Index: usb-2.6/drivers/usb/host/ohci-hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-hub.c
+++ usb-2.6/drivers/usb/host/ohci-hub.c
@@ -560,10 +560,10 @@ static void start_hnp(struct ohci_hcd *o
/* called from some task, normally khubd */
static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
{
- __hc32 __iomem *portstat = &ohci->regs->roothub.portstatus [port];
- u32 temp;
- u16 now = ohci_readl(ohci, &ohci->regs->fmnumber);
- u16 reset_done = now + PORT_RESET_MSEC;
+ __hc32 __iomem *portstat = &ohci->regs->roothub.portstatus[port];
+ u32 temp;
+ unsigned long reset_done = jiffies +
+ msecs_to_jiffies(PORT_RESET_MSEC);
/* build a "continuous enough" reset signal, with up to
* 3msec gap between pulses. scheduler HZ==100 must work;
@@ -578,6 +578,8 @@ static inline int root_port_reset (struc
return -ESHUTDOWN;
if (!(temp & RH_PS_PRS))
break;
+ if (time_after(jiffies, reset_done))
+ break;
udelay (500);
}
@@ -589,8 +591,7 @@ static inline int root_port_reset (struc
/* start the next reset, sleep till it's probably done */
ohci_writel (ohci, RH_PS_PRS, portstat);
msleep(PORT_RESET_HW_MSEC);
- now = ohci_readl(ohci, &ohci->regs->fmnumber);
- } while (tick_before(now, reset_done));
+ } while (time_before_eq(jiffies, reset_done));
/* caller synchronizes using PRSC */
return 0;
-
From: Alan Stern <stern@rowland.harvard.edu> Yes we can, by making OHCI and EHCI one module with a top-level dispatch. If you enable both OHCI and EHCI, the top-level module will dispatch the host initializations in the correct order. This is what I've suggested from the beginning. -
Wait, you can have hardware with both EHCI and UHCI too. Does that mean we should merge all three together? I don't think so :) But perhaps we can order the hardware init stuff from all three together like this into a separate module they all depend on. In a way, that's what the lock tried to do, right? Are we just not catching all places we could have hardware being talked to by two modules at the same time? thanks, greg k-h -
No, we do catch the one place where it happens. The problem seems to be that the hardware update takes some time. That is, on one side we take the write lock, talk to the EHCI hardware, and drop the write lock. On the other side we take the read lock, talk to the OHCI hardware, and drop the read lock. Nevertheless, the interference occurs. David B.'s interpretation is that the hardware's change of state takes more time than the CPU uses in manipulating locks and switching tasks. Hence his suggestion for adding a delay. Alan Stern -
From: Alan Stern <stern@rowland.harvard.edu>
No, there are no warning messages that trigger
My last patch was just fine, it timed out appropriately and warned the
user telling them exactly what event timed out.
I'm reposting it here for reference, this is getting silly:
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
index bb9cc59..9149593 100644
--- a/drivers/usb/host/ohci-hub.c
+++ b/drivers/usb/host/ohci-hub.c
@@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
u32 temp;
u16 now = ohci_readl(ohci, &ohci->regs->fmnumber);
u16 reset_done = now + PORT_RESET_MSEC;
+ int limit_1;
/* build a "continuous enough" reset signal, with up to
* 3msec gap between pulses. scheduler HZ==100 must work;
* this might need to be deadline-scheduled.
*/
- do {
+ limit_1 = 100;
+ while (--limit_1 >= 0) {
+ int limit_2;
+
/* spin until any current reset finishes */
- for (;;) {
+ limit_2 = PORT_RESET_HW_MSEC * 2;
+ while (--limit_2 >= 0) {
temp = ohci_readl (ohci, portstat);
/* handle e.g. CardBus eject */
if (temp == ~(u32)0)
@@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
break;
udelay (500);
}
+ if (limit_2 < 0) {
+ ohci_warn(ohci, "Root port inner-loop reset timeout, "
+ "portstat[%08x]\n", temp);
+ }
if (!(temp & RH_PS_CCS))
break;
@@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
ohci_writel (ohci, RH_PS_PRS, portstat);
msleep(PORT_RESET_HW_MSEC);
now = ohci_readl(ohci, &ohci->regs->fmnumber);
- } while (tick_before(now, reset_done));
+ if (!tick_before(now, reset_done))
+ break;
+ }
+ if (limit_1 < 0) {
+ ohci_warn(ohci, "Root port outer-loop reset timeout, "
+ "now[%04x] reset_done[%04x]\n",
+ now, reset_done);
+ }
/* caller synchronizes using PRSC */
...Where did 100 come from? Why not rely on the reset_done criterion,
This also is a somewhat arbitrary limit. There's no obvious reason why
PORT_RESET_HW_MSEC should be both the limit for this loop and the
I realize that you are copying existing code here, but this makes the
same kind of mistake you are trying to fix. If a broken controller
fails to increment its framenumber register, this termination condition
What reason is there for having two warning messages? One ought to be
enough.
Alan Stern
Index: usb-2.6/drivers/usb/host/ohci-hub.c
===================================================================
--- usb-2.6.orig/drivers/usb/host/ohci-hub.c
+++ usb-2.6/drivers/usb/host/ohci-hub.c
@@ -560,10 +560,10 @@ static void start_hnp(struct ohci_hcd *o
/* called from some task, normally khubd */
static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
{
- __hc32 __iomem *portstat = &ohci->regs->roothub.portstatus [port];
- u32 temp;
- u16 now = ohci_readl(ohci, &ohci->regs->fmnumber);
- u16 reset_done = now + PORT_RESET_MSEC;
+ __hc32 __iomem *portstat = &ohci->regs->roothub.portstatus[port];
+ u32 temp;
+ unsigned long reset_done = jiffies +
+ msecs_to_jiffies(PORT_RESET_MSEC);
/* build a "continuous enough" reset signal, with up to
* 3msec gap between pulses. scheduler HZ==100 must work;
@@ -578,6 +578,12 @@ static inline int root_port_reset (struc
return -ESHUTDOWN;
if (!(temp & RH_PS_PRS))
break;
+ if (time_after(jiffies, reset_done)) {
+ ohci_warn(ohci, "Port %d reset timeout, "
+ "status %x\n",
+ port + 1, temp);
+ break;
+ }
udelay (500);
}
@@ -589,8 +595,7 @@ static inline int root_port_reset (struc
/* start the next reset, sleep till it's probably done */
ohci_writel (ohci, RH_PS_PRS, portstat);
msleep(PORT_RESET_HW_MSEC);
- now = ohci_readl(ohci, &ohci->regs->fmnumber);
- } while (tick_before(now, reset_done));
+ } while (time_before_eq(jiffies, ...From: Alan Stern <stern@rowland.harvard.edu> In my patch it was possible for the inner loop one to succeed, but the outer one to not do so. In your's this is not the case so I guess it's OK. I wonder if it's so wise trying to do two things at once. Here we are adding the loop timeouts, and also changing to using jiffies based timeouts rather than a chip timer register based one. I preferred my patches because it solved one single problem, the lack of loop limits. The timeout mechanism could have been changed in another followon patch. -
At this point I'd be perfectly happy to sit back and let David B. choose, among all the possibilities that have been posted, the one he thinks is best. Alan Stern -
Not if what I suggested is happening is really what's happening. (Quoted next.) It's got to be just a *simple* hardware race, and the msleep would reliably prevent it since the switch takes a finite amount of time to do its job. I've had to struggle with real heisenbugs, and this doesn't have enough conflicting behaviors inside the silicon (or So, you're unwilling to explore whether that suggestion addresses As Alan noted: no can do, in general. That's why I've not griped harder at the distro vendors who are ignoring the fairly simple recommendation that's been around for six years now: load EHCI before other USB controller drivers. Admittedly, until you turned up this glitch there was no downside It will exit by the inner loop (with diagnostic) before it exits from the outer one. Then the hub logic and other code will give The version I sent should just merge. - Dave -
There's actually no such thing as an "EHCI port" or an "OHCI port". Instead, there's a set of ports, each of which can be switched so the USB differential data signals go up to either controller. When EHCI starts, that switch points to EHCI so that devices can try enumerating with high speed signaling. When a device doesn't respond to that "chirp", the EHCI root hub driver switches the port to the Yes, that's why I asked about EHCI. My speculation would be that OHCI starts the reset, and EHCI claims the port before it completes; or contrariwise OHCI starts the reset right after EHCI claims it. And there's some point in that process where a hardware race makes the trouble you've observed. I believe there are plenty of other places where it's perfectly fine if EHCI grabs the port, or this little race would have shown up many times before. - Dave -
Since we can't know which O/UHCI is paired with which EHCI, we can't really have generic code to deal with that race, but maybe we can be smart and basically mutex khubd activity such as port reset vs. registration of any new HCD ? I'm not even sure module load order is 100% fault proof here since khubd spawns as a thread... Ben. -
From: Benjamin Herrenschmidt <benh@kernel.crashing.org> I'm concerned about that as well, thanks for bringing it up. My understanding, however, is that the critical thing is that the EHCI device reset being done by the EHCI driver probe occurs and completes first. If that is true, then just making sure EHCI loads initially is a sufficient constraint to fix this problem. -
Yup, that would be, though I hate that sort of load order dependencies... Ben. -
Don't need this "limit_1" timeout; "reset_done" handles all the timeout needed there. The regs->fmnumber is essentially This is the loop that didn't terminate for you, right? PORT_RESET_HW_MSEC is the ceiling you should use here, What values do you see for "portstat"? I suspect there will be some flag set which would allow a more immediate exit from that loop. RH_PS_CCS might clear, for example. And in any case, if that fails I don't see any reason not to just -
From: David Brownell <david-b@pacbell.net>
If the hardware hangs and the register stops incrementing,
the entire kernel will hang. That is unacceptable.
Absolutely nothing clears in the register from it's initial value.
Here is the patch with the limit_2 initial value fixed.
I kept loop_1 in there, it is necessary. No kernel code should
hang in an endless loop because of malfunctioning hardware.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/drivers/usb/host/ohci-hub.c b/drivers/usb/host/ohci-hub.c
index bb9cc59..9149593 100644
--- a/drivers/usb/host/ohci-hub.c
+++ b/drivers/usb/host/ohci-hub.c
@@ -563,14 +563,19 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
u32 temp;
u16 now = ohci_readl(ohci, &ohci->regs->fmnumber);
u16 reset_done = now + PORT_RESET_MSEC;
+ int limit_1;
/* build a "continuous enough" reset signal, with up to
* 3msec gap between pulses. scheduler HZ==100 must work;
* this might need to be deadline-scheduled.
*/
- do {
+ limit_1 = 100;
+ while (--limit_1 >= 0) {
+ int limit_2;
+
/* spin until any current reset finishes */
- for (;;) {
+ limit_2 = PORT_RESET_HW_MSEC * 2;
+ while (--limit_2 >= 0) {
temp = ohci_readl (ohci, portstat);
/* handle e.g. CardBus eject */
if (temp == ~(u32)0)
@@ -579,6 +584,10 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
break;
udelay (500);
}
+ if (limit_2 < 0) {
+ ohci_warn(ohci, "Root port inner-loop reset timeout, "
+ "portstat[%08x]\n", temp);
+ }
if (!(temp & RH_PS_CCS))
break;
@@ -589,7 +598,14 @@ static inline int root_port_reset (struct ohci_hcd *ohci, unsigned port)
ohci_writel (ohci, RH_PS_PRS, portstat);
msleep(PORT_RESET_HW_MSEC);
now = ohci_readl(ohci, &ohci->regs->fmnumber);
- } while (tick_before(now, reset_done));
+ if (!tick_before(now, reset_done))
+ break;
+ }
+ if (limit_1 < 0) {
+ ohci_warn(ohci, "Root port ...That's unfortunately useless ... PPS, PRS, CCS (meaning powered, resetting, connected). In short, there is *no* indication from the OHCI hardware that a disconnect happened. -
From: David Brownell <david-b@pacbell.net> But it's an excellent example of why my patch is mandatory. When the device gets into a stuck state, this code does the wrong thing. Instead of trying to deal with bad situations, it hangs the system instead. That sucks for anyone trying to debug something like this. This has caused weeks of debugging and grief for people, and it could have all been eliminated if this looping code were more robust to hardware failures. If you can't see why this is bad programming practice in a hardware driver, I will try to get my patch to someone who does. Even though I'm severly overloaded, you asked me for a patch, I gave you one. And now you want to push it back at me because it isn't clear to you yet that the code there right now is a huge problem. And all of this is independant of making sure EHCI loads first, we need to fix that too. -
I agree. David Brownell, unless you strongly object, I'll take David Miller's patch and add it to my tree. And if you do object, can you respond to the objection with a version of the patch that you would accept? :) thanks, greg k-h -
I'll send some version with a signoff. - Dave -
And I got some questions as I reviewed it. That's far from un-heard-of. One particular question related to one potential simplification ... which unfortunately You're reading things into what I wrote which were not there. Calm down. - Dave -
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
| Michael Breuer | Re: [PATCH] af_packet: Don't use skb after dev_queue_xmit() |
