The fragile part IMO is that the kernel currently allows the loading of
ehci to interrupt the initialization of uhci/ohci and *that* is what is
causing the errors.
I have run some tests loading ehci and uhci manually and when they are
done separately (i.e. with a little delay between the two) there are no
errors at all!
If uhci is loaded first, you only get a nice, clean "USB disconnect"
message (for devices already detected by uhci) when ehci is loaded.
If ehci is loaded first the low-speed devices are only detected after uhci
is loaded as well.
The *only* time you get the "device not accepting address" and "unable to
enumerate" errors is when you allow the ehci initialization to interrupt
the uhci initialization. IMO that cannot be classified anything other
than a bug.
See also the attachments with dmesg output:
- modprobe_uhci-ehci: uhci first; ehci later
- modprobe_ehci-uhci: ehci first; uhci later
- modprobe_uhci+ehci: both simultaneously (uhci slightly earlier)
Two problems:
- CONFIG_USB_DEBUG causes such a huge load of output that it is totally
unacceptable to have that enabled permanently for a running system
- I cannot reproduce this issue on demand, even though I've tried with
various delays between loading uhci and ehci
Possibly with the new patches from Greg KH [1] it would be possible to
disable USB debugging automatically when system boot is completed, but
I'd have to build a kernel with those and wait for the problem to happen
again.
What I can see in the logs I do have is that in the error case for some
reason a "reset low speed USB device" is triggered instead of either an
"enumeration failure" or a "USB disconnect", which are what I normally
see. As mentioned before, this seems to indicate to me a subtle timing
difference between the boots and IMO confirms the danger of allowing the
initialization of ehci to interrupt an ongoing initialization of uhci.
My guess is that this "reset" is insufficient to cause the bus to be
properly rescanned when ehci hands it back to uhci. I also guess that a
"reset" can occur if the interruption by the ehci loading happens
somewhere between the times that would otherwise cause an "enumeration
failure" and a clean "USB disconnect".
I have now. See results and comments above.
No, unfortunately I cannot reproduce it on demand. Probably because the
timing is too subtle and the "window" in which the problem occurs is
quite small.
You made the comment that this issue isn't worse than yanking out
cables/devices at random times. AFAIK it is still very much discouraged
to do that for e.g. storage devices, especially when data has recently
been written to them, without at least syncing and preferably unmounting
the device first. For a lot of devices (like keyboards) it doesn't really
matter of course.
There is one huge difference though: if a user yanks out a (storage)
device while it is in use he's just being dumb and IMO deserves what he
gets. It's basically the same as pulling a SATA cable or the power cable
of a desktop system.
But when the _kernel_ does the same, it is IMO being irresponsible.
I'm don't think it is reasonable to go so far as to completely prohibit
ehci from loading after uhci, especially not during system boot. But
maybe it should be made to first check with the low speed drivers what
their state is _before_ just barging in and rudely interrupting things on
the hardware level.
And maybe the kernel should (eventually) even go so far as to check
whether a low speed USB driver is in use by a mounted storage device and
maybe then loading ehci should be blocked. Just as 'modprobe -r' for a
ATA module is blocked if the driver is still in use.
My tests show that it is quite easy to avoid errors by just making sure
that ehci does not interrupt *the initialization process* of uhci.
Wouldn't it be possible to let ehci first check the state of the
uhci/ohci drivers and to have it *delay* its own initialization if those
are still busy initializing themselves?
Conversely uhci/ohci should probably not respond to new devices being
plugged in when they have been notified by ehci that it wants to (or has
started to) initialize itself.
Another option (probably on top of the above suggestion) would be to
slightly delay ohci/uhci initialization during system boot. This would
allow the general hardware discovery process to reach the later ehci PCI
device and start the ehci initialization.
ohci/uhci initialization could then start after ehci initialization has
completed; if no ehci device is present, ohci/uhci initialization would
still just start after the delay times out.
My boot logs show that the devices are generally detected within the same
second, so such a delay could be quite short.
Does this sound at all logical and feasible?
Cheers,
FJP
[1] http://lkml.org/lkml/2008/8/8/529