Re: All 2.6.26-rcX hang immediately after loading ohci_hcd

Previous thread: Re: [bug?] tg3: Failed to load firmware "tigon/tg3_tso.bin" by Jaswinder Singh on Saturday, July 5, 2008 - 2:37 am. (1 message)

Next thread: [PATCH] Move _RET_IP_ and _THIS_IP_ to include/linux/kernel.h by Eduard - Gabriel Munteanu on Saturday, July 5, 2008 - 5:14 am. (2 messages)
To: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, <dbrownell@...>
Cc: <linux-kernel@...>, <linux-usb@...>
Date: Saturday, July 5, 2008 - 3:08 am

--Boundary-01=_g3xbI0qrhVoFVyj
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

I am sorry for late report but I did not have much time recently (actually
I am leaving for next week tomorrow again without access to my Linux system=
).

All 2.6.26 versions hang on boot immediately after loading OHCI driver. 2.6=
=2E25
is fine; 2.6.26-rc as shipped by Mandriva cooker seems to work as well so t=
he
problem may be in my config (no I did not try different config due to lack =
of
time).

When system hangs SysRq is possible but rebooting (SysRq-B) does not work. =
Pressing
power button after that emits long stack and finally pressing and holding
power button switches it off.

Below is netconsole capture. Configs of 2.6.25 and 2.6.26 attached.

[ 0.000000] Linux version 2.6.26-rc8-1avb (bor@cooker) (gcc version 4.3.=
1 20080626 (prerelease) (GCC) ) #6 Sat Jul 5 09:13:03 MSD 2008
[ 0.000000] PAT disabled. Not yet verified on this CPU type.
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[ 0.000000] BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e0000 - 00000000000eee00 (reserved)
[ 0.000000] BIOS-e820: 00000000000eee00 - 00000000000ef000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000000ef000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 000000001ef60000 (usable)
[ 0.000000] BIOS-e820: 000000001ef60000 - 000000001ef70000 (ACPI data)
[ 0.000000] BIOS-e820: 000000001ef70000 - 0000000020000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
[ 0.000000] 495MB LOWMEM available.
[ 0.000000] Zone PFN ranges:
[ 0.000000] DMA 0 -> 4096
[ 0.000000] Normal 4096 -> 126816
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[1] active PFN ranges...

To: Andrey Borzenkov <arvidjaar@...>
Cc: Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, <dbrownell@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Alan Stern <stern@...>, Greg Kroah-Hartman <gregkh@...>
Date: Saturday, July 5, 2008 - 4:51 pm

The problem seems to be:

...

ie it looks like modprobe is stuck in some endless loop thanks to some
OHCI probe thing.

There's a lot of other processes then in 'D' state. Some are waiting for
some IO to complete, others look like they are waiting for some semaphore.
But those issues look like they may be a secondary result of the primary
issue (for example, softirq's aren't completing due to the lockup looking
like it may be in a irq/softirq handler, and the semaphore they are
waiting for seems to be the device layer semaphore that is held by the
probing routine already)

There are in fact several runnable tasks, but only the above one is the

It's also a bit sad that the core device infrastructure uses the old-style
semaphores rather than mutexes, because if it used mutexes the "locks
held" debugging would show those locks too. As it is, it is silent about
it, and only points out some relatively uninteresting stuff. But that

Anyway, the "show registers" one is the smoking gun, since it shows the

It really looks lik it's some endless loop - possibly due to endless
interrupts happening while in ohci_hub_status_data().

And I don't think this is due to the recently fixed IRQF_DISABLED bug.
Admittedly, that bug would likely never show up on UP unless you have
spinlock debugging enabled, which you obviously do have. That might
explain why the Mandriva cooker kernel binary works for you. But if it's
the IRQF_DISABLED thing, any lockup would probably show up as spinning
recursively on a spinlock, which is not the case for you.

If it _is_ the IRQF_DISABLED bug, then it's fixed in commit
de85422b94ddb23c021126815ea49414047c13dc, which isn't in any released -rc
yet (I'm doing -rc9 today which will have it), but has been in the last
few daily snapshots and obviously is in the current -git tree.

That said, it really looks like it's stuck in some endless loop in
__do_softirq(). Not that that should be possible (there's an explicit
loop limit there). So ...

To: Andrey Borzenkov <arvidjaar@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Alan Stern <stern@...>, Greg Kroah-Hartman <gregkh@...>
Date: Saturday, July 5, 2008 - 8:04 pm

I seem to recall some oddness with RHSC on some platforms,
way back when I had lots of OHCI hardware and did comparative.
testing. RHSC and RD did not act quite like the docs said ...

Simple experiments: add "distrust_firmware=y" to the ohci
module options. Or: try a kernel without CONFIG_PM enabled.

More complex experiments: try building with CONFIG_USB_DEBUG and
after editing drivers/usb/host/ohci-hcd.c to #define OHCI_VERBOSE_DEBUG
(just change the #undef to a #define) ... then restart things
(passing debug=8 on the kernel command line may help) and see what
it says on the system console.

If this is really an IRQ storm that may turn up more data.
Alternatively, it might not show info until you tweak ohci_irq()
(in that same file) to dump "ints" after it checks whether the
IRQ was "for some other device". And if even that doesn't show
a storm of messages ... then the problem may not be directly
caused by OHCI. Lots of stuff seems to be sharing that same
IRQ, after all...

- Dave
--

To: David Brownell <david-b@...>
Cc: Andrey Borzenkov <arvidjaar@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Alan Stern <stern@...>, Greg Kroah-Hartman <gregkh@...>
Date: Saturday, July 5, 2008 - 8:28 pm

I think Andrey said he was leaving for a week, so I don't think he'll be
testing now.

But Andrey - if you are on-line and have access to the machine, one thing
to try would be to just do a

git revert e872154921a6b5

in case it really was that particular commit. It seems to revert cleanly
(although I didn't actually check if the result then compiled/worked).

And a "git bisect" would obviously be wonderful in case it wasn't.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: David Brownell <david-b@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Alan Stern <stern@...>, Greg Kroah-Hartman <gregkh@...>
Date: Sunday, July 6, 2008 - 12:59 am

Looks much better; I am writing this now under 2.6.26-rc8 with above
commit reverted.

To: Andrey Borzenkov <arvidjaar@...>
Cc: David Brownell <david-b@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Alan Stern <stern@...>, Greg Kroah-Hartman <gregkh@...>
Date: Sunday, July 6, 2008 - 1:29 am

Ok, that's certainly pretty conclusive. Thanks for the great console logs
making it so straightforward to narrow it down to this (even if it was
just a wild guess - with your extensive logs it was still fairly
informed).

Greg, Alan, David - at this point I think the commit should just be
reverted. We're past -rc9, and unless either of you can see some obvious
alternate fix (eg some bug in the commit that explains Adrey's problems
that can just be fixed), I'm not seeing any good alternatives.

I don't know what the hardware details are, but based on the bootup
messages it seems to be a Toshiba motherboatd with an ALI 1535 chipset - I
think it's a Toshiba Portege 4000 (which would mean that the OHCI
controlle is the ALI M5237).

For all I know, that may not be the best possible chipset out there, but
it's not something extremely odd either. The machine may be a bit long in
the tooth by now (I think it's a 750MHz PIII in there), but that may also
explain why most developers wouldn't have seen this issue..

Hmm?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Andrey Borzenkov <arvidjaar@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Alan Stern <stern@...>, Greg Kroah-Hartman <gregkh@...>
Date: Sunday, July 6, 2008 - 1:25 pm

Right. I tried RC9 on some OHCI hardware I have locally,
and none of it seems to have this problem. If I had more
PCI slots, I could test more OHCI silicon ... but those
are going the way of the serial port (at least on PCs).

The only regresion I uncovered is that my OMAP1 OSK stopped

I don't recall anything unusual about ALI and OHCI. They
worked fine for me in some K6 and K7 hardware I had back
then ... then they got renamed to ULI, bought by NVidia,
and killed.

- Dave

--

To: Linus Torvalds <torvalds@...>
Cc: Andrey Borzenkov <arvidjaar@...>, David Brownell <david-b@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Greg Kroah-Hartman <gregkh@...>
Date: Sunday, July 6, 2008 - 12:40 pm

There doesn't seem to be much choice.

As I recall, the consequences of leaving the old code were that
sometimes a suspend or a resume wouldn't work as expected. That's
better than hanging the entire system -- but on the other hand, it
would affect more people.

Andrey, when you have time I'd like to do some more debugging to find
out exactly what's going wrong.

Alan Stern

--

To: Alan Stern <stern@...>
Cc: Linus Torvalds <torvalds@...>, David Brownell <david-b@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Greg Kroah-Hartman <gregkh@...>
Date: Friday, July 11, 2008 - 1:03 am

Let me know what you need. I am still running rc8 but may update if needed.

To: Andrey Borzenkov <arvidjaar@...>
Cc: Linus Torvalds <torvalds@...>, David Brownell <david-b@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Greg Kroah-Hartman <gregkh@...>
Date: Friday, July 11, 2008 - 10:16 am

[If nobody minds, after this message I will remove everything except
linux-usb from the CC: list.]

That's good; it'll help initially to test a system with the bug
present.

To start with, add a printk near the start of ohci_irq() in ohci-hcd.c.
Have it display in hex the value of "ints", immediately after the
ohci_readl() call.

Alan Stern

--

To: Linus Torvalds <torvalds@...>
Cc: David Brownell <david-b@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-usb@...>, Alan Stern <stern@...>, Greg Kroah-Hartman <gregkh@...>
Date: Sunday, July 6, 2008 - 1:35 am

yes

{pts/0}% lspci -nn
00:00.0 Host bridge [0600]: ALi Corporation M1644/M1644T Northbridge+Triden=
t [10b9:1644] (rev 01)
00:01.0 PCI bridge [0604]: ALi Corporation PCI to AGP Controller [10b9:5247]
00:02.0 USB Controller [0c03]: ALi Corporation USB 1.1 Controller [10b9:523=
7] (rev 03)
00:04.0 IDE interface [0101]: ALi Corporation M5229 IDE [10b9:5229] (rev c3)
00:06.0 Multimedia audio controller [0401]: ALi Corporation M5451 PCI AC-Li=
nk Controller Audio Device [10b9:5451] (rev 01)
00:07.0 ISA bridge [0601]: ALi Corporation M1533/M1535 PCI to ISA Bridge [A=
laddin IV/V/V+] [10b9:1533]
00:08.0 Bridge [0680]: ALi Corporation M7101 Power Management Controller [P=
MU] [10b9:7101]
00:0a.0 Ethernet controller [0200]: Intel Corporation 82557/8/9/0/1 Etherne=
t Pro 100 [8086:1229] (rev 08)
00:10.0 CardBus bridge [0607]: Texas Instruments PCI1410 PC card Cardbus Co=
ntroller [104c:ac50] (rev 01)
00:11.0 CardBus bridge [0607]: Toshiba America Info Systems ToPIC100 PCI to=
Cardbus Bridge with ZV Support [1179:0617] (rev 32)
00:11.1 CardBus bridge [0607]: Toshiba America Info Systems ToPIC100 PCI to=
Cardbus Bridge with ZV Support [1179:0617] (rev 32)
00:12.0 System peripheral [0880]: Toshiba America Info Systems SD TypA Cont=
roller [1179:0805] (rev 03)
01:00.0 VGA compatible controller [0300]: Trident Microsystems CyberBlade X=
PAi1 [1023:8820] (rev 82)

Handle 0x0001, DMI type 1, 25 bytes
System Information
Manufacturer: TOSHIBA
Product Name: PORTEGE 4000
Version: PP400E-0CJPX-0V
Serial Number: Z1109698G
UUID: 440FA000-001B-11D6-8000-C3D0A1109698

Previous thread: Re: [bug?] tg3: Failed to load firmware "tigon/tg3_tso.bin" by Jaswinder Singh on Saturday, July 5, 2008 - 2:37 am. (1 message)

Next thread: [PATCH] Move _RET_IP_ and _THIS_IP_ to include/linux/kernel.h by Eduard - Gabriel Munteanu on Saturday, July 5, 2008 - 5:14 am. (2 messages)