Re: [2.6.34-rc1 REGRESSION] ahci 0000:00:1f.2: controller reset failed (0xffffffff)

Previous thread: [PATCH] firewire: cdev: fix information leak by Stefan Richter on Tuesday, April 6, 2010 - 2:59 pm. (2 messages)

Next thread: [tip:core/hweight] x86: Add optimized popcnt variants by tip-bot for Borislav Petkov on Tuesday, April 6, 2010 - 4:04 pm. (2 messages)
From: Andy Isaacson
Date: Tuesday, April 6, 2010 - 3:54 pm

This Dell Precision WorkStation T3400 doesn't boot 2.6.34-rc1 (tried
522dba71).  2.6.33 was fine, and it's been running various stable
kernels for the last 18 months.  Unfortunately I can't reasonably bisect
as I need this machine to be usable, but I can test specific patches or
options.  (three or four reboots is fine, 15 is not.)

full dmesg from failing boot and a successful boot at
http://web.hexapodia.org/~adi/tmp/20100406-pci-ahci-reset-fail/

I suspect it's due to:

[    3.094038] pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]
[    3.103001] pci 0000:00:1f.2: can't reserve [mem 0xff970000-0xff9707ff]

so I've CCed a few recent committers to setup-res.c.

dmesg up to point of failure:

[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.34-rc1-00005-g522dba7 (andy@farthing) (gcc version 4.3.3 (Debian 4.3.3-5) ) #4 SMP Tue Apr 6 12:20:02 PDT 2010
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-2.6.34-rc1-00005-g522dba7 root=UUID=a2359eda-9295-451c-924f-c181c6f49d0d ro console=tty1 console=ttyS0,115200
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009ec00 (usable)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000bfe01c00 (usable)
[    0.000000]  BIOS-e820: 00000000bfe01c00 - 00000000bfe53c00 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000bfe53c00 - 00000000bfe55c00 (ACPI data)
[    0.000000]  BIOS-e820: 00000000bfe55c00 - 00000000c0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 00000000fed00400 (reserved)
[    0.000000]  BIOS-e820: 00000000fed20000 - 00000000feda0000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fef00000 (reserved)
[    0.000000]  BIOS-e820: 00000000ffb00000 - 0000000100000000 (reserved)
[    0.000000]  ...
From: Yinghai
Date: Tuesday, April 6, 2010 - 6:08 pm

can you try to boot with pci=nocrs ?

also please check with -rc4.

YH

--

From: Andy Isaacson
Date: Tuesday, April 6, 2010 - 6:28 pm

All I see in git is -rc3 + 299 commits ending with 0fdf867...

That still fails with the same "controller reset failed" message from
ahci.

-andy
--

From: Yinghai Lu
Date: Tuesday, April 6, 2010 - 7:19 pm

please file bug and assign to Bjorn.

YH
--

From: Bjorn Helgaas
Date: Tuesday, April 6, 2010 - 8:59 pm

Thanks a lot for reporting this!

No need to bisect it.  I'm pretty sure 2.6.34-rc1 will boot fine if you
use "pci=use_crs" (obviously that's only a temporary workaround until we
fix the real problem).

The BIOS apparently reported this window:

  pci_root PNP0A03:00: host bridge window [mem 0xff97c000-0xff97ffff]

which doesn't enclose the [mem 0xff970000-0xff9707ff] region where BIOS
put AHCI device, so we moved the AHCI device.  Unfortunately, we put it
at [mem 0x000a0000-0x000a07ff], which wasn't a very good choice because
that's probably already used by a VGA device.

If you happen to have Windows on this box, I'd love to know whether *it*
moves the AHCI device, too, or whether Windows interprets the BIOS
information differently than we do.  If you have Windows and can collect
screenshots of the Device Manager resources for the PCI bus and the AHCI
controller, that would be a good start.

Would you mind trying the patch below and the patch and kernel args
here:
  https://bugzilla.kernel.org/show_bug.cgi?id=15533#c5

This will (1) reserve the VGA area, so we should put the AHCI device
elsewhere, and (2) collect a few more details about exactly what the
BIOS is reporting.

Bjorn


commit 46b6e80aae2ec1d073767c92bba1d98896bce700
Author: Bjorn Helgaas <bjorn.helgaas@hp.com>
Date:   Tue Apr 6 21:44:12 2010 -0600

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 86b1506..f4c0fe4 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -44,7 +44,6 @@ static inline void visws_early_detect(void) { }
 extern unsigned long saved_video_mode;
 
 extern void reserve_standard_io_resources(void);
-extern void i386_reserve_resources(void);
 extern void setup_default_timer_irq(void);
 
 #ifdef CONFIG_X86_MRST
diff --git a/arch/x86/kernel/head32.c b/arch/x86/kernel/head32.c
index b2e2460..966b37f 100644
--- a/arch/x86/kernel/head32.c
+++ b/arch/x86/kernel/head32.c
@@ -22,7 +22,6 @@ static void __init ...
From: Andy Isaacson
Date: Tuesday, April 6, 2010 - 9:13 pm

pci=nocrs worked on 2.6.34-rc3-00299-g0fdf867.  I won't be back in front

The machine has one VGA controller exposed currently; there may be
another integrated Intel video controller on the motherboard and
disabled by the BIOS.

01:00.0 VGA compatible controller: nVidia Corporation Quadro NVS 290 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: nVidia Corporation Device 0492
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at fc000000 (32-bit, non-prefetchable) [size=16M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
        I/O ports at dc80 [size=128]
        Expansion ROM at fde00000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: nouveau


The machine only has Linux installed, but I may have access to another

I'll try that on Thursday.

-andy
--

From: Bjorn Helgaas
Date: Tuesday, April 6, 2010 - 9:21 pm

Oops, sorry, I meant it would probably work with "pci=nocrs", as you
already confirmed.  Don't bother trying "pci=use_crs"; that's turned on


Great, thanks!  Oh, and I forgot to ask: what BIOS version are you
running?  Google found several reports of USB issues in Windows on this
box, e.g., http://tim.cexx.org/?p=529 .

I think we still have a Linux bug in that we should be reserving the
legacy VGA area, but if the BIOS is reporting an incorrect host bridge
window, that will cause us to move the AHCI controller and tickle this
bug when we wouldn't otherwise.

Bjorn


--

From: Andy Isaacson
Date: Wednesday, April 7, 2010 - 10:16 am

On another T3400 with BIOS A03, Win7's Device Manager -> IDE ATA/ATAPI
controllers -> Standard AHCI 1.0 -> Resources -> Memory Range setting is
ff97f800-ff97ffff.  (If that's not the info you needed, let me know

BIOS Information
        Vendor: Dell Inc.
        Version: A04

I'll try the debug patch tomorrow morning.

-andy
--

From: Bjorn Helgaas
Date: Wednesday, April 7, 2010 - 11:08 am

Assuming this is the same AHCI controller (probably is, because I only
see one mentioned in your logs), I think Win7 moved it from where BIOS
left it.  It probably started at 0xff970000, and Win7 moved it into one
of the host bridge windows (but not the legacy VGA one):

  pci_root PNP0A03:00: host bridge window [mem 0xff980800-0xff980bff]
  pci_root PNP0A03:00: host bridge window [mem 0xff97c000-0xff97ffff]
  pci 0000:00:1f.2: no compatible bridge window for [mem 0xff970000-0xff9707ff]

Bjorn
--

From: Andy Isaacson
Date: Wednesday, April 7, 2010 - 11:42 am

Yes, there's only one AHCI controller mentioned on either machine.

-andy
--

From: Bjorn Helgaas
Date: Monday, April 12, 2010 - 10:56 am

We established that the patch in the message above wasn't enough
(the patch reserved 0xa0000-0xbffff, and Linux moved the AHCI
controller to 0xc0000 instead of 0xa0000).

But I'd still like to see the details of what ACPI is telling us,
so if you wouldn't mind trying that patch from bugzilla:
  https://bugzilla.kernel.org/show_bug.cgi?id=15533#c5
and collecting an acpidump, and attaching both to the bug report:
  https://bugzilla.kernel.org/show_bug.cgi?id=15744
that would be great.

Linux thinks the windows are:
  pci_root PNP0A03:00: host bridge window [mem 0x000a0000-0x000bffff]
  pci_root PNP0A03:00: host bridge window [mem 0x000c0000-0x000effff]
  pci_root PNP0A03:00: host bridge window [mem 0x000f0000-0x000fffff]

The 0xa0000-0xbffff one makes good sense.  That's normally MMIO that's
routed via PCI to the VGA device frame buffer, and we should be able
to figure out how to avoid that area, e.g., by using BIOS info, PCI
class codes, etc.

Now we need to figure how to avoid the 0xc0000-0xeffff and 0xf0000-0xfffff
windows.  Maybe there's something special about how ACPI describes them.

Or maybe we're just unlucky because these are the first windows in the
_CRS list, and Linux tries them in order, while Windows uses a different
strategy.

Bjorn
--

From: Andy Isaacson
Date: Monday, April 12, 2010 - 12:33 pm

That's confusing, I think I figured it out but "try this patch" which
links to a message that refers to another patch and some commandline
options and some config options and doesn't say what the goal is, is a
lot for me to parse since I don't actually understand what's going on
here.

I think I got it all:
https://bugzilla.kernel.org/attachment.cgi?id=25969
https://bugzilla.kernel.org/attachment.cgi?id=25970

Let me know (using small words if necessary) if I screwed something up.

Thanks,
-andy
--

From: Matthew Wilcox
Date: Monday, April 12, 2010 - 2:56 pm

Perhaps it's sufficient to try them in reverse order?
--

From: H. Peter Anvin
Date: Monday, April 12, 2010 - 7:14 pm

Why bother?  The first megabyte is really special in x86... it is
historically used for legacy devices, it has specific functions for PCI
firmware, and it has separate MTRRs.

Simply put, "there there be dragons".  There is no sane reason to
allocate unassigned devices there (preassigned devices is another matter).

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: H. Peter Anvin
Date: Monday, April 12, 2010 - 6:51 pm

I strongly suspects that Windows knows that < 1 MB is special, and only
ever assigns it upon explicit allocation.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Maciej Rutecki
Date: Friday, April 9, 2010 - 12:10 pm

I created a Bugzilla entry at 
https://bugzilla.kernel.org/show_bug.cgi?id=15744
for your bug report, please add your address to the CC list in there, thanks!

-- 
Maciej Rutecki
http://www.maciek.unixy.pl
--

Previous thread: [PATCH] firewire: cdev: fix information leak by Stefan Richter on Tuesday, April 6, 2010 - 2:59 pm. (2 messages)

Next thread: [tip:core/hweight] x86: Add optimized popcnt variants by tip-bot for Borislav Petkov on Tuesday, April 6, 2010 - 4:04 pm. (2 messages)