login
Header Space

 
 

Re: intel-iommu: CONFIG_DMAR*=y kills my box

Previous thread: mainline boot failures II: getname -> kmem_cache_alloc oops by Andi Kleen on Saturday, April 26, 2008 - 10:17 am. (1 message)

Next thread: [git pull] x86 updates by Ingo Molnar on Saturday, April 26, 2008 - 10:21 am. (1 message)
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: Ashok Raj <ashok.raj@...>, Shaohua Li <shaohua.li@...>, Anil S Keshavamurthy <anil.s.keshavamurthy@...>
Date: Saturday, April 26, 2008 - 10:21 am

Hi all,

On an kernel ( tested 2.6.25 , linux-next and latest -git ) with DMAR enabled my box won't boot unless I disable it with intel_iommu=off.

It hangs very early on boot and not even the reset button works anymore , all I get is an black screen , no message or alike and the box hangs.
Also it does not make any difference when enabling / disabling  DMAR_GFX_WA.

I would like to debug the problem but I have no idea on how to do it.

Also this box has an ASUS P5E-VM DO motherboard , latest BIOS , Q9300 Quad CPU and 4G RAM.

lspci :

00:00.0 Host bridge: Intel Corporation DRAM Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation Integrated Graphics Controller (rev 02)
00:03.0 Communication controller: Intel Corporation MEI Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation USB UHCI Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation USB UHCI Controller #5 (rev 02)
00:1a.2 USB Controller: Intel Corporation USB UHCI Controller #6 (rev 02)
00:1a.7 USB Controller: Intel Corporation USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation HD Audio Controller (rev 02)
00:1d.0 USB Controller: Intel Corporation USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation SMBus Controller (rev 02)
00:1f.6 Signal processing controller: Intel Corporation Thermal Subsystem (rev 02)


dmesg:

[    0.000000] Linux version 2.6.25-05096-gb1721d0-dirty (crazy@thor) (gcc version 4.3.0 (Frugalware Linux) ) #783...
To: Gabriel C <nix.or.die@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Ashok Raj <ashok.raj@...>, Shaohua Li <shaohua.li@...>, Anil S Keshavamurthy <anil.s.keshavamurthy@...>
Date: Monday, April 28, 2008 - 8:11 pm

The quickest way is to replace the time_after call in the macro with a
counter to 3_000_000_000 (3-billion) and don't forget to increment it
each time through the IOMMU_WAIT_OP loop to see where its hanging.

You need an ITP and some insight into the workings of BIOS.  (I wouldn't
bother digging any deeper than seeing which DMAR setup step failed in the
early startup of the intel-iommu code.  See IOMMU_WAIT_OP.  It should
use TSC instead of Jiffies.  Your system is stuck in that macro.  Most
likely because your BIOS is buggering up its side of the IOMMU DMAR
equation.

Basically what is happening is that the register set and poll protocol
for talking to the iommu is failing on the poll part of the protocol.
As this is happening in early boot before timers are running you get
stuck there.

I've had this happen on a number of systems so far, and so far its
always a bios thing.  I've put some effort in adding code to detect such
things Like using the TSC instead of jiffies to detect the failure case,
and then trying to un-wrap the DMAR setup code.  I got the TSC thing to
work, which would replace the black screen startup hang with a call to
panic, but unwrapping the DMAR setup safely has been harder for me to
get working.

--
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: Ashok Raj <ashok.raj@...>, Shaohua Li <shaohua.li@...>, Anil S Keshavamurthy <anil.s.keshavamurthy@...>, Andrew Morton <akpm@...>
Date: Saturday, April 26, 2008 - 1:27 pm

Got more infos on that.

After poking a lot around I found out the boot problem occurs when I have *any* usb devices inserted on boot ( really does not matter which ).

Removing these on boot ( mouse , keyboard and an 1G stick ) made it boot but I got an OOps on loading the parport_pc modules and box died :/

After blacklisting parport_pc box was up. Strange thing is now I can insert my USB devices and everything is working.

The OOPs is reproducible with modprobe too. 

New dmesg output : 
...


[    0.000000] Linux version 2.6.25-05096-gb1721d0-dirty (crazy@thor) (gcc version 4.3.0 (Frugalware Linux) ) #793 SMP PREEMPT Sat Apr 26 18:07:59 CEST 2008
[    0.000000] Command line: root=/dev/sdb1 ro debug vga=792 3
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009cc00 (usable)
[    0.000000]  BIOS-e820: 000000000009cc00 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000cf550000 (usable)
[    0.000000]  BIOS-e820: 00000000cf550000 - 00000000cf55e000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000cf55e000 - 00000000cf5e0000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000cf5e0000 - 00000000cf600000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 000000012c000000 (usable)
[    0.000000] Entering add_active_range(0, 0, 156) 0 entries of 256 used
[    0.000000] Entering add_active_range(0, 256, 849232) 1 entries of 256 used
[    0.000000] Entering add_active_range(0, 1048576, 1228800) 2 entries of 256 used
[    0.000000] max_pfn_mapped = 1228800
[    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[    0.000000] init_memory_mapping
[    0.000000] DMI 2.4 present.
[    0.000000] ACPI: RSDP 000F9A80, 0024 (r2 ACPIAM)
[    0.000000] ACPI: XSDT CF...
To: Gabriel C <nix.or.die@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Ashok Raj <ashok.raj@...>, Shaohua Li <shaohua.li@...>, Anil S Keshavamurthy <anil.s.keshavamurthy@...>, Andrew Morton <akpm@...>, Greg KH <greg@...>
Date: Thursday, May 8, 2008 - 1:54 pm

Gabriel,
I don't have any idea how the parport_pc probe is triggering a PCIE bus
walk seeing that its a legacy device.  (isn't it?  I don't see a PCI
P-port card in the lspci dump you sent).

Can you send (off list) your .config file used to recreate this Ooops?

Also the following is a short patch to help gaurd the walking off a null
pointer in drivers/pci/search.c  (which is what I see happening when you
load the parap_pc module.


Signed-Off-By: mark gross &lt;mgross@linux.intel.com&gt;

------


Index: linux-next/drivers/pci/search.c
===================================================================
--- linux-next.orig/drivers/pci/search.c	2008-05-08 09:53:30.000000000 -0700
+++ linux-next/drivers/pci/search.c	2008-05-08 10:03:35.000000000 -0700
@@ -29,6 +29,11 @@
 	if (pdev-&gt;is_pcie)
 		return NULL;
 	while (1) {
+		if (!pdev | !pdev-&gt;bus) {
+			WARN_ON(1); /* this shouldn't happen */
+			return NULL;
+		}
+
 		if (!pdev-&gt;bus-&gt;self)
 			break;
 		pdev = pdev-&gt;bus-&gt;self;
--
To: Gabriel C <nix.or.die@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Ashok Raj <ashok.raj@...>, Shaohua Li <shaohua.li@...>, Anil S Keshavamurthy <anil.s.keshavamurthy@...>, Andrew Morton <akpm@...>
Date: Monday, April 28, 2008 - 8:29 pm

wha?  Do you have any dmesg traces from when this event happens?  Is it
just another black screen at early boot or do you get somewhere before


I'm on the stack :(

It feels like something outside of the IOMMU code is messing up here.
If pci_find_upstream_pcie_bridge is getting a bad pdev, I think
something is odd because the intel-iommu code doesn't do anything but
pass through the value's in the dev thats setup by the other code.

--
To: <mgross@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Ashok Raj <ashok.raj@...>, Shaohua Li <shaohua.li@...>, Anil S Keshavamurthy <anil.s.keshavamurthy@...>, Andrew Morton <akpm@...>
Date: Monday, April 28, 2008 - 9:20 pm

I don't have any log so far , is really early on boot some who. 

Also as said removing all external USB devices like mouse , keyboard etc box boots up fine.

--
To: Gabriel C <nix.or.die@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Ashok Raj <ashok.raj@...>, Shaohua Li <shaohua.li@...>, Anil S Keshavamurthy <anil.s.keshavamurthy@...>, Andrew Morton <akpm@...>
Date: Tuesday, April 29, 2008 - 6:53 pm

I'm thinking SMI's from the USB emulation is interacting badly.  There
is no kernel code path that intersects USB before the DMAR's are set up.



I hate to be one of those "'does not reproduce" guys but;
Sadly I am unable to reproduce this.  modprobe parport_pc doesn't seem
to make any complaints on my desktop.

lspci:

00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82G33/G31 Express
Integrated Graphics Controller (rev 02)
00:03.0 Communication controller: Intel Corporation 82G33/G31/P35/P31
Express MEI Controller (rev 02)
....

I do find it quite odd that a DMA code path specific to PCIE is somehow
in the loop for a parallel port device.  Should this be possible?

I could easily be wrong so feel free to correct me but;
I think your bios is goofy / unprepared to support IOMMU / VT-d and
doing strange things with enumerated a parallel port on a PCIE bus with
VTD is turned on...

--
To: <mgross@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Ashok Raj <ashok.raj@...>, Shaohua Li <shaohua.li@...>, Anil S Keshavamurthy <anil.s.keshavamurthy@...>, Andrew Morton <akpm@...>, <jbarnes@...>, <linux-pci@...>
Date: Tuesday, April 29, 2008 - 7:04 pm

No worries :) 

I'm sure is some problem on that box maybe the BIOS or some piece HW I have an you don't.

Just updated pciids here new lspci output :

lspci
00:00.0 Host bridge: Intel Corporation 82Q35 Express DRAM Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82Q35 Express Integrated Graphics Controller (rev 02)
00:03.0 Communication controller: Intel Corporation 82Q35 Express MEI Controller (rev 02)
00:03.2 IDE interface: Intel Corporation 82Q35 Express PT IDER Controller (rev 02)
00:03.3 Serial controller: Intel Corporation 82Q35 Express Serial KT Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92)
00:1f.0 ISA bridge: Intel Corporation 82801IO (ICH9DO) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)


--
To: Gabriel C <nix.or.die@...>
Cc: <mgross@...>, <linux-kernel@...>, <ashok.raj@...>, <shaohua.li@...>, <anil.s.keshavamurthy@...>, <jbarnes@...>, <linux-pci@...>
Date: Tuesday, May 6, 2008 - 4:15 pm

On Wed, 30 Apr 2008 01:04:45 +0200

Guys, it's really painful having to scroll through thousand-line emails to

So..  what happened here?  It seems like a pretty fatal problem, and
personally I don't think that contacting vendors about BIOS upgrades is a
suitable general solution.  It would be much better to find a kernel-based
fix or workaround?


--
To: Andrew Morton <akpm@...>
Cc: Gabriel C <nix.or.die@...>, <linux-kernel@...>, <ashok.raj@...>, <shaohua.li@...>, <anil.s.keshavamurthy@...>, <jbarnes@...>, <linux-pci@...>
Date: Thursday, May 8, 2008 - 10:43 am

I just got some expert advice on this issue.  I will try again and look
for a problem with the iommu code getting executed before the USB / PCI
bus is fully initialized.

also, I just realized that I was attempting to reproduce the failure on
the MM tree, and re-reading the report I see its on the linux-next tree.

I'm retesting it this am.

--mgross

 
--
To: Andrew Morton <akpm@...>
Cc: Gabriel C <nix.or.die@...>, <linux-kernel@...>, <ashok.raj@...>, <shaohua.li@...>, <anil.s.keshavamurthy@...>, <jbarnes@...>, <linux-pci@...>
Date: Thursday, May 8, 2008 - 12:45 pm

linux-next works fine on my system :(  with intel-iommu on.

root@mtgsdv1:~# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82G33/G31 Express Integrated Graphics Controller (rev 02)
00:03.0 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express MEI Controller (rev 02)
00:03.2 IDE interface: Intel Corporation 82G33/G31/P35/P31 Express PT IDER Controller (rev 02)
00:03.3 Serial controller: Intel Corporation 82G33/G31/P35/P31 Express Serial KT Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02)
00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) 00:1f.0 ISA bridge: Intel Corporation Unknown device 2910 (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
00:1f.5 IDE interface: Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller (rev 02)

Gabriel,

The idea that perhaps something changed in the boot order, resulti...
To: <mgross@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <ashok.raj@...>, <shaohua.li@...>, <anil.s.keshavamurthy@...>, <jbarnes@...>, <linux-pci@...>
Date: Thursday, May 8, 2008 - 1:56 pm

Mark , does the box you test on have any extra PCI/e/X cards ( like a extra graphics card , extra network card ) ?
On my box all extra PCI/e/X slots are empty I just use the onboard stuff at the moment.


If it helps I can send you my config over. 


Gabriel 
--
To: Gabriel C <nix.or.die@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <ashok.raj@...>, <shaohua.li@...>, <anil.s.keshavamurthy@...>, <jbarnes@...>, <linux-pci@...>
Date: Thursday, May 8, 2008 - 2:55 pm

This is what I'm running :(

--mgross

--
To: Andrew Morton <akpm@...>
Cc: Gabriel C <nix.or.die@...>, <linux-kernel@...>, <ashok.raj@...>, <shaohua.li@...>, <anil.s.keshavamurthy@...>, <jbarnes@...>, <linux-pci@...>
Date: Wednesday, May 7, 2008 - 4:58 pm

We don't have a stable work around yet.  Unwrapping the IOMMU startup
code is proving tricky.  The only thing I can think of is to change the
polarity of the intel-iommu command line to default to off, and if your
bios doesn't suck you can enable it if you want to use it.

I don't like this option much, since all the issues have been bios
based.  But, I don't know what else to do at this time.

--mgross

--
Previous thread: mainline boot failures II: getname -> kmem_cache_alloc oops by Andi Kleen on Saturday, April 26, 2008 - 10:17 am. (1 message)

Next thread: [git pull] x86 updates by Ingo Molnar on Saturday, April 26, 2008 - 10:21 am. (1 message)
speck-geostationary