Hi all, On an kernel ( tested 2.6.25 , linux-next and latest -git ) with DMAR enabled my box won't boot unless I disable it with intel_iommu=off. It hangs very early on boot and not even the reset button works anymore , all I get is an black screen , no message or alike and the box hangs. Also it does not make any difference when enabling / disabling DMAR_GFX_WA. I would like to debug the problem but I have no idea on how to do it. Also this box has an ASUS P5E-VM DO motherboard , latest BIOS , Q9300 Quad CPU and 4G RAM. lspci : 00:00.0 Host bridge: Intel Corporation DRAM Controller (rev 02) 00:02.0 VGA compatible controller: Intel Corporation Integrated Graphics Controller (rev 02) 00:03.0 Communication controller: Intel Corporation MEI Controller (rev 02) 00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02) 00:1a.0 USB Controller: Intel Corporation USB UHCI Controller #4 (rev 02) 00:1a.1 USB Controller: Intel Corporation USB UHCI Controller #5 (rev 02) 00:1a.2 USB Controller: Intel Corporation USB UHCI Controller #6 (rev 02) 00:1a.7 USB Controller: Intel Corporation USB2 EHCI Controller #2 (rev 02) 00:1b.0 Audio device: Intel Corporation HD Audio Controller (rev 02) 00:1d.0 USB Controller: Intel Corporation USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation USB UHCI Controller #3 (rev 02) 00:1d.7 USB Controller: Intel Corporation USB2 EHCI Controller #1 (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) 00:1f.0 ISA bridge: Intel Corporation LPC Interface Controller (rev 02) 00:1f.2 SATA controller: Intel Corporation 6 port SATA AHCI Controller (rev 02) 00:1f.3 SMBus: Intel Corporation SMBus Controller (rev 02) 00:1f.6 Signal processing controller: Intel Corporation Thermal Subsystem (rev 02) dmesg: [ 0.000000] Linux version 2.6.25-05096-gb1721d0-dirty (crazy@thor) (gcc version 4.3.0 (Frugalware Linux) ) #783...
The quickest way is to replace the time_after call in the macro with a counter to 3_000_000_000 (3-billion) and don't forget to increment it each time through the IOMMU_WAIT_OP loop to see where its hanging. You need an ITP and some insight into the workings of BIOS. (I wouldn't bother digging any deeper than seeing which DMAR setup step failed in the early startup of the intel-iommu code. See IOMMU_WAIT_OP. It should use TSC instead of Jiffies. Your system is stuck in that macro. Most likely because your BIOS is buggering up its side of the IOMMU DMAR equation. Basically what is happening is that the register set and poll protocol for talking to the iommu is failing on the poll part of the protocol. As this is happening in early boot before timers are running you get stuck there. I've had this happen on a number of systems so far, and so far its always a bios thing. I've put some effort in adding code to detect such things Like using the TSC instead of jiffies to detect the failure case, and then trying to un-wrap the DMAR setup code. I got the TSC thing to work, which would replace the black screen startup hang with a call to panic, but unwrapping the DMAR setup safely has been harder for me to get working. --
Got more infos on that. After poking a lot around I found out the boot problem occurs when I have *any* usb devices inserted on boot ( really does not matter which ). Removing these on boot ( mouse , keyboard and an 1G stick ) made it boot but I got an OOps on loading the parport_pc modules and box died :/ After blacklisting parport_pc box was up. Strange thing is now I can insert my USB devices and everything is working. The OOPs is reproducible with modprobe too. New dmesg output : ... [ 0.000000] Linux version 2.6.25-05096-gb1721d0-dirty (crazy@thor) (gcc version 4.3.0 (Frugalware Linux) ) #793 SMP PREEMPT Sat Apr 26 18:07:59 CEST 2008 [ 0.000000] Command line: root=/dev/sdb1 ro debug vga=792 3 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009cc00 (usable) [ 0.000000] BIOS-e820: 000000000009cc00 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000cf550000 (usable) [ 0.000000] BIOS-e820: 00000000cf550000 - 00000000cf55e000 (ACPI data) [ 0.000000] BIOS-e820: 00000000cf55e000 - 00000000cf5e0000 (ACPI NVS) [ 0.000000] BIOS-e820: 00000000cf5e0000 - 00000000cf600000 (reserved) [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved) [ 0.000000] BIOS-e820: 0000000100000000 - 000000012c000000 (usable) [ 0.000000] Entering add_active_range(0, 0, 156) 0 entries of 256 used [ 0.000000] Entering add_active_range(0, 256, 849232) 1 entries of 256 used [ 0.000000] Entering add_active_range(0, 1048576, 1228800) 2 entries of 256 used [ 0.000000] max_pfn_mapped = 1228800 [ 0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 [ 0.000000] init_memory_mapping [ 0.000000] DMI 2.4 present. [ 0.000000] ACPI: RSDP 000F9A80, 0024 (r2 ACPIAM) [ 0.000000] ACPI: XSDT CF...
Gabriel,
I don't have any idea how the parport_pc probe is triggering a PCIE bus
walk seeing that its a legacy device. (isn't it? I don't see a PCI
P-port card in the lspci dump you sent).
Can you send (off list) your .config file used to recreate this Ooops?
Also the following is a short patch to help gaurd the walking off a null
pointer in drivers/pci/search.c (which is what I see happening when you
load the parap_pc module.
Signed-Off-By: mark gross <mgross@linux.intel.com>
------
Index: linux-next/drivers/pci/search.c
===================================================================
--- linux-next.orig/drivers/pci/search.c 2008-05-08 09:53:30.000000000 -0700
+++ linux-next/drivers/pci/search.c 2008-05-08 10:03:35.000000000 -0700
@@ -29,6 +29,11 @@
if (pdev->is_pcie)
return NULL;
while (1) {
+ if (!pdev | !pdev->bus) {
+ WARN_ON(1); /* this shouldn't happen */
+ return NULL;
+ }
+
if (!pdev->bus->self)
break;
pdev = pdev->bus->self;
--wha? Do you have any dmesg traces from when this event happens? Is it just another black screen at early boot or do you get somewhere before I'm on the stack :( It feels like something outside of the IOMMU code is messing up here. If pci_find_upstream_pcie_bridge is getting a bad pdev, I think something is odd because the intel-iommu code doesn't do anything but pass through the value's in the dev thats setup by the other code. --
I don't have any log so far , is really early on boot some who. Also as said removing all external USB devices like mouse , keyboard etc box boots up fine. --
I'm thinking SMI's from the USB emulation is interacting badly. There is no kernel code path that intersects USB before the DMAR's are set up. I hate to be one of those "'does not reproduce" guys but; Sadly I am unable to reproduce this. modprobe parport_pc doesn't seem to make any complaints on my desktop. lspci: 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller (rev 02) 00:02.0 VGA compatible controller: Intel Corporation 82G33/G31 Express Integrated Graphics Controller (rev 02) 00:03.0 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express MEI Controller (rev 02) .... I do find it quite odd that a DMA code path specific to PCIE is somehow in the loop for a parallel port device. Should this be possible? I could easily be wrong so feel free to correct me but; I think your bios is goofy / unprepared to support IOMMU / VT-d and doing strange things with enumerated a parallel port on a PCIE bus with VTD is turned on... --
No worries :) I'm sure is some problem on that box maybe the BIOS or some piece HW I have an you don't. Just updated pciids here new lspci output : lspci 00:00.0 Host bridge: Intel Corporation 82Q35 Express DRAM Controller (rev 02) 00:02.0 VGA compatible controller: Intel Corporation 82Q35 Express Integrated Graphics Controller (rev 02) 00:03.0 Communication controller: Intel Corporation 82Q35 Express MEI Controller (rev 02) 00:03.2 IDE interface: Intel Corporation 82Q35 Express PT IDER Controller (rev 02) 00:03.3 Serial controller: Intel Corporation 82Q35 Express Serial KT Controller (rev 02) 00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02) 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02) 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02) 00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02) 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02) 00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 02) 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) 00:1f.0 ISA bridge: Intel Corporation 82801IO (ICH9DO) LPC Interface Controller (rev 02) 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02) --
On Wed, 30 Apr 2008 01:04:45 +0200 Guys, it's really painful having to scroll through thousand-line emails to So.. what happened here? It seems like a pretty fatal problem, and personally I don't think that contacting vendors about BIOS upgrades is a suitable general solution. It would be much better to find a kernel-based fix or workaround? --
I just got some expert advice on this issue. I will try again and look for a problem with the iommu code getting executed before the USB / PCI bus is fully initialized. also, I just realized that I was attempting to reproduce the failure on the MM tree, and re-reading the report I see its on the linux-next tree. I'm retesting it this am. --mgross --
linux-next works fine on my system :( with intel-iommu on. root@mtgsdv1:~# lspci 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller (rev 02) 00:02.0 VGA compatible controller: Intel Corporation 82G33/G31 Express Integrated Graphics Controller (rev 02) 00:03.0 Communication controller: Intel Corporation 82G33/G31/P35/P31 Express MEI Controller (rev 02) 00:03.2 IDE interface: Intel Corporation 82G33/G31/P35/P31 Express PT IDER Controller (rev 02) 00:03.3 Serial controller: Intel Corporation 82G33/G31/P35/P31 Express Serial KT Controller (rev 02) 00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02) 00:1a.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 02) 00:1a.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 02) 00:1a.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 02) 00:1a.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 02) 00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 02) 00:1d.0 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 02) 00:1d.1 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 02) 00:1d.2 USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 02) 00:1d.7 USB Controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 92) 00:1f.0 ISA bridge: Intel Corporation Unknown device 2910 (rev 02) 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02) 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02) 00:1f.5 IDE interface: Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller (rev 02) Gabriel, The idea that perhaps something changed in the boot order, resulti...
Mark , does the box you test on have any extra PCI/e/X cards ( like a extra graphics card , extra network card ) ? On my box all extra PCI/e/X slots are empty I just use the onboard stuff at the moment. If it helps I can send you my config over. Gabriel --
This is what I'm running :( --mgross --
We don't have a stable work around yet. Unwrapping the IOMMU startup code is proving tricky. The only thing I can think of is to change the polarity of the intel-iommu command line to default to off, and if your bios doesn't suck you can enable it if you want to use it. I don't like this option much, since all the issues have been bios based. But, I don't know what else to do at this time. --mgross --
