Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Previous thread: Re: [MeeGo-Dev][PATCH] Topcliff: Update PCH_SPI driver to 2.6.35 by Greg KH on Tuesday, August 10, 2010 - 10:10 am. (2 messages)

Next thread: [PATCH] nfs: Add "lookupcache" to displayed mount options by Patrick J. LoPresti on Tuesday, August 10, 2010 - 11:36 am. (1 message)

Hello Joerg,

The requested info is attached.
So that would mean a bios problem ? (those are not on my wishlist :-p)

--
Sander






-- 
Best regards,
 Sander                            [ message continues ]
" title="mailto:linux@eikelenboom.it">mailto:linux@eikelenboom.it

Yeah, looks like a BIOS problem. But the driver should handle that
without crashing the system, so there is a bug in the driver too.

Problem is:

AMD-Vi:   DEV_ALIAS_RANGE         devid: 0a:01.0 flags: 00 devid_to: 0a:00.0
AMD-Vi:   DEV_RANGE_END           devid: 0a:1f.7

This means that PCI devices from 0a:01.0 to 0a:1f.7 may use their own
device-id or 0a:00.0. But a device which id 0a:00.0 is not present in
the system. From the lspci output this looks like your USB3 controler
should alias to 09:00.0. I prepare a patch for you to fix the crash but
I can't guarantee that your USB3 controler will work afterwards. If you
see IO-Page-Faults please report them to me.

	Joerg

--


Hello Joerg,

Could you also provide a perhaps more specific message what is wrong with the bios, that i could forward to MSI, in the hope it will reach the bios engineers someday ? :-)

--
Sander





-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

--


Lets first prove that my theory is right before contacting MSI directly.
Can you try the attached patch? it should fix the boot-crash. When the
system booted successfully please try some USB device (make sure it uses
the seperate usb-controler, I guess the seperate device is responsible
for USB 3, so try to plug a device into one of your USB 3 ports).
If you finished that please send me whether it worked or not and the
full dmesg output of the system.

	Joerg


Hello Joerg,

Errr which seperate usb controller ? .. it has actually:
- 1 pci-e usb 2.0 controller
- 2 pci-e usb 3.0 controller (one of which includes a sata controller as well)

(apart from the onboard stuff)

--
Sander






-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

--


Hi Sander,


The devices should be attached to this controler:

0a:01.0 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43) (prog-if 10 [OHCI])
0a:01.1 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43) (prog-if 10 [OHCI])
0a:01.2 USB Controller [0c03]: NEC Corporation USB 2.0 [1033:00e0] (rev 04) (prog-if 20 [EHCI])

The PCI devices associated with that controler alias to 0a:00.0 which
does not exist in your system (hence the crash). And the fact that these
devices have an alias makes me believe that the BIOS detects them as
legacy PCI devices. PCI-e does typically not has aliases. Can you send
lcpi -t output to see to which upstream bridge these devices are
connected to?

	Joerg

--


Hmmm the fun part seems to be .. that the usb devices on that usb2 controller seemed to work fine on Xen.
And i have some problems about xen not willing to passthrough things with the usb3 controllers (supposedly due to the (extra) bridges),
that are the controllers on 04:00.0 and 08:00.0

-[0000:00]-+-00.0
           +-00.2
           +-02.0-[0000:0d]--+-00.0
           |                 \-00.1
           +-05.0-[0000:0c]----00.0
           +-06.0-[0000:0b]----00.0
           +-0a.0-[0000:09-0a]----00.0-[0000:0a]--+-01.0
           |                                      +-01.1
           |                                      \-01.2
           +-0b.0-[0000:05-08]----00.0-[0000:06-08]--+-01.0-[0000:08]----00.0
           |                                         \-02.0-[0000:07]----00.0
           +-0d.0-[0000:04]----00.0
           +-11.0
           +-12.0
           +-12.2
           +-13.0
           +-13.2
           +-14.0
           +-14.3
           +-14.4-[0000:03]----06.0
           +-14.5
           +-15.0-[0000:02]--
           +-16.0
           +-16.2
           +-18.0
           +-18.1
           +-18.2
           +-18.3
           \-18.4

I had hoped things would become easier/better with my new mobo including iommu :-)
Doesn't seem that way yet. Previously i had 2 usb2.0 controllers(1x pci 1x pci-e) and 1 usb3.0(pci-e) passed through (with xen-swiotlb and no hardware iommu).. and that worked fine grabbing video 24/7 for several weeks.


But lets hope for the best :-)

--
Sander








-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

--


Hmm, thats weird. In this case these devices probably do not alias at

Yeah, device 09:00.0 is a PCIe-to-PCI bridge and the addtional USB
controlers are behind that bridge as legacy PCI devices. Thats why the
BIOS sets up the alias-entry. It should set up 09:00.0 instead of
0a:00.0 to make things work correctly.

	Joerg

--


Hi Joerg,

Ok it boots ok now, but plugging in a USB device in the 2.0 controller (0a.01.*) results in a flood of error messages about the usb controller not functioning.
When running same kernel with amd_iommu=off results in ...the device at least registering properly as usb device (altough trying to use it now resulted in an entirely new oops probably in the driver of the videograbber.)

--
Sander





-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

--


It boots now, dmesg attached.






-- 
Best regards,
 Sander                            [ message continues ]
" title="mailto:linux@eikelenboom.it">mailto:linux@eikelenboom.it

Ok,


AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x0000 address=0x0000000000001080 flags=0x0070]

So it indeed uses 0a:00.0 as the device id. Thats weird but states that
the BIOS is actually ok. I need to fix that in the driver.

Thanks,

	Joerg

--


Ok, here is a quick and dirty patch wich should make your system boot
again. It introduces other issues which will show up when you try to
assign the devices to a virtual machine. But at least the devices should
work again on bare-metal.

	Joerg


Hello Joerg,

Had to apply the patch by hand, and found 2 typo's:

arch/x86/kernel/amd_iommu.c: In function âdo_attachâ:
arch/x86/kernel/amd_iommu.c:1456: error: implicit declaration of function âset_dte_enryâ
arch/x86/kernel/amd_iommu.c: In function âdo_detachâ:
arch/x86/kernel/amd_iommu.c:1486: error: implicit declaration of function âclear_dte_enryâ
make[2]: *** [arch/x86/kernel/amd_iommu.o] Error 1



Should be "entry" of course.

--

Sander




-- 
Best regards,
 Sander                            mailto:linux@eikelenboom.it

--

Previous thread: Re: [MeeGo-Dev][PATCH] Topcliff: Update PCH_SPI driver to 2.6.35 by Greg KH on Tuesday, August 10, 2010 - 10:10 am. (2 messages)

Next thread: [PATCH] nfs: Add "lookupcache" to displayed mount options by Patrick J. LoPresti on Tuesday, August 10, 2010 - 11:36 am. (1 message)