bad DMAR interaction with iwlagn and SATA

Previous thread: [RESEND][PATCH] mm: make do_move_pages() complexity linear by Brice Goglin on Thursday, September 25, 2008 - 6:00 am. (1 message)

Next thread: Re: D945GCLF intel atom board & r8169 ethernet broken in linux-2.6.27-rc5?? by John Gumb on Thursday, September 25, 2008 - 6:58 am. (1 message)
From: Andres Freund
Date: Thursday, September 25, 2008 - 6:11 am

--Boundary-01=_z342IWcS2hda0oi
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hi,=09

in some accident caused by wanting to create the .config/compile the kernel=
=20
for my new laptop (thinkpad t500) before the desperately needed sleeping I=
=20
activated DMAR...

I don't know if this is relevant, but I though i better report it.


This was on fb478da5ba69ecf40729ae8ab37ca406b1e5be48 - sometime after 2.6.2=
7-
rc7

I stumbled over two buglets:
=46irst:
[ 4184.617392] DMAR:[DMA Read] Request device [03:00.0] fault addr fa946000=
=20
[ 4184.617393] DMAR:[fault reason 06] PTE Read access is not set
[ 4184.644081] iwlagn: Microcode HW error detected.  Restarting.
[ 4186.646000] psmouse.c: TouchPad at isa0060/serio1/input0 lost=20
synchronization, throwing 1 bytes away.
[ 4186.683034] Registered led device: iwl-phy0:radio
[ 4186.683478] Registered led device: iwl-phy0:assoc
[ 4186.683793] Registered led device: iwl-phy0:RX
[ 4186.684094] Registered led device: iwl-phy0:TX
[ 4186.689749] wlan0: authenticate with AP 00:1d:7e:42:fe:42
[ 4186.691691] wlan0: authenticated
[ 4186.691705] wlan0: associate with AP 00:1d:7e:42:fe:42
[ 4186.696380] wlan0: RX ReassocResp from 00:1d:7e:42:fe:42 (capab=3D0x411=
=20
status=3D0 aid=3D2)
[ 4186.696392] wlan0: associated

Most of the time when this happened, the machine wasnt reacting for 1-3=20
seconds and had audio buffer underruns, but I also had a hard lockup which =
I=20
couldnt diagnose so far.

Second:
[ 2937.484251] DMAR:[DMA Read] Request device [00:1f.2] fault addr fffbf000=
=20
[ 2937.484255] DMAR:[fault reason 06] PTE Read access is not set
[ 2937.484297] ata1.00: exception Emask 0x60 SAct 0x1 SErr 0x800 action 0x6=
=20
frozen
[ 2937.484303] ata1.00: irq_stat 0x20000000, host bus error
[ 2937.484309] ata1: SError: { HostInt }
[ 2937.484319] ata1.00: cmd 61/08:00:c0:1d:6b/00:00:07:00:00/40 tag 0 ncq 4=
096=20
out
[ 2937.484321]          res ...
From: Jeff Garzik
Date: Thursday, September 25, 2008 - 7:11 pm

Ouch, a host bus error is serious nastiness...

http://ata.wiki.kernel.org/index.php/Libata_error_messages#Error_classes

That's the ATA controller falling over after some serious machine hiccups.

	Jeff



--

From: Andres Freund
Date: Thursday, September 25, 2008 - 7:18 pm

Hi Jeff,

On Friday 26 September 2008, you wrote in "Re: bad DMAR interaction with=20
I only hit that with DMAR activated (hit it twice, different boots), so it=
=20
seems to be related to that. Is there anything I can help to debug that?

Andres
From: Jeff Garzik
Date: Thursday, September 25, 2008 - 7:41 pm

No idea about DMAR.  On the ATA side, it pretty diagnoses itself as you 
see here.  Unfortunately, ATA controller is behaving exactly as it 
should, when a major system error is thrown its way.

	Jeff



--

From: Muli Ben-Yehuda
Date: Friday, September 26, 2008 - 7:47 am

The way to debug this is to figure out why device 00:1f.2 is trying to
read from DMA address fffbf000 and does not have permission to do
so. This could be indicative of a driver bug where it is programming
the device to read from some buffer that has not been allocated
through the DMA API and thus does not have a valid IOMMU mapping, or a
hardware quirk where the device tries to read from memory without host
involvement. The former is much more likely.

Cheers,
Muli
-- 
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
                      xxx
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/
--

From: Johannes Berg
Date: Friday, September 26, 2008 - 8:12 am

and indeed matches experience from myself and Marcel that DMA bugs seem
to lurk.

johannes
From: Tomas Winkler
Date: Friday, September 26, 2008 - 4:30 pm

On Fri, Sep 26, 2008 at 6:12 PM, Johannes Berg

Meanwhile it all reported bugs in this case points to 64 bit
installations, I'll give it more testing
Thnaks.
Tomas
--

From: Andres Freund
Date: Monday, September 29, 2008 - 1:27 am

Hi,

On Saturday 27 September 2008, Tomas Winkler wrote in "Re: bad DMAR=20
Would it help to test on 32bit? I have some dissk with 32bit system install=
ed=20
lying around somewhere...

Any other patches to try?

Andres
From: Tomas Winkler
Date: Monday, September 29, 2008 - 1:40 am

I've posted few patches lately to address some RX buffers issues you
may to try those. Not sure it will help though.
http://marc.info/?l=linux-wireless&m=122241327108723&w=2
http://marc.info/?l=linux-wireless&m=122241327208729&w=2

Thanks
Tomas
--

From: Johannes Berg
Date: Monday, October 6, 2008 - 5:26 am

Andres, can you post your config?

johannes
From: Andres Freund
Date: Monday, October 6, 2008 - 7:32 am

--Boundary-01=_kFi6IjGv/l+E+l2
Content-Type: text/plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Hi,

On Monday 06 October 2008, Johannes Berg wrote in "Re: bad DMAR interaction=
=20
Btw, will do so later this evening having access to the older harddisk (wit=
h=20
Sure, my current running one is attached.
The config I had the error with was exactly the same just with CONFIG_DMAR =
and=20
e1000e enabled (but is overwritten now)...

Its no problem trying another branch more debugging options or so if needed.

Andres


--Boundary-01=_kFi6IjGv/l+E+l2
Content-Type: text/plain;
  charset="iso-8859-15";
  name="config-2.6.27-rc7"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="config-2.6.27-rc7"

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.27-rc7
# Tue Sep 30 15:47:39 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
# CONFIG_GENERIC_LOCKBREAK is not set
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
# CONFIG_GENERIC_GPIO is not set
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not ...
From: Johannes Berg
Date: Tuesday, October 7, 2008 - 1:37 am

ed.

No, I was just wondering whether x86-64 had something like powerpc's
CONFIG_64K_PAGES, but it doesn't seem to. 2M-page support seems to be
used always dependent on the CPU, but I have no idea you can tell
whether or not your CPU supports that.

johannes
From: Kyle McMartin
Date: Tuesday, October 7, 2008 - 10:04 am

2MB pages (and 4MB pages) are dependent on PSE/PAE, there's no configurable
page size on x86 like there is on other platforms.

PSE gives you 4MB pages, PAE reduces your 4MB pages to 2MB pages (for
extra flag and address bits.)

About the only useful places for these are large mappings like ioremap
and whatnot.

regards, Kyle
--

From: Johannes Berg
Date: Tuesday, October 7, 2008 - 10:08 am

Thanks for the explanation. Can you explain too why iwlwifi crashes when
I enable 64k pages? ;)

johannes
From: Johannes Berg
Date: Friday, September 26, 2008 - 8:10 am

Thanks. I've also been chasing a DMA corruption issue with iwlagn (on

I suspect the hard lockup was due to a BUG_ON in the iwlagn driver, if
you can reproduce this either try applying the patch here [1] or going
to a VC to see if it crashes there. It's a BUG_ON in iwl-tx.c.

johannes

[1] http://article.gmane.org/gmane.linux.kernel.wireless.general/21226

From: Andres Freund
Date: Monday, September 29, 2008 - 1:26 am

Hi,

On Friday 26 September 2008, you wrote in "Re: bad DMAR interaction with=20
Could not reproduce so far - it is rather hard working on the machine with=
=20
DMAR enabled because I get 1-5s lockups all the time like described above...

Andres

Previous thread: [RESEND][PATCH] mm: make do_move_pages() complexity linear by Brice Goglin on Thursday, September 25, 2008 - 6:00 am. (1 message)

Next thread: Re: D945GCLF intel atom board & r8169 ethernet broken in linux-2.6.27-rc5?? by John Gumb on Thursday, September 25, 2008 - 6:58 am. (1 message)