Re: Request for MCE decode (AMD Barcelona, fam 10h)

Previous thread: MPT Fusion , Continous whole card resets on 2.6.26.2 by Mr. James W. Laferriere on Saturday, September 6, 2008 - 6:31 pm. (1 message)

Next thread: [git pull] Input updates for 2.6.27-rc5 by Dmitry Torokhov on Saturday, September 6, 2008 - 8:56 pm. (1 message)
From: Tony Vroon
Date: Saturday, September 6, 2008 - 7:32 pm

On a Tyan-based system with intermittent but persistent instability, I
have finally received a message that something might actually be wrong
in hardware. Could you decode:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 4 MISC c000000001000000=20
STATUS fa00002000020c0f MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 4 BANK 4 MISC c000000001000000=20
STATUS fa00000000070f0f MCGSTATUS 0

This appeared while the 3Ware 9550SXU-8LP RAID controller reported a
disk corruption:
3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=3D0.
Machine check events logged
3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair
completed:port=3D2, LBA=3D0x74907F9.

This is on:
Linux prometheus 2.6.27-rc5-00283-g70bb089 #1 SMP Sat Sep 6 13:52:51 BST
2008 x86_64 Quad-Core AMD Opteron(tm) Processor 2354 AuthenticAMD
GNU/Linux

This is a 2x Opteron 2354 (so 8 core) system, on a Tyan S2915-E
mainboard with the v2.07 BIOS. The system is equipped with 16GB RAM,
populated as 8x Kingston KVR667D2D4P5/2G.

Configuration for the RAID controller, in case it is relevant:
/c0 Driver Version =3D 2.26.02.011
/c0 Model =3D 9550SXU-8LP
/c0 Available Memory =3D 112MB
/c0 Firmware Version =3D FE9X 3.08.00.029
/c0 Bios Version =3D BE9X 3.10.00.003
/c0 Boot Loader Version =3D BL9X 3.02.00.001
/c0 Serial Number =3D [scrubbed]
/c0 PCB Version =3D Rev 032
/c0 PCHIP Version =3D 1.60
/c0 ACHIP Version =3D 1.90
/c0 Number of Ports =3D 8
/c0 Number of Drives =3D 6
/c0 Number of Units =3D 1
/c0 Total Optimal Units =3D 1
/c0 Not Optimal Units =3D 0=20
/c0 JBOD Export Policy =3D off
/c0 Disk Spinup Policy =3D 2
/c0 Spinup Stagger Time Policy (sec) =3D 1
/c0 Auto-Carving Policy =3D off
/c0 Auto-Carving Size =3D 2048 GB
/c0 Auto-Rebuild Policy =3D on
/c0 Controller Bus Type =3D PCIX
/c0 Controller Bus Width =3D 64 bits
/c0 Controller Bus Speed =3D 133 Mhz

Unit  UnitType  Status  ...
From: Jeroen van Rijn
Date: Saturday, September 6, 2008 - 8:16 pm

Hi Tony,

Not easily, and it's too late to parse
arch/x86/kernel/cpu/mcheck/mce_64.c and find out what it means before
I nod off. Still, before I sign off, have you tried running "mcelog
--ascii"? It needs to be run on the machine the check occured on. It
might give you something to go on before the cavalry arrives.

Best regards,
  Jeroen.
--

From: Tony Vroon
Date: Saturday, September 6, 2008 - 9:22 pm

That worked, thank you. Had to feed it back in as /dev/mcelog was empty,
but it made a bit more sense of it:

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 0 data cache        bit57 =3D processor context corrupt
       bit59 =3D misc error valid
       bit61 =3D error uncorrected
       bit62 =3D error overflow (multiple errors)
  bus error 'local node observed, request didn't time out
      generic error mem transaction
      generic access, level generic'
STATUS fa00002000020c0f MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 4 0 data cache        bit57 =3D processor context corrupt
       bit59 =3D misc error valid
       bit61 =3D error uncorrected
       bit62 =3D error overflow (multiple errors)
  bus error 'generic participation, request timed out
      generic error mem transaction
      generic access, level generic'

Regards,
Tony V.
From: Andi Kleen
Date: Monday, September 8, 2008 - 3:55 am

It unfortunately turns out that mcelog logging is a tricky
psychological problem. How should the warning above have
looked like so that you would not have required "peer insight"
and actually just contacted your hardware vendor? 

Thank you.

-Andi (who wonders if <blink> tags in syslog would be useful 
to solve this)
--

From: Jeroen van Rijn
Date: Monday, September 8, 2008 - 4:13 am

I suppose mcelog might be extended to point at possible tools to get a
second opinion, in case the admin would like to to be entirely
certain. In their position I can understand them when their vendor
asks them if it's the hardware and what tests they've run to rule out
software.

Think for example a machine check that might point to faulty RAM, it
might direct the admin to run memcheck if mcelog alone isn't

Yikes, ixnay to the <blinkay>. Next people will ask for flash support
to get all-singing and -dancing error messages.

-- Jeroen.
--

From: Tony Vroon
Date: Monday, September 8, 2008 - 5:22 am

Indication of the faulty part, so I know whether to contact AMD or Tyan.
Without a clear idea of which it could quickly turn into an infinite
redirect loop between the two.

Regards,
Tony V.
From: Andi Kleen
Date: Monday, September 8, 2008 - 7:04 am

Ok so you wanted linux-kernel to diagnose your hardware for you?

For DIMMs you can get that with --dmi if you run the latest mcelog
and if it's a memory problem. 

Unfortunately the BIOS vendors in their wisdom often deliver incorrect
DMI tables, so the information is not always very useful.
 
-Andi
-- 
ak@linux.intel.com

--

From: Tony Vroon
Date: Monday, September 8, 2008 - 8:52 am

I was hoping for some help in narrowing it down, yes. Jeroen's reply was
very helpful, and more among the line of what I was expecting. I have
contacted all vendors involved now, and it looks like the system RAM is
not fully compatible with the mainboard.

With regard to the message, I would suggest an alternate wording like
such:

A hardware component in your system is failing.
Please contact your hardware vendor(s).
If unsure, contact your CPU vendor first.

Regards,
Tony V.
From: Jeroen van Rijn
Date: Monday, September 8, 2008 - 9:25 am

Or this:
"A hardware component in your system is failing.
[insert specific bit if MCE is certain enough about what part]
If you can, try to narrow it down by placing it in another mainboard
(assuming you have one available), or run [memcheck, another tool].
Then contact the hardware vendor(s) in question, if uncertain, try

Jeroen.
--

From: Pavel Machek
Date: Monday, September 8, 2008 - 6:55 am

Ugh, actually this is not right. AFAIK MCEs can be triggered by stuff
like PCI aborts, which in turn can be caused by software.

If you really want me to contact hw vendor, you need to be a lot more
specific.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Andi Kleen
Date: Monday, September 8, 2008 - 7:00 am

PCI aborts don't normally cause machine checks, no.

-Andi
-- 
ak@linux.intel.com
--

Previous thread: MPT Fusion , Continous whole card resets on 2.6.26.2 by Mr. James W. Laferriere on Saturday, September 6, 2008 - 6:31 pm. (1 message)

Next thread: [git pull] Input updates for 2.6.27-rc5 by Dmitry Torokhov on Saturday, September 6, 2008 - 8:56 pm. (1 message)