On a Tyan-based system with intermittent but persistent instability, I have finally received a message that something might actually be wrong in hardware. Could you decode: MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 BANK 4 MISC c000000001000000=20 STATUS fa00002000020c0f MCGSTATUS 0 MCE 1 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 4 BANK 4 MISC c000000001000000=20 STATUS fa00000000070f0f MCGSTATUS 0 This appeared while the 3Ware 9550SXU-8LP RAID controller reported a disk corruption: 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=3D0. Machine check events logged 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0023): Sector repair completed:port=3D2, LBA=3D0x74907F9. This is on: Linux prometheus 2.6.27-rc5-00283-g70bb089 #1 SMP Sat Sep 6 13:52:51 BST 2008 x86_64 Quad-Core AMD Opteron(tm) Processor 2354 AuthenticAMD GNU/Linux This is a 2x Opteron 2354 (so 8 core) system, on a Tyan S2915-E mainboard with the v2.07 BIOS. The system is equipped with 16GB RAM, populated as 8x Kingston KVR667D2D4P5/2G. Configuration for the RAID controller, in case it is relevant: /c0 Driver Version =3D 2.26.02.011 /c0 Model =3D 9550SXU-8LP /c0 Available Memory =3D 112MB /c0 Firmware Version =3D FE9X 3.08.00.029 /c0 Bios Version =3D BE9X 3.10.00.003 /c0 Boot Loader Version =3D BL9X 3.02.00.001 /c0 Serial Number =3D [scrubbed] /c0 PCB Version =3D Rev 032 /c0 PCHIP Version =3D 1.60 /c0 ACHIP Version =3D 1.90 /c0 Number of Ports =3D 8 /c0 Number of Drives =3D 6 /c0 Number of Units =3D 1 /c0 Total Optimal Units =3D 1 /c0 Not Optimal Units =3D 0=20 /c0 JBOD Export Policy =3D off /c0 Disk Spinup Policy =3D 2 /c0 Spinup Stagger Time Policy (sec) =3D 1 /c0 Auto-Carving Policy =3D off /c0 Auto-Carving Size =3D 2048 GB /c0 Auto-Rebuild Policy =3D on /c0 Controller Bus Type =3D PCIX /c0 Controller Bus Width =3D 64 bits /c0 Controller Bus Speed =3D 133 Mhz Unit UnitType Status ...
Hi Tony, Not easily, and it's too late to parse arch/x86/kernel/cpu/mcheck/mce_64.c and find out what it means before I nod off. Still, before I sign off, have you tried running "mcelog --ascii"? It needs to be run on the machine the check occured on. It might give you something to go on before the cavalry arrives. Best regards, Jeroen. --
That worked, thank you. Had to feed it back in as /dev/mcelog was empty,
but it made a bit more sense of it:
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 0 data cache bit57 =3D processor context corrupt
bit59 =3D misc error valid
bit61 =3D error uncorrected
bit62 =3D error overflow (multiple errors)
bus error 'local node observed, request didn't time out
generic error mem transaction
generic access, level generic'
STATUS fa00002000020c0f MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 4 0 data cache bit57 =3D processor context corrupt
bit59 =3D misc error valid
bit61 =3D error uncorrected
bit62 =3D error overflow (multiple errors)
bus error 'generic participation, request timed out
generic error mem transaction
generic access, level generic'
Regards,
Tony V.
It unfortunately turns out that mcelog logging is a tricky psychological problem. How should the warning above have looked like so that you would not have required "peer insight" and actually just contacted your hardware vendor? Thank you. -Andi (who wonders if <blink> tags in syslog would be useful to solve this) --
I suppose mcelog might be extended to point at possible tools to get a second opinion, in case the admin would like to to be entirely certain. In their position I can understand them when their vendor asks them if it's the hardware and what tests they've run to rule out software. Think for example a machine check that might point to faulty RAM, it might direct the admin to run memcheck if mcelog alone isn't Yikes, ixnay to the <blinkay>. Next people will ask for flash support to get all-singing and -dancing error messages. -- Jeroen. --
Indication of the faulty part, so I know whether to contact AMD or Tyan. Without a clear idea of which it could quickly turn into an infinite redirect loop between the two. Regards, Tony V.
Ok so you wanted linux-kernel to diagnose your hardware for you? For DIMMs you can get that with --dmi if you run the latest mcelog and if it's a memory problem. Unfortunately the BIOS vendors in their wisdom often deliver incorrect DMI tables, so the information is not always very useful. -Andi -- ak@linux.intel.com --
I was hoping for some help in narrowing it down, yes. Jeroen's reply was very helpful, and more among the line of what I was expecting. I have contacted all vendors involved now, and it looks like the system RAM is not fully compatible with the mainboard. With regard to the message, I would suggest an alternate wording like such: A hardware component in your system is failing. Please contact your hardware vendor(s). If unsure, contact your CPU vendor first. Regards, Tony V.
Or this: "A hardware component in your system is failing. [insert specific bit if MCE is certain enough about what part] If you can, try to narrow it down by placing it in another mainboard (assuming you have one available), or run [memcheck, another tool]. Then contact the hardware vendor(s) in question, if uncertain, try Jeroen. --
Ugh, actually this is not right. AFAIK MCEs can be triggered by stuff like PCI aborts, which in turn can be caused by software. If you really want me to contact hw vendor, you need to be a lot more specific. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
PCI aborts don't normally cause machine checks, no. -Andi -- ak@linux.intel.com --
