Re: Hardware Error Kernel Mini-Summit

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Borislav Petkov
Date: Tuesday, May 18, 2010 - 11:46 pm

From: "Eric W. Biederman" <ebiederm@xmission.com>
Date: Tue, May 18, 2010 at 09:14:09PM -0400


This is exactly the reason why we need a better error logging and
reporting than a log. How do you want to discover trends and count CECCs
per DIMM if you scan the logs all the time and grep for the DRAM page
it happened, the CS row it is located in and whether this is located in
the same DIMM as the 115th error back in the log? This gets especially
tricky if you're using one of the gazillion memory interleaving schemes.

Ok, and what about other errors like L3 cache errors, for example? You
want to count those too and upon reaching a threshold disable a cache
index _before_ it turns a correctable ECC into an uncorrectable error
bringing the whole system down with a critical MCE.

How about error injection, you want to test the hardware/software with
injecting real hardware errors and not simulating it all in software.

And also you want to be able to schedule different maintenance actions
depending on the severity of the error and in certain cases get away
with a clean shutdown even in the face of an uncorrectable error.

So, the whole idea entails much more than reporting errors in the syslog
but rather making the system intelligent enough to prolong its own life
and be able to warn the user that something bad is about to happen.

And we don't have that right now - right now we say that some machine
checks have been logged and with uncorrectable MCEs we freeze cowardly
and hope to be able to make a warm reset so that the MCA MSRs still
contain some valid data which we can decode painstakingly by hand.

I hope this makes our intentions a bit clearer.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Mon May 17, 11:23 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon May 17, 3:41 pm)
Re: Hardware Error Kernel Mini-Summit, Hidetoshi Seto, (Mon May 17, 11:52 pm)
Re: Hardware Error Kernel Mini-Summit, Borislav Petkov, (Tue May 18, 6:06 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 9:44 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 9:50 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 9:52 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 10:06 am)
Re: Hardware Error Kernel Mini-Summit, Joe Perches, (Tue May 18, 10:42 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 10:59 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue May 18, 11:10 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue May 18, 11:45 am)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 11:53 am)
Re: Hardware Error Kernel Mini-Summit, Joe Perches, (Tue May 18, 11:57 am)
RE: Hardware Error Kernel Mini-Summit, Luck, Tony, (Tue May 18, 12:08 pm)
Re: Hardware Error Kernel Mini-Summit, Borislav Petkov, (Tue May 18, 12:18 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 12:30 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 12:34 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 1:42 pm)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Tue May 18, 2:37 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 3:00 pm)
Re: Hardware Error Kernel Mini-Summit, Eric W. Biederman, (Tue May 18, 3:14 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue May 18, 3:28 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 3:29 pm)
Re: Hardware Error Kernel Mini-Summit, Eric W. Biederman, (Tue May 18, 6:14 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 11:39 pm)
Re: Hardware Error Kernel Mini-Summit, Borislav Petkov, (Tue May 18, 11:46 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Wed May 19, 12:09 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Wed May 19, 2:03 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Wed May 19, 4:54 am)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Wed May 19, 10:30 am)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Thu May 20, 5:37 am)
Re: Hardware Error Kernel Mini-Summit, Russ Anderson, (Mon May 24, 8:55 am)
Re: Hardware Error Kernel Mini-Summit, Russ Anderson, (Mon May 24, 9:21 am)
Re: Hardware Error Kernel Mini-Summit, Russ Anderson, (Mon May 24, 10:13 am)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Mon May 24, 10:35 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon May 24, 11:26 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon May 24, 11:31 am)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Mon Jun 14, 3:03 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 4:49 am)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Mon Jun 14, 12:47 pm)
Re: Hardware Error Kernel Mini-Summit, Eric W. Biederman, (Mon Jun 14, 1:06 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 1:21 pm)
RE: Hardware Error Kernel Mini-Summit, Luck, Tony, (Mon Jun 14, 1:21 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 1:36 pm)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Mon Jun 14, 2:34 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 11:44 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 11:56 pm)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Tue Jun 15, 1:06 am)
Re: Hardware Error Kernel Mini-Summit, Borislav Petkov, (Tue Jun 15, 3:01 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue Jun 15, 4:41 am)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Tue Jun 15, 5:21 am)
RE: Hardware Error Kernel Mini-Summit, Luck, Tony, (Tue Jun 15, 11:15 am)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Tue Jun 15, 11:38 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue Jun 15, 12:35 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue Jun 15, 12:37 pm)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Tue Jun 15, 1:48 pm)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Tue Jun 15, 3:33 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Wed Jun 16, 2:40 am)