Re: Hardware Error Kernel Mini-Summit

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Eric W. Biederman
Date: Tuesday, May 18, 2010 - 6:14 pm

Andi Kleen <andi@firstfloor.org> writes:


This suggests that to get things reported in dmesg I should
setup a cron job that pulls the latest kernel checks to see
if things are reported into syslog and sends you an email
if things are wrong.

I'm not ready to believe the average person that is running linux
is too stupid to understand the difference between a hardware
error and a software error.


The error rate should not be fixed per bit but should be roughly fixed
per DIMM.  If the error rate over time is fixed per bit we are in deep
trouble.


Not at all, and I don't have a clue where you start thinking
predictive page offlining makes the least bit of sense.  Broken
or even weak bits are rarely the common reason for ECC errors.


A log is a fine format for realizing you have a problem.  A
log doesn't need to be the only place errors are reported
but a log should be the default place ECC errors are reported.
We do that with hard drive errors and other kinds of hardware
errors and we have done it for years without problems.

My experience is that correctable ECC errors come in two kinds of
frequencies.

- The expected single bit correctable error range.  Which is somewhere
  between once a month and once a year per dimm.

  On the most unreasonable configuration I ever worked with. 4TB of ram
  in 1GB sticks up at Los Alomos, at 7000ft in an environment know
  to trigger errors I saw roughly one correctable ECC error an hour.
  Huge but just barely within the expected range.

  I can live with a log message once a month on a mundane system.

- Errors that occur frequently. That is broken hardware of one time or
  another.  I want to know about that so I can schedule down time to replace
  my memory before I get an uncorrected ECC error.  Errors of this kind
  are likely happening frequently enough as to impact performance.

Eric
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Mon May 17, 11:23 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon May 17, 3:41 pm)
Re: Hardware Error Kernel Mini-Summit, Hidetoshi Seto, (Mon May 17, 11:52 pm)
Re: Hardware Error Kernel Mini-Summit, Borislav Petkov, (Tue May 18, 6:06 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 9:44 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 9:50 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 9:52 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 10:06 am)
Re: Hardware Error Kernel Mini-Summit, Joe Perches, (Tue May 18, 10:42 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Tue May 18, 10:59 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue May 18, 11:10 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue May 18, 11:45 am)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 11:53 am)
Re: Hardware Error Kernel Mini-Summit, Joe Perches, (Tue May 18, 11:57 am)
RE: Hardware Error Kernel Mini-Summit, Luck, Tony, (Tue May 18, 12:08 pm)
Re: Hardware Error Kernel Mini-Summit, Borislav Petkov, (Tue May 18, 12:18 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 12:30 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 12:34 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 1:42 pm)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Tue May 18, 2:37 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 3:00 pm)
Re: Hardware Error Kernel Mini-Summit, Eric W. Biederman, (Tue May 18, 3:14 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue May 18, 3:28 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 3:29 pm)
Re: Hardware Error Kernel Mini-Summit, Eric W. Biederman, (Tue May 18, 6:14 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Tue May 18, 11:39 pm)
Re: Hardware Error Kernel Mini-Summit, Borislav Petkov, (Tue May 18, 11:46 pm)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Wed May 19, 12:09 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Wed May 19, 2:03 am)
Re: Hardware Error Kernel Mini-Summit, Mauro Carvalho Chehab, (Wed May 19, 4:54 am)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Wed May 19, 10:30 am)
Re: Hardware Error Kernel Mini-Summit, Ingo Molnar, (Thu May 20, 5:37 am)
Re: Hardware Error Kernel Mini-Summit, Russ Anderson, (Mon May 24, 8:55 am)
Re: Hardware Error Kernel Mini-Summit, Russ Anderson, (Mon May 24, 9:21 am)
Re: Hardware Error Kernel Mini-Summit, Russ Anderson, (Mon May 24, 10:13 am)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Mon May 24, 10:35 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon May 24, 11:26 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon May 24, 11:31 am)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Mon Jun 14, 3:03 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 4:49 am)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Mon Jun 14, 12:47 pm)
Re: Hardware Error Kernel Mini-Summit, Eric W. Biederman, (Mon Jun 14, 1:06 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 1:21 pm)
RE: Hardware Error Kernel Mini-Summit, Luck, Tony, (Mon Jun 14, 1:21 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 1:36 pm)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Mon Jun 14, 2:34 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 11:44 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Mon Jun 14, 11:56 pm)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Tue Jun 15, 1:06 am)
Re: Hardware Error Kernel Mini-Summit, Borislav Petkov, (Tue Jun 15, 3:01 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue Jun 15, 4:41 am)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Tue Jun 15, 5:21 am)
RE: Hardware Error Kernel Mini-Summit, Luck, Tony, (Tue Jun 15, 11:15 am)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Tue Jun 15, 11:38 am)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue Jun 15, 12:35 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Tue Jun 15, 12:37 pm)
Re: Hardware Error Kernel Mini-Summit, Nils Carlson, (Tue Jun 15, 1:48 pm)
Re: Hardware Error Kernel Mini-Summit, Tony Luck, (Tue Jun 15, 3:33 pm)
Re: Hardware Error Kernel Mini-Summit, Andi Kleen, (Wed Jun 16, 2:40 am)