Re: Hardware Error Kernel Mini-Summit

Previous thread: Patch: ni_tio.c by Samuel Richardson on Monday, May 17, 2010 - 11:08 am. (1 message)

Next thread: Conflict between tip/tracing/core and tip/perf/core by Steven Rostedt on Monday, May 17, 2010 - 11:50 am. (5 messages)
From: Mauro Carvalho Chehab
Date: Monday, May 17, 2010 - 11:23 am

During the last LF Collaboration Summit, we've done a mini-summit [1],
intended to improve the hardware error detection in kernel, currently
provided by MCE and EDAC subsystems.

The idea of this mini-summit came up after Thomas Gleixner and Ingo
Molnar suggestions that edac and mce should converge into an error
subsystem.

I'm enclosing the minutes of the meeting, in order to allow it to be
reviewed by other kernel hackers that are interested on the theme but
unfortunately couldn't come to the meeting.

Btw, during the meeting, it were decided that EDAC ML could better work
if moved to vger, so I'm copying here both the old and the new edac
mailing lists.

[1] http://events.linuxfoundation.org/lfcs2010/edac

---


	I Hardware Error Kernel Mini-Summit
	===================================
				April, 15 - San Francisco, CA, US
				2010 Linux Foundation Collaboration Summit

Attendees:
	Ben Woodard   - Red Hat
	Brent Young   - Intel
	Doug Thompson - LLNL
	Mark Grondona - LLNL
	Matt Domsch   - Dell
	Mauro Chehab  - Red Hat
	Tony Luck     - Intel

After some initial description of the current state of error
handling in Linux, we moved to work on requirements and high
level design of the system going forward.

Requirements
============

First and foremost is that end-users (presumably system administrators)
need notification of the hardware component that is the source of each
error.  Ideally this should include "silk screen" markings so that the
user can identify which component is at fault.

Other requirements may vary amongst different types of end users, but
include:
+ Minimal disruption to system performance when logging corrected errors.
+ Assurance that h/w error detection mechanisms are correctly configured
  and enabled.
+ System topology information

Wit respect to System topology, it was poined that LLNL is concerned about
being sure that ECC is enabled, as some BIOS'es lied about that in the past.
Also, memory topology information is needed to ...
From: Andi Kleen
Date: Monday, May 17, 2010 - 3:41 pm

Just for the innocent readers who might be mislead by this:

Nehalem-EP DIMM error accounting already works fine today using
mcelog for most cases, including RHEL5.5 (with some limits) 
and RHEL6beta with no additional changes needed.

In RHEL6 the daemon does the accounting and the client reports the errors
separated for each DIMM and separated in uc and ce.  In RHEL5
the information is in a log file and can be gotten from there.

In addition the daemon supports various advanced RAS features including

Already done too, see
http://permalink.gmane.org/gmane.linux.acpi.devel/45743

However the interface won't give you the topology you're asking
for, just the errors.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Mauro Carvalho Chehab
Date: Tuesday, May 18, 2010 - 9:50 am

Ok. It should be clear that the main target of the mini-summit is to define
how the several subsystems will integrate into a hardware-abstracted way
to report errors from kernel. So, we're looking on the next steps to improve
what we currently have, and avoid to have more than one different subsystem
trying to get the same info, eventually using the same registers, but providing
different interfaces to userspace.

-- 

Cheers,
Mauro
--

From: Andi Kleen
Date: Tuesday, May 18, 2010 - 11:10 am

Well there are different use cases.

mcelog mainly deals in thresholds (including fancy ones like 
per page and per object thresholds) and events and actions to thresholds
(= more events), all your proposals are dealing with objects counts currently.

It does per object counting too, but only incidentially.

I suspect there are use cases for both, although I personally suspect
for most people events, thresholds and their actions are the most useful
thing to handle by default. But one size doesn't fit all.

Anyways it boils down you need different interfaces for different things.

For example there will be always events versus accounting. 

You can synthesize accounting from events (that is what mcelog
does today). The other way round does not work so well unfortunately,
or at least would be rather inefficient.

Also large parts of the actions can be only usefully done in user space, so 
you need a user space component.

I am somewhat biased of course but I think mcelog is doing a reasonable
good job today at being this user space component. It definitely
has areas that could be improved too, but at lot of the basics
are there and doing ok.

In principle mcelog could feed from another API too, but it would
definitely prefer to not to have to poll it or having to parse
printks.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Hidetoshi Seto
Date: Monday, May 17, 2010 - 11:52 pm

Thank you very much for providing this report.

I agree that we should have a well organized error subsystem that
covers all error sources in the system and that provides enough
simple and powerful API for users. As one of interested absentee,
I think I could be of some help to you (e.g. x86 low level).

It might be off-topic here, but I'd like to point that you missed
the presence of PCIe AER subsystem that handle hardware errors on
PCIe devices nowadays (It works well on ppc, x86 and so on).
Given that APEI also covers PCIe errors and that some system can
map MC registers to PCI configuration space, I think there is no
way for the new error subsystem to ignore I/O device errors while
it care errors on CPU/memory and cooperate with APEI.


Thanks,
H.Seto

--

From: Mauro Carvalho Chehab
Date: Tuesday, May 18, 2010 - 9:44 am

Yes, it makes sense to integrate also PCIe AER subystem. IMO, the first
step is to provide an error core integrated to perf, and then start
integrating the several error systems around it.

-- 

Cheers,
Mauro
--

From: Joe Perches
Date: Tuesday, May 18, 2010 - 10:42 am

Why integrated to perf?

--

From: Mauro Carvalho Chehab
Date: Tuesday, May 18, 2010 - 10:59 am

That's the original plan. It were suggested by Ingo and Thomas at LKML. Borislav
also send a more technical proposal about it.

It actually makes sense, since some sorts of errors may affect performance
(those non-fatal errors that are auto-recovered). Also, using debugfs and the
same kind of logic used by perf to filter errors seems pertinent for hardware 
errors. As the actual patches were not written yet, the details on how those 
things will integrate will depend on further analysis.

-- 

Cheers,
Mauro
--

From: Andi Kleen
Date: Tuesday, May 18, 2010 - 11:45 am

For a different perspective on this see also

http://permalink.gmane.org/gmane.linux.kernel/952061

AFAIK all the issues mentioned there are still open.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Joe Perches
Date: Tuesday, May 18, 2010 - 11:57 am

Yup, that's why I asked for explanation.

What was offered lacks useful goal and design detail.

perf is a cool tool, but probably not necessary for
for a HW error reporting system.


--

From: Ingo Molnar
Date: Tuesday, May 18, 2010 - 11:53 am

It makes sense to use the kernel's performance events 
logging framework when we are logging events about how the 
system performs.

Furthermore it's NMI safe, offers structured logging, has 
various streaming, multiplexing and filtering capabilities 
that come handy for RAS purposes and more.

The other option would be to use an ad-hoc logging 
implementation, only used for EDAC/RAS, which couldnt be 
mixed with other system events. That approach has various 
obvious disadvanteges so we are aiming for a unified 
approach.

Thanks,

	Ingo
--

From: Luck, Tony
Date: Tuesday, May 18, 2010 - 12:08 pm

Perhaps it makes more sense to say that the Linux "performance
events logging framework" has become more generic and is really

Those of us present at the mini-summit were not familiar with
all the features available. One area of concern was how to be
sure that something is in fact listening to and logging the
error events.  My understanding is that if there is no process
attached to an event, the kernel will just drop it.  This is
of particular concern because the kernel's first scan of the
machine check banks occurs before there are any processes.
So errors found early in boot (which might be saved fatal
errors from before the boot) might be lost.

-Tony
--

From: Borislav Petkov
Date: Tuesday, May 18, 2010 - 12:18 pm

From: "Luck, Tony" <tony.luck@intel.com>


Well, we have a trace_mce_record tracepoint in the mcheck code which
calls all the necessary callbacks when an mcheck occurs. For the time
being, the idea is to use the mce.c ring buffer for early mchecks and
copy them to the regular ftrace per-cpu buffer after the last has been
initialized. Later, we could switch to a another early bootmem buffer if
there's need to.

Also, we want to have a userspace daemon that reads out the mces from
the trace buffer and does further processing like thresholding etc in
userspace.

Concerning critical errors, there we bypass the perf subsystem and
execute the smallest amount of code possible while trying to shutdown
gracefully if the error type allows that.

These are the rough ideas at least...

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.
--

From: Ingo Molnar
Date: Tuesday, May 18, 2010 - 12:34 pm

The end result would be even simpler by one more step: 
with persistent events we just use them and dont need the 
mce.c ringbuffer at all. (getting rid of that complication 
is one of the code cleanliness benefits i see in this move 
as a x86 maintainer - beyond the obvious generalization 

Yeah. Each perf_event can have arbitrary callbacks with 
add-on (or critical) functionality. We would activate the 
event(s) during bootup and it would do its thing from that 
point on: critical functionality gets a direct path via 
the callback, and every other event that survives goes via 
the regular perf output channels, to one (or more) 
consumers/subscribers of these events.

	Ingo
--

From: Eric W. Biederman
Date: Tuesday, May 18, 2010 - 3:14 pm

Can someone please tell me why everyone is eager to squirrel
correctable error reports away and not report them in dmesg? aka
syslog.

I have had on several occasions a machine with memory errors that
mcelog or the BIOS was eating the error reports and not putting them
anywhere a normal human being would look.

If your system isn't broken correctable errors are rare.  People look
at syslog.  People look in /var/log/messages and dmesg when something
goes weird.

I have no problem with additional interfaces to provide additional
functionality but please can we put errors where people can find them.

Eric
--

From: Andi Kleen
Date: Tuesday, May 18, 2010 - 3:28 pm

The original motivation to put them somewhere else
because I was sick of people reporting them as kernel bugs.


Actually the more memory you have the more common they are.
And the trend is to more and more memory.

Really to do anything useful with them you need trends
and automatic actions (like predictive page offlining)

A log isn't really a good format for that.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Eric W. Biederman
Date: Tuesday, May 18, 2010 - 6:14 pm

This suggests that to get things reported in dmesg I should
setup a cron job that pulls the latest kernel checks to see
if things are reported into syslog and sends you an email
if things are wrong.

I'm not ready to believe the average person that is running linux
is too stupid to understand the difference between a hardware

The error rate should not be fixed per bit but should be roughly fixed
per DIMM.  If the error rate over time is fixed per bit we are in deep

Not at all, and I don't have a clue where you start thinking
predictive page offlining makes the least bit of sense.  Broken

A log is a fine format for realizing you have a problem.  A
log doesn't need to be the only place errors are reported
but a log should be the default place ECC errors are reported.
We do that with hard drive errors and other kinds of hardware
errors and we have done it for years without problems.

My experience is that correctable ECC errors come in two kinds of
frequencies.

- The expected single bit correctable error range.  Which is somewhere
  between once a month and once a year per dimm.

  On the most unreasonable configuration I ever worked with. 4TB of ram
  in 1GB sticks up at Los Alomos, at 7000ft in an environment know
  to trigger errors I saw roughly one correctable ECC error an hour.
  Huge but just barely within the expected range.

  I can live with a log message once a month on a mundane system.

- Errors that occur frequently. That is broken hardware of one time or
  another.  I want to know about that so I can schedule down time to replace
  my memory before I get an uncorrected ECC error.  Errors of this kind
  are likely happening frequently enough as to impact performance.

Eric
--

From: Borislav Petkov
Date: Tuesday, May 18, 2010 - 11:46 pm

From: "Eric W. Biederman" <ebiederm@xmission.com>

This is exactly the reason why we need a better error logging and
reporting than a log. How do you want to discover trends and count CECCs
per DIMM if you scan the logs all the time and grep for the DRAM page
it happened, the CS row it is located in and whether this is located in
the same DIMM as the 115th error back in the log? This gets especially
tricky if you're using one of the gazillion memory interleaving schemes.

Ok, and what about other errors like L3 cache errors, for example? You
want to count those too and upon reaching a threshold disable a cache
index _before_ it turns a correctable ECC into an uncorrectable error
bringing the whole system down with a critical MCE.

How about error injection, you want to test the hardware/software with
injecting real hardware errors and not simulating it all in software.

And also you want to be able to schedule different maintenance actions
depending on the severity of the error and in certain cases get away
with a clean shutdown even in the face of an uncorrectable error.

So, the whole idea entails much more than reporting errors in the syslog
but rather making the system intelligent enough to prolong its own life
and be able to warn the user that something bad is about to happen.

And we don't have that right now - right now we say that some machine
checks have been logged and with uncorrectable MCEs we freeze cowardly
and hope to be able to make a warm reset so that the MCA MSRs still
contain some valid data which we can decode painstakingly by hand.

I hope this makes our intentions a bit clearer.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.
--

From: Ingo Molnar
Date: Wednesday, May 19, 2010 - 12:09 am

Basically the idea behind the generic structured logging 
framework (the perf events kernel subsystem) is to have 
both ASCII output (where desired: critical errors), but to 
also have well-specified event format parsable to 
user-space tools.

Plus there's the need for fast, lightweight, flexible 
event passing mechanism - which is given by the perf 
events transport which enables arbitrary size in-memory 
ring-buffers, poll() and epoll support, etc.

perf events supports all these different usecases and 
comes with a (constantly growing) set of events already 
defined upstream. We've got more than a dozen different 
upstream subsystems that have defined events and we have 
over a hundred individual events. There's a rapidly 
growing tool space that makes case by case use of these 
event sources to measure/observe various aspects of the 
system.

Regarding dmesg, there's a WIP patch on lkml that 
integrates printks into this framework as well - makes 
each printk also available as a special string event.

That way a tool can have both programmatic access to 
printk output (without having to interact with the syslog 
buffer itself) - together with all the other structured 
log sources, while humans can also see what is happening.

Thanks,

	Ingo
--

From: Mauro Carvalho Chehab
Date: Wednesday, May 19, 2010 - 4:54 am

Some system admins prefer to have everything on dmesg, as they
can enable a serial console, and catch the logs remotely, even
when the machine crashes for example due to a hardware failure.

So, IMHO, one feature that the perf event needs is the capability
to report errors via a serial console also, or a mechanism where
some events are sent via dmesg.

-- 

Cheers,
Mauro
--

From: Ingo Molnar
Date: Thursday, May 20, 2010 - 5:37 am

Yeah. That can be an aspect of the callback - or might 
even be integrated into the core code.

Thanks,

	Ingo
--

From: Nils Carlson
Date: Monday, June 14, 2010 - 3:03 am

Just left the above for reference. How would this affect other
aspects of EDAC such as the error injection, the sysfs
entries that (in most cases) reflect the layout of dimm's, and
allow the setting of scrub rate? If we're just talking about
replacing all instances of printk (when logging single bit
errors) with perf events, I don't really see that as a problem.
But EDAC is much more than that today...

Thoughts, comments?

/Nils Carlson
--

From: Andi Kleen
Date: Monday, June 14, 2010 - 4:49 am

Some of this can be probably retained, about the way EDAC
e.g. represents layout is quite unsuitable too. It includes
a lot of internal implementation details that in some cases
you can't even get anymore on modern design. Something
with a proper abstract interface is better.  EDAC never had this.

Also the biggest problem is still that EDAC doesn't
give you any silk screen labels, so unless you 
have motherboard schemantics the layout it presents
is fairly useless -- you still don't know which DIMM
to exchange. So in theory EDAC looks great, but in practice ...

On a lot of modern systems I checked DMI
seems reasonably accurate in terms of layout, so I suspect they can 
be handled with this. For others probably
still need some special driver, but one 
with a proper interface.

For error injection: some modern systems support this
though ACPI EINJ which has an separate non EDAC 
interface. For others I've been simply using some scripts
that twiddle the bits from user space. You can do that
with a shell script. If it was staying in the kernel
it could be probably moved into a proper error injection
framework that is not arbitarily tied to memory.
Lots of different devices have error injection
support and exposing some of that a in a general
frame work would likely make sense.

Anyways the old EDAC drivers for this are not going
away, you can still use them. The interesting
question though is how to properly define the interface

I never quite saw the point of that one, but yes
there's no replacement for this anywhere else.

Normally scrub rate can be simply set in the BIOS,
is that not good enough? Is there a use case for
changing it dynamically? 

Note that modern hardware typically has demand scrubbing
anyways, that is when there is an error it automatically

I don't think perf is the right tool for this, the semantics
are mostly unsuitable (it hasn't been designed as a error reporting
tool, but as a performance tool and performance events are quite
different ...
From: Nils Carlson
Date: Monday, June 14, 2010 - 12:47 pm

I do have motherboard schematics, or rather, we build our own
boards. But the point is valid, a lot of people don't make their own
hardware. On the other hand, the people who do use this part of

This is true, and this is the way things are going on
our end as well. I guess that would mean
one driver that hooks into all frameworks though?
So you wouldn't go to the EDAC sysfs directory
to find everything to do with the same piece of hardware
anymore, but would have to go the n different
directories looking for all the pieces? I don't really

But all new hardware will look the way the hardware
designers want it to, so our interface will be a moving
target? Maybe it's time to let hardware makers provide
a board specification with device tree and memory
There is a use-case. A lot has to do with how different patrol
scrub rates work, some just go through memory at a constant
speed (MB/s), others vary according to load. The thing is,
different applications want their memory scrubbed within
different time frames, and as the amount of memory on boards
varies and the bios doesn't vary this implies the need for setting
scrub rate from userspace.

Patrol scrubbing is normally used because it discovers errors
faster in seldom accessed memory allowing a DIMM with
too many errors to be replaced faster. Some applications
like to use demand scrubbing as well, and some consider
it to increase memory latency too much.


Oh, a hodge podge is much more than just single bit
correctable error reporting... :-) You never know what
you'll find in the sysfs directory for a given memory
controller.

/Nils Carlson
--

From: Andi Kleen
Date: Monday, June 14, 2010 - 1:21 pm

Most users do not build their own boards and do not have
schemantics. And that's not home computer users.

Anyways I think important is that by default you get something
useful (including silk screen labels) without doing 
any special configuration steps.

Right now DMI is the only sane option for this that I can see.
EDAC doesn't do it because it has no silk screen labels.

And yes if someone is a power user they could still override

Let me try to understand that.

You want to inject errors on a random computer you don't
know anything about? Do you do that frequently? Why
are you doing this? 

Obviously there needs to be a way to identify to what

You can define relatively abstract interfaces.

It's just that EDAC is not it. They may not be perfect
future proof (after all who knows how memories of quantum
computers or whatever will look like), but hopefully
at least reasonably forward looking.

e.g. for memory layout imho a reasonable way
is to just define it as

DIMM  (if you need below that look at a log) 
 \-------- silk screen label (most important attribute!)
 |
abstract path. This can be an arbitary string. e.g. MC0/Ch1/DIMM0
 |             Or MC0/BOB0/Ch1/DIMM3
 |             Parsers don't need to know any details about it.
 |
socket

You can event represent that as a flat data structure,
no need to really map the abstract path to directories
(that just makes parsers difficult to write -- most sysfs

That's DMI on x86! 


What's the theory behind varying scrub rates? 

Yes, but why do you want to vary the rate?
Normally it should just depend on memory size and expected

That sounds odd -- if you have so many errors that you worry
about that you have other problems definitely? 
Is this based on some benchmarking?

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Eric W. Biederman
Date: Monday, June 14, 2010 - 1:06 pm

It sounds like you can't be bothered to understand the EDAC code,
or the fact that some users actually like to know when their hardware

- In practice it works even without silk screen labels.
- The current EDAC code displays which DIMMS you have plugged
  in so you can tell if you unplug one, if it was the DIMM

DMI is great on the days it works, there is a lot of variations
between BIOS's.  Also if the information is decent it can be
used to inform the current EDAC code as well as anything else.

You mean an interface that doesn't report the error so people
won't complain to you about a near useless kernel error

Setting the scrub rate isn't half so interesting as displaying
it.

Having basic hardware information displayed in sysfs seems to be the
design of the rest of linux.  I don't see abandoning that part of the
EDAC design as wise.

Displaying the fact that ECC is turned on in the hardware is one
of the more interesting bits.  That at least allows you to verify

I will agree with that.  The argument that errors that should only
happen rarely need a high performance handler seems to indicate

If the basic errors could be posted in some kind of NMI/machine check
safe data structure it would not be hard to get EDAC drivers to
consume them.

Eric
--

From: Luck, Tony
Date: Monday, June 14, 2010 - 1:21 pm

Potentially some unnecessary reboot cycles needed:
 - power off, pull a DIMM, power on, check with EDAC
 - repeat until you get the right DIMM

This also assumes that you can unplug just one DIMM. Some
motherboards require pairs of DIMMs to be added/removed together.

-Tony
--

From: Andi Kleen
Date: Monday, June 14, 2010 - 1:36 pm

On Mon, Jun 14, 2010 at 01:06:59PM -0700, Eric W. Biederman wrote:


Binary search for bad DIMMs. The way to handle memory errors in
the 21th century.

Obviously that does not really work, especially not on large

No DMI layout is unfortunately difficult to map to EDAC layout.
That's mostly EDAC's fault actually.


DMI[1] does not report the errors, the errors are in machine checks
(or possibly other non architectural registers) 
DMI just gives you enumeration. It doesn't give everything,
but it's reasonably complete at least.



I still would like to understand the idea behind this varying 

There are hundreds to thousands of BIOS level hardware knobs for memory
configuration (and if you count all BIOS knobs for everything far more) 

Why do you want to check a single bit only? (which is actually not
a single bit but also a lot of different ways to set this)

I can see there's a need to check that BIOS are doing the right
thing, but you'll never get that from a few sysfs fields.
You need a proper tool that is written for the system in question.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Tony Luck
Date: Monday, June 14, 2010 - 2:34 pm

There was a case mentioned at the collaboration summit
meeting where a BIOS bug mis-reported whether ECC was
enabled - claiming it was on, when in fact it was off.

Error injection could be used to check for another instance
of a lying BIOS (inject an error - make sure it gets counted).
Not as direct as seeing that the right bits are enabled in the
memory controller configuration registers, but still effective.
Perhaps more so as this technique validates different pieces
of the chipset specific code against each other. An EDAC
driver that tells you that ECC is enabled might be lying too,
if it is looking at the wrong bit or the wrong register.

-Tony
--

From: Andi Kleen
Date: Monday, June 14, 2010 - 11:44 pm

Yes I heard about that, but since it's not a single bit setting
there are lots of different ways it could be broken in theory.

To check it you really need to have a tool that knows about
all the registers and checks them all.

It's a bit like checking if someone speaks a foreign language

Yep.

It's asking a question with a one word answer where you don't
know the correct answer.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Andi Kleen
Date: Monday, June 14, 2010 - 11:56 pm

On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote:


The way I envision it to working is that a abstracted dimm interface
(or edac2 or whatever you want to call it) can be fed from any reasonable
DIMM layout driver. This could be either DMI on x86 or some other
driver. There would be nothing really x86 specific about that.

That said I think overall the focus for memory error handling
should focus on smart event handling, not dumb accounting.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Nils Carlson
Date: Tuesday, June 15, 2010 - 1:06 am

Could you maybe provide some references on how DIMM layout
could be read from DMI? I can't find anything nearly this specific,
or is it something we're expecting to happen in future BIOS's?

Also, there would probably need to be some standard describing
different DIMM layouts in general, though maybe such a thing exists.

In other words, there would be have to be some way of ascertaining
that the info you read from DMI is sufficient to decode MCEs so that
a faulting DIMM can be identified. In an ideal world, this could
be tested by some simple tool that could be run by the BIOS writers
to test that they're providing the OS with sufficient info.

/Nils
--

From: Borislav Petkov
Date: Tuesday, June 15, 2010 - 3:01 am

From: Nils Carlson <nils.carlson@ludd.ltu.se>

You cannot decode an ECC to a DIMM only using DMI info - at least on AMD
you cannot. The MCE contains the physical address where the ECC happened
and you need EDAC to convert this to a chip select row. Additionally,
you need the error syndrome depending on the dram controllers addressing
mode used.

Now, after you have the chip select row, you need to map this to a DIMM
rank and in order to do that, you need the DIMM info which is in the
SPD ROM (one of the data in the SPD is the DIMM rank which is needed
to unambiguously pinpoint which DIMM is generating those errors). Then
you can use the DMI info - assuming it contains the correct silk screen
labels on the motherboard - to map to a DIMM.

What currently EDAC does is decode the ECC to a chip select - what we
need is some I2C/SMBus code which can read the SPD ROM. I haven't had
the time to look into it yet, though.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.
--

From: Andi Kleen
Date: Tuesday, June 15, 2010 - 4:41 am

On Tue, Jun 15, 2010 at 10:06:33AM +0200, Nils Carlson wrote:


The hardware (or BIOS) tells you the DIMM. You read the DIMMs
from DMI and map them using the locators. The locator strings
are not standardized, but there are not too many different
formats around, so they can be implemented.

Again this does not give you full layout, but it gives
you a "path to a DIMM" and a DIMM locator. 

An alternative is also to use the ACPI based reporting
mechanism which is needed on some system. In this case
the CPER gives you a reference to the DMI object of the DIMM. 

In principle DMI has more information (arrays, ranges etc.)
but in my experience that is not strong enough to really find
the DIMM on modern systems. You need hardware or BIOS help for this.


I don't think the goal is to have full DIMM layout. This will
never replace your schemantics.

The goal is to find which DIMM has a problem. So have a path
and a locator. The path may tell you some additional information

That's difficult in a general way, you will probably always 
need some system specific test plan.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Nils Carlson
Date: Tuesday, June 15, 2010 - 5:21 am

Hi Andi,


Hmm.. From having a quick look at our boards I can conclude
that the information our BIOS puts in their is useless.
Will discuss durther with our BIOS writers. They do
their own error detection during the boot in which they
decode to DIMM's, so obviously the information is in there

So what are we left with? Non-standardised locator strings
that may or may not be present at the mercy of the bios-writer?
I'm already feeling depressed. Re-writing EDAC to try to
make sense of this information seems overly risky.

I think in general that this is one of the wonderfull things
about linux, you're not so much at the mercy of BIOS-writers.
As soon as we start relying on the BIOS for functionality we're
encouraging the BIOS people to put more functionality in there,
and BIOS functionality is great, as long as there are no bugs!

But there are bugs. And correcting them is so prohibitively
expensive that I don't even want to think about it. And when
the BIOS messes up, it's the device driver writers who have to
magically workaround the problems.

Could we come up with some plan that doesn't involve
trusting to the goodwill (and competence) of BIOS writes?

I personally really like the device tree compiler for PowerPC.
It allows you to be explicit about what you have. Not for everyone,
but maybe there could be some way to apply the same principle? Maybe
some way of loading modules with parameters or configuring your setup
from sysfs?

/Nils
--

From: Luck, Tony
Date: Tuesday, June 15, 2010 - 11:15 am

That would be nice - but there already exists a platform
(Xeon-7500 series a.k.a. Nehalem-EX) where the hardware
chipset registers that you would need to do your own
memory topology reverse engineering in Linux are only
accessible to SMM level code.  I've finally come to the
conclusion that an EDAC style driver just isn't possible

Even when the chip set registers are accessible, it can be very
complex to do this for the general case (think of boards that
support arbitrary mixing of different size/speed DIMMs - the
BIOS may have done some interesting somersaults while computing
which interleaving modes to use).

Even more complex on high end systems when BIOS may handle row
sparing transparently to the OS. Memory mirroring is also
becoming fashionable - how can EDAC represent this (when
the h/w view of the memory doesn't match the OS view)?

-Tony


--

From: Nils Carlson
Date: Tuesday, June 15, 2010 - 11:38 am

Yes, I'm dreading the day they come to me telling me that
they've got one of those. On the one end you have hardware
people who love to put functionality there, and then you
have applications that have real-time requirements to
whom you have to explain that the latest and greatest
processor is broken for their purposes.

One day I'll use this as an excuse to migrate everyone to
PPC where people know that a bootloader is a bootloader.

But grudges against BIOS's aside, I don't know what to do
about Nehalem-EX systems. I guess at that point we really
Difficult questions. But at some point I wonder who will be buying
systems where finding out which DIMM is broken is so complex
that it requires a masters degree.

/Nils
--

From: Andi Kleen
Date: Tuesday, June 15, 2010 - 12:37 pm

... and the numbers that come out of this may have no relation
to your motherboard labels at all. What do you do then? 
Read schemantics again?  Or do binary search on the DIMM again
like Eric suggested? 

For all of this a system specific mapping table is needed
and the only place to get this as a default option without 
explicit configuration for each motherboard is the BIOS.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Andi Kleen
Date: Tuesday, June 15, 2010 - 12:35 pm

In this case you would need the equivalent information
of a system specific DMI table in some device driver.

Do you see how this does not fly? How should a device
driver know more about the system than the BIOS?

And if you can load some specific table into the device
driver why can't you simply update the BIOS too?

Well you can supply your own if you're a power user
anyways, but most users are not power users. So it's no 
option as a default.

Or could you imagine a standard server getting installed
and asking with a desktop window "please enter the DIMM mappings

the problem is that the information is nowhere else.
If the BIOS doesn't know it Linux certainly doesn't know it either.

On the other hand if Linux uses this information there is certainly
an angle to get at least server vendors to fix their stuff
(and non servers do not matter for memory errors because they
run in non ECC mode anyways)

It's certainly in the server vendors own interest to supply correct
information here anyways. If they don't it will cost them in
unnecessary memory replacement costs.

BTW on the systems I have access to DMI seems to be largely 
correct these days. I guess your system is a unlucky exception.

Maybe your BIOS people will do something useful next generation.

Having a DMI override is no problem at all. ACPI uses this all the time
for example.

No need at all to speak a foreign language for this, even if it's your
mother tongue.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Nils Carlson
Date: Tuesday, June 15, 2010 - 1:48 pm

No, something is wrong with the BIOS. ;-)

<long snip>

Could you maybe give me an example from the board of your choosing
of a DMI table print out, explain the format and then show how to use  
it?
I'd like to show it to our BIOS writers. Ideally, maybe somebody could
post a good suggestion for a standardized format? (Though that's very
optimistic, but maybe cpu vendors can make suggestions to board makers?)

/Nils
--

From: Andi Kleen
Date: Wednesday, June 16, 2010 - 2:40 am

The only requirement the current mcelog parser has is
(that is what it actually uses, it parses more things but I abandoned
them):

- List of DIMMs (type 17) 
- It's useful if they have the correct size for display to the user.
- Correct serial/part numbers/manufacturer are also useful (for display), but
not strictly required.
- Locator should match the silk screen label of the DIMM on the board
- Bank Locator is in the format prefix_Node%u_Channel%u_Dimm%u
prefix can be arbitary, but should not contain '_'
Node matching SOCKETID coming from CPU, Channel matching Channel, Dimm
matching Dimm number from CPU.
This requirement is the only extension over the standard.

-Andi
--

From: Tony Luck
Date: Tuesday, June 15, 2010 - 3:33 pm

You could go one stage further and make DIMMs just one example of
a field replaceable unit.  So the "error analysis subsystem" would keep track
of errors reported by any component (cpu, DIMM, I/O card, fan, power
supply, disk, ...).  Each category could have different "X errors per Y
interval" parameter that made sense for it.

-Tony
--

From: Andi Kleen
Date: Wednesday, May 19, 2010 - 2:03 am

Experience disagrees with you (that is not sure about average,
but at least there's a significant portion) 


Error rates of good DIMMs scale roughly with the number of transistors.


A low steady rate of corrected errors on a large system
is expected.  In fact if you look at the memory error log.
of a large system (towards TBs) it nearly always has some 
memory related events.

In this case a log is not really useful. What you need

Same issue here: if something is truly broken it floods
you with errors.

First this costs a lot of time to process and it does not 
actually tell you anything useful because most errors in a flood
are similar.

Basically you don't care if you have 100 or 1000 errors, 
and you definitely don't want all the of the errors filling up
your disk and using up your CPU.

Again a threshold with an action is much more useful here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Russ Anderson
Date: Monday, May 24, 2010 - 9:21 am

I agree with Andi.  While there are a wire range of users, the
vast majority know little about the hardware they are running
on.  Even in commercial settings, where users/admins are better
educated, there is little time to do detailed error analysis.

The more errors are detected/analyzed/corrected/recovered, the

Having the infrastructure to automatically off-line pages
is a good thing.  The details of where to set the predictive
threshold likely will be hardware specific (different DIMM



Yes, good points.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com
--

From: Andi Kleen
Date: Monday, May 24, 2010 - 11:26 am

It's already there with a modern mcelog in daemon mode 

The current default in mcelog is 10 corrected errors per 24h
per 4k page or 1 uncorrected error on the page (if your CPU
supports recovering from that).  It is on by default. 

You can configure it to be different if you want.

-Andi
--

From: Tony Luck
Date: Wednesday, May 19, 2010 - 10:30 am

On Tue, May 18, 2010 at 6:14 PM, Eric W. Biederman

Hard drives aren't really a similar situation ... we don't see any
of the low level errors from a modern hard drive because its
f/w handles the retries and block re-mapping transparently.
By the time something serious enough happens that it gets
reported to the OS, we pretty much already know that there
is a real problem.

We are still in the dark ages for memory errors where the OS
is expected to look at all the errors and figure out whether they
represent any kind of meaningful pattern that requires some
action to replace h/w components.

-Tony
--

From: Russ Anderson
Date: Monday, May 24, 2010 - 8:55 am

ia64 is good at detecting & recovering from memory uncorrectable
errors.  x86 is significantly behind, due to historically not
being able to recover from uncorrectable memory errors.  

ia64 had the Intel defined MCA Spec which defined the interaction
between SAL and the kernel.  x86 does not have a similar well
defined way of how errors should be handled.  It would be 

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com
--

From: Tony Luck
Date: Monday, May 24, 2010 - 10:35 am

X86 has machine check registers defined by the SDM. It also
has some f/w <-> OS interactions defined by the APEI sections
in the latest ACPI spec (chapter 17 of the 4.0a spec released
last month - see http://acpi.info). Some parts look cleaner than
the ia64 SAL spec. E.g. errors logged from before the current
OS booted are presented in the Boot Error Record Table instead
of just appearing among the stream of errors that SAL_GET_ERROR
provides to the OS without any way to distinguish current errors
from old ones.

-Tony
--

From: Andi Kleen
Date: Monday, May 24, 2010 - 11:31 am

I should add the Intel Software Developer's manual has quite
precise guidelines on what to do (and the Linux MCE code implements
near all that faithfully) 

The ACPI spec isn't quite as precise unfortunately.

-Andi
--

From: Ingo Molnar
Date: Tuesday, May 18, 2010 - 3:29 pm

That's possible too - the TRACE_EVENT() of MCE events, 
beyond the record format, also includes a human-readable 
ASCII output format string:

 # tail -1 /debug/tracing/events/mce/mce_record/format

 print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, 
 ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, 
 PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", 
 REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank, 
 REC->status, REC->addr, REC->misc, REC->cs, REC->ip, 
 REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime, 
 REC->socketid, REC->apicid

Which could be used to printk events.

Cheers,

	Ingo
--

From: Ingo Molnar
Date: Tuesday, May 18, 2010 - 12:30 pm

I proposed a (fairly straightforward) extension to which 
Boris agreed: we can introduce 'persistent events', which 
have task-less buffers attached to them, which will hold 
(a configurable amount of) of events.

Those can then be picked up by a task later on and no 
event is lost.

Would such a feature address your concern?

It would be useful not just for reliable error event 
collection, it could also be used for things like the boot 
tracer (which too deals with events that occur before 
there are any user-space tasks to pick up events).

I.e. it fits into the whole scheme in a pretty natural, 
multi-purpose way.

Thanks,

	Ingo
--

From: Ingo Molnar
Date: Tuesday, May 18, 2010 - 1:42 pm

Tony, should we accelerate the development of this 
persistent events sub-feature?

Boris posted initial patches of the new perf events based 
EDAC/MCE/RAS design direction to lkml and indicated that 
it works for him. He also indicated that he can do the 
initial work of unifying EDAC and MCE without the 
persistent events feature for now. (this all is obviously 
v2.6.36-ish material)

But if it's important, if you'd like to move ahead with 
the unification swiftly then we can certainly increase its 
priority.

Also, a few notes:

1) the new RAS tool itself might or might not be part of 
tools/perf/ - for the prototype it certainly makes sense 
to be there but otherwise feel free to start tools/ras/ 
and share code with tools/perf/ but otherwise keep a 
separate RAS tool-space.

2) There's a new perf feature (that went upstream today) 
that is of EDAC/RAS interest: the ability to do live 
tracing. This is basically a daemon-alike, 
event->policy-action based flow that RAS eventing is 
about.

3) Another new perf feature of interest is 'perf inject' 
(this too went upstream today): to inject artificial 
events into the stream of events. This mechanism could be 
used to simulate rare error conditions and to test out 
policy reactions systematically - an important part of 
system error recovery testing.

4) We are working on enumerating events via sysfs, not via 
debugfs. This would make the events provided by EDAC/MCE 
more generally available. See Lin Ming's patches on lkml:

  Subject: [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs

Please chime in that thread to make sure the event_source 
class is suitable to describe EDAC/MCE event sources as 
well. Any event_source that is made available by drivers 
can then by used by tools for event transport.

This gives us a broad platform to add various RAS events 
as well, beyond raw hardware events: we could for example 
events for various system anomalies such as lockup 
messages, kernel ...
From: Tony Luck
Date: Tuesday, May 18, 2010 - 2:37 pm

The persistent event feature sounds like it will solve

We've missed the deadlines for inclusion in certain
popular distributions ... so it may be OK to take a
relatively leisurely path to getting this done right

Simulated errors are handy for testing the very
top level of the s/w stack. But real errors are
better. There's some APEI code in Len's tree
that can inject real errors (on systems with the

This looks like sticky ground.  I can see the event mechanism
passing data to a user daemon working well for all kinds of
corrected and minor errors. But when you start talking about
lockups and fatal errors things get a lot trickier. Often the
main concern at this point is error containment. Making sure
that the flaky data doesn't become visible (saved to storage,
transmitted to the network, etc.). Getting from a machine check
handler through some context switches (and page
faults etc.) to a user level daemon before the error

In a cluster/cloud/datacenter that daemon will need to be
networked and hooked to the system management tools
that are controlling the bigger environment. But I agree
that this looks like a worthy end goal.

-Tony
--

From: Ingo Molnar
Date: Tuesday, May 18, 2010 - 3:00 pm

I was pointing beyond the narrow hardware (memory) error 
point of view, towards a more generic 'system health' 
thinking.

In the broader view it may makes sense to for example 
define policy over excessive number of segfaults on a 
server system (where excessive segfaults are an anomaly), 
or a suspiciously large number of soft IO errors, etc.

But yes, of course, when it comes to hard memory errors, 
those take precedence, and handling them (and 
saving/propagating information about them while we still 

As Boris mentioned it too, critical policy action can and 
will be done straight in the kernel.

	Ingo
--

From: Russ Anderson
Date: Monday, May 24, 2010 - 10:13 am

That is how it is done in ia64.  The MCA interrupt 
handler does the low level handling.  It makes sure
all the cpus have rendezvoused, looks at the MCA record
to determine what happend and does whatever recovery 
steps are needed, such as kill the application.


-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com
--

From: Ingo Molnar
Date: Tuesday, May 18, 2010 - 11:39 pm

Agreed, hardware assisted error injection is by far the 
best and most complete solution.

Thanks,

	Ingo
--

From: Borislav Petkov
Date: Tuesday, May 18, 2010 - 6:06 am

From: Mauro Carvalho Chehab <mchehab@redhat.com>

Have the old list subscribers been moved to the new list or do they
need to re-subscribe?

Thanks.

-- 
Regards/Gruss,
Boris.

Operating Systems Research Center
Advanced Micro Devices, Inc.
--

From: Mauro Carvalho Chehab
Date: Tuesday, May 18, 2010 - 9:52 am

I suspect that you'll need to subscribe on the new ML.

-- 

Cheers,
Mauro
--

From: Mauro Carvalho Chehab
Date: Tuesday, May 18, 2010 - 10:06 am

The current i7core_edac driver is ready for merge upstream, using the current
edac API. It supports the following processor families:
	i7core, i5core, Lynnfield, Nehalem, Nehalem-EP and Westmere-EP

The tree is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core.git linux_next

This driver doesn't support Nehalem-EX. From the discussions we had during
the mini-summit, the MCU of -EX family is very different, so, a separate
driver will be required for it.

Please review. My plan is to submit this driver for upstream merge this week.

-- 

Cheers,
Mauro
--

Previous thread: Patch: ni_tio.c by Samuel Richardson on Monday, May 17, 2010 - 11:08 am. (1 message)

Next thread: Conflict between tip/tracing/core and tip/perf/core by Steven Rostedt on Monday, May 17, 2010 - 11:50 am. (5 messages)