During the last LF Collaboration Summit, we've done a mini-summit [1], intended to improve the hardware error detection in kernel, currently provided by MCE and EDAC subsystems. The idea of this mini-summit came up after Thomas Gleixner and Ingo Molnar suggestions that edac and mce should converge into an error subsystem. I'm enclosing the minutes of the meeting, in order to allow it to be reviewed by other kernel hackers that are interested on the theme but unfortunately couldn't come to the meeting. Btw, during the meeting, it were decided that EDAC ML could better work if moved to vger, so I'm copying here both the old and the new edac mailing lists. [1] http://events.linuxfoundation.org/lfcs2010/edac --- I Hardware Error Kernel Mini-Summit =================================== April, 15 - San Francisco, CA, US 2010 Linux Foundation Collaboration Summit Attendees: Ben Woodard - Red Hat Brent Young - Intel Doug Thompson - LLNL Mark Grondona - LLNL Matt Domsch - Dell Mauro Chehab - Red Hat Tony Luck - Intel After some initial description of the current state of error handling in Linux, we moved to work on requirements and high level design of the system going forward. Requirements ============ First and foremost is that end-users (presumably system administrators) need notification of the hardware component that is the source of each error. Ideally this should include "silk screen" markings so that the user can identify which component is at fault. Other requirements may vary amongst different types of end users, but include: + Minimal disruption to system performance when logging corrected errors. + Assurance that h/w error detection mechanisms are correctly configured and enabled. + System topology information Wit respect to System topology, it was poined that LLNL is concerned about being sure that ECC is enabled, as some BIOS'es lied about that in the past. Also, memory topology information is needed to ...
Just for the innocent readers who might be mislead by this: Nehalem-EP DIMM error accounting already works fine today using mcelog for most cases, including RHEL5.5 (with some limits) and RHEL6beta with no additional changes needed. In RHEL6 the daemon does the accounting and the client reports the errors separated for each DIMM and separated in uc and ce. In RHEL5 the information is in a log file and can be gotten from there. In addition the daemon supports various advanced RAS features including Already done too, see http://permalink.gmane.org/gmane.linux.acpi.devel/45743 However the interface won't give you the topology you're asking for, just the errors. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Ok. It should be clear that the main target of the mini-summit is to define how the several subsystems will integrate into a hardware-abstracted way to report errors from kernel. So, we're looking on the next steps to improve what we currently have, and avoid to have more than one different subsystem trying to get the same info, eventually using the same registers, but providing different interfaces to userspace. -- Cheers, Mauro --
Well there are different use cases. mcelog mainly deals in thresholds (including fancy ones like per page and per object thresholds) and events and actions to thresholds (= more events), all your proposals are dealing with objects counts currently. It does per object counting too, but only incidentially. I suspect there are use cases for both, although I personally suspect for most people events, thresholds and their actions are the most useful thing to handle by default. But one size doesn't fit all. Anyways it boils down you need different interfaces for different things. For example there will be always events versus accounting. You can synthesize accounting from events (that is what mcelog does today). The other way round does not work so well unfortunately, or at least would be rather inefficient. Also large parts of the actions can be only usefully done in user space, so you need a user space component. I am somewhat biased of course but I think mcelog is doing a reasonable good job today at being this user space component. It definitely has areas that could be improved too, but at lot of the basics are there and doing ok. In principle mcelog could feed from another API too, but it would definitely prefer to not to have to poll it or having to parse printks. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Thank you very much for providing this report. I agree that we should have a well organized error subsystem that covers all error sources in the system and that provides enough simple and powerful API for users. As one of interested absentee, I think I could be of some help to you (e.g. x86 low level). It might be off-topic here, but I'd like to point that you missed the presence of PCIe AER subsystem that handle hardware errors on PCIe devices nowadays (It works well on ppc, x86 and so on). Given that APEI also covers PCIe errors and that some system can map MC registers to PCI configuration space, I think there is no way for the new error subsystem to ignore I/O device errors while it care errors on CPU/memory and cooperate with APEI. Thanks, H.Seto --
Yes, it makes sense to integrate also PCIe AER subystem. IMO, the first step is to provide an error core integrated to perf, and then start integrating the several error systems around it. -- Cheers, Mauro --
That's the original plan. It were suggested by Ingo and Thomas at LKML. Borislav also send a more technical proposal about it. It actually makes sense, since some sorts of errors may affect performance (those non-fatal errors that are auto-recovered). Also, using debugfs and the same kind of logic used by perf to filter errors seems pertinent for hardware errors. As the actual patches were not written yet, the details on how those things will integrate will depend on further analysis. -- Cheers, Mauro --
For a different perspective on this see also http://permalink.gmane.org/gmane.linux.kernel/952061 AFAIK all the issues mentioned there are still open. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Yup, that's why I asked for explanation. What was offered lacks useful goal and design detail. perf is a cool tool, but probably not necessary for for a HW error reporting system. --
It makes sense to use the kernel's performance events logging framework when we are logging events about how the system performs. Furthermore it's NMI safe, offers structured logging, has various streaming, multiplexing and filtering capabilities that come handy for RAS purposes and more. The other option would be to use an ad-hoc logging implementation, only used for EDAC/RAS, which couldnt be mixed with other system events. That approach has various obvious disadvanteges so we are aiming for a unified approach. Thanks, Ingo --
Perhaps it makes more sense to say that the Linux "performance events logging framework" has become more generic and is really Those of us present at the mini-summit were not familiar with all the features available. One area of concern was how to be sure that something is in fact listening to and logging the error events. My understanding is that if there is no process attached to an event, the kernel will just drop it. This is of particular concern because the kernel's first scan of the machine check banks occurs before there are any processes. So errors found early in boot (which might be saved fatal errors from before the boot) might be lost. -Tony --
From: "Luck, Tony" <tony.luck@intel.com> Well, we have a trace_mce_record tracepoint in the mcheck code which calls all the necessary callbacks when an mcheck occurs. For the time being, the idea is to use the mce.c ring buffer for early mchecks and copy them to the regular ftrace per-cpu buffer after the last has been initialized. Later, we could switch to a another early bootmem buffer if there's need to. Also, we want to have a userspace daemon that reads out the mces from the trace buffer and does further processing like thresholding etc in userspace. Concerning critical errors, there we bypass the perf subsystem and execute the smallest amount of code possible while trying to shutdown gracefully if the error type allows that. These are the rough ideas at least... -- Regards/Gruss, Boris. Operating Systems Research Center Advanced Micro Devices, Inc. --
The end result would be even simpler by one more step: with persistent events we just use them and dont need the mce.c ringbuffer at all. (getting rid of that complication is one of the code cleanliness benefits i see in this move as a x86 maintainer - beyond the obvious generalization Yeah. Each perf_event can have arbitrary callbacks with add-on (or critical) functionality. We would activate the event(s) during bootup and it would do its thing from that point on: critical functionality gets a direct path via the callback, and every other event that survives goes via the regular perf output channels, to one (or more) consumers/subscribers of these events. Ingo --
Can someone please tell me why everyone is eager to squirrel correctable error reports away and not report them in dmesg? aka syslog. I have had on several occasions a machine with memory errors that mcelog or the BIOS was eating the error reports and not putting them anywhere a normal human being would look. If your system isn't broken correctable errors are rare. People look at syslog. People look in /var/log/messages and dmesg when something goes weird. I have no problem with additional interfaces to provide additional functionality but please can we put errors where people can find them. Eric --
The original motivation to put them somewhere else because I was sick of people reporting them as kernel bugs. Actually the more memory you have the more common they are. And the trend is to more and more memory. Really to do anything useful with them you need trends and automatic actions (like predictive page offlining) A log isn't really a good format for that. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
This suggests that to get things reported in dmesg I should setup a cron job that pulls the latest kernel checks to see if things are reported into syslog and sends you an email if things are wrong. I'm not ready to believe the average person that is running linux is too stupid to understand the difference between a hardware The error rate should not be fixed per bit but should be roughly fixed per DIMM. If the error rate over time is fixed per bit we are in deep Not at all, and I don't have a clue where you start thinking predictive page offlining makes the least bit of sense. Broken A log is a fine format for realizing you have a problem. A log doesn't need to be the only place errors are reported but a log should be the default place ECC errors are reported. We do that with hard drive errors and other kinds of hardware errors and we have done it for years without problems. My experience is that correctable ECC errors come in two kinds of frequencies. - The expected single bit correctable error range. Which is somewhere between once a month and once a year per dimm. On the most unreasonable configuration I ever worked with. 4TB of ram in 1GB sticks up at Los Alomos, at 7000ft in an environment know to trigger errors I saw roughly one correctable ECC error an hour. Huge but just barely within the expected range. I can live with a log message once a month on a mundane system. - Errors that occur frequently. That is broken hardware of one time or another. I want to know about that so I can schedule down time to replace my memory before I get an uncorrected ECC error. Errors of this kind are likely happening frequently enough as to impact performance. Eric --
From: "Eric W. Biederman" <ebiederm@xmission.com> This is exactly the reason why we need a better error logging and reporting than a log. How do you want to discover trends and count CECCs per DIMM if you scan the logs all the time and grep for the DRAM page it happened, the CS row it is located in and whether this is located in the same DIMM as the 115th error back in the log? This gets especially tricky if you're using one of the gazillion memory interleaving schemes. Ok, and what about other errors like L3 cache errors, for example? You want to count those too and upon reaching a threshold disable a cache index _before_ it turns a correctable ECC into an uncorrectable error bringing the whole system down with a critical MCE. How about error injection, you want to test the hardware/software with injecting real hardware errors and not simulating it all in software. And also you want to be able to schedule different maintenance actions depending on the severity of the error and in certain cases get away with a clean shutdown even in the face of an uncorrectable error. So, the whole idea entails much more than reporting errors in the syslog but rather making the system intelligent enough to prolong its own life and be able to warn the user that something bad is about to happen. And we don't have that right now - right now we say that some machine checks have been logged and with uncorrectable MCEs we freeze cowardly and hope to be able to make a warm reset so that the MCA MSRs still contain some valid data which we can decode painstakingly by hand. I hope this makes our intentions a bit clearer. -- Regards/Gruss, Boris. Operating Systems Research Center Advanced Micro Devices, Inc. --
Basically the idea behind the generic structured logging framework (the perf events kernel subsystem) is to have both ASCII output (where desired: critical errors), but to also have well-specified event format parsable to user-space tools. Plus there's the need for fast, lightweight, flexible event passing mechanism - which is given by the perf events transport which enables arbitrary size in-memory ring-buffers, poll() and epoll support, etc. perf events supports all these different usecases and comes with a (constantly growing) set of events already defined upstream. We've got more than a dozen different upstream subsystems that have defined events and we have over a hundred individual events. There's a rapidly growing tool space that makes case by case use of these event sources to measure/observe various aspects of the system. Regarding dmesg, there's a WIP patch on lkml that integrates printks into this framework as well - makes each printk also available as a special string event. That way a tool can have both programmatic access to printk output (without having to interact with the syslog buffer itself) - together with all the other structured log sources, while humans can also see what is happening. Thanks, Ingo --
Some system admins prefer to have everything on dmesg, as they can enable a serial console, and catch the logs remotely, even when the machine crashes for example due to a hardware failure. So, IMHO, one feature that the perf event needs is the capability to report errors via a serial console also, or a mechanism where some events are sent via dmesg. -- Cheers, Mauro --
Yeah. That can be an aspect of the callback - or might even be integrated into the core code. Thanks, Ingo --
Just left the above for reference. How would this affect other aspects of EDAC such as the error injection, the sysfs entries that (in most cases) reflect the layout of dimm's, and allow the setting of scrub rate? If we're just talking about replacing all instances of printk (when logging single bit errors) with perf events, I don't really see that as a problem. But EDAC is much more than that today... Thoughts, comments? /Nils Carlson --
Some of this can be probably retained, about the way EDAC e.g. represents layout is quite unsuitable too. It includes a lot of internal implementation details that in some cases you can't even get anymore on modern design. Something with a proper abstract interface is better. EDAC never had this. Also the biggest problem is still that EDAC doesn't give you any silk screen labels, so unless you have motherboard schemantics the layout it presents is fairly useless -- you still don't know which DIMM to exchange. So in theory EDAC looks great, but in practice ... On a lot of modern systems I checked DMI seems reasonably accurate in terms of layout, so I suspect they can be handled with this. For others probably still need some special driver, but one with a proper interface. For error injection: some modern systems support this though ACPI EINJ which has an separate non EDAC interface. For others I've been simply using some scripts that twiddle the bits from user space. You can do that with a shell script. If it was staying in the kernel it could be probably moved into a proper error injection framework that is not arbitarily tied to memory. Lots of different devices have error injection support and exposing some of that a in a general frame work would likely make sense. Anyways the old EDAC drivers for this are not going away, you can still use them. The interesting question though is how to properly define the interface I never quite saw the point of that one, but yes there's no replacement for this anywhere else. Normally scrub rate can be simply set in the BIOS, is that not good enough? Is there a use case for changing it dynamically? Note that modern hardware typically has demand scrubbing anyways, that is when there is an error it automatically I don't think perf is the right tool for this, the semantics are mostly unsuitable (it hasn't been designed as a error reporting tool, but as a performance tool and performance events are quite different ...
I do have motherboard schematics, or rather, we build our own boards. But the point is valid, a lot of people don't make their own hardware. On the other hand, the people who do use this part of This is true, and this is the way things are going on our end as well. I guess that would mean one driver that hooks into all frameworks though? So you wouldn't go to the EDAC sysfs directory to find everything to do with the same piece of hardware anymore, but would have to go the n different directories looking for all the pieces? I don't really But all new hardware will look the way the hardware designers want it to, so our interface will be a moving target? Maybe it's time to let hardware makers provide a board specification with device tree and memory There is a use-case. A lot has to do with how different patrol scrub rates work, some just go through memory at a constant speed (MB/s), others vary according to load. The thing is, different applications want their memory scrubbed within different time frames, and as the amount of memory on boards varies and the bios doesn't vary this implies the need for setting scrub rate from userspace. Patrol scrubbing is normally used because it discovers errors faster in seldom accessed memory allowing a DIMM with too many errors to be replaced faster. Some applications like to use demand scrubbing as well, and some consider it to increase memory latency too much. Oh, a hodge podge is much more than just single bit correctable error reporting... :-) You never know what you'll find in the sysfs directory for a given memory controller. /Nils Carlson --
Most users do not build their own boards and do not have schemantics. And that's not home computer users. Anyways I think important is that by default you get something useful (including silk screen labels) without doing any special configuration steps. Right now DMI is the only sane option for this that I can see. EDAC doesn't do it because it has no silk screen labels. And yes if someone is a power user they could still override Let me try to understand that. You want to inject errors on a random computer you don't know anything about? Do you do that frequently? Why are you doing this? Obviously there needs to be a way to identify to what You can define relatively abstract interfaces. It's just that EDAC is not it. They may not be perfect future proof (after all who knows how memories of quantum computers or whatever will look like), but hopefully at least reasonably forward looking. e.g. for memory layout imho a reasonable way is to just define it as DIMM (if you need below that look at a log) \-------- silk screen label (most important attribute!) | abstract path. This can be an arbitary string. e.g. MC0/Ch1/DIMM0 | Or MC0/BOB0/Ch1/DIMM3 | Parsers don't need to know any details about it. | socket You can event represent that as a flat data structure, no need to really map the abstract path to directories (that just makes parsers difficult to write -- most sysfs That's DMI on x86! What's the theory behind varying scrub rates? Yes, but why do you want to vary the rate? Normally it should just depend on memory size and expected That sounds odd -- if you have so many errors that you worry about that you have other problems definitely? Is this based on some benchmarking? -Andi -- ak@linux.intel.com -- Speaking for myself only. --
It sounds like you can't be bothered to understand the EDAC code, or the fact that some users actually like to know when their hardware - In practice it works even without silk screen labels. - The current EDAC code displays which DIMMS you have plugged in so you can tell if you unplug one, if it was the DIMM DMI is great on the days it works, there is a lot of variations between BIOS's. Also if the information is decent it can be used to inform the current EDAC code as well as anything else. You mean an interface that doesn't report the error so people won't complain to you about a near useless kernel error Setting the scrub rate isn't half so interesting as displaying it. Having basic hardware information displayed in sysfs seems to be the design of the rest of linux. I don't see abandoning that part of the EDAC design as wise. Displaying the fact that ECC is turned on in the hardware is one of the more interesting bits. That at least allows you to verify I will agree with that. The argument that errors that should only happen rarely need a high performance handler seems to indicate If the basic errors could be posted in some kind of NMI/machine check safe data structure it would not be hard to get EDAC drivers to consume them. Eric --
Potentially some unnecessary reboot cycles needed: - power off, pull a DIMM, power on, check with EDAC - repeat until you get the right DIMM This also assumes that you can unplug just one DIMM. Some motherboards require pairs of DIMMs to be added/removed together. -Tony --
On Mon, Jun 14, 2010 at 01:06:59PM -0700, Eric W. Biederman wrote: Binary search for bad DIMMs. The way to handle memory errors in the 21th century. Obviously that does not really work, especially not on large No DMI layout is unfortunately difficult to map to EDAC layout. That's mostly EDAC's fault actually. DMI[1] does not report the errors, the errors are in machine checks (or possibly other non architectural registers) DMI just gives you enumeration. It doesn't give everything, but it's reasonably complete at least. I still would like to understand the idea behind this varying There are hundreds to thousands of BIOS level hardware knobs for memory configuration (and if you count all BIOS knobs for everything far more) Why do you want to check a single bit only? (which is actually not a single bit but also a lot of different ways to set this) I can see there's a need to check that BIOS are doing the right thing, but you'll never get that from a few sysfs fields. You need a proper tool that is written for the system in question. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
There was a case mentioned at the collaboration summit meeting where a BIOS bug mis-reported whether ECC was enabled - claiming it was on, when in fact it was off. Error injection could be used to check for another instance of a lying BIOS (inject an error - make sure it gets counted). Not as direct as seeing that the right bits are enabled in the memory controller configuration registers, but still effective. Perhaps more so as this technique validates different pieces of the chipset specific code against each other. An EDAC driver that tells you that ECC is enabled might be lying too, if it is looking at the wrong bit or the wrong register. -Tony --
Yes I heard about that, but since it's not a single bit setting there are lots of different ways it could be broken in theory. To check it you really need to have a tool that knows about all the registers and checks them all. It's a bit like checking if someone speaks a foreign language Yep. It's asking a question with a one word answer where you don't know the correct answer. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
On Mon, Jun 14, 2010 at 04:46:40PM -0700, Doug Thompson wrote: The way I envision it to working is that a abstracted dimm interface (or edac2 or whatever you want to call it) can be fed from any reasonable DIMM layout driver. This could be either DMI on x86 or some other driver. There would be nothing really x86 specific about that. That said I think overall the focus for memory error handling should focus on smart event handling, not dumb accounting. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Could you maybe provide some references on how DIMM layout could be read from DMI? I can't find anything nearly this specific, or is it something we're expecting to happen in future BIOS's? Also, there would probably need to be some standard describing different DIMM layouts in general, though maybe such a thing exists. In other words, there would be have to be some way of ascertaining that the info you read from DMI is sufficient to decode MCEs so that a faulting DIMM can be identified. In an ideal world, this could be tested by some simple tool that could be run by the BIOS writers to test that they're providing the OS with sufficient info. /Nils --
From: Nils Carlson <nils.carlson@ludd.ltu.se> You cannot decode an ECC to a DIMM only using DMI info - at least on AMD you cannot. The MCE contains the physical address where the ECC happened and you need EDAC to convert this to a chip select row. Additionally, you need the error syndrome depending on the dram controllers addressing mode used. Now, after you have the chip select row, you need to map this to a DIMM rank and in order to do that, you need the DIMM info which is in the SPD ROM (one of the data in the SPD is the DIMM rank which is needed to unambiguously pinpoint which DIMM is generating those errors). Then you can use the DMI info - assuming it contains the correct silk screen labels on the motherboard - to map to a DIMM. What currently EDAC does is decode the ECC to a chip select - what we need is some I2C/SMBus code which can read the SPD ROM. I haven't had the time to look into it yet, though. -- Regards/Gruss, Boris. Operating Systems Research Center Advanced Micro Devices, Inc. --
On Tue, Jun 15, 2010 at 10:06:33AM +0200, Nils Carlson wrote: The hardware (or BIOS) tells you the DIMM. You read the DIMMs from DMI and map them using the locators. The locator strings are not standardized, but there are not too many different formats around, so they can be implemented. Again this does not give you full layout, but it gives you a "path to a DIMM" and a DIMM locator. An alternative is also to use the ACPI based reporting mechanism which is needed on some system. In this case the CPER gives you a reference to the DMI object of the DIMM. In principle DMI has more information (arrays, ranges etc.) but in my experience that is not strong enough to really find the DIMM on modern systems. You need hardware or BIOS help for this. I don't think the goal is to have full DIMM layout. This will never replace your schemantics. The goal is to find which DIMM has a problem. So have a path and a locator. The path may tell you some additional information That's difficult in a general way, you will probably always need some system specific test plan. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Hi Andi, Hmm.. From having a quick look at our boards I can conclude that the information our BIOS puts in their is useless. Will discuss durther with our BIOS writers. They do their own error detection during the boot in which they decode to DIMM's, so obviously the information is in there So what are we left with? Non-standardised locator strings that may or may not be present at the mercy of the bios-writer? I'm already feeling depressed. Re-writing EDAC to try to make sense of this information seems overly risky. I think in general that this is one of the wonderfull things about linux, you're not so much at the mercy of BIOS-writers. As soon as we start relying on the BIOS for functionality we're encouraging the BIOS people to put more functionality in there, and BIOS functionality is great, as long as there are no bugs! But there are bugs. And correcting them is so prohibitively expensive that I don't even want to think about it. And when the BIOS messes up, it's the device driver writers who have to magically workaround the problems. Could we come up with some plan that doesn't involve trusting to the goodwill (and competence) of BIOS writes? I personally really like the device tree compiler for PowerPC. It allows you to be explicit about what you have. Not for everyone, but maybe there could be some way to apply the same principle? Maybe some way of loading modules with parameters or configuring your setup from sysfs? /Nils --
That would be nice - but there already exists a platform (Xeon-7500 series a.k.a. Nehalem-EX) where the hardware chipset registers that you would need to do your own memory topology reverse engineering in Linux are only accessible to SMM level code. I've finally come to the conclusion that an EDAC style driver just isn't possible Even when the chip set registers are accessible, it can be very complex to do this for the general case (think of boards that support arbitrary mixing of different size/speed DIMMs - the BIOS may have done some interesting somersaults while computing which interleaving modes to use). Even more complex on high end systems when BIOS may handle row sparing transparently to the OS. Memory mirroring is also becoming fashionable - how can EDAC represent this (when the h/w view of the memory doesn't match the OS view)? -Tony --
Yes, I'm dreading the day they come to me telling me that they've got one of those. On the one end you have hardware people who love to put functionality there, and then you have applications that have real-time requirements to whom you have to explain that the latest and greatest processor is broken for their purposes. One day I'll use this as an excuse to migrate everyone to PPC where people know that a bootloader is a bootloader. But grudges against BIOS's aside, I don't know what to do about Nehalem-EX systems. I guess at that point we really Difficult questions. But at some point I wonder who will be buying systems where finding out which DIMM is broken is so complex that it requires a masters degree. /Nils --
... and the numbers that come out of this may have no relation to your motherboard labels at all. What do you do then? Read schemantics again? Or do binary search on the DIMM again like Eric suggested? For all of this a system specific mapping table is needed and the only place to get this as a default option without explicit configuration for each motherboard is the BIOS. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
In this case you would need the equivalent information of a system specific DMI table in some device driver. Do you see how this does not fly? How should a device driver know more about the system than the BIOS? And if you can load some specific table into the device driver why can't you simply update the BIOS too? Well you can supply your own if you're a power user anyways, but most users are not power users. So it's no option as a default. Or could you imagine a standard server getting installed and asking with a desktop window "please enter the DIMM mappings the problem is that the information is nowhere else. If the BIOS doesn't know it Linux certainly doesn't know it either. On the other hand if Linux uses this information there is certainly an angle to get at least server vendors to fix their stuff (and non servers do not matter for memory errors because they run in non ECC mode anyways) It's certainly in the server vendors own interest to supply correct information here anyways. If they don't it will cost them in unnecessary memory replacement costs. BTW on the systems I have access to DMI seems to be largely correct these days. I guess your system is a unlucky exception. Maybe your BIOS people will do something useful next generation. Having a DMI override is no problem at all. ACPI uses this all the time for example. No need at all to speak a foreign language for this, even if it's your mother tongue. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
No, something is wrong with the BIOS. ;-) <long snip> Could you maybe give me an example from the board of your choosing of a DMI table print out, explain the format and then show how to use it? I'd like to show it to our BIOS writers. Ideally, maybe somebody could post a good suggestion for a standardized format? (Though that's very optimistic, but maybe cpu vendors can make suggestions to board makers?) /Nils --
The only requirement the current mcelog parser has is (that is what it actually uses, it parses more things but I abandoned them): - List of DIMMs (type 17) - It's useful if they have the correct size for display to the user. - Correct serial/part numbers/manufacturer are also useful (for display), but not strictly required. - Locator should match the silk screen label of the DIMM on the board - Bank Locator is in the format prefix_Node%u_Channel%u_Dimm%u prefix can be arbitary, but should not contain '_' Node matching SOCKETID coming from CPU, Channel matching Channel, Dimm matching Dimm number from CPU. This requirement is the only extension over the standard. -Andi --
You could go one stage further and make DIMMs just one example of a field replaceable unit. So the "error analysis subsystem" would keep track of errors reported by any component (cpu, DIMM, I/O card, fan, power supply, disk, ...). Each category could have different "X errors per Y interval" parameter that made sense for it. -Tony --
Experience disagrees with you (that is not sure about average, but at least there's a significant portion) Error rates of good DIMMs scale roughly with the number of transistors. A low steady rate of corrected errors on a large system is expected. In fact if you look at the memory error log. of a large system (towards TBs) it nearly always has some memory related events. In this case a log is not really useful. What you need Same issue here: if something is truly broken it floods you with errors. First this costs a lot of time to process and it does not actually tell you anything useful because most errors in a flood are similar. Basically you don't care if you have 100 or 1000 errors, and you definitely don't want all the of the errors filling up your disk and using up your CPU. Again a threshold with an action is much more useful here. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
I agree with Andi. While there are a wire range of users, the vast majority know little about the hardware they are running on. Even in commercial settings, where users/admins are better educated, there is little time to do detailed error analysis. The more errors are detected/analyzed/corrected/recovered, the Having the infrastructure to automatically off-line pages is a good thing. The details of where to set the predictive threshold likely will be hardware specific (different DIMM Yes, good points. -- Russ Anderson, OS RAS/Partitioning Project Lead SGI - Silicon Graphics Inc rja@sgi.com --
It's already there with a modern mcelog in daemon mode The current default in mcelog is 10 corrected errors per 24h per 4k page or 1 uncorrected error on the page (if your CPU supports recovering from that). It is on by default. You can configure it to be different if you want. -Andi --
On Tue, May 18, 2010 at 6:14 PM, Eric W. Biederman Hard drives aren't really a similar situation ... we don't see any of the low level errors from a modern hard drive because its f/w handles the retries and block re-mapping transparently. By the time something serious enough happens that it gets reported to the OS, we pretty much already know that there is a real problem. We are still in the dark ages for memory errors where the OS is expected to look at all the errors and figure out whether they represent any kind of meaningful pattern that requires some action to replace h/w components. -Tony --
ia64 is good at detecting & recovering from memory uncorrectable errors. x86 is significantly behind, due to historically not being able to recover from uncorrectable memory errors. ia64 had the Intel defined MCA Spec which defined the interaction between SAL and the kernel. x86 does not have a similar well defined way of how errors should be handled. It would be -- Russ Anderson, OS RAS/Partitioning Project Lead SGI - Silicon Graphics Inc rja@sgi.com --
X86 has machine check registers defined by the SDM. It also has some f/w <-> OS interactions defined by the APEI sections in the latest ACPI spec (chapter 17 of the 4.0a spec released last month - see http://acpi.info). Some parts look cleaner than the ia64 SAL spec. E.g. errors logged from before the current OS booted are presented in the Boot Error Record Table instead of just appearing among the stream of errors that SAL_GET_ERROR provides to the OS without any way to distinguish current errors from old ones. -Tony --
I should add the Intel Software Developer's manual has quite precise guidelines on what to do (and the Linux MCE code implements near all that faithfully) The ACPI spec isn't quite as precise unfortunately. -Andi --
That's possible too - the TRACE_EVENT() of MCE events, beyond the record format, also includes a human-readable ASCII output format string: # tail -1 /debug/tracing/events/mce/mce_record/format print fmt: "CPU: %d, MCGc/s: %llx/%llx, MC%d: %016Lx, ADDR/MISC: %016Lx/%016Lx, RIP: %02x:<%016Lx>, TSC: %llx, PROCESSOR: %u:%x, TIME: %llu, SOCKET: %u, APIC: %x", REC->cpu, REC->mcgcap, REC->mcgstatus, REC->bank, REC->status, REC->addr, REC->misc, REC->cs, REC->ip, REC->tsc, REC->cpuvendor, REC->cpuid, REC->walltime, REC->socketid, REC->apicid Which could be used to printk events. Cheers, Ingo --
I proposed a (fairly straightforward) extension to which Boris agreed: we can introduce 'persistent events', which have task-less buffers attached to them, which will hold (a configurable amount of) of events. Those can then be picked up by a task later on and no event is lost. Would such a feature address your concern? It would be useful not just for reliable error event collection, it could also be used for things like the boot tracer (which too deals with events that occur before there are any user-space tasks to pick up events). I.e. it fits into the whole scheme in a pretty natural, multi-purpose way. Thanks, Ingo --
Tony, should we accelerate the development of this persistent events sub-feature? Boris posted initial patches of the new perf events based EDAC/MCE/RAS design direction to lkml and indicated that it works for him. He also indicated that he can do the initial work of unifying EDAC and MCE without the persistent events feature for now. (this all is obviously v2.6.36-ish material) But if it's important, if you'd like to move ahead with the unification swiftly then we can certainly increase its priority. Also, a few notes: 1) the new RAS tool itself might or might not be part of tools/perf/ - for the prototype it certainly makes sense to be there but otherwise feel free to start tools/ras/ and share code with tools/perf/ but otherwise keep a separate RAS tool-space. 2) There's a new perf feature (that went upstream today) that is of EDAC/RAS interest: the ability to do live tracing. This is basically a daemon-alike, event->policy-action based flow that RAS eventing is about. 3) Another new perf feature of interest is 'perf inject' (this too went upstream today): to inject artificial events into the stream of events. This mechanism could be used to simulate rare error conditions and to test out policy reactions systematically - an important part of system error recovery testing. 4) We are working on enumerating events via sysfs, not via debugfs. This would make the events provided by EDAC/MCE more generally available. See Lin Ming's patches on lkml: Subject: [RFC][PATCH v2 06/11] perf: core, export pmus via sysfs Please chime in that thread to make sure the event_source class is suitable to describe EDAC/MCE event sources as well. Any event_source that is made available by drivers can then by used by tools for event transport. This gives us a broad platform to add various RAS events as well, beyond raw hardware events: we could for example events for various system anomalies such as lockup messages, kernel ...
The persistent event feature sounds like it will solve We've missed the deadlines for inclusion in certain popular distributions ... so it may be OK to take a relatively leisurely path to getting this done right Simulated errors are handy for testing the very top level of the s/w stack. But real errors are better. There's some APEI code in Len's tree that can inject real errors (on systems with the This looks like sticky ground. I can see the event mechanism passing data to a user daemon working well for all kinds of corrected and minor errors. But when you start talking about lockups and fatal errors things get a lot trickier. Often the main concern at this point is error containment. Making sure that the flaky data doesn't become visible (saved to storage, transmitted to the network, etc.). Getting from a machine check handler through some context switches (and page faults etc.) to a user level daemon before the error In a cluster/cloud/datacenter that daemon will need to be networked and hooked to the system management tools that are controlling the bigger environment. But I agree that this looks like a worthy end goal. -Tony --
I was pointing beyond the narrow hardware (memory) error point of view, towards a more generic 'system health' thinking. In the broader view it may makes sense to for example define policy over excessive number of segfaults on a server system (where excessive segfaults are an anomaly), or a suspiciously large number of soft IO errors, etc. But yes, of course, when it comes to hard memory errors, those take precedence, and handling them (and saving/propagating information about them while we still As Boris mentioned it too, critical policy action can and will be done straight in the kernel. Ingo --
That is how it is done in ia64. The MCA interrupt handler does the low level handling. It makes sure all the cpus have rendezvoused, looks at the MCA record to determine what happend and does whatever recovery steps are needed, such as kill the application. -- Russ Anderson, OS RAS/Partitioning Project Lead SGI - Silicon Graphics Inc rja@sgi.com --
Agreed, hardware assisted error injection is by far the best and most complete solution. Thanks, Ingo --
From: Mauro Carvalho Chehab <mchehab@redhat.com> Have the old list subscribers been moved to the new list or do they need to re-subscribe? Thanks. -- Regards/Gruss, Boris. Operating Systems Research Center Advanced Micro Devices, Inc. --
I suspect that you'll need to subscribe on the new ML. -- Cheers, Mauro --
The current i7core_edac driver is ready for merge upstream, using the current edac API. It supports the following processor families: i7core, i5core, Lynnfield, Nehalem, Nehalem-EP and Westmere-EP The tree is available at: git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/i7core.git linux_next This driver doesn't support Nehalem-EX. From the discussions we had during the mini-summit, the MCU of -EX family is very different, so, a separate driver will be required for it. Please review. My plan is to submit this driver for upstream merge this week. -- Cheers, Mauro --
