Alan Cox [interview] submitted a pair of patches to add error detection and correction (EDAC) logic to the 2.6 kernel. He noted, "I don't think its yet merge ready but getting there so I'd appreciate other folks comments and views on what else needs fixing before generating a submission for Andrew." Alan has submitted a subset of the bluesmoke kernel module which "is mainly concerned with reporting ECC, PCI, machine check, cache, hypertransport, thermal throttling and related events." This version of the patch is only for the 2.6 kernel, and was renamed from bluesmoke to EDAC.
Memory error checking used to be accomplised with a parity checking bit that was attached to each byte of memory. The parity bit was calculated when each byte of memory was written, and then verified when each byte of memory was read. If the stored parity bit didn't match the calculated parity bit on a read, that byte of memory was known to have changed. Parity checking was a reasonably effective method for detecting a one bit change in a byte of memory. ECC expanded upon this idea with the use of a hashing algorithm that calculates a checksum for multiple bytes of memory. This checksum can be used to detect when one or more bits has changed. On single bit errors, it can also restore the memory to its intended state, actually correcting the error.
From: Alan Cox [email blocked] To: linux-kernel Subject: [RFC PATCH] 1/2 EDAC (Core Code) Date: Tue, 11 Oct 2005 16:13:43 +0100 This is a pair of patches that add the basic ECC and related chipset handling to the kernel. Various "interesting" patches have kicked around since Dan Hollis original work some years ago. Since then many people have picked up the code and improved on it. The current code (bluesmoke.sf.net) has some very fancy NMI handling features and other complex open issues with NMI during a PCI transaction and the like. This code is a subset and the maintainer agrees this is the best approach to merging. >From the original repository - 2.4 and other compatibility code removed - Some commenting added - A couple of 32bit isms cleaned up - Move several drivers from pci_find to pci_get - Remove all the NMI layer handling so that the code requires no core changes - Rename from bluesmoke to EDAC (Correct name for the whole field of ECC and friends) to avoid confusion I don't think its yet merge ready but getting there so I'd appreciate other folks comments and views on what else needs fixing before generating a submission for Andrew.
| Attachment | Size |
|---|---|
| edac_core.patch | 50.79 KB |
| edac_drivers.patch | 67.2 KB |
Hamming codes
Probably the most popular sort of ECC comes in the form of Hamming Codes. (Google them if you're truly curious.) Their sudden popularity comes in part from the nature of the coding. You see, you need roughly log2(n)+2 bits to protect a word of length n against single-bit errors and detect 2 bit errors. (That's not the exact formula, but it's close.) To protect a 64-bit word against single bit errors, you'd need 8 additional bits.
Turns out that 72-bit "parity memory" (64-bit memory w/ 8 additional parity bits) has just enough parity bits that it could instead serve as ECC memory with no additional storage requirement. Once 64-bit memory busses became common, ECC followed shortly thereafter. Chipset vendors could add ECC to their chipsets without any other changes to the system.
Naming
I must say Bluesmoke was a much better name than EDAC. The latter is just dull.
Ah, but when you're talking a
Ah, but when you're talking about high-reliability stuff, dull is what you want.
That's why I refuse to use "F
That's why I refuse to use "Firewire", but opt for the far superior "IEEE-1394 Bus Interface". Much more reliable and speedy.
Not HPIB...err...GPIB?
What, isn't IEEE-488 fast enough for you? ;-) The cables are certainly more, uhm, durable.
Seriously though, EDAC is a much more descriptive name than Bluesmoke, if nothing else. Bluesmoke is something of an inside joke, and suitable while developing the feature. A name that describes what it does, and which does so with something of a tone of gravity is rather useful in this space.
As for Firewire vs. 1394... get over yourself. :-) I don't hear you calling JPEGs "JFIF encapsulated ISO/IEC 10918-1 compliant bitstreams containing continuous tone images." JPEG's the name of the committee that made the standard... not the standard itself. But it's a usefull shorthand. :-) Or worse, the inanely named Bluetooth... aka IEEE 802.15.1. I'll still call it Bluetooth thankyouverymuch, despite the stupid sounding name. It's a different world in the world of consumer gadgets... the world of marketing people and focus groups and branding. It's a little different when you're trying to describe a feature that handles system failure events to the IT staff.
Eh... I guess I'm feeling a bit snarky this morning, eh? Nothing personal.
I call JPEGs "jeffifs isos"..
I call JPEGs "jeffifs isos"... And bluetooth "tootie fifty one", and Firewire -> fireonwire. :D
that will send a message to al those "marcus the mouse"-marketeers.. (www.hackles.org)