x86: mce: Xeon75xx specific interface to get corrected memory error information v2
[This version addresses the previous comments. It does not change
any interface to the outside and does not attempt to encode DIMMs
or anything like that, but only passes out the physical address of u
a corrected error in the standard ADDR register field.
So for the outside it looks exactly the same as if the CPU supported this
natively, but no otherwise special interfaces.
I hope this addresses previous concerns. I guess the DIMM error reporting
can be revisited once there's a new reporting interface. There are still
some traces of DIMM parsing in there, but it's only used for debug
purposes now.]
---
Xeon 75xx doesn't log physical addresses on corrected machine check
events in the standard architectural MSRs. Instead the address has to
be retrieved in a model specific way. This makes it impossible
to do predictive failure analysis.
Implement cpu model specific code to do this in mce-xeon75xx.c using a new hook
that is called from the generic poll code. The code retrieves
the physical addressof the last corrected error from the platform
and makes the address look like a standard architectural MCA address for
further processing.
There's no code to print this information on a panic because this only
works for corrected errors, and corrected errors do not usually result in
panics.
The act of retrieving the PA information can take some time, so this
code has a rate limit to avoid taking too much CPU time on a error flood.
The whole thing can be loaded as a module and has suitable
PCI-IDs so that it can be auto-loaded by a distribution.
The code also checks explicitely for the expected CPU model
number to make sure this code doesn't run anywhere else.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
arch/x86/Kconfig | 8
arch/x86/kernel/cpu/mcheck/Makefile | 1
arch/x86/kernel/cpu/mcheck/mce-internal.h | 1
...