Hi,
so I finally had some spare time to stare at perf/ftrace code and ponder
on how to use those facilities for MCE collecting and reporting. Btw, I
have to say, it took me quite a while to understand what goes where - my
suggestion to anyone who tries to understand how perf/ftrace works is
to do make <file.i> where there is at least one trace_XXX emit record
function call and start untangling code paths from there.
So anyway, here are some questions I had, I just as well may've missed
something so please correct me if I'm wrong:
1. Since machine checks can happen at any time, we need to have the
MCE tracepoint (trace_mce_record() in <include/trace/events/mce.h>)
always enabled. This, in turn, means that we need the ftrace/perf
infrastructure always compiled in (lockless ring buffer, perf_event.c
stuff) on any x86 system so that MCEs can be handled at anytime. Is this
going to be ok to be enabled on _all_ machines, hmmm... I dunno, maybe
only a subset of those facilites at least.
2. Tangential to 1., we need that "thin" software layer prepared for
decoding and reporting them as early as possible. event_trace_init() is
an fs_initcall and executed too late, IMHO. The ->perf_event_enable in
the ftrace_event_call is enabled even later on the perf init path over
the sys_perf_even_open which is at userspace time. In our case, this is
going be executed by the error logging and decoding daemon I guess.
3. Since we want to listen for MCEs all the time, the concept of
enabling and disabling those events does not apply in the sense of
performance profiling. IOW, MCEs need to be able to be logged to the
ring buffer at any time. I guess this is easily done - we simply enable
MCE events at the earliest moment possible and disable them on shutdown;
done.
So yeah, some food for thought but what do you guys think?
Thanks.
--
Regards/Gruss,
Boris.
--
I'm not exactly sure what you goal is, but if you need to do something directly, you can bypass ftrace and perf. All trace events can be connected by anything even when ftrace and perf are not enabled. That is, you need to connect to the tracepoint and write you own callback. This can be done pretty much at anytime during boot up. To see how to connect to a trace point, you can look at register_trace_sched_switch() in kernel/trace/ftrace.c. This registers a This looks like a good reason to have your own handler. More than one callback may be registered to a tracepoint, so you do not need to worry about having other handlers affect your code. --
From: Steven Rostedt <rostedt@goodmis.org>
Date: Mon, May 03, 2010 at 10:41:12AM -0400
Right, so the idea is to use the perf/ftrace infrastructure to detect
failing hardware which is signalled through machine checks, among
others. I'm thinking a lockless ring buffer would be cool so we can
execute in any context... wait a minute, right, we have that already.
However, if I use the perf/ftrace facilities, I have to have them
enabled on every system since machine checks are core processor
functionality and the software support for those has to be always
enabled. And the code has to be small and execute fast since after a
critical mcheck happens all bets are off.
So I'm thinking maybe a core perf/ftrace stuff which is thin and is
always enabled... but I'm not sure for I haven't stared at the code
The base tracepoint functionality should suffice for now but this is
Yep, good ideas, thanks. /me goes back to the drawing board.
--
Regards/Gruss,
Boris.
--
A good beginning of any such investigations would be to describe what exact problems you're trying to solve here. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
We could certainly move bits of this initialization earlier. Also we could add the notion of 'persistent' events that dont have a user-space process attached to them - and which would buffer to a certain degree. Such persistent events could be initialized during bootup and the daemon could pick up the events later on. In-kernel actions/policy could work off the callback mechanism. (See for example how the new NMI watchdog code in tip:perf/nmi makes use of it - or how the hw-breakpoints code utilizes it.) These would work even if there's no user-space daemon attached (or if the daemon has been stopped). So i dont see a significant design problem here - it's all natural extensions of existing Note that it doesnt _have to_ go to the ftrace ring-buffer. If the daemon (or whatever facility picking up the events) keeps a global (per cpu) MCE perf event enabled all the time then it might be doing that regardless of ftrace. Some decoupling from ftrace could be done here easily. I'd suggest to not worry about it - once we have the MCE event code we can certainly reshape the underlying support code to be more readily available/configurable. (or even built-in) This is not really a significant design issue. To start with this, a quick initial prototype could use the 'perf trace' live mode tracing script. (See latest -tip, 'perf trace --script <script-name>' and 'perf record -o -' to activate live mode.) Note that there's also 'perf inject' now, which can be used to simulate rare events and test the daemon's reactions to it. (Right now perf-inject is only used to inject certain special build-id events, but it can be used for the injection of MCE events as well.) Thanks, Ingo --
From: Ingo Molnar <mingo@elte.hu> Date: Tue, May 04, 2010 at 01:32:27PM +0200 so I did some experimenting with this and have a pretty rough prototype which conveys decoded MCEs to userspace where they're read with perf. More specifically, I did perf record -e mce:mce_record -a after tweaking the mce_record tracepoint to include the decoded error string. And then doing perf trace -g python perf trace -s perf-trace.py got me: in trace_begin mce__mce_record 6 00600.700632283 0 init mcgcap=262, mcgstatus=0, bank=4, status=15888347641659525651, addr=26682366720, misc=13837309867997528064, ip=0, cs=0, tsc=0, walltime=1273928155, cpu=6, cpuid=1052561, apicid=6, socketid=0, cpuvendor=2, decoded_err= Northbridge Error, node 1ECC/ChipKill ECC error. CE err addr: 0x636649b00 CE page 0x636649, offset 0xb00, grain 0, syndrome 0x1fd, row 3, channel 0 Transaction type: generic read(mem access), no t in trace_end which shows the signature of an ECC which I injected earlier over the EDAC sysfs interface. And yes, the decoded_err appears truncated so I'll have to think of a slicker way to collect that info. Although they're pretty rough yet, I've attached the relevant patches so that one could get an impression of where we're moving here. 0001-amd64_edac-Remove-polling-mechanism.patch removes the EDAC polling mechanism in favor of hooking into the machine_check_poll polling function using the atomic notifier which we already use for uncorrectable errors. The other two 0002-mce-trace-Add-decoded-string-to-mce_record-s-format.patch 0003-edac-mce-Prepare-error-decoded-info.patch add that decoded_err string. I'm open for better ideas here though. Concerning the early MCE logging and reporting, I'm thinking of using the mce.c ring buffer temporarily until the ftrace buffer has been initialized and then copying all records into the last. We might do a more elegant solution in the future after all that bootmem churn has quieted down and ...
Agreed. There's overlap here with the boot tracer as well, which we want to convert over to perf events as well. That could be achieved via the concept of 'persistent events', which are basically task-less events brought active and attached to a buffer space to dump the events to. I agree that the critical functionality itself should be implemented in the kernel - and not all routed through a user-space component. But please generate the callbacks via perf events, like the new watchdog code does in -tip (tip:perf/nmi): cafcd80: lockup_detector: Cross arch compile fixes 23637d4: lockup_detector: Introduce CONFIG_HARDLOCKUP_DETECTOR c01d432: lockup_detector: Adapt CONFIG_PERF_EVENT_NMI to other archs e16bb1d: lockup_detector: Update some config 5e85391: x86, watchdog: Fix build error in hw_nmi.c 0167c78: watchdog: Export touch_softlockup_watchdog 19cc36c: lockup_detector: Fix forgotten config conversion 89d7ce2: lockup_detector: Make BOOTPARAM_SOFTLOCKUP_PANIC depend on LOCKUP_DETECTOR d7c5473: lockup_detector: Separate touch_nmi_watchdog code path from touch_watchdog 10f9014: x86: Cleanup hw_nmi.c cruft 7cbb7e7: x86: Move trigger_all_cpu_backtrace to its own die_notifier f69bcf6: lockup_detector: Remove nmi_watchdog.c file 2508ce1: lockup_detector: Remove old softlockup code 332fbdb: lockup_detector: Touch_softlockup cleanups and softlockup_tick removal 58687ac: lockup_detector: Combine nmi_watchdog and softlockup detector 5671a10: nmi_watchdog: Tell the world we're active c99c30f: nmi_watchdog: Turn it off by default 47195d5: nmi_watchdog: Clean up various small details 2cc4452: nmi_watchdog: Fix undefined 'apic' build bug 96ca402: nmi_watchdog: Properly configure for software events 6081b6c: nmi_watchdog: support for oprofile cf454ae: nmi_watchdog: Fallback to software events when no hardware pmu detected 504d7cf: nmi_watchdog: Compile and portability fixes c3128fb: nmi_watchdog: Use a boolean config flag for compiling 8e7672c: nmi_watchdog: ...
From: Ingo Molnar <mingo@elte.hu>
Date: Sun, May 16, 2010 at 01:26:41PM +0200
The decoded string handling is still clumsy since it is of variable
length and I have to allocate a string of the maximal possible length of
any error for it to not get truncated.
We could avoid this by spitting error codes of fixed length, instead,
which the RAS daemon would map to strings in userspace but the coding
scheme would need some thinking so that it works adequately for all
possible error types.
[snip]
My only concern here is that going over the perf events and callbacks
adds unnecessary additional code when we're in emergency mode handling
an uncorrectable error. However, I don't know how much that code would
Correct me if I'm wrong but since the trace_mce_record() is a
tracepoint, we don't want to register callbacks over the perf_event*
interface but rather use the ftrace path, like Steven's example with
register_trace_sched_switch().
If we do it this way, there's no overhead at all and we add all
the needed functionality like tty switching and emergency shutdown
preparations to the proper path in the proper order - right after having
decoded the MCE and still in the NMI context of do_machine_check, i.e.
at the earliest possible moment and without wasting time.
And with tracepoints we still have the event unification/enumeration you
Well, since MCEs are per CPU, the proper way to map those would be
/sys/devices/system/cpu/cpuX/events/mce/
With this, we could have the additional functionality to disable some
MCEs per CPU if it makes sense for certain cases...
Well, both mce and edac are a subset of RAS so calling it perf/ras/ is
the most sensible way to go IMHO. Also, this is where the perf inject
can be reused since we can inject true hardware errors and not only
ditto, 'perf ras' is what I'm thinking.
Thanks.
--
Regards/Gruss,
Boris.
--
