The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file, kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period - probably implement some missing procfs stuff Signed-off-by: Don Zickus <dzickus@redhat.com> --- arch/x86/kernel/apic/hw_nmi.c | 2 +- include/linux/sched.h | 6 + init/Kconfig | 1 + kernel/Makefile | 4 +- kernel/sysctl.c | 9 + kernel/watchdog.c | 570 +++++++++++++++++++++++++++++++++++++++++ lib/Kconfig.debug | 21 +- 7 files changed, 605 insertions(+), 8 deletions(-) create mode 100644 kernel/watchdog.c diff --git a/arch/x86/kernel/apic/hw_nmi.c ...
Thanks for the comments, will try to get them added in the next spin. Cheers, Don --
In fact we also have the sample_freq thing that let you run against a frequency (events per sec) rather than a sample period. We do some calculations that recompute the actual sample period on top of this frequency in a regular base as the events come. It's rather unfortunate we can't have 0.xxxx frequencies but may be we can work that out by adding a freq_unscale field, which would basically produce the frequency like this: freq = event->attr.sample_freq * (10 ^ -event->attr.freq_unscale) This is not trivial though, so let's rather focus on the That looks too general. It's more about the fact the arch supports I'm confused, do we have two versions of the softlockup Please use __get_cpu_var() instead so that we keep the preempt disabled Why do you keep the software clock, I don't see how it can be useful to detect hardlockups. It triggers in a regular irq Please avoid such ifdefs in the middle of the code. It's better to gather hardlockup matters in a single ifdef block: #ifdef CONFIG_PERF_EVENTS_NMI define your functions here #else define (if needed) your off case functions here (static inline stubs) You probably want a backtrace cpu mask here as well Please warn here instead, nothing seriously dangerous is going to Especially such kind of idef in a function plus goto in the middle, we really want to avoid that. You want a watchdog_nmi_enable() call instead that does nothing I'd suggest to drop the NMI prefix, this is not anymore about detecting hard lockups only. May be config WATCHDOG or config LOCKUP_DETECTOR if it's taken already. Also you should half-drop the DETECT_SOFTLOCKUP thing: keep it's definition but drop the ability to choose it from the prompt: config DETECT_SOFTLOCKUP bool depends on DEBUG_KERNEL && !S390 default y This way we keep it for compatibility with def_configs, it will enable the WATCHDOG by default if it is ...
I was trying to figure out a way to add the PERF_EVENTS dependency as I didn't want to impose it on the CONFIG_NMI_WATCHDOG if that config Originally Ingo talked about a migration path, so I was going to support the older one in case the new one was having issues, sort of like what he suggested about moving the nmi code from arch/x86/kernel/apic/nmi.c to kernel/watchdog.c. But I can probably drop the softlockup case as the Yeah, sorry. I dropped part of the code per your suggestion and forget to I was just thinking of the case where dispite the WARN above, the cpu actually recovered and then failed again separately. But I probably won't I understand the general idea but not quite the implementation idea. I will work on it and see what I come up with. Thanks for the review! Cheers, Don --
This is really just a corner case, I guess you don't need to
bother with that. It is actually racy against other cpus and adding
a spinlock here (in the everything is fine path) would be an overkill.
In fact, having two per cpu vars named hardlockup_warned and
softlockup_warned would be better than cpumasks. I'm sorry I
suggested you the cpumask, but such per cpu vars will avoid
you dealing with these synchonization issues. And one of the primary
We current have:
config DETECT_SOFTLOCKUP
bool "Blah"
depends on DEBUG_KERNEL && !S390
default y
help
.......
The idea is to remove the "Blah" so that the user can't select it
anymore from make menuconfig, and to remove the help too as it's useless
too.
So that config WATCHDOG can be default y if DETECT_SOFTLOCKUP.
Then if someone comes with a config that has DETECT_SOFTLOCKUP,
it's new implementation (WATCHDOG) will enabled by default.
--
Yeah, I guess per cpu is better. I agree that locks in NMI are frowned upon but I wasn't sure of it was dealt with. I'll try to implement this. Any objections if I combined hardlockup and softlockup with per cpu watchdog_warn and have bit masks for HARDLOCKUP Ah, I missed the bool part. I got it. Thanks for the clarification. Cheers, --
They work in fact. They are just not checked by lockdep. And mostly they are very dangerous: if something else can take it (from interrupt, from context) then this is a deadlock. And even though we ensure this is only taken from NMI, we tend Hmm, a hardlockup can come in after a softlockup. Don't worry too much about memory: usually the more you have cpu, the more you have memory :) Plus this is debugging code, not something supposed to be enabled in production. --
Let me re-explain what I meant. It was meant to do double duty. The softlockup code only checks the SOFTLOCKUP bit and the hardlockup only ever checks the HARDLOCKUP bit. Well depends on your POV. In RHEL we enable both NMI_WATCHDOG and SOFTLOCKUP on production systems (and we have customers that are thankful for that :-) ). Cheers, Don --
On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote: Hmm, this is NMI handler path so from what we protect this per-cpu data? Do I miss something? /me confused -- Cyrill --
The cpu mask is not per cpu here, this is a shared bitmap, so you can race against other cpus NMIs. That said, as I suggested, having a per cpu var that we set when we warned would be much better than a spinlock here. --
Every time you think NMI and spinlock think FAIL. --
In fact I was first inspired by the x86 nmi watchdog handler that does this spinlock to protect cpumask, but realize just after my FAIL ;-) --
On Fri, Apr 16, 2010 at 04:46:17PM +0200, Frederic Weisbecker wrote: yeah, saw DECLARE_BITMAP but read it as DEFINE_PER_CPU for some reason. having any spinlock in irq handler is really under suspicious. -- Cyrill --
It's not per-cpu which is the problem I believe. If you have multiple cpus dealing with hardlockups than they can overwrite each other's data. Cheers, Don --
Hello all, After making a bunch of cleanups, I am stuck debating whether to continue updating this patch on the stale branch perf/nmi on Ingo's tree or just repost the whole patch again (which isn't much bigger just adds the arch/x86/kernel/apic/hw_nmi.c piece). Part of the new patch series includes removing kernel/nmi_watchdog.c, which seemed kinda silly because it was only an intermediate file until things got shifted to kernel/watchdog.c Thoughts? Cheers, Don --
I'd prefer relative patches as the current perf/nmi bits are tested quite well. Intermediate stages are not a problem: 90% of the code in the kernel's Git history is 'intermediate' as well, in hindsight. What matters is that the workflow that resulted was clean and that the patches were (and are) clean. Ingo --
Ok, I'll continue that then. Thanks. Cheers, Don --
