This patch series covers mostly the changes necessary for combining the nmi_watchdog and softlockup code. Also added are the cleanups associated with the change, like removing the old files. The changelogs in each patch are more specific to what the changes are. Don Zickus (5): [watchdog] combine nmi_watchdog and softlockup [watchdog] convert touch_softlockup_watchdog to touch_watchdog [watchdog] remove old softlockup code [watchdog] remove nmi_watchdog.c file [x86] watchdog: move trigger_all_cpu_backtrace to its own die_notifier [x86] watchdog: cleanup hw_nmi.c cruft Documentation/kernel-parameters.txt | 2 + arch/ia64/kernel/time.c | 2 +- arch/ia64/kernel/uncached.c | 2 +- arch/ia64/xen/time.c | 2 +- arch/mn10300/kernel/gdb-stub.c | 2 +- arch/powerpc/kernel/swsusp_64.c | 2 +- arch/sparc/kernel/nmi.c | 2 +- arch/x86/include/asm/nmi.h | 2 +- arch/x86/kernel/apic/Makefile | 4 +- arch/x86/kernel/apic/hw_nmi.c | 113 +++----- arch/x86/kernel/apic/nmi.c | 2 +- arch/x86/kernel/traps.c | 4 +- arch/x86/xen/smp.c | 2 +- drivers/ide/ide-iops.c | 2 +- drivers/ide/ide-taskfile.c | 2 +- drivers/mtd/nand/nand_base.c | 4 +- drivers/video/nvidia/nv_accel.c | 2 +- include/linux/nmi.h | 13 +- include/linux/sched.h | 20 +- init/Kconfig | 5 +- kernel/Makefile | 3 +- kernel/kgdb.c | 6 +- kernel/nmi_watchdog.c | 259 ---------------- kernel/panic.c | 2 +- kernel/power/hibernate.c | 2 +- kernel/sched.c | 2 +- kernel/sched_clock.c | 2 +- kernel/softlockup.c | 278 ------------------ kernel/sysctl.c | 52 ++-- ...
The new nmi_watchdog (which uses the perf event subsystem) is very similar in structure to the softlockup detector. Using Ingo's suggestion, I combined the two functionalities into one file, kernel/watchdog.c. Now both the nmi_watchdog (or hardlockup detector) and softlockup detector sit on top of the perf event subsystem, which is run every 60 seconds or so to see if there are any lockups. To detect hardlockups, cpus not responding to interrupts, I implemented an hrtimer that runs 5 times for every perf event overflow event. If that stops counting on a cpu, then the cpu is most likely in trouble. To detect softlockups, tasks not yielding to the scheduler, I used the previous kthread idea that now gets kicked every time the hrtimer fires. If the kthread isn't being scheduled neither is anyone else and the warning is printed to the console. I tested this on x86_64 and both the softlockup and hardlockup paths work. V2: - cleaned up the Kconfig and softlockup combination - surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI - seperated out the softlockup case from perf event subsystem - re-arranged the enabling/disabling nmi watchdog from proc space - added cpumasks for hardlockup failure cases - removed fallback to soft events if no PMU exists for hard events V3: - comment cleanups - drop support for older softlockup code - per_cpu cleanups - completely remove software clock base hardlockup detector - use per_cpu masking on hard/soft lockup detection - #ifdef cleanups - rename config option NMI_WATCHDOG to LOCKUP_DETECTOR - documentation additions TODO: - figure out how to make an arch-agnostic clock2cycles call (if possible) to feed into perf events as a sample period Signed-off-by: Don Zickus <dzickus@redhat.com> --- Documentation/kernel-parameters.txt | 2 + arch/x86/include/asm/nmi.h | 2 +- arch/x86/kernel/apic/Makefile | 4 +- arch/x86/kernel/apic/hw_nmi.c | 2 +- arch/x86/kernel/traps.c | ...
Hi Don, I realize that you didn't introduce Arch in this help text, but can we eliminate him, please? E.g.: Platform can generate an NMI using the perf event subsystem. or --- ~Randy --
Some minor things: __get_cpu_var --
well, I already have this_cpu and need it later, I figured I would just use it with per_cpu and save _get_cpu_var the work of re-running again same as above. Thanks for the review. Cheers, Don --
This is more about code clarity in fact. per_cpu() suggests we are fetching something from another cpu. This is a very minor issue though. --
I have no problem changing it. Cheers, Don --
This file migrated to kernel/watchdog.c and then combined with
kernel/softlockup.c. As a result kernel/nmi_watchdog.c is no longer
needed. Just remove it.
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
kernel/nmi_watchdog.c | 259 -------------------------------------------------
1 files changed, 0 insertions(+), 259 deletions(-)
delete mode 100644 kernel/nmi_watchdog.c
diff --git a/kernel/nmi_watchdog.c b/kernel/nmi_watchdog.c
deleted file mode 100644
index a79d211..0000000
--- a/kernel/nmi_watchdog.c
+++ /dev/null
@@ -1,259 +0,0 @@
-/*
- * Detect Hard Lockups using the NMI
- *
- * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
- *
- * this code detects hard lockups: incidents in where on a CPU
- * the kernel does not respond to anything except NMI.
- *
- * Note: Most of this code is borrowed heavily from softlockup.c,
- * so thanks to Ingo for the initial implementation.
- * Some chunks also taken from arch/x86/kernel/apic/nmi.c, thanks
- * to those contributors as well.
- */
-
-#include <linux/mm.h>
-#include <linux/cpu.h>
-#include <linux/nmi.h>
-#include <linux/init.h>
-#include <linux/delay.h>
-#include <linux/freezer.h>
-#include <linux/lockdep.h>
-#include <linux/notifier.h>
-#include <linux/module.h>
-#include <linux/sysctl.h>
-
-#include <asm/irq_regs.h>
-#include <linux/perf_event.h>
-
-static DEFINE_PER_CPU(struct perf_event *, nmi_watchdog_ev);
-static DEFINE_PER_CPU(int, nmi_watchdog_touch);
-static DEFINE_PER_CPU(long, alert_counter);
-
-static int panic_on_timeout;
-
-void touch_nmi_watchdog(void)
-{
- __raw_get_cpu_var(nmi_watchdog_touch) = 1;
- touch_softlockup_watchdog();
-}
-EXPORT_SYMBOL(touch_nmi_watchdog);
-
-void touch_all_nmi_watchdog(void)
-{
- int cpu;
-
- for_each_online_cpu(cpu)
- per_cpu(nmi_watchdog_touch, cpu) = 1;
- touch_softlockup_watchdog();
-}
-
-static int __init setup_nmi_watchdog(char *str)
-{
- if (!strncmp(str, "panic", 5)) {
- panic_on_timeout = 1;
- str = strchr(str, ...My changes with the softlockup code uses an older version of softlockup.c.
A couple of commits have been committed that were not on the branch I am
using. This patch resolves those conflicts.
Commit 8c2eb4 softlockup: Stop spurious softlockup messages due to overflow
d6ad3e softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume
Conflicts:
include/linux/sched.h
kernel/kgdb.c
kernel/softlockup.c
Signed-off-by: Don Zickus <dzickus@redhat.com>
--
Hi Ingo,
If there is a better way to deal with these conflicts please let me know.
---
include/linux/sched.h | 4 ++++
kernel/kgdb.c | 6 +++---
kernel/watchdog.c | 17 ++++++++++++++++-
3 files changed, 23 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0c128ad..e039e22 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -310,6 +310,7 @@ extern void sched_show_task(struct task_struct *p);
#ifdef CONFIG_LOCKUP_DETECTOR
extern void touch_watchdog(void);
extern void touch_all_watchdogs(void);
+extern void touch_watchdog_sync(void);
extern unsigned int softlockup_panic;
extern int softlockup_thresh;
extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
@@ -322,6 +323,9 @@ static inline void touch_watchdog(void)
static inline void touch_all_watchdogs(void)
{
}
+static inline void touch_watchdog_sync(void)
+{
+}
#endif
#ifdef CONFIG_DETECT_HUNG_TASK
diff --git a/kernel/kgdb.c b/kernel/kgdb.c
index 15df969..da98119 100644
--- a/kernel/kgdb.c
+++ b/kernel/kgdb.c
@@ -596,7 +596,7 @@ static void kgdb_wait(struct pt_regs *regs)
/* Signal the primary CPU that we are done: */
atomic_set(&cpu_in_kgdb[cpu], 0);
- touch_watchdog();
+ touch_watchdog_sync();
clocksource_touch_watchdog();
local_irq_restore(flags);
}
@@ -1450,7 +1450,7 @@ acquirelock:
(kgdb_info[cpu].task &&
kgdb_info[cpu].task->pid != kgdb_sstep_pid) && --sstep_tries) {
...