Re: [PATCH 1/6] [watchdog] combine nmi_watchdog and softlockup

Previous thread: [PATCH 2/6] [watchdog] convert touch_softlockup_watchdog to touch_watchdog by Don Zickus on Tuesday, April 20, 2010 - 8:23 am. (6 messages)

Next thread: Your Reply by Mr. David Smith on Tuesday, April 20, 2010 - 8:20 am. (1 message)
From: Don Zickus
Date: Tuesday, April 20, 2010 - 8:23 am

This patch series covers mostly the changes necessary for combining the
nmi_watchdog and softlockup code.  Also added are the cleanups associated
with the change, like removing the old files.

The changelogs in each patch are more specific to what the changes are.

Don Zickus (5):
  [watchdog] combine nmi_watchdog and softlockup
  [watchdog] convert touch_softlockup_watchdog to touch_watchdog
  [watchdog] remove old softlockup code
  [watchdog] remove nmi_watchdog.c file
  [x86] watchdog: move trigger_all_cpu_backtrace to its own die_notifier
  [x86] watchdog: cleanup hw_nmi.c cruft

 Documentation/kernel-parameters.txt |    2 +
 arch/ia64/kernel/time.c             |    2 +-
 arch/ia64/kernel/uncached.c         |    2 +-
 arch/ia64/xen/time.c                |    2 +-
 arch/mn10300/kernel/gdb-stub.c      |    2 +-
 arch/powerpc/kernel/swsusp_64.c     |    2 +-
 arch/sparc/kernel/nmi.c             |    2 +-
 arch/x86/include/asm/nmi.h          |    2 +-
 arch/x86/kernel/apic/Makefile       |    4 +-
 arch/x86/kernel/apic/hw_nmi.c       |  113 +++-----
 arch/x86/kernel/apic/nmi.c          |    2 +-
 arch/x86/kernel/traps.c             |    4 +-
 arch/x86/xen/smp.c                  |    2 +-
 drivers/ide/ide-iops.c              |    2 +-
 drivers/ide/ide-taskfile.c          |    2 +-
 drivers/mtd/nand/nand_base.c        |    4 +-
 drivers/video/nvidia/nv_accel.c     |    2 +-
 include/linux/nmi.h                 |   13 +-
 include/linux/sched.h               |   20 +-
 init/Kconfig                        |    5 +-
 kernel/Makefile                     |    3 +-
 kernel/kgdb.c                       |    6 +-
 kernel/nmi_watchdog.c               |  259 ----------------
 kernel/panic.c                      |    2 +-
 kernel/power/hibernate.c            |    2 +-
 kernel/sched.c                      |    2 +-
 kernel/sched_clock.c                |    2 +-
 kernel/softlockup.c                 |  278 ------------------
 kernel/sysctl.c                     |   52 ++--
 ...
From: Don Zickus
Date: Tuesday, April 20, 2010 - 8:23 am

The new nmi_watchdog (which uses the perf event subsystem) is very
similar in structure to the softlockup detector.  Using Ingo's suggestion,
I combined the two functionalities into one file, kernel/watchdog.c.

Now both the nmi_watchdog (or hardlockup detector) and softlockup detector
sit on top of the perf event subsystem, which is run every 60 seconds or so
to see if there are any lockups.

To detect hardlockups, cpus not responding to interrupts, I implemented an
hrtimer that runs 5 times for every perf event overflow event.  If that stops
counting on a cpu, then the cpu is most likely in trouble.

To detect softlockups, tasks not yielding to the scheduler, I used the
previous kthread idea that now gets kicked every time the hrtimer fires.
If the kthread isn't being scheduled neither is anyone else and the
warning is printed to the console.

I tested this on x86_64 and both the softlockup and hardlockup paths work.

V2:
- cleaned up the Kconfig and softlockup combination
- surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI
- seperated out the softlockup case from perf event subsystem
- re-arranged the enabling/disabling nmi watchdog from proc space
- added cpumasks for hardlockup failure cases
- removed fallback to soft events if no PMU exists for hard events

V3:
- comment cleanups
- drop support for older softlockup code
- per_cpu cleanups
- completely remove software clock base hardlockup detector
- use per_cpu masking on hard/soft lockup detection
- #ifdef cleanups
- rename config option NMI_WATCHDOG to LOCKUP_DETECTOR
- documentation additions

TODO:
- figure out how to make an arch-agnostic clock2cycles call (if possible)
  to feed into perf events as a sample period

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 Documentation/kernel-parameters.txt |    2 +
 arch/x86/include/asm/nmi.h          |    2 +-
 arch/x86/kernel/apic/Makefile       |    4 +-
 arch/x86/kernel/apic/hw_nmi.c       |    2 +-
 arch/x86/kernel/traps.c             |   ...
From: Randy Dunlap
Date: Tuesday, April 20, 2010 - 8:53 am

Hi Don,

I realize that you didn't introduce Arch in this help text,
but can we eliminate him, please?  E.g.:

	  Platform can generate an NMI using the perf event subsystem.
or




---
~Randy
--

From: Don Zickus
Date: Tuesday, April 20, 2010 - 9:11 am

yup.

Thanks,
Don
--

From: Frederic Weisbecker
Date: Wednesday, April 21, 2010 - 10:27 am

Some minor things:

















__get_cpu_var

--

From: Don Zickus
Date: Wednesday, April 21, 2010 - 10:50 am

well, I already have this_cpu and need it later, I figured I would just
use it with per_cpu and save _get_cpu_var the work of re-running



again same as above.

Thanks for the review.

Cheers,
Don
--

From: Frederic Weisbecker
Date: Wednesday, April 21, 2010 - 1:24 pm

This is more about code clarity in fact. per_cpu() suggests we are
fetching something from another cpu.

This is a very minor issue though.

--

From: Don Zickus
Date: Wednesday, April 21, 2010 - 1:49 pm

I have no problem changing it.

Cheers,
Don
--

From: Don Zickus
Date: Tuesday, April 20, 2010 - 8:24 am

This file migrated to kernel/watchdog.c and then combined with
kernel/softlockup.c.  As a result kernel/nmi_watchdog.c is no longer
needed.  Just remove it.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 kernel/nmi_watchdog.c |  259 -------------------------------------------------
 1 files changed, 0 insertions(+), 259 deletions(-)
 delete mode 100644 kernel/nmi_watchdog.c

diff --git a/kernel/nmi_watchdog.c b/kernel/nmi_watchdog.c
deleted file mode 100644
index a79d211..0000000
--- a/kernel/nmi_watchdog.c
+++ /dev/null
@@ -1,259 +0,0 @@
-/*
- * Detect Hard Lockups using the NMI
- *
- * started by Don Zickus, Copyright (C) 2010 Red Hat, Inc.
- *
- * this code detects hard lockups: incidents in where on a CPU
- * the kernel does not respond to anything except NMI.
- *
- * Note: Most of this code is borrowed heavily from softlockup.c,
- * so thanks to Ingo for the initial implementation.
- * Some chunks also taken from arch/x86/kernel/apic/nmi.c, thanks
- * to those contributors as well.
- */
-
-#include <linux/mm.h>
-#include <linux/cpu.h>
-#include <linux/nmi.h>
-#include <linux/init.h>
-#include <linux/delay.h>
-#include <linux/freezer.h>
-#include <linux/lockdep.h>
-#include <linux/notifier.h>
-#include <linux/module.h>
-#include <linux/sysctl.h>
-
-#include <asm/irq_regs.h>
-#include <linux/perf_event.h>
-
-static DEFINE_PER_CPU(struct perf_event *, nmi_watchdog_ev);
-static DEFINE_PER_CPU(int, nmi_watchdog_touch);
-static DEFINE_PER_CPU(long, alert_counter);
-
-static int panic_on_timeout;
-
-void touch_nmi_watchdog(void)
-{
-	__raw_get_cpu_var(nmi_watchdog_touch) = 1;
-	touch_softlockup_watchdog();
-}
-EXPORT_SYMBOL(touch_nmi_watchdog);
-
-void touch_all_nmi_watchdog(void)
-{
-	int cpu;
-
-	for_each_online_cpu(cpu)
-		per_cpu(nmi_watchdog_touch, cpu) = 1;
-	touch_softlockup_watchdog();
-}
-
-static int __init setup_nmi_watchdog(char *str)
-{
-	if (!strncmp(str, "panic", 5)) {
-		panic_on_timeout = 1;
-		str = strchr(str, ...
From: Don Zickus
Date: Tuesday, April 20, 2010 - 9:16 am

My changes with the softlockup code uses an older version of softlockup.c.
A couple of commits have been committed that were not on the branch I am
using.  This patch resolves those conflicts.

Commit 8c2eb4 softlockup: Stop spurious softlockup messages due to overflow
       d6ad3e softlockup: Add sched_clock_tick() to avoid kernel warning on kgdb resume

Conflicts:

	include/linux/sched.h
	kernel/kgdb.c
	kernel/softlockup.c

Signed-off-by: Don Zickus <dzickus@redhat.com>

--

Hi Ingo,

If there is a better way to deal with these conflicts please let me know.

---
 include/linux/sched.h |    4 ++++
 kernel/kgdb.c         |    6 +++---
 kernel/watchdog.c     |   17 ++++++++++++++++-
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0c128ad..e039e22 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -310,6 +310,7 @@ extern void sched_show_task(struct task_struct *p);
 #ifdef CONFIG_LOCKUP_DETECTOR
 extern void touch_watchdog(void);
 extern void touch_all_watchdogs(void);
+extern void touch_watchdog_sync(void);
 extern unsigned int  softlockup_panic;
 extern int softlockup_thresh;
 extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
@@ -322,6 +323,9 @@ static inline void touch_watchdog(void)
 static inline void touch_all_watchdogs(void)
 {
 }
+static inline void touch_watchdog_sync(void)
+{
+}
 #endif
 
 #ifdef CONFIG_DETECT_HUNG_TASK
diff --git a/kernel/kgdb.c b/kernel/kgdb.c
index 15df969..da98119 100644
--- a/kernel/kgdb.c
+++ b/kernel/kgdb.c
@@ -596,7 +596,7 @@ static void kgdb_wait(struct pt_regs *regs)
 
 	/* Signal the primary CPU that we are done: */
 	atomic_set(&cpu_in_kgdb[cpu], 0);
-	touch_watchdog();
+	touch_watchdog_sync();
 	clocksource_touch_watchdog();
 	local_irq_restore(flags);
 }
@@ -1450,7 +1450,7 @@ acquirelock:
 	    (kgdb_info[cpu].task &&
 	     kgdb_info[cpu].task->pid != kgdb_sstep_pid) && --sstep_tries) {
 ...
Previous thread: [PATCH 2/6] [watchdog] convert touch_softlockup_watchdog to touch_watchdog by Don Zickus on Tuesday, April 20, 2010 - 8:23 am. (6 messages)

Next thread: Your Reply by Mr. David Smith on Tuesday, April 20, 2010 - 8:20 am. (1 message)