Re: [PATCH v2] [watchdog] combine nmi_watchdog and softlockup

Previous thread: [PATCH 1/2 v2] lockdep: Fix redundant_hardirqs_on incremented with irqs enabled by Frederic Weisbecker on Thursday, April 15, 2010 - 2:10 pm. (2 messages)

Next thread: Re: [RFC] perf_events: support for uncore a.k.a. nest units by Gary.Mohr on Thursday, April 15, 2010 - 2:16 pm. (16 messages)
From: Don Zickus
Date: Thursday, April 15, 2010 - 2:25 pm

The new nmi_watchdog (which uses the perf event subsystem) is very
similar in structure to the softlockup detector.  Using Ingo's suggestion,
I combined the two functionalities into one file, kernel/watchdog.c.

Now both the nmi_watchdog (or hardlockup detector) and softlockup detector
sit on top of the perf event subsystem, which is run every 60 seconds or so
to see if there are any lockups.

To detect hardlockups, cpus not responding to interrupts, I implemented an
hrtimer that runs 5 times for every perf event overflow event.  If that stops
counting on a cpu, then the cpu is most likely in trouble.

To detect softlockups, tasks not yielding to the scheduler, I used the
previous kthread idea that now gets kicked every time the hrtimer fires.
If the kthread isn't being scheduled neither is anyone else and the
warning is printed to the console.

I tested this on x86_64 and both the softlockup and hardlockup paths work.

V2:
- cleaned up the Kconfig and softlockup combination
- surrounded hardlockup cases with #ifdef CONFIG_PERF_EVENTS_NMI
- seperated out the softlockup case from perf event subsystem
- re-arranged the enabling/disabling nmi watchdog from proc space
- added cpumasks for hardlockup failure cases
- removed fallback to soft events if no PMU exists for hard events

TODO:
- figure out how to make an arch-agnostic clock2cycles call (if possible)
  to feed into perf events as a sample period
- probably implement some missing procfs stuff

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/apic/hw_nmi.c |    2 +-
 include/linux/sched.h         |    6 +
 init/Kconfig                  |    1 +
 kernel/Makefile               |    4 +-
 kernel/sysctl.c               |    9 +
 kernel/watchdog.c             |  570 +++++++++++++++++++++++++++++++++++++++++
 lib/Kconfig.debug             |   21 +-
 7 files changed, 605 insertions(+), 8 deletions(-)
 create mode 100644 kernel/watchdog.c

diff --git a/arch/x86/kernel/apic/hw_nmi.c ...
From: Randy Dunlap
Date: Thursday, April 15, 2010 - 3:32 pm

---
~Randy
--

From: Don Zickus
Date: Friday, April 16, 2010 - 7:12 am

Thanks for the comments, will try to get them added in the next spin.

Cheers,
Don
--

From: Frederic Weisbecker
Date: Thursday, April 15, 2010 - 6:47 pm

In fact we also have the sample_freq thing that let you run
against a frequency (events per sec) rather than a sample
period.

We do some calculations that recompute the actual sample
period on top of this frequency in a regular base as the
events come.

It's rather unfortunate we can't have 0.xxxx frequencies
but may be we can work that out by adding a freq_unscale
field, which would basically produce the frequency like
this:

	freq = event->attr.sample_freq * (10 ^ -event->attr.freq_unscale)

This is not trivial though, so let's rather focus on the



That looks too general. It's more about the fact the arch supports



I'm confused, do we have two versions of the softlockup






Please use __get_cpu_var() instead so that we keep the preempt disabled









Why do you keep the software clock, I don't see how it can be
useful to detect hardlockups. It triggers in a regular irq









Please avoid such ifdefs in the middle of the code.

It's better to gather hardlockup matters in a single ifdef block:

#ifdef CONFIG_PERF_EVENTS_NMI
define your functions here
#else
define (if needed) your off case functions here (static inline stubs)



You probably want a backtrace cpu mask here as well



Please warn here instead, nothing seriously dangerous is going to



Especially such kind of idef in a function plus goto in the middle, we really
want to avoid that.

You want a watchdog_nmi_enable() call instead that does nothing












I'd suggest to drop the NMI prefix, this is not anymore
about detecting hard lockups only.

May be config WATCHDOG or config LOCKUP_DETECTOR if it's taken
already.


Also you should half-drop the DETECT_SOFTLOCKUP thing:
keep it's definition but drop the ability to choose it from
the prompt:

config DETECT_SOFTLOCKUP
	bool
	depends on DEBUG_KERNEL && !S390
	default y

This way we keep it for compatibility with def_configs, it will
enable the WATCHDOG by default if it is ...
From: Don Zickus
Date: Friday, April 16, 2010 - 7:12 am

I was trying to figure out a way to add the PERF_EVENTS dependency as I
didn't want to impose it on the CONFIG_NMI_WATCHDOG if that config

Originally Ingo talked about a migration path, so I was going to support
the older one in case the new one was having issues, sort of like what he
suggested about moving the nmi code from arch/x86/kernel/apic/nmi.c to
kernel/watchdog.c.  But I can probably drop the softlockup case as the

Yeah, sorry.  I dropped part of the code per your suggestion and forget to





I was just thinking of the case where dispite the WARN above, the cpu
actually recovered and then failed again separately.  But I probably won't






I understand the general idea but not quite the implementation idea.  I will work
on it and see what I come up with.

Thanks for the review!

Cheers,
Don
--

From: Frederic Weisbecker
Date: Friday, April 16, 2010 - 7:43 am

This is really just a corner case, I guess you don't need to
bother with that. It is actually racy against other cpus and adding
a spinlock here (in the everything is fine path) would be an overkill.

In fact, having two per cpu vars named hardlockup_warned and
softlockup_warned would be better than cpumasks. I'm sorry I
suggested you the cpumask, but such per cpu vars will avoid
you dealing with these synchonization issues. And one of the primary




We current have:

config DETECT_SOFTLOCKUP
     bool "Blah"
     depends on DEBUG_KERNEL && !S390
     default y
     help
       .......

The idea is to remove the "Blah" so that the user can't select it
anymore from make menuconfig, and to remove the help too as it's useless
too.

So that config WATCHDOG can be default y if DETECT_SOFTLOCKUP.
Then if someone comes with a config that has DETECT_SOFTLOCKUP,
it's new implementation (WATCHDOG) will enabled by default.

--

From: Don Zickus
Date: Friday, April 16, 2010 - 8:04 am

Yeah, I guess per cpu is better.  I agree that locks in NMI are frowned
upon but I wasn't sure of it was dealt with.

I'll try to implement this.  Any objections if I combined hardlockup and
softlockup with per cpu watchdog_warn and have bit masks for HARDLOCKUP

Ah, I missed the bool part.  I got it.  Thanks for the clarification.

Cheers,
--

From: Frederic Weisbecker
Date: Friday, April 16, 2010 - 8:32 am

They work in fact. They are just not checked by lockdep.
And mostly they are very dangerous: if something else can
take it (from interrupt, from context) then this is a deadlock.
And even though we ensure this is only taken from NMI, we tend



Hmm, a hardlockup can come in after a softlockup.
Don't worry too much about memory: usually the more you have cpu,
the more you have memory :)
Plus this is debugging code, not something supposed to be enabled
in production.

--

From: Don Zickus
Date: Friday, April 16, 2010 - 9:14 am

Let me re-explain what I meant.  It was meant to do double duty.  The
softlockup code only checks the SOFTLOCKUP bit and the hardlockup only
ever checks the HARDLOCKUP bit.


Well depends on your POV.  In RHEL we enable both NMI_WATCHDOG and
SOFTLOCKUP on production systems (and we have customers that are
thankful for that :-) ).

Cheers,
Don
--

From: Frederic Weisbecker
Date: Friday, April 16, 2010 - 9:24 am

Ok :)

--

From: Cyrill Gorcunov
Date: Friday, April 16, 2010 - 7:32 am

On Fri, Apr 16, 2010 at 03:47:14AM +0200, Frederic Weisbecker wrote:

Hmm, this is NMI handler path so from what we protect this per-cpu data?
Do I miss something? /me confused

	-- Cyrill
--

From: Frederic Weisbecker
Date: Friday, April 16, 2010 - 7:46 am

The cpu mask is not per cpu here, this is a shared bitmap, so you
can race against other cpus NMIs.

That said, as I suggested, having a per cpu var that we set when we
warned would be much better than a spinlock here.

--

From: Peter Zijlstra
Date: Friday, April 16, 2010 - 7:53 am

Every time you think NMI and spinlock think FAIL.
--

From: Frederic Weisbecker
Date: Friday, April 16, 2010 - 7:59 am

In fact I was first inspired by the x86 nmi watchdog handler
that does this spinlock to protect cpumask, but realize just
after my FAIL ;-)

--

From: Cyrill Gorcunov
Date: Friday, April 16, 2010 - 7:54 am

On Fri, Apr 16, 2010 at 04:46:17PM +0200, Frederic Weisbecker wrote:

yeah, saw DECLARE_BITMAP but read it as DEFINE_PER_CPU for some reason.
having any spinlock in irq handler is really under suspicious.

	-- Cyrill
--

From: Don Zickus
Date: Friday, April 16, 2010 - 7:46 am

It's not per-cpu which is the problem I believe.  If you have multiple
cpus dealing with hardlockups than they can overwrite each other's data.

Cheers,
Don
--

From: Don Zickus
Date: Monday, April 19, 2010 - 2:21 pm

Hello all,

After making a bunch of cleanups, I am stuck debating whether to continue
updating this patch on the stale branch perf/nmi on Ingo's tree or just
repost the whole patch again (which isn't much bigger just adds the
arch/x86/kernel/apic/hw_nmi.c piece).

Part of the new patch series includes removing kernel/nmi_watchdog.c,
which seemed kinda silly because it was only an intermediate file until
things got shifted to kernel/watchdog.c

Thoughts?

Cheers,
Don
--

From: Ingo Molnar
Date: Monday, April 19, 2010 - 2:35 pm

I'd prefer relative patches as the current perf/nmi bits are tested quite 
well.

Intermediate stages are not a problem: 90% of the code in the kernel's Git 
history is 'intermediate' as well, in hindsight. What matters is that the 
workflow that resulted was clean and that the patches were (and are) clean.

	Ingo
--

From: Don Zickus
Date: Monday, April 19, 2010 - 2:51 pm

Ok, I'll continue that then.  Thanks.

Cheers,
Don
--

Previous thread: [PATCH 1/2 v2] lockdep: Fix redundant_hardirqs_on incremented with irqs enabled by Frederic Weisbecker on Thursday, April 15, 2010 - 2:10 pm. (2 messages)

Next thread: Re: [RFC] perf_events: support for uncore a.k.a. nest units by Gary.Mohr on Thursday, April 15, 2010 - 2:16 pm. (16 messages)