This patch set tries to solve the local_irq_disable() vs NMI problem that SPARC has by providing new arch hooks and instrumenting the existing interface to WARN on conflicting usage. --
Provide local_irq_{save,restore}_nmi() which will allow us to help
architectures that implement NMIs using IRQ priorities like SPARC64
does.
Sparc uses IRQ prio 15 for NMIs and implements local_irq_disable() as
disable <= 14. However if you do that while inside an NMI you re-
enable the NMI priority again, causing all kinds of fun.
A more solid implementation would first check the disable level and
never lower it, however that is more costly and would slow down the
rest of the kernel for no particular reason.
Therefore introduce local_irq_save_nmi() which can implement this
slower but more solid scheme and dis-allow local_irq_save() from NMI
context.
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/irqflags.h | 51 ++++++++++++++++++++++++++++++++++++++++---
kernel/lockdep.c | 7 +++++
kernel/trace/trace_irqsoff.c | 8 ++++++
3 files changed, 63 insertions(+), 3 deletions(-)
Index: linux-2.6/include/linux/irqflags.h
===================================================================
--- linux-2.6.orig/include/linux/irqflags.h
+++ linux-2.6/include/linux/irqflags.h
@@ -18,6 +18,7 @@
extern void trace_softirqs_off(unsigned long ip);
extern void trace_hardirqs_on(void);
extern void trace_hardirqs_off(void);
+ extern void trace_hardirqs_off_no_nmi(void);
# define trace_hardirq_context(p) ((p)->hardirq_context)
# define trace_softirq_context(p) ((p)->softirq_context)
# define trace_hardirqs_enabled(p) ((p)->hardirqs_enabled)
@@ -30,6 +31,7 @@
#else
# define trace_hardirqs_on() do { } while (0)
# define trace_hardirqs_off() do { } while (0)
+# define trace_hardirqs_off_no_nmi() do { } while (0)
# define trace_softirqs_on(ip) do { } while (0)
# define trace_softirqs_off(ip) do { } while (0)
# define trace_hardirq_context(p) 0
@@ -59,15 +61,15 @@
#define local_irq_enable() \
do { trace_hardirqs_on(); raw_local_irq_enable(); } while (0)
#define ...Should we do this for all archs? I can imagine a lot of warning reports coming in the near future. And they will be passing it towards me. --
From: Steven Rostedt <rostedt@goodmis.org>
That's the whole point, so that the problem is more easily noticed and
it gets fixed long before I end up accidently testing the code on my
machines :-)
To be honest, the fix is so trivial, you just need to add '_nmi' to
the local_irq_{save,restore}() calls that warn like this.
I'm even willing to have you forward all of those reports to me and
I'll be responsible for fixing them.
How's that? :-)
--
Sure. /me sets up his procmailrc to search for the WARN_ON line in trace_irqsoff.c and have it forward to davem. -- Steve --
Since we can call cpu_clock() from NMI context fix up the IRQ disabling to conform to the new rules. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched_clock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6/kernel/sched_clock.c =================================================================== --- linux-2.6.orig/kernel/sched_clock.c +++ linux-2.6/kernel/sched_clock.c @@ -241,9 +241,9 @@ unsigned long long cpu_clock(int cpu) unsigned long long clock; unsigned long flags; - local_irq_save(flags); + local_irq_save_nmi(flags); clock = sched_clock_cpu(cpu); - local_irq_restore(flags); + local_irq_restore_nmi(flags); return clock; } --
That seem to add a small overhead in various places. Do we want to make local_irq_save_nmi == local_irq_save for archs that have native nmi? Or cpu_clock_nmi()? --
Patch 8bb39f9 (perf: Fix 'perf sched record' deadlock) introduced a
local_irq_save() in NMI context, convert that to local_irq_save_nmi()
and move the IRQ disable into perf_output_lock/unlock().
The former is needed because we now disallow local_irq_disable() from
NMI context due to some arch limitations.
The second is because its really about IRQ lock inversion with that
funny output lock, and perf_event_task_output() is only one site that
could trigger it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
---
include/linux/perf_event.h | 1 +
kernel/perf_event.c | 17 ++++++-----------
2 files changed, 7 insertions(+), 11 deletions(-)
Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -758,6 +758,7 @@ struct perf_output_handle {
struct perf_mmap_data *data;
unsigned long head;
unsigned long offset;
+ unsigned long flags;
int nmi;
int sample;
int locked;
Index: linux-2.6/kernel/perf_event.c
===================================================================
--- linux-2.6.orig/kernel/perf_event.c
+++ linux-2.6/kernel/perf_event.c
@@ -2848,6 +2848,10 @@ static void perf_output_lock(struct perf
struct perf_mmap_data *data = handle->data;
int cur, cpu = get_cpu();
+ /*
+ * Since this is a lock we need to be IRQ-safe
+ */
+ local_irq_save_nmi(handle->flags);
handle->locked = 0;
for (;;) {
@@ -2906,6 +2910,7 @@ again:
if (atomic_xchg(&data->wakeup, 0))
perf_output_wakeup(handle);
out:
+ local_irq_restore_nmi(handle->flags);
put_cpu();
}
@@ -3385,19 +3390,10 @@ static void perf_event_task_output(struc
unsigned long flags;
int size, ret;
- /*
- * If this CPU attempts to acquire an rq lock held by a CPU spinning
- * in perf_output_lock() from interrupt context, it's game over.
- ...From: Peter Zijlstra <a.p.zijlstra@chello.nl> Thanks Peter, looks good at first sight. I'll toss together the sparc bits and give this a go. --
From: Peter Zijlstra <a.p.zijlstra@chello.nl> Ok, two patches coming. One which adds the sparc64 irqflags.h methods. And one which annotates the ftrace functions, as needed. With this I have the function tracer working. Thanks! --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List |
