Hi, The series is not yet mergeable because it would break PowerPc (hot regs snapshot API has been changed, and I don't know how to update PowerPc for that). But if you're fine with the ideas, I can integrate the necessary changes to fix this, and also separate fixes and updates. Thanks. Frederic Weisbecker (7): perf: Drop the frame reliablity check perf: Fetch hot regs from the template caller x86: Unify dumpstack.h and stacktrace.h perf: Move perf_arch_fetch_caller_regs into a macro perf: Make perf_fetch_caller_regs rewind to the first caller only perf: Use hot regs with software sched/migrate events perf: Correctly align perf event tracing buffer arch/x86/include/asm/perf_event.h | 10 ++++++- arch/x86/include/asm/stacktrace.h | 45 ++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/perf_event.c | 18 +------------ arch/x86/kernel/dumpstack.c | 1 - arch/x86/kernel/dumpstack.h | 47 ---------------------------------- arch/x86/kernel/dumpstack_32.c | 2 - arch/x86/kernel/dumpstack_64.c | 2 - arch/x86/kernel/stacktrace.c | 7 +++-- include/linux/perf_event.h | 51 ++++++++++++++---------------------- include/trace/ftrace.h | 23 ++++++++-------- kernel/perf_event.c | 10 +------ kernel/trace/trace_event_perf.c | 13 ++++++--- 12 files changed, 101 insertions(+), 128 deletions(-) delete mode 100644 arch/x86/kernel/dumpstack.h --
It is useless now that we have a pure stack frame
walker, as given addr are always reliable.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
---
arch/x86/kernel/cpu/perf_event.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index f571f51..9bc6550 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1597,8 +1597,7 @@ static void backtrace_address(void *data, unsigned long addr, int reliable)
{
struct perf_callchain_entry *entry = data;
- if (reliable)
- callchain_store(entry, addr);
+ callchain_store(entry, addr);
}
static const struct stacktrace_ops backtrace_ops = {
--
1.6.2.3
--
Trace events can be defined from a template using
DECLARE_EVENT_CLASS/DEFINE_EVENT or directly with TRACE_EVENT.
In both cases we have a template tracepoint handler, used to
record the trace, to which we pass our ftrace event instance.
In the function level, if the class is named "foo" and the event
is named "blah", we have the following chain of calls:
perf_trace_blah() -> perf_trace_templ_foo()
In the case we have several events sharing the class "blah",
we'll have multiple users of perf_trace_templ_foo(), and it
won't be inlined by the compiler. This is usually what happens
with the DECLARE_EVENT_CLASS/DEFINE_EVENT based definition.
But if perf_trace_blah() is the only caller of perf_trace_templ_foo()
there are fair chances that it will be inlined.
The problem is that we fetch the regs from perf_trace_templ_foo()
after we rewinded the frame pointer to the second caller, we want
to reach the caller of perf_trace_blah() to get the right source
of the event. And we do this by always assuming that
perf_trace_templ_foo() is not inlined. But as shown above this
is not always true. And if it is inlined we miss the first caller,
losing the most important level of precision.
We get:
61.31% ls [kernel.kallsyms] [k] do_softirq
|
--- do_softirq
irq_exit
do_IRQ
common_interrupt
|
|--25.00%-- tty_buffer_request_room
Instead of:
61.31% ls [kernel.kallsyms] [k] __do_softirq
|
--- __do_softirq
do_softirq
irq_exit
do_IRQ
common_interrupt
|
|--25.00%-- tty_buffer_request_room
To fix this, we fetch the regs from perf_trace_blah() rather ...arch/x86/include/asm/stacktrace.h and arch/x86/kernel/dumpstack.h
declare headers of objects about the same topic.
Actually most of the files that include stacktrace.h also include
dumpstack.h
Although dumpstack.h seems more reserved for internals of stack
traces, those are quite often needed to define specialized stack
trace operations. And perf event arch headers are going to need
access to such low level operations anyway. So don't continue to
bother with dumpstack.h as it's not anymore about isolated deep
internals.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
arch/x86/include/asm/stacktrace.h | 46 ++++++++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/perf_event.c | 2 -
arch/x86/kernel/dumpstack.c | 1 -
arch/x86/kernel/dumpstack.h | 47 -------------------------------------
arch/x86/kernel/dumpstack_32.c | 2 -
arch/x86/kernel/dumpstack_64.c | 2 -
arch/x86/kernel/stacktrace.c | 7 +++--
7 files changed, 50 insertions(+), 57 deletions(-)
delete mode 100644 arch/x86/kernel/dumpstack.h
diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 4dab78e..8fb70b7 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -1,3 +1,8 @@
+/*
+ * Copyright (C) 1991, 1992 Linus Torvalds
+ * Copyright (C) 2000, 2001, 2002 Andi Kleen, SuSE Labs
+ */
+
#ifndef _ASM_X86_STACKTRACE_H
#define _ASM_X86_STACKTRACE_H
@@ -38,8 +43,49 @@ struct stacktrace_ops {
walk_stack_t walk_stack;
};
+struct pt_regs;
+struct task_struct;
+
void dump_trace(struct task_struct *tsk, struct pt_regs *regs,
unsigned long *stack, unsigned long bp,
const struct stacktrace_ops *ops, void *data);
+#ifdef CONFIG_X86_32
+#define STACKSLOTS_PER_LINE 8
+#define get_bp(bp) asm("movl %%ebp, %0" : "=r" (bp) :)
+#else
+#define STACKSLOTS_PER_LINE ...Having perf_arch_fetch_caller_regs() as a weak function overridable
by macros makes it hard to export its symbol for modules, as we
need to export the symbol in another file than the generic weak
definition to deal with compiler bugs.
Currently it's not a problem because we can define the symbol in
trace_event_perf.c as only trace events use it. But we are going
to make some generic software events to use this new facility,
so the symbol must be available even if trace events are not built.
This patch fixes the problem by making perf_arch_fetch_caller_regs
a macro that archs can define in their <asm/perf_event.h>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Archs <linux-arch@vger.kernel.org>
---
arch/x86/include/asm/perf_event.h | 10 +++++++++-
arch/x86/kernel/cpu/perf_event.c | 13 -------------
include/linux/perf_event.h | 8 +++++---
kernel/perf_event.c | 7 -------
kernel/trace/trace_event_perf.c | 2 --
5 files changed, 14 insertions(+), 26 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 124dddd..4bf3d37 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -155,7 +155,15 @@ extern void perf_events_lapic_init(void);
#define perf_instruction_pointer(regs) ((regs)->ip)
-#else
+#include <asm/stacktrace.h>
+
+#define perf_arch_fetch_caller_regs(regs, ip, skip) \
+ (regs)->ip = ip; \
+ (regs)->bp = rewind_frame_pointer(skip); \
+ (regs)->cs = __KERNEL_CS; \
+ local_save_flags((regs)->flags);
+
+#else /* !CONFIG_PERF_EVENTS */
static inline void init_hw_perf_events(void) { }
static inline void perf_events_lapic_init(void) { }
#endif
diff --git a/arch/x86/kernel/cpu/perf_event.c ...Trace events now only need to rewind the regs to the immediate
caller, so we don't need anymore to implement the support for
further stack walking.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Archs <linux-arch@vger.kernel.org>
---
arch/x86/include/asm/perf_event.h | 6 +++---
arch/x86/include/asm/stacktrace.h | 5 ++---
include/linux/perf_event.h | 30 +++++-------------------------
include/trace/ftrace.h | 2 +-
4 files changed, 11 insertions(+), 32 deletions(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 4bf3d37..1df2317 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -157,9 +157,9 @@ extern void perf_events_lapic_init(void);
#include <asm/stacktrace.h>
-#define perf_arch_fetch_caller_regs(regs, ip, skip) \
- (regs)->ip = ip; \
- (regs)->bp = rewind_frame_pointer(skip); \
+#define perf_arch_fetch_caller_regs(regs, __ip) \
+ (regs)->ip = __ip; \
+ (regs)->bp = caller_frame_pointer(); \
(regs)->cs = __KERNEL_CS; \
local_save_flags((regs)->flags);
diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 8fb70b7..62c9897 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -74,15 +74,14 @@ struct stack_frame {
unsigned long return_address;
};
-static inline unsigned long rewind_frame_pointer(int n)
+static inline unsigned long caller_frame_pointer(void)
{
struct stack_frame *frame;
get_bp(frame);
#ifdef CONFIG_FRAME_POINTER
- while (n--)
- frame = frame->next_frame;
+ frame = frame->next_frame;
#endif
return (unsigned long)frame;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ...Hello
Current linux-2.6 tree panics on my dev machine
64 bit kernel, 32bit user land
CONFIG_FRAME_POINTER=y
perf timechart record &
Instant crash
Call Trace:
perf_trace_sched_switch+0xd5/0x120
schedule+0x6b5/0x860
retint_careful+0xd/0x21
RIP ffffffff81010955 perf_arch_fetch_caller_regs+0x15/0x40
CR2: 00000000d21f1422
rewind_frame_pointer() is probably wrong.
No test performed to check frame is in current stack, or
that (!user_mode_vm(regs))
static inline unsigned long rewind_frame_pointer(int n)
{
struct stack_frame *frame;
get_bp(frame);
#ifdef CONFIG_FRAME_POINTER
while (n--)
frame = frame->next_frame;
#endif
return (unsigned long)frame;
}
--
user_mode_vm() can not work here as we are actually filling regs from scratch. But we indeed need to have a safe dereference to avoid such crashes. A simple probe_kernel_address() should do the trick. This API is going to change for the next cycle as it won't need to rewind further than the first caller. So I'm going to do a rough probe_kernel_address() fix for the current version. The next --
Can you please test this fix?
Thanks.
---
From 60d5c4e8498efc4a01abceef54ad3bc91993bf41 Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <fweisbec@gmail.com>
Date: Thu, 8 Apr 2010 14:05:50 +0200
Subject: [PATCH] perf: Fix unsafe frame rewinding with hot regs fetching
When we fetch the hot regs and rewind to the nth caller, it
might happen that we dereference a frame pointer outside the
kernel stack boundaries, like in this example:
perf_trace_sched_switch+0xd5/0x120
schedule+0x6b5/0x860
retint_careful+0xd/0x21
Since we directly dereference a userspace frame pointer here while
rewinding behind retint_careful, this may end up in a crash.
Fix this by simply using probe_kernel_address() when we rewind the
frame pointer.
This issue will have a much more proper fix in the next version of the
perf_arch_fetch_caller_regs() API that will only need to rewind to the
first caller.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Miller <davem@davemloft.net>
Cc: Archs <linux-arch@vger.kernel.org>
---
arch/x86/kernel/dumpstack.h | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/dumpstack.h b/arch/x86/kernel/dumpstack.h
index e39e771..e1a93be 100644
--- a/arch/x86/kernel/dumpstack.h
+++ b/arch/x86/kernel/dumpstack.h
@@ -14,6 +14,8 @@
#define get_bp(bp) asm("movq %%rbp, %0" : "=r" (bp) :)
#endif
+#include <linux/uaccess.h>
+
extern void
show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
unsigned long *stack, unsigned long bp, char *log_lvl);
@@ -42,8 +44,10 @@ static inline unsigned long rewind_frame_pointer(int n)
get_bp(frame);
#ifdef CONFIG_FRAME_POINTER
- while (n--)
- frame = frame->next_frame;
+ while (n--) {
+ if (probe_kernel_address(&frame->next_frame, ...Thanks, no more crash :) Tested-by: Eric Dumazet <eric.dumazet@gmail.com> --
Ingo,
Please pull the perf/urgent branch that can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
perf/urgent
Thanks,
Frederic
---
Frederic Weisbecker (1):
perf: Fix unsafe frame rewinding with hot regs fetching
arch/x86/kernel/dumpstack.h | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)
---
commit ab285f2b5290d92b7ec1a6f9aad54308dadf6157
Author: Frederic Weisbecker <fweisbec@gmail.com>
Date: Thu Apr 8 14:05:50 2010 +0200
perf: Fix unsafe frame rewinding with hot regs fetching
When we fetch the hot regs and rewind to the nth caller, it
might happen that we dereference a frame pointer outside the
kernel stack boundaries, like in this example:
perf_trace_sched_switch+0xd5/0x120
schedule+0x6b5/0x860
retint_careful+0xd/0x21
Since we directly dereference a userspace frame pointer here while
rewinding behind retint_careful, this may end up in a crash.
Fix this by simply using probe_kernel_address() when we rewind the
frame pointer.
This issue will have a much more proper fix in the next version of the
perf_arch_fetch_caller_regs() API that will only need to rewind to the
first caller.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Miller <davem@davemloft.net>
Cc: Archs <linux-arch@vger.kernel.org>
diff --git a/arch/x86/kernel/dumpstack.h b/arch/x86/kernel/dumpstack.h
index e39e771..e1a93be 100644
--- a/arch/x86/kernel/dumpstack.h
+++ b/arch/x86/kernel/dumpstack.h
@@ -14,6 +14,8 @@
#define get_bp(bp) asm("movq %%rbp, %0" : "=r" (bp) :)
#endif
+#include <linux/uaccess.h>
+
extern void
...The trace event buffer used by perf to record raw sample events is typed as an array of char and may then not be aligned to 8 by alloc_percpu(). But we need it to be aligned to 8 in sparc64 because we cast this buffer into a random structure type built by the TRACE_EVENT() macro to store the traces. So if a random 64 bits field is accessed inside, it may be not under an expected good alignment. Use an array of long instead to force the appropriate alignment, and perform a compile time check to ensure the size in byte of the buffer is a multiple of sizeof(long) so that its actual size doesn't get shrinked under us. This fixes unaligned accesses reported while using perf lock in sparc 64. Suggested-by: David Miller <davem@davemloft.net> Suggested-by: Tejun Heo <htejun@gmail.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Tejun Heo <htejun@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> --- kernel/trace/trace_event_perf.c | 11 +++++++++-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 8e9edcd..30314f5 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -15,7 +15,12 @@ EXPORT_PER_CPU_SYMBOL_GPL(perf_trace_regs); static char *perf_trace_buf; static char *perf_trace_buf_nmi; -typedef typeof(char [PERF_MAX_TRACE_SIZE]) perf_trace_t ; +/* + * Force it to be aligned to unsigned long to avoid misaligned accesses + * suprises + */ +typedef typeof(unsigned long [PERF_MAX_TRACE_SIZE / sizeof(unsigned long)]) + perf_trace_t; /* Count the events in use (per event id, not per instance) */ static int total_ref_count; @@ -128,6 +133,8 @@ __kprobes void *perf_trace_buf_prepare(int size, unsigned short type, char *trace_buf, ...
Scheduler's task migration events don't work because they always
pass NULL regs perf_sw_event(). The event hence gets filtered
in perf_swevent_add().
Scheduler's context switches events use task_pt_regs() to get
the context when the event occured which is a wrong thing to
do as this won't give us the place in the kernel where we went
to sleep but the place where we left userspace. The result is
even more wrong if we switch from a kernel thread.
Use the hot regs snapshot for both events as they belong to the
non-interrupt/exception based events family. Unlike page faults
or so that provide the regs matching the exact origin of the event,
we need to save the current context.
This makes the task migration event working and fix the context
switch callchains and origin ip.
Example: perf record -a -e cs
Before:
10.91% ksoftirqd/0 0 [k] 0000000000000000
|
--- (nil)
perf_callchain
perf_prepare_sample
__perf_event_overflow
perf_swevent_overflow
perf_swevent_add
perf_swevent_ctx_event
do_perf_sw_event
__perf_sw_event
perf_event_task_sched_out
schedule
run_ksoftirqd
kthread
kernel_thread_helper
After:
23.77% hald-addon-stor [kernel.kallsyms] [k] schedule
|
--- schedule
|
|--60.00%-- schedule_timeout
| wait_for_common
| wait_for_completion
| blk_execute_rq
| scsi_execute
| scsi_execute_req
| sr_test_unit_ready
| |
| |--66.67%-- sr_media_change
| | media_changed
...The patch below adds the necessary stuff for powerpc. You could fold
it into your "perf: Move perf_arch_fetch_caller_regs into a macro"
patch, or keep it as a separate patch in the series (though that would
make preserving bisectability more difficult).
There is a little difficulty in that you first create a 3-argument
form of perf_arch_fetch_caller_regs and then remove one argument in a
later patch, whereas my patch below just creates the 2-argument form.
I'm sure you can resolve that one way or another.
The patch to add the old-style perf_arch_fetch_caller_regs for powerpc
is sitting in the tip/perf/urgent branch but hasn't gone anywhere.
I suppose we need to get Ingo to pull that branch into tip/perf/core
and then include a revert patch in the series that switches to the new
interface. Alternatively, if the urgent branch isn't append-only then
we could disappear it (the urgent branch would need to be rewound by
one commit).
Paul.
----------
[PATCH] perf/powerpc: Implement perf_arch_fetch_caller_regs for powerpc
This adds the ability to get a hot register snapshot for powerpc,
enabling us to get meaningful call chains for tracepoints and context
switch events.
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/include/asm/perf_event.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/powerpc/include/asm/perf_event.h b/arch/powerpc/include/asm/perf_event.h
index e6d4ce6..5c16b89 100644
--- a/arch/powerpc/include/asm/perf_event.h
+++ b/arch/powerpc/include/asm/perf_event.h
@@ -21,3 +21,15 @@
#ifdef CONFIG_FSL_EMB_PERF_EVENT
#include <asm/perf_event_fsl_emb.h>
#endif
+
+#ifdef CONFIG_PERF_EVENTS
+#include <asm/ptrace.h>
+#include <asm/reg.h>
+
+#define perf_arch_fetch_caller_regs(regs, __ip) \
+ do { \
+ (regs)->nip = __ip; \
+ (regs)->gpr[1] = *(unsigned long *)__get_SP(); \
+ asm volatile("mfmsr %0" : "=r" ((regs)->msr)); \
+ } while (0)
+#endif
--
Since the series needs a resend anyway folding back ought to be fine i think. I'm wondering whether this should get into tip:perf/urgent - or in tip:perf/core for 2.6.35. It fixes sw event call trace ugliness, but is that a 2.6.34 regression? Is there any other aspect of the series that points towards accelerating this into .34? Thanks, Ingo --
Let's have a look: perf: Correctly align perf event tracing buffer Should probably go into urgent. The change is not invasive at all. It doesn't fix a regression but it's still an important fix. The rest It depends. The whole bunch is rather invasive. The callchains of context switches never worked correctly I think. I couldn't tell if the cpu migration has ever worked. If it ever did, then it's a regression fix but in the middle of too much hot regs improvements. So I can cook a specific fix for the cpu migration event to work, and keep the rest for perf/core. --
Thanks a lot, that's axactly what I needed. --
