There is a warn_on_once() check for PERF_SAMPLE_RAW which trips when using PEBS on both Core and Nehalem. Core PEBS sample size is 144 bytes and 176 bytes for Nehalem. Both are multiples of 8, but the size field is encoded as int, thus the total is never a multiple of 8 which trips the check. I think the size should have been u64, but now it is too late to change given it is ABI. Signed-off-by: Stephane Eranian <eranian@google.com> diff --git a/kernel/perf_event.c b/kernel/perf_event.c index 8143e77..fffeb95 100644 --- a/kernel/perf_event.c +++ b/kernel/perf_event.c @@ -3311,7 +3311,6 @@ void perf_prepare_sample(struct perf_event_header *header, else size += sizeof(u32); - WARN_ON_ONCE(size & (sizeof(u64)-1)); header->size += size; } --
PEBS hasn't seen -linus yet, so we can fix that. There's various things that do indeed rely on the perf buffer to always be u64 aligned, so this warning isn't bogus at all. --
On the subject of PEBS, we need to change the ABI before it does hit
-linus, I've got something like the below,. but I'm not quite sure of
it.
We could keep the PERF_RECORD_MISC_EXACT bit and allow non-fixed up IPs
in PERF_SAMPLE_EVENT_IP fields, that way we can avoid having to add both
IP and EVENT_IP to samples.
---
Subject: perf: Change the PEBS interface
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Mar 30 13:35:58 CEST 2010
This patch changes the PEBS interface from perf_event_attr::precise
and PERF_RECORD_MISC_EXACT to perf_event_attr::precise and
PERF_SAMPLE_EVENT_IP.
The rationale is that .precise=1 !PERF_SAMPLE_EVENT_IP could be used
to implement buffered PEBS.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
arch/x86/include/asm/perf_event.h | 9 --
arch/x86/kernel/cpu/perf_event_intel_ds.c | 133 +++++++++++++-----------------
include/linux/perf_event.h | 7 +
kernel/perf_event.c | 6 +
tools/perf/builtin-top.c | 9 +-
5 files changed, 79 insertions(+), 85 deletions(-)
Index: linux-2.6/arch/x86/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_event.h
+++ linux-2.6/arch/x86/include/asm/perf_event.h
@@ -128,21 +128,12 @@ extern void perf_events_lapic_init(void)
#define PERF_EVENT_INDEX_OFFSET 0
-/*
- * Abuse bit 3 of the cpu eflags register to indicate proper PEBS IP fixups.
- * This flag is otherwise unused and ABI specified to be 0, so nobody should
- * care what we do with it.
- */
-#define PERF_EFLAGS_EXACT (1UL << 3)
-
#define perf_misc_flags(regs) \
({ int misc = 0; \
if (user_mode(regs)) \
misc |= PERF_RECORD_MISC_USER; \
else \
misc |= PERF_RECORD_MISC_KERNEL; \
- if (regs->flags & PERF_EFLAGS_EXACT) \
- misc |= PERF_RECORD_MISC_EXACT; \
misc; })
#define ...You always have the original IP in the PEBS sample, so why add yet another way I am not sure I understand what you mean by buffered PEBS. Are you talking about using PEBS buffer bigger than one entry? If so, how can you do: - the LBR based fixups for multiple samples (on PMU interrupt)
Are you suggesting you add some padding the PEBS raw sample you return as PERF_SAMPLE_RAW? Then you need to define what RAW actually means? Seems here, it would mean more than what the I assume this has to do with the wrap-around detection. --
Well, RAW doesn't mean anything much at all, its really a fugly pass some crap around thing. So yeah, adding padding seems just fine. --
I would rather see size as u64. Who's using raw today anyway? The PEBS record itself is always a multiple of u64. The size header is the one --
The trace events. Hence the size of the size shouldn't be touched, it is an ABI now. --
Given your alignment constraints, it seems like it was a bad choice to pick u32 for size to begin with. --
Indeed, I'm not exactly sure how this is dealt since this is indeed u32 and the buffer requires to align to u64... --
Ah I remember now how we did that. We align the trace events raw sample size such that sizeof(raw_sample) + sizeof(size) is aligned to u64. Well, indeed that complexifies a bit the raw_sample size handling but at least it makes the traces a bit more compact. --
Nah, mostly dealing with architectures that don't do well with unaligned accesses. --
The best fix here is to simply remove the PERF_SAMPLE_RAW support and implement PERF_SAMPLE_REGS. Its just that we need to come up with a way to deal with compat pt_regs muck. --
But isn't RAW already part of the ABI? It is true that you need to also provide the interrupted state (or part of it) to the user, i.e., beyond just the IP. But the interrupt state and PEBS state are distinct and should not be swapped. In fact, both may be useful at the same time to evaluate the skid. So yes, you need PERF_SAMPLE_REGS. But just like your proposal with the two IPs. I think you need to export two versions of REGS: IREGS and PREGS for instance. The issue is that PEBS does not record the whole state but only a very small subset. Then, on Nehalem, there is PEBS more where you collect more that just machine state, you collect cache miss information. That's not regs anymore. --
Why? The only thing I would do is maybe use the interrupt state to fill out the PEBS reg data, its mostly things like segment regs that go missing, That doesn't seem like something a regular person would be interested in, and for those who do they can easily hack the kernel. Yes, the whole load-latency train-wreck is something we need to come up with a sensible interface for, that's definitely not something I would do through RAW. --
