Re: [PATCH] perf_events: fix bogus warn_on(_once) in perf_prepare_sample()

Previous thread: Re: PATCH: ni_labpc_cs.c 2 by Greg KH on Thursday, April 8, 2010 - 12:16 pm. (1 message)

Next thread: [PATCH #3] reiserfs: Fix permissions on .reiserfs_priv by Jeff Mahoney on Thursday, April 8, 2010 - 1:55 pm. (4 messages)
From: Stephane Eranian
Date: Thursday, April 8, 2010 - 1:45 pm

There is a warn_on_once() check for PERF_SAMPLE_RAW which trips
	when using PEBS on both Core and Nehalem. Core PEBS sample size is 144
	bytes and 176 bytes for Nehalem. Both are multiples of 8, but the size
	field is encoded as int, thus the total is never a multiple of 8 which
	trips the check. I think the size should have been u64, but now it is
	too late to change given it is ABI.

	Signed-off-by: Stephane Eranian <eranian@google.com>

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 8143e77..fffeb95 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -3311,7 +3311,6 @@ void perf_prepare_sample(struct perf_event_header *header,
 		else
 			size += sizeof(u32);
 
-		WARN_ON_ONCE(size & (sizeof(u64)-1));
 		header->size += size;
 	}
 
--

From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 1:55 pm

PEBS hasn't seen -linus yet, so we can fix that.

There's various things that do indeed rely on the perf buffer to always
be u64 aligned, so this warning isn't bogus at all.

--

From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 2:03 pm

On the subject of PEBS, we need to change the ABI before it does hit
-linus, I've got something like the below,. but I'm not quite sure of
it.

We could keep the PERF_RECORD_MISC_EXACT bit and allow non-fixed up IPs
in PERF_SAMPLE_EVENT_IP fields, that way we can avoid having to add both
IP and EVENT_IP to samples.

---
Subject: perf: Change the PEBS interface
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Mar 30 13:35:58 CEST 2010

This patch changes the PEBS interface from perf_event_attr::precise
and PERF_RECORD_MISC_EXACT to perf_event_attr::precise and
PERF_SAMPLE_EVENT_IP.

The rationale is that .precise=1 !PERF_SAMPLE_EVENT_IP could be used
to implement buffered PEBS.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
---
 arch/x86/include/asm/perf_event.h         |    9 --
 arch/x86/kernel/cpu/perf_event_intel_ds.c |  133 +++++++++++++-----------------
 include/linux/perf_event.h                |    7 +
 kernel/perf_event.c                       |    6 +
 tools/perf/builtin-top.c                  |    9 +-
 5 files changed, 79 insertions(+), 85 deletions(-)

Index: linux-2.6/arch/x86/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_event.h
+++ linux-2.6/arch/x86/include/asm/perf_event.h
@@ -128,21 +128,12 @@ extern void perf_events_lapic_init(void)
 
 #define PERF_EVENT_INDEX_OFFSET			0
 
-/*
- * Abuse bit 3 of the cpu eflags register to indicate proper PEBS IP fixups.
- * This flag is otherwise unused and ABI specified to be 0, so nobody should
- * care what we do with it.
- */
-#define PERF_EFLAGS_EXACT	(1UL << 3)
-
 #define perf_misc_flags(regs)				\
 ({	int misc = 0;					\
 	if (user_mode(regs))				\
 		misc |= PERF_RECORD_MISC_USER;		\
 	else						\
 		misc |= PERF_RECORD_MISC_KERNEL;	\
-	if (regs->flags & PERF_EFLAGS_EXACT)		\
-		misc |= PERF_RECORD_MISC_EXACT;		\
 	misc; })
 
 #define ...
From: Stephane Eranian
Date: Thursday, April 8, 2010 - 2:18 pm

You always have the original IP in the PEBS sample, so why add yet another way
I am not sure I understand what you mean by buffered PEBS. Are you talking
about using PEBS buffer bigger than one entry?
If so, how can you do:
  - the LBR based fixups for multiple samples (on PMU interrupt)
From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 2:24 pm

flush on context switches.

--

From: Stephane Eranian
Date: Thursday, April 8, 2010 - 2:33 pm

cost?
--

From: Stephane Eranian
Date: Thursday, April 8, 2010 - 2:08 pm

Are you suggesting you add some padding the PEBS raw sample you
return as PERF_SAMPLE_RAW? Then you need to define what RAW
actually means? Seems here, it would mean more than what the
I assume this has to do with the wrap-around detection.
--

From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 2:11 pm

Well, RAW doesn't mean anything much at all, its really a fugly pass
some crap around thing.

So yeah, adding padding seems just fine.

--

From: Stephane Eranian
Date: Thursday, April 8, 2010 - 2:14 pm

I would rather see size as u64. Who's using raw today anyway?

The PEBS record itself is always a multiple of u64. The size header is the one
--

From: Frederic Weisbecker
Date: Thursday, April 8, 2010 - 2:17 pm

The trace events. Hence the size of the size shouldn't be touched, it
is an ABI now.

--

From: Stephane Eranian
Date: Thursday, April 8, 2010 - 2:22 pm

Given your alignment constraints, it seems like it was a bad choice to pick
u32 for size to begin with.
--

From: Frederic Weisbecker
Date: Thursday, April 8, 2010 - 2:42 pm

Indeed, I'm not exactly sure how this is dealt since this is indeed u32
and the buffer requires to align to u64...

--

From: Frederic Weisbecker
Date: Thursday, April 8, 2010 - 2:46 pm

Ah I remember now how we did that. We align the trace events raw sample size
such that sizeof(raw_sample) + sizeof(size) is aligned to u64.

Well, indeed that complexifies a bit the raw_sample size handling but at least
it makes the traces a bit more compact.

--

From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 2:23 pm

the tracepoint stuff does

--

From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 2:12 pm

Nah, mostly dealing with architectures that don't do well with unaligned
accesses.

--

From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 2:15 pm

The best fix here is to simply remove the PERF_SAMPLE_RAW support and
implement PERF_SAMPLE_REGS. Its just that we need to come up with a way
to deal with compat pt_regs muck.



--

From: Stephane Eranian
Date: Thursday, April 8, 2010 - 2:29 pm

But isn't RAW already part of the ABI?

It is true that you need to also provide the interrupted state (or
part of it) to the user, i.e.,
beyond just the IP.

But the interrupt state and PEBS state are distinct and should not be
swapped. In fact,
both may be useful at the same time to evaluate the skid.

So yes, you need PERF_SAMPLE_REGS. But just like your proposal with the two IPs.
I think you need to export two versions of REGS: IREGS and PREGS for
instance. The
issue is that PEBS does not record the whole state but only a very small subset.

Then, on Nehalem, there is PEBS more where you collect more that just
machine state,
you collect cache miss information. That's not regs anymore.
--

From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 2:42 pm

Why?

The only thing I would do is maybe use the interrupt state to fill out
the PEBS reg data, its mostly things like segment regs that go missing,

That doesn't seem like something a regular person would be interested
in, and for those who do they can easily hack the kernel.


Yes, the whole load-latency train-wreck is something we need to come up
with a sensible interface for, that's definitely not something I would
do through RAW.



--

Previous thread: Re: PATCH: ni_labpc_cs.c 2 by Greg KH on Thursday, April 8, 2010 - 12:16 pm. (1 message)

Next thread: [PATCH #3] reiserfs: Fix permissions on .reiserfs_priv by Jeff Mahoney on Thursday, April 8, 2010 - 1:55 pm. (4 messages)