Re: [PATCH 2/3] perf: Add support for extra parameters for raw events

Previous thread: [PATCH] x86/mrst: set vRTC's IRQ to level trigger type by Alan Cox on Thursday, November 11, 2010 - 8:50 am. (1 message)

Next thread: [PATCH] staging: ft1000: Copy from user into correct data by Steven Rostedt on Thursday, November 11, 2010 - 9:29 am. (3 messages)
From: Andi Kleen
Date: Thursday, November 11, 2010 - 9:15 am

From: Andi Kleen <ak@linux.intel.com>

Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event
that can be used to monitor any offcore accesses from a core.
This is a very useful event for various tunings, and it's
also needed to implement the generic LLC-* events correctly.

Unfortunately this event requires programming a mask in a separate
register. And worse this separate register is per core, not per
CPU thread.

This patch adds:
- Teaches perf_events that OFFCORE_RESPONSE need extra parameters.
- Adds a new field to the user interface to pass the extra mask.
This reuses one of the unused fields for perf events. The change
is ABI neutral because noone is likely to have used OFFCORE_RESPONSE
before (with zero mask it wouldn't count anything)
- Add support to the Intel perf_event core to schedule the per
core resource. I tried to add generic infrastructure for this
that could be also used for other core resources.
The basic code has is patterned after the similar AMD northbridge
constraints code.

Thanks to Stephane Eranian who pointed out some problems
in the original version and suggested improvements.

Cc: eranian@google.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c       |   56 +++++++++++++++
 arch/x86/kernel/cpu/perf_event_intel.c |  120 ++++++++++++++++++++++++++++++++
 include/linux/perf_event.h             |    7 ++-
 3 files changed, 182 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index ed63101..97133ec 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -93,6 +93,8 @@ struct amd_nb {
 	struct event_constraint event_constraints[X86_PMC_IDX_MAX];
 };
 
+struct intel_percore;
+
 #define MAX_LBR_ENTRIES		16
 
 struct cpu_hw_events {
@@ -126,6 +128,8 @@ struct cpu_hw_events {
 	void				*lbr_context;
 	struct perf_branch_stack	lbr_stack;
 	struct perf_branch_entry	lbr_entries[MAX_LBR_ENTRIES];
+	int ...
From: Andi Kleen
Date: Thursday, November 11, 2010 - 9:15 am

From: Andi Kleen <ak@linux.intel.com>

Add support to the perf tool to specify extra values for raw events
and pass them to the kernel.

The new format is -e rXXXX[,YYYY]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 tools/perf/Documentation/perf-list.txt |    6 ++++++
 tools/perf/builtin-report.c            |    7 ++++---
 tools/perf/util/parse-events.c         |   26 ++++++++++++++++++++------
 tools/perf/util/parse-events.h         |    2 +-
 tools/perf/util/ui/browsers/hists.c    |    3 ++-
 5 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt
index 399751b..c398390 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -61,6 +61,12 @@ raw encoding of 0x1A8 can be used:
 You should refer to the processor specific documentation for getting these
 details. Some of them are referenced in the SEE ALSO section below.
 
+Some special events like the Intel OFFCORE_RESPONSE events require
+extra parameters.  These can be specified by appending a command and
+the extra parameter as a hex number. Since the bitmasks can be
+complicated to compute it is recommended to use a suitable mapping file
+with --event-map.
+
 OPTIONS
 -------
 None
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 5de405d..7f6bcfb 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -187,7 +187,8 @@ static int process_read_event(event_t *event, struct perf_session *session __use
 	attr = perf_header__find_attr(event->read.id, &session->header);
 
 	if (show_threads) {
-		const char *name = attr ? __event_name(attr->type, attr->config)
+		const char *name = attr ? __event_name(attr->type, attr->config,
+						       0)
 				   : "unknown";
 		perf_read_values_add_value(&show_threads_values,
 					   event->read.pid, event->read.tid,
@@ -197,7 +198,7 @@ static int process_read_event(event_t *event, struct ...
From: Corey Ashford
Date: Thursday, November 11, 2010 - 10:54 am

It's not clear to me that a comma is the best character to use in this 
instance, because multiple events can be listed after the -e separated 
by commas as well, for example:  -e rXXXX,cycles.
-e rXXXX,YYYY,cycles would be confusing to read.

How about -e rXXXX[+YYYY]

- Corey
--

From: Andi Kleen
Date: Thursday, November 11, 2010 - 11:39 am

That's a good point. The comma seems to work in my testing though for 
some reason, but I guess
I broke the multiple events case then. I'll switch over to +

-Andi

--

From: Corey Ashford
Date: Thursday, November 11, 2010 - 12:38 pm

I think it would have parsed OK, but it is a confusing syntax to look 
at.  I should point out that there are also the u,k,h modifiers which 
can be specified, even on raw events, so the full syntax looks something 
like:

-e rXXXX[+YYYY][:u|:k|:h]

- Corey
--

From: Andi Kleen
Date: Thursday, November 11, 2010 - 12:49 pm

Ok I can fix the help text for that.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 12:59 pm

If we're frobbing it in the existing config qword, do we still need
special syntax?

Also, I think we can use the same mechanism to program the
PEBS-load-latency MSR, right?
--

From: Andi Kleen
Date: Thursday, November 11, 2010 - 1:12 pm

I haven't looked at that, but if it's a per core resource
the infrastructure can be reused at least.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 1:37 pm

Can't remember if the load-latency msr is per-core, but its an extra reg
that needs to be set.

The nhm/wsm LBR config msr is per-core though, so that could use your
infrastructure too.
--

From: Andi Kleen
Date: Friday, November 12, 2010 - 3:27 am

Ok. I guess it shouldn't be too hard to add.


Stephane mentioned that too. Not right now, but perhaps later I'll
look at using this.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Stephane Eranian
Date: Friday, November 12, 2010 - 3:49 am

The difficulty with PEBS-LL (load latency) is not so much the encoding of the
latency. It is more how to expose the data back to user. You get a full PEBS
record for each miss. So you get the PEBS machine state + data addr, miss
latency, and data source (where did the line come from). We have already
discussed how to expose machine state in general. I'll work on a patch for this.
So we can get the general PEBS machine state out. However, the question is
how do we expose data addr, latency, data source? We can reuse the
SAMPLE_ADDR for the data address. Sample IP would point to the load
instruction (with help from LBR to correct the off by one issue). For
the latency
and data source, I proposed using pseudo regs and leveraging the sampled machine
state mechanism. An alternative may be to define a new record type that would b
generic enough to be reusable, for instance { instr_addr, data_addr,
latency, data_src, flags; }.

But let's fix OFFCORE_RESPONSE_* first.
--

From: Peter Zijlstra
Date: Friday, November 12, 2010 - 4:36 am

Frederic was working on this PERF_SAMPLE_REGS stuff as well for his copy


I'm not sure I like the idea of pseudo regs.. I'm afraid it'll get messy
quite quickly. Load-latency is a bit like IBS that way, terribly messy.


--

From: Stephane Eranian
Date: Friday, November 12, 2010 - 6:00 am

I don't understand what aspect you think is messy. When you are sampling
cache misses, you expect to get the tuple (instr addr, data addr, latency,
data source). That is what you get with AMD IBS, Nehalem PEBS-LL and
also Itanium D-EAR. I am sure IBM Power has something similar as well.
To collect this, you can either store the info in registers (AMD, Itanium)
or in a buffer (PEBS). But regardless of that you will always have to expose
the tuple. We have a solution for two out of 4 fields that reuses the existing
infrastructure. We need something else for the other two.

We should expect that in the future PMUs will collect more than code addresses.
--

From: Peter Zijlstra
Date: Friday, November 12, 2010 - 6:21 am

Its the data source thing I have most trouble with -- see below. The
latency isn't immediately clear either, I mean the larger the bubble the
more hits the instruction will get, so there should be a correlation

Well, if Intel PEBS, IA64 and PPC64 all have a data source thing we can
simply add PERF_SAMPLE_SOURCE or somesuch and use that. 

Do IA64/PPC64 have latency fields as well? PERF_SAMPLE_LATENCY would
seem to be the thing to use in that case.

BTW, what's the status of perf on IA64? And do we really still care

Sure, but I hate stuff that counts multiple events on a single base like
IBS does, and LL is similar to that, its a fetch retire counter and then
you report where fetch was satisfied from. So in effect you're measuring
l1/l2/l3/dram hit/miss all at the same time but on a fetch basis.

Note that we need proper userspace for such crap as well, and libpfm
doesn't count, we need a full analysis tool in perf itself.
--

From: Stephane Eranian
Date: Friday, November 12, 2010 - 7:03 am

Peter,


The latency is the miss latency, i.e., time to bring the cache line back
at the time the miss is detected. That's not because you see a latency
of 20 cycles that you can assume the line came from the LLC cache, it
may have been in-flight by the time the load was issued. In other words,
the latency may not be enough to figure out where the line actually came.

As for the correlation to cycles sampling, they don't point to the same
location. With cycles, you point to the stalled instructions, i.e., where
you wait for the data to arrive. With PEBS-LL (and variations on the
other archs), you point to the missing load instructions. Sometimes
those can be far apart, it depends on the code flow, instruction scheduling
by the compiler and so on. Backtracing from the stall instruction to the
missing load is tricky business especially with branches, interrupts and
such. Some people have tried that in the past.

What you are really after here is identifying load misses which do incur serious
stalls in your program. No single HW feature provides that. But by combining
cache miss and cycle profiles, I think you can get a good handle on this.

Although the latency is a good hint for potential stalls, there is no guarantee.
A miss latency could be completely overlapped with executions. PEBS-LL
(or variations on the other arch) won't report the overlap. You have
to correlate
this with a cycle profiling. However, it you get latencies of > 40 cycles
or more it is highly unlikely the compiler was able to hide that, thus those are
good candidates for prefetching of some sort (assuming you get lots of samples

Itanium definitively does have data source, so does IBS. Don't know about

It is not dead, there is one more CPU in the making if I recall.
I did touch base with Tony Luck last week on this. I think adding
support for the basic counting stuff should be possible. You have
4 counters, with event constraints. Getting the constraints right
for some events is a bit tricky and the ...
From: Frederic Weisbecker
Date: Friday, November 12, 2010 - 6:30 am

Would you like me to respin this isolated chunk of my dwarf cfi unwinding patchset?
So that we can discuss the ABI.

Thanks.

--

From: Stephane Eranian
Date: Friday, November 12, 2010 - 8:00 am

From: Stephane Eranian
Date: Friday, November 12, 2010 - 3:41 am

Load latency is luckily per thread. Nothing special to do to handle this one.
Exactly.
--

From: Peter Zijlstra
Date: Friday, November 12, 2010 - 3:48 am

Right, but in effect Andi adds two pieces of infrastructure, 

 - extra_reg, which determines if a particular cntr value needs extra
   data, this is useful for both the offcore msr and the pebs-ll msr.

 - per-core constraints, which is useful for shared msrs like the
   offcore and lbr config things.

So while his current patch set only uses either piece once -- to provide
offcore support -- both pieces have in fact at least one more potential
user (for future patches).


--

From: Stephane Eranian
Date: Friday, November 12, 2010 - 3:39 am

I don't think you need special syntax. You can simply come up with
the 64-bit raw hex value. corey recently added a small utility to do
this via libpfm4: perf stat -e `evt2raw unhalted_core_cycles:u` ....

With libpfm4, I'll expose the event as a normal one. Therefore
you would do:

perf stat -e OFFCORE_RESPONSE_0:PF_DATA_RD:REMOTE_DRAM ...

This will encode to:
Yes, we could hardcode the latency the same way.
--

From: Andi Kleen
Date: Friday, November 12, 2010 - 3:48 am

At least right now it would still need a constraint, because only
one counter can use it. 

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Stephane Eranian
Date: Friday, November 12, 2010 - 3:52 am

True. But there are already contraint tables for PEBS. For instance,
if I use INST_RETIRED (0x3c) with PEBS, then it cannot use a fixed
counter. So you can do a constraint for MEM_LOAD_RETIRED:LAT_ABOVE_THRES
the same way (include the umask).
--

From: Stephane Eranian
Date: Friday, November 12, 2010 - 3:56 am

Second thought on this. There is no contraint for the event
MEM_LOAD_RETIRED:LAT_ABOVE_THRES. It can be measured in any
of the 4 generic counters. It's just that it can only be measured once given
the extra MSR would then be shared. One way to implement this constraint
is indeed to mark the event has being able to run only in one counter. That
may be the simplest way to implement this. In that case I would use a
counter which does not have many other constraints, e.g. 3 or 4.
--

From: Andi Kleen
Date: Thursday, November 11, 2010 - 9:15 am

From: Andi Kleen <ak@linux.intel.com>

The generic perf LLC-* events do count the L2 caches, not the real
L3 LLC on Intel Nehalem and Westmere. This lead to quite some confusion.

Fixing this properly requires use of the special OFFCORE_RESPONSE
events which need a separate mask register. This has been implemented
in a earlier patch.

Now use this infrastructure to set correct events for the LLC-*
on Nehalem and Westmere

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/perf_event.c       |    7 +++++-
 arch/x86/kernel/cpu/perf_event_intel.c |   37 ++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 97133ec..3e429a1 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -299,6 +299,10 @@ static u64 __read_mostly hw_cache_event_ids
 				[PERF_COUNT_HW_CACHE_MAX]
 				[PERF_COUNT_HW_CACHE_OP_MAX]
 				[PERF_COUNT_HW_CACHE_RESULT_MAX];
+static u64 __read_mostly hw_cache_extra_regs
+				[PERF_COUNT_HW_CACHE_MAX]
+				[PERF_COUNT_HW_CACHE_OP_MAX]
+				[PERF_COUNT_HW_CACHE_RESULT_MAX];
 
 /*
  * Propagate event elapsed time into the generic event.
@@ -455,7 +459,8 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event_attr *attr)
 		return -EINVAL;
 
 	hwc->config |= val;
-
+	hwc->extra_config =
+		hw_cache_extra_regs[cache_type][cache_op][cache_result];
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index bbe7fba..a06dae3 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -215,6 +215,39 @@ static __initconst const u64 westmere_hw_cache_event_ids
  },
 };
 
+/*
+ * OFFCORE_RESPONSE MSR bits (subset), See IA32 SDM Vol 3 30.6.1.3
+ */
+
+#define DMND_DATA_RD     (1 << 0)
+#define DMND_RFO         (1 << 1)
+#define DMND_WB          (1 << 3)
+#define PF_DATA_RD       (1 ...
From: Andi Kleen
Date: Thursday, November 11, 2010 - 10:35 am

Small bug fix for the earlier offcore_response patch:
need to put the events for intel_pmu too, not only core_pmu.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a06dae3..9a71ffd 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1031,6 +1031,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	 */
 	.max_period		= (1ULL << 31) - 1,
 	.get_event_constraints	= intel_get_event_constraints,
+	.put_event_constraints	= intel_put_event_constraints,
 
 	.cpu_starting		= intel_pmu_cpu_starting,
 	.cpu_dying		= intel_pmu_cpu_dying,

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Stephane Eranian
Date: Thursday, November 11, 2010 - 10:54 am

You don't need it for core pmu, this is for Core Duo/Solo
--

From: Andi Kleen
Date: Thursday, November 11, 2010 - 11:44 am

It shouldn't matter because Core Duo doesn't have any per CPU resources, 
but
it seems more symetric to call it in any case.

It may make a difference if other users of the per CPU resource code get 
added.

-Andi

--

From: Stephane Eranian
Date: Thursday, November 11, 2010 - 2:29 pm

There is no HT on these processors. I don't know of any valuable HW
features that is
not support for these.
--

From: Stephane Eranian
Date: Thursday, November 11, 2010 - 11:06 am

Andi,

Thanks for creating this patch. It was on my TODO list for a while.
OFFCORE_RESPONSE is indeed a very useful event.

One thing I noticed in your patch is that you don't special
case the configuration where HT is off. In that case, the
sharing problem goes away. I think you could override
either way during init.

Some more tidbits:
- OFFCORE_RESPONSE_0 is 0x01b7
- OFFCORE_RESPONSE_1 is 0x01bb

The umask is not zero but 1. Dont' know if you get
something meaningful is you pass a umask of zero.
But that's the user's responsibility to set this right.

An alternative approach could have been to stash the
extra MSR value in the upper 32-bit value of the config.
It's 16-bit wide today. OFFCORE_RESPONSE is a
model specific event. There is no guarantee it will be
there in future CPU, so it would be safe to do that as well.

--

From: Andi Kleen
Date: Thursday, November 11, 2010 - 11:42 am

On 11/11/2010 7:06 PM, Stephane Eranian wrote:


The allocator should handle that transparently. With HT off the resource is

Hmm I seem to get events that look meaningful with 0. But you're right 1
is better. I used that in manual tests, but it wasn't in the cache number
mappings. I'll fix that for the next version.

-Andi

--

From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 12:09 pm

Either per_core != NULL implies percore_used or it should be state

s/unsigned/unsigned int/ (also later in the patch) and then align the



Please dynamically allocate these when needed, just like the AMD

I think I like Stephane's suggestion better, frob them into the existing
u64 word, since its model specific and we still have 33 empty bits in
the control register there's plenty space.



--

From: Andi Kleen
Date: Thursday, November 11, 2010 - 12:25 pm

On Thu, Nov 11, 2010 at 08:09:14PM +0100, Peter Zijlstra wrote:




Fully dynamic is difficult because the topology discovery does not 
really handle that nicely.

I can allocate at boot, but it will not save a lot of memory
(just one entry per core)

To be honest I would prefer not to do that change, are you sure

Ok. I'll see how many changes that needs.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 12:36 pm

Can we make it so, that is, move that int inside the intel_percore

Hrm,.. but we only need to do this on nhm/wsm init, we already have code
detecting the cpu model. And I'm quite sure you know where to poke to
see if HT is enabled.
--

From: Andi Kleen
Date: Thursday, November 11, 2010 - 12:45 pm

It's needed per cpu. Basically the flag says "bother to look into the 
percore structure"
This way most of this is kept out of the fast paths. But one CPU in a core
may need it and the other not.

In principle it could be merged into some other cpuc flag word if you 

Ok I guess I can do a dynamic per cpu allocation for this case.

-Andi
--

From: Peter Zijlstra
Date: Thursday, November 11, 2010 - 12:49 pm

Nah, I missed that detail, having it as is seems fine.


--

Previous thread: [PATCH] x86/mrst: set vRTC's IRQ to level trigger type by Alan Cox on Thursday, November 11, 2010 - 8:50 am. (1 message)

Next thread: [PATCH] staging: ft1000: Copy from user into correct data by Steven Rostedt on Thursday, November 11, 2010 - 9:29 am. (3 messages)