From: Andi Kleen <ak@linux.intel.com> Intel Nehalem/Westmere have a special OFFCORE_RESPONSE event that can be used to monitor any offcore accesses from a core. This is a very useful event for various tunings, and it's also needed to implement the generic LLC-* events correctly. Unfortunately this event requires programming a mask in a separate register. And worse this separate register is per core, not per CPU thread. This patch adds: - Teaches perf_events that OFFCORE_RESPONSE need extra parameters. - Adds a new field to the user interface to pass the extra mask. This reuses one of the unused fields for perf events. The change is ABI neutral because noone is likely to have used OFFCORE_RESPONSE before (with zero mask it wouldn't count anything) - Add support to the Intel perf_event core to schedule the per core resource. I tried to add generic infrastructure for this that could be also used for other core resources. The basic code has is patterned after the similar AMD northbridge constraints code. Thanks to Stephane Eranian who pointed out some problems in the original version and suggested improvements. Cc: eranian@google.com Signed-off-by: Andi Kleen <ak@linux.intel.com> --- arch/x86/kernel/cpu/perf_event.c | 56 +++++++++++++++ arch/x86/kernel/cpu/perf_event_intel.c | 120 ++++++++++++++++++++++++++++++++ include/linux/perf_event.h | 7 ++- 3 files changed, 182 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index ed63101..97133ec 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -93,6 +93,8 @@ struct amd_nb { struct event_constraint event_constraints[X86_PMC_IDX_MAX]; }; +struct intel_percore; + #define MAX_LBR_ENTRIES 16 struct cpu_hw_events { @@ -126,6 +128,8 @@ struct cpu_hw_events { void *lbr_context; struct perf_branch_stack lbr_stack; struct perf_branch_entry lbr_entries[MAX_LBR_ENTRIES]; + int ...
From: Andi Kleen <ak@linux.intel.com>
Add support to the perf tool to specify extra values for raw events
and pass them to the kernel.
The new format is -e rXXXX[,YYYY]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
tools/perf/Documentation/perf-list.txt | 6 ++++++
tools/perf/builtin-report.c | 7 ++++---
tools/perf/util/parse-events.c | 26 ++++++++++++++++++++------
tools/perf/util/parse-events.h | 2 +-
tools/perf/util/ui/browsers/hists.c | 3 ++-
5 files changed, 33 insertions(+), 11 deletions(-)
diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt
index 399751b..c398390 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -61,6 +61,12 @@ raw encoding of 0x1A8 can be used:
You should refer to the processor specific documentation for getting these
details. Some of them are referenced in the SEE ALSO section below.
+Some special events like the Intel OFFCORE_RESPONSE events require
+extra parameters. These can be specified by appending a command and
+the extra parameter as a hex number. Since the bitmasks can be
+complicated to compute it is recommended to use a suitable mapping file
+with --event-map.
+
OPTIONS
-------
None
diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 5de405d..7f6bcfb 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -187,7 +187,8 @@ static int process_read_event(event_t *event, struct perf_session *session __use
attr = perf_header__find_attr(event->read.id, &session->header);
if (show_threads) {
- const char *name = attr ? __event_name(attr->type, attr->config)
+ const char *name = attr ? __event_name(attr->type, attr->config,
+ 0)
: "unknown";
perf_read_values_add_value(&show_threads_values,
event->read.pid, event->read.tid,
@@ -197,7 +198,7 @@ static int process_read_event(event_t *event, struct ...It's not clear to me that a comma is the best character to use in this instance, because multiple events can be listed after the -e separated by commas as well, for example: -e rXXXX,cycles. -e rXXXX,YYYY,cycles would be confusing to read. How about -e rXXXX[+YYYY] - Corey --
That's a good point. The comma seems to work in my testing though for some reason, but I guess I broke the multiple events case then. I'll switch over to + -Andi --
I think it would have parsed OK, but it is a confusing syntax to look at. I should point out that there are also the u,k,h modifiers which can be specified, even on raw events, so the full syntax looks something like: -e rXXXX[+YYYY][:u|:k|:h] - Corey --
Ok I can fix the help text for that. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
If we're frobbing it in the existing config qword, do we still need special syntax? Also, I think we can use the same mechanism to program the PEBS-load-latency MSR, right? --
I haven't looked at that, but if it's a per core resource the infrastructure can be reused at least. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Can't remember if the load-latency msr is per-core, but its an extra reg that needs to be set. The nhm/wsm LBR config msr is per-core though, so that could use your infrastructure too. --
Ok. I guess it shouldn't be too hard to add. Stephane mentioned that too. Not right now, but perhaps later I'll look at using this. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
The difficulty with PEBS-LL (load latency) is not so much the encoding of the
latency. It is more how to expose the data back to user. You get a full PEBS
record for each miss. So you get the PEBS machine state + data addr, miss
latency, and data source (where did the line come from). We have already
discussed how to expose machine state in general. I'll work on a patch for this.
So we can get the general PEBS machine state out. However, the question is
how do we expose data addr, latency, data source? We can reuse the
SAMPLE_ADDR for the data address. Sample IP would point to the load
instruction (with help from LBR to correct the off by one issue). For
the latency
and data source, I proposed using pseudo regs and leveraging the sampled machine
state mechanism. An alternative may be to define a new record type that would b
generic enough to be reusable, for instance { instr_addr, data_addr,
latency, data_src, flags; }.
But let's fix OFFCORE_RESPONSE_* first.
--
Frederic was working on this PERF_SAMPLE_REGS stuff as well for his copy I'm not sure I like the idea of pseudo regs.. I'm afraid it'll get messy quite quickly. Load-latency is a bit like IBS that way, terribly messy. --
I don't understand what aspect you think is messy. When you are sampling cache misses, you expect to get the tuple (instr addr, data addr, latency, data source). That is what you get with AMD IBS, Nehalem PEBS-LL and also Itanium D-EAR. I am sure IBM Power has something similar as well. To collect this, you can either store the info in registers (AMD, Itanium) or in a buffer (PEBS). But regardless of that you will always have to expose the tuple. We have a solution for two out of 4 fields that reuses the existing infrastructure. We need something else for the other two. We should expect that in the future PMUs will collect more than code addresses. --
Its the data source thing I have most trouble with -- see below. The latency isn't immediately clear either, I mean the larger the bubble the more hits the instruction will get, so there should be a correlation Well, if Intel PEBS, IA64 and PPC64 all have a data source thing we can simply add PERF_SAMPLE_SOURCE or somesuch and use that. Do IA64/PPC64 have latency fields as well? PERF_SAMPLE_LATENCY would seem to be the thing to use in that case. BTW, what's the status of perf on IA64? And do we really still care Sure, but I hate stuff that counts multiple events on a single base like IBS does, and LL is similar to that, its a fetch retire counter and then you report where fetch was satisfied from. So in effect you're measuring l1/l2/l3/dram hit/miss all at the same time but on a fetch basis. Note that we need proper userspace for such crap as well, and libpfm doesn't count, we need a full analysis tool in perf itself. --
Peter, The latency is the miss latency, i.e., time to bring the cache line back at the time the miss is detected. That's not because you see a latency of 20 cycles that you can assume the line came from the LLC cache, it may have been in-flight by the time the load was issued. In other words, the latency may not be enough to figure out where the line actually came. As for the correlation to cycles sampling, they don't point to the same location. With cycles, you point to the stalled instructions, i.e., where you wait for the data to arrive. With PEBS-LL (and variations on the other archs), you point to the missing load instructions. Sometimes those can be far apart, it depends on the code flow, instruction scheduling by the compiler and so on. Backtracing from the stall instruction to the missing load is tricky business especially with branches, interrupts and such. Some people have tried that in the past. What you are really after here is identifying load misses which do incur serious stalls in your program. No single HW feature provides that. But by combining cache miss and cycle profiles, I think you can get a good handle on this. Although the latency is a good hint for potential stalls, there is no guarantee. A miss latency could be completely overlapped with executions. PEBS-LL (or variations on the other arch) won't report the overlap. You have to correlate this with a cycle profiling. However, it you get latencies of > 40 cycles or more it is highly unlikely the compiler was able to hide that, thus those are good candidates for prefetching of some sort (assuming you get lots of samples Itanium definitively does have data source, so does IBS. Don't know about It is not dead, there is one more CPU in the making if I recall. I did touch base with Tony Luck last week on this. I think adding support for the basic counting stuff should be possible. You have 4 counters, with event constraints. Getting the constraints right for some events is a bit tricky and the ...
Would you like me to respin this isolated chunk of my dwarf cfi unwinding patchset? So that we can discuss the ABI. Thanks. --
Load latency is luckily per thread. Nothing special to do to handle this one. Exactly. --
Right, but in effect Andi adds two pieces of infrastructure, - extra_reg, which determines if a particular cntr value needs extra data, this is useful for both the offcore msr and the pebs-ll msr. - per-core constraints, which is useful for shared msrs like the offcore and lbr config things. So while his current patch set only uses either piece once -- to provide offcore support -- both pieces have in fact at least one more potential user (for future patches). --
I don't think you need special syntax. You can simply come up with the 64-bit raw hex value. corey recently added a small utility to do this via libpfm4: perf stat -e `evt2raw unhalted_core_cycles:u` .... With libpfm4, I'll expose the event as a normal one. Therefore you would do: perf stat -e OFFCORE_RESPONSE_0:PF_DATA_RD:REMOTE_DRAM ... This will encode to: Yes, we could hardcode the latency the same way. --
At least right now it would still need a constraint, because only one counter can use it. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
True. But there are already contraint tables for PEBS. For instance, if I use INST_RETIRED (0x3c) with PEBS, then it cannot use a fixed counter. So you can do a constraint for MEM_LOAD_RETIRED:LAT_ABOVE_THRES the same way (include the umask). --
Second thought on this. There is no contraint for the event MEM_LOAD_RETIRED:LAT_ABOVE_THRES. It can be measured in any of the 4 generic counters. It's just that it can only be measured once given the extra MSR would then be shared. One way to implement this constraint is indeed to mark the event has being able to run only in one counter. That may be the simplest way to implement this. In that case I would use a counter which does not have many other constraints, e.g. 3 or 4. --
From: Andi Kleen <ak@linux.intel.com> The generic perf LLC-* events do count the L2 caches, not the real L3 LLC on Intel Nehalem and Westmere. This lead to quite some confusion. Fixing this properly requires use of the special OFFCORE_RESPONSE events which need a separate mask register. This has been implemented in a earlier patch. Now use this infrastructure to set correct events for the LLC-* on Nehalem and Westmere Signed-off-by: Andi Kleen <ak@linux.intel.com> --- arch/x86/kernel/cpu/perf_event.c | 7 +++++- arch/x86/kernel/cpu/perf_event_intel.c | 37 ++++++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 97133ec..3e429a1 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -299,6 +299,10 @@ static u64 __read_mostly hw_cache_event_ids [PERF_COUNT_HW_CACHE_MAX] [PERF_COUNT_HW_CACHE_OP_MAX] [PERF_COUNT_HW_CACHE_RESULT_MAX]; +static u64 __read_mostly hw_cache_extra_regs + [PERF_COUNT_HW_CACHE_MAX] + [PERF_COUNT_HW_CACHE_OP_MAX] + [PERF_COUNT_HW_CACHE_RESULT_MAX]; /* * Propagate event elapsed time into the generic event. @@ -455,7 +459,8 @@ set_ext_hw_attr(struct hw_perf_event *hwc, struct perf_event_attr *attr) return -EINVAL; hwc->config |= val; - + hwc->extra_config = + hw_cache_extra_regs[cache_type][cache_op][cache_result]; return 0; } diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index bbe7fba..a06dae3 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -215,6 +215,39 @@ static __initconst const u64 westmere_hw_cache_event_ids }, }; +/* + * OFFCORE_RESPONSE MSR bits (subset), See IA32 SDM Vol 3 30.6.1.3 + */ + +#define DMND_DATA_RD (1 << 0) +#define DMND_RFO (1 << 1) +#define DMND_WB (1 << 3) +#define PF_DATA_RD (1 ...
Small bug fix for the earlier offcore_response patch:
need to put the events for intel_pmu too, not only core_pmu.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a06dae3..9a71ffd 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1031,6 +1031,7 @@ static __initconst const struct x86_pmu intel_pmu = {
*/
.max_period = (1ULL << 31) - 1,
.get_event_constraints = intel_get_event_constraints,
+ .put_event_constraints = intel_put_event_constraints,
.cpu_starting = intel_pmu_cpu_starting,
.cpu_dying = intel_pmu_cpu_dying,
--
ak@linux.intel.com -- Speaking for myself only.
--
You don't need it for core pmu, this is for Core Duo/Solo --
It shouldn't matter because Core Duo doesn't have any per CPU resources, but it seems more symetric to call it in any case. It may make a difference if other users of the per CPU resource code get added. -Andi --
There is no HT on these processors. I don't know of any valuable HW features that is not support for these. --
Andi, Thanks for creating this patch. It was on my TODO list for a while. OFFCORE_RESPONSE is indeed a very useful event. One thing I noticed in your patch is that you don't special case the configuration where HT is off. In that case, the sharing problem goes away. I think you could override either way during init. Some more tidbits: - OFFCORE_RESPONSE_0 is 0x01b7 - OFFCORE_RESPONSE_1 is 0x01bb The umask is not zero but 1. Dont' know if you get something meaningful is you pass a umask of zero. But that's the user's responsibility to set this right. An alternative approach could have been to stash the extra MSR value in the upper 32-bit value of the config. It's 16-bit wide today. OFFCORE_RESPONSE is a model specific event. There is no guarantee it will be there in future CPU, so it would be safe to do that as well. --
On 11/11/2010 7:06 PM, Stephane Eranian wrote: The allocator should handle that transparently. With HT off the resource is Hmm I seem to get events that look meaningful with 0. But you're right 1 is better. I used that in manual tests, but it wasn't in the cache number mappings. I'll fix that for the next version. -Andi --
Either per_core != NULL implies percore_used or it should be state s/unsigned/unsigned int/ (also later in the patch) and then align the Please dynamically allocate these when needed, just like the AMD I think I like Stephane's suggestion better, frob them into the existing u64 word, since its model specific and we still have 33 empty bits in the control register there's plenty space. --
On Thu, Nov 11, 2010 at 08:09:14PM +0100, Peter Zijlstra wrote: Fully dynamic is difficult because the topology discovery does not really handle that nicely. I can allocate at boot, but it will not save a lot of memory (just one entry per core) To be honest I would prefer not to do that change, are you sure Ok. I'll see how many changes that needs. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Can we make it so, that is, move that int inside the intel_percore Hrm,.. but we only need to do this on nhm/wsm init, we already have code detecting the cpu model. And I'm quite sure you know where to poke to see if HT is enabled. --
It's needed per cpu. Basically the flag says "bother to look into the percore structure" This way most of this is kept out of the fast paths. But one CPU in a core may need it and the other not. In principle it could be merged into some other cpuc flag word if you Ok I guess I can do a dynamic per cpu allocation for this case. -Andi --
Nah, I missed that detail, having it as is seems fine. --
