Re: [osrc-patches] [PATCH 00/12] perf: introduce model specific events and AMD IBS

Previous thread: [PATCH 11/12] perf, x86: implement AMD IBS event configuration by Robert Richter on Tuesday, April 13, 2010 - 1:23 pm. (12 messages)

Next thread: Re: [PATCH 2/3] cxgb4i: main driver files by Mike Christie on Tuesday, April 13, 2010 - 1:41 pm. (2 messages)
From: Robert Richter
Date: Tuesday, April 13, 2010 - 1:23 pm

This patch series introduces model specific events and impments AMD
IBS (Instruction Based Sampling) for perf_events.

IBS is documented here:

 BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors
 http://support.amd.com/us/Processor_TechDocs/31116.pdf 

 AMD64 Architecture Programmer’s Manual Volume 2: System Programming
 http://support.amd.com/us/Processor_TechDocs/24593.pdf

The general approach is to introduce a flag to mark an event as model
specific. With that flag set a model specific ibs (raw) config value
can be passed to the pmu for setup. When there are ibs samples
available, it is sent back as a raw data sample to the userland. So we
have a raw config value and raw sampling data. This requires the
userland to setup ibs and further extract and process sampling data.

Patches 1-8 rework and refactor the code to prepare the ibs
implementation. This is done in patches 9-12.

I will add ibs example code to libpfm4.

-Robert


--

From: Robert Richter
Date: Tuesday, April 13, 2010 - 1:23 pm

This patch introduces a flag to mark events as model specific. The
flag can be used to setup hardware events other than generic
performance counters by passing special config data to the pmu. The
data can be interpreted different from generic events and can be used
for other purposes.

The concept of PMU model-specific arguments was practiced already in
Perfmon2. Perfmon2 provides an interface to pass model specific config
values to the pmu and receive event specific samples back. The
implementation of the model specific flag extends the perf_event i/f
by this feature too.

The userland program must be aware of the cpu model to create
model specific events. This could be done for example by checking the
cpu and family.

The model_spec flag can be arbitrarily reused on other models or
architectures. For backward compatibility all architectures must
return an error, if model_spec events are not supported and the bit is
set.

E.g., this flag is used to setup IBS on an AMD cpu. IBS is not common
to pmu features from other vendors or architectures. The pmu must be
setup with a special config value. Sample data is returned in a
certain format back to the userland. An IBS event is passed in the
config field of the event attributes and with model_spec and raw event
flag enabled. The samples are then passed back in a raw sample.

Signed-off-by: Robert Richter <robert.richter@amd.com>
---
 arch/powerpc/kernel/perf_event.c |    3 +++
 arch/sh/kernel/perf_event.c      |    3 +++
 arch/sparc/kernel/perf_event.c   |    3 +++
 arch/x86/kernel/cpu/perf_event.c |    3 +++
 include/linux/perf_event.h       |    3 ++-
 5 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kernel/perf_event.c b/arch/powerpc/kernel/perf_event.c
index 08460a2..8e68e15 100644
--- a/arch/powerpc/kernel/perf_event.c
+++ b/arch/powerpc/kernel/perf_event.c
@@ -1037,6 +1037,9 @@ const struct pmu *hw_perf_event_init(struct perf_event *event)
 	event->hw.config_base = ev;
 	event->hw.idx = ...
From: Robert Richter
Date: Tuesday, April 13, 2010 - 1:23 pm

To reuse this function for events with different enable bit masks,
this mask is part of the function's argument list now.

The function will be used later to control ibs events too.

Signed-off-by: Robert Richter <robert.richter@amd.com>
---
 arch/x86/kernel/cpu/perf_event.c       |    9 +++++----
 arch/x86/kernel/cpu/perf_event_intel.c |    5 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 9eeffad..f66f52a 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -847,10 +847,10 @@ void hw_perf_enable(void)
 	x86_pmu.enable_all(added);
 }
 
-static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc)
+static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
+					  u64 enable_mask)
 {
-	wrmsrl(hwc->config_base + hwc->idx,
-			      hwc->config | ARCH_PERFMON_EVENTSEL_ENABLE);
+	wrmsrl(hwc->config_base + hwc->idx, hwc->config | enable_mask);
 }
 
 static inline void x86_pmu_disable_event(struct perf_event *event)
@@ -922,7 +922,8 @@ static void x86_pmu_enable_event(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	if (cpuc->enabled)
-		__x86_pmu_enable_event(&event->hw);
+		__x86_pmu_enable_event(&event->hw,
+				       ARCH_PERFMON_EVENTSEL_ENABLE);
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a099df9..a4b56ac 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -513,7 +513,8 @@ static void intel_pmu_nhm_enable_all(int added)
 			if (!event)
 				continue;
 
-			__x86_pmu_enable_event(&event->hw);
+			__x86_pmu_enable_event(&event->hw,
+					       ARCH_PERFMON_EVENTSEL_ENABLE);
 		}
 	}
 	intel_pmu_enable_all(added);
@@ -617,7 +618,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	if (unlikely(event->attr.precise))
 ...
From: tip-bot for Robert Richter
Date: Friday, May 7, 2010 - 11:43 am

Commit-ID:  31fa58af57c41d2912debf62d47d5811062411f1
Gitweb:     http://git.kernel.org/tip/31fa58af57c41d2912debf62d47d5811062411f1
Author:     Robert Richter <robert.richter@amd.com>
AuthorDate: Tue, 13 Apr 2010 22:23:14 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 7 May 2010 11:31:00 +0200

perf, x86: Pass enable bit mask to __x86_pmu_enable_event()

To reuse this function for events with different enable bit masks,
this mask is part of the function's argument list now.

The function will be used later to control ibs events too.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1271190201-25705-6-git-send-email-robert.richter@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/cpu/perf_event.c       |    9 +++++----
 arch/x86/kernel/cpu/perf_event_intel.c |    5 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index c2c1e10..4e218d7 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -844,10 +844,10 @@ void hw_perf_enable(void)
 	x86_pmu.enable_all(added);
 }
 
-static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc)
+static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc,
+					  u64 enable_mask)
 {
-	wrmsrl(hwc->config_base + hwc->idx,
-			      hwc->config | ARCH_PERFMON_EVENTSEL_ENABLE);
+	wrmsrl(hwc->config_base + hwc->idx, hwc->config | enable_mask);
 }
 
 static inline void x86_pmu_disable_event(struct perf_event *event)
@@ -919,7 +919,8 @@ static void x86_pmu_enable_event(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
 	if (cpuc->enabled)
-		__x86_pmu_enable_event(&event->hw);
+		__x86_pmu_enable_event(&event->hw,
+				       ARCH_PERFMON_EVENTSEL_ENABLE);
 }
 
 /*
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c ...
From: Robert Richter
Date: Tuesday, April 13, 2010 - 1:45 pm

The patch set triggers this warning in perf_prepare_sample():

 WARN_ON_ONCE(size & (sizeof(u64)-1))

Should a raw data sample padded to 64 bit?

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Robert Richter
Date: Wednesday, April 14, 2010 - 5:31 am

Just found Stephane's patch on the mailing list that removes the warning.

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Peter Zijlstra
Date: Thursday, April 15, 2010 - 12:41 am

Yes it should.

--

From: Peter Zijlstra
Date: Thursday, April 15, 2010 - 12:44 am

I would much rather it uses the ->precise thing PEBS also uses,
otherwise we keep growing funny arch extensions and end up with a

Please add a valid usecase to tools/perf/ instead.

--

From: Robert Richter
Date: Thursday, April 15, 2010 - 8:16 am

I agree that an exiting flag could be reused. But the naming 'precise'
could be misleading. Maybe we rename it to 'model_spec' or something

I will also provide an example for tools/perf but reusing libpfm4 was
a first quick solution for me.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Peter Zijlstra
Date: Wednesday, April 21, 2010 - 5:11 am

Right, so I really hate PERF_SAMPLE_RAW, and I'm considering simply
removing that for PEBS as well, its just too ugly. If we want the
register set we need to work on getting PERF_SAMPLE_REGS in a sensible
shape.

As to the meaning for ->precise, its meant to convey the counters are
not affected by skid and the like, I thought IBS provided exact IPs as
well (/me should re-read the IBS docs).

The thing with something like ->model_spec and PERF_SAMPLE_RAW is that
it doesn't provide a clear model, the user doesn't know what to expect
of it, it could be anything.

We want the ABI to express clear concepts, and things like lets bypass
everything and just dump stuff out to userspace really are to be avoided
at all costs.

Sadly IBS seems to be an utter trainwreck in the concept department (I'm
still struggling how to make a sensible interpretation of the data it
gathers).

The thing I absolutely want to avoid is the ABI becoming a fragmented
trainwreck like oprofile is.

Also not using sample_period for the sample period is of course utterly
unacceptable.
--

From: Stephane Eranian
Date: Wednesday, April 21, 2010 - 6:21 am

I wonder why SAMPLE_RAW went in in the first place, then. What was the
justification for it, traces?

Okay, so you're suggesting everything is exposed via PERF_SAMPLE_REGS.
PEBS does capture machine state which is easily mapped onto PERF_SAMPLE_REGS.
Well, that's until you look at PEB-LL on Nehalem where is captures
latencies and data
addresses.

IBS does not capture machine state in the sense of general purpose registers.
IBS captures micro-architectural info about an instruction and stores
this into a
handful of IBS registers. Those could be funneled through PERF_SAMPLE_REGS
as well, I believe. But that means, PERF_SAMPLE_REGS would need some
configuration
bitmask to name the registers of interest, e.g. EAX, EDX, IBSOP_DATA,
IBSOP_PHYSAD,
and so on.

As you pointed out a long time ago, IBS returns to much information to
be abstracted
easily unless you're willing to drop part of the information it
returns, e.g., concentrate
on cache miss info only. But this goes back to a point I made early
on: there are
many usage models, different metrics need different data. You don't
want to prevent
By construction it is given that it tracks an instruction. Thus, the
IP is always
that of the monitored instruction: no skid. The difference, though, it
that is not
associated with an actual counter+event. IBS keeps its own counter. You can
Concept is simple: track an instruction (uop) as it traverses the pipeline and
gather all sort of micro-architectural info. When the instruction retires,
interrupt the CPU and deliver the info via registers. The current implementation
is not without problems, but you can already gather useful info such as instrs
latencies, data cache misses.

Speaking of data cache misses, I believe there may be a way to abstract
sampling on cache misses, i.e, capture where they occur, that would work
with both IBS and PEBS-LL. That may be a good way to start exposing
That is fixed now.
--

From: Stephane Eranian
Date: Wednesday, April 21, 2010 - 11:40 am

You may want to collect some of the general purpose registers
(interrupted state) in each
sample. This is what you get today in a signal handler (ucontext) for
instance. That may
also be a way to export PEBS samples.
--

From: Robert Richter
Date: Wednesday, April 21, 2010 - 11:26 am

What is the idea of PERF_SAMPLE_REGS? A git grep PERF_SAMPLE_REGS only
returns a single line in -tip. I know nothing about it.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Robert Richter
Date: Wednesday, April 21, 2010 - 9:28 am

The ABI should provide hw access to all pmu features. As it is not
possible to abstract these features for all models and architectures
in the same way and a new feature may work completely different, the
only solution I see is to provide a way to write to pmu registers and
receive sampling data unfiltered back. For IBS for example the data in
a sample does not fit into existing generic events. Making IBS generic
also does not help much, since it will not be available on different
models or architectures. Introducing event types that will never
reused do not need to be generalized, it is better to generalize the
way how to handle this kind of events. This is the reason I like the

Yes, the real rip is in the sample, but there is much more data than

Of course, it could be anything. But this is not the intention. If
configs and buffers or close or exactly as the hw i/f, then it is
spec'ed and well defined. The user only have to know how to configure
a certain hw feature of a special model and how to get data back. This
is how IBS is implemented. Maybe this is the same that you have in
mind with PERF_SAMPLE_REG? This interface can then be reused for a
very different feature and this looks to me like a clear solution. I
do not see alternatives. And even if we process the samples in the

That's the point, there is no generalization of this type of data, but
still it is useful.


-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Robert Richter
Date: Wednesday, April 28, 2010 - 9:16 am

Peter,

I suggest to apply patches 1, 2, 3, 5, 6 to tip/perf/core if you don't
have objections. I will then resubmit a new version of all remaining
patches.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

--

From: Peter Zijlstra
Date: Tuesday, May 4, 2010 - 7:18 am

Ok, I'm not a big fan of 5, but I guess there's no way around that
without adding lots more code.


--

Previous thread: [PATCH 11/12] perf, x86: implement AMD IBS event configuration by Robert Richter on Tuesday, April 13, 2010 - 1:23 pm. (12 messages)

Next thread: Re: [PATCH 2/3] cxgb4i: main driver files by Mike Christie on Tuesday, April 13, 2010 - 1:41 pm. (2 messages)