Re: [RFC] perf_events: support for uncore a.k.a. nest units

Previous thread: [PATCH] Use full path to dnsdomainname and domainname in scripts/mkcompile_h by Glenn Sommer on Tuesday, January 19, 2010 - 11:29 am. (11 messages)

Next thread: [PATCH] thinkpad-acpi: wrong thermal attribute_group removed in thermal_exit() by Roel Kluin on Tuesday, January 19, 2010 - 12:59 pm. (2 messages)
From: Corey Ashford
Date: Tuesday, January 19, 2010 - 12:41 pm

-----
Intro
-----
One subject that hasn't been addressed since the introduction of perf_events in 
the Linux kernel is that of support for "uncore" or "nest" unit events.  Uncore 
is the term used by the Intel engineers for their off-core units but are still 
on the same die as the cores, and "nest" means exactly the same thing for IBM 
Power processor engineers.  I will use the term uncore for brevity and because 
it's in common parlance, but the issues and design possibilities below are 
relevant to both.  I will also broaden the term by stating that uncore will also 
refer to PMUs that are completely off of the processor chip altogether.

Contents
--------
1. Why support PMUs in uncore units?  Is there anything interesting to look at?
2. How do uncore events differ from core events?
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
4. How do you encode uncore events?
5. How do you address a particular uncore PMU?
6. Event rotation issues with uncore PMUs
7. Other issues?
8. Feedback?

----
1. Why support PMUs in uncore units?  Is there anything interesting to look at?
----

Today, many x86 chips contain uncore units, and we think that it's likely that 
the trend will continue, as more devices - I/O, memory interfaces, shared 
caches, accelerators, etc. - are integrated onto multi-core chips.  As these 
devices become more sophisticated and more workload is diverted off-core, 
engineers and performance analysts are going to want to look at what's happening 
in these units so that they can find bottlenecks.

In addition, we think that even off-chip I/O and interconnect devices are likely 
to gain PMUs because engineers will want to find bottlenecks in their massively 
parallel systems.

----
2. How do uncore events differ from core events?
----

The main difference is that uncore events are mostly likely not going to be tied 
to a particular Linux task, or even a CPU context.  Uncore units are resources 
that are in some sense ...
From: Andi Kleen
Date: Tuesday, January 19, 2010 - 5:44 pm

What the user needs to know is which CPUs are affected by that uncore
event. For example the integrated memory controller counters that count local

I don't think a raw hex number will scale anywhere. You'll need a human
readable event list / sub event masks with help texts.

Often uncore events have specific restrictions, and that needs
to be enforced somewhere too.

Doing that all in a clean way that is also usable

Such a compressed addressing scheme doesn't seem very future proof.
e.g. core 4 bits for the core is already obsolete (see the "80 core chip" that

That's a more workable scheme, but you still need to find a clean
way to describe topology (see above). The existing examples in sysfs
are unfortuately all clumpsy imho.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Corey Ashford
Date: Tuesday, January 19, 2010 - 6:49 pm

I left out one critical detail here: I had in mind that we'd be using a library 
like libpfm for handling the issue of event names + attributes to raw code 
translation.  In fact, we are using libpfm today for this purpose in the 

Agreed.  If the designer is very generous with the size of each field, it could 
hold up for quite awhile, but still there's a problem with relating these 

Yes, I agree.  Also it's easy to construct a system design that doesn't have a 
hierarchical topology.  A simple example would be a cluster of 32 nodes, each of 
which is connected to its 31 neighbors.  Perhaps for the purposes of just 
enumerating PMUs, a tree might be sufficient, but it's not clear to me that it 
is mathematically sufficient for all topologies, not to mention if it's 
intuitive enough to use.  For example, highly-interconnected components might 
require that PMU leaf nodes be duplicated in multiple branches, i.e. PMU paths 
might not be unique in some topologies.

I'm certainly open to better alternatives!


Thanks for your thoughts,

- Corey

--

From: Andi Kleen
Date: Wednesday, January 20, 2010 - 2:35 am

I doubt it's needed or useful to describe all details of an interconnect.

If detailed distance information is needed a simple table like
the SLIT table exported by ACPI would seem easier to handle.

But at least some degree of locality (e.g. "local memory controller")

We already have cyclical graphs in sysfs using symlinks. I'm not 
sure they are all that easy to parse/handle, but at least they
can be described.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Corey Ashford
Date: Wednesday, January 20, 2010 - 12:28 pm

Thanks for the pointer.  I didn't know about the ACPI SLIT and SRAT tables until 
your post.  Having had a quick look at them, I don't think they'd be that 

I think locality could be determined by looking at the device tree.  For 
example, a memory controller for a particular processor chip would be a 

Good point.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com

--

From: Peter Zijlstra
Date: Wednesday, January 20, 2010 - 6:34 am

Well, to some extend the user will have to participate. For example
which uncore pmu will be selected depends on the cpu you're attaching
the event to according to the cpu to node map.

Furthermore the intel uncore thing has curious interrupt routing



Well, if you give these things a cpumask and put them all onto the

The better solution is to generalize the whole rr on tick scheme (which
has already been discussed).


--

From: Peter Zijlstra
Date: Wednesday, January 20, 2010 - 2:33 pm

OK, so I read most of the intel uncore stuff, and it seems to suggest
you need a regular pmu event to receive uncore events (chained setup),
this seems rather retarded since it wastes a perfectly good pmu event
and makes configuring all this more intricate...

A well, nothing to be done about that I guess.. 



--

From: Corey Ashford
Date: Wednesday, January 20, 2010 - 4:23 pm

Yes, we have a similar situation where in addition to events that are counted on 
core PMU counters, we also have counters that are off-core; in some cases the 
counters are in off-core units which take their actual events from other 
off-core units, in addition to their own events.  So you can see that this can 
be almost arbitrarily complex.

As for the PERF_TYPE_(CORE,NODE,SOCKET) idea, that could still work, even 
though, for example, a socket event may be counted on a core PMU.  Using more 
encodings for the type field, as you've suggested, would allow us to reuse the 
64-bit config space multiple times.  Were you thinking that with the type field 
we'd still re-use the "cpu" argument for the actual pmu address within the 
PERF_TYPE_* space?  If so, that's an interesting idea, but I think it still 
leaves open the problem of how to actually relate those address to the real 
hardware, especially in the case of using a hypervisor which has provided you a 
small subset of the physical hardware in the system.

I really think we need some sort of data structure which is passed from the 
kernel to user space to represent the topology of the system, and give useful 
information to be able to identify each PMU node.  Whether this is done with a 
sysfs-style tree, a table in a file, XML, etc... it doesn't really matter much, 
but it needs to be something that can be parsed relatively easily and *contains 
just enough information* for the user to be able to correctly choose PMUs, and 
for the kernel to be able to relate that back to actual PMU hardware.

In our case, we are looking at /proc/device-tree, and it actually does appear to 
contain enough information for us.  However, since /proc/device-tree is not 
available anywhere but Power arch (/proc/device-tree originates from a data 
structure passed into the OS from the Open Firmware) we'd like to have a more 
general approach that can be used on x86 and other arches.

- Corey

--

From: Ingo Molnar
Date: Thursday, January 21, 2010 - 12:21 am

The right way would be to extend the current event description under 
/debug/tracing/events with hardware descriptors and (maybe) to formalise this 
into a separate /proc/events/ or into a separate filesystem.

The advantage of this is that in the grand scheme of things we _really_ dont 
want to limit performance events to 'hardware' hierarchies, or to 
devices/sysfs, some existing /proc scheme, or any other arbitrary (and 
fundamentally limiting) object enumeration.

We want a unified, logical enumeration of all events and objects that we care 
about from a performance monitoring and analysis point of view, shaped for the 
purpose of and parsed by perf user-space. And since the current event 
descriptors are already rather rich as they enumerate all sorts of things: 

 - tracepoints
 - hw-breakpoints
 - dynamic probes
 
etc., and are well used by tooling we should expand those with real hardware 
structure.

Thanks,

	Ingo
--

From: Corey Ashford
Date: Thursday, January 21, 2010 - 12:13 pm

This is an intriguing idea; I like the idea of generalizing all of this info 
into one structure.

So you think that this structure should contain event info as well?  If these 
structures are created by the kernel, I think that would necessitate placing 
large event tables into the kernel, which is something I think we'd prefer to 
avoid because of the amount of memory it would take.  Keep in mind that we need 
not only event names, but event descriptions, encodings, attributes (e.g. unit 
masks), attribute descriptions, etc.  I suppose the kernel could read a file 
from the file system, and then add this info to the tree, but that just seems 
bad.  Are there existing places in the kernel where it reads a user space file 
to create a user space pseudo filesystem?

I think keeping event naming in user space, and PMU naming in kernel space might 
be a better idea: the kernel exposes the available PMUs to user space via some 
structure, and a user space library tries to recognize the exposed PMUs and 
provide event lists and other needed info.  The perf tool would use this library 
to be able to list available events to users.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com

--

From: Corey Ashford
Date: Thursday, January 21, 2010 - 12:28 pm

Perhaps another way of handing this would be to have the kernel dynamically load 
a specific "PMU kernel module" once it has detected that it has a particular PMU 
in the hardware.  The module would consist only of a data structure, and a 
simple API to access the event data.  This way, only only the PMUs that actually 
exist in the hardware would need to be loaded into memory, and perhaps then only 
temporarily (just long enough to create the pseudo fs nodes).

Still, though, since it's a pseudo fs, all of that event data would be taking up 
kernel memory.

Another model, perhaps, would be to actually write this data out to a real file 
system upon every boot up, so that it wouldn't need to be held in memory.  That 
seems rather ugly and time consuming, though.

-- 
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
cjashfor@us.ibm.com

--

From: Ingo Molnar
Date: Wednesday, January 27, 2010 - 3:28 am

I dont think memory consumption is a problem at all. The structure of the 
monitored hardware/software state is information we _want_ the kernel to 
provide, mainly because there's no unified repository for user-space to get 
this info from.

If someone doesnt want it on some ultra-embedded box then sure a .config 
switch can be provided to allow it to be turned off.

	Ingo
--

From: Corey Ashford
Date: Wednesday, January 27, 2010 - 12:50 pm

Ok, just so that we quantify things a bit, let's say I have 20 different types 
of PMUs totalling 2000 different events, each of which has a name and text 
description, averaging 300 characters.  Along with that, there's let's say 4 
64-bit words of metadata per event describing encoding, which attributes apply 
to the event, and any other needed info. I don't know how much memory each 
pseudo fs node takes up.  Let me guess and say 128 bytes for each event node 
(the amount taken for the PMU nodes would be negligible compared with the event 
nodes).

So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.

Let's assume that the correct event module can be loaded dynamically, so that we 
don't need to have all of the possible event sets for a particular arch kernel 
build.

Any opinions on whether allocating this amount of kernel memory would be 
acceptable?  It seems like a lot of kernel memory to me, but I come from an 
embedded systems background.  Granted, most systems are going to use a fraction 
of that amount of memory (<100KB) due to having far fewer PMUs and therefore 
fewer distinct event types.

There's at least one more dimension to this.  Let's say I have 16 uncore PMUs 
all of the same type, each of which has, for example 8 events.  As a very crude 
pseudo fs, let's say we have a structure like this:


/sys/devices/pmus/
     uncore_pmu0/
         event0/ (path name to here is the name of the pmu and event)
             description (file)
             applicable_attributes (file)
         event1/
             description
             applicable_attributes
         event2/
             ...
         event7/
             ...
     uncore_pmu1/
         event0/
             description
             applicable_attributes
         ...
     ...
     uncore_pmu15/
         ...

Now, you can see that there's a lot of replication here, because the event 
descriptions and attributes are the same for each uncore pmu.  We can use 
symlinks to link them to ...
From: Peter Zijlstra
Date: Thursday, January 28, 2010 - 3:57 am

I really don't like this. The the cpu->uncore map is fixed by the
topology of the machine, which is already available in /sys some place.

Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or
something like that. We can start with 2 generic events for that type,
local/remote memory accesses and take it from there.



--

From: Corey Ashford
Date: Thursday, January 28, 2010 - 11:00 am

I don't quite get what you're saying here.  Perhaps you are thinking 
that all uncore units are associated with a particular cpu node, or a 
set of cpu nodes?  And that there's only one uncore unit per cpu (or set 
of cpus) that needs to be addressed, i.e. no ambiguity?

That is not going to be the case for all systems.  We can have uncore 
units that are associated with the entire system, for example PMUs in an 
I/O device.   And we can have multiple uncore units of a particular 
type, for example multiple vector coprocessors, each with its own PMU, 
and are associated with a single cpu or a set of cpus.

perf_events needs an addressing scheme that covers these cases.

- Corey
--

From: Peter Zijlstra
Date: Thursday, January 28, 2010 - 12:06 pm

Well, I was initially thinking of the intel uncore thing which is memory
controller, so node, level.



You could possible add a u64 pmu_id field to perf_event_attr and use
that together with things like:

  PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding
  PERF_TYPE_SPU, attr.pmu_id = spu-id

But before we go there the perf core needs to be extended to deal with
multiple hardware pmus, something which isn't too hard but we need to be
careful not to bloat the normal code paths for these somewhat esoteric
use cases.



--

From: Corey Ashford
Date: Thursday, January 28, 2010 - 12:44 pm

Thank you for that clarification.

One of Ingo's comments was that he wants perf to be able to expose all of the 
available PMUs via the perf tool.  That perf should be able to parse some data 
structure (somewhere) that would contain all of the info the user would need to 
choose a particular PMU.  Do you have some ideas about how that could be 
accomplished using the above encoding scheme?  I can see how it would be fairly 
easy to come up with a PERF_TYPE_* encoding per-topology, and then interpret all 
of those bits correctly within the kernel (which is saavy to that topology), but 
I don't see how there would be a straight-forward way to expose that structure 
to perf.  How would perf know which of those encodings apply to the current 
system, how many PMUs there are of each type, etc.

That's why I'm leaning toward a /sys/devices-style pseudo fs at the moment.  If 

Is this something you are looking into?

- Corey

--

From: Corey Ashford
Date: Thursday, January 28, 2010 - 3:08 pm

Thank you for the clarification.

One of Ingo's comments in this thread was that he wants perf to be able to 
display to the user the available PMUs along with their respect events.  That 
perf would parse some machine-independent data structure (somewhere) to get this 
info.   This same info would provide the user a method of specifying which PMU 
he wants to address.  He'd also like all of the event info data to reside in the 
same place.  I hope I am paraphrasing him correctly.

I can see that with the scheme you have proposed above, it would be 
straight-forward to encode PMU ids for a particular new PERF_TYPE_* system 
topology, but I don't see a clear way of providng perf with enough information 
to tell it which particular topology is being used, how many units of each PMU 
type exist, and so on.  Do you have any ideas as to how to accomplish this goal 
with the method you are suggesting?

This is one of the reasons why I am leaning toward a /sys/devices-style data 
structure; the kernel could easily build it based on the pmus that it discovers 
(through whatever means), and the user can fairly easily choose a pmu from this 
structure to open, and it's unambiguous to the kernel as to which pmu the user 
really wants.

I am not convinced that this is the right place to put the event info for each PMU.

 > But before we go there the perf core needs to be extended to deal with
 > multiple hardware pmus, something which isn't too hard but we need to be
 > careful not to bloat the normal code paths for these somewhat esoteric
 > use cases.
 >

Is this something you've looked into?  If so, what sort of issues have you 
discovered?

Thanks,

- Corey

--

From: Peter Zijlstra
Date: Friday, January 29, 2010 - 2:52 am

Well, the dumb way is simply probing all of them and see who responds.
Another might be adding a pmu attribute (showing the pmu-id) to the
existing sysfs topology layouts (system topology, pci, spu, are all

Right, I'm not at all sure the kernel wants to know about any events
beyond those needed for pmu scheduling constraints and possible generic
event maps.

Clearly it needs to know about all software events, but I don't think we

I've poked at it a little yes, while simply abstracting the current hw
interface and making it a list of pmu's isn't hard at all, it does add
overhead to a few key locations.

Another aspect is event scheduling, you'd want to separate the event
lists for the various pmus so that the RR thing works as expected, this
again adds overhead because you now need to abstract out the event lists
as well.

The main fast path affected by both these things is the task switch
event scheduling where you have to iterate all active events and their
pmus.

So while the abstraction itself isn't too hard, doing it so as to
minimize the bloat on the key paths does make it interesting.



--

From: Corey Ashford
Date: Friday, January 29, 2010 - 4:05 pm

That can work, but it's still fuzzy to me how a user would relate a PMU address 
that he's encoded to some actual device in the system he's using.  How would he 
know that he's addressing the correct device (besides that the PMU type 

So you'd read the id from the sysfs topology tree, and then pass that id to the 
interface?  That's an interesting approach that eliminates the need to pass a 
string pmu path to the kernel.

I like this idea, but I need to read more deeply about the topology entries to 

Interesting.

Thanks for your comments.

- Corey

--

From: Peter Zijlstra
Date: Saturday, January 30, 2010 - 1:42 am

No, the attr.pmu_id would reflect the location in the tree (pci
location, or spu number), the pmu id reported would identify the kind of
pmu driver used for that particular device.

I realized this confusion after sending but didn't clarify, we should
come up with a good alternative name for either (or both) uses.



--

From: Corey Ashford
Date: Monday, February 1, 2010 - 12:39 pm

Ok, just so I'm clear here, is attr.pmu_id a (char *) or some sort of encoded 
bit field?

- Corey

--

From: Peter Zijlstra
Date: Monday, February 1, 2010 - 12:54 pm

Right, currently on x86 we have x86_pmu.name which basically tells us
what kind of pmu it is, but we really don't export that since its
trivial to see that from /proc/cpuinfo, but the ARM people expressed
interest in this because decoding the cpuid equivalent on arm is like
nasty business.

But once we go add other funny pmu's, it becomes interesting to know
what kind of pmu it is and tie possible event sets to them.

--

From: Peter Zijlstra
Date: Thursday, January 21, 2010 - 1:36 am

Well, I'm tempted to say that's a problem for the virt guys :-)

One way one could solve that is by having the topology information
include the virt<->phys map, so that you can find the physical node from
the virtual cpu number.

Going to be interesting though, but then, virt seems to be about
creating problems where there were none before.

--

From: stephane eranian
Date: Thursday, January 21, 2010 - 1:47 am

I don't think that is correct. You can be using the uncore PMU on Nehalem
without any core PMU event. The only thing to realize is that uncore PMU
shares the same interrupt vector as core PMU. You need to configure which
core the uncore is going to interrupt on. This is done via a bitmask, so you
--

From: Peter Zijlstra
Date: Thursday, January 21, 2010 - 1:59 am

Ah, sharing the IRQ line is no problem. But from reading I got the
impression you need to configure an Offcore counter. See 30.6.2.1:

• EN_PMI_COREn (bit n, n = 0, 3 if four cores are present): When set, processor
core n is programmed to receive an interrupt signal from any interrupt enabled
uncore counter. PMI delivery due to an uncore counter overflow is enabled by
setting IA32_DEBUG_CTL.Offcore_PMI_EN to 1.

Which seems to indicate a link with the off-core response thing.

However I would be very glad to be wrong :-)

--

From: stephane eranian
Date: Thursday, January 21, 2010 - 2:16 am

The offcore_response register is a CORE PMU register.
I know the name is confusing.

OFFCORE_RESPONSE_0 is not a counter but rather an
extension of the filtering capabilities of regular counters.
To use offcore_response_0, you will need to program
a generic counter (1 config, 1 data) + offcore_response_0.
This is a very useful feature to understand memory traffic.

There is an associated difficulty though. The
offcore_response_0 MSR is shared between HT threads.
Thus the kernel needs to arbitrate. That should be taken
care of by the event scheduling code which I am working
on.

In the context of perf_event, you need to find some encoding
of the event to pass as RAW. The raw event code + umask
is 0x01B7. Then, you need to encode the value for the
offcore_response_0 MSR, which is 16 bits on NHM/WSM.
You could either stash this into the config field or use an extended
field. The kernel would detect 0x01B7 and use the value in the
extra config field.

As described in vol3b, Intel Westmere adds a second
offcore_response counter. It behaves the same way with
the same restrictions.

The debugctl bit controls uncore activation not offcore.

--

From: stephane eranian
Date: Thursday, January 21, 2010 - 2:43 am

Given the PMU sharing model of perf_events, it seems you may have
multiple consumers of uncore PMU at the same time. That means you
will need to direct the interrupt onto all the CPU for which you currently
have a user. You may have multiple users per CPU, thus you need some
reference count to track all of that. The alternative is to systematically
broadcast the uncore PMU interrupt. Each core then checks whether or
not it has uncore users.

Note that all of this is independent of the type of event, i.e., per-thread
or system-wide.
--

From: Lin Ming
Date: Tuesday, March 30, 2010 - 12:42 am

Hi, Corey

How is this going now? Are you still working on this?
I'd like to help to add support for uncore, test, write code or anything
else.

Thanks,

--

From: Corey Ashford
Date: Tuesday, March 30, 2010 - 9:49 am

I haven't been actively working on adding infrastructure for nest PMUs 
yet.  At the moment we are working on supporting nest events for IBM's 
Wire-Speed processor, using the current infrastructure, because of the 
time limitations.  Using the existing infrastructure is definitely not 
ideal, but for this processor, it's workable.

There are still a lot of issues to solve for adding this infrastructure:

1) Does perf_events need a new context type (in addition to per-task and 
per-cpu)?  This is partly because we don't want to be mixing the 
rotation of CPU-events with nest events.  Each PMU really ought to have 
its own event list.

2) How do we deal with accessing PMU's which require slow access methods 
(e.g. internal serial bus)?  The accesses may need to be placed on 
worker threads so that they don't affect the performance of context 
switches and system ticks.

3) How exactly do we represent the PMU's in the pseudofs (/sys or 
/proc)?  And how exactly does the user specify the PMU to perf_events?
Peter Zijlstra and Stephane Eranian both recommended opening the PMU 
with open() and then passing the resulting fd in through the 
perf_event_attr struct.

4) How do we choose a CPU to do the housekeeping work for a particular 
nest PMU.  Peter thought that user space should still specify the it via 
open_perf_event() cpu parameter, but there's also an argument to be made 
for the kernel choosing the best CPU to handle the job, or at least make 
it optional for the user to choose the CPU.

I'm sure there are other issues as well.  If you'd like to start working 
on some (or all!) of these, you are more than welcome to.  I think we 
need to toss around some more ideas before committing much to code at 
this point.

- Corey
--

From: Peter Zijlstra
Date: Tuesday, March 30, 2010 - 10:15 am

Right, I've got some definite ideas on how to go here, just need some
time to implement them.

The first thing that needs to be done is get rid of all the __weak
functions (with exception of perf_callchain*, since that really is arch
specific).

For hw_perf_event_init() we need to create a pmu registration facility
and lookup a pmu_id, either passed as an actual id found in sysfs or an
open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
compat).

hw_perf_disable/enable() would become struct pmu functions and
perf_disable/enable need to become per-pmu, most functions operate on a
specific event, for those we know the pmu and hence can call the per-pmu
version. (XXX find those sites where this is not true).

Then we can move to context, yes I think we want new context for new
PMUs, otherwise we get very funny RR interleaving problems. My idea was
to move find_get_context() into struct pmu as well, this allows you to
have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
because then things like perf_event_task_sched_out() would get rather
complex.

For RR we can move away from perf_event_task_tick and let the pmu
install a (hr)timer for this on their own.

I've been planning to implement this for more than a week now, its just
that other stuff keeps getting in the way.

--

From: Corey Ashford
Date: Tuesday, March 30, 2010 - 3:12 pm

On 3/30/2010 10:15 AM, Peter Zijlstra wrote:

This sounds like a good idea.  Right now for the Wire-Speed processor, we have a 
loop that goes through all of the nest PMU's and calls their respective per-pmu 

Yes, I think it makes a lot of sense, so that there's not some sort of fixed 

Definitely.  I don't think it makes sense to have per-task context on 

This is necessary I think, because of the access time for some of the PMU's.  I 
wonder though if it should, perhaps optionally, be off-loaded to a high-priority 
task to do the switching so that access latency to the PMU can be controlled.

As I mentioned when we met, some of the Wire-Speed processor nest PMU control 
registers are accessed via SCOM, which is an internal, 200 MHz serial bus.  We 
are being quoted ~525 SCOM bus ticks to do a PMU control register access, which 
comes out to about 2.5 microseconds.  If you figure 5 accesses to rotate the 

Well, it's not as if this is a trivial task either :)

- Corey

--

From: Peter Zijlstra
Date: Wednesday, March 31, 2010 - 7:01 am

For uncore no, but there is also the hw-breakpoint stuff that is being
presented as a pmu, for those it would make sense to have a separate
per-task context.

But doing multiple per-task contexts is something for a next step

Yeah, you mentioned that.. for those things we need some changes anyway,
since currently we install per-cpu counters using IPIs and expect the
pmu::enable() method to be synchronous (it has a return value). It would
be totally unacceptable to do 2.5ms pokes with IRQs disabled.

The RR thing would be the easiest to solve, just let the timer wake up a
thread instead of doing the work itself, that's fully isolated to how
the pmu chooses to implement that. The above mentioned issue however
would be much more challenging to fix nicely.



--

From: stephane eranian
Date: Wednesday, March 31, 2010 - 7:13 am

Also some of perf_enable()/perf_disable() would have to be per PMU and
not global like they are today.
--

From: Maynard Johnson
Date: Wednesday, March 31, 2010 - 8:49 am

hw-breakpoint presented as a pmu?  hmmmm.  IMHO, this is an example where
shoehorning something into the perf_events subsystem that logically doesn't
[snip]

--

From: Corey Ashford
Date: Wednesday, March 31, 2010 - 10:50 am

Just to be clear, it's 2.5us, not 2.5ms, but I think it's still bad... 

It seems like it might need to be done in two phases.  IPI request is 
sent, and then a thread is woken up on the other CPU, it does some work, 
and then sets a status variable and somehow notifies the caller that the 
operation has completed.  I don't know the kernel's communication 
mechanisms well enough to know which one is most appropriate - maybe rwsem?

- Corey
--

From: stephane eranian
Date: Tuesday, March 30, 2010 - 2:28 pm

On Tue, Mar 30, 2010 at 6:49 PM, Corey Ashford
I concur with the fact that you don't want to mix events form
different PMUs in the
same rotation list. There are some side effects of not doing this with
AMD64 Northbridge
events when you have multiple concurrent sessions. But this is a special case
where the "uncore" (nest) PMU is actually controlled via the core PMU. I think
that is okay for now, but you certainly don't want that for Nehalem uncore, for
One of the housekeeping task is to handle uncore PMU interrupts, for instance.
That is not a trivial task given that events are managed independently and
that you could be monitoring per-thread or system-wide. It may be that
some uncore PMU can only interrupt one core. Intel Nehalem can interrupt
many at once.
--

From: Corey Ashford
Date: Tuesday, March 30, 2010 - 4:11 pm

That's a good point, and I think it's unreasonable to expect that the user knows 
exactly how the interrupts are connected from the uncore/nest PMU to which CPU(s).

Perhaps one way around this would be to return an error if the chosen CPU wasn't 
fully capable of performing the housekeeping functions for the requested PMU. 
But this certainly isn't ideal, because relying on this mechanism would require 
that the user (or user tool) figure out which CPU is fully capable by 
trial-and-error.

- Corey

--

From: stephane eranian
Date: Wednesday, March 31, 2010 - 6:43 am

On Wed, Mar 31, 2010 at 1:11 AM, Corey Ashford

I think users should not have to worry about all of this. It is also
fine to restrict
any uncore monitoring to system-wide mode, i.e., not per-thread. But I have
also seen people requesting just that hoping they could draw correlations
between core and uncore events in one run. I think this is less critical though.

In the specific case of CPU-uncore (e.g., Nehalem uncore), you could simply
identify the uncore PMU with the CPU. For anything else, you would definitively
need the file descriptor. The more uniform approach would have been to use
the file descriptor all along, except for per-thread. But that's fine, I think.
--

Previous thread: [PATCH] Use full path to dnsdomainname and domainname in scripts/mkcompile_h by Glenn Sommer on Tuesday, January 19, 2010 - 11:29 am. (11 messages)

Next thread: [PATCH] thinkpad-acpi: wrong thermal attribute_group removed in thermal_exit() by Roel Kluin on Tuesday, January 19, 2010 - 12:59 pm. (2 messages)