----- Intro ----- One subject that hasn't been addressed since the introduction of perf_events in the Linux kernel is that of support for "uncore" or "nest" unit events. Uncore is the term used by the Intel engineers for their off-core units but are still on the same die as the cores, and "nest" means exactly the same thing for IBM Power processor engineers. I will use the term uncore for brevity and because it's in common parlance, but the issues and design possibilities below are relevant to both. I will also broaden the term by stating that uncore will also refer to PMUs that are completely off of the processor chip altogether. Contents -------- 1. Why support PMUs in uncore units? Is there anything interesting to look at? 2. How do uncore events differ from core events? 3. Why does a CPU need to be assigned to manage a particular uncore unit's events? 4. How do you encode uncore events? 5. How do you address a particular uncore PMU? 6. Event rotation issues with uncore PMUs 7. Other issues? 8. Feedback? ---- 1. Why support PMUs in uncore units? Is there anything interesting to look at? ---- Today, many x86 chips contain uncore units, and we think that it's likely that the trend will continue, as more devices - I/O, memory interfaces, shared caches, accelerators, etc. - are integrated onto multi-core chips. As these devices become more sophisticated and more workload is diverted off-core, engineers and performance analysts are going to want to look at what's happening in these units so that they can find bottlenecks. In addition, we think that even off-chip I/O and interconnect devices are likely to gain PMUs because engineers will want to find bottlenecks in their massively parallel systems. ---- 2. How do uncore events differ from core events? ---- The main difference is that uncore events are mostly likely not going to be tied to a particular Linux task, or even a CPU context. Uncore units are resources that are in some sense ...
What the user needs to know is which CPUs are affected by that uncore event. For example the integrated memory controller counters that count local I don't think a raw hex number will scale anywhere. You'll need a human readable event list / sub event masks with help texts. Often uncore events have specific restrictions, and that needs to be enforced somewhere too. Doing that all in a clean way that is also usable Such a compressed addressing scheme doesn't seem very future proof. e.g. core 4 bits for the core is already obsolete (see the "80 core chip" that That's a more workable scheme, but you still need to find a clean way to describe topology (see above). The existing examples in sysfs are unfortuately all clumpsy imho. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
I left out one critical detail here: I had in mind that we'd be using a library like libpfm for handling the issue of event names + attributes to raw code translation. In fact, we are using libpfm today for this purpose in the Agreed. If the designer is very generous with the size of each field, it could hold up for quite awhile, but still there's a problem with relating these Yes, I agree. Also it's easy to construct a system design that doesn't have a hierarchical topology. A simple example would be a cluster of 32 nodes, each of which is connected to its 31 neighbors. Perhaps for the purposes of just enumerating PMUs, a tree might be sufficient, but it's not clear to me that it is mathematically sufficient for all topologies, not to mention if it's intuitive enough to use. For example, highly-interconnected components might require that PMU leaf nodes be duplicated in multiple branches, i.e. PMU paths might not be unique in some topologies. I'm certainly open to better alternatives! Thanks for your thoughts, - Corey --
I doubt it's needed or useful to describe all details of an interconnect. If detailed distance information is needed a simple table like the SLIT table exported by ACPI would seem easier to handle. But at least some degree of locality (e.g. "local memory controller") We already have cyclical graphs in sysfs using symlinks. I'm not sure they are all that easy to parse/handle, but at least they can be described. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Thanks for the pointer. I didn't know about the ACPI SLIT and SRAT tables until your post. Having had a quick look at them, I don't think they'd be that I think locality could be determined by looking at the device tree. For example, a memory controller for a particular processor chip would be a Good point. -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cjashfor@us.ibm.com --
Well, to some extend the user will have to participate. For example which uncore pmu will be selected depends on the cpu you're attaching the event to according to the cpu to node map. Furthermore the intel uncore thing has curious interrupt routing Well, if you give these things a cpumask and put them all onto the The better solution is to generalize the whole rr on tick scheme (which has already been discussed). --
OK, so I read most of the intel uncore stuff, and it seems to suggest you need a regular pmu event to receive uncore events (chained setup), this seems rather retarded since it wastes a perfectly good pmu event and makes configuring all this more intricate... A well, nothing to be done about that I guess.. --
Yes, we have a similar situation where in addition to events that are counted on core PMU counters, we also have counters that are off-core; in some cases the counters are in off-core units which take their actual events from other off-core units, in addition to their own events. So you can see that this can be almost arbitrarily complex. As for the PERF_TYPE_(CORE,NODE,SOCKET) idea, that could still work, even though, for example, a socket event may be counted on a core PMU. Using more encodings for the type field, as you've suggested, would allow us to reuse the 64-bit config space multiple times. Were you thinking that with the type field we'd still re-use the "cpu" argument for the actual pmu address within the PERF_TYPE_* space? If so, that's an interesting idea, but I think it still leaves open the problem of how to actually relate those address to the real hardware, especially in the case of using a hypervisor which has provided you a small subset of the physical hardware in the system. I really think we need some sort of data structure which is passed from the kernel to user space to represent the topology of the system, and give useful information to be able to identify each PMU node. Whether this is done with a sysfs-style tree, a table in a file, XML, etc... it doesn't really matter much, but it needs to be something that can be parsed relatively easily and *contains just enough information* for the user to be able to correctly choose PMUs, and for the kernel to be able to relate that back to actual PMU hardware. In our case, we are looking at /proc/device-tree, and it actually does appear to contain enough information for us. However, since /proc/device-tree is not available anywhere but Power arch (/proc/device-tree originates from a data structure passed into the OS from the Open Firmware) we'd like to have a more general approach that can be used on x86 and other arches. - Corey --
The right way would be to extend the current event description under /debug/tracing/events with hardware descriptors and (maybe) to formalise this into a separate /proc/events/ or into a separate filesystem. The advantage of this is that in the grand scheme of things we _really_ dont want to limit performance events to 'hardware' hierarchies, or to devices/sysfs, some existing /proc scheme, or any other arbitrary (and fundamentally limiting) object enumeration. We want a unified, logical enumeration of all events and objects that we care about from a performance monitoring and analysis point of view, shaped for the purpose of and parsed by perf user-space. And since the current event descriptors are already rather rich as they enumerate all sorts of things: - tracepoints - hw-breakpoints - dynamic probes etc., and are well used by tooling we should expand those with real hardware structure. Thanks, Ingo --
This is an intriguing idea; I like the idea of generalizing all of this info into one structure. So you think that this structure should contain event info as well? If these structures are created by the kernel, I think that would necessitate placing large event tables into the kernel, which is something I think we'd prefer to avoid because of the amount of memory it would take. Keep in mind that we need not only event names, but event descriptions, encodings, attributes (e.g. unit masks), attribute descriptions, etc. I suppose the kernel could read a file from the file system, and then add this info to the tree, but that just seems bad. Are there existing places in the kernel where it reads a user space file to create a user space pseudo filesystem? I think keeping event naming in user space, and PMU naming in kernel space might be a better idea: the kernel exposes the available PMUs to user space via some structure, and a user space library tries to recognize the exposed PMUs and provide event lists and other needed info. The perf tool would use this library to be able to list available events to users. -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cjashfor@us.ibm.com --
Perhaps another way of handing this would be to have the kernel dynamically load a specific "PMU kernel module" once it has detected that it has a particular PMU in the hardware. The module would consist only of a data structure, and a simple API to access the event data. This way, only only the PMUs that actually exist in the hardware would need to be loaded into memory, and perhaps then only temporarily (just long enough to create the pseudo fs nodes). Still, though, since it's a pseudo fs, all of that event data would be taking up kernel memory. Another model, perhaps, would be to actually write this data out to a real file system upon every boot up, so that it wouldn't need to be held in memory. That seems rather ugly and time consuming, though. -- Regards, - Corey Corey Ashford Software Engineer IBM Linux Technology Center, Linux Toolchain Beaverton, OR 503-578-3507 cjashfor@us.ibm.com --
I dont think memory consumption is a problem at all. The structure of the monitored hardware/software state is information we _want_ the kernel to provide, mainly because there's no unified repository for user-space to get this info from. If someone doesnt want it on some ultra-embedded box then sure a .config switch can be provided to allow it to be turned off. Ingo --
Ok, just so that we quantify things a bit, let's say I have 20 different types
of PMUs totalling 2000 different events, each of which has a name and text
description, averaging 300 characters. Along with that, there's let's say 4
64-bit words of metadata per event describing encoding, which attributes apply
to the event, and any other needed info. I don't know how much memory each
pseudo fs node takes up. Let me guess and say 128 bytes for each event node
(the amount taken for the PMU nodes would be negligible compared with the event
nodes).
So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.
Let's assume that the correct event module can be loaded dynamically, so that we
don't need to have all of the possible event sets for a particular arch kernel
build.
Any opinions on whether allocating this amount of kernel memory would be
acceptable? It seems like a lot of kernel memory to me, but I come from an
embedded systems background. Granted, most systems are going to use a fraction
of that amount of memory (<100KB) due to having far fewer PMUs and therefore
fewer distinct event types.
There's at least one more dimension to this. Let's say I have 16 uncore PMUs
all of the same type, each of which has, for example 8 events. As a very crude
pseudo fs, let's say we have a structure like this:
/sys/devices/pmus/
uncore_pmu0/
event0/ (path name to here is the name of the pmu and event)
description (file)
applicable_attributes (file)
event1/
description
applicable_attributes
event2/
...
event7/
...
uncore_pmu1/
event0/
description
applicable_attributes
...
...
uncore_pmu15/
...
Now, you can see that there's a lot of replication here, because the event
descriptions and attributes are the same for each uncore pmu. We can use
symlinks to link them to ...I really don't like this. The the cpu->uncore map is fixed by the
topology of the machine, which is already available in /sys some place.
Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or
something like that. We can start with 2 generic events for that type,
local/remote memory accesses and take it from there.
--
I don't quite get what you're saying here. Perhaps you are thinking that all uncore units are associated with a particular cpu node, or a set of cpu nodes? And that there's only one uncore unit per cpu (or set of cpus) that needs to be addressed, i.e. no ambiguity? That is not going to be the case for all systems. We can have uncore units that are associated with the entire system, for example PMUs in an I/O device. And we can have multiple uncore units of a particular type, for example multiple vector coprocessors, each with its own PMU, and are associated with a single cpu or a set of cpus. perf_events needs an addressing scheme that covers these cases. - Corey --
Well, I was initially thinking of the intel uncore thing which is memory controller, so node, level. You could possible add a u64 pmu_id field to perf_event_attr and use that together with things like: PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding PERF_TYPE_SPU, attr.pmu_id = spu-id But before we go there the perf core needs to be extended to deal with multiple hardware pmus, something which isn't too hard but we need to be careful not to bloat the normal code paths for these somewhat esoteric use cases. --
Thank you for that clarification. One of Ingo's comments was that he wants perf to be able to expose all of the available PMUs via the perf tool. That perf should be able to parse some data structure (somewhere) that would contain all of the info the user would need to choose a particular PMU. Do you have some ideas about how that could be accomplished using the above encoding scheme? I can see how it would be fairly easy to come up with a PERF_TYPE_* encoding per-topology, and then interpret all of those bits correctly within the kernel (which is saavy to that topology), but I don't see how there would be a straight-forward way to expose that structure to perf. How would perf know which of those encodings apply to the current system, how many PMUs there are of each type, etc. That's why I'm leaning toward a /sys/devices-style pseudo fs at the moment. If Is this something you are looking into? - Corey --
Thank you for the clarification. One of Ingo's comments in this thread was that he wants perf to be able to display to the user the available PMUs along with their respect events. That perf would parse some machine-independent data structure (somewhere) to get this info. This same info would provide the user a method of specifying which PMU he wants to address. He'd also like all of the event info data to reside in the same place. I hope I am paraphrasing him correctly. I can see that with the scheme you have proposed above, it would be straight-forward to encode PMU ids for a particular new PERF_TYPE_* system topology, but I don't see a clear way of providng perf with enough information to tell it which particular topology is being used, how many units of each PMU type exist, and so on. Do you have any ideas as to how to accomplish this goal with the method you are suggesting? This is one of the reasons why I am leaning toward a /sys/devices-style data structure; the kernel could easily build it based on the pmus that it discovers (through whatever means), and the user can fairly easily choose a pmu from this structure to open, and it's unambiguous to the kernel as to which pmu the user really wants. I am not convinced that this is the right place to put the event info for each PMU. > But before we go there the perf core needs to be extended to deal with > multiple hardware pmus, something which isn't too hard but we need to be > careful not to bloat the normal code paths for these somewhat esoteric > use cases. > Is this something you've looked into? If so, what sort of issues have you discovered? Thanks, - Corey --
Well, the dumb way is simply probing all of them and see who responds. Another might be adding a pmu attribute (showing the pmu-id) to the existing sysfs topology layouts (system topology, pci, spu, are all Right, I'm not at all sure the kernel wants to know about any events beyond those needed for pmu scheduling constraints and possible generic event maps. Clearly it needs to know about all software events, but I don't think we I've poked at it a little yes, while simply abstracting the current hw interface and making it a list of pmu's isn't hard at all, it does add overhead to a few key locations. Another aspect is event scheduling, you'd want to separate the event lists for the various pmus so that the RR thing works as expected, this again adds overhead because you now need to abstract out the event lists as well. The main fast path affected by both these things is the task switch event scheduling where you have to iterate all active events and their pmus. So while the abstraction itself isn't too hard, doing it so as to minimize the bloat on the key paths does make it interesting. --
That can work, but it's still fuzzy to me how a user would relate a PMU address that he's encoded to some actual device in the system he's using. How would he know that he's addressing the correct device (besides that the PMU type So you'd read the id from the sysfs topology tree, and then pass that id to the interface? That's an interesting approach that eliminates the need to pass a string pmu path to the kernel. I like this idea, but I need to read more deeply about the topology entries to Interesting. Thanks for your comments. - Corey --
No, the attr.pmu_id would reflect the location in the tree (pci location, or spu number), the pmu id reported would identify the kind of pmu driver used for that particular device. I realized this confusion after sending but didn't clarify, we should come up with a good alternative name for either (or both) uses. --
Ok, just so I'm clear here, is attr.pmu_id a (char *) or some sort of encoded bit field? - Corey --
Right, currently on x86 we have x86_pmu.name which basically tells us what kind of pmu it is, but we really don't export that since its trivial to see that from /proc/cpuinfo, but the ARM people expressed interest in this because decoding the cpuid equivalent on arm is like nasty business. But once we go add other funny pmu's, it becomes interesting to know what kind of pmu it is and tie possible event sets to them. --
Well, I'm tempted to say that's a problem for the virt guys :-) One way one could solve that is by having the topology information include the virt<->phys map, so that you can find the physical node from the virtual cpu number. Going to be interesting though, but then, virt seems to be about creating problems where there were none before. --
I don't think that is correct. You can be using the uncore PMU on Nehalem without any core PMU event. The only thing to realize is that uncore PMU shares the same interrupt vector as core PMU. You need to configure which core the uncore is going to interrupt on. This is done via a bitmask, so you --
Ah, sharing the IRQ line is no problem. But from reading I got the impression you need to configure an Offcore counter. See 30.6.2.1: • EN_PMI_COREn (bit n, n = 0, 3 if four cores are present): When set, processor core n is programmed to receive an interrupt signal from any interrupt enabled uncore counter. PMI delivery due to an uncore counter overflow is enabled by setting IA32_DEBUG_CTL.Offcore_PMI_EN to 1. Which seems to indicate a link with the off-core response thing. However I would be very glad to be wrong :-) --
The offcore_response register is a CORE PMU register. I know the name is confusing. OFFCORE_RESPONSE_0 is not a counter but rather an extension of the filtering capabilities of regular counters. To use offcore_response_0, you will need to program a generic counter (1 config, 1 data) + offcore_response_0. This is a very useful feature to understand memory traffic. There is an associated difficulty though. The offcore_response_0 MSR is shared between HT threads. Thus the kernel needs to arbitrate. That should be taken care of by the event scheduling code which I am working on. In the context of perf_event, you need to find some encoding of the event to pass as RAW. The raw event code + umask is 0x01B7. Then, you need to encode the value for the offcore_response_0 MSR, which is 16 bits on NHM/WSM. You could either stash this into the config field or use an extended field. The kernel would detect 0x01B7 and use the value in the extra config field. As described in vol3b, Intel Westmere adds a second offcore_response counter. It behaves the same way with the same restrictions. The debugctl bit controls uncore activation not offcore. --
Given the PMU sharing model of perf_events, it seems you may have multiple consumers of uncore PMU at the same time. That means you will need to direct the interrupt onto all the CPU for which you currently have a user. You may have multiple users per CPU, thus you need some reference count to track all of that. The alternative is to systematically broadcast the uncore PMU interrupt. Each core then checks whether or not it has uncore users. Note that all of this is independent of the type of event, i.e., per-thread or system-wide. --
Hi, Corey How is this going now? Are you still working on this? I'd like to help to add support for uncore, test, write code or anything else. Thanks, --
I haven't been actively working on adding infrastructure for nest PMUs yet. At the moment we are working on supporting nest events for IBM's Wire-Speed processor, using the current infrastructure, because of the time limitations. Using the existing infrastructure is definitely not ideal, but for this processor, it's workable. There are still a lot of issues to solve for adding this infrastructure: 1) Does perf_events need a new context type (in addition to per-task and per-cpu)? This is partly because we don't want to be mixing the rotation of CPU-events with nest events. Each PMU really ought to have its own event list. 2) How do we deal with accessing PMU's which require slow access methods (e.g. internal serial bus)? The accesses may need to be placed on worker threads so that they don't affect the performance of context switches and system ticks. 3) How exactly do we represent the PMU's in the pseudofs (/sys or /proc)? And how exactly does the user specify the PMU to perf_events? Peter Zijlstra and Stephane Eranian both recommended opening the PMU with open() and then passing the resulting fd in through the perf_event_attr struct. 4) How do we choose a CPU to do the housekeeping work for a particular nest PMU. Peter thought that user space should still specify the it via open_perf_event() cpu parameter, but there's also an argument to be made for the kernel choosing the best CPU to handle the job, or at least make it optional for the user to choose the CPU. I'm sure there are other issues as well. If you'd like to start working on some (or all!) of these, you are more than welcome to. I think we need to toss around some more ideas before committing much to code at this point. - Corey --
Right, I've got some definite ideas on how to go here, just need some time to implement them. The first thing that needs to be done is get rid of all the __weak functions (with exception of perf_callchain*, since that really is arch specific). For hw_perf_event_init() we need to create a pmu registration facility and lookup a pmu_id, either passed as an actual id found in sysfs or an open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards compat). hw_perf_disable/enable() would become struct pmu functions and perf_disable/enable need to become per-pmu, most functions operate on a specific event, for those we know the pmu and hence can call the per-pmu version. (XXX find those sites where this is not true). Then we can move to context, yes I think we want new context for new PMUs, otherwise we get very funny RR interleaving problems. My idea was to move find_get_context() into struct pmu as well, this allows you to have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts because then things like perf_event_task_sched_out() would get rather complex. For RR we can move away from perf_event_task_tick and let the pmu install a (hr)timer for this on their own. I've been planning to implement this for more than a week now, its just that other stuff keeps getting in the way. --
On 3/30/2010 10:15 AM, Peter Zijlstra wrote: This sounds like a good idea. Right now for the Wire-Speed processor, we have a loop that goes through all of the nest PMU's and calls their respective per-pmu Yes, I think it makes a lot of sense, so that there's not some sort of fixed Definitely. I don't think it makes sense to have per-task context on This is necessary I think, because of the access time for some of the PMU's. I wonder though if it should, perhaps optionally, be off-loaded to a high-priority task to do the switching so that access latency to the PMU can be controlled. As I mentioned when we met, some of the Wire-Speed processor nest PMU control registers are accessed via SCOM, which is an internal, 200 MHz serial bus. We are being quoted ~525 SCOM bus ticks to do a PMU control register access, which comes out to about 2.5 microseconds. If you figure 5 accesses to rotate the Well, it's not as if this is a trivial task either :) - Corey --
For uncore no, but there is also the hw-breakpoint stuff that is being presented as a pmu, for those it would make sense to have a separate per-task context. But doing multiple per-task contexts is something for a next step Yeah, you mentioned that.. for those things we need some changes anyway, since currently we install per-cpu counters using IPIs and expect the pmu::enable() method to be synchronous (it has a return value). It would be totally unacceptable to do 2.5ms pokes with IRQs disabled. The RR thing would be the easiest to solve, just let the timer wake up a thread instead of doing the work itself, that's fully isolated to how the pmu chooses to implement that. The above mentioned issue however would be much more challenging to fix nicely. --
Also some of perf_enable()/perf_disable() would have to be per PMU and not global like they are today. --
hw-breakpoint presented as a pmu? hmmmm. IMHO, this is an example where shoehorning something into the perf_events subsystem that logically doesn't [snip] --
Just to be clear, it's 2.5us, not 2.5ms, but I think it's still bad... It seems like it might need to be done in two phases. IPI request is sent, and then a thread is woken up on the other CPU, it does some work, and then sets a status variable and somehow notifies the caller that the operation has completed. I don't know the kernel's communication mechanisms well enough to know which one is most appropriate - maybe rwsem? - Corey --
On Tue, Mar 30, 2010 at 6:49 PM, Corey Ashford I concur with the fact that you don't want to mix events form different PMUs in the same rotation list. There are some side effects of not doing this with AMD64 Northbridge events when you have multiple concurrent sessions. But this is a special case where the "uncore" (nest) PMU is actually controlled via the core PMU. I think that is okay for now, but you certainly don't want that for Nehalem uncore, for One of the housekeeping task is to handle uncore PMU interrupts, for instance. That is not a trivial task given that events are managed independently and that you could be monitoring per-thread or system-wide. It may be that some uncore PMU can only interrupt one core. Intel Nehalem can interrupt many at once. --
That's a good point, and I think it's unreasonable to expect that the user knows exactly how the interrupts are connected from the uncore/nest PMU to which CPU(s). Perhaps one way around this would be to return an error if the chosen CPU wasn't fully capable of performing the housekeeping functions for the requested PMU. But this certainly isn't ideal, because relying on this mechanism would require that the user (or user tool) figure out which CPU is fully capable by trial-and-error. - Corey --
On Wed, Mar 31, 2010 at 1:11 AM, Corey Ashford I think users should not have to worry about all of this. It is also fine to restrict any uncore monitoring to system-wide mode, i.e., not per-thread. But I have also seen people requesting just that hoping they could draw correlations between core and uncore events in one run. I think this is less critical though. In the specific case of CPU-uncore (e.g., Nehalem uncore), you could simply identify the uncore PMU with the CPU. For anything else, you would definitively need the file descriptor. The more uniform approach would have been to use the file descriptor all along, except for per-thread. But that's fine, I think. --
