Here's a patch against my current tree that gets the perfmon code
building and hopefully working.
Note, it needs the kobject_create_and_register() patch which is in my
tree, but I do not think it made it to -mm yet. The next -mm cycle
should have it.
Also, the sysfs usage in the perfmon code is quite strange and not
documented at all. Yes, there is a little bit in the documentation
about what a few of the files do, but there are _way_ more files and
even directories being created under /sys/kernel/perfmon/ that are not
documented at all here.
If you document this stuff, I think I can clean up your sysfs code a
lot, making things simpler, easier to extend, and easier to understand.
But as it is, I don't want to break anything as it's totally unknown how
this stuff is supposed to work...
Hint, use the Documentation/ABI directory to document your sysfs
interfaces, that is what it is there for...
thanks,
greg k-h
---------------
From: Greg Kroah-Hartman <gregkh@suse.de>
Subject: perfmon: fix up some static kobject usages
This gets the perfmon code to build properly on the latest -mm tree, as
well as removing some static kobjects.
A lot of future kobject cleanups can be done on this code, but the
documentation for the perfmon sysfs interface is very limited and does
not describe all of the different files and subdirectories at all.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
perfmon/perfmon_sysfs.c | 37 +++++++++++++++----------------------
1 file changed, 15 insertions(+), 22 deletions(-)
--- a/perfmon/perfmon_sysfs.c
+++ b/perfmon/perfmon_sysfs.c
@@ -76,7 +76,8 @@ EXPORT_SYMBOL(pfm_controls);
DECLARE_PER_CPU(struct pfm_stats, pfm_stats);
-static struct kobject pfm_kernel_kobj, pfm_kernel_fmt_kobj;
+static struct kobject *pfm_kernel_kobj;
+static struct kobject *pfm_kernel_fmt_kobj;
static void pfm_reset_stats(int cpu)
{
@@ -402,31 +403,23 @@ static struct attribute_group pfm_kernel
int __init pfm_init_sysfs(void)
...Greg, I will move the description from perfmon2.txt to its own file in ABI/testing. -- -Stephane -
That is what I was referring to, that file does not describe all of the That would be great to have, thanks. greg k-h -
Greg, Perfmon sysfs document has been updated following your adivce. you can check out in my perfmon tree the following commit: e83278f879e52ecee025effe9ad509fd51e4a516 Thanks. -- -Stephane -
Where is this git tree located? On git.kernel.org somewhere? thanks, greg k-h -
I get mine from git+ssh://master.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git -
Thanks, that worked, let me go read the new documentation... -
Thanks, that looks a lot better. Do you want me to send you patches based on this tree to help clean up the sysfs usage now that it's documented? Also, a lot of your per-cpu sysfs files should probably move to debugfs as they are for debugging only, right? No need to clutter up sysfs with them when only the very few perfmon developers would be needing access to them. thanks, greg k-h -
Greg, Yes, send me the patches. But from what you were saying earlier it seems I would need an extra sysfs patches to make this compile. Is that particular Yes, this is mostly debugging. If debugfs is meant for this, then I'll be happy to move this stuff over there. Is there some good example of how I could do that based on my current sysfs code? Thanks. -- -Stephane -
No, it's in my tree, and will be in the next -mm. You will need a few There is documentation for debugfs in the kernel api document :) And, there are many in-kernel users of debugfs, a grep for "debugfs_create_" should show you some examples of how to use this. If you have any questions, please let me know. thanks, greg k-h -
Greg, Could you send them to me? if they are not too intrusive I could add them to my tree. Yet I don't want something to distant from Linus's tree which Ok, I'll look at that next. Thanks, -- -Stephane -
Greg, I have now removed all the perfmon2 statistics from sysfs and moved them to debugfs. I must admit, I like it better this way. Debugfs is also so much easier to program. Patch has been pushed into my tree. Let me know if you think I can improve the sysfs code some more. Thanks. -- -Stephane -
On Tue, 6 Nov 2007 16:34:54 -0800 Unfortunately I still haven't merged perfmon due to recently-occurring minor conflicts with Tony's ia64 tree and more major recently-occurring conflicts with the x86 tree. There's not really a lot which Stephane can practically do about this - normally I'll just get down and fix stuff like this up. But the impression I get from various people is that the perfmon tree in its present form would not be a popular merge. The impression which people have (and I admit to sharing it) is that there's just too much stuff in there and it might not all be justifiable. But I suspect that people have largely forgotten what is in there, and why it is in there. We really need to get this ball rolling, and that will require a sustained effort from more people than just Stephane. I suppose as a starting point we could yet again review the existing patches, please. People will mainly concentrate upon the changelogging to understand which features are being proposed and why, so that submission should describe these things pretty carefully: what are the features and why do we need each of them. tia. -
Is there some way to rebase these patches/git tree to be a bit more easy to review? Right now there are over 75 patches in the tree and many (if not most) can be removed by merging them with previous patches. If someone could break this stuff down into reviewable pieces, it would go a very long way toward making it acceptable. Is there any way to just provide a basic framework that everyone can agree on and then add on more stuff as time goes on? Do we have to have every different processor/arch with support to start with? thanks, greg k-h -
Greg KH <greg-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> writes: [dropped perfmon list because gmane messed it up and it's apparently I think the real problem are not the architectures (the processor adaption layer is usually relatively straight forward IIRC), but the excessive functionality implemented by the user interface. It would be really good to extract a core perfmon and start with that and then add stuff as it makes sense. e.g. core perfmon could be something simple like just support to context switch state and initialize counters in a basic way and perhaps get counter numbers for RDPMC in ring3 on x86[1] Next step could be basic event on overflow/underflow support. Then more features as they make sense, with clear rationale what they're good for and proper step by step patches. -Andi [1] On x86 we urgently need a replacement to RDTSC for counting cycles. -
Perhaps a core could provide also as much functionality so that Perfmon can be used with an *unpatched* kernel using loadable modules? One drawback with today's Perfmon is that it can not be used with a vanilla kernel. But maybe such a core is by far too complex for a first merge. -Robert -- Advanced Micro Devices, Inc. Operating System Research Center email: robert.richter@amd.com -
Hello, Note that I am not against the gradual approach such as: - system-wide only counting - per-thread counting - user-level sampling support - in-kernel sampling buffer support - in-kernel customizable sampling buffer formats via modules - event set multiplexing - PMU description modules It would obvisouly cause a lot of troubles to existing perfmon libraries and applications (e.g. PAPI). It would also be fairly tricky to do because you'd have to make sure that in the beginning, you leave enough flexiblity such that you can add the rest while maintaining total backward compatibility. But given that we already have the full solution, it could just be a matter of dropping features without disrupting the user level API. Of course there would be a bigger burden on the maintainer because he would have two trees to maintain but I think that is already commonplace in many of the kernel-related projects. Let's take a simple example. The set of syscalls necessary to control a system-wide monitoring session is exactly the same as for a per-thread session. The difference is just a flag when the session is created. Thus, we could keep the same set of syscalls, but only accept system-wide sessions. Later on, when we add per-thread, we would just have to expose the per-thread session flag. Having said that, does not mean that this is necessarily what we will do. I am just try to present my understanding of the comments from Andrew, Andi and others. I think that going with a kernel module will not address the 'complexity/bloat' perception that some people have. There is a logic to that, I did not just wakeup one day saying 'wouldn't it be cool to add set multiplexing?'. There was a true need expressed by users or developers and it was justfied by what the hardware offered then. This unfortunately still stands today. I admit that justification is not necessarily spelled out clearly in the code. So I understand most of those worries and I am trying to figure out how we could ...
There no way we'll keep this completely idiotic userland API. If people start to use out of tree APIs they can pretty much expect that they're not going to stay around. And in this case they most certainly won't. -
(jumping in late in the game)
Linux Trace Toolkit Next Generation would _happily_ use global PMC
counters, but I would prefer to interact with an internal kernel API
rather than being required to start/stop counters from user-space. There
is a big precision loss involved in having to start things from
userspace.
Ideally, this API would manage access to available PMCs and even use the
same counters for both system-wide tracing/profiling done at the same
time as user-space profiling. This would however involve having a
wrapper around both user-space and kernel-space performance counter
reads, which is fine with me. I would suggest that user-space still go
through a system call for this, since this is available a early boot,
before the filesystem is mounted.
This API could offer to in-kernel architecture _independent_ PMC control
interface to :
- list available PMCs
- That would involve mapping the common PMCs to some generic
identifier
- attach to these PMCs, with a certain priority
We could call a single connexion to a PMC a "virtual PMC". All PMC
accesses should then be done through this internally managed structure
(giving callbacks to be called after a certain count, reads, stop...).
We could have virtual PMCs that are : system wide, or per thread.
As a starting point, we could limit one virtual PMC attached to a
physical PMC at a given time. Later, we could add support for multiple
virtual PMCs connected to a single physical PMC. The priorities could be
used to kick out the PMC users with lower priorities (that involves that
a PMC read could fail!).
Then, to get interrupts or signals upon PMC overflow, we could manage
each physical PMC like a timer, using the lowest requested value for the
next time were are to be awakened. Some logic would have to be added to
the pmc read operation to get the "real" expected value, but this is
nothing difficult.
Those were the ideas I had last OLS after hearing the talk about
perfmon2. I hope they can be useful. If ...Hi Robert, In the past I suggested that it might be useful to have a version of perfmon2 that only set up the perfmon on a global basis. That would allow the patches for context switches to be added as a separate step, splitting up the patch into smaller set of patches. Perfmon2 uses a set of system calls to control the performance monitoring hardware. This would make it difficult to use an unpatch kernel unless perfmon changed the mechanism used to control the performance monitoring hardware. -Will -
Hello, Yes, that would be a possibility but as you pointed out there are some problems: - perfmon2 uses system calls. So unless you can dynamically patch the syscall table we would have to go back to the ioctl() and driver model. I was under the impression that people did not quite like multiplexing syscalls such as ioctl(). I also do prefer the multi syscall approach. - perfmon2 needs to install a PMU interrupt handler. On X86, this is not just an external device interrupts. There needs to be some APIC and interrupt gate setup. There maybe other constraints on other architectures as well. Not sure if all functions/structures necessary for this are available to modules. - we could not support per-thread mode with the kernel module approach due to link to the context switch code. I do believe per-thread is a key value-add for performance monitoring. -- -Stephane -
The oprofile module can setup a handler for PMU interrupts. This is done in archi/x86/oprofile/nmi_int:nmi_cpu_setup(). Other modules could do the same. However, it bumps what ever was using the nmi/pmu off, then restores nmi/pmu when oprofile is shut down. Maybe the pmu/nmi resource reservation mechanism The per-thread monitoring is useful to a number of people and many people want it. The thought was how to break the large perfmon patch into set of smaller incremental patches. So it isn't whether to have per-thread pmu virtualization, but rather when/how to get it in. -Will -
Will, Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog setup. It uses the register_die() mechanism, if I recall. The low level APIC and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because I think we all agree on this. -- -Stephane -
Oprofile works without the NMI watchdog too, but it just happens to be another It could handle it in the same way as oprofile if it wanted. But given NMIs make everything more complicated and it might not be worth it. -Andi -
Andi.
I meant the register_die_notifier() mechanism which allow you to
chain a handler on NMI interrupts. At least that's my understanding
reading the code:
static int nmi_setup(void)
{
int err=0;
int cpu;
if (!allocate_msrs())
return -ENOMEM;
if ((err = register_die_notifier(&profile_exceptions_nb))){
free_msrs();
pfm_release_allcpus();
return err;
}
Yes, horribly more complicated because of locking issues within perfmon.
As soon as you expose a file descriptor, you need some locking to prevent
multiple user threads (malicious or not) to compete to access the PMU state.
I think the value add of NMI can be as well achieved with advanced PMU features
such as Intel Core 2 PEBS.
--
-Stephane
-
Why do you need the file descriptor? One of the main problems with perfmon is the complicated user interface. True probably, although only on CPUs that support PEBS. Dropping features for old CPUs is unfortunately quite difficult in Linux, and in this case probably not an option because there are so many of them (e.g. all of AMD not Fam10h) -Andi -
Andi, To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread. file descriptor allows you to use close, read, select, poll and you leverage the existing file descriptor sharing/inheritance sematics. At the kernel level, a descriptor provides all the callback necessary to make sure you clean up the perfmon Yes, I know that. Also note that unfortunately, AMD Fam10h IBS feature does not allow you to capture more than one sample in critical sections. It is still interrupt based sampling with one entry-deep buffer: one interrupt = one sample. Perfmon does support NMI though it is much more expensive to use. -- -Stephane -
Surely that could be done with a flag for each call too? Keeping file descriptors Didn't you already have a thread destructor for it? -Andi -
Andi,
I don't understand this.
Let's take the simplest possible example (self-monitoring per-thread)
counting one event in one data register.
int
main(int argc, char **argv)
{
int ctx_fd;
pfarg_pmd_t pd[1];
pfarg_pmc_t pc[1];
pfarg_ctx_t ctx;
pfarg_load_t load_args;
memset(&ctx, 0, sizeof(ctx));
memset(pc, 0, sizeof(pc));
memset(pd, 0, sizeof(pd));
/* create session (context) and get file descriptor back (identifier) */
ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);
/* setup one config register (PMC0) */
pc[0].reg_num = 0
pc[0].reg_value = 0x1234;
/* setup one data register (PMD0) */
pd[0].reg_num = 0;
pd[0].reg_value = 0;
/* program the registers */
pfm_write_pmcs(ctx_fd, pc, 1);
pfm_write_pmds(ctx_fd, pd, 1);
/* attach the context to self */
load_args.load_pid = getpid();
pfm_load_context(ctx_fd, &load_args);
/* activate monitoring */
pfm_start(ctx_fd, NULL);
/*
* run code to measure
*/
/* stop monitoring */
pfm_stop(ctx_fd);
/* read data register */
pfm_read_pmds(ctx_fd, pd, 1);
printf("PMD0 %llu\n", pd[0].reg_value);
/* destroy session */
close(ctx_fd);
return 0;
}
--
-Stephane
-
[dropped all these bouncing email lists. Adding closed lists to public Why do you need to set the data register? Wouldn't it make My replacement would be to just add a flags argument to write_pmcs Why can't that be done by the call setting up the register? Or if someone needs to do it for a specific region they can read On x86 i think it would be much simpler to just let the set/alloc register call return a number and then use RDPMC directly. That would be actually faster and be much simpler too. I suppose most architectures have similar facilities, if not a call could be added for them but it's not really essential. The call might be also needed for event multiplexing, but frankly I would just leave that out for now. e.g. here is one use case I would personally see as useful. We need a replacement for simple cycle counting since RDTSC doesn't do that anymore on modern x86 CPUs. It could be something like: /* 0 is the initial value */ /* could be either library or syscall */ event = get_event(COUNTER_CYCLES); if (event < 0) /* CPU has no cycle counter */ reg = setup_perfctr(event, 0 /* value */, LOCAL_EVENT); /* syscall */ rdpmc(reg, start); .... some code to run ... rdpmc(reg, end); free_perfctr(reg); /* syscall */ On other architectures rdpmc would be different of course, but the rest could be probably similar. -Andi -
Andi, Partially true. The file descriptor becomes really useful when you sample. You leverage the file descriptor to receive notifications of counter overflows and full sampling buffer. You extract notification messages via read() and you can use SIGIO, select/poll. The example shows how you can leverage existing mechanisms to destroy the session, i.e., free the associated kernel resources. For that, you use close() instead of adding yet another syscall. It also provides a resource limitation mechanisms to control consumption Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)? That would be quite expensive when you have lots of registers to setup: one syscall per register. The perfmon syscalls to read/write registers accept vector of arguments to amortize the cost of the syscall over multiple registers (similar to poll(2)). With many tools, registers are not just setup once. During certain measurements, data registers may be read multiple times. When you sample or multiplex at the user level, you do need to reprogram the PMU state and that is on the critical path. You do not want a call that programs the entire PMU state all at once either. Many times, you only want to modify a small subset. Having the full state does also cause some portability It depends on what you are doing. Here, this was not really necessary. It was meant to show how you can program the data registers as well. Perfmon2 provides default values for all data registers. For counters, the value is guaranteed to be zero. But it is important to note that not all data registers are counters. That is the case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as well, and some may need to be initialized to non zero value, i.e., the IBS sampling period. With event-based sampling, the period is expressed as the number of occurrences of an event. For instance, you can say: " take a sample every 2000 L2 cache misses". The way you express this with perfmon2 is ...
Hmm, ok for the event notification we would need a nice interface. Still I think you optimize the wrong thing here. There are basically two cases I see: - Global measurement of lots of things: Things are slow anyways with large context switch overheads. The overheads are large anyways. Doing one or more system calls probably does not matter much. Most important is a clean interface. - Exact measurement of the current process. For that you need very low latencies. Any system call is too slow. That is why CPUs have instructions like RDPMC that allow to read those registers with minimal latency in user space. Interface should support those. Also for this case programming time does not matter too much. You just program once and then do RDPMC before code to measure and then afterwards and take the difference. The actual counter setup is out Setting period should be a separate call. Mixing the two together into one I didn't object to providing the initial value -- my example had that. Just having a separate concept of data registers seems too complicated to me. You should just pass event types and values and the kernel gives you And? You didn't say what the advantage of that is? All the approaches add context switch latencies. It is not clear that the separate Well the system call layer can manage that transparently with a little software state I disagree. Using RDPMC is essential for at least some of the things I would like to do with perfmon2. If the interface does not provide it it is useless to me at least. System calls are far too slow for cycle measurements. And when RDPMC is already supported it should be as widely used as possible. Regarding the portable code problem: of course you would have some header in user space I think only supporting global and self monitoring as first step is totally fine. Sure at some point a system call for the more complex cases (also like multiplexing) would be needed. But I don't think we need it as ...
There are a number of processors that have 32-bit counters such as the IBM power processors. On many x86 processors the upper bits of the counter are sign extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are available. Roll over of values is quite possible (<2 seconds of cycle count), so What range of cycles are you interested in measuring? 100's of cycles? A couple thousand? Are you just looking at cycle counts or other events? -Will -
Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is sign-extension. That's why it is okay for measuring small sections of code, but that's it. On AMD, I think it is better. On Itanium you get the 47-bit worth. Don't know about Power or Cell. -- -Stephane -
On x86 they are sign-extended only on write, on read they are 40 bits wide for intel, 48 bits for AMD. BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch to disable it, dunno if it has been applied. -- Phe -
Obviously -- without a system call to set up performance counters it would be fairly useless. But of course once such system calls are in they should be able to trigger the bit for each process. -Andi -
Andi,
Why do you think the existing interfaces are not a good fit for this?
Is this just because of your problem with file descriptors?
From my experience read(), select(), and SIGIO are fine. I know many tools use that.
As for the file descriptor, you would need to replace that with another identifier of
some sort. As I pointed out in another message on this thread, you don't want to use
a pid-based identifier. This is not usable when you monitor other threads and you
If people do not like vector arguments, then I think I can live with N system calls
to program N registers. Now you have two choices for passing the arguments:
- a pointer to a struct
struct pfarg_pmc {
uint64_t reg_value;
uint16_t reg_num;
} pmc0;
pmc0.reg_value = 0; pmc0.reg_value = 0x1234;
pfm_write_pmcs(fd, &pmc0);
- explicitly passing every field:
pfm_write_pmcs(fd, 0x0, 0x1234);
Given that event set and multiplexing would not be in initially, we would want
to allow for them to be added later without having to create yet another
system call, right?
I am not sure I understand what you mean by 'lots of things'?
I don't have a problem with that. And in fact, I already support that
at least on Itanium. I had that in there for X86 but I dropped it after
you said that you would enable cr4.pce globally. I don't have a problem
Periods are setup by data register. Given that there is already a call to program
the data register why add another one? You don't need to treat the sampling period
differently from the register value. This just a value that will cause the register
Should you support a kernel level sampling buffer (like Oprofile) you'd also want
to specify the reset value on overflow. And you would not necessarily want it to
be identical to the initial value (period). So you'd to have a way to specify that
I am not against providing a flat namespace. But I think it is nice to separate config
Absolutely not, you don't want to the kernel to know about events. This has ...Hi folks, Well, I can say the mood here at supercomputing'07 is pretty somber in regards to the latest exchange of messages regarding the perfmon patches. Our community has been the largest user of both the PerfCtr and the Perfmon patches, the former being regularly installed by vendors and integrators on clusters at install time, and the latter now being adopted into vendor kernels by IBM, Cray, AMD, SiCortex and others. Of course, adoption by a vendor, does not a good kernel patch make. However, it should be viewed as a strong data point on demand for such functionality. We are a community focused on performance and we have long had a need for these tools. A solution that does not provide 64 bit virtualized per-thread counts is not a solution at all. That would need to be ripped out by all of us using this functionality so we could get something that actually does what the community needs, not what the you folks think we need. Device level access and/or root access to the counters is not unacceptable for machines in production. If that was fine, oprofile would have satisfied everyone and we wouldn't be sucking up your bandwidth. Please understand that people outside of the your community are desperate for adoption of any form of 'per-thread' PMU functionality into the kernel. For those of you who are (still) not convinced of this, I can arrange your inbox to be spammed by 1000's of HPC geeks, managers, vendors, etc. My point is, let's start somewhere that the community finds useful. Otherwise we run the risk of developing an interface that everyone isn't comfortable with and no-one uses. Hardly a productive exercise. So please, do consider a set of core functionality that provides for (at least) the following: - per-CPU and per-thread 64 bit virtualized counts - third person operation (attach/ptrace) - dispatch of signal upon interrupt on overflow if requested - 'buffered' interrupts into a buffer that can be mmap'd into ...
"somber"? Why? We (a number of the kernel developers) want to see the perfmon code make it into the kernel tree, unfortunatly, in the current state it is in, that's not going to happen. Andi specified a way that this can happen, just refactor your patches into smaller bits that can be reviewed and applied. If you, or anyone else has any questions about this, please let us know. So far, I have not seen any response to his message, so I'm guessing that the perfmon developers either are off working on this, or don't care. And if they don't care, then yes, I agree with your "somber" feeling... thanks, greg k-h -
Well... Philip is (I assume) a numerical-computing guy and not a kernel-developing guy (probably a wise choice). He speaks for quite a few people - they have serious need for this feature but they've had to scruff around with out-of-tree patches for years to get it, and still there are problems. I was hoping that after the round of release-and-review which Stephane, Andi and I did about twelve months ago that we were on track to merge the perfmon codebase as-offered. But now it turns out that the sentiment is that the code simply has too many bells-and-whistles to be acceptable. My problem with that sentiment is that it is quite likely the case that those bells-n-whistles are actually useful and needed features. Perfmon has been out there for quite a few years and the code which is in there _should_ be in response to real-world in-the-field experience. Such requirements never go away. So. If what I am saying is correct then the best course of action would be for Stephane to help us all to understand what these features are and why we need them. The ideal way in which to do this is [patch] perfmon: core [patch] perfmon: whizzy feature #1 [patch] perfmon: whizzy feature #2 [patch] perfmon: whizzy feature #3 etc. Where the changelog in each whizzy-feature-n explains what it does, why it does it and why our users need it. Whatever happens, perfmon is so big and so old and has been out-of-tree for so long that it's going to take a pile of work from lots of people to get any of it landed. -
I agree. Right now their git tree has over 80 patches in it, without descriptions like this to help those of us who want to review and help out, it is quite difficult. thanks, greg k-h -
> He speaks for quite a few people - they have serious need for this feature Most likely they have serious need for a very small subset of perfmon2. The point of my proposal was to get this very small subset in quickly. Phil, how many of the command line options of pfmon do you actually use? How many do the people at your conference use? Or what functions, what performance counters etc. in PAPI or whatever library you use? Make use understand the use cases better, that would already help a lot in merging by concentrating on what people actually really need. -Andi -
Hi Andi, pfmon is a single tool and fairly low level, the HPC folks don't use it so much because it isn't parallel aware and is meant for power- users. It is not representative of the tools used in HPC at all. Our community uses tools built on the infrastructure provided by libpfm and PAPI for the most part. I know you don't want to hear this, but we actually use all of the features of perfmon, because a) we wanted to use the best methods available and b) areas where user level solutions could be made (like multiplexing) introduced too much noise and overhead to be of use. For years we relied on PerfCtr which did 'just enough' for us. But when Perfmon2 became available, we adopted technology where it meant a significant increase in accuracy for the resulting measurements, specifically for us that meant, kernel multiplexing and sample buffers. Note that PAPI is just middleware. The tools built upon it are what people use...some of those are commercial tools like Vampir but most are Open Source. These tools are cross platform, as such they run on nearly everything...although intel/amd/ppc systems dominate the HPC market. The usage cases are always the same and can be broken down into simple counting and sampling: - providing virtualized 64-bit counters per-thread - providing notification (buffered or non) on interrupt/overflow of the above. If you'd like to outline further what you'd like to hear from the community, I can arrange that. I seem to remember going through this once before, but I'd be happy to do it again. For reference, here's a quick list from memory of some of the tools in active use and built on this infrastructure. These are used heavily around the globe. You'll see that each basically follows one of the 2 usage models above. - HPCToolkit (Rice) - PerfSuite (NCSA) - Vampir (Dresden) - Kojak (Juelich) - TAU (UOregon) - PAPIEX (me) - GPTL (NCAR) - HPM-Linux (IBM) - Paraver (Barcelona) Time to go give ...
That is hard to believe. But let's go for it temporarily for the argument. Can you instead prioritize features. What is most essential, what is Ok that makes sense and should be possible with a reasonable simple Please list concrete features, throwing around random names is not useful. -Andi -
Just getting back to this now that SC07 is finally over... You are welcome to download the code and some of the tools and verify Yes, although this has been done before. You've got the list below in the previous emails which should be considered the absolute minimum. - A feature which was dropped earlier by Stefane (only to satiate LKML), we consider very important. Allowing one tomapping of the kernels view of the PMD's, allowing user-space access to full 64-bit counts, if the architecture supports a user-level read instruction. Getting the counts in a couple of dozen cycles is ALWAYS a win for us. This is because the HPC community is mainly interested in self-monitoring, not third-party, because the former can be easily associated with context in the app through instrumentation in various forms. - Kernel multiplexing is very nice to have, saves you tremendous overhead at user level. PAPI has an implementation in user-space for the platforms that don't support this. The flexibility of the current implementation is not exploited, here I'm referring to the concept of eventsets. Having multiplexing is important. Being able to allocate/reallocate eventsets and the threshold of individual eventsets is just nice to have. - Custom sample formats would be considered not often used in our community, largely because the tools run on all HPC/Linux architectures. PAPI uses the default sample format which has been sufficient for our needs. However, the lack of custom sample formats preclude the dev of the specialized tools that access the sampling hardware as found on the IA64, PPC64, the Barcelona and the SiCortex node chip. Well that's good news. The above is what we have used via the PerfCtr set of patches for a long time. It wasn't quite enough, but it got the job This is kind of comment that makes the Linux/HPC folks 'somber'. What isn't useful, is being dismissive of an entire community that moves a heck of a lot of ...
I didn't see a clear list. My impression so far is that you're not quite sure what you want, You mean returning the register number for RDPMC or equivalent and a way to enable it for ring 3 access? I'm considering that an essential feature too. I wasn't aware Yes it is for everybody. I've been rather questioning if the slow ways (complicated syscalls) to get the counter information are really What do you mean with custom sample formats exactly? What information do you want in there? And why? e.g. PEBS and so on pretty much fix the in memory sample format in hardware, so they only way to get a custom format would be to use a separate buffer. I can think of one reason why the kernel should add more information in a separate buffer (log the instruction bytes so that it can be disassembled and a address histogram be generated using the PEBS register values), but it is a relatively obscure one and definitely not a essential feature. Unfortunately it is also hard to implement completely Sorry, but these kind of non technical BS arguments will just make you be ignored in mainline Linux lands. They might work if you pay a lot of money to specific Linux companies (do you?), but here on linux-kernel you have to convince with purely technical arguments. -Andi -
Andi, No, he is talking about something similar to what was in perfctr. The kernel emulates 64-bit counters in software and that is you get back when you read the counters. If you read via RDPMC, you get 40 bits. To reconstruct the full 64-bit value from user land you need the upper bits. One approach is for the kernel to allow you to remap a page that has the 64-bit (software) counters. With Perfmon2 allows you to have an in-kernel sampling buffer. The idea is not new, Oprofile has this as well. The problem here is that if the buffer is in the kernel the format of the samples is fixed and it should have to. Tools may want to record samples in different formats and as you said some may need extra information gathered in the kernel. Some may want to aggregate samples in the kernel (Oprofile used to do that), some may want to use a double-buffer approach to minimize blind spots, others may simply use the counter overflow mechanism to record something that is non-PMU related, e.g, kernel call stack. I have built such a module and it was quite interesting to collect the call stack when you hit a last cache level miss. The idea behind customizable sampling format is simple: extract the format from the perfmon core and put this into a kernel module. The core provides a simple registration mechanism and the two communicate via a set of callbacks. Perfmon2 comes with a basic default format which works on all platforms. But it is possible to develop others without having to patch the kernel nor recompile nor reboot. At its core, each format provides a handler routine which is called on counter overflow. The handler routine controls what is recorded, how it is recorded, how it is exported to userland, and wheher overflow notifications need to be sent. Using this mechanism, for instance, we were able to connect the Oprofile kernel code to perfmon2 on Itanium with a 100 lines of This is also how we support PEBS because, as you said, the format of the samples is not under your ...
You mean the page contains the upper [40;63] bits? Sounds reasonable, although I don't remember seeing that when I looked ... you also didn't say *why* that is needed. Can you give a concrete use case for something that cannot be done The existing oprofile code works already fine on x86, no real Exactly that makes the support for random custom buffers questionable. e.g. as I can see the main advantage of perfmon over existing setups is that it support PEBS etc., but with your custom buffer formats which are by definition incompatible with PEBS you would negate that advantage again. Why this insistence against changing anything? -Andi -
Upper 32-bit ([32:63]). On many implementations the only lower 32-bit are available in the register. the 32:40 bits in several processor implementation of x86 processors can not be set to bit outside of sign extension of bit 32. On OProfile is very useful in many cases, but it only perform sampling. If one want to take a look at the number events a specific section of code causes, one can't really do that with oprofile. The counters are running systemwide, not per thread. For some experiments developers really like to have per thread counters. The rewrite of oprofile to use the perfmon code was to consolidate code using the performance monitoring hardware. Use one interface for accessing the performance monitoring hardware rather than have one for sampling and another So the alternative approach is to write a new device driver for each of the new performance monitoring mechanisms, e.g. one for PEBS and another for IBS? One of the reason for the custom sample buffers was to avoid having an expensive user-space signal for a process to record some simple pieces of data each time the data becomes available. For the oprofile port to the perfmon2 custom buffer mechanism the instruction pointer and the counter that overflowed are recorded. The buffer can be processed in one large chunk by userspace, reducing overhead. In essence the current implementation of OProfile in the mainline kernels has a custom buffer mechanism. -Will -
Andi, Do you question why Oprofile has one ;-> But I am happy to explain. With sampling, you want to record information about the execution of a thread at some interval. The interval could be expressed as time or number of occurences of an PMU event. Typically you get a notification. Then you need to collect certain information about the execution. Typically you record the instruction pointer (e.g. Oprofile), but you may want to record the value of other counters, PMU registers or other HW/SW resources. While you're doing this monitoring is typically stopped so you get a consitent view. After you're done recording you need to re-arm the sampling period. If you use event-based sampling, you need to reprogram the counter(s). Then you resume monitoring. You have to repeat this process for each sample regardless of whether you are self-monitoring, monitoring another thread, or monitoring a CPU. Such sequence of operations is quite expensive, especially in the case where you are monitoring another thread, because it incurs at least a couple of context switches per sample in addition to the various register manipulations and syscalls. The idea with the kernel sampling buffer is that you amortize the cost of notification to userland over LOTS of samples. On counter overflow, the kernel records the samples on your behalf. There is no context switch, samples are always recorded in the context on the monitored thread. Now, you need a bit more information for this to work correctly because the kernel records on *your behalf*, thus you need to express: - what you want to see recorded - the value to reload into the overflowed counter(s) so the kernel can re-arm the next period. Because you have multiple counters, you may use them for sampling periods, i.e., overlap sampling measurements. That is something done very frequently. For instance, the q-syscollect tool that D. Mosberger wrote, is overlapping elapsed cycles and branch trace buffer (BTB) sampling to ...
- cross platform extensible API for configuring perf counters - support for multiplexed counters - support for virtualized 64-bit counters - support for PC and call graph sampling at specific intervals - support for reading counters not necessarily with sampling - taskswitch support for counters - API available from userland - ability to self-monitor: need select/poll/etc interface - support for PEBS, IBS and whatever other new perf monitoring infrastructure the vendors through at us in the future - low overhead: must minimize the "probe effect" of monitoring - low noise in measurements: cannot achieve this in userland permon2 has all of this and more i've probably neglected... -dean -
From: dean gaudet <dean@arctic.org> I want to state that even though I've been a stickler on the system call stuff, in general I want to see perfmon2 go into tree and I agree with how most of the infrastructure is implemented and the features it provides. -
Now if we only had a series of patches that we could actually review and apply to the -mm tree so that people can try them out... :) thanks, greg k-h -
I suppose by complicated here, your referring to the gather semantics of the pfm_read/write_pmds/pmcs calls. Many processors may have 100's of registers (IA64, BG/P, SiCortex), some of which have different access times. So a naive syscall of 'give me all the registers you've got' isn't going to cut it. However, any additional simplicity (performance) we can squeeze out of this particular primitive is a huge win as it sits in the critical path of the user Performance and noise. See the earlier message about our user-land implementation versus kernel mode implementations. Any any useful granularity, you begin to seriously affect the counts with noise as well as dilate the run-time. But let's punt on this one until after By custom here, I mean the ability to have the kernel take samples containing more than just the IP, the PID and a bitmask of which registers overflowed at this point. Myself and others have worked hard to get effective address sampling into the hardware (there are registers that contain EA's of misses as well as branch mispredict data on the PPC, IA64, Barcelona and SiCortex) that are handled through the use of a format that gathers up that information at interrupt time for deposit into the sample buffer. We are not wedded to Perfmon2's implementation of these formats, we are however, wedded to having this information collected at interrupt time as the data may change by the time you get back to user-mode. This hardware is not obscure any more, it's the norm, as we've learned at thus simple aggregate counters, even those with precise I love it when kernel folks refer to their own revenue streams (and yes, we do, ask your VP of sales) and the needs of a user community as "BS non-technical arguments". But let's get back to basics here. We can sort that out over a beer sometime. At this point, let's try and agree on the minimum set of functionality acceptable for a first round of patches. - per-CPU ...
From: Andi Kleen <andi@firstfloor.org> I would like to add sparc64 support to perfmon2 as well and therefore I've been considering this angle of the API issues as well. The counters on sparc64 can be configured to be readable by userspace, so for the self-monitoring cases I really would like to make sure the perfmon2 library interface could use direct reads for sampling instead of system calls or specialized traps. If I get some spare time I'll look at the current perfmon2 patches and see if I can toss together sparc64 support to get a feel for how things stand currently. -
Whose sentiment? I've had a bit of a look at it today together with David Gibson. Our impression is that the latest version is a lot cleaner and simpler than it used to be. I'm also reading Stephane's technical report which describes the interface, and whilst I'm only part-way through it, I haven't seen anything yet which strikes me as unnecessary or overly complicated. Paul. -
Yes, that's quite possible. I don't know how up-to-date people's knowledge is. I know I haven't looked seriously at the code in around twelve months. Let's get it on the wires as outlined and take a look at it all. -
Mine for example. The whole userspace interface is just on crack, and the code is full of complexities aswell. -
Could you give some _technical_ details of what you don't like? Paul. -
I've done this a gazillion times before, so maybe instead of beeing a lazy bastard you could look up mailinglist archive. It's not like this is the first discussion of perfmon. But to get start look at the systems calls, many of them are beasts like: int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n) This is basically a read(2) (or for other syscalls a write) on something else than the file descriptor provided to the system call. The right thing to do is obviously have a pmds and pmcs file in procfs for the thread beeing monitored instead of these special-case files, with another set for global tracing. Similarly I'm pretty sure we can get a much better interface if we introduce marching files in procfs for the other calls. -
From: Christoph Hellwig <hch@infradead.org> This is my impression too, all of the things being done with a slew of system calls would be better served by real special files and appropriate fops. Whether the thing is some kind of misc device or procfs is less important than simply getting away from these system calls. -
Ok, I just got 4 freakin' bounces from all of these subscriber only perfmon etc. mailing lists. Please remove those lists from the CC: as it's pointless for those of us not on the lists to participate if those lists can't even see the feedback we are giving. -
Special files and fops really only work well if you can coerce the interface into one where data flows predominantly one way. I don't think they work so well for something that is more like an RPC across the user/kernel barrier. For that a system call is better. For instance, if you have something that kind-of looks like read_pmds(int n, int *pmd_numbers, u64 *pmd_values); where the caller supplies an array of PMD numbers and the function returns their values (and you want that reading to be done atomically Why? What's inherently offensive about system calls? Paul. -
From: Nick Piggin <nickpiggin@yahoo.com.au> Sure, why not? Just cook up an iovec. pmd_numbers goes to offset X and pmd_values goes to offset Y, with some helpers like what we have in the networking already for recvmsg. But why would you want readv() for this? The syscall thing Paul asked me to translate into a read() doesn't provide iovec-like behavior so I don't see why readv() is necessary at all. -
Ah sorry, that's what I get for typing before I think: of course readv doesn't vectorise the right part of the equation. What I really mean is a readv-like syscall, but one that also vectorises the file offset. Maybe this is useful enough as a generic syscall that also helps Paul's example... Of course, I guess this all depends on whether the atomicity is an important requirement. If not, you can obviously just do it with multiple read syscalls... -
I've sometimes thought it would be useful to have a "transaction" system call that is like a write + read combined into one: int transaction(int fd, char *req, size_t req_nb, char *reply, size_t reply_nb); as a way to provide a general request/reply interface for special That would take N system calls instead of one, which could have a performance impact if you need to read the counters frequently (which I believe you do in some performance monitoring situations). Paul. -
Maybe not a bad idea, though I'm not the one to ask about taste ;) In this case, it is enough for your requests to be a set of scalars (eg. file offsets), so it _could_ be handled with vectorised offsets... But in general, for special files, I guess the response is usually some structured data (that is not visible at the syscall layer). So I don't see a big problem to have a similarly arbitrarily That's true too. -
In the same way a read of structured data from a special file "is an" ioctl, yeah. You could implement either with an ioctl. The main difference is they have more explicitly typed interfaces Whether that's enough argument (and if Paul's proposal is widely usable enough) is another question. Which I won't try to answer. -
From: Paul Mackerras <paulus@samba.org>
The same way we handle some of the multicast "getsockopt()"
calls. The parameters passed in are both inputs and outputs.
For the above example:
struct pmd_info {
int *pmd_numbers;
u64 *pmd_values;
int n;
} *p;
buffer_size = N;
p = malloc(buffer_size);
p->pmd_numbers = p + foo;
p->pmd_values = p + bar;
p->n = whatever(N);
err = read(fd, p, N);
It's definitely doable, use your imagination.
You can encode all kinds of operation types into the
header as well.
Another alternative is to use generic netlink.
-
You're suggesting that the behaviour of a read() should depend on what was in the buffer before the read? Gack! Surely you have better taste than that? Or are you saying that a read (or write) has a side-effect of altering some other area of memory besides the buffer you give to read()? That Then you end up with two system calls to get the data rather than one (one to send the request and another to read the reply). For something that needs to be quick that is a suboptimal interface. Paul. -
From: Paul Mackerras <paulus@samba.org> Absolutely that's what I mean, it's atomic and gives you exactly what you need. I see nothing wrong or gross with these semantics. Nothing in the "book of UNIX" specifies that for a device or special file the passed Not necessarily, consider the possibility of using recvmsg() control message data. With that it could be done in one go. This also suggests that it could be implemented as it's own protocol family. -
True, but is it now any so different to an ioctl? -
Ohhhhh.... kayyyyy.... *shudders* It really violates the abstract model of "read" pretty badly. "Read" is "fill in the buffer with data from the device", not "do some arbitrary stuff with this area of memory". I'd prefer to have a transaction() system call like I suggested to There's all sorts of possible ways that it could be implemented. On the one hand we have an actual proposed implementation, and on the other we have various people saying "oh but it could be implemented this other way" without providing any actual code. Now if those people can show that their way of doing it is significantly simpler and better than the existing implementation, then that's useful. I really don't think that doing a whole new net protocol family is a simpler and better way of doing a performance monitor interface, though. Paul. -
From: Paul Mackerras <paulus@samba.org> So much for getting rid of the extra system calls... -
*I* never had a problem with a few extra system calls. I don't understand why you (apparently) do. Paul. -
From: Paul Mackerras <paulus@samba.org> We're stuck with them forever, they are hard to version and extend cleanly. Those are my main objections. -
The first is valid (for suitable values of "forever") but applies to any user/kernel interface, not just system calls. As for the second (hard to version) I don't see why it applies to syscalls specifically more than to other interfaces. It's just a matter of designing it correctly in the first place. For example, the sys_swapcontext system call we have on powerpc takes an argument which is the size of the ucontext_t that userland is using, which allows us to extend it in future if necessary. (Note that I'm not saying that the current perfmon2 interfaces are well-designed in this respect.) The third (hard to extend cleanly) is a good point, and is a valid criticism of the current set of perfmon2 system calls, I think. However, the goal of being able to extend the interface tends to be in opposition to the goal of having strong typing of the interface. Things like a multiplexed syscall or an ioctl are much easier to extend but that is at the expense of losing strong typing. Something like my transaction() (or your weird kind of read() :) also provides extensibility but loses type safety to some degree. Also, as Andi says, this is core CPU state that we are dealing with, not some I/O device, so treating the whole of perfmon2 (or any performance monitoring infrastructure) as a driver doesn't fit very well, and in fact system calls are appropriate. Just like we don't try to make access to debugging facilities fit into a driver, we shouldn't make performance monitoring fit into a driver either. Paul. -
From: Paul Mackerras <paulus@samba.org> I disagree. With netlink we can just add new attributes when a new need arises for a particular interface. The attribute code describes the type precisely, so there is no loss of strong typing at all. -
Well you must mean something different by "strong typing" from the rest of us. Strong typing means that the compiler can check that you have passed in the correct types of arguments, but the compiler doesn't have any visibility into what structures are valid in netlink messages. In any case, I think that adding a structure size argument to the current perfmon2 system calls where appropriate would mean that we could extend them cleanly later on if necessary. It would mean that we could add fields at the end, and that the kernel could know what version of the structures that userspace was using. Paul. -
That's strong static typing. Netlink is 90% strong static typing plus 10% strong dynamic typing. That is, it'll tell you at run-time if you give it the wrong netlink attribute. The types within each netlink attribute is checked at compile time. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -
Well it tells you EINVAL no matter what is wrong. That's roughly similar to a compiler whose only error message is 'WRONG'. Or the ed school of error reporting. That makes any checking it does barely useful. -Andi -
Instead of blabbering further about this topic, I decided to put my code where my mouth is and spent the weekend porting the perfmon2 kernel bits, and the user bits (libpfm and pfmon) to sparc64. As a result I've found that perfmon2 is quite nice and allows incredibly useful and powerful tools to be written. The syscalls aren't that bad and really I see not reason to block it's inclusion. I rescind all of my earlier objections, let's merge this soon :-) -
David, I appreciate your effort. I am glad to see that the interface and implementation survived yet another architecture. I think at this point ARM is the only major architecture missing. In anycase, I would As I said earlier, I am not opposed to changing the syscalls. I have proposed a few schemes to address the issue of versioning. If vectors arguments are problematic, we can go with single register/call. I think there are other areas where perfmon2 could benefit from the Thanks. -- -Stephane -
From: Stephane Eranian <eranian@hpl.hp.com> I sent these to Philip Mucci late last night, but in the meantime I finished implementing breakpoint support as well for pfmon. Let me clean up my diffs and I'll send it all out to you in a few hours. -
Strongly agree. However, I think we need to add structure size arguments to most of the syscalls so we can extend them later. Also, something I've been meaning to mention to Stephane is that the use of the cast_ulp() macro in perfmon is bogus and won't work on 32-bit big-endian platforms such as ppc32 and sparc32. On such platforms you can't take a pointer to an array of u64, cast it to unsigned long * and expect the kernel bitmap operations to work correctly on it. At the least you also need to XOR the bit numbers with 32 on those platforms. Another alternative is to define the bitmaps as arrays of bytes instead, which eliminates all byte ordering and wordsize problems (but makes it more tricky to use the kernel bitmap functions directly). Paul. -
Paul, Yes, that is one way. It works well if you only extend structures at the end. Given that you need to obtain the file descriptor first via a pfm_create_context call, an alternative could be that you pass a version number to that call to I don't like those cast_ulp() macros. They were put there to avoid compiler warnings on some architectures. Clearly with the big-endian issue, we need to find something else. The bitmap*() macros make unsigned long *. The interface uses fixed size type to ensure ABI compatibility between 32 and 64 bit modes. This way there is no need to marhsall syscall arguments for a 32-bit app running on a 64-bit host. Looks like we will have to use bytes (u8) instead. This may have some performance impact as well. Several bitmaps are used in the context/interrupt routines. Even with u8, there is still a problem with the bitmap*() macros. Now, only a small subset of the bitmap() macros are used, so it may be okay to duplicate them for u8. -- -Stephane -
From: Stephane Eranian <eranian@hpl.hp.com> I think it would be fine to just create a set of bitop interfaces that operate on u32 objects instead of "unsigned long". Currently perfmon2 does not need the atomic variants at all, and those could thus be provided entirely under include/asm-generic/bitops/ -
Hello,
A few weeks back, I mentioned that I would post some
interesting problems that I have encountered while
implementing perfmon and for which I am still looking
for better solutions.
Here is one that I would like to solve right now and
for which I am interested in your comments.
One of the perfmon syscall (pfm_restart()) is used to
resume monitoring after a user level notification. When
operating in per-thread non self-monitoring mode, the
syscall needs to operate on the machine state of the
monitored thread. So you get into this situation:
Thread T0 Thread T1
| |
pfm_restart() |
| |
spin_lock_irqsave() |
| |
<modify T1's machine state>--------------->|
| |
spin_unlock_irqrestore() |
| |
v v
Thread T1 may be running at the time T0 needs to modify its state.
The current solution is to set a TIF flag in T1. That TIF flag will
cause T1 (on kernel exit) to go into a perfmon function that will
then modify the state, i.e., state is self-modified. That works okay
but there are a few race conditions. For self-monitoring sessions
(e.g., system-wide or per-thread), it is easy because we operate in
the correct thread.
But there is a big difference between self-monitoring and non
self-monitoring. The pfm_restart() syscall does not provide the
same guarantee.
In self-monitoring modes, the interface guarantees that by the time you
return from the call, the effects of the call are visible. Whereas when
monitoring another thread, the call currently does not provide such
guarantee, i.e., it does not wait until T1 has seen the TIF flag and
completed the state modification before returning. We could add a ...The utrace code supports this style of thread manipulation better than ptrace. - FChE --
Charles, Afre you saying that utrace provides a utrace_thread_stop(tid) call that returns only when the thread tid is off the CPU. And then there is a utrace_thread_resume(tid) call. If that's the case then that is what I need. How are we with regards to utrace integration? Thanks. -- -Stephane --
While I see no single call, it can be synthesized from a sequence of them: utrace_attach, utrace_set_flags (... UTRACE_ACTION_QUESCE ...), Roland McGrath is working on breaking the patches down. - FChE --
Yes, the read call could be simplified to the level proposed above by Paul. -- -Stephane -
No it's not basically a read(). It's more like a request/reply interface, which a read()/write() interface doesn't handle very well. The request in this case is "tell me about this particular collection of PMDs" and the reply is the values. It seems to me that an important part of this is to be able to collect values from several PMDs at a single point in time, or at least an approximation to a single point in time. So that means that you don't want a file per PMD either. Basically we don't have a good abstraction for a request/reply (or command/response) type of interface, and this is a case where we need one. Having a syscall that takes a struct containing the request and reply is as good a way as any, particularly for something that needs to be quick. Paul. -
From: Paul Mackerras <paulus@samba.org> Yes it can, see my other reply. -
Hello, Exactly. This is not a brute force read()! On input you pass the list of registers you want to read. Upon return, you get the list of values. Now, I think the current call could be optimized even more by making the structure smaller. Today, the structure passed read/write PMD registers is the same. On write, we pass other information such as the reset values (sampling periods), randomization parameters and some Yes, we want to be able to read one or many registers in one call. The number of PMU counters is not going to shrink, so having a file -- -Stephane -
At least for x86 and I suspect some 1other architectures we don't initially need a syscall at all for this. There is an instruction RDPMC who can read a performance counter just fine. It is also much faster and generally preferable for the case where a process measures events about itself. In fact it is essential for one of the use cases I would like to see perfmon used (replacement of RDTSC for cycle counting) Later a syscall might be needed with event multiplexing, but that seems I don't like read/write for this too much. I think it's better to have individual syscalls. After all that is CPU state and having syscalls for that does seem reasonable. -Andi -
Andi, On a machine with only two generic counters such as MIPS or Intel Core 2 Duo, multiplexing offers some advantages. If NMI watchdog is enabled, then you drop As I said earlier, we do use read(), not for reading counters but to extract overflow notification messages when we are sampling. It makes more sense for this usage because this is where you want to leverage some key mechanisms such as: - asynchronous notification via SIGIO. this is how you can implement self-sampling for instance. - select/poll to allow monitoring tools to wait for notification coming from multiple sessions in one call. This is useful when monitoring across fork or pthread_create. -- -Stephane -
NMI watchdog is off by default now. Yes longer term we might need multiplexing, but definitely not as first step. -Andi -
How would you provide access to the counters of another process? Through an extension to ptrace perhaps? Paul. -
From: Andi Kleen <andi@firstfloor.org> I wouldn't even want to use a syscall for something like that on Sparc, I'd rather give this a dedicated software trap so that I can code it completely in assembler. -
actually multiplexing is the main feature i am in need of. there are an insufficient number of counters (even on k8 with 4 counters) to do complete stall accounting or to get a general overview of L1d/L1i/L2 cache hit rates, average miss latency, time spent in various stalls, and the memory system utilization (or HT bus utilization). this runs out to something like 30 events which are interesting... and re-running a benchmark over and over just to get around the lack of multiplexing is a royal pain in the ass. it's not a "far away non-essential feature" to me. it's something i would use daily if i had all the pieces together now (and i'm constrained because i cannot add an out-of-tree patch which adds unofficial syscalls to the kernel i use). -dean -
So by "multiplexing" do you mean the ability to have multiple event sets associated with a context and have the kernel switch between them automatically? Paul. -
Hello, Multiplexing in the context of perfmon2 means that you can measure more events than there are counters. To make this work, we create the notion of an event set or more precisely a register set. Each set encapsulates the full PMU state. Then the kernel multiplexes the sets onto the actual PMU hardware. Why do we need this? As Dean pointed out, that are many important metrics which do require more events than there are counters. Making multiple runs can be difficult with some workloads. But there are also other, less known, reasons why you'd want to do this. This is not because you have lots of counters that you can necessarily measure lots of related events simultaneously. Take pentium 4 for instance, it has 18 counters, but for most interesting metrics, you cannot measure all the events at once. Why? Because there are important hardware constraints which translate into event combination constraints. It is not uncommon to have constraints such as: - event A and B cannot be measured together - event A can only be measured by counter X - if event A is measured, then only events B, C, D can be measured This is not just on Itanium. Power has limitations, Intel Core 2 has limitations, AMD Opterons also have limitations. When you combine limited number of counters with strong constraints, it can quickly become difficult to make measurements in one run. Multiplexing is, of course, not as good as measuring all events continuously but if you run for long enough and with a reasonable switching periods, the *estimates* you get by scaling the obtained counts can be very close to what they would have been had you measured all events all the time. You have to balance precision with overhead. Why do this in the kernel? One might argue that there is nothing preventing tools from multiplexing at the user level. That's true and we do support this as well. You have to: - stop monitoring - read out current counter - reprogram config and data registers - ...
We've provided multiplexing in PAPI at the user level for years. That forced it to the user level, which wasn't pretty. Or very statistically accurate. We've been eagerly anticipating the improvements provided by in-kernel multiplexing in perfmon2. We and our user base don't consider this a "far away non-essential feature", but a deficiency that's needed addressing for a long time. -
Greg, I am the core developer of this and I am not as pessimistic as Phil. Yet I admit I think I understand your concerns. I will work on this. I think it is possible to refactor. It will certainly be painful (for me), but I think it can be done within some reasonable delay. Of course, it would be help if you could better qualify what I do care a lot actually. Believe me, I do spend a lot of effort and energy on this project everyday, like many others around the world, and I intend for it to succeed. We have reached a point in the development of processor hardware where this kind of features is crucial and it is not just for HPC folks anymore. -- -Stephane -
I think Andrew already spelled this out. If after reading his message, you still have questions, please let me know and I'll be glad to work with you to address them. thanks, greg k-h -
<stupid bullshitting snipped> What about investing some effort to do a proper performance counter infrastructure or turning the mess perfom is into one instead of this useless rant? Code is not getting any better by your complain ccing gazillions of useless list. -
Context switch is imho the main differentiating feature of perfmon over oprofile. Not sure it makes sense to take that one out. I don't think the complexity of the patches comes from the context switch anyways, it comes from the lots of other things it does. -Andi -
