Re: [PATCH] fix up perfmon to build on -mm

Previous thread: Linux 2.6.24-rc2 by Linus Torvalds on Tuesday, November 6, 2007 - 5:26 pm. (10 messages)

Next thread: [patch 00/23] Slab defragmentation V6 by Christoph Lameter on Tuesday, November 6, 2007 - 6:11 pm. (18 messages)
From: Greg KH
Date: Tuesday, November 6, 2007 - 5:34 pm

Here's a patch against my current tree that gets the perfmon code
building and hopefully working.

Note, it needs the kobject_create_and_register() patch which is in my
tree, but I do not think it made it to -mm yet.  The next -mm cycle
should have it.

Also, the sysfs usage in the perfmon code is quite strange and not
documented at all.  Yes, there is a little bit in the documentation
about what a few of the files do, but there are _way_ more files and
even directories being created under /sys/kernel/perfmon/ that are not
documented at all here.

If you document this stuff, I think I can clean up your sysfs code a
lot, making things simpler, easier to extend, and easier to understand.
But as it is, I don't want to break anything as it's totally unknown how
this stuff is supposed to work...

Hint, use the Documentation/ABI directory to document your sysfs
interfaces, that is what it is there for...

thanks,

greg k-h

---------------
From: Greg Kroah-Hartman <gregkh@suse.de>
Subject: perfmon: fix up some static kobject usages

This gets the perfmon code to build properly on the latest -mm tree, as
well as removing some static kobjects.

A lot of future kobject cleanups can be done on this code, but the
documentation for the perfmon sysfs interface is very limited and does
not describe all of the different files and subdirectories at all.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 perfmon/perfmon_sysfs.c |   37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

--- a/perfmon/perfmon_sysfs.c
+++ b/perfmon/perfmon_sysfs.c
@@ -76,7 +76,8 @@ EXPORT_SYMBOL(pfm_controls);
 
 DECLARE_PER_CPU(struct pfm_stats, pfm_stats);
 
-static struct kobject pfm_kernel_kobj, pfm_kernel_fmt_kobj;
+static struct kobject *pfm_kernel_kobj;
+static struct kobject *pfm_kernel_fmt_kobj;
 
 static void pfm_reset_stats(int cpu)
 {
@@ -402,31 +403,23 @@ static struct attribute_group pfm_kernel
 
 int __init pfm_init_sysfs(void)
 ...
From: Stephane Eranian
Date: Wednesday, November 7, 2007 - 3:34 am

Greg,

I will move the description from perfmon2.txt to its own file in
ABI/testing.

--
-Stephane
-

From: Greg KH
Date: Wednesday, November 7, 2007 - 10:07 am

That is what I was referring to, that file does not describe all of the

That would be great to have, thanks.

greg k-h
-

From: Stephane Eranian
Date: Wednesday, November 7, 2007 - 6:42 am

Greg,

Perfmon sysfs document has been updated following your adivce.
you can check out in my perfmon tree  the following commit:

	e83278f879e52ecee025effe9ad509fd51e4a516

Thanks.


-- 

-Stephane
-

From: Greg KH
Date: Wednesday, November 7, 2007 - 10:08 am

Where is this git tree located?  On git.kernel.org somewhere?

thanks,

greg k-h
-

From: Andrew Morton
Date: Wednesday, November 7, 2007 - 10:33 am

I get mine from git+ssh://master.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git
-

From: Greg KH
Date: Wednesday, November 7, 2007 - 10:41 am

Thanks, that worked, let me go read the new documentation...
-

From: Stephane Eranian
Date: Wednesday, November 7, 2007 - 10:50 am

From: Greg KH
Date: Wednesday, November 7, 2007 - 10:47 am

Thanks, that looks a lot better.

Do you want me to send you patches based on this tree to help clean up
the sysfs usage now that it's documented?

Also, a lot of your per-cpu sysfs files should probably move to debugfs
as they are for debugging only, right?  No need to clutter up sysfs with
them when only the very few perfmon developers would be needing access
to them.

thanks,

greg k-h
-

From: Stephane Eranian
Date: Wednesday, November 7, 2007 - 10:57 am

Greg,

Yes, send me the patches. But from what you were saying earlier it seems
I would need an extra sysfs patches to make this compile. Is that particular
Yes, this is mostly debugging. If debugfs is meant for this, then I'll
be happy to move this stuff over there. Is there some good example of how
I could do that based on my current sysfs code?

Thanks.

-- 
-Stephane
-

From: Greg KH
Date: Wednesday, November 7, 2007 - 12:53 pm

No, it's in my tree, and will be in the next -mm.  You will need a few

There is documentation for debugfs in the kernel api document :)

And, there are many in-kernel users of debugfs, a grep for
"debugfs_create_" should show you some examples of how to use this.  If
you have any questions, please let me know.

thanks,

greg k-h
-

From: Stephane Eranian
Date: Wednesday, November 7, 2007 - 1:39 pm

Greg,

Could you send them to me? if they are not too intrusive I could add them
to my tree. Yet I don't want something to distant from Linus's tree which
Ok, I'll look at that next.

Thanks,

-- 

-Stephane
-

From: Stephane Eranian
Date: Thursday, November 8, 2007 - 8:27 am

Greg,


I have now removed all the perfmon2 statistics from sysfs and moved them
to debugfs. I must admit, I like it better this way. Debugfs is also so
much easier to program.

Patch has been pushed into my tree. Let me know if you think I can improve
the sysfs code some more.

Thanks.

-- 

-Stephane
-

From: Andrew Morton
Date: Friday, November 9, 2007 - 1:06 pm

On Tue, 6 Nov 2007 16:34:54 -0800

Unfortunately I still haven't merged perfmon due to recently-occurring
minor conflicts with Tony's ia64 tree and more major recently-occurring
conflicts with the x86 tree.

There's not really a lot which Stephane can practically do about this -
normally I'll just get down and fix stuff like this up.  But the impression
I get from various people is that the perfmon tree in its present form
would not be a popular merge.

The impression which people have (and I admit to sharing it) is that
there's just too much stuff in there and it might not all be justifiable. 
But I suspect that people have largely forgotten what is in there, and why
it is in there.

We really need to get this ball rolling, and that will require a sustained
effort from more people than just Stephane.  I suppose as a starting point
we could yet again review the existing patches, please.  People will mainly
concentrate upon the changelogging to understand which features are being
proposed and why, so that submission should describe these things pretty
carefully: what are the features and why do we need each of them.

tia.
-

From: Greg KH
Date: Friday, November 9, 2007 - 2:38 pm

Is there some way to rebase these patches/git tree to be a bit more easy
to review?  Right now there are over 75 patches in the tree and many (if
not most) can be removed by merging them with previous patches.

If someone could break this stuff down into reviewable pieces, it would
go a very long way toward making it acceptable.

Is there any way to just provide a basic framework that everyone can
agree on and then add on more stuff as time goes on?  Do we have to have
every different processor/arch with support to start with?

thanks,

greg k-h
-

From: Andi Kleen
Date: Saturday, November 10, 2007 - 1:32 pm

Greg KH <greg-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> writes:

[dropped perfmon list because gmane messed it up and it's apparently

I think the real problem are not the architectures (the processor
adaption layer is usually relatively straight forward IIRC), but the
excessive functionality implemented by the user interface.

It would be really good to extract a core perfmon and start with
that and then add stuff as it makes sense.

e.g. core perfmon could be something simple like just support
to context switch state and initialize counters in a basic way 
and perhaps get counter numbers for RDPMC in ring3 on x86[1]

Next step could be basic event on overflow/underflow support.

Then more features as they make sense, with clear rationale
what they're good for and proper step by step patches. 

-Andi

[1] On x86 we urgently need a replacement to RDTSC for counting
cycles.
-

From: Robert Richter
Date: Tuesday, November 13, 2007 - 8:17 am

Perhaps a core could provide also as much functionality so that
Perfmon can be used with an *unpatched* kernel using loadable modules?
One drawback with today's Perfmon is that it can not be used with a
vanilla kernel. But maybe such a core is by far too complex for a
first merge.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com


-

From: Stephane Eranian
Date: Tuesday, November 13, 2007 - 11:32 am

Hello,

Note that I am not against the gradual approach such as:
	- system-wide only counting
	- per-thread counting
	- user-level sampling support
	- in-kernel sampling buffer support
	- in-kernel customizable sampling buffer formats via modules
	- event set multiplexing
	- PMU description modules

It would obvisouly cause a lot of troubles to existing perfmon libraries and
applications (e.g. PAPI). It would also be fairly tricky to do because you'd 
have to make sure that in the beginning, you leave enough flexiblity such that
you can add the rest while maintaining total backward compatibility. But given
that we already have the full solution, it could just be a matter of dropping
features without disrupting the user level API. Of course there would be a bigger
burden on the maintainer because he would have two trees to maintain but I think
that is already commonplace in many of the kernel-related projects.

Let's take a simple example. The set of syscalls necessary to control a system-wide
monitoring session is exactly the same as for a per-thread session. The difference is
just a flag when the session is created. Thus, we could keep the same set of syscalls,
but only accept system-wide sessions. Later on, when we add per-thread, we would just
have to expose the per-thread session flag.

Having said that, does not mean that this is necessarily what we will do. I am just
try to present my understanding of the comments from Andrew, Andi and others.

I think that going with a kernel module will not address the 'complexity/bloat' perception
that some people have. There is a logic to that, I did not just wakeup one day saying
'wouldn't it be cool to add set multiplexing?'. There was a true need expressed by users or
developers and it was justfied by what the hardware offered then. This unfortunately still
stands today. I admit that justification is not necessarily spelled out clearly in the code. So
I understand most of those worries and I am trying to figure out how we could ...
From: Christoph Hellwig
Date: Tuesday, November 13, 2007 - 3:29 pm

There no way we'll keep this completely idiotic userland API.  If people start
to use out of tree APIs they can pretty much expect that they're not going
to stay around.  And in this case they most certainly won't.

-

From: Mathieu Desnoyers
Date: Friday, November 16, 2007 - 11:25 am

(jumping in late in the game)

Linux Trace Toolkit Next Generation would _happily_ use global PMC
counters, but I would prefer to interact with an internal kernel API
rather than being required to start/stop counters from user-space. There
is a big precision loss involved in having to start things from
userspace.

Ideally, this API would manage access to available PMCs and even use the
same counters for both system-wide tracing/profiling done at the same
time as user-space profiling. This would however involve having a
wrapper around both user-space and kernel-space performance counter
reads, which is fine with me. I would suggest that user-space still go
through a system call for this, since this is available a early boot,
before the filesystem is mounted.

This API could offer to in-kernel architecture _independent_ PMC control
interface to :
- list available PMCs
  - That would involve mapping the common PMCs to some generic
    identifier
- attach to these PMCs, with a certain priority

We could call a single connexion to a PMC a "virtual PMC". All PMC
accesses should then be done through this internally managed structure
(giving callbacks to be called after a certain count, reads, stop...).
We could have virtual PMCs that are : system wide, or per thread.

As a starting point, we could limit one virtual PMC attached to a
physical PMC at a given time. Later, we could add support for multiple
virtual PMCs connected to a single physical PMC. The priorities could be
used to kick out the PMC users with lower priorities (that involves that
a PMC read could fail!).

Then, to get interrupts or signals upon PMC overflow, we could manage
each physical PMC like a timer, using the lowest requested value for the
next time were are to be awakened. Some logic would have to be added to
the pmc read operation to get the "real" expected value, but this is
nothing difficult.

Those were the ideas I had last OLS after hearing the talk about
perfmon2. I hope they can be useful. If ...
From: William Cohen
Date: Tuesday, November 13, 2007 - 8:35 am

Hi Robert,

In the past I suggested that it might be useful to have a version of perfmon2 
that only set up the perfmon on a global basis. That would allow the patches for 
context switches to be added as a separate step, splitting up the patch into 
smaller set of patches.

Perfmon2 uses a set of system calls to control the performance monitoring 
hardware. This would make it difficult to use an unpatch kernel unless perfmon 
changed the mechanism used to control the performance monitoring hardware.

-Will
-

From: Stephane Eranian
Date: Tuesday, November 13, 2007 - 10:55 am

Hello,

Yes, that would be a possibility but as you pointed out there are some problems:

	- perfmon2 uses system calls. So unless you can dynamically patch the
	  syscall table we would have to go back to the ioctl() and driver model.
	  I was under the impression that people did not quite like multiplexing
	  syscalls such as ioctl(). I also do prefer the multi syscall approach.

	- perfmon2 needs to install a PMU interrupt handler. On X86, this is not just
	  an external device interrupts. There needs to be some APIC and interrupt
	  gate setup. There maybe other constraints on other architectures as well.
	  Not sure if all functions/structures necessary for this are available to
	  modules.

	- we could not support per-thread mode with the kernel module approach due to
	  link to the context switch code. I do believe per-thread is a key value-add
	  for performance monitoring.

-- 
-Stephane
-

From: William Cohen
Date: Tuesday, November 13, 2007 - 11:33 am

The oprofile module can setup a handler for PMU interrupts. This is done in 
archi/x86/oprofile/nmi_int:nmi_cpu_setup().  Other modules could do the same. 
However, it bumps what ever was using the nmi/pmu off, then restores nmi/pmu 
when oprofile is shut down. Maybe the pmu/nmi resource reservation mechanism 

The per-thread monitoring is useful to a number of people and many people want 
it. The thought was how to break the large perfmon patch into set of smaller 
incremental patches. So it isn't whether to have per-thread pmu virtualization, 
but rather when/how to get it in.

-Will
-

From: Stephane Eranian
Date: Tuesday, November 13, 2007 - 2:13 pm

Will,


Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog
setup. It uses the register_die() mechanism, if I recall. The low level APIC
and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because

I think we all agree on this.

-- 

-Stephane
-

From: Andi Kleen
Date: Tuesday, November 13, 2007 - 2:29 pm

Oprofile works without the NMI watchdog too, but it just happens to be another


It could handle it in the same way as oprofile if it wanted. But given
NMIs make everything more complicated and it might not be worth it.

-Andi
-

From: Stephane Eranian
Date: Tuesday, November 13, 2007 - 2:46 pm

Andi.

I meant the register_die_notifier() mechanism which allow you to
chain a handler on NMI interrupts. At least that's my understanding
reading the code:

static int nmi_setup(void)
{
        int err=0;
        int cpu;

        if (!allocate_msrs())
                return -ENOMEM;

        if ((err = register_die_notifier(&profile_exceptions_nb))){
                free_msrs();
                pfm_release_allcpus();
                return err;
        }
Yes, horribly more complicated because of locking issues within perfmon.
As soon as you expose a file descriptor, you need some locking to prevent
multiple user threads (malicious or not) to compete to access the PMU state.
I think the value add of NMI can be as well achieved with advanced PMU features
such as Intel Core 2 PEBS.

-- 
-Stephane
-

From: Andi Kleen
Date: Tuesday, November 13, 2007 - 2:50 pm

Why do you need the file descriptor? 

One of the main problems with perfmon is the complicated user interface.


True probably, although only on CPUs that support PEBS. Dropping features
for old CPUs is unfortunately quite difficult in Linux, and in this case
probably not an option because there are so many of them (e.g. all of AMD
not Fam10h) 

-Andi
-

From: Stephane Eranian
Date: Tuesday, November 13, 2007 - 3:22 pm

Andi,

To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread.
file descriptor allows you to use close, read, select, poll and you leverage the
existing file descriptor sharing/inheritance sematics. At the kernel level, a 
descriptor provides all the callback necessary to make sure you clean up the perfmon

Yes, I know that. Also note that unfortunately, AMD Fam10h IBS feature does not
allow you to capture more than one sample in critical sections. It is still
interrupt based sampling with one entry-deep buffer: one interrupt = one sample.
Perfmon does support NMI though it is much more expensive to use.

-- 
-Stephane
-

From: Andi Kleen
Date: Tuesday, November 13, 2007 - 3:25 pm

Surely that could be done with a flag for each call too? Keeping file descriptors

Didn't you already have a thread destructor for it?

-Andi
-

From: Stephane Eranian
Date: Tuesday, November 13, 2007 - 3:58 pm

Andi,


I don't understand this.

Let's take the simplest possible example (self-monitoring per-thread)
counting one event in one data register.

int
main(int argc, char **argv)
{
	int ctx_fd;
	pfarg_pmd_t pd[1];
	pfarg_pmc_t pc[1];
	pfarg_ctx_t ctx;
	pfarg_load_t load_args;

	memset(&ctx, 0, sizeof(ctx));
	memset(pc, 0, sizeof(pc));
	memset(pd, 0, sizeof(pd));

	/* create session (context) and get file descriptor back (identifier) */
	ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);

	/* setup one config register (PMC0) */
	pc[0].reg_num   = 0
	pc[0].reg_value = 0x1234;

	/* setup one data register (PMD0) */
	pd[0].reg_num = 0;
	pd[0].reg_value = 0;

	/* program the registers */
	pfm_write_pmcs(ctx_fd, pc, 1);
	pfm_write_pmds(ctx_fd, pd, 1);

	/* attach the context to self */
	load_args.load_pid = getpid();
	pfm_load_context(ctx_fd, &load_args);

	/* activate monitoring */
	pfm_start(ctx_fd, NULL);

	/*
	 * run code to measure
	 */

	/* stop monitoring */
	pfm_stop(ctx_fd);

	/* read data register */
	pfm_read_pmds(ctx_fd, pd, 1);

	printf("PMD0 %llu\n", pd[0].reg_value);

	/* destroy session */
	close(ctx_fd);

	return 0;
}

-- 

-Stephane
-

From: Andi Kleen
Date: Tuesday, November 13, 2007 - 7:07 pm

[dropped all these bouncing email lists. Adding closed lists to public



Why do you need to set the data register? Wouldn't it make

My replacement would be to just add a flags argument to write_pmcs 

Why can't that be done by the call setting up the register?

Or if someone needs to do it for a specific region they can read

On x86 i think it would be much simpler to just let the set/alloc
register call return a number and then use RDPMC directly. That would
be actually faster and be much simpler too.

I suppose most architectures have similar facilities, if not a call could be 
added for them but it's not really essential. The call might be also needed
for event multiplexing, but frankly I would just leave that out for now.

e.g. here is one use case I would personally see as useful. We need
a replacement for simple cycle counting since RDTSC doesn't do that anymore
on modern x86 CPUs.  It could be something like:

	/* 0 is the initial value */

	/* could be either library or syscall */
	event = get_event(COUNTER_CYCLES); 
	if (event < 0) 
		/* CPU has no cycle counter */

	reg = setup_perfctr(event, 0 /* value */, LOCAL_EVENT); /* syscall */

	rdpmc(reg, start);
	.... some code to run ...
	rdpmc(reg, end);

	free_perfctr(reg);	/* syscall */

On other architectures rdpmc would be different of course, but 
the rest could be probably similar.

-Andi

-

From: Stephane Eranian
Date: Wednesday, November 14, 2007 - 6:09 am

Andi,



Partially true. The file descriptor becomes really useful when you sample.
You leverage the file descriptor to receive notifications of counter overflows
and full sampling buffer. You extract notification messages via read() and you can
use SIGIO, select/poll.

The example shows how you can leverage existing mechanisms to destroy the session, i.e.,
free the associated kernel resources. For that, you use close() instead of adding yet
another syscall. It also provides a resource limitation mechanisms to control consumption
Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

That would be quite expensive when you have lots of registers to setup: one
syscall per register. The perfmon syscalls to read/write registers accept vector
of arguments to amortize the cost of the syscall over multiple registers
(similar to poll(2)).

With many tools, registers are not just setup once. During certain measurements,
data registers may be read multiple times. When you sample or multiplex at
the user level, you do need to reprogram the PMU state and that is on the critical
path.

You do not want a call that programs the entire PMU state all at once either. Many times,
you only want to modify a small subset. Having the full state does also cause some portability
It depends on what you are doing. Here, this was not really necessary. It was
meant to show how you can program the data registers as well. Perfmon2 provides
default values for all data registers. For counters, the value is guaranteed to
be zero.

But it is important to note that not all data registers are counters. That is the
case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
well, and some may need to be initialized to non zero value, i.e., the IBS sampling
period.

With event-based sampling,  the period is expressed as the number of occurrences
of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
The way you express this with perfmon2 is ...
From: Andi Kleen
Date: Wednesday, November 14, 2007 - 7:24 am

Hmm, ok for the event notification we would need a nice interface. Still




I think you optimize the wrong thing here.

There are basically two cases I see:

-  Global measurement of lots of things:
Things are slow anyways with large context switch overheads. The 
overheads are large anyways. Doing one or more system calls probably
does not matter much. Most important is a clean interface.

- Exact measurement of the current process. For that you need very
low latencies. Any system call is too slow. That is why CPUs have
instructions like RDPMC that allow to read those registers with
minimal latency in user space. Interface should support those.

Also for this case programming time does not matter too much. You
just program once and then do RDPMC before code to measure and then
afterwards and take the difference. The actual counter setup is out 

Setting period should be a separate call. Mixing the two together into one

I didn't object to providing the initial value -- my example had that.
Just having a separate concept of data registers seems too complicated to me.
You should just pass event types and values and the kernel gives you

And?  You didn't say what the advantage of that is? 

All the approaches add context switch latencies. It is not clear that the separate



Well the system call layer can manage that transparently with a little software state


I disagree. Using RDPMC is essential for at least some of the things I would like
to do with perfmon2. If the interface does not provide it it is useless to me at least.
System calls are far too slow for cycle measurements. 

And when RDPMC is already supported it should be as widely used as possible.

Regarding the portable code problem: of course you would have some header in user space

I think only supporting global and self monitoring as first step is totally fine.

Sure at some point a system call for the more complex cases (also like multiplexing) would
be needed. But I don't think we need it as ...
From: William Cohen
Date: Wednesday, November 14, 2007 - 8:44 am

There are a number of processors that have 32-bit counters such as the IBM power 
processors. On many x86 processors the upper bits of the counter are sign 
extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are 
available. Roll over of values is quite possible (<2 seconds of cycle count), so 

What range of cycles are you interested in measuring? 100's of cycles? A couple 
thousand? Are you just looking at cycle counts or other events?

-Will
-

From: Stephane Eranian
Date: Wednesday, November 14, 2007 - 9:13 am

Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is
sign-extension. That's why it is okay for measuring small sections of code,
but that's it. On AMD, I think it is better. On Itanium you get the 47-bit worth.
Don't know about Power or Cell.

-- 
-Stephane
-

From: Philippe Elie
Date: Wednesday, November 14, 2007 - 11:53 am

On x86 they are sign-extended only on write, on read they are 40 bits wide
for intel, 48 bits for AMD.

BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
to disable it, dunno if it has been applied.

-- 
Phe

-

From: Andi Kleen
Date: Wednesday, November 14, 2007 - 12:15 pm

Obviously -- without a system call to set up performance counters it
would be fairly useless. But of course once such system calls are in
they should be able to trigger the bit for each process.

-Andi
-

From: Stephane Eranian
Date: Wednesday, November 14, 2007 - 5:07 pm

Andi,


Why do you think the existing interfaces are not a good fit for this?
Is this just because of your problem with file descriptors?

From my experience read(), select(), and SIGIO are fine. I know many tools use that.

As for the file descriptor, you would need to replace that with another identifier of
some sort. As I pointed out in another message on this thread, you don't want to use
a pid-based identifier. This is not usable when you monitor other threads and you
If people do not like vector arguments, then I think I can live with N system calls
to program N registers. Now you have two choices for passing the arguments:

	- a pointer to a struct
		struct pfarg_pmc {
			uint64_t reg_value;
			uint16_t reg_num;
		} pmc0;
		pmc0.reg_value = 0; pmc0.reg_value = 0x1234;
		pfm_write_pmcs(fd, &pmc0);

	- explicitly passing every field:
		pfm_write_pmcs(fd, 0x0, 0x1234);

Given that event set and multiplexing would not be in initially, we would want
to allow for them to be added later without having to create yet another
system call, right?


I am not sure I understand what you mean by 'lots of things'?

I don't have a problem with that. And in fact, I already support that
at least on Itanium. I had that in there for X86 but I dropped it after
you said that you would enable cr4.pce globally. I don't have a problem
Periods are setup by data register. Given that there is already a call to program
the data register why add another one? You don't need to treat the sampling period
differently from the register value. This just a value that will cause the register

Should you support a kernel level sampling buffer (like Oprofile) you'd also want
to specify the reset value on overflow. And you would not necessarily want it to
be identical to the initial value (period). So you'd to have a way to specify that

I am not against providing a flat namespace. But I think it is nice to separate config

Absolutely not, you don't want to the kernel to know about events. This has ...
From: Philip Mucci
Date: Tuesday, November 13, 2007 - 11:47 am

Hi folks,

Well, I can say the mood here at supercomputing'07 is pretty somber  
in regards to the latest exchange of messages regarding the perfmon  
patches. Our community has been the largest user of both the PerfCtr  
and the Perfmon patches, the former being regularly installed by  
vendors and integrators on clusters at install time, and the latter  
now being adopted into vendor kernels by IBM, Cray, AMD, SiCortex and  
others. Of course, adoption by a vendor, does not a good kernel patch  
make. However, it should be viewed as a strong data point on demand  
for such functionality. We are a community focused on performance and  
we have long had a need for these tools.

A solution that does not provide 64 bit virtualized per-thread counts  
is not a solution at all. That would need to be ripped out by all of  
us using this functionality so we could get something that actually  
does what the community needs, not what the you folks think we need.  
Device level access and/or root access to the counters is not  
unacceptable for machines in production. If that was fine, oprofile  
would have satisfied everyone and we wouldn't be sucking up your  
bandwidth. Please understand that people outside of the your  
community are desperate for adoption of any form of 'per-thread' PMU  
functionality into the kernel. For those of you who are (still) not  
convinced of this, I can arrange your inbox to be spammed by 1000's  
of HPC geeks, managers, vendors, etc. My point is, let's start  
somewhere that the community finds useful. Otherwise we run the risk  
of developing an interface that everyone isn't comfortable with and  
no-one uses. Hardly a productive exercise.

So please, do consider a set of core functionality that provides for  
(at least) the following:

- per-CPU and per-thread 64 bit virtualized counts
- third person operation (attach/ptrace)
- dispatch of signal upon interrupt on overflow if requested
- 'buffered' interrupts into a buffer that can be mmap'd into ...
From: Greg KH
Date: Tuesday, November 13, 2007 - 11:59 am

"somber"?

Why?

We (a number of the kernel developers) want to see the perfmon code make
it into the kernel tree, unfortunatly, in the current state it is in,
that's not going to happen.

Andi specified a way that this can happen, just refactor your patches
into smaller bits that can be reviewed and applied.

If you, or anyone else has any questions about this, please let us know.
So far, I have not seen any response to his message, so I'm guessing
that the perfmon developers either are off working on this, or don't
care.

And if they don't care, then yes, I agree with your "somber" feeling...

thanks,

greg k-h
-

From: Andrew Morton
Date: Tuesday, November 13, 2007 - 1:07 pm

Well...  Philip is (I assume) a numerical-computing guy and not a
kernel-developing guy (probably a wise choice).

He speaks for quite a few people - they have serious need for this feature
but they've had to scruff around with out-of-tree patches for years to get
it, and still there are problems.

I was hoping that after the round of release-and-review which Stephane,
Andi and I did about twelve months ago that we were on track to merge the
perfmon codebase as-offered.  But now it turns out that the sentiment is
that the code simply has too many bells-and-whistles to be acceptable.

My problem with that sentiment is that it is quite likely the case that
those bells-n-whistles are actually useful and needed features.  Perfmon
has been out there for quite a few years and the code which is in there
_should_ be in response to real-world in-the-field experience.  Such
requirements never go away.


So.  If what I am saying is correct then the best course of action would be
for Stephane to help us all to understand what these features are and why
we need them.  The ideal way in which to do this is

[patch] perfmon: core
[patch] perfmon: whizzy feature #1
[patch] perfmon: whizzy feature #2
[patch] perfmon: whizzy feature #3

etc.  Where the changelog in each whizzy-feature-n explains what it does,
why it does it and why our users need it.

Whatever happens, perfmon is so big and so old and has been out-of-tree for
so long that it's going to take a pile of work from lots of people to get
any of it landed.

-

From: Greg KH
Date: Tuesday, November 13, 2007 - 1:14 pm

I agree.  Right now their git tree has over 80 patches in it, without
descriptions like this to help those of us who want to review and help
out, it is quite difficult.

thanks,

greg k-h
-

From: Andi Kleen
Date: Tuesday, November 13, 2007 - 1:36 pm

> He speaks for quite a few people - they have serious need for this feature

Most likely they have serious need for a very small subset of perfmon2.
The point of my proposal was to get this very small subset in quickly.

Phil, how many of the command line options of pfmon do you
actually use? How many do the people at your conference use? Or what
functions, what performance counters etc. in PAPI or whatever 
library you use? 

Make use understand the use cases better, that would already help a lot
in merging by concentrating on what people actually really need.

-Andi

-

From: Philip Mucci
Date: Tuesday, November 13, 2007 - 5:28 pm

Hi Andi,

pfmon is a single tool and fairly low level, the HPC folks don't use  
it so much because it isn't parallel aware and is meant for power- 
users. It is not representative of the tools used in HPC at all. Our  
community uses tools built on the infrastructure provided by libpfm  
and PAPI for the most part.

I know you don't want to hear this, but we actually use all of the  
features of perfmon, because a) we wanted to use the best methods  
available and b) areas where user level solutions could be made (like  
multiplexing) introduced too much noise and overhead to be of use.  
For years we relied on PerfCtr which did 'just enough' for us. But  
when Perfmon2 became available, we adopted technology where it meant  
a significant increase in accuracy for the resulting measurements,  
specifically for us that meant, kernel multiplexing and sample buffers.
Note that PAPI is just middleware. The tools built upon it are what  
people use...some of those are commercial tools like Vampir but most  
are Open Source. These tools are cross platform, as such they run on  
nearly everything...although intel/amd/ppc systems dominate the HPC  
market.

The usage cases are always the same and can be broken down into  
simple counting and sampling:

	- providing virtualized 64-bit counters per-thread
	- providing notification (buffered or non) on interrupt/overflow of  
the above.

If you'd like to outline further what you'd like to hear from the  
community, I can arrange that. I seem to remember going through this  
once before, but I'd be happy to do it again. For reference, here's a  
quick list from memory of some of the tools in active use and built  
on this infrastructure. These are used heavily around the globe.  
You'll see that each basically follows one of the 2 usage models above.

- HPCToolkit (Rice)
- PerfSuite (NCSA)
- Vampir (Dresden)
- Kojak (Juelich)
- TAU (UOregon)
- PAPIEX (me)
- GPTL (NCAR)
- HPM-Linux (IBM)
- Paraver (Barcelona)

Time to go give ...
From: Andi Kleen
Date: Tuesday, November 13, 2007 - 6:52 pm

That is hard to believe.

But let's go for it temporarily for the argument. 

Can you instead prioritize features.  What is most essential, what is 


Ok that makes sense and should be possible with a reasonable simple

Please list concrete features, throwing around random names is not useful.

-Andi

-

From: Philip Mucci
Date: Friday, November 16, 2007 - 2:18 am

Just getting back to this now that SC07 is finally over...


You are welcome to download the code and some of the tools and verify  

Yes, although this has been done before. You've got the list below in  
the previous
emails which should be considered the absolute minimum.

- A feature which was dropped earlier by Stefane (only to satiate  
LKML), we consider
very important. Allowing one tomapping of the kernels view of the  
PMD's, allowing
user-space access to full 64-bit counts, if the architecture
supports a user-level read instruction. Getting the counts in a  
couple of dozen cycles
is ALWAYS a win for us. This is because the HPC community is mainly  
interested in
self-monitoring, not third-party, because the former can be easily  
associated with
context in the app through instrumentation in various forms.

- Kernel multiplexing is very nice to have, saves you tremendous  
overhead at user
level. PAPI has an implementation in user-space for the platforms  
that don't support
this. The flexibility of the current implementation is not exploited,  
here I'm
referring to the concept of eventsets. Having multiplexing is  
important. Being able
to allocate/reallocate eventsets and the threshold of individual  
eventsets is just nice
to have.

- Custom sample formats would be considered not often used in our  
community, largely
because the tools run on all HPC/Linux architectures. PAPI uses the  
default sample
format which has been sufficient for our needs. However, the lack of  
custom sample
formats preclude the dev of the specialized tools that access the  
sampling
hardware as found on the IA64, PPC64, the Barcelona and the SiCortex  
node chip.

Well that's good news. The above is what we have used via the PerfCtr  
set of
patches for a long time. It wasn't quite enough, but it got the job  

This is kind of comment that makes the Linux/HPC folks 'somber'. What  
isn't useful, is being dismissive of an entire community that moves a  
heck of a lot of ...
From: Andi Kleen
Date: Friday, November 16, 2007 - 8:15 am

I didn't see a clear list. 

My impression so far is that you're not quite sure what you want,

You mean returning the register number for RDPMC or equivalent
and a way to enable it for ring 3 access? 

I'm considering that an essential feature too. I wasn't aware

Yes it is for everybody. I've been rather questioning if the slow
ways (complicated syscalls) to get the counter information are really 


What do you mean with custom sample formats exactly?  What information
do you want in there? And why?

e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
so they only way to get a custom format would be to use a separate buffer.

I can think of one reason why the kernel should add more information
in a separate buffer (log the instruction bytes so that it can
be disassembled and a address histogram be generated using the PEBS
register values), but it is a relatively obscure one and definitely
not a essential feature. Unfortunately it is also hard to implement completely

Sorry, but these kind of non technical BS arguments will just make
you be ignored in mainline Linux lands. They might work if you pay
a lot of money to specific Linux companies (do you?), but here
on linux-kernel you have to convince with purely technical arguments.

-Andi
-

From: Stephane Eranian
Date: Friday, November 16, 2007 - 9:00 am

Andi,

No, he is talking about something similar to what was in perfctr.
The kernel emulates 64-bit counters in software and that is you
get back when you read the counters. If you read via RDPMC, you
get 40 bits. To reconstruct the full 64-bit value from user land
you need the upper bits. One approach is for the kernel to allow
you to remap a page that has the 64-bit (software) counters. With
Perfmon2 allows you to have an in-kernel sampling buffer. The idea is
not new, Oprofile has this as well. The problem here is that if the
buffer is in the kernel the format of the samples is fixed and it
should have to. Tools may want to record samples in different formats
and as you said some may need extra information gathered in the kernel.
Some may want to aggregate samples in the kernel (Oprofile used to
do that), some may want to use a double-buffer approach to minimize
blind spots, others may simply use the counter overflow mechanism to
record something that is non-PMU related, e.g, kernel call stack.
I have built such a module and it was quite interesting to collect
the call stack when you hit a last cache level miss.

The idea behind customizable sampling format is simple: extract the
format from the perfmon core and put this into a kernel module. The
core provides a simple registration mechanism and the two communicate
via a set of callbacks.

Perfmon2 comes with a basic default format which works on all
platforms. But it is possible to develop others without having to
patch the kernel nor recompile nor reboot. At its core, each format provides
a handler routine which is called on counter overflow. The handler routine
controls what is recorded, how it is recorded, how it is exported to
userland, and wheher overflow notifications need to be sent.

Using this mechanism, for instance, we were able to connect the
Oprofile kernel code to perfmon2 on Itanium with a 100 lines of

This is also how we support PEBS because, as you said, the format of the
samples is not under your ...
From: Andi Kleen
Date: Friday, November 16, 2007 - 9:28 am

You mean the page contains the upper [40;63] bits? 

Sounds reasonable, although I don't remember seeing that when I looked


... you also didn't say *why* that is needed.

Can you give a concrete use case for something that cannot be done

The existing oprofile code works already fine on x86, no real

Exactly that makes the support for random custom buffers questionable.

e.g. as I can see the main advantage of perfmon over existing setups
is that it support PEBS etc., but with your custom buffer formats which
are by definition incompatible with PEBS you would negate that advantage
again.


Why this insistence against changing anything?

-Andi
-

From: William Cohen
Date: Friday, November 16, 2007 - 10:13 am

Upper 32-bit ([32:63]). On many implementations the only lower 32-bit are 
available in the register. the 32:40 bits in several processor implementation of 
x86 processors can not be set to bit outside of sign extension of bit 32. On 

OProfile is very useful in many cases, but it only perform sampling. If one want 
to take a look at the number events a specific section of code causes, one can't 
really do that with oprofile. The counters are running systemwide, not per 
thread. For some experiments developers really like to have per thread counters.

The rewrite of oprofile to use the perfmon code was to consolidate code using 
the performance monitoring hardware. Use one interface for accessing the 
performance monitoring hardware rather than have one for sampling and another 

So the alternative approach is to write a new device driver for each of the new 
performance monitoring mechanisms, e.g. one for PEBS and another for IBS?

One of the reason for the custom sample buffers was to avoid having an expensive 
user-space signal for a process to record some simple pieces of data each time 
the data becomes available. For the oprofile port to the perfmon2 custom buffer 
  mechanism the instruction pointer and the counter that overflowed are 
recorded. The buffer can be processed in one large chunk by userspace, reducing 
overhead. In essence the current implementation of OProfile in the mainline 
kernels has a custom buffer mechanism.

-Will

-

From: Stephane Eranian
Date: Friday, November 16, 2007 - 10:36 am

Andi,

Do you question why Oprofile has one ;->

But I am happy to explain.

With sampling, you want to record information about the execution of a
thread at some interval. The interval could be expressed as time or
number of occurences of an PMU event.

Typically you get a notification. Then you need to collect certain 
information about the execution. Typically you record the instruction
pointer (e.g. Oprofile), but you may want to record the value of other
counters, PMU registers or other HW/SW resources. While you're doing
this monitoring is typically stopped so you get a consitent view. After
you're done recording you need to re-arm the sampling period. If you
use event-based sampling, you need to reprogram the counter(s). Then
you resume monitoring. You have to repeat this process for each sample
regardless of whether you are self-monitoring, monitoring another thread,
or monitoring a CPU.

Such sequence of operations is quite expensive, especially in the case
where you are monitoring another thread, because it incurs at least
a couple of context switches per sample in addition to the various
register manipulations and syscalls.

The idea with the kernel sampling buffer is that you amortize the
cost of notification to userland over LOTS of samples. On counter
overflow, the kernel records the samples on your behalf. There is
no context switch, samples are always recorded in the context on
the monitored thread.

Now, you need a bit more information for this to work correctly
because the kernel records on *your behalf*,  thus
you need to express:
	- what you want to see recorded

	- the value to reload into the overflowed counter(s)
	  so the kernel can re-arm the next period.

Because you have multiple counters, you may use them for sampling
periods, i.e., overlap sampling measurements. That is something
done very frequently.

For instance, the q-syscollect tool that D. Mosberger wrote, is
overlapping elapsed cycles and branch trace buffer (BTB) sampling
to ...
From: dean gaudet
Date: Friday, November 16, 2007 - 10:51 am

- cross platform extensible API for configuring perf counters
- support for multiplexed counters
- support for virtualized 64-bit counters
- support for PC and call graph sampling at specific intervals
- support for reading counters not necessarily with sampling
- taskswitch support for counters
- API available from userland
- ability to self-monitor: need select/poll/etc interface
- support for PEBS, IBS and whatever other new perf monitoring 
  infrastructure the vendors through at us in the future
- low overhead:  must minimize the "probe effect" of monitoring
- low noise in measurements:  cannot achieve this in userland

permon2 has all of this and more i've probably neglected...

-dean
-

From: David Miller
Date: Friday, November 16, 2007 - 5:29 pm

From: dean gaudet <dean@arctic.org>

I want to state that even though I've been a stickler on the system
call stuff, in general I want to see perfmon2 go into tree and I agree
with how most of the infrastructure is implemented and the features it
provides.
-

From: Greg KH
Date: Friday, November 16, 2007 - 6:07 pm

Now if we only had a series of patches that we could actually review and
apply to the -mm tree so that people can try them out... :)

thanks,

greg k-h
-

From: Philip Mucci
Date: Friday, November 16, 2007 - 1:16 pm

I suppose by complicated here, your referring to the gather semantics  
of the
pfm_read/write_pmds/pmcs calls. Many processors may have 100's of  
registers
(IA64, BG/P, SiCortex), some of which have different access times. So a
naive syscall of 'give me all the registers you've got' isn't going  
to cut it.
However, any additional simplicity (performance) we can squeeze out  
of this
particular primitive is a huge win as it sits in the critical path of  
the user

Performance and noise. See the earlier message about our user-land  
implementation versus kernel mode implementations. Any any useful  
granularity, you begin to seriously affect the counts with noise as  
well as dilate the run-time. But let's punt on this one until after  

By custom here, I mean the ability to have the kernel take samples  
containing
more than just the IP, the PID and a bitmask of which registers  
overflowed at this
point. Myself and others have worked hard to get effective address  
sampling into the
hardware (there are registers that contain EA's of misses as well as  
branch mispredict
data on the PPC, IA64, Barcelona and SiCortex) that are handled  
through the use
of a format that gathers up that information at interrupt time for  
deposit into
the sample buffer. We are not wedded to Perfmon2's implementation of  
these formats, we
are however, wedded to having this information collected at interrupt  
time as the data
may change by the time you get back to user-mode. This hardware is  
not obscure any more,
it's the norm, as we've learned at thus simple aggregate counters,  
even those with precise

I love it when kernel folks refer to their own revenue streams
(and yes, we do, ask your VP of sales) and the needs of a user  
community as
"BS non-technical arguments".

But let's get back to basics here. We can sort that out over a beer  
sometime.
At this point, let's try and agree on the minimum set of
functionality acceptable for a first round of patches.

- per-CPU ...
From: David Miller
Date: Friday, November 16, 2007 - 5:15 pm

From: Andi Kleen <andi@firstfloor.org>

I would like to add sparc64 support to perfmon2 as well
and therefore I've been considering this angle of the
API issues as well.

The counters on sparc64 can be configured to be readable by userspace,
so for the self-monitoring cases I really would like to make sure the
perfmon2 library interface could use direct reads for sampling instead
of system calls or specialized traps.

If I get some spare time I'll look at the current perfmon2 patches
and see if I can toss together sparc64 support to get a feel for
how things stand currently.
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 12:24 am

Whose sentiment?

I've had a bit of a look at it today together with David Gibson.  Our
impression is that the latest version is a lot cleaner and simpler
than it used to be.  I'm also reading Stephane's technical report
which describes the interface, and whilst I'm only part-way through
it, I haven't seen anything yet which strikes me as unnecessary or
overly complicated.

Paul.
-

From: Andrew Morton
Date: Wednesday, November 14, 2007 - 12:40 am

Yes, that's quite possible.  I don't know how up-to-date people's
knowledge is.  I know I haven't looked seriously at the code in around
twelve months.

Let's get it on the wires as outlined and take a look at it all.
-

From: Christoph Hellwig
Date: Wednesday, November 14, 2007 - 3:38 am

Mine for example.  The whole userspace interface is just on crack,
and the code is full of complexities aswell.

-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 3:43 am

Could you give some _technical_ details of what you don't like?

Paul.
-

From: Christoph Hellwig
Date: Wednesday, November 14, 2007 - 4:00 am

I've done this a gazillion times before, so maybe instead of beeing a lazy
bastard you could look up mailinglist archive.  It's not like this is the
first discussion of perfmon.  But to get start look at the systems calls,
many of them are beasts like:

  int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)

This is basically a read(2) (or for other syscalls a write) on something
else than the file descriptor provided to the system call.   The right thing
to do is obviously have a pmds and pmcs file in procfs for the thread beeing
monitored instead of these special-case files, with another set for global
tracing.  Similarly I'm pretty sure we can get a much better interface
if we introduce marching files in procfs for the other calls.

-

From: David Miller
Date: Wednesday, November 14, 2007 - 4:12 am

From: Christoph Hellwig <hch@infradead.org>

This is my impression too, all of the things being done with
a slew of system calls would be better served by real special
files and appropriate fops.  Whether the thing is some kind
of misc device or procfs is less important than simply getting
away from these system calls.
-

From: David Miller
Date: Wednesday, November 14, 2007 - 4:14 am

Ok, I just got 4 freakin' bounces from all of these subscriber only
perfmon etc. mailing lists.

Please remove those lists from the CC: as it's pointless for those of
us not on the lists to participate if those lists can't even see the
feedback we are giving.

-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 4:44 am

Special files and fops really only work well if you can coerce the
interface into one where data flows predominantly one way.  I don't
think they work so well for something that is more like an RPC across
the user/kernel barrier.  For that a system call is better.

For instance, if you have something that kind-of looks like

	read_pmds(int n, int *pmd_numbers, u64 *pmd_values);

where the caller supplies an array of PMD numbers and the function
returns their values (and you want that reading to be done atomically

Why?  What's inherently offensive about system calls?

Paul.
-

From: Nick Piggin
Date: Tuesday, November 13, 2007 - 4:49 pm

Could you implement it with readv()?
-

From: David Miller
Date: Wednesday, November 14, 2007 - 4:58 am

From: Nick Piggin <nickpiggin@yahoo.com.au>

Sure, why not?  Just cook up an iovec.  pmd_numbers goes to offset
X and pmd_values goes to offset Y, with some helpers like what
we have in the networking already for recvmsg.

But why would you want readv() for this?  The syscall thing
Paul asked me to translate into a read() doesn't provide
iovec-like behavior so I don't see why readv() is necessary
at all.
-

From: Nick Piggin
Date: Tuesday, November 13, 2007 - 5:25 pm

Ah sorry, that's what I get for typing before I think: of course
readv doesn't vectorise the right part of the equation.

What I really mean is a readv-like syscall, but one that also
vectorises the file offset. Maybe this is useful enough as a generic
syscall that also helps Paul's example...

Of course, I guess this all depends on whether the atomicity is an
important requirement. If not, you can obviously just do it with
multiple read syscalls...
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 2:30 pm

I've sometimes thought it would be useful to have a "transaction"
system call that is like a write + read combined into one:

	int transaction(int fd, char *req, size_t req_nb,
			char *reply, size_t reply_nb);

as a way to provide a general request/reply interface for special

That would take N system calls instead of one, which could have a
performance impact if you need to read the counters frequently (which
I believe you do in some performance monitoring situations).

Paul.
-

From: Nick Piggin
Date: Wednesday, November 14, 2007 - 3:17 am

Maybe not a bad idea, though I'm not the one to ask about taste ;)
In this case, it is enough for your requests to be a set of scalars
(eg. file offsets), so it _could_ be handled with vectorised offsets...

But in general, for special files, I guess the response is usually
some structured data (that is not visible at the syscall layer).
So I don't see a big problem to have a similarly arbitrarily

That's true too.
-

From: Chuck Ebbert
Date: Wednesday, November 14, 2007 - 3:56 pm

IOW, an ioctl.
-

From: Nick Piggin
Date: Wednesday, November 14, 2007 - 4:03 am

In the same way a read of structured data from a special file
"is an" ioctl, yeah. You could implement either with an ioctl.

The main difference is they have more explicitly typed interfaces
Whether that's enough argument (and if Paul's proposal is widely
usable enough) is another question. Which I won't try to answer.
-

From: David Miller
Date: Wednesday, November 14, 2007 - 4:52 am

From: Paul Mackerras <paulus@samba.org>

The same way we handle some of the multicast "getsockopt()"
calls.  The parameters passed in are both inputs and outputs.

For the above example:

	struct pmd_info {
		int *pmd_numbers;
		u64 *pmd_values;
		int n;
	} *p;

	buffer_size = N;
	p = malloc(buffer_size);
	p->pmd_numbers = p + foo;
	p->pmd_values = p + bar;
	p->n = whatever(N);
	err = read(fd, p, N);

It's definitely doable, use your imagination.

You can encode all kinds of operation types into the
header as well.

Another alternative is to use generic netlink.
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 5:03 am

You're suggesting that the behaviour of a read() should depend on what
was in the buffer before the read?  Gack!  Surely you have better
taste than that?

Or are you saying that a read (or write) has a side-effect of altering
some other area of memory besides the buffer you give to read()?  That

Then you end up with two system calls to get the data rather than one
(one to send the request and another to read the reply).  For
something that needs to be quick that is a suboptimal interface.

Paul.
-

From: David Miller
Date: Wednesday, November 14, 2007 - 5:07 am

From: Paul Mackerras <paulus@samba.org>

Absolutely that's what I mean, it's atomic and gives you exactly what
you need.

I see nothing wrong or gross with these semantics.  Nothing in the
"book of UNIX" specifies that for a device or special file the passed

Not necessarily, consider the possibility of using recvmsg() control
message data.  With that it could be done in one go.

This also suggests that it could be implemented as it's own protocol
family.
-

From: Nick Piggin
Date: Tuesday, November 13, 2007 - 5:28 pm

True, but is it now any so different to an ioctl?
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 2:50 pm

Ohhhhh.... kayyyyy.... *shudders*

It really violates the abstract model of "read" pretty badly.  "Read"
is "fill in the buffer with data from the device", not "do some
arbitrary stuff with this area of memory".

I'd prefer to have a transaction() system call like I suggested to

There's all sorts of possible ways that it could be implemented.  On
the one hand we have an actual proposed implementation, and on the
other we have various people saying "oh but it could be implemented
this other way" without providing any actual code.

Now if those people can show that their way of doing it is
significantly simpler and better than the existing implementation,
then that's useful.  I really don't think that doing a whole new
net protocol family is a simpler and better way of doing a performance
monitor interface, though.

Paul.
-

From: David Miller
Date: Wednesday, November 14, 2007 - 4:03 pm

From: Paul Mackerras <paulus@samba.org>

So much for getting rid of the extra system calls...
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 4:12 pm

*I* never had a problem with a few extra system calls.  I don't
 understand why you (apparently) do.

Paul.
-

From: David Miller
Date: Wednesday, November 14, 2007 - 4:21 pm

From: Paul Mackerras <paulus@samba.org>

We're stuck with them forever, they are hard to version and extend
cleanly.

Those are my main objections.
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 6:11 pm

The first is valid (for suitable values of "forever") but applies to
any user/kernel interface, not just system calls.

As for the second (hard to version) I don't see why it applies to
syscalls specifically more than to other interfaces.  It's just a
matter of designing it correctly in the first place.  For example, the
sys_swapcontext system call we have on powerpc takes an argument which
is the size of the ucontext_t that userland is using, which allows us
to extend it in future if necessary.  (Note that I'm not saying that
the current perfmon2 interfaces are well-designed in this respect.)

The third (hard to extend cleanly) is a good point, and is a valid
criticism of the current set of perfmon2 system calls, I think.
However, the goal of being able to extend the interface tends to be in
opposition to the goal of having strong typing of the interface.
Things like a multiplexed syscall or an ioctl are much easier to
extend but that is at the expense of losing strong typing.  Something
like my transaction() (or your weird kind of read() :) also provides
extensibility but loses type safety to some degree.

Also, as Andi says, this is core CPU state that we are dealing with,
not some I/O device, so treating the whole of perfmon2 (or any
performance monitoring infrastructure) as a driver doesn't fit very
well, and in fact system calls are appropriate.  Just like we don't
try to make access to debugging facilities fit into a driver, we
shouldn't make performance monitoring fit into a driver either.

Paul.
-

From: David Miller
Date: Wednesday, November 14, 2007 - 6:27 pm

From: Paul Mackerras <paulus@samba.org>

I disagree.

With netlink we can just add new attributes when a new need arises for
a particular interface.  The attribute code describes the type
precisely, so there is no loss of strong typing at all.
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 7:34 pm

Well you must mean something different by "strong typing" from the
rest of us.  Strong typing means that the compiler can check that you
have passed in the correct types of arguments, but the compiler
doesn't have any visibility into what structures are valid in netlink
messages.

In any case, I think that adding a structure size argument to the
current perfmon2 system calls where appropriate would mean that we
could extend them cleanly later on if necessary.  It would mean that
we could add fields at the end, and that the kernel could know what
version of the structures that userspace was using.

Paul.
-

From: Herbert Xu
Date: Thursday, November 15, 2007 - 12:48 am

That's strong static typing.  Netlink is 90% strong static
typing plus 10% strong dynamic typing.  That is, it'll tell
you at run-time if you give it the wrong netlink attribute.

The types within each netlink attribute is checked at compile
time.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-

From: Andi Kleen
Date: Thursday, November 15, 2007 - 1:19 am

Well it tells you EINVAL no matter what is wrong.

That's roughly similar to a compiler whose only error message
is 'WRONG'. Or the ed school of error reporting.

That makes any checking it does barely useful.

-Andi
-

From: David Miller
Date: Monday, November 19, 2007 - 6:08 am

Instead of blabbering further about this topic, I decided to put my
code where my mouth is and spent the weekend porting the perfmon2
kernel bits, and the user bits (libpfm and pfmon) to sparc64.

As a result I've found that perfmon2 is quite nice and allows
incredibly useful and powerful tools to be written.  The syscalls
aren't that bad and really I see not reason to block it's inclusion.

I rescind all of my earlier objections, let's merge this soon :-)
-

From: Stephane Eranian
Date: Monday, November 19, 2007 - 1:53 pm

David,


I appreciate your effort. I am glad to see that the interface
and implementation survived yet another architecture. I think at this
point ARM is the only major architecture missing. In anycase, I would

As I said earlier, I am not opposed to changing the syscalls. I have
proposed a few schemes to address the issue of versioning. If vectors
arguments are problematic, we can go with single register/call.

I think there are other areas where perfmon2 could benefit from the

Thanks.

-- 
-Stephane
-

From: David Miller
Date: Monday, November 19, 2007 - 5:55 pm

From: Stephane Eranian <eranian@hpl.hp.com>

I sent these to Philip Mucci late last night, but in the meantime
I finished implementing breakpoint support as well for pfmon.

Let me clean up my diffs and I'll send it all out to you in a
few hours.
-

From: Paul Mackerras
Date: Monday, November 19, 2007 - 2:43 pm

Strongly agree.  However, I think we need to add structure size
arguments to most of the syscalls so we can extend them later.

Also, something I've been meaning to mention to Stephane is that the
use of the cast_ulp() macro in perfmon is bogus and won't work on
32-bit big-endian platforms such as ppc32 and sparc32.  On such
platforms you can't take a pointer to an array of u64, cast it to
unsigned long * and expect the kernel bitmap operations to work
correctly on it.  At the least you also need to XOR the bit numbers
with 32 on those platforms.  Another alternative is to define the
bitmaps as arrays of bytes instead, which eliminates all byte ordering
and wordsize problems (but makes it more tricky to use the kernel
bitmap functions directly).

Paul.

-

From: Stephane Eranian
Date: Monday, November 19, 2007 - 3:48 pm

Paul,

Yes, that is one way. It works well if you only extend structures at the end.
Given that you need to obtain the file descriptor first via a pfm_create_context
call, an alternative could be that you pass a version number to that call to

I don't like those cast_ulp() macros. They were put there to avoid compiler
warnings on some architectures. Clearly with the big-endian issue, we need
to find something else. The bitmap*() macros make unsigned long *.

The interface uses fixed size type to ensure ABI compatibility between
32 and 64 bit modes. This way there is no need to marhsall syscall arguments
for a 32-bit app running on a 64-bit host.

Looks like we will have to use bytes (u8) instead.  This may have some
performance impact as well. Several bitmaps are used in the context/interrupt
routines. Even with u8, there is still a problem with the bitmap*() macros.
Now, only a small subset of the bitmap() macros are used, so it may be okay
to duplicate them for u8.


-- 

-Stephane
-

From: David Miller
Date: Monday, November 19, 2007 - 5:53 pm

From: Stephane Eranian <eranian@hpl.hp.com>

I think it would be fine to just create a set of bitop interfaces that
operate on u32 objects instead of "unsigned long".

Currently perfmon2 does not need the atomic variants at all, and those
could thus be provided entirely under include/asm-generic/bitops/
-

From: Stephane Eranian
Date: Thursday, December 13, 2007 - 9:00 am

Hello,

A few weeks back, I mentioned that I would post some
interesting problems that I have encountered while
implementing perfmon and for which I am still looking
for better solutions.

Here is one that I would like to solve right now and
for which I am interested in your comments.

One of the perfmon syscall (pfm_restart()) is used to
resume monitoring after a user level notification. When
 operating in per-thread non self-monitoring mode, the
syscall needs to operate on the machine state of the
monitored thread. So you get into this situation:


        Thread T0                        Thread T1
            |                                |
       pfm_restart()                         |
            |                                |
    spin_lock_irqsave()                      |
            |                                |
  <modify T1's machine state>--------------->|
            |                                |
    spin_unlock_irqrestore()                 |
            |                                |
            v                                v

Thread T1 may be running at the time T0 needs to modify its state.
The current solution is to set a TIF flag in T1. That TIF flag will
cause T1 (on kernel exit) to go into a perfmon function that will
then modify the state, i.e., state is self-modified. That works okay
but there are a few race conditions. For self-monitoring sessions
(e.g., system-wide or per-thread), it is easy because we operate in
the correct thread.

But there is a big difference between self-monitoring and non
self-monitoring. The pfm_restart() syscall does not provide the
same guarantee.

In self-monitoring modes, the interface guarantees that by the time you
return from the call, the effects of the call are visible. Whereas when
monitoring another thread, the call currently does not provide such
guarantee, i.e., it does not wait until T1 has seen the TIF flag and
completed the state modification before returning. We could add a ...
From: Frank Ch. Eigler
Date: Friday, December 14, 2007 - 12:12 pm

The utrace code supports this style of thread manipulation better
than ptrace.

- FChE
--

From: Stephane Eranian
Date: Friday, December 14, 2007 - 2:07 pm

Charles,


Afre you saying that utrace provides a utrace_thread_stop(tid) call
that returns only when the thread tid is off the CPU. And then there
is a utrace_thread_resume(tid) call. If that's the case then that is
what I need.

How are we with regards to utrace integration?

Thanks.

-- 
-Stephane
--

From: Frank Ch. Eigler
Date: Saturday, December 15, 2007 - 8:54 am

While I see no single call, it can be synthesized from a sequence of
them: utrace_attach, utrace_set_flags (... UTRACE_ACTION_QUESCE ...),

Roland McGrath is working on breaking the patches down.

- FChE
--

From: Stephane Eranian
Date: Wednesday, November 14, 2007 - 6:51 am

Yes, the read call could be simplified to the level proposed above by Paul.

-- 
-Stephane
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 4:39 am

No it's not basically a read().  It's more like a request/reply
interface, which a read()/write() interface doesn't handle very well.
The request in this case is "tell me about this particular collection
of PMDs" and the reply is the values.

It seems to me that an important part of this is to be able to collect
values from several PMDs at a single point in time, or at least an
approximation to a single point in time.  So that means that you don't
want a file per PMD either.

Basically we don't have a good abstraction for a request/reply (or
command/response) type of interface, and this is a case where we need
one.  Having a syscall that takes a struct containing the request and
reply is as good a way as any, particularly for something that needs
to be quick.

Paul.
-

From: David Miller
Date: Wednesday, November 14, 2007 - 4:52 am

From: Paul Mackerras <paulus@samba.org>

Yes it can, see my other reply.
-

From: Stephane Eranian
Date: Wednesday, November 14, 2007 - 6:47 am

Hello,


Exactly. This is not a brute force read()! On input you pass the list
of registers you want to read. Upon return, you get the list of values.

Now, I think the current call could be optimized even more by making
the structure smaller. Today, the structure passed read/write
PMD registers is the same. On write, we pass other information such as 
the reset values (sampling periods), randomization parameters and some

Yes, we want to be able to read one or many registers in one call.
The number of PMU counters is not going to shrink, so having a file

-- 
-Stephane
-

From: Andi Kleen
Date: Wednesday, November 14, 2007 - 5:38 am

At least for x86 and I suspect some 1other architectures we don't
initially need a syscall at all for this. There is an instruction
RDPMC who can read a performance counter just fine. It is also much
faster and generally preferable for the case where a process measures
events about itself. In fact it is essential for one of the use cases
I would like to see perfmon used (replacement of RDTSC for cycle
counting) 

Later a syscall might be needed with event multiplexing, but that seems

I don't like read/write for this too much. I think it's better to
have individual syscalls.  After all that is CPU state and having
syscalls for that does seem reasonable.

-Andi
-

From: Stephane Eranian
Date: Wednesday, November 14, 2007 - 7:13 am

Andi,


On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
multiplexing offers some advantages. If NMI watchdog is enabled, then you drop

As I said earlier, we do use read(), not for reading counters but to extract overflow
notification messages when we are sampling. It makes more sense for this usage because
this is where you want to leverage some key mechanisms such as:

	 - asynchronous notification via SIGIO. this is how you can implement self-sampling
	   for instance.

	 - select/poll to allow monitoring tools to wait for notification coming from
	   multiple sessions in one call. This is useful when monitoring across fork or
	   pthread_create.

-- 
-Stephane
-

From: Andi Kleen
Date: Wednesday, November 14, 2007 - 7:26 am

NMI watchdog is off by default now.

Yes longer term we might need multiplexing, but definitely not as first step.

-Andi
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 5:23 pm

How would you provide access to the counters of another process?
Through an extension to ptrace perhaps?

Paul.
-

From: David Miller
Date: Wednesday, November 14, 2007 - 12:48 pm

From: Andi Kleen <andi@firstfloor.org>

I wouldn't even want to use a syscall for something like
that on Sparc, I'd rather give this a dedicated software
trap so that I can code it completely in assembler.
-

From: dean gaudet
Date: Wednesday, November 14, 2007 - 9:20 pm

actually multiplexing is the main feature i am in need of. there are an 
insufficient number of counters (even on k8 with 4 counters) to do 
complete stall accounting or to get a general overview of L1d/L1i/L2 cache 
hit rates, average miss latency, time spent in various stalls, and the 
memory system utilization (or HT bus utilization).  this runs out to 
something like 30 events which are interesting... and re-running a 
benchmark over and over just to get around the lack of multiplexing is a 
royal pain in the ass.

it's not a "far away non-essential feature" to me.  it's something i would 
use daily if i had all the pieces together now (and i'm constrained 
because i cannot add an out-of-tree patch which adds unofficial syscalls 
to the kernel i use).

-dean
-

From: Paul Mackerras
Date: Wednesday, November 14, 2007 - 9:47 pm

So by "multiplexing" do you mean the ability to have multiple event
sets associated with a context and have the kernel switch between them
automatically?

Paul.
-

From: dean gaudet
Date: Wednesday, November 14, 2007 - 10:14 pm

yep.

-dean
-

From: Stephane Eranian
Date: Thursday, November 15, 2007 - 1:53 am

Hello,


Multiplexing in the context of perfmon2 means that you can measure more events
than there are counters. To make this work, we create the notion of an event set
or more precisely a register set. Each set encapsulates the full PMU state. Then
the kernel multiplexes the sets onto the actual PMU hardware.

Why do we need this?

As Dean pointed out, that are many important metrics which do require more events
than there are counters. Making multiple runs can be difficult with some workloads.

But there are also other, less known, reasons why you'd want to do this. This is
not because you have lots of counters that you can necessarily measure lots of
related events simultaneously. Take pentium 4 for instance, it has 18 counters, but
for most interesting metrics, you cannot measure all the events at once. Why? Because
there are important hardware constraints which translate into event combination 
constraints. It is not uncommon to have constraints such as:
	- event A and B cannot be measured together
	- event A can only be measured by counter X
	- if event A is measured, then only events B, C, D can be measured

This is not just on Itanium. Power has limitations, Intel Core 2 has limitations,
AMD Opterons also have limitations.

When you combine limited number of counters with strong constraints, it can quickly
become difficult to make measurements in one run.

Multiplexing is, of course, not as good as measuring all events continuously but
if you run for long enough and with a reasonable switching periods, the *estimates*
you get by scaling the obtained counts can be very close to what they would have
been had you measured all events all the time. You have to balance precision with
overhead.

Why do this in the kernel?

One might argue that there is nothing preventing tools from multiplexing at the user
level. That's true and we do support this as well. You have to:
		- stop monitoring
		- read out current counter
		- reprogram config and data registers
		- ...
From: Dan Terpstra
Date: Thursday, November 15, 2007 - 10:01 am

We've provided multiplexing in PAPI at the user level for years. That forced
it to the user level, which wasn't pretty. Or very statistically accurate.
We've been eagerly anticipating the improvements provided by in-kernel
multiplexing in perfmon2. We and our user base don't consider this a "far
away non-essential feature", but a deficiency that's needed addressing for a
long time.

-

From: Stephane Eranian
Date: Tuesday, November 13, 2007 - 2:33 pm

Greg,


I am the core developer of this and I am not as pessimistic as Phil. Yet I admit

I think I understand your concerns. I will work on this. I think it is possible to
refactor. It will certainly be painful (for me), but I think it can be done within
some reasonable delay. Of course, it would be help if you could better qualify what


I do care a lot actually. Believe me, I do spend a lot of effort and energy
on this project everyday, like many others around the world, and I intend for
it to succeed. We have reached a point in the development of processor hardware
where this kind of features is crucial and it is not just for HPC folks anymore.

-- 
-Stephane
-

From: Greg KH
Date: Tuesday, November 13, 2007 - 2:45 pm

I think Andrew already spelled this out.  If after reading his message,
you still have questions, please let me know and I'll be glad to work
with you to address them.

thanks,

greg k-h
-

From: Christoph Hellwig
Date: Tuesday, November 13, 2007 - 3:27 pm

<stupid bullshitting snipped>

What about investing some effort to do a proper performance counter
infrastructure or turning the mess perfom is into one instead of this
useless rant?  Code is not getting any better by your complain ccing
gazillions of useless list.
-

From: Andi Kleen
Date: Tuesday, November 13, 2007 - 1:42 pm

Context switch is imho the main differentiating feature of perfmon 
over oprofile.  Not sure it makes sense to take that one out.

I don't think the complexity of the patches comes from the context
switch anyways, it comes from the lots of other things it does.

-Andi
-

Previous thread: Linux 2.6.24-rc2 by Linus Torvalds on Tuesday, November 6, 2007 - 5:26 pm. (10 messages)

Next thread: [patch 00/23] Slab defragmentation V6 by Christoph Lameter on Tuesday, November 6, 2007 - 6:11 pm. (18 messages)