Re: [perfmon] Re: [perfmon2] perfmon2 merge news

Previous thread: Linux 2.6.24-rc2 by Linus Torvalds on Tuesday, November 6, 2007 - 8:26 pm. (10 messages)

Next thread: [patch 00/23] Slab defragmentation V6 by Christoph Lameter on Tuesday, November 6, 2007 - 9:11 pm. (18 messages)
To: Andrew Morton <akpm@...>, Stephane Eranian <eranian@...>
Cc: <perfmon@...>, <linux-kernel@...>
Date: Tuesday, November 6, 2007 - 8:34 pm

Here's a patch against my current tree that gets the perfmon code
building and hopefully working.

Note, it needs the kobject_create_and_register() patch which is in my
tree, but I do not think it made it to -mm yet. The next -mm cycle
should have it.

Also, the sysfs usage in the perfmon code is quite strange and not
documented at all. Yes, there is a little bit in the documentation
about what a few of the files do, but there are _way_ more files and
even directories being created under /sys/kernel/perfmon/ that are not
documented at all here.

If you document this stuff, I think I can clean up your sysfs code a
lot, making things simpler, easier to extend, and easier to understand.
But as it is, I don't want to break anything as it's totally unknown how
this stuff is supposed to work...

Hint, use the Documentation/ABI directory to document your sysfs
interfaces, that is what it is there for...

thanks,

greg k-h

---------------
From: Greg Kroah-Hartman <gregkh@suse.de>
Subject: perfmon: fix up some static kobject usages

This gets the perfmon code to build properly on the latest -mm tree, as
well as removing some static kobjects.

A lot of future kobject cleanups can be done on this code, but the
documentation for the perfmon sysfs interface is very limited and does
not describe all of the different files and subdirectories at all.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
perfmon/perfmon_sysfs.c | 37 +++++++++++++++----------------------
1 file changed, 15 insertions(+), 22 deletions(-)

--- a/perfmon/perfmon_sysfs.c
+++ b/perfmon/perfmon_sysfs.c
@@ -76,7 +76,8 @@ EXPORT_SYMBOL(pfm_controls);

DECLARE_PER_CPU(struct pfm_stats, pfm_stats);

-static struct kobject pfm_kernel_kobj, pfm_kernel_fmt_kobj;
+static struct kobject *pfm_kernel_kobj;
+static struct kobject *pfm_kernel_fmt_kobj;

static void pfm_reset_stats(int cpu)
{
@@ -402,31 +403,23 @@ static struct attribute_group pfm_kernel

int __init pfm_init_sysfs(v...

To: Greg KH <greg@...>
Cc: <eranian@...>, <perfmon@...>, <linux-kernel@...>
Date: Friday, November 9, 2007 - 4:06 pm

On Tue, 6 Nov 2007 16:34:54 -0800

Unfortunately I still haven't merged perfmon due to recently-occurring
minor conflicts with Tony's ia64 tree and more major recently-occurring
conflicts with the x86 tree.

There's not really a lot which Stephane can practically do about this -
normally I'll just get down and fix stuff like this up. But the impression
I get from various people is that the perfmon tree in its present form
would not be a popular merge.

The impression which people have (and I admit to sharing it) is that
there's just too much stuff in there and it might not all be justifiable.
But I suspect that people have largely forgotten what is in there, and why
it is in there.

We really need to get this ball rolling, and that will require a sustained
effort from more people than just Stephane. I suppose as a starting point
we could yet again review the existing patches, please. People will mainly
concentrate upon the changelogging to understand which features are being
proposed and why, so that submission should describe these things pretty
carefully: what are the features and why do we need each of them.

tia.
-

To: Andrew Morton <akpm@...>
Cc: <eranian@...>, <perfmon@...>, <linux-kernel@...>
Date: Friday, November 9, 2007 - 5:38 pm

Is there some way to rebase these patches/git tree to be a bit more easy
to review? Right now there are over 75 patches in the tree and many (if
not most) can be removed by merging them with previous patches.

If someone could break this stuff down into reviewable pieces, it would
go a very long way toward making it acceptable.

Is there any way to just provide a basic framework that everyone can
agree on and then add on more stuff as time goes on? Do we have to have
every different processor/arch with support to start with?

thanks,

greg k-h
-

To: <gregkh@...>
Cc: <akpm@...>, <eranian@...>, <linux-kernel@...>
Date: Saturday, November 10, 2007 - 4:32 pm

Greg KH <greg-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org> writes:

[dropped perfmon list because gmane messed it up and it's apparently

I think the real problem are not the architectures (the processor
adaption layer is usually relatively straight forward IIRC), but the
excessive functionality implemented by the user interface.

It would be really good to extract a core perfmon and start with
that and then add stuff as it makes sense.

e.g. core perfmon could be something simple like just support
to context switch state and initialize counters in a basic way
and perhaps get counter numbers for RDPMC in ring3 on x86[1]

Next step could be basic event on overflow/underflow support.

Then more features as they make sense, with clear rationale
what they're good for and proper step by step patches.

-Andi

[1] On x86 we urgently need a replacement to RDTSC for counting
cycles.
-

To: Andi Kleen <andi@...>
Cc: <gregkh@...>, <akpm@...>, <eranian@...>, <linux-kernel@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 11:17 am

Perhaps a core could provide also as much functionality so that
Perfmon can be used with an *unpatched* kernel using loadable modules?
One drawback with today's Perfmon is that it can not be used with a
vanilla kernel. But maybe such a core is by far too complex for a
first merge.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: robert.richter@amd.com

-

To: Robert Richter <robert.richter@...>
Cc: Andi Kleen <andi@...>, <akpm@...>, <gregkh@...>, <linux-kernel@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 11:35 am

Hi Robert,

In the past I suggested that it might be useful to have a version of perfmon2
that only set up the perfmon on a global basis. That would allow the patches for
context switches to be added as a separate step, splitting up the patch into
smaller set of patches.

Perfmon2 uses a set of system calls to control the performance monitoring
hardware. This would make it difficult to use an unpatch kernel unless perfmon
changed the mechanism used to control the performance monitoring hardware.

-Will
-

To: William Cohen <wcohen@...>
Cc: Robert Richter <robert.richter@...>, Andi Kleen <andi@...>, <akpm@...>, <gregkh@...>, <linux-kernel@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 4:42 pm

Context switch is imho the main differentiating feature of perfmon
over oprofile. Not sure it makes sense to take that one out.

I don't think the complexity of the patches comes from the context
switch anyways, it comes from the lots of other things it does.

-Andi
-

To: William Cohen <wcohen@...>
Cc: Robert Richter <robert.richter@...>, <akpm@...>, Andi Kleen <andi@...>, <gregkh@...>, <perfmon2-devel@...>, <linux-kernel@...>, <perfmon@...>
Date: Tuesday, November 13, 2007 - 1:55 pm

Hello,

Yes, that would be a possibility but as you pointed out there are some problems:

- perfmon2 uses system calls. So unless you can dynamically patch the
syscall table we would have to go back to the ioctl() and driver model.
I was under the impression that people did not quite like multiplexing
syscalls such as ioctl(). I also do prefer the multi syscall approach.

- perfmon2 needs to install a PMU interrupt handler. On X86, this is not just
an external device interrupts. There needs to be some APIC and interrupt
gate setup. There maybe other constraints on other architectures as well.
Not sure if all functions/structures necessary for this are available to
modules.

- we could not support per-thread mode with the kernel module approach due to
link to the context switch code. I do believe per-thread is a key value-add
for performance monitoring.

--
-Stephane
-

To: <eranian@...>
Cc: William Cohen <wcohen@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 2:47 pm

Hi folks,

Well, I can say the mood here at supercomputing'07 is pretty somber
in regards to the latest exchange of messages regarding the perfmon
patches. Our community has been the largest user of both the PerfCtr
and the Perfmon patches, the former being regularly installed by
vendors and integrators on clusters at install time, and the latter
now being adopted into vendor kernels by IBM, Cray, AMD, SiCortex and
others. Of course, adoption by a vendor, does not a good kernel patch
make. However, it should be viewed as a strong data point on demand
for such functionality. We are a community focused on performance and
we have long had a need for these tools.

A solution that does not provide 64 bit virtualized per-thread counts
is not a solution at all. That would need to be ripped out by all of
us using this functionality so we could get something that actually
does what the community needs, not what the you folks think we need.
Device level access and/or root access to the counters is not
unacceptable for machines in production. If that was fine, oprofile
would have satisfied everyone and we wouldn't be sucking up your
bandwidth. Please understand that people outside of the your
community are desperate for adoption of any form of 'per-thread' PMU
functionality into the kernel. For those of you who are (still) not
convinced of this, I can arrange your inbox to be spammed by 1000's
of HPC geeks, managers, vendors, etc. My point is, let's start
somewhere that the community finds useful. Otherwise we run the risk
of developing an interface that everyone isn't comfortable with and
no-one uses. Hardly a productive exercise.

So please, do consider a set of core functionality that provides for
(at least) the following:

- per-CPU and per-thread 64 bit virtualized counts
- third person operation (attach/ptrace)
- dispatch of signal upon interrupt on overflow if requested
- 'buffered' interrupts into a buffer that can be mmap'd into user...

To: Philip Mucci <mucci@...>
Cc: <eranian@...>, William Cohen <wcohen@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 6:27 pm

<stupid bullshitting snipped>

What about investing some effort to do a proper performance counter
infrastructure or turning the mess perfom is into one instead of this
useless rant? Code is not getting any better by your complain ccing
gazillions of useless list.
-

To: Philip Mucci <mucci@...>
Cc: <eranian@...>, William Cohen <wcohen@...>, <akpm@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 2:59 pm

"somber"?

Why?

We (a number of the kernel developers) want to see the perfmon code make
it into the kernel tree, unfortunatly, in the current state it is in,
that's not going to happen.

Andi specified a way that this can happen, just refactor your patches
into smaller bits that can be reviewed and applied.

If you, or anyone else has any questions about this, please let us know.
So far, I have not seen any response to his message, so I'm guessing
that the perfmon developers either are off working on this, or don't
care.

And if they don't care, then yes, I agree with your "somber" feeling...

thanks,

greg k-h
-

To: Greg KH <gregkh@...>
Cc: Philip Mucci <mucci@...>, William Cohen <wcohen@...>, <akpm@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 5:33 pm

Greg,

I am the core developer of this and I am not as pessimistic as Phil. Yet I admit

I think I understand your concerns. I will work on this. I think it is possible to
refactor. It will certainly be painful (for me), but I think it can be done within
some reasonable delay. Of course, it would be help if you could better qualify what

I do care a lot actually. Believe me, I do spend a lot of effort and energy
on this project everyday, like many others around the world, and I intend for
it to succeed. We have reached a point in the development of processor hardware
where this kind of features is crucial and it is not just for HPC folks anymore.

--
-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Philip Mucci <mucci@...>, William Cohen <wcohen@...>, <akpm@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 5:45 pm

I think Andrew already spelled this out. If after reading his message,
you still have questions, please let me know and I'll be glad to work
with you to address them.

thanks,

greg k-h
-

To: Greg KH <gregkh@...>
Cc: Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 4:07 pm

Well... Philip is (I assume) a numerical-computing guy and not a
kernel-developing guy (probably a wise choice).

He speaks for quite a few people - they have serious need for this feature
but they've had to scruff around with out-of-tree patches for years to get
it, and still there are problems.

I was hoping that after the round of release-and-review which Stephane,
Andi and I did about twelve months ago that we were on track to merge the
perfmon codebase as-offered. But now it turns out that the sentiment is
that the code simply has too many bells-and-whistles to be acceptable.

My problem with that sentiment is that it is quite likely the case that
those bells-n-whistles are actually useful and needed features. Perfmon
has been out there for quite a few years and the code which is in there
_should_ be in response to real-world in-the-field experience. Such
requirements never go away.

So. If what I am saying is correct then the best course of action would be
for Stephane to help us all to understand what these features are and why
we need them. The ideal way in which to do this is

[patch] perfmon: core
[patch] perfmon: whizzy feature #1
[patch] perfmon: whizzy feature #2
[patch] perfmon: whizzy feature #3

etc. Where the changelog in each whizzy-feature-n explains what it does,
why it does it and why our users need it.

Whatever happens, perfmon is so big and so old and has been out-of-tree for
so long that it's going to take a pile of work from lots of people to get
any of it landed.

-

To: Andrew Morton <akpm@...>
Cc: Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 3:24 am

Whose sentiment?

I've had a bit of a look at it today together with David Gibson. Our
impression is that the latest version is a lot cleaner and simpler
than it used to be. I'm also reading Stephane's technical report
which describes the interface, and whilst I'm only part-way through
it, I haven't seen anything yet which strikes me as unnecessary or
overly complicated.

Paul.
-

To: Paul Mackerras <paulus@...>
Cc: Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 6:38 am

Mine for example. The whole userspace interface is just on crack,
and the code is full of complexities aswell.

-

To: Christoph Hellwig <hch@...>
Cc: Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 6:43 am

Could you give some _technical_ details of what you don't like?

Paul.
-

To: Paul Mackerras <paulus@...>
Cc: Christoph Hellwig <hch@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 7:00 am

I've done this a gazillion times before, so maybe instead of beeing a lazy
bastard you could look up mailinglist archive. It's not like this is the
first discussion of perfmon. But to get start look at the systems calls,
many of them are beasts like:

int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)

This is basically a read(2) (or for other syscalls a write) on something
else than the file descriptor provided to the system call. The right thing
to do is obviously have a pmds and pmcs file in procfs for the thread beeing
monitored instead of these special-case files, with another set for global
tracing. Similarly I'm pretty sure we can get a much better interface
if we introduce marching files in procfs for the other calls.

-

To: Christoph Hellwig <hch@...>
Cc: Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 8:38 am

At least for x86 and I suspect some 1other architectures we don't
initially need a syscall at all for this. There is an instruction
RDPMC who can read a performance counter just fine. It is also much
faster and generally preferable for the case where a process measures
events about itself. In fact it is essential for one of the use cases
I would like to see perfmon used (replacement of RDTSC for cycle
counting)

Later a syscall might be needed with event multiplexing, but that seems

I don't like read/write for this too much. I think it's better to
have individual syscalls. After all that is CPU state and having
syscalls for that does seem reasonable.

-Andi
-

To: Andi Kleen <andi@...>
Cc: Christoph Hellwig <hch@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Thursday, November 15, 2007 - 12:20 am

actually multiplexing is the main feature i am in need of. there are an
insufficient number of counters (even on k8 with 4 counters) to do
complete stall accounting or to get a general overview of L1d/L1i/L2 cache
hit rates, average miss latency, time spent in various stalls, and the
memory system utilization (or HT bus utilization). this runs out to
something like 30 events which are interesting... and re-running a
benchmark over and over just to get around the lack of multiplexing is a
royal pain in the ass.

it's not a "far away non-essential feature" to me. it's something i would
use daily if i had all the pieces together now (and i'm constrained
because i cannot add an out-of-tree patch which adds unofficial syscalls
to the kernel i use).

-dean
-

To: 'dean gaudet' <dean@...>, 'Andi Kleen' <andi@...>
Cc: 'papi list' <ptools-perfapi@...>, 'OSPAT devel' <ospat-devel@...>, 'Greg KH' <gregkh@...>, 'Perfmon' <perfmon@...>, <linux-kernel@...>, 'Christoph Hellwig' <hch@...>, 'Paul Mackerras' <paulus@...>, 'Andrew Morton' <akpm@...>, <perfmon2-devel@...>, 'Philip Mucci' <mucci@...>
Date: Thursday, November 15, 2007 - 1:01 pm

We've provided multiplexing in PAPI at the user level for years. That forced
it to the user level, which wasn't pretty. Or very statistically accurate.
We've been eagerly anticipating the improvements provided by in-kernel
multiplexing in perfmon2. We and our user base don't consider this a "far
away non-essential feature", but a deficiency that's needed addressing for a
long time.

-

To: dean gaudet <dean@...>
Cc: Andi Kleen <andi@...>, Christoph Hellwig <hch@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Stephane Eranian <eranian@...>
Date: Thursday, November 15, 2007 - 4:53 am

Hello,

Multiplexing in the context of perfmon2 means that you can measure more events
than there are counters. To make this work, we create the notion of an event set
or more precisely a register set. Each set encapsulates the full PMU state. Then
the kernel multiplexes the sets onto the actual PMU hardware.

Why do we need this?

As Dean pointed out, that are many important metrics which do require more events
than there are counters. Making multiple runs can be difficult with some workloads.

But there are also other, less known, reasons why you'd want to do this. This is
not because you have lots of counters that you can necessarily measure lots of
related events simultaneously. Take pentium 4 for instance, it has 18 counters, but
for most interesting metrics, you cannot measure all the events at once. Why? Because
there are important hardware constraints which translate into event combination
constraints. It is not uncommon to have constraints such as:
- event A and B cannot be measured together
- event A can only be measured by counter X
- if event A is measured, then only events B, C, D can be measured

This is not just on Itanium. Power has limitations, Intel Core 2 has limitations,
AMD Opterons also have limitations.

When you combine limited number of counters with strong constraints, it can quickly
become difficult to make measurements in one run.

Multiplexing is, of course, not as good as measuring all events continuously but
if you run for long enough and with a reasonable switching periods, the *estimates*
you get by scaling the obtained counts can be very close to what they would have
been had you measured all events all the time. You have to balance precision with
overhead.

Why do this in the kernel?

One might argue that there is nothing preventing tools from multiplexing at the user
level. That's true and we do support this as well. You have to:
- stop monitoring
- read out current counter
- reprogram config and data registers
- restart mo...

To: dean gaudet <dean@...>
Cc: Andi Kleen <andi@...>, Christoph Hellwig <hch@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>
Date: Thursday, November 15, 2007 - 12:47 am

So by "multiplexing" do you mean the ability to have multiple event
sets associated with a context and have the kernel switch between them
automatically?

Paul.
-

To: Paul Mackerras <paulus@...>
Cc: Andi Kleen <andi@...>, Christoph Hellwig <hch@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>
Date: Thursday, November 15, 2007 - 1:14 am

yep.

-dean
-

To: <andi@...>
Cc: <hch@...>, <paulus@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <perfmon@...>, <perfmon2-devel@...>, <ospat-devel@...>, <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 3:48 pm

From: Andi Kleen <andi@firstfloor.org>

I wouldn't even want to use a syscall for something like
that on Sparc, I'd rather give this a dedicated software
trap so that I can code it completely in assembler.
-

To: Andi Kleen <andi@...>
Cc: Christoph Hellwig <hch@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 10:13 am

Andi,

On a machine with only two generic counters such as MIPS or Intel Core 2 Duo,
multiplexing offers some advantages. If NMI watchdog is enabled, then you drop

As I said earlier, we do use read(), not for reading counters but to extract overflow
notification messages when we are sampling. It makes more sense for this usage because
this is where you want to leverage some key mechanisms such as:

- asynchronous notification via SIGIO. this is how you can implement self-sampling
for instance.

- select/poll to allow monitoring tools to wait for notification coming from
multiple sessions in one call. This is useful when monitoring across fork or
pthread_create.

--
-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Andi Kleen <andi@...>, Christoph Hellwig <hch@...>, Paul Mackerras <paulus@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 10:26 am

NMI watchdog is off by default now.

Yes longer term we might need multiplexing, but definitely not as first step.

-Andi
-

To: Andi Kleen <andi@...>
Cc: Stephane Eranian <eranian@...>, Christoph Hellwig <hch@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 8:23 pm

How would you provide access to the counters of another process?
Through an extension to ptrace perhaps?

Paul.
-

To: Christoph Hellwig <hch@...>
Cc: Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Andi Kleen <andi@...>
Date: Wednesday, November 14, 2007 - 7:39 am

No it's not basically a read(). It's more like a request/reply
interface, which a read()/write() interface doesn't handle very well.
The request in this case is "tell me about this particular collection
of PMDs" and the reply is the values.

It seems to me that an important part of this is to be able to collect
values from several PMDs at a single point in time, or at least an
approximation to a single point in time. So that means that you don't
want a file per PMD either.

Basically we don't have a good abstraction for a request/reply (or
command/response) type of interface, and this is a case where we need
one. Having a syscall that takes a struct containing the request and
reply is as good a way as any, particularly for something that needs
to be quick.

Paul.
-

To: Paul Mackerras <paulus@...>
Cc: Christoph Hellwig <hch@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Philip Mucci <mucci@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Andi Kleen <andi@...>
Date: Wednesday, November 14, 2007 - 9:47 am

Hello,

Exactly. This is not a brute force read()! On input you pass the list
of registers you want to read. Upon return, you get the list of values.

Now, I think the current call could be optimized even more by making
the structure smaller. Today, the structure passed read/write
PMD registers is the same. On write, we pass other information such as
the reset values (sampling periods), randomization parameters and some

Yes, we want to be able to read one or many registers in one call.
The number of PMU counters is not going to shrink, so having a file

--
-Stephane
-

To: <paulus@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 7:52 am

From: Paul Mackerras <paulus@samba.org>

Yes it can, see my other reply.
-

To: <hch@...>
Cc: <paulus@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <perfmon@...>, <andi@...>, <perfmon2-devel@...>, <ospat-devel@...>, <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 7:12 am

From: Christoph Hellwig <hch@infradead.org>

This is my impression too, all of the things being done with
a slew of system calls would be better served by real special
files and appropriate fops. Whether the thing is some kind
of misc device or procfs is less important than simply getting
away from these system calls.
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 7:44 am

Special files and fops really only work well if you can coerce the
interface into one where data flows predominantly one way. I don't
think they work so well for something that is more like an RPC across
the user/kernel barrier. For that a system call is better.

For instance, if you have something that kind-of looks like

read_pmds(int n, int *pmd_numbers, u64 *pmd_values);

where the caller supplies an array of PMD numbers and the function
returns their values (and you want that reading to be done atomically

Why? What's inherently offensive about system calls?

Paul.
-

To: Paul Mackerras <paulus@...>
Cc: David Miller <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 9:51 am

Yes, the read call could be simplified to the level proposed above by Paul.

--
-Stephane
-

To: <paulus@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 7:52 am

From: Paul Mackerras <paulus@samba.org>

The same way we handle some of the multicast "getsockopt()"
calls. The parameters passed in are both inputs and outputs.

For the above example:

struct pmd_info {
int *pmd_numbers;
u64 *pmd_values;
int n;
} *p;

buffer_size = N;
p = malloc(buffer_size);
p->pmd_numbers = p + foo;
p->pmd_values = p + bar;
p->n = whatever(N);
err = read(fd, p, N);

It's definitely doable, use your imagination.

You can encode all kinds of operation types into the
header as well.

Another alternative is to use generic netlink.
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 8:03 am

You're suggesting that the behaviour of a read() should depend on what
was in the buffer before the read? Gack! Surely you have better
taste than that?

Or are you saying that a read (or write) has a side-effect of altering
some other area of memory besides the buffer you give to read()? That

Then you end up with two system calls to get the data rather than one
(one to send the request and another to read the reply). For
something that needs to be quick that is a suboptimal interface.

Paul.
-

To: <paulus@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 8:07 am

From: Paul Mackerras <paulus@samba.org>

Absolutely that's what I mean, it's atomic and gives you exactly what
you need.

I see nothing wrong or gross with these semantics. Nothing in the
"book of UNIX" specifies that for a device or special file the passed

Not necessarily, consider the possibility of using recvmsg() control
message data. With that it could be done in one go.

This also suggests that it could be implemented as it's own protocol
family.
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 5:50 pm

Ohhhhh.... kayyyyy.... *shudders*

It really violates the abstract model of "read" pretty badly. "Read"
is "fill in the buffer with data from the device", not "do some
arbitrary stuff with this area of memory".

I'd prefer to have a transaction() system call like I suggested to

There's all sorts of possible ways that it could be implemented. On
the one hand we have an actual proposed implementation, and on the
other we have various people saying "oh but it could be implemented
this other way" without providing any actual code.

Now if those people can show that their way of doing it is
significantly simpler and better than the existing implementation,
then that's useful. I really don't think that doing a whole new
net protocol family is a simpler and better way of doing a performance
monitor interface, though.

Paul.
-

To: <paulus@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 7:03 pm

From: Paul Mackerras <paulus@samba.org>

So much for getting rid of the extra system calls...
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 7:12 pm

*I* never had a problem with a few extra system calls. I don't
understand why you (apparently) do.

Paul.
-

To: <paulus@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 7:21 pm

From: Paul Mackerras <paulus@samba.org>

We're stuck with them forever, they are hard to version and extend
cleanly.

Those are my main objections.
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 9:11 pm

The first is valid (for suitable values of "forever") but applies to
any user/kernel interface, not just system calls.

As for the second (hard to version) I don't see why it applies to
syscalls specifically more than to other interfaces. It's just a
matter of designing it correctly in the first place. For example, the
sys_swapcontext system call we have on powerpc takes an argument which
is the size of the ucontext_t that userland is using, which allows us
to extend it in future if necessary. (Note that I'm not saying that
the current perfmon2 interfaces are well-designed in this respect.)

The third (hard to extend cleanly) is a good point, and is a valid
criticism of the current set of perfmon2 system calls, I think.
However, the goal of being able to extend the interface tends to be in
opposition to the goal of having strong typing of the interface.
Things like a multiplexed syscall or an ioctl are much easier to
extend but that is at the expense of losing strong typing. Something
like my transaction() (or your weird kind of read() :) also provides
extensibility but loses type safety to some degree.

Also, as Andi says, this is core CPU state that we are dealing with,
not some I/O device, so treating the whole of perfmon2 (or any
performance monitoring infrastructure) as a driver doesn't fit very
well, and in fact system calls are appropriate. Just like we don't
try to make access to debugging facilities fit into a driver, we
shouldn't make performance monitoring fit into a driver either.

Paul.
-

To: Paul Mackerras <paulus@...>
Cc: <mucci@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>, Stephane Eranian <eranian@...>
Date: Thursday, November 15, 2007 - 4:29 am

Hi,

In the initial design there was only one perfmon syscall perfmonctl()
and it was a multiplexing call. People objected to it and thus I split it
up into multiple system calls. I like the strong typing but I agree that
it is harder to extend without creating new syscalls. In the current
state, all perfmon syscalls take a pointer to structs which have reserved
fields for future extensions. If you specify that reserved fields must be
zeroed, then it leaves you *some* flexibility for extending the structs.

Another alternative, similar to your ucontext, would be to pass the size
of the structure. If we assume we drop the vector arguments, we could do:

pfm_write_pmcs(fd, &pmc, sizeof(pmc));
instead of
pfm_write_pmcs(fd, &pmc);

Should the sizeof(pmc) need to change we could demultiplex inside the
kernel. Another, probably cleaner, possibility is to version structures
that are passed:
union pfarg_pmc {
int version;
struct {
int version;
int reg_num;
u64 reg_value;
}
}

But that seems overkill. I think versioning could be passed when the session
is created instead of at every call:

Agreed 100%. This is especially true because we support per-thread
monitoring.

--
-Stephane
-

To: <paulus@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 9:27 pm

From: Paul Mackerras <paulus@samba.org>

I disagree.

With netlink we can just add new attributes when a new need arises for
a particular interface. The attribute code describes the type
precisely, so there is no loss of strong typing at all.
-

To: <paulus@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Monday, November 19, 2007 - 9:08 am

Instead of blabbering further about this topic, I decided to put my
code where my mouth is and spent the weekend porting the perfmon2
kernel bits, and the user bits (libpfm and pfmon) to sparc64.

As a result I've found that perfmon2 is quite nice and allows
incredibly useful and powerful tools to be written. The syscalls
aren't that bad and really I see not reason to block it's inclusion.

I rescind all of my earlier objections, let's merge this soon :-)
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Monday, November 19, 2007 - 5:43 pm

Strongly agree. However, I think we need to add structure size
arguments to most of the syscalls so we can extend them later.

Also, something I've been meaning to mention to Stephane is that the
use of the cast_ulp() macro in perfmon is bogus and won't work on
32-bit big-endian platforms such as ppc32 and sparc32. On such
platforms you can't take a pointer to an array of u64, cast it to
unsigned long * and expect the kernel bitmap operations to work
correctly on it. At the least you also need to XOR the bit numbers
with 32 on those platforms. Another alternative is to define the
bitmaps as arrays of bytes instead, which eliminates all byte ordering
and wordsize problems (but makes it more tricky to use the kernel
bitmap functions directly).

Paul.

-

To: Paul Mackerras <paulus@...>
Cc: David Miller <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>, Stephane Eranian <eranian@...>
Date: Monday, November 19, 2007 - 6:48 pm

Paul,

Yes, that is one way. It works well if you only extend structures at the end.
Given that you need to obtain the file descriptor first via a pfm_create_context
call, an alternative could be that you pass a version number to that call to

I don't like those cast_ulp() macros. They were put there to avoid compiler
warnings on some architectures. Clearly with the big-endian issue, we need
to find something else. The bitmap*() macros make unsigned long *.

The interface uses fixed size type to ensure ABI compatibility between
32 and 64 bit modes. This way there is no need to marhsall syscall arguments
for a 32-bit app running on a 64-bit host.

Looks like we will have to use bytes (u8) instead. This may have some
performance impact as well. Several bitmaps are used in the context/interrupt
routines. Even with u8, there is still a problem with the bitmap*() macros.
Now, only a small subset of the bitmap() macros are used, so it may be okay
to duplicate them for u8.

--

-Stephane
-

To: <eranian@...>
Cc: <paulus@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Monday, November 19, 2007 - 8:53 pm

From: Stephane Eranian <eranian@hpl.hp.com>

I think it would be fine to just create a set of bitop interfaces that
operate on u32 objects instead of "unsigned long".

Currently perfmon2 does not need the atomic variants at all, and those
could thus be provided entirely under include/asm-generic/bitops/
-

To: <linux-kernel@...>
Cc: <davem@...>, <paulus@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <andi@...>, <eranian@...>, Stephane Eranian <eranian@...>
Date: Thursday, December 13, 2007 - 12:00 pm

Hello,

A few weeks back, I mentioned that I would post some
interesting problems that I have encountered while
implementing perfmon and for which I am still looking
for better solutions.

Here is one that I would like to solve right now and
for which I am interested in your comments.

One of the perfmon syscall (pfm_restart()) is used to
resume monitoring after a user level notification. When
operating in per-thread non self-monitoring mode, the
syscall needs to operate on the machine state of the
monitored thread. So you get into this situation:

Thread T0 Thread T1
| |
pfm_restart() |
| |
spin_lock_irqsave() |
| |
<modify T1's machine state>--------------->|
| |
spin_unlock_irqrestore() |
| |
v v

Thread T1 may be running at the time T0 needs to modify its state.
The current solution is to set a TIF flag in T1. That TIF flag will
cause T1 (on kernel exit) to go into a perfmon function that will
then modify the state, i.e., state is self-modified. That works okay
but there are a few race conditions. For self-monitoring sessions
(e.g., system-wide or per-thread), it is easy because we operate in
the correct thread.

But there is a big difference between self-monitoring and non
self-monitoring. The pfm_restart() syscall does not provide the
same guarantee.

In self-monitoring modes, the interface guarantees that by the time you
return from the call, the effects of the call are visible. Whereas when
monitoring another thread, the call currently does not provide such
guarantee, i.e., it does not wait until T1 has seen the TIF flag and
completed the state modification before returning. We could add a ...

To: <eranian@...>
Cc: <linux-kernel@...>, <davem@...>, <paulus@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <andi@...>, <eranian@...>, <roland@...>
Date: Friday, December 14, 2007 - 3:12 pm

The utrace code supports this style of thread manipulation better
than ptrace.

- FChE
--

To: Frank Ch. Eigler <fche@...>
Cc: <linux-kernel@...>, <davem@...>, <paulus@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <andi@...>, <eranian@...>, <roland@...>
Date: Friday, December 14, 2007 - 5:07 pm

Charles,

Afre you saying that utrace provides a utrace_thread_stop(tid) call
that returns only when the thread tid is off the CPU. And then there
is a utrace_thread_resume(tid) call. If that's the case then that is
what I need.

How are we with regards to utrace integration?

Thanks.

--
-Stephane
--

To: <eranian@...>
Cc: <linux-kernel@...>, <davem@...>, <paulus@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <andi@...>, <eranian@...>, <roland@...>
Date: Saturday, December 15, 2007 - 11:54 am

While I see no single call, it can be synthesized from a sequence of
them: utrace_attach, utrace_set_flags (... UTRACE_ACTION_QUESCE ...),

Roland McGrath is working on breaking the patches down.

- FChE
--

To: David Miller <davem@...>
Cc: <paulus@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>, Stephane Eranian <eranian@...>
Date: Monday, November 19, 2007 - 4:53 pm

David,

I appreciate your effort. I am glad to see that the interface
and implementation survived yet another architecture. I think at this
point ARM is the only major architecture missing. In anycase, I would

As I said earlier, I am not opposed to changing the syscalls. I have
proposed a few schemes to address the issue of versioning. If vectors
arguments are problematic, we can go with single register/call.

I think there are other areas where perfmon2 could benefit from the

Thanks.

--
-Stephane
-

To: <eranian@...>
Cc: <paulus@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Monday, November 19, 2007 - 8:55 pm

From: Stephane Eranian <eranian@hpl.hp.com>

I sent these to Philip Mucci late last night, but in the meantime
I finished implementing breakpoint support as well for pfmon.

Let me clean up my diffs and I'll send it all out to you in a
few hours.
-

To: David Miller <davem@...>
Cc: <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 10:34 pm

Well you must mean something different by "strong typing" from the
rest of us. Strong typing means that the compiler can check that you
have passed in the correct types of arguments, but the compiler
doesn't have any visibility into what structures are valid in netlink
messages.

In any case, I think that adding a structure size argument to the
current perfmon2 system calls where appropriate would mean that we
could extend them cleanly later on if necessary. It would mean that
we could add fields at the end, and that the kernel could know what
version of the structures that userspace was using.

Paul.
-

To: Paul Mackerras <paulus@...>
Cc: <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Thursday, November 15, 2007 - 3:48 am

That's strong static typing. Netlink is 90% strong static
typing plus 10% strong dynamic typing. That is, it'll tell
you at run-time if you give it the wrong netlink attribute.

The types within each netlink attribute is checked at compile
time.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-

To: Herbert Xu <herbert@...>
Cc: Paul Mackerras <paulus@...>, <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Thursday, November 15, 2007 - 4:19 am

Well it tells you EINVAL no matter what is wrong.

That's roughly similar to a compiler whose only error message
is 'WRONG'. Or the ed school of error reporting.

That makes any checking it does barely useful.

-Andi
-

To: David Miller <davem@...>
Cc: <paulus@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Tuesday, November 13, 2007 - 8:28 pm

True, but is it now any so different to an ioctl?
-

To: Paul Mackerras <paulus@...>
Cc: David Miller <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Tuesday, November 13, 2007 - 7:49 pm

Could you implement it with readv()?
-

To: <nickpiggin@...>
Cc: <paulus@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 7:58 am

From: Nick Piggin <nickpiggin@yahoo.com.au>

Sure, why not? Just cook up an iovec. pmd_numbers goes to offset
X and pmd_values goes to offset Y, with some helpers like what
we have in the networking already for recvmsg.

But why would you want readv() for this? The syscall thing
Paul asked me to translate into a read() doesn't provide
iovec-like behavior so I don't see why readv() is necessary
at all.
-

To: David Miller <davem@...>
Cc: <paulus@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Tuesday, November 13, 2007 - 8:25 pm

Ah sorry, that's what I get for typing before I think: of course
readv doesn't vectorise the right part of the equation.

What I really mean is a readv-like syscall, but one that also
vectorises the file offset. Maybe this is useful enough as a generic
syscall that also helps Paul's example...

Of course, I guess this all depends on whether the atomicity is an
important requirement. If not, you can obviously just do it with
multiple read syscalls...
-

To: Nick Piggin <nickpiggin@...>
Cc: David Miller <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 5:30 pm

I've sometimes thought it would be useful to have a "transaction"
system call that is like a write + read combined into one:

int transaction(int fd, char *req, size_t req_nb,
char *reply, size_t reply_nb);

as a way to provide a general request/reply interface for special

That would take N system calls instead of one, which could have a
performance impact if you need to read the counters frequently (which
I believe you do in some performance monitoring situations).

Paul.
-

To: Paul Mackerras <paulus@...>
Cc: David Miller <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 6:17 am

Maybe not a bad idea, though I'm not the one to ask about taste ;)
In this case, it is enough for your requests to be a set of scalars
(eg. file offsets), so it _could_ be handled with vectorised offsets...

But in general, for special files, I guess the response is usually
some structured data (that is not visible at the syscall layer).
So I don't see a big problem to have a similarly arbitrarily

That's true too.
-

To: Nick Piggin <nickpiggin@...>
Cc: Paul Mackerras <paulus@...>, David Miller <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 6:56 pm

IOW, an ioctl.
-

To: Chuck Ebbert <cebbert@...>
Cc: Paul Mackerras <paulus@...>, David Miller <davem@...>, <hch@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <andi@...>
Date: Wednesday, November 14, 2007 - 7:03 am

In the same way a read of structured data from a special file
"is an" ioctl, yeah. You could implement either with an ioctl.

The main difference is they have more explicitly typed interfaces
Whether that's enough argument (and if Paul's proposal is widely
usable enough) is another question. Which I won't try to answer.
-

To: <hch@...>
Cc: <paulus@...>, <akpm@...>, <gregkh@...>, <mucci@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <perfmon@...>, <andi@...>, <perfmon2-devel@...>, <ospat-devel@...>, <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 7:14 am

Ok, I just got 4 freakin' bounces from all of these subscriber only
perfmon etc. mailing lists.

Please remove those lists from the CC: as it's pointless for those of
us not on the lists to participate if those lists can't even see the
feedback we are giving.

-

To: Paul Mackerras <paulus@...>
Cc: Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Wednesday, November 14, 2007 - 3:40 am

Yes, that's quite possible. I don't know how up-to-date people's
knowledge is. I know I haven't looked seriously at the code in around
twelve months.

Let's get it on the wires as outlined and take a look at it all.
-

To: Andrew Morton <akpm@...>
Cc: Greg KH <gregkh@...>, Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 4:36 pm

> He speaks for quite a few people - they have serious need for this feature

Most likely they have serious need for a very small subset of perfmon2.
The point of my proposal was to get this very small subset in quickly.

Phil, how many of the command line options of pfmon do you
actually use? How many do the people at your conference use? Or what
functions, what performance counters etc. in PAPI or whatever
library you use?

Make use understand the use cases better, that would already help a lot
in merging by concentrating on what people actually really need.

-Andi

-

To: Andi Kleen <andi@...>
Cc: Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Stephane Eranian <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, <perfmon2-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 8:28 pm

Hi Andi,

pfmon is a single tool and fairly low level, the HPC folks don't use
it so much because it isn't parallel aware and is meant for power-
users. It is not representative of the tools used in HPC at all. Our
community uses tools built on the infrastructure provided by libpfm
and PAPI for the most part.

I know you don't want to hear this, but we actually use all of the
features of perfmon, because a) we wanted to use the best methods
available and b) areas where user level solutions could be made (like
multiplexing) introduced too much noise and overhead to be of use.
For years we relied on PerfCtr which did 'just enough' for us. But
when Perfmon2 became available, we adopted technology where it meant
a significant increase in accuracy for the resulting measurements,
specifically for us that meant, kernel multiplexing and sample buffers.
Note that PAPI is just middleware. The tools built upon it are what
people use...some of those are commercial tools like Vampir but most
are Open Source. These tools are cross platform, as such they run on
nearly everything...although intel/amd/ppc systems dominate the HPC
market.

The usage cases are always the same and can be broken down into
simple counting and sampling:

- providing virtualized 64-bit counters per-thread
- providing notification (buffered or non) on interrupt/overflow of
the above.

If you'd like to outline further what you'd like to hear from the
community, I can arrange that. I seem to remember going through this
once before, but I'd be happy to do it again. For reference, here's a
quick list from memory of some of the tools in active use and built
on this infrastructure. These are used heavily around the globe.
You'll see that each basically follows one of the 2 usage models above.

- HPCToolkit (Rice)
- PerfSuite (NCSA)
- Vampir (Dresden)
- Kojak (Juelich)
- TAU (UOregon)
- PAPIEX (me)
- GPTL (NCAR)
- HPM-Linux (IBM)
- Paraver (Barcelona)

Time to go give a t...

To: Philip Mucci <mucci@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Stephane Eranian <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, <perfmon2-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 9:52 pm

That is hard to believe.

But let's go for it temporarily for the argument.

Can you instead prioritize features. What is most essential, what is

Ok that makes sense and should be possible with a reasonable simple

Please list concrete features, throwing around random names is not useful.

-Andi

-

To: Andi Kleen <andi@...>
Cc: Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Stephane Eranian <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, papi list <ptools-perfapi@...>
Date: Friday, November 16, 2007 - 5:18 am

Just getting back to this now that SC07 is finally over...

You are welcome to download the code and some of the tools and verify

Yes, although this has been done before. You've got the list below in
the previous
emails which should be considered the absolute minimum.

- A feature which was dropped earlier by Stefane (only to satiate
LKML), we consider
very important. Allowing one tomapping of the kernels view of the
PMD's, allowing
user-space access to full 64-bit counts, if the architecture
supports a user-level read instruction. Getting the counts in a
couple of dozen cycles
is ALWAYS a win for us. This is because the HPC community is mainly
interested in
self-monitoring, not third-party, because the former can be easily
associated with
context in the app through instrumentation in various forms.

- Kernel multiplexing is very nice to have, saves you tremendous
overhead at user
level. PAPI has an implementation in user-space for the platforms
that don't support
this. The flexibility of the current implementation is not exploited,
here I'm
referring to the concept of eventsets. Having multiplexing is
important. Being able
to allocate/reallocate eventsets and the threshold of individual
eventsets is just nice
to have.

- Custom sample formats would be considered not often used in our
community, largely
because the tools run on all HPC/Linux architectures. PAPI uses the
default sample
format which has been sufficient for our needs. However, the lack of
custom sample
formats preclude the dev of the specialized tools that access the
sampling
hardware as found on the IA64, PPC64, the Barcelona and the SiCortex
node chip.

Well that's good news. The above is what we have used via the PerfCtr
set of
patches for a long time. It wasn't quite enough, but it got the job

This is kind of comment that makes the Linux/HPC folks 'somber'. What
isn't useful, is being dismissive of an entire community that moves a
heck of a lot of Lin...

To: Philip Mucci <mucci@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Stephane Eranian <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, papi list <ptools-perfapi@...>
Date: Friday, November 16, 2007 - 11:15 am

I didn't see a clear list.

My impression so far is that you're not quite sure what you want,

You mean returning the register number for RDPMC or equivalent
and a way to enable it for ring 3 access?

I'm considering that an essential feature too. I wasn't aware

Yes it is for everybody. I've been rather questioning if the slow
ways (complicated syscalls) to get the counter information are really

What do you mean with custom sample formats exactly? What information
do you want in there? And why?

e.g. PEBS and so on pretty much fix the in memory sample format in hardware,
so they only way to get a custom format would be to use a separate buffer.

I can think of one reason why the kernel should add more information
in a separate buffer (log the instruction bytes so that it can
be disassembled and a address histogram be generated using the PEBS
register values), but it is a relatively obscure one and definitely
not a essential feature. Unfortunately it is also hard to implement completely

Sorry, but these kind of non technical BS arguments will just make
you be ignored in mainline Linux lands. They might work if you pay
a lot of money to specific Linux companies (do you?), but here
on linux-kernel you have to convince with purely technical arguments.

-Andi
-

To: <andi@...>
Cc: <mucci@...>, <akpm@...>, <gregkh@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <ptools-perfapi@...>
Date: Friday, November 16, 2007 - 8:15 pm

From: Andi Kleen <andi@firstfloor.org>

I would like to add sparc64 support to perfmon2 as well
and therefore I've been considering this angle of the
API issues as well.

The counters on sparc64 can be configured to be readable by userspace,
so for the self-monitoring cases I really would like to make sure the
perfmon2 library interface could use direct reads for sampling instead
of system calls or specialized traps.

If I get some spare time I'll look at the current perfmon2 patches
and see if I can toss together sparc64 support to get a feel for
how things stand currently.
-

To: Andi Kleen <andi@...>
Cc: Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Stephane Eranian <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, papi list <ptools-perfapi@...>
Date: Friday, November 16, 2007 - 4:16 pm

I suppose by complicated here, your referring to the gather semantics
of the
pfm_read/write_pmds/pmcs calls. Many processors may have 100's of
registers
(IA64, BG/P, SiCortex), some of which have different access times. So a
naive syscall of 'give me all the registers you've got' isn't going
to cut it.
However, any additional simplicity (performance) we can squeeze out
of this
particular primitive is a huge win as it sits in the critical path of
the user

Performance and noise. See the earlier message about our user-land
implementation versus kernel mode implementations. Any any useful
granularity, you begin to seriously affect the counts with noise as
well as dilate the run-time. But let's punt on this one until after

By custom here, I mean the ability to have the kernel take samples
containing
more than just the IP, the PID and a bitmask of which registers
overflowed at this
point. Myself and others have worked hard to get effective address
sampling into the
hardware (there are registers that contain EA's of misses as well as
branch mispredict
data on the PPC, IA64, Barcelona and SiCortex) that are handled
through the use
of a format that gathers up that information at interrupt time for
deposit into
the sample buffer. We are not wedded to Perfmon2's implementation of
these formats, we
are however, wedded to having this information collected at interrupt
time as the data
may change by the time you get back to user-mode. This hardware is
not obscure any more,
it's the norm, as we've learned at thus simple aggregate counters,
even those with precise

I love it when kernel folks refer to their own revenue streams
(and yes, we do, ask your VP of sales) and the needs of a user
community as
"BS non-technical arguments".

But let's get back to basics here. We can sort that out over a beer
sometime.
At this point, let's try and agree on the minimum set of
functionality acceptable for a first round of patches.

- per-CPU (system...

To: Andi Kleen <andi@...>
Cc: Philip Mucci <mucci@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Stephane Eranian <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, papi list <ptools-perfapi@...>
Date: Friday, November 16, 2007 - 1:51 pm

- cross platform extensible API for configuring perf counters
- support for multiplexed counters
- support for virtualized 64-bit counters
- support for PC and call graph sampling at specific intervals
- support for reading counters not necessarily with sampling
- taskswitch support for counters
- API available from userland
- ability to self-monitor: need select/poll/etc interface
- support for PEBS, IBS and whatever other new perf monitoring
infrastructure the vendors through at us in the future
- low overhead: must minimize the "probe effect" of monitoring
- low noise in measurements: cannot achieve this in userland

permon2 has all of this and more i've probably neglected...

-dean
-

To: <dean@...>
Cc: <andi@...>, <mucci@...>, <akpm@...>, <gregkh@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <ptools-perfapi@...>
Date: Friday, November 16, 2007 - 8:29 pm

From: dean gaudet <dean@arctic.org>

I want to state that even though I've been a stickler on the system
call stuff, in general I want to see perfmon2 go into tree and I agree
with how most of the infrastructure is implemented and the features it
provides.
-

To: David Miller <davem@...>
Cc: <dean@...>, <andi@...>, <mucci@...>, <akpm@...>, <eranian@...>, <wcohen@...>, <robert.richter@...>, <linux-kernel@...>, <ptools-perfapi@...>
Date: Friday, November 16, 2007 - 9:07 pm

Now if we only had a series of patches that we could actually review and
apply to the -mm tree so that people can try them out... :)

thanks,

greg k-h
-

To: Andi Kleen <andi@...>
Cc: Philip Mucci <mucci@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Stephane Eranian <eranian@...>
Date: Friday, November 16, 2007 - 12:00 pm

Andi,

No, he is talking about something similar to what was in perfctr.
The kernel emulates 64-bit counters in software and that is you
get back when you read the counters. If you read via RDPMC, you
get 40 bits. To reconstruct the full 64-bit value from user land
you need the upper bits. One approach is for the kernel to allow
you to remap a page that has the 64-bit (software) counters. With
Perfmon2 allows you to have an in-kernel sampling buffer. The idea is
not new, Oprofile has this as well. The problem here is that if the
buffer is in the kernel the format of the samples is fixed and it
should have to. Tools may want to record samples in different formats
and as you said some may need extra information gathered in the kernel.
Some may want to aggregate samples in the kernel (Oprofile used to
do that), some may want to use a double-buffer approach to minimize
blind spots, others may simply use the counter overflow mechanism to
record something that is non-PMU related, e.g, kernel call stack.
I have built such a module and it was quite interesting to collect
the call stack when you hit a last cache level miss.

The idea behind customizable sampling format is simple: extract the
format from the perfmon core and put this into a kernel module. The
core provides a simple registration mechanism and the two communicate
via a set of callbacks.

Perfmon2 comes with a basic default format which works on all
platforms. But it is possible to develop others without having to
patch the kernel nor recompile nor reboot. At its core, each format provides
a handler routine which is called on counter overflow. The handler routine
controls what is recorded, how it is recorded, how it is exported to
userland, and wheher overflow notifications need to be sent.

Using this mechanism, for instance, we were able to connect the
Oprofile kernel code to perfmon2 on Itanium with a 100 lines of

This is also how we support PEBS because, as you said, the format of the
samples is not under your control....

To: Stephane Eranian <eranian@...>
Cc: Andi Kleen <andi@...>, Philip Mucci <mucci@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>
Date: Friday, November 16, 2007 - 12:28 pm

You mean the page contains the upper [40;63] bits?

Sounds reasonable, although I don't remember seeing that when I looked

... you also didn't say *why* that is needed.

Can you give a concrete use case for something that cannot be done

The existing oprofile code works already fine on x86, no real

Exactly that makes the support for random custom buffers questionable.

e.g. as I can see the main advantage of perfmon over existing setups
is that it support PEBS etc., but with your custom buffer formats which
are by definition incompatible with PEBS you would negate that advantage
again.

Why this insistence against changing anything?

-Andi
-

To: Andi Kleen <andi@...>
Cc: Philip Mucci <mucci@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Stephane Eranian <eranian@...>
Date: Friday, November 16, 2007 - 1:36 pm

Andi,

Do you question why Oprofile has one ;->

But I am happy to explain.

With sampling, you want to record information about the execution of a
thread at some interval. The interval could be expressed as time or
number of occurences of an PMU event.

Typically you get a notification. Then you need to collect certain
information about the execution. Typically you record the instruction
pointer (e.g. Oprofile), but you may want to record the value of other
counters, PMU registers or other HW/SW resources. While you're doing
this monitoring is typically stopped so you get a consitent view. After
you're done recording you need to re-arm the sampling period. If you
use event-based sampling, you need to reprogram the counter(s). Then
you resume monitoring. You have to repeat this process for each sample
regardless of whether you are self-monitoring, monitoring another thread,
or monitoring a CPU.

Such sequence of operations is quite expensive, especially in the case
where you are monitoring another thread, because it incurs at least
a couple of context switches per sample in addition to the various
register manipulations and syscalls.

The idea with the kernel sampling buffer is that you amortize the
cost of notification to userland over LOTS of samples. On counter
overflow, the kernel records the samples on your behalf. There is
no context switch, samples are always recorded in the context on
the monitored thread.

Now, you need a bit more information for this to work correctly
because the kernel records on *your behalf*, thus
you need to express:
- what you want to see recorded

- the value to reload into the overflowed counter(s)
so the kernel can re-arm the next period.

Because you have multiple counters, you may use them for sampling
periods, i.e., overlap sampling measurements. That is something
done very frequently.

For instance, the q-syscollect tool that D. Mosberger wrote, is
overlapping elapsed cycles and branch trace buffer (BTB) sampling
to col...

To: Andi Kleen <andi@...>
Cc: Stephane Eranian <eranian@...>, Philip Mucci <mucci@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>
Date: Friday, November 16, 2007 - 1:13 pm

Upper 32-bit ([32:63]). On many implementations the only lower 32-bit are
available in the register. the 32:40 bits in several processor implementation of
x86 processors can not be set to bit outside of sign extension of bit 32. On

OProfile is very useful in many cases, but it only perform sampling. If one want
to take a look at the number events a specific section of code causes, one can't
really do that with oprofile. The counters are running systemwide, not per
thread. For some experiments developers really like to have per thread counters.

The rewrite of oprofile to use the perfmon code was to consolidate code using
the performance monitoring hardware. Use one interface for accessing the
performance monitoring hardware rather than have one for sampling and another

So the alternative approach is to write a new device driver for each of the new
performance monitoring mechanisms, e.g. one for PEBS and another for IBS?

One of the reason for the custom sample buffers was to avoid having an expensive
user-space signal for a process to record some simple pieces of data each time
the data becomes available. For the oprofile port to the perfmon2 custom buffer
mechanism the instruction pointer and the counter that overflowed are
recorded. The buffer can be processed in one large chunk by userspace, reducing
overhead. In essence the current implementation of OProfile in the mainline
kernels has a custom buffer mechanism.

-Will

-

To: William Cohen <wcohen@...>
Cc: Andi Kleen <andi@...>, Philip Mucci <mucci@...>, Andrew Morton <akpm@...>, Greg KH <gregkh@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>
Date: Friday, November 16, 2007 - 5:56 pm

Will,

That is quite true on Intel's. Perfmon2 only considers the bottom 31 bits as
true counter bits, the rest is forced to 1. This is true even on Intel Core 2.

--
-Stephane
-

To: Andrew Morton <akpm@...>
Cc: Philip Mucci <mucci@...>, <eranian@...>, William Cohen <wcohen@...>, Robert Richter <robert.richter@...>, <linux-kernel@...>, Perfmon <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>, OSPAT devel <ospat-devel@...>, papi list <ptools-perfapi@...>
Date: Tuesday, November 13, 2007 - 4:14 pm

I agree. Right now their git tree has over 80 patches in it, without
descriptions like this to help those of us who want to review and help
out, it is quite difficult.

thanks,

greg k-h
-

To: <eranian@...>
Cc: <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 2:33 pm

The oprofile module can setup a handler for PMU interrupts. This is done in
archi/x86/oprofile/nmi_int:nmi_cpu_setup(). Other modules could do the same.
However, it bumps what ever was using the nmi/pmu off, then restores nmi/pmu
when oprofile is shut down. Maybe the pmu/nmi resource reservation mechanism

The per-thread monitoring is useful to a number of people and many people want
it. The thought was how to break the large perfmon patch into set of smaller
incremental patches. So it isn't whether to have per-thread pmu virtualization,
but rather when/how to get it in.

-Will
-

To: William Cohen <wcohen@...>
Cc: <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 5:13 pm

Will,

Oprofile does not setup the PMU interrupt. It builds on top of the NMI watchdog
setup. It uses the register_die() mechanism, if I recall. The low level APIC
and gate is setup elsewhere. Perfmon does not use NMI, unless forced to because

I think we all agree on this.

--

-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: William Cohen <wcohen@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, <perfmon@...>, Andi Kleen <andi@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 5:29 pm

Oprofile works without the NMI watchdog too, but it just happens to be another

It could handle it in the same way as oprofile if it wanted. But given
NMIs make everything more complicated and it might not be worth it.

-Andi
-

To: Andi Kleen <andi@...>
Cc: <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, <perfmon@...>, William Cohen <wcohen@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 5:46 pm

Andi.

I meant the register_die_notifier() mechanism which allow you to
chain a handler on NMI interrupts. At least that's my understanding
reading the code:

static int nmi_setup(void)
{
int err=0;
int cpu;

if (!allocate_msrs())
return -ENOMEM;

if ((err = register_die_notifier(&profile_exceptions_nb))){
free_msrs();
pfm_release_allcpus();
return err;
}
Yes, horribly more complicated because of locking issues within perfmon.
As soon as you expose a file descriptor, you need some locking to prevent
multiple user threads (malicious or not) to compete to access the PMU state.
I think the value add of NMI can be as well achieved with advanced PMU features
such as Intel Core 2 PEBS.

--
-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Andi Kleen <andi@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, <perfmon@...>, William Cohen <wcohen@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 5:50 pm

Why do you need the file descriptor?

One of the main problems with perfmon is the complicated user interface.

True probably, although only on CPUs that support PEBS. Dropping features
for old CPUs is unfortunately quite difficult in Linux, and in this case
probably not an option because there are so many of them (e.g. all of AMD
not Fam10h)

-Andi
-

To: Andi Kleen <andi@...>
Cc: <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, <perfmon@...>, William Cohen <wcohen@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 6:22 pm

Andi,

To identify your monitoring session be it system-wide (i.e., per-cpu) or per-thread.
file descriptor allows you to use close, read, select, poll and you leverage the
existing file descriptor sharing/inheritance sematics. At the kernel level, a
descriptor provides all the callback necessary to make sure you clean up the perfmon

Yes, I know that. Also note that unfortunately, AMD Fam10h IBS feature does not
allow you to capture more than one sample in critical sections. It is still
interrupt based sampling with one entry-deep buffer: one interrupt = one sample.
Perfmon does support NMI though it is much more expensive to use.

--
-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Andi Kleen <andi@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, <perfmon@...>, William Cohen <wcohen@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 6:25 pm

Surely that could be done with a flag for each call too? Keeping file descriptors

Didn't you already have a thread destructor for it?

-Andi
-

To: Andi Kleen <andi@...>
Cc: <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, <perfmon@...>, William Cohen <wcohen@...>, <perfmon2-devel@...>
Date: Tuesday, November 13, 2007 - 6:58 pm

Andi,

I don't understand this.

Let's take the simplest possible example (self-monitoring per-thread)
counting one event in one data register.

int
main(int argc, char **argv)
{
int ctx_fd;
pfarg_pmd_t pd[1];
pfarg_pmc_t pc[1];
pfarg_ctx_t ctx;
pfarg_load_t load_args;

memset(&ctx, 0, sizeof(ctx));
memset(pc, 0, sizeof(pc));
memset(pd, 0, sizeof(pd));

/* create session (context) and get file descriptor back (identifier) */
ctx_fd = pfm_create_context(&ctx, NULL, NULL, 0);

/* setup one config register (PMC0) */
pc[0].reg_num = 0
pc[0].reg_value = 0x1234;

/* setup one data register (PMD0) */
pd[0].reg_num = 0;
pd[0].reg_value = 0;

/* program the registers */
pfm_write_pmcs(ctx_fd, pc, 1);
pfm_write_pmds(ctx_fd, pd, 1);

/* attach the context to self */
load_args.load_pid = getpid();
pfm_load_context(ctx_fd, &load_args);

/* activate monitoring */
pfm_start(ctx_fd, NULL);

/*
* run code to measure
*/

/* stop monitoring */
pfm_stop(ctx_fd);

/* read data register */
pfm_read_pmds(ctx_fd, pd, 1);

printf("PMD0 %llu\n", pd[0].reg_value);

/* destroy session */
close(ctx_fd);

return 0;
}

--

-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Andi Kleen <andi@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, William Cohen <wcohen@...>
Date: Tuesday, November 13, 2007 - 10:07 pm

[dropped all these bouncing email lists. Adding closed lists to public

Why do you need to set the data register? Wouldn't it make

My replacement would be to just add a flags argument to write_pmcs

Why can't that be done by the call setting up the register?

Or if someone needs to do it for a specific region they can read

On x86 i think it would be much simpler to just let the set/alloc
register call return a number and then use RDPMC directly. That would
be actually faster and be much simpler too.

I suppose most architectures have similar facilities, if not a call could be
added for them but it's not really essential. The call might be also needed
for event multiplexing, but frankly I would just leave that out for now.

e.g. here is one use case I would personally see as useful. We need
a replacement for simple cycle counting since RDTSC doesn't do that anymore
on modern x86 CPUs. It could be something like:

/* 0 is the initial value */

/* could be either library or syscall */
event = get_event(COUNTER_CYCLES);
if (event < 0)
/* CPU has no cycle counter */

reg = setup_perfctr(event, 0 /* value */, LOCAL_EVENT); /* syscall */

rdpmc(reg, start);
.... some code to run ...
rdpmc(reg, end);

free_perfctr(reg); /* syscall */

On other architectures rdpmc would be different of course, but
the rest could be probably similar.

-Andi

-

To: Andi Kleen <andi@...>
Cc: <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, William Cohen <wcohen@...>
Date: Wednesday, November 14, 2007 - 9:09 am

Andi,

Partially true. The file descriptor becomes really useful when you sample.
You leverage the file descriptor to receive notifications of counter overflows
and full sampling buffer. You extract notification messages via read() and you can
use SIGIO, select/poll.

The example shows how you can leverage existing mechanisms to destroy the session, i.e.,
free the associated kernel resources. For that, you use close() instead of adding yet
another syscall. It also provides a resource limitation mechanisms to control consumption
Are you suggesting something like: pfm_write_pmcs(fd, 0, 0x1234)?

That would be quite expensive when you have lots of registers to setup: one
syscall per register. The perfmon syscalls to read/write registers accept vector
of arguments to amortize the cost of the syscall over multiple registers
(similar to poll(2)).

With many tools, registers are not just setup once. During certain measurements,
data registers may be read multiple times. When you sample or multiplex at
the user level, you do need to reprogram the PMU state and that is on the critical
path.

You do not want a call that programs the entire PMU state all at once either. Many times,
you only want to modify a small subset. Having the full state does also cause some portability
It depends on what you are doing. Here, this was not really necessary. It was
meant to show how you can program the data registers as well. Perfmon2 provides
default values for all data registers. For counters, the value is guaranteed to
be zero.

But it is important to note that not all data registers are counters. That is the
case of Itanium 2, some are just buffers. On AMD Barcelona IBS several are buffers as
well, and some may need to be initialized to non zero value, i.e., the IBS sampling
period.

With event-based sampling, the period is expressed as the number of occurrences
of an event. For instance, you can say: " take a sample every 2000 L2 cache misses".
The way you express this with perfmon2 is that y...

To: Stephane Eranian <eranian@...>
Cc: Andi Kleen <andi@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, William Cohen <wcohen@...>
Date: Wednesday, November 14, 2007 - 10:24 am

Hmm, ok for the event notification we would need a nice interface. Still

I think you optimize the wrong thing here.

There are basically two cases I see:

- Global measurement of lots of things:
Things are slow anyways with large context switch overheads. The
overheads are large anyways. Doing one or more system calls probably
does not matter much. Most important is a clean interface.

- Exact measurement of the current process. For that you need very
low latencies. Any system call is too slow. That is why CPUs have
instructions like RDPMC that allow to read those registers with
minimal latency in user space. Interface should support those.

Also for this case programming time does not matter too much. You
just program once and then do RDPMC before code to measure and then
afterwards and take the difference. The actual counter setup is out

Setting period should be a separate call. Mixing the two together into one

I didn't object to providing the initial value -- my example had that.
Just having a separate concept of data registers seems too complicated to me.
You should just pass event types and values and the kernel gives you

And? You didn't say what the advantage of that is?

All the approaches add context switch latencies. It is not clear that the separate

Well the system call layer can manage that transparently with a little software state

I disagree. Using RDPMC is essential for at least some of the things I would like
to do with perfmon2. If the interface does not provide it it is useless to me at least.
System calls are far too slow for cycle measurements.

And when RDPMC is already supported it should be as widely used as possible.

Regarding the portable code problem: of course you would have some header in user space

I think only supporting global and self monitoring as first step is totally fine.

Sure at some point a system call for the more complex cases (also like multiplexing) would
be needed. But I don't think we need it as firs...

To: Andi Kleen <andi@...>
Cc: <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>, William Cohen <wcohen@...>, <perfmon2-devel@...>
Date: Wednesday, November 14, 2007 - 8:07 pm

Andi,

Why do you think the existing interfaces are not a good fit for this?
Is this just because of your problem with file descriptors?

From my experience read(), select(), and SIGIO are fine. I know many tools use that.

As for the file descriptor, you would need to replace that with another identifier of
some sort. As I pointed out in another message on this thread, you don't want to use
a pid-based identifier. This is not usable when you monitor other threads and you
If people do not like vector arguments, then I think I can live with N system calls
to program N registers. Now you have two choices for passing the arguments:

- a pointer to a struct
struct pfarg_pmc {
uint64_t reg_value;
uint16_t reg_num;
} pmc0;
pmc0.reg_value = 0; pmc0.reg_value = 0x1234;
pfm_write_pmcs(fd, &pmc0);

- explicitly passing every field:
pfm_write_pmcs(fd, 0x0, 0x1234);

Given that event set and multiplexing would not be in initially, we would want
to allow for them to be added later without having to create yet another
system call, right?

I am not sure I understand what you mean by 'lots of things'?

I don't have a problem with that. And in fact, I already support that
at least on Itanium. I had that in there for X86 but I dropped it after
you said that you would enable cr4.pce globally. I don't have a problem
Periods are setup by data register. Given that there is already a call to program
the data register why add another one? You don't need to treat the sampling period
differently from the register value. This just a value that will cause the register

Should you support a kernel level sampling buffer (like Oprofile) you'd also want
to specify the reset value on overflow. And you would not necessarily want it to
be identical to the initial value (period). So you'd to have a way to specify that

I am not against providing a flat namespace. But I think it is nice to separate config

Absolutely not, you don't want to the kernel to know about events. This has ...

To: Andi Kleen <andi@...>
Cc: Stephane Eranian <eranian@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>
Date: Wednesday, November 14, 2007 - 11:44 am

There are a number of processors that have 32-bit counters such as the IBM power
processors. On many x86 processors the upper bits of the counter are sign
extended from the lower 32 bits. Thus, one can only assume the lower 32-bit are
available. Roll over of values is quite possible (<2 seconds of cycle count), so

What range of cycles are you interested in measuring? 100's of cycles? A couple
thousand? Are you just looking at cycle counts or other events?

-Will
-

To: William Cohen <wcohen@...>
Cc: Andi Kleen <andi@...>, Stephane Eranian <eranian@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>
Date: Wednesday, November 14, 2007 - 2:53 pm

On x86 they are sign-extended only on write, on read they are 40 bits wide
for intel, 48 bits for AMD.

BTW, isn't rdpmc only enable for ring 0 on linux ? I remember a patch
to disable it, dunno if it has been applied.

--
Phe

-

To: Philippe Elie <phil.el@...>
Cc: William Cohen <wcohen@...>, Andi Kleen <andi@...>, Stephane Eranian <eranian@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>
Date: Wednesday, November 14, 2007 - 3:15 pm

Obviously -- without a system call to set up performance counters it
would be fairly useless. But of course once such system calls are in
they should be able to trigger the bit for each process.

-Andi
-

To: William Cohen <wcohen@...>
Cc: Andi Kleen <andi@...>, <akpm@...>, Robert Richter <robert.richter@...>, <gregkh@...>, <linux-kernel@...>
Date: Wednesday, November 14, 2007 - 12:13 pm

Exactly, on Intel's only the bottom 32-bit actually are useable, the rest is
sign-extension. That's why it is okay for measuring small sections of code,
but that's it. On AMD, I think it is better. On Itanium you get the 47-bit worth.
Don't know about Power or Cell.

--
-Stephane
-

To: Robert Richter <robert.richter@...>
Cc: Andi Kleen <andi@...>, <gregkh@...>, <akpm@...>, <linux-kernel@...>, <perfmon2-devel@...>, <perfmon@...>
Date: Tuesday, November 13, 2007 - 2:32 pm

Hello,

Note that I am not against the gradual approach such as:
- system-wide only counting
- per-thread counting
- user-level sampling support
- in-kernel sampling buffer support
- in-kernel customizable sampling buffer formats via modules
- event set multiplexing
- PMU description modules

It would obvisouly cause a lot of troubles to existing perfmon libraries and
applications (e.g. PAPI). It would also be fairly tricky to do because you'd
have to make sure that in the beginning, you leave enough flexiblity such that
you can add the rest while maintaining total backward compatibility. But given
that we already have the full solution, it could just be a matter of dropping
features without disrupting the user level API. Of course there would be a bigger
burden on the maintainer because he would have two trees to maintain but I think
that is already commonplace in many of the kernel-related projects.

Let's take a simple example. The set of syscalls necessary to control a system-wide
monitoring session is exactly the same as for a per-thread session. The difference is
just a flag when the session is created. Thus, we could keep the same set of syscalls,
but only accept system-wide sessions. Later on, when we add per-thread, we would just
have to expose the per-thread session flag.

Having said that, does not mean that this is necessarily what we will do. I am just
try to present my understanding of the comments from Andrew, Andi and others.

I think that going with a kernel module will not address the 'complexity/bloat' perception
that some people have. There is a logic to that, I did not just wakeup one day saying
'wouldn't it be cool to add set multiplexing?'. There was a true need expressed by users or
developers and it was justfied by what the hardware offered then. This unfortunately still
stands today. I admit that justification is not necessarily spelled out clearly in the code. So
I understand most of those worries and I am trying to figure out how we could best ...

To: Stephane Eranian <eranian@...>
Cc: Robert Richter <robert.richter@...>, Andi Kleen <andi@...>, <gregkh@...>, <akpm@...>, <linux-kernel@...>, <perfmon2-devel@...>, <perfmon@...>, Christoph Hellwig <hch@...>
Date: Friday, November 16, 2007 - 2:25 pm

(jumping in late in the game)

Linux Trace Toolkit Next Generation would _happily_ use global PMC
counters, but I would prefer to interact with an internal kernel API
rather than being required to start/stop counters from user-space. There
is a big precision loss involved in having to start things from
userspace.

Ideally, this API would manage access to available PMCs and even use the
same counters for both system-wide tracing/profiling done at the same
time as user-space profiling. This would however involve having a
wrapper around both user-space and kernel-space performance counter
reads, which is fine with me. I would suggest that user-space still go
through a system call for this, since this is available a early boot,
before the filesystem is mounted.

This API could offer to in-kernel architecture _independent_ PMC control
interface to :
- list available PMCs
- That would involve mapping the common PMCs to some generic
identifier
- attach to these PMCs, with a certain priority

We could call a single connexion to a PMC a "virtual PMC". All PMC
accesses should then be done through this internally managed structure
(giving callbacks to be called after a certain count, reads, stop...).
We could have virtual PMCs that are : system wide, or per thread.

As a starting point, we could limit one virtual PMC attached to a
physical PMC at a given time. Later, we could add support for multiple
virtual PMCs connected to a single physical PMC. The priorities could be
used to kick out the PMC users with lower priorities (that involves that
a PMC read could fail!).

Then, to get interrupts or signals upon PMC overflow, we could manage
each physical PMC like a timer, using the lowest requested value for the
next time were are to be awakened. Some logic would have to be added to
the pmc read operation to get the "real" expected value, but this is
nothing difficult.

Those were the ideas I had last OLS after hearing the talk about
perfmon2. I hope they can be useful. If thing...

To: Stephane Eranian <eranian@...>
Cc: Robert Richter <robert.richter@...>, Andi Kleen <andi@...>, <gregkh@...>, <akpm@...>, <linux-kernel@...>, <perfmon2-devel@...>, <perfmon@...>
Date: Tuesday, November 13, 2007 - 6:29 pm

There no way we'll keep this completely idiotic userland API. If people start
to use out of tree APIs they can pretty much expect that they're not going
to stay around. And in this case they most certainly won't.

-

To: Greg KH <greg@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 9:42 am

Greg,

Perfmon sysfs document has been updated following your adivce.
you can check out in my perfmon tree the following commit:

e83278f879e52ecee025effe9ad509fd51e4a516

Thanks.

--

-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 1:47 pm

Thanks, that looks a lot better.

Do you want me to send you patches based on this tree to help clean up
the sysfs usage now that it's documented?

Also, a lot of your per-cpu sysfs files should probably move to debugfs
as they are for debugging only, right? No need to clutter up sysfs with
them when only the very few perfmon developers would be needing access
to them.

thanks,

greg k-h
-

To: Greg KH <greg@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 1:57 pm

Greg,

Yes, send me the patches. But from what you were saying earlier it seems
I would need an extra sysfs patches to make this compile. Is that particular
Yes, this is mostly debugging. If debugfs is meant for this, then I'll
be happy to move this stuff over there. Is there some good example of how
I could do that based on my current sysfs code?

Thanks.

--
-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 3:53 pm

No, it's in my tree, and will be in the next -mm. You will need a few

There is documentation for debugfs in the kernel api document :)

And, there are many in-kernel users of debugfs, a grep for
"debugfs_create_" should show you some examples of how to use this. If
you have any questions, please let me know.

thanks,

greg k-h
-

To: Greg KH <greg@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>, <perfmon2-devel@...>
Date: Thursday, November 8, 2007 - 11:27 am

Greg,

I have now removed all the perfmon2 statistics from sysfs and moved them
to debugfs. I must admit, I like it better this way. Debugfs is also so
much easier to program.

Patch has been pushed into my tree. Let me know if you think I can improve
the sysfs code some more.

Thanks.

--

-Stephane
-

To: Greg KH <greg@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 4:39 pm

Greg,

Could you send them to me? if they are not too intrusive I could add them
to my tree. Yet I don't want something to distant from Linus's tree which
Ok, I'll look at that next.

Thanks,

--

-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 1:08 pm

Where is this git tree located? On git.kernel.org somewhere?

thanks,

greg k-h
-

To: Greg KH <greg@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 1:50 pm

To: Greg KH <greg@...>
Cc: <eranian@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 1:33 pm

I get mine from git+ssh://master.kernel.org/pub/scm/linux/kernel/git/eranian/linux-2.6.git
-

To: Andrew Morton <akpm@...>
Cc: <eranian@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 1:41 pm

Thanks, that worked, let me go read the new documentation...
-

To: Greg KH <greg@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 6:34 am

Greg,

I will move the description from perfmon2.txt to its own file in
ABI/testing.

--
-Stephane
-

To: Stephane Eranian <eranian@...>
Cc: Andrew Morton <akpm@...>, <perfmon@...>, <linux-kernel@...>
Date: Wednesday, November 7, 2007 - 1:07 pm

That is what I was referring to, that file does not describe all of the

That would be great to have, thanks.

greg k-h
-

Previous thread: Linux 2.6.24-rc2 by Linus Torvalds on Tuesday, November 6, 2007 - 8:26 pm. (10 messages)

Next thread: [patch 00/23] Slab defragmentation V6 by Christoph Lameter on Tuesday, November 6, 2007 - 9:11 pm. (18 messages)