Andi,
On Fri, Nov 16, 2007 at 05:28:13PM +0100, Andi Kleen wrote:
I dropped that quite some time ago.
That's because you said you were going to enable it system-wide by default.
Do you question why Oprofile has one ;->
But I am happy to explain.
With sampling, you want to record information about the execution of a
thread at some interval. The interval could be expressed as time or
number of occurences of an PMU event.
Typically you get a notification. Then you need to collect certain
information about the execution. Typically you record the instruction
pointer (e.g. Oprofile), but you may want to record the value of other
counters, PMU registers or other HW/SW resources. While you're doing
this monitoring is typically stopped so you get a consitent view. After
you're done recording you need to re-arm the sampling period. If you
use event-based sampling, you need to reprogram the counter(s). Then
you resume monitoring. You have to repeat this process for each sample
regardless of whether you are self-monitoring, monitoring another thread,
or monitoring a CPU.
Such sequence of operations is quite expensive, especially in the case
where you are monitoring another thread, because it incurs at least
a couple of context switches per sample in addition to the various
register manipulations and syscalls.
The idea with the kernel sampling buffer is that you amortize the
cost of notification to userland over LOTS of samples. On counter
overflow, the kernel records the samples on your behalf. There is
no context switch, samples are always recorded in the context on
the monitored thread.
Now, you need a bit more information for this to work correctly
because the kernel records on *your behalf*, thus
you need to express:
- what you want to see recorded
- the value to reload into the overflowed counter(s)
so the kernel can re-arm the next period.
Because you have multiple counters, you may use them for sampling
periods, i.e., overlap sampling measurements. That is something
done very frequently.
For instance, the q-syscollect tool that D. Mosberger wrote, is
overlapping elapsed cycles and branch trace buffer (BTB) sampling
to collect, in *one* run, a flat profile and a statistical call graph.
Depending on which counter overflowed, you may one to record
different things. For instance, the flat profile requires
just the instruction pointer. But for the BTB, the buffer
is implemented by PMU registers, thus you need to record
them (16 total). You don't want to record all register possible
in each sample: reading PMU register is costly and you
want to maximize buffer space usage.
As you can see, you need to express per counter:
- what other resources to record when it overflows
- the value to reload into the counter after overflow
In perfmon2 this information is passed by the pfm_write_pmds()
call. You can say:
PMD2.value = -5000; /* initial period */
PMD2.reset = -2000; /* repeat period */
PMD2.smpl_pmds = 0xf0; /* to record PMD4-7 on overlow */
Now, it is important to note that this is not just on Itanium
that we need this kind of flexibility. Given that you mentioned
IBS, I will use it as a non-X86 example. IBS is implemented
using PMU registers, 10 to be precise. There is no need for a
custom sampling format to support that, the default format is
sufficient.
The default sampling format does record more than the instruction
pointer. Each sample has a fixed size header including the instruction
pointer but also PID/TID/CPU. But it also has a variable size body
where the kernel stores the other registers you want to record
in each sample based on which counter overflowed. So for IBS, it
would store the 10 data registers.
PEBS is one. You would have to special this. PEBS includes
the instruction pointer + values of all registers. You'd have
to devise a scheme to allocate the PEBS buffer and then on
PEBS interrupt you'd have to copy the data into the other
buffer. Not counting on the fact that PEBS between P4 and Intel
Core 2 different and that this is an Intel X86 only feature.
I think this is better isolated into X86 specific code and
into a kernel module because it does not work on all models.
Can you support advanced monitoring like I just described above?
Quite the contrary, without the custom buffers we would have horrible
hacks to support PEBS.
I think you are confused about the terms here. The custom sampling
format is a kernel-level interface to plug-in kernel modules
which implement custom sampling formats. PEBS requires a custom
format because you do not control what is recorded. Thus what
you do is you *create* a format whose sample format *maps* the PEBS
format exactly. And that format is *different* from the one used
by the default sampling format.
No, it does not. No sampling format, no extra tricks.
Because hardware is very diverse and is changing rapidly.
Changing the kernel is difficult and it takes a very long time
for new features to reach end-users. You are not without knowing
that most users do not download their production kernels from
kernel.org. Monitoring is not just reserved for core developers
and it is also very useful on production systems to diagnose
performance problems.
--
-Stephane
-