Re: [RFC] simple dprobe like markers for the kernel

Previous thread: SL*B: drop kmem cache argument from constructor by Alexey Dobriyan on Wednesday, July 9, 2008 - 6:11 pm. (9 messages)

Next thread: Mysterious high wakeups-from-idle per second by Theodore Ts'o on Wednesday, July 9, 2008 - 8:14 pm. (2 messages)
From: Frank Ch. Eigler
Date: Wednesday, July 9, 2008 - 7:29 pm

Another disadvantage is one that came up earlier when markers were
initially thought up: that something so invisible to the compiler (no
code being generated in the instruction stream, after optimization,
may be impossible to locate: not just the statement but also the
putative parameters.

Long ago, someone proposed inserting an asm("nop") mini-markers into
the instruction stream, which could then be used as an anchor to tie a
kprobe to, so that would solve the statement-location problem.

But it doesn't help assure that the parameters will be available in
dwarf, so someone else proposed adding another asm that just asks the
parameters to be evaluated and placed *somewhere*.  Each asm input
constraint was to be the loosest possible, so as to not force the
compiler to put the values into registers (and evict their normal
tracing-ignorant tenants).

I believe this combination was never actually built/tested, partly
because people realized that then the compiler would always have to
evaluate parameters unconditionally, whether or not a kprobe is
present.  (To do it otherwise would IIRC require the asm code to
include control-flow-modification instructions, which would surprise
gcc.)

So that's roughly how we arrived at recent markers.  They expose to
the compiler the parameters, but arrange not to evaluate them unless
necessary.  The most recent markers code patches nops over most or all
the hot path instructions, so there is no tangible performance impact.


- FChE
--

From: James Bottomley
Date: Thursday, July 10, 2008 - 6:49 am

Actually, I listed that one as an advantage.  But, in order to be
completely zero impact, the probe cannot interfere with optimisation,
and so you run the risk of having the probe point do strange things
(like it's in the middle of a loop that gets unrolled) or that the
variables you want to advertise get optimised away.

All of this is mitigated by correct selection of the probe points and


Actually, it does.  Assuming the probe is placed in the code by someone
who knows what they're doing and is using it, you can ensure that what
you're advertising actually exists.  If you look at the SCSI example I
gave, both the probe points and the variables actually exist, and will

Yes there are.  There are actually two performance impacts:

     1. The nops themselves take cycles to execute ... small, granted,
        but it adds up with lots of probe points
     2. The probes interfere with optimisation since to replace them
        with a function call, they must be barriers.

I didn't say use simple probes to replace markers ... I just said it's
an alternative for things like I/O subsystems that don't want the
perturbation.

James


--

From: Frank Ch. Eigler
Date: Thursday, July 10, 2008 - 7:22 am

Hi -


Well, you can test your theory: replace some "tracepoints" or markers
or printk's with this, and see if systemtap (or gdb) can get at the
same data.

When "correct selection" is a function of any particular compiler's
optimization algorithms, it will be difficult for a human programmer

That's *if* the line number ends up being resolvable back to a PC.  In
fact, since there is no code emitted for it, that particular line

You misunderstood - I am not talking about whether the variables exist
in the context of the source code.  The question is which of those
variables still exist, live & addressable, in the machine code and
execution state.  You may be surprised to what extent compiler

That's why I qualified it with "tangible".  Please confirm your
intuition about these costs.


- FChE
--

From: James Bottomley
Date: Thursday, July 10, 2008 - 7:43 am

Not necessarily.  A tracepoint by a barrier will always be pretty much
OK, as will variables that are either passed in or passed to functions
(since they have to be instantiated to pass as arguments).

Plus screw ups are easily detectable by a tool that parses the dwarf.

The essential point is that we need zero impact trace points and that
makes them difficult to place in this fashion.  However, the burden of
placing and verifying them rests with the people in the actual subsystem

Erm, no ... dwarf is designed to emit an entry for every line in the
file (whether it contains a statment or not).  The empty lines get
elided in the line number program (because you can attach them to the
first statement following) but a correct parser will recover them (by

No ... I'm used to optimisation strangeness.  Again, I'm not trying to
eliminate it because that would defeat the zero impact purpose.  I'm
trying to build a system that can be useful without any impact.  The
consequence is going to be that certain trace points can't be used
because of the optimiser, but that's the tradeoff.  As long as the
people placing the trace points are subject matter experts in the

1 is pretty obvious ... the nops have a defined cycle time in every
instruction architecture.  The optimisation costs are very difficult to
quantify since they vary so much from compiler to compiler and function
to function.

James


--

From: Theodore Tso
Date: Thursday, July 10, 2008 - 8:30 am

So as I understand things, your light-weight tracepoints are designed
for very performance-sensitive code paths where we don't want to
disturbe the optimization in the deactivated state.  In
non-performance sensitive parts of the kernel, where cycle counting is
not so important, tracepoints can and probably should still be used.
So I don't think you were proposing eliminating the current kernel
markers in favor of this approach, yes?

When you said a tool could determine if the tracepoint had gotten
optimized away, or the variables were no longer present, I assume you
meant at compile time, right?  So with the right tool built into the
kbuild infrastructure, if we could simply print warnings when
tracepoints had gotten optimized away, that would make the your simple
tracepoints quite safe for general use, I would think.

	    	       	   	   	  	- Ted

P.S.  When you said that the current kernel markers are "a bit
heavyweight", how bad are they in practice?  Hundreds of cycles?  More?
--

From: James Bottomley
Date: Thursday, July 10, 2008 - 8:57 am

That's right ... I started from the position that the current markers
were too heavy for an I/O subsystem, but I'm sure they have many other

Yes and no.  Yes because a tool will be able to detect the problems, but
no if you're thinking an actual kernel compile would do it (unless some
tool is designed for this and integrated into the build ... the obvious

Yes ... but someone has to come up with the tool.  I suppose rebuilding
the line number matrix and finding the variables at the location is easy
mechanical dwarf stuff ... but it will give the kernel build a lot of
external dependencies it didn't have before.

Plus, this level of checking can only be done if dwarf is generated
(i.e. CONFIG_KERNEL_DEBUG_INFO is y).

James


--

From: Frank Ch. Eigler
Date: Thursday, July 10, 2008 - 11:18 am

Hi -


It will be interesting to see how frequently such a warning appears
for a good suite of such mini markers, on a diversity of architectures

Good question.  The only performance measurements I have seen posted
indicate negligible effects.


- FChE
--

From: James Bottomley
Date: Saturday, July 12, 2008 - 11:22 am

This is just an incremental update based on feedback.  The most
significant was that making the marker a compiler barrier will free the
inserter from worrying about the mark sliding around changes to named
variables (and thus having to worry about this in placement) at
practically zero optimisation cost.  I also updated the code to drop and
asm section instead of using the static variable scheme.  I also added
documentation and made the module loader ignore them (since modules
don't go through the vmlinux.lds transformations).

I also added a simple versioning scheme (basically tack the version on
to the end of the section name).  It can be used simply and even
provides backwards compatibility (just emit the old and the new
sections).

If everyone's happy with this, I'll follow it up with the systemtap
changes to make use of them ... they've been incredibly helpful
debugging some of the CDROM problems for me so far.

James

From: James Bottomley <James.Bottomley@HansenPartnership.com>
Date: Wed, 9 Jul 2008 16:18:16 -0500
Subject: [PATCH] add simple marker trace point infrastructure

his patch adds incredibly simple markers which are designed to be used
via kprobes.  All it does is add an extra section to the kernel (and
modules) which annotates the location in source file/line of the marker
and a description of the variables of interest.  Tools like systemtap
can then use the kernel dwarf2 debugging information to transform this
to a precise probe point that gives access to the named variables.

The beauty of this scheme is that it has zero cost in the unactivated
case (the extra section is discardable if you're not interested in the
information, and nothing is actually added into the routine being
marked).  The disadvantage is that it's really unusable for rolling your
own marker probes because it relies on the dwarf2 information to locate
the probe point for kprobes and unravel the local variables of interest,
so you need an external tool like systemtap to help ...
From: James Bottomley
Date: Saturday, July 12, 2008 - 1:04 pm

This is the systemtap piece that allows you to use simple markers as
probe points for people who want to play around with the functionality.

James

From: James Bottomley <James.Bottomley@HansenPartnership.com>
Date: Fri, 11 Jul 2008 09:32:34 -0500
Subject: Add simple_marker statement

Now that the kernel drops simple markers in a __simple_marker section, update systemtap to parse for them by introducing an extra

<module>.simple_mark(<marker str>)

statement.  It would be nice to reuse the existing mark() directive,
but unfortunately, the parser can't cope with semantic dependent
parsing (it won't allow the registration of two identical patterns),
so the easiest way to get this to work is to introduce an additional
statement type.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 tapsets.cxx |  124 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 121 insertions(+), 3 deletions(-)

diff --git a/tapsets.cxx b/tapsets.cxx
index adfe10e..ce59102 100644
--- a/tapsets.cxx
+++ b/tapsets.cxx
@@ -458,6 +458,7 @@ static string TOK_MAXACTIVE("maxactive");
 static string TOK_STATEMENT("statement");
 static string TOK_ABSOLUTE("absolute");
 static string TOK_PROCESS("process");
+static string TOK_SIMPLE_MARK("simple_mark");
 
 // Can we handle this query with just symbol-table info?
 enum dbinfo_reqt
@@ -571,7 +572,15 @@ module_cache
 };
 typedef struct module_cache module_cache_t;
 
+struct marker_map_data {
+  string file;
+  int line;
+
+  marker_map_data(void) : line(-1) { };
+};
+
 #ifdef HAVE_TR1_UNORDERED_MAP
+typedef tr1::unordered_map<string,struct marker_map_data> marker_map_t;
 typedef tr1::unordered_map<string,Dwarf_Die> cu_function_cache_t;
 typedef tr1::unordered_map<string,cu_function_cache_t*> mod_cu_function_cache_t; // module:cu -> function -> die
 #else
@@ -579,6 +588,7 @@ struct stringhash {
   size_t operator() (const string& s) const { hash<const char*> h; return h(s.c_str()); }
 };
 ...
From: Frank Ch. Eigler
Date: Saturday, July 12, 2008 - 4:06 pm

Clever.  We can include support for this as soon as kernel-side
simple_mark widget go upstream.

(For completeness, the code would need test cases, docs, and desirably
support for wildcarding as in probe kernel.simple_mark("*").)


- FChE
--

From: Masami Hiramatsu
Date: Monday, July 14, 2008 - 9:26 am

Hi James,


I'm very interested in your approach.

IMHO, as Aoki investigated, the overhead of markers is not so big
unless we put a lot of them into kernel. And from "active"
overhead point of view, it takes less than tens of nano-seconds,
while kprobes takes hundreds of nano-seconds. Kprobe also has a
limitation of probable points, it can't probe "__kprobes" marked
functions. So, original markers still has advantages.

However, your approach is also useful, especially for embedding
thousands of markers in kernel or drivers.

So I think it's better to use both of them as the situation demands.

I just have one comment on its name. Since it doesn't trace
anything, so I'd rather like notation() or note_mark() than
trace_simple(). :-)


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: James Bottomley
Date: Monday, July 14, 2008 - 3:02 pm

That's the case which I started from.  The point is that if passive
markers have a cost, we have to be very careful about placing them to

Yes ... the zero impact markers are completely dependent on the kprobes
overhead for activation ... on the other hand, one of the vendor
complaints is cost of activation of kprobes, so it's nicely tied into

Certainly ... as I said to Ted, I'm not planning to replace the current

well ... the current markers code uses trace_mark as its base .. I was
just trying to fit into that scheme.

Also, don't rely on anything in this code yet ... that's why it's an
RFC; I'm still playing around with the section formats and the
information.  After more discussions with people, I'm actually coming to
the conclusion that dropping the address of the simple marker might be
very useful (in place of file and line).  It makes the marker section
need relocation, but it would also mean they could be used simply from
within the kernel as well.

James


--

Previous thread: SL*B: drop kmem cache argument from constructor by Alexey Dobriyan on Wednesday, July 9, 2008 - 6:11 pm. (9 messages)

Next thread: Mysterious high wakeups-from-idle per second by Theodore Ts'o on Wednesday, July 9, 2008 - 8:14 pm. (2 messages)