login
Header Space

 
 

Re: [patch 0/2] Immediate Values - jump patching update

Previous thread: Build failure (with insane config) by Roland Dreier on Sunday, April 27, 2008 - 11:34 pm. (3 messages)

Next thread: [patch 2/2] Markers - use imv_cond jump liveliness by Mathieu Desnoyers on Sunday, April 27, 2008 - 11:34 pm. (1 message)
To: <akpm@...>, Ingo Molnar <mingo@...>, <linux-kernel@...>, H. Peter Anvin <hpa@...>
Date: Sunday, April 27, 2008 - 11:34 pm

Hi Ingo,

Here is the update to the jump patching optimization taking care of Peter's
comments about register liveliness and instruction re-use by gcc optimizations.
A good thing : it actually simplifies the code. Unfortunately, it adds 3 bytes
to the instructions in i-cache because I now have to use a 5-bytes mov
instruction so I can replace it with a 5-bytes jump. Therefore, 9 bytes are
added to rather small functions (5-bytes mov + 2-bytes test + 2 bytes
conditional branch) and 13 bytes are added to larger functions which needs a 6
bytes conditional branch at the branch site.

Instead of having to execute a sequence of nop, nop and jump, we now only have
to execute the near jump, which jumps either at the address following the
conditional branch or at the target address of the conditional branch, depending
on the immediate value variable state.

Thanks to Peter for the review.

Those two patches apply on top of sched-devel.git.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 1:21 pm

Just in case someone gets the wrong idea...

I still think this is the completely wrong approach.

I'd use stronger terms, but Al Viro would sue me for copyright infringement.

	-hpa
--
To: H. Peter Anvin <hpa@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 4:25 pm

hm, can it result in a broken kernel? If yes, how? Or are your 
objections more higher level?

i actually like that we end up with something rather NOP-ish, with some 
mild 'ambient' impact due to the extra constraints that state value 
visibility brings with it. The in-source impact of the markers is 
minimal, especially with Peter Zijstra's wrappers. The scheduler markers 
at least are also expected to stay pretty stable as well.

the syscall markers should be done less intrusively - and i think they 
can be done less intrusively.

but we need to get past this current impasse, the optimizations that 
Mathieu has done to markers are pretty impressive so far. The overhead 
is not zero, but it gets quite close to it and the SystemTap and kprobes 

it would clearly fall under the Fair Use Doctrine and perhaps also under 
the Doctrine of Severe Necessities, so dont be shy!

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 5:03 pm

My objections are higher level, I believe the current code is (a) 
painfully complex, and I'd rather not see it in the kernel, and (b) the 
wrong thing anyway.

Put a 5-byte nop in as the marker, and patch it with a call instruction, 
out of line, to a collector function.

	-hpa
--
To: H. Peter Anvin <hpa@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 6:11 pm

the counter argument was that by specific sched.o analysis, this results 
in slower code. The reason is that the "function call parameter 
preparation" halo around that 5-byte patch site is larger than that 
single conditional branch operation to an offline place of the current 
function is.

i.e. the current optimized marker approach does roughly this:

  [ .... fastpath head ....       ]
  [ immediate value instruction   ]  ---&gt;
  [ branch instruction            ]  ---&gt; these two get NOP-ed out
  [ .... fastpath tail ....       ]
  [ ............................. ]
  [ ... offline area ............ ]
  [ ... parameter preparation ... ]
  [ ... marker call ............. ]

your proposed 5-byte call NOP approach (which btw. was what i proposed 
multiple times in the past 2 years) would do this:

  [ .... fastpath head ......     ]
  [ ... parameter preparation ... ]
  [ ....   5-byte CALL .......... ]  ---&gt; NOP-ed out
  [ .... fastpath tail .......... ]
  [ ............................. ]

in the first case we have little "marker parameter/value preparation" 
cost: it all happens in the 'offline area' _by GCC_. I.e. the fastpath 
is relatively undisturbed.

in the latter case, all the 'parameter preparation' phase has to happen 
at around the 5-byte CALL site, in the fastpath. This, in the specific, 
assembly level analysis of sched.o, was shown by Matthieu to be a 
pessimisation. We are better off by inserting that conditional and 
letting gcc generate the call, than by forcing it in the middle of the 
fastpath - even if we end up NOP-ing out the call.

wrt. complexity i agree with you - if the current optimization cannot be 
made correctly we have to fall back to a simpler variant, even if it's 
slower.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 6:25 pm

No, that's not what I'm proposing at all (and would indeed be bogus.)

What I'm proposing is:

 &gt;   [ .... fastpath head ......     ]
 &gt;   [ ....   5-byte CALL .......... ]  ---&gt; NOP-ed out
 &gt;   [ .... fastpath tail .......... ]
 &gt;   [ ............................. ]

The call site is created with an asm() statement as opposed to a gcc 
function call; it is up to the logging function to take the state and 
mangle it into whatever format it wants to; the debugging information 
(e.g. DWARF) should tell it all it needs to know about how the 
register/memory state maps onto the C state.  This mapping can either be 
done online, with a small piece of dynamic code, or offline (although 
offline makes it tricky to know what memory tems to gather.)

For pieces of data which would be dead at the call site, one can add "g" 
annotations to the asm() statement to force it to be live without 
forcing it to be present in any particular location.

	-hpa
--
To: H. Peter Anvin <hpa@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>, Frank Ch. Eigler <fche@...>
Date: Monday, April 28, 2008 - 6:44 pm

that would be rather impractical as we'd force DEBUG_INFO builds on 
anyone (it's HUGE) just to do some trivial tracing. Look at the ftrace 
plugin usage model - it wants to be widely available and easy to use.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>, Frank Ch. Eigler <fche@...>
Date: Monday, April 28, 2008 - 7:06 pm

Otherwise you're forcing everyone to take the cost of additional cache 
footprint, plus optimizer interference, just because they might want to 
possibly do some trivial tracing.  DEBUG_INFO is The Right Thing for 
this, as it carries all the information you may want in a well-defined 
format.  You don't necessarily have to keep all this information around, 
of course; you can distill out the information for the trace sites at 
compile time and keep a tracer information file, after which you can 
strip the information.

There is actually yet another alternative, though, which is to build the 
tracer at compile time.  The tricky part of this is that it almost 
requires inserting a preprocessor before the assembler, and use really 
ugly asm() macros to extract the very information that the debug format 
is designed explicitly to extract!

Personally, I think requiring DEBUG_INFO is a helluva lot less ugly than 
these branch hacks.

	-hpa
--
To: H. Peter Anvin <hpa@...>
Cc: Ingo Molnar <mingo@...>, <akpm@...>, <linux-kernel@...>, Frank Ch. Eigler <fche@...>
Date: Monday, April 28, 2008 - 9:46 pm

Peter, do you have something like the following code in mind ?

I think the main differences between the code snippet down here and the
markers is that markers rely on the compiler to generate the stack
setup, and have this code a little bit closer to the function than what
I propose here, where I put the stack setup code in a "farfaraway"
section.  Moreover, markers are much simpler than what I show here.
And actually, markers can be deployed portably, with
architecture-specific optimizations refined later. This has to be
implemented all up front for any traced architecture. In addition,
dealing with weird types like unsigned long long can become a pain.
Also, due to fact that we are asking the compiler to put keep some
variables live in registers, I would be tempted to embed this in a block
controlled by an if() statement (conditional branch, like I use for the
markers) so we don't have to pay the penality of populating the
registers when not required if there are not live at the marker site.

Here is the toy program :

#include &lt;stdio.h&gt;

#define SAVE_REGS   \
	"pushl %%eax\n\t" \
	"pushl %%ebp\n\t" \
	"pushl %%edi\n\t" \
	"pushl %%esi\n\t" \
	"pushl %%edx\n\t" \
	"pushl %%ecx\n\t" \
	"pushl %%ebx\n\t"

#define RESTORE_REGS \
	"popl %%ebx\n\t"	\
	"popl %%ecx\n\t"	\
	"popl %%edx\n\t"	\
	"popl %%esi\n\t"	\
	"popl %%edi\n\t"	\
	"popl %%ebp\n\t"	\
	"popl %%eax\n\t"

#define asmlinkage __attribute__((regparm(0)))
#define _ASM_PTR ".long "

struct marker_info {
	const char *name;
	const char *fields;
	unsigned char nr_args;
	unsigned char type_sizes[];
};

#define trace_mark3(name, fields, arg1, arg2, arg3)			\
	do {								\
		static struct marker_info info = {			\
			#name,						\
			fields,						\
			3, { sizeof(arg1), sizeof(arg2), sizeof(arg3) } }; \
		asm (".section __marker_info, \"a\",@progbits\n\t"	\
			_ASM_PTR "%c0, 1f, 2f\n\t"			\
			".previous\n\t"					\
			".section __farfarawaycode,\"ax\",@progbits\n\t"\
			"1:\n\t"					\
			SAVE_REGS...
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Ingo Molnar <mingo@...>, <akpm@...>, <linux-kernel@...>, Frank Ch. Eigler <fche@...>
Date: Monday, April 28, 2008 - 10:07 pm

Basically, although I was suggesting using a per-site dynamic piece of 

We're requesting to keep them *alive*, but not necessarily in registers 
(that would be an "r" constraint.)

	-hpa
--
To: H. Peter Anvin <hpa@...>
Cc: Ingo Molnar <mingo@...>, <akpm@...>, <linux-kernel@...>, Frank Ch. Eigler <fche@...>
Date: Tuesday, April 29, 2008 - 8:18 am

Interesting. Actually, I use the "g" constraint in the code I showed
you, so it might be more acceptable to put in the fast path without
requiring registers to be populated. The only cases that would generate
additional code would probably be arguments like :

/* multiple pointer dereference */
trace_mark(evname, "argname", ptr-&gt;stuff[index]-&gt;...);

/*
 * having to do some work to prepare the variable (calling a macro or
 * inline function which does more than just a pointer deref.
 */
trace_mark(evname, "argname", get_real_valueof(variable));

/* constants */
trace_mark(evname, "argname", 10000);

Those cases won't add code to the critial path with my current markers,
but it would with the inline assembly "g" constraint approach. Looking
at the "mm" instrumentation, where page_to_pfn, swp_offset and
get_swap_info_struct are used makes me think it would not be such a rare
case.

I would also like to point out that maintaining a _separated_ piece of
code for each instrumentation site which would heavily depend on the
inner kernel data structures seems like a maintenance nightmare. This is
why I am trying to get the instrumented site to export the meaningful
data, self-described, in a standardized way. We can then simply hook on
all the instrumented sites to either perform some in-kernel analysis on
the data (ftrace) or to export that to a userspace trace analyzer
(LTTV analyzing LTTng traces).

I would be happy with a solution that doesn't depend on this gigantic
DWARF information and can be included in the kernel build process. See,
I think tracing is, primarily, a facility that the kernel should provide
to users so they can tune and find problems in their own applications.
From this POV, it would make sense to consider tracing as part of the
kernel code itself, not as a separated, kernel debugging oriented piece
of code. If you require per-site dynamic pieces of code, you are only
adding to the complexity of such a tracer. Actually, an active tracer
would trash the...
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Ingo Molnar <mingo@...>, <akpm@...>, <linux-kernel@...>, Frank Ch. Eigler <fche@...>
Date: Tuesday, April 29, 2008 - 11:35 am

That's funny, given that's exactly what you have now.

DWARF information is the way you get this stuff out of the compiler. 
That is what it's *there for*.  If you don't want to keep it around you 
can distill out the information you need and then remove it.  However, 
as I have said about six times now:

THE RIGHT WAY TO DO THIS IS WITH COMPILER SUPPORT.

All these problems is because you're trying to do something behind the 

This is somewhat nice in theory; I've found that gcc tends to interlace 
them pretty heavily and so the cache interference even of gcc "out of 
line" code is sizable.  Furthermore, modern CPUs often speculatively 
fetch *both* branches of a conditional.


Nice theory.  Doesn't work in practice as long as you rely on gcc unlikey().

	-hpa
--
To: H. Peter Anvin <hpa@...>
Cc: Ingo Molnar <mingo@...>, <akpm@...>, <linux-kernel@...>, Frank Ch. Eigler <fche@...>
Date: Sunday, May 4, 2008 - 10:54 am

The per-site pieces of code are only there to do the stack setup. I

About DWARF : I agree with Ingo that we might not want to depend on this
kind of information normally expected to be correct for debug uses in a
part of infrastructure that is not limited to debugging situation.
Continous performance monitoring is one of the use cases I have in mind.

Moreover, depending on DWARF info requires us to do
architecture-specific code from the beginning. The markers are designed
in such a way that any given new architecture can use the "architecture
agnostic" version of the markers, and then later implement the
optimizations. With about 27 architectures supported by the Linux
kernel, I think this approach makes sense. Looking at the number of
years it took to port something as "simple" as kprobes to 8 out of 27

We totally agree on this about the jump-patching optimization. If the
jump-patching approach I proposed is too far-fetched, and if reading a
variable from memory at each tracing site is too expensive, I would
propose to use the standard "immediate values" flavor until gcc gives us

Using the compiler for the markers (I am not talking about immediate
values, which is an optimization) is what gives us the ability to do an
architecture-agnostic version. The 19 architectures which still lacks

Following your own suggestion, why don't we fix gcc and make it

Agreed. I'd like to find some info about which microarchitectures you

Let's fix gcc ! ;)

Cheers,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Ingo Molnar <mingo@...>, <akpm@...>, <linux-kernel@...>, Frank Ch. Eigler <fche@...>
Date: Sunday, May 4, 2008 - 5:05 pm

Doing this with compiler support is definitely The Right Thing, so I 

Not sure about Core 2, although Core 2 definitely can track down the 

Sounds great :)

	-hpa

--
To: H. Peter Anvin <hpa@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 8:47 pm

Hi -



This would require either that DWARF processing be done far after
kernel build, and thus the kernel cannot be self-sufficient /
introspective without user-space assistance (like firmware); or that
the DWARF data extraction (and systemtap-like $context-variable code
generation) be part of the kernel build itself.  It *might* be
workable.

At least one complication though is that in the case of markers,
tracing parameter evaluation is itself conditional (and placed out of
the hot path due to -freorder-blocks).  With your suggested kind of
assembly ("g" constraints for all the expressions), those expressions
would be evaluated unconditionally, just to make them live somewhere.
That unconditional evaluation can easily entail memory reads and
dependent arithmetic, which could swamp the savings of eliminating the
marker-style conditional branch.

- FChE
--
To: Frank Ch. Eigler <fche@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>
Date: Monday, April 28, 2008 - 9:08 pm

DWARF processing done after the kernel build isn't really a problem, 
although it requires a large vmlinux(.gz) file to be kept around.

As part of the kernel build is probably better, at least in the case of 

Well, it depends a bit on what kind of expressions you put in there.
You don't really want to put *expressions* in there as much as you want 
to put *data* references in there, although, of course, if your have 
something like "foo-&gt;bar[baz]-&gt;quux" then it's easy to trip upon.

One advantage of using the debugging information in this manner is that 
you can write the actual markers long after the kernel has compiled, as 
long as the data happens to be available at the call site.

Once again, the *proper* way to do this is with compiler support.  That 
way we can do patched branches and out-of-line everything with impunity.

	-hpa

--
To: H. Peter Anvin <hpa@...>
Cc: Frank Ch. Eigler <fche@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>
Date: Tuesday, April 29, 2008 - 8:08 am

and that's exactly what was tripped upon in sched.o and analyzed.

Furthermore, the suggestion of doing this exclusively within the DWARF2 
space - besides the not particularly minor complication of it not being 
implemented yet - is:

 - quite substantially complex on its own

 - would make Linux instrumentation dependent on all sorts of DWARF2
   details which we had our 'fun' with before. (I proffer that that's
   more fragile than any code patching can ever be.)

 - if done self-sufficiently (i.e. if a kernel image can be used to
   trace things, which i believe any usable kernel tracer must offer),
   it would, with the current debug info format, enlargen the kernel RAM
   image with quite a substantial amount of unswappable kernel memory. 

But i would not mind such a scheme at all (it is in essence SystemTap 
integrated into the core kernel) - in fact if your scheme gets 
implemented then the current marker facilities could be ported to that 
scheme transparently.

So i dont see how it can be a loss to stick with the current markers for 
the time being. If the super-optimized runtime patching in this thread 
is deemed too fragile we simply wont do that and live with the (small) 
runtime overhead that current markers have.

This is a sane plan IMO, basically all the instrumentation folks agree 
with it (SystemTap, LTTNG, kprobes, ftrace), the scheduler folks agree 
with it as well and we'd like to move on.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: H. Peter Anvin <hpa@...>, Frank Ch. Eigler <fche@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, <akpm@...>, <linux-kernel@...>
Date: Wednesday, May 14, 2008 - 10:53 am

I am not sure self-sufficiency is  good goal here.

If tracing becomes part of kernel-user ABI, we are in big trouble...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: <akpm@...>, <linux-kernel@...>, H. Peter Anvin <hpa@...>
Date: Monday, April 28, 2008 - 8:48 am

thanks Mathieu, i've queued them up for more testing. Your previous 
queue already looked good here so i pushed it out into sched-devel.git 
as you probably noticed.

Sidenote, without trying to bikeshed paint this issue too much: are we 
absolutely sure that (on modern CPU architectures) such a short jump is 
better than just 2-3 NOPs in a sequence? It's a minor (sub-cycle) detail 
in any case.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: <akpm@...>, <linux-kernel@...>, H. Peter Anvin <hpa@...>
Date: Monday, April 28, 2008 - 10:35 am

Let's see :

Paraphrasing the
"Intel® 64 and IA-32 Architectures Optimization Reference Manual" :
http://www.intel.com/design/processor/manuals/248966.pdf

3.5.1.8 Using NOPs

The 1-byte nop, xchg %eax,%eax, is treated specially by the cpu. It
takes only one µop and does not have register dependencies. However,
what we would need here is likely a 5-bytes nop.

The generic 5-bytes nop found in asm-x86/nops.h is :

#define GENERIC_NOP1 ".byte 0x90\n"
#define GENERIC_NOP4 ".byte 0x8d,0x74,0x26,0x00\n"
#define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4

In Intel's guide, they propose a single instruction :
NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)

(0F 1F 44 00 00H)

Which would be required for correct code patching, since we cannot
safely turn a 1+4 bytes sequence of instruction into a single 5-bytes
call without suffering from possible preempted threads returning in the
middle of the instruction. Since the 5-bytes nop does not seem to be
available on architectures below P6, I think this would be a backward
compatibility problem.

K8 NOP 5 also uses a combination of K8_NOP3 K8_NOP2, so this is not only
backward compatibility, but also a concern for current AMD
compatibility. (same issue for K7 5-bytes nop).

However, if we forget about this issue and take for granted that we can
use a single nop instruction that would only take a single µop and have
no external effect (that would be the ideal P6 and + world), the
closest comparison with the jmp instruction I found is the jcc
(conditional branch) instruction in the same manual, Appendix C, where
they list jcc as having 0 latency when not taken and to have a bandwidth
of 0.5, compared to a latency of 1 and bandwidth of 0.33 to 0.5 for nop.

So, just there, a single jmp instruction would seem to have a lower
latency than the nop. (0 cycle vs 1 cycle)

Another interested reading in this document is about conditional
branches :

3.4.1 Optimizing the Front End

Some interesting guide lines :
- Eliminate bran...
Previous thread: Build failure (with insane config) by Roland Dreier on Sunday, April 27, 2008 - 11:34 pm. (3 messages)

Next thread: [patch 2/2] Markers - use imv_cond jump liveliness by Mathieu Desnoyers on Sunday, April 27, 2008 - 11:34 pm. (1 message)
speck-geostationary