Hi Ingo,
Here is the update to the jump patching optimization taking care of Peter's
comments about register liveliness and instruction re-use by gcc optimizations.
A good thing : it actually simplifies the code. Unfortunately, it adds 3 bytes
to the instructions in i-cache because I now have to use a 5-bytes mov
instruction so I can replace it with a 5-bytes jump. Therefore, 9 bytes are
added to rather small functions (5-bytes mov + 2-bytes test + 2 bytes
conditional branch) and 13 bytes are added to larger functions which needs a 6
bytes conditional branch at the branch site.Instead of having to execute a sequence of nop, nop and jump, we now only have
to execute the near jump, which jumps either at the address following the
conditional branch or at the target address of the conditional branch, depending
on the immediate value variable state.Thanks to Peter for the review.
Those two patches apply on top of sched-devel.git.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Just in case someone gets the wrong idea...
I still think this is the completely wrong approach.
I'd use stronger terms, but Al Viro would sue me for copyright infringement.
-hpa
--
hm, can it result in a broken kernel? If yes, how? Or are your
objections more higher level?i actually like that we end up with something rather NOP-ish, with some
mild 'ambient' impact due to the extra constraints that state value
visibility brings with it. The in-source impact of the markers is
minimal, especially with Peter Zijstra's wrappers. The scheduler markers
at least are also expected to stay pretty stable as well.the syscall markers should be done less intrusively - and i think they
can be done less intrusively.but we need to get past this current impasse, the optimizations that
Mathieu has done to markers are pretty impressive so far. The overhead
is not zero, but it gets quite close to it and the SystemTap and kprobesit would clearly fall under the Fair Use Doctrine and perhaps also under
the Doctrine of Severe Necessities, so dont be shy!Ingo
--
My objections are higher level, I believe the current code is (a)
painfully complex, and I'd rather not see it in the kernel, and (b) the
wrong thing anyway.Put a 5-byte nop in as the marker, and patch it with a call instruction,
out of line, to a collector function.-hpa
--
the counter argument was that by specific sched.o analysis, this results
in slower code. The reason is that the "function call parameter
preparation" halo around that 5-byte patch site is larger than that
single conditional branch operation to an offline place of the current
function is.i.e. the current optimized marker approach does roughly this:
[ .... fastpath head .... ]
[ immediate value instruction ] --->
[ branch instruction ] ---> these two get NOP-ed out
[ .... fastpath tail .... ]
[ ............................. ]
[ ... offline area ............ ]
[ ... parameter preparation ... ]
[ ... marker call ............. ]your proposed 5-byte call NOP approach (which btw. was what i proposed
multiple times in the past 2 years) would do this:[ .... fastpath head ...... ]
[ ... parameter preparation ... ]
[ .... 5-byte CALL .......... ] ---> NOP-ed out
[ .... fastpath tail .......... ]
[ ............................. ]in the first case we have little "marker parameter/value preparation"
cost: it all happens in the 'offline area' _by GCC_. I.e. the fastpath
is relatively undisturbed.in the latter case, all the 'parameter preparation' phase has to happen
at around the 5-byte CALL site, in the fastpath. This, in the specific,
assembly level analysis of sched.o, was shown by Matthieu to be a
pessimisation. We are better off by inserting that conditional and
letting gcc generate the call, than by forcing it in the middle of the
fastpath - even if we end up NOP-ing out the call.wrt. complexity i agree with you - if the current optimization cannot be
made correctly we have to fall back to a simpler variant, even if it's
slower.Ingo
--
No, that's not what I'm proposing at all (and would indeed be bogus.)
What I'm proposing is:
> [ .... fastpath head ...... ]
> [ .... 5-byte CALL .......... ] ---> NOP-ed out
> [ .... fastpath tail .......... ]
> [ ............................. ]The call site is created with an asm() statement as opposed to a gcc
function call; it is up to the logging function to take the state and
mangle it into whatever format it wants to; the debugging information
(e.g. DWARF) should tell it all it needs to know about how the
register/memory state maps onto the C state. This mapping can either be
done online, with a small piece of dynamic code, or offline (although
offline makes it tricky to know what memory tems to gather.)For pieces of data which would be dead at the call site, one can add "g"
annotations to the asm() statement to force it to be live without
forcing it to be present in any particular location.-hpa
--
that would be rather impractical as we'd force DEBUG_INFO builds on
anyone (it's HUGE) just to do some trivial tracing. Look at the ftrace
plugin usage model - it wants to be widely available and easy to use.Ingo
--
Otherwise you're forcing everyone to take the cost of additional cache
footprint, plus optimizer interference, just because they might want to
possibly do some trivial tracing. DEBUG_INFO is The Right Thing for
this, as it carries all the information you may want in a well-defined
format. You don't necessarily have to keep all this information around,
of course; you can distill out the information for the trace sites at
compile time and keep a tracer information file, after which you can
strip the information.There is actually yet another alternative, though, which is to build the
tracer at compile time. The tricky part of this is that it almost
requires inserting a preprocessor before the assembler, and use really
ugly asm() macros to extract the very information that the debug format
is designed explicitly to extract!Personally, I think requiring DEBUG_INFO is a helluva lot less ugly than
these branch hacks.-hpa
--
Peter, do you have something like the following code in mind ?
I think the main differences between the code snippet down here and the
markers is that markers rely on the compiler to generate the stack
setup, and have this code a little bit closer to the function than what
I propose here, where I put the stack setup code in a "farfaraway"
section. Moreover, markers are much simpler than what I show here.
And actually, markers can be deployed portably, with
architecture-specific optimizations refined later. This has to be
implemented all up front for any traced architecture. In addition,
dealing with weird types like unsigned long long can become a pain.
Also, due to fact that we are asking the compiler to put keep some
variables live in registers, I would be tempted to embed this in a block
controlled by an if() statement (conditional branch, like I use for the
markers) so we don't have to pay the penality of populating the
registers when not required if there are not live at the marker site.Here is the toy program :
#include <stdio.h>
#define SAVE_REGS \
"pushl %%eax\n\t" \
"pushl %%ebp\n\t" \
"pushl %%edi\n\t" \
"pushl %%esi\n\t" \
"pushl %%edx\n\t" \
"pushl %%ecx\n\t" \
"pushl %%ebx\n\t"#define RESTORE_REGS \
"popl %%ebx\n\t" \
"popl %%ecx\n\t" \
"popl %%edx\n\t" \
"popl %%esi\n\t" \
"popl %%edi\n\t" \
"popl %%ebp\n\t" \
"popl %%eax\n\t"#define asmlinkage __attribute__((regparm(0)))
#define _ASM_PTR ".long "struct marker_info {
const char *name;
const char *fields;
unsigned char nr_args;
unsigned char type_sizes[];
};#define trace_mark3(name, fields, arg1, arg2, arg3) \
do { \
static struct marker_info info = { \
#name, \
fields, \
3, { sizeof(arg1), sizeof(arg2), sizeof(arg3) } }; \
asm (".section __marker_info, \"a\",@progbits\n\t" \
_ASM_PTR "%c0, 1f, 2f\n\t" \
".previous\n\t" \
".section __farfarawaycode,\"ax\",@progbits\n\t"\
"1:\n\t" \
SAVE_REGS...
Basically, although I was suggesting using a per-site dynamic piece of
We're requesting to keep them *alive*, but not necessarily in registers
(that would be an "r" constraint.)-hpa
--
Interesting. Actually, I use the "g" constraint in the code I showed
you, so it might be more acceptable to put in the fast path without
requiring registers to be populated. The only cases that would generate
additional code would probably be arguments like :/* multiple pointer dereference */
trace_mark(evname, "argname", ptr->stuff[index]->...);/*
* having to do some work to prepare the variable (calling a macro or
* inline function which does more than just a pointer deref.
*/
trace_mark(evname, "argname", get_real_valueof(variable));/* constants */
trace_mark(evname, "argname", 10000);Those cases won't add code to the critial path with my current markers,
but it would with the inline assembly "g" constraint approach. Looking
at the "mm" instrumentation, where page_to_pfn, swp_offset and
get_swap_info_struct are used makes me think it would not be such a rare
case.I would also like to point out that maintaining a _separated_ piece of
code for each instrumentation site which would heavily depend on the
inner kernel data structures seems like a maintenance nightmare. This is
why I am trying to get the instrumented site to export the meaningful
data, self-described, in a standardized way. We can then simply hook on
all the instrumented sites to either perform some in-kernel analysis on
the data (ftrace) or to export that to a userspace trace analyzer
(LTTV analyzing LTTng traces).I would be happy with a solution that doesn't depend on this gigantic
DWARF information and can be included in the kernel build process. See,
I think tracing is, primarily, a facility that the kernel should provide
to users so they can tune and find problems in their own applications.
From this POV, it would make sense to consider tracing as part of the
kernel code itself, not as a separated, kernel debugging oriented piece
of code. If you require per-site dynamic pieces of code, you are only
adding to the complexity of such a tracer. Actually, an active tracer
would trash the...
That's funny, given that's exactly what you have now.
DWARF information is the way you get this stuff out of the compiler.
That is what it's *there for*. If you don't want to keep it around you
can distill out the information you need and then remove it. However,
as I have said about six times now:THE RIGHT WAY TO DO THIS IS WITH COMPILER SUPPORT.
All these problems is because you're trying to do something behind the
This is somewhat nice in theory; I've found that gcc tends to interlace
them pretty heavily and so the cache interference even of gcc "out of
line" code is sizable. Furthermore, modern CPUs often speculatively
fetch *both* branches of a conditional.Nice theory. Doesn't work in practice as long as you rely on gcc unlikey().
-hpa
--
The per-site pieces of code are only there to do the stack setup. I
About DWARF : I agree with Ingo that we might not want to depend on this
kind of information normally expected to be correct for debug uses in a
part of infrastructure that is not limited to debugging situation.
Continous performance monitoring is one of the use cases I have in mind.Moreover, depending on DWARF info requires us to do
architecture-specific code from the beginning. The markers are designed
in such a way that any given new architecture can use the "architecture
agnostic" version of the markers, and then later implement the
optimizations. With about 27 architectures supported by the Linux
kernel, I think this approach makes sense. Looking at the number of
years it took to port something as "simple" as kprobes to 8 out of 27We totally agree on this about the jump-patching optimization. If the
jump-patching approach I proposed is too far-fetched, and if reading a
variable from memory at each tracing site is too expensive, I would
propose to use the standard "immediate values" flavor until gcc gives usUsing the compiler for the markers (I am not talking about immediate
values, which is an optimization) is what gives us the ability to do an
architecture-agnostic version. The 19 architectures which still lacksFollowing your own suggestion, why don't we fix gcc and make it
Agreed. I'd like to find some info about which microarchitectures you
Let's fix gcc ! ;)
Cheers,
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Doing this with compiler support is definitely The Right Thing, so I
Not sure about Core 2, although Core 2 definitely can track down the
Sounds great :)
-hpa
--
Hi -
This would require either that DWARF processing be done far after
kernel build, and thus the kernel cannot be self-sufficient /
introspective without user-space assistance (like firmware); or that
the DWARF data extraction (and systemtap-like $context-variable code
generation) be part of the kernel build itself. It *might* be
workable.At least one complication though is that in the case of markers,
tracing parameter evaluation is itself conditional (and placed out of
the hot path due to -freorder-blocks). With your suggested kind of
assembly ("g" constraints for all the expressions), those expressions
would be evaluated unconditionally, just to make them live somewhere.
That unconditional evaluation can easily entail memory reads and
dependent arithmetic, which could swamp the savings of eliminating the
marker-style conditional branch.- FChE
--
DWARF processing done after the kernel build isn't really a problem,
although it requires a large vmlinux(.gz) file to be kept around.As part of the kernel build is probably better, at least in the case of
Well, it depends a bit on what kind of expressions you put in there.
You don't really want to put *expressions* in there as much as you want
to put *data* references in there, although, of course, if your have
something like "foo->bar[baz]->quux" then it's easy to trip upon.One advantage of using the debugging information in this manner is that
you can write the actual markers long after the kernel has compiled, as
long as the data happens to be available at the call site.Once again, the *proper* way to do this is with compiler support. That
way we can do patched branches and out-of-line everything with impunity.-hpa
--
and that's exactly what was tripped upon in sched.o and analyzed.
Furthermore, the suggestion of doing this exclusively within the DWARF2
space - besides the not particularly minor complication of it not being
implemented yet - is:- quite substantially complex on its own
- would make Linux instrumentation dependent on all sorts of DWARF2
details which we had our 'fun' with before. (I proffer that that's
more fragile than any code patching can ever be.)- if done self-sufficiently (i.e. if a kernel image can be used to
trace things, which i believe any usable kernel tracer must offer),
it would, with the current debug info format, enlargen the kernel RAM
image with quite a substantial amount of unswappable kernel memory.But i would not mind such a scheme at all (it is in essence SystemTap
integrated into the core kernel) - in fact if your scheme gets
implemented then the current marker facilities could be ported to that
scheme transparently.So i dont see how it can be a loss to stick with the current markers for
the time being. If the super-optimized runtime patching in this thread
is deemed too fragile we simply wont do that and live with the (small)
runtime overhead that current markers have.This is a sane plan IMO, basically all the instrumentation folks agree
with it (SystemTap, LTTNG, kprobes, ftrace), the scheduler folks agree
with it as well and we'd like to move on.Ingo
--
I am not sure self-sufficiency is good goal here.
If tracing becomes part of kernel-user ABI, we are in big trouble...
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
thanks Mathieu, i've queued them up for more testing. Your previous
queue already looked good here so i pushed it out into sched-devel.git
as you probably noticed.Sidenote, without trying to bikeshed paint this issue too much: are we
absolutely sure that (on modern CPU architectures) such a short jump is
better than just 2-3 NOPs in a sequence? It's a minor (sub-cycle) detail
in any case.Ingo
--
Let's see :
Paraphrasing the
"Intel® 64 and IA-32 Architectures Optimization Reference Manual" :
http://www.intel.com/design/processor/manuals/248966.pdf3.5.1.8 Using NOPs
The 1-byte nop, xchg %eax,%eax, is treated specially by the cpu. It
takes only one µop and does not have register dependencies. However,
what we would need here is likely a 5-bytes nop.The generic 5-bytes nop found in asm-x86/nops.h is :
#define GENERIC_NOP1 ".byte 0x90\n"
#define GENERIC_NOP4 ".byte 0x8d,0x74,0x26,0x00\n"
#define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4In Intel's guide, they propose a single instruction :
NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)(0F 1F 44 00 00H)
Which would be required for correct code patching, since we cannot
safely turn a 1+4 bytes sequence of instruction into a single 5-bytes
call without suffering from possible preempted threads returning in the
middle of the instruction. Since the 5-bytes nop does not seem to be
available on architectures below P6, I think this would be a backward
compatibility problem.K8 NOP 5 also uses a combination of K8_NOP3 K8_NOP2, so this is not only
backward compatibility, but also a concern for current AMD
compatibility. (same issue for K7 5-bytes nop).However, if we forget about this issue and take for granted that we can
use a single nop instruction that would only take a single µop and have
no external effect (that would be the ideal P6 and + world), the
closest comparison with the jmp instruction I found is the jcc
(conditional branch) instruction in the same manual, Appendix C, where
they list jcc as having 0 latency when not taken and to have a bandwidth
of 0.5, compared to a latency of 1 and bandwidth of 0.33 to 0.5 for nop.So, just there, a single jmp instruction would seem to have a lower
latency than the nop. (0 cycle vs 1 cycle)Another interested reading in this document is about conditional
branches :3.4.1 Optimizing the Front End
Some interesting guide lines :
- Eliminate bran...
| H. Peter Anvin | Re: [RFC 00/15] x86_64: Optimize percpu accesses |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Eric W. Biederman | Remaining straight forward kthread API conversions... |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
git: | |
