Hi All, The following set of patches a)Implant marker probes in futex.c and b)define a sample use-case for the markers in futex. The marker handler patches also make use of the enhancements proposed to the 'trace' infrastructure i.e. debugfs_printk() interface. http://lkml.org/lkml/2008/4/15/117 The patches are generated over 2.6.25-rc8-mm1 and have been tested on an i386 machine. Thanks, K.Prasad --
We place a probe at the function entry for each futex operation,
and also at each point where a futex operation fails.
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
kernel/futex.c | 110 +++++++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 88 insertions(+), 22 deletions(-)
Index: linux-2.6.25-rc8-mm2/kernel/futex.c
===================================================================
--- linux-2.6.25-rc8-mm2.orig/kernel/futex.c
+++ linux-2.6.25-rc8-mm2/kernel/futex.c
@@ -737,8 +737,14 @@ static int futex_wake(u32 __user *uaddr,
union futex_key key;
int ret;
- if (!bitset)
- return -EINVAL;
+ trace_mark(futex_wake_called, "uaddr:%p fshared:%p nr_wake:%d "
+ "bitset:%d",
+ uaddr, fshared, nr_wake, bitset);
+
+ if (unlikely(!bitset)) {
+ ret = -EINVAL;
+ goto out;
+ }
futex_lock_mm(fshared);
@@ -770,6 +776,8 @@ static int futex_wake(u32 __user *uaddr,
spin_unlock(&hb->lock);
out:
futex_unlock_mm(fshared);
+ if (unlikely(ret != 0))
+ trace_mark(futex_wake_failed, "uaddr:%p ret:%d", uaddr, ret);
return ret;
}
@@ -901,6 +909,14 @@ static int futex_requeue(u32 __user *uad
struct futex_q *this, *next;
int ret, drop_count = 0;
+ /*
+ * Call this probe futex_cmp_requeue_called, although this function is
+ * common to both FUTEX_REQUEUE and FUTEX_CMP_REQUEUE. This is because
+ * FUTEX_REQUEUE is deprecated.
+ */
+ trace_mark(futex_cmp_requeue_called, "uaddr1:%p fshared:%p uaddr2:%p "
+ "nr_wake:%d nr_requeue:%d cmpval:%p",
+ uaddr1, fshared, uaddr2, nr_wake, nr_requeue, cmpval);
retry:
futex_lock_mm(fshared);
@@ -908,8 +924,12 @@ static int futex_requeue(u32 __user *uad
if (unlikely(ret != 0))
goto out;
ret = get_futex_key(uaddr2, fshared, &key2);
- if (unlikely(ret != 0))
+ if (unlikely(ret != 0)) {
+ trace_mark(futex_cmp_requeue_uaddr2_failed, "uaddr1:%p "
+ "uaddr2:%p ret:%d",
+ uaddr1, uaddr2, ret);
goto out;
+ }
hb1 = hash_futex(&key1);
hb2 = ...These files define and implement marker handlers for the probes in futex.c. They may be built by choosing the CONFIG_FUTEX_DEBUG option (can be built as a kernel module also). They export data to the user using debugfs_printk and the output is available in futex_debug_trace/ directory under the debugfs mount. Separate directories under this are created for each marker probe. Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com> --- include/linux/futex_markers.h | 63 +++++++++++++ init/Kconfig | 13 ++ kernel/Makefile | 1 kernel/futex_markers.c | 201 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 278 insertions(+) Index: linux-2.6.25-rc8-mm1/init/Kconfig =================================================================== --- linux-2.6.25-rc8-mm1.orig/init/Kconfig +++ linux-2.6.25-rc8-mm1/init/Kconfig @@ -646,6 +646,19 @@ config FUTEX support for "fast userspace mutexes". The resulting kernel may not run glibc-based applications correctly. +config FUTEX_DEBUG + tristate "Enable debugging for Futex - currently stats in debugfs" + depends on FUTEX + depends on DEBUG_FS && MARKERS && TRACE + default m + help + This option provides debugging data for Futex system calls + in debugfs. + + Say Y/M here if you want to enable Futex debugging + in-kernel/module. + Say N if you are unsure. + config ANON_INODES bool Index: linux-2.6.25-rc8-mm1/kernel/futex_markers.c =================================================================== --- /dev/null +++ linux-2.6.25-rc8-mm1/kernel/futex_markers.c @@ -0,0 +1,201 @@ +/* + * futex_markers.c - Marker handler for probes in futex.c + * + * Copyright IBM Corp. 2008 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This ...
This is some seriuosly ugly looking gunk, why would we want stuff like that scattered across the code? What is wrong with a few simple hooks like: trace_futex_wait(uaddr, fshares, val, abs_time, bitset); and then deal with that. Also, you seem to expose way too much futex internals; do you really need that? People will go use this marker crap like ABI and further restrain us from changing the code. /me unhappy. --
I don't really see how it differs so much from printks, which kernel If any of your variable type changes, then you are exporting an unknown data structure to user-space. _That_ would break a userspace tracer whenever you change any of these kernel variables and you don't want that. Exporting the field names and variable types helps to identify the variables by their given names rather than their respective order. Having the field type insures binary compatibility. Clearly we can turn your trace_futex_wait(uaddr, fshares, val, abs_time, bitset); into a trace_mark() with a simple define, and I don't see any problem with that. I just want to make sure the event name, field names and field types are exported, and this is done by markers. However, I wonder why none of the kernel printk() are turned into specialized defines to make the code "cleaner".. maybe it's because it is useful to Because we extract the field names and types, we can create tracer plugins that would hook on field names rather than expect a specific number of fields and fixed field types. It makes it possible to tolerate missing fields pretty easily. But yes, tracer tools might have to be adapted to internal kernel changes, since they must follow kernel structure changes. However, staying as close as possible to a canonical representation of event fields, staying far from the specific implemetation, would help to lessen the inter-dependency. On the other hand, it would probably hurt trace compactness and efficiency. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
They aren't in every -E codepath, nor they are at the start and at the end of every important function (like system call). Such printks are usually inserted during debugging when you don't care about ugliness and these patches will eventually make kernel looks like being permanently debugged one. --
Wondering how useful would printks be when debugging large systems (say analysing futex contention on a 32-way CPU system)....also to mention the inability to do a binary dump of the interested data structures. --K.Prasad --
Which never last longer than the debug session and are never exposed to trace_futex_wait()'s signature would make the compiler issue a complaint I must be missing something here, printks don't need to look pretty because they never see the light of lkml. They get ripped out as soon as See, these tracer tools are my nightmare as member of an enterprise linux team. They'll make an already hard job even harder, no thanks! At least reduce the hooks to the very bare minimum; only log when the mutex changes state; not this: we entered futex_wait; we exited with state crap. Just log: futex: <uaddr> wait futex: <uaddr> wakeup And other fundamental events, the rest is just not needed. --
i'm clearly NAK-ing all futex trace hooks until the true impact of the whole marker facility is better understood. I've tried them for the scheduler and they were a clear failure: too bloated and too persistent. but more importantly, as things stand today i've yet to see a _any_ bugreport where these 'tracer' tools that are being referred to were actually used in the field to fix something. The latency tracers (and the other tracer variants in -rt) on the other hand have a documented track record of being useful in fixing bugs and instrumenting the kernel. Ingo --
I have not seen any counter argument for the in-depth analysis of the instruction cache impact of the optimized markers I've done. Arguing that the markers are "bloated" based only on "size kernel/sched.o" You will probably be interested in the following paper, which explains various situations in which using a tracer has solved real problems at Google, IBM, Autodesk, which are Linux users running large clusters or Linux systems with soft RT constraints. Linux Kernel Debugging on Google-sized clusters at Ottawa Linux Symposium 2007 http://ltt.polymtl.ca/papers/bligh-Reprint.pdf Now for some performance impact : Here are some results I have taken comparing the optimized markers approach with the dynamic ftrace approach. These runs with some ALU work in tight loops, using clflush() to flush the cache lines pointing to "global" data (pointer read : current->pid) used in the loop. I also have the numbers for running the loop without the ALU work, but I leave them out since they only make the tables harder to read : basically, the cached impact for running the empty loop with markers or ftrace instrumentation is about 0 to 3 cycles. It's the uncached impact which clearly makes the difference between both approaches. On AMD64, adding the markers or ftrace statement actually accelerates the runs when executed with an ALU work baseline. It adds 1 to 2 cycles with executed alone in the loop without any work. Frank Ch. Eigler is preparing some macrobenchmarks. I hope he will find time to post them soon. Results in cycles per loop baseline : Cycles for ALU loop 28.10013 (will be substracted for cached runs) Cycles for clflush() and ALU loop 230.11087 (will be substracted from non-cached runs) gcc version 4.1.3 20070812 (prerelease) (Debian 4.1.2-15), -O2 ------------------------------------------------------------------------------ |x86 Pentium 4, 3.0GHz, Linux 2.6.25-rc7 | cached | uncached ...
uhm, i'm not sure what you mean - how else would you quantify bloat than to look at the size of the affected subsystem? Ingo --
Hi - For example, by measuring how much of that "bloat" is actually experienced (e.g., in the form of presence in cache or additional time taken.) when that instrumentation is turned off vs. on. That should sound familiar, as this is the sort of data that you outlined (but I don't recall whether you actually provided numbers/recipes) when describing the "dyn-ftrace" nop-padding work. - FChE --
Data cache bloat inspection : If you use the "size" output, it will take into account all the data placed in special sections. At link time, these sections are put together far from the actual cache hot kernel data. Instruction cache bloat inspection : If a code region is placed with cache cold instructions (unlikely branches), it should not increase the cache impact, since although we might use one more cache line, it won't be often loaded in cache because all the code that shares this cache line is unlikely. TLB entries bloat : If code is added in unlikely branches, the instruction size increase could increase the number of TLB entries required to keep cache hot code. However, in our case, adding 10 (hot) + 50 (cold) bytes to the scheduler code per optimized marker would require 68 markers to occupy a whole 4kB TLB entry. Statistically, we could suppose that adding less than 34 markers to the scheduler should not use any supplementary TLB entry. Adding 3 markers is therefore very unlikely to increase the TLB impact. Given we have about 1024 TLB entries, adding 1/25th of a TLB entry to the cache hot kernel instructions should not matter much, especially since it might be absorbed by alignment. And since the kernel core code is placed in "Huge TLB pages" on many architectures nowadays, I really don't think the impact of a few bytes out of 4MB is significant. I therefore think that looking only at code size is misleading when considering the cache impact of markers, since they have been designed to put the bytes as far away as possible from cache-hot memory. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
absent from your figures is dyn-ftrace :-) dyn-ftrace is a young but rather nifty tool that extends on old-style static tracepoints rather forcefully, by adding 75,000+ tracepoints to the kernel here and now, with no in-source maintainence overhead at all. (It's a nice add-on to SystemTap because it adds space for a jprobe, but i digress ;-) anyway, my objections against markers would be reduced quite significantly, if we were able to improve the immediate values optimization to not actually patch the MOV's constant portion, but to just patch both the MOV and the branch instruction, into two NOPs. in fact, we can do even better: we should turn the immediate values optimization into a simple "patch out a jump to the slowpath" optimization. I.e. just a single instruction to patch in and out. (and that makes the NMI impact all that easier to handle as well) That would pretty much meet my "the typical trace point should have the cost of a single NOP" threshold for fastpath tracing overhead, which i've been repeating for 2 years now, and which would make the scheduler markers a lot more palatable ;-) hm? Ingo --
I think we could do something there. Let's have a look at a few
marker+immediate values fast paths on x86_32 :
4631: b0 00 mov $0x0,%al
4633: 84 c0 test %al,%al
4635: 0f 85 c6 00 00 00 jne 4701 <try_to_wake_up+0xea>
7059: b0 00 mov $0x0,%al
705b: 84 c0 test %al,%al
705d: 75 63 jne 70c2 <sched_exec+0xb6>
83ac: b0 00 mov $0x0,%al
83ae: 84 c0 test %al,%al
83b0: 75 29 jne 83db <wait_task_inactive+0x69>
If we want to support NMI context and have the ability to instrument
preemptable code without too much headache, we must insure that every
modification will leave the code in a "correct" state and that we do not
grow the size of any reachable instruction. Also, we must insure gcc
did not put code between these instructions. Modifying non-relocatable
instructions would also be a pain, since we would have to deal with
instruction pointer relocation in the breakpoint code when the code
modification is being done.
Luckily, gcc almost never place any code between the mov, test and jne
instructions. But since we cannot we sure, we could dynamically check
for this code pattern after the mov instruction. If we find it, then we
play with it as if it was a single asm block, but if we don't find what
we expect, then we use standard immediate values for that. I expect the
heavily optimised version will be usable almost all the time.
This heavily optimized version could consist of a simple jump to the
address following the "jne" instruction. To activate the immediate
value, we could simply put back a mov $0x1,%al. I don't think we care
_that_ much about the active tracing performance, since we take a
supplementary function call already in that case.
We could probably force the mov into %al to make sure we search ...sounds like a promising plan to me.
would you be interested in putting back the sched-tracer markers into
sched.c, using the redo-markers patch i posted some time ago (attached
below again) - and send me a series combined with immediate values
against sched-devel/latest, so that i can have another look at their
impact? [That would be the current optimized-markers approach initially,
not the proposed super-optimized-markers approach.]
i wont worry much about the slowpath impact, you and Frank are right in
that -freorder-blocks will probably hide the slowpath nicely and it will
have other benefits as well.
Maybe a functional -freorder-blocks patch for x86 would help in judging
that impact precisely - i think Arjan made one some time ago? Last time
i tried it (a few weeks ago) it crashed early during bootup - but it
would be an interesting feature to have.
Ingo
---------------------------------->
Subject: sched: reintroduce markers
From: Ingo Molnar <mingo@elte.hu>
---
include/linux/sched.h | 36 --------
kernel/sched.c | 11 +-
kernel/trace/trace.h | 12 --
kernel/trace/trace_functions.c | 2
kernel/trace/trace_sched_switch.c | 163 +++++++++++++++++++++++++++++++++++---
kernel/trace/trace_sched_wakeup.c | 146 +++++++++++++++++++++++++++++++---
6 files changed, 299 insertions(+), 71 deletions(-)
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -2031,42 +2031,6 @@ extern int sched_mc_power_savings, sched
extern void normalize_rt_tasks(void);
-#ifdef CONFIG_CONTEXT_SWITCH_TRACER
-extern void
-ftrace_ctx_switch(struct task_struct *prev, struct task_struct *next);
-#else
-static inline void
-ftrace_ctx_switch(struct task_struct *prev, struct task_struct *next)
-{
-}
-#endif
-
-#ifdef CONFIG_SCHED_TRACER
-extern void
-ftrace_wake_up_task(struct ...you also need to make sure no cpu is executing that code ever.. I expect gcc to start using the macro-fusion capable ones more and more over time at least, and for that the compare and jmp need to be consecutive. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
By "insure that every modification will leave the code in a "correct" state", I mean that at any given time before, during or after the code modification, if an NMI comes on any CPU and try to run the modified code, it should have a valid version of the code to execute. Does it Early reasults of the work I've done last night : I can detect about 96% of the ~120 markers I've put in my instrumented kernel. Not only does the compare and jmp need to be consecutive, but the movb $0x0,%al also does. I *could* try to detect specific code inserted in between, but I really have to make sure I don't get burned by the compiler inserting a jmp there. I'll post my work shortly. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
On Wed, 16 Apr 2008 10:00:09 -0400 I understand your words. My concern is that I don't quite understand how you guarantee that you'll not be executing the code you're modifying. Just saying "it's consistent before and after" sounds nice but probably isn't I wonder if just sticking in 2 barriers around your code make gcc stop moving stuff too much -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
hm, an extra optimization barrier might have worse effects than even an extra instruction. I think if the detection and patching can be made 100% safe, we dont care about the remaining 4% of markers that gcc somehow reorders. and in parallel gcc folks might want to start helping us achieve single-instruction branch points? Currently there's no way to get flags values out of inline assembly, except via a register intermediary which adds another instruction. For flags that are unaffected by gcc's input/output constraint generation code it would make sense to allow them to be exported out of inline assembly. Ingo --
Ah, I see. I insert a breakpoint and execute a bypass rather than the code being modified. I only put back the 1st instruction byte after the I'm not sure people would like the side-effect for the rest of optimizations, but it should be tried. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
On Wed, 16 Apr 2008 10:48:14 -0400 sorry but I'm not convinced that that is safe without a real exclusion mechanism. --
I'll post the patchset soon. Please have a look at the rather lengthy comment in arch/x86/kernel/immediate.c for details. I'll be glad to answer any question that remains. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Hi,
I am trying to figure out why the following code seems to work fine on
x86_32, but the update fails to happen on x86_64. Any idea why ? From
what I see the memcpy seems to fall in a memory area which is not the
same as the one I test at the end of the function. It only fails for
module addresses, which are vmalloc'd. Once I fix this, I'll be able to
send the patchset.
(from my own alternative.c) :
void *__kprobes text_poke(void *addr, const void *opcode, size_t len)
{
unsigned long flags;
char *vaddr;
int nr_pages = 2;
struct page *pages[2];
int i;
if (*((uint8_t *)addr - 1) != BREAKPOINT_INSTRUCTION) {
BUG_ON(len > sizeof(long));
BUG_ON((((long)addr + len - 1) & ~(sizeof(long) - 1))
- ((long)addr & ~(sizeof(long) - 1)));
}
if (is_vmalloc_addr(addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
} else {
pages[0] = virt_to_page(addr);
pages[1] = virt_to_page(addr + PAGE_SIZE);
}
BUG_ON(!pages[0]);
if (!pages[1])
nr_pages = 1;
vaddr = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
BUG_ON(!vaddr);
local_irq_save(flags);
memcpy(&vaddr[(unsigned long)addr & ~PAGE_MASK], opcode, len);
local_irq_restore(flags);
vunmap(vaddr);
sync_core();
/* Could also do a CLFLUSH here to speed up CPU recovery; but
that causes hangs on some VIA CPUs. */
for (i = 0; i < len; i++)
BUG_ON(((char *)addr)[i] != ((char *)opcode)[i]);
return addr;
}
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Ah, found it. the is_vmalloc_addr() is somewhat misleading because it does not include the vmalloc'd memory for modules, which is in a different address space. I wonder if it has other unknown side-effects. I used !core_kernel_text() instead. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Yes, but then you would have to create new code for each event you want
to trace. In the end, it would increase the icache footprint
The thing is that the trace_marks really fills two purpose : they
extract information from the core kernel, which is meant to be shipped
on production systems so tracing tools can report what is happening on
the system and they also allow kernel hackers to add markers of their
own, so they can extract information about specific events they are
interested in along with the standard kernel instrumentation.
So, part of it is meant to be standard kernel information, part of it
can be used for debugging. And since the kernel code evolves through
time, it makes sense to have an infrastructure flexible enough to follow
these changes easily.
Your proposal is interesting though. We could keep the flexible markers
as the core infrastructure to declare static instrumentation and add
static inlines in C files to map function prototypes to markers, such
as:
static inline void trace_futex_wait(void *uaddr)
{
trace_mark(futex_wait, "uaddr %p", uaddr);
}
static inline void trace_futex_wakeup(void *uaddr)
{
trace_mark(futex_wakeup, "uaddr %p", uaddr);
}
But personnally I don't really see how these static inlines will look
I totally agree with you on that. This is the approach I've used in the
LTTng instrumentation.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Looking at patch 2/2 in this series that seems to be needed anyway. So The trouble I have with these free form thingies is that its too hard to keep them consistent. Why the need to duplicate it; that's utter madness. --
Hi - That, plus the new hand-written function (trace_futex_wait) would still need to manage the packaging of the arguments for consumption by separately compiled pieces. It is desirable not to require such hand-written functions to *also* be declared in headers for these The main thing is type checking by engaging gcc's printf format checking logic. In my original markers proposal, the types were encoded into the function name, sort of as in C++: trace_mark_nnnnn(futex_wake_called, uaddr, fshares, val, abs_time, bitset); where each "n" stands for some integral value, and could be chosen amongst a small number of other types (say -- "s": char* string, "p": void*, "l":64-bit long). Then, type checking could be done by the core compiler for both event producers and consumers. One downside was that the trace_mark_* permutations themselves would have to be generated by some shell/perl script [1], and some deemed this probably unacceptable. I'm still not sure... [1] some systemtap archaeology: This second instance is optional and is used as a consistency check for the event consumer to hook up exactly to the intended producer. The string could be empty. - FChE --
* Frank Ch. Eigler (fche@redhat.com) wrote: -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Atleast until 2.6.25-rc8-mm2, I would expect the marker registration to
fail if the format strings don't match.
I find an early return in set_marker() in marker.c if the format
strings don't match.
Mathieu,
Can you let me know if you would find the patch below - which
eliminates the need to pass format string to marker_probe_register()
function, acceptable?
I have done some elementary tests with the patch, but I know it would
fail if there are duplicate markers with different format strings.
However I presume that duplicate markers are meant to appear primarily
when markers are present in inlined functions (in which case their
format strings would be the same).
Thanks,
K.Prasad
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
---
include/linux/marker.h | 4 ++--
kernel/marker.c | 40 +++++++++++++++++-----------------------
samples/markers/probe-example.c | 3 ---
3 files changed, 19 insertions(+), 28 deletions(-)
Index: linux-register_marker_patch/kernel/marker.c
===================================================================
--- linux-register_marker_patch.orig/kernel/marker.c
+++ linux-register_marker_patch/kernel/marker.c
@@ -364,7 +364,7 @@ static struct marker_entry *get_marker(c
* Add the marker to the marker hash table. Must be called with markers_mutex
* held.
*/
-static struct marker_entry *add_marker(const char *name, const char *format)
+static struct marker_entry *add_marker(const char *name)
{
struct hlist_head *head;
struct hlist_node *node;
@@ -372,9 +372,15 @@ static struct marker_entry *add_marker(c
size_t name_len = strlen(name) + 1;
size_t format_len = 0;
u32 hash = jhash(name, name_len-1, 0);
+ struct marker *iter;
+
+ /* Search for the marker with 'name' in __markers section */
+ for (iter = __start___markers;
+ ((iter < __stop___markers) && (strcmp(name, iter->name)));
+ iter++);
- if (format)
- format_len = strlen(format) + 1;
+ if (iter->format)
+ format_len = ...Sorry, I will have to NACK this one. A probe which registers on a marker has two choices : either is passes a NULL format string, which tells the marker infrastructure that th probe will dynamically parse the format string to grab the arguments, like printk. However, if the probe is expecting a particular layout of arguments, It must tell so upon registration, so the marker infrastructure can check if the marker it's trying to connect to still ave the same arguments. That will make sure specialized probes won't break between kernel version changes. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
*blink* so all this is so you don't have to put a declarion in a header file? How about we put these premanent markers in a header - Mathieu says there are <200. Surely that's not too much trouble. Then you can keep this trace_mark() (perhaps trace_printf() is a better So instead of writing normal C code and placing a declarion in a header, you've come up with a scheme that needs to duplicate a text string to check integrity. Sounds like a real good way to confuse people. --
The idea is to export as much data as possible (considering that the cost involved in doing so is less) and use them only when required. If we were to log just the futex_ops, just as you had suggested, "Just log: futex: <uaddr> wait futex: <uaddr> wakeup" it would provide only cursory information about code-flow and not enough debug information to help the user solve the issue he's trying to debug. Say for e.g. in futex_wait you may want to know if abs_time had a non-zero value but the primitive debugging information would end up being a handicap from discovering that. If you can specifically point me to information you think would be absolutely unnecessary, I can get them out of the trace_mark(). --K.Prasad --
I'm thinking everything is superflous; you're basically logging what strace already gives you; except worse by encoding local variable names and exposing kernel pointers. Also get rid of that futex_markers.h thing - its utterly useless and makes the code less readable. --
But we don't want to run strace just for this stuff. As you probably know, strace involves invasive user-space context-switching between The pointers are probably excessive, the and the names don't really matter. What does matter is providing enough information for a problem diagnosis tool & person to reconstruct what the kernel must have been thinking when it did something noteworthy. - FChE --
Then what do we do when someone comes along and changes one of those names; do we go around changing the markers and then requiring all tools to change as well? (And no this isn't far fetched; I'm thinking of changing fshared in the near future). Sounds like people will complain and generate back pressure against such changes - something we should avoid. As soon as these markers place a Sure, but then just make a strace like tracer and be done with it - no need to pollute the futex code with that. --
Hi -
At least two answers apply. The markers being put in should be chosen
with the concurrence of the subsystem maintainer, who should help
identify those key quantities that are likely to be both long-lived
and good diagnostic value. So that's the discussion we're having
right now: which values should be passed. If they're long-lived, then
they can be given a long-lived name - and it doesn't have to be a
low-level C variable name. (There's no reason why we can't also have
a slew of short-lived pure-debugging sorts of markers compiled in. A
marker naming convention like "__futex..." can be adopted for such
purposes - where nothing is promised for version n+1, just hoping to
help diagnose problems in this version.)
The other answer is that we should ensure that tools do not assume
that the set of markers is fixed. Let's never set such an
expectation. (In systemtap, we have several abstraction and
version-adaptation facilities that aim to hide such changes.)
The kernel guy can choose at least two methods to help tools
without contraining himself too much. He can change
trace_mark(futex_foo, "p1 %d p2 %d", p1, p2);
to a pair:
trace_mark(futex_foo, "p1 %d p2 %d", 0, p2); // backward compat.
trace_mark(futex_foo2, "p2 %d p3 %d", p2, p3); // new marker
or even just drop the backward compatibility one altogether.
It will need judicious choices by the kernel and willingness by the
tools to keep up. Those that don't will simply notice fewer events
coming in, but nothing important *breaking*.
The current crop of tools (lttng, systemtap) are both from friendly
groups who recognize that they have more of an expendable diagnostic
rather than operational role, and thus are willing to carry that
burden. By the time new tools will show up, we will have more
experience with the degree and type of marker changes over time, and
they won't be in a position to place new constraints on the
Indeed. This is why it's important for the subsystem ...Hi Peter, Adding to what Frank and others have said before, it would be nice to have you point out which markers or parameters exported by them are superfluous and I'm willing to have them removed from my patch. I've just sent out another patch over marker infrastructure to Mathieu which will eliminate the need to re-send the format string during initialisation and if accepted, the patch should take care of any concerns about duplication of code. Indeed any markers in futex.c that you think would be absolutely unnecessary can be removed on a case-by-case basis, but do let us know them. Thanks, K.Prasad --
I'm thinking the two use-cases are confused here. So we have a) permanent markers b) ad-hoc debug markers I'm thinking that for the first class the compilation level coupling is no problem at all. And for the second class it doesn't matter how ugly they are as long as it works on the spot. So I'm arguing these two should be separated. --
Hi - OTOH I think this is a false dichotomy. Debugging is not only done by a subsystem maintainer during the merge/rc period. When something goes wrong on a deployed machine, problem diagnosis requires data, which ideally should be gathered as non-intrusively as possible - that means no recompiling / rebooting, and ideally very little slowdown. I bet that many of those markers you might consider "ad-hoc" would in fact have some post-release diagnostic value. Now, maybe in the case of futexes, the kernel makes no sophisticated decisions that may need subsequent review. Maybe all the internals are directly calculable from the results visible at the system call level (which threads block and which return). But this is certainly not so for many other parts of the kernel. With markers, you don't have to make this an agonizing decision. You just insert markers, regardless of whether it's high-level, universal, "permanent"; or whether it's low-level, diagnostic, temporary. The same consumer tools will work for them all. - FChE --
This again tries into the argument about not making markers depend on the code structure or implementation details. I'm really wanting to avoid ever having to be obstructed by a marker. So any marker that does not represent a solid high level event (to take Mathieu's example: a context switch is a context switch, and we'll always have one) I'm not comfortable with merging that upstream. So even though these ad-hoc markers might have some diagnostic value - I'll never support merging them. If a customer might have some issue I can hand him a custom kernel with these markers added in - I see absolutely no reason to burden upstream with these. So we _do_ have to make this decision; some shite should just not be merged but be kept separate. --
Hi - It is your prerogative as a subsystem maintainer to make a guess about this. Others may make their own decisions differently, considering Perhaps the kinds of bugs in your code, coupled with the kinds of customers who experience those bugs, make tolerable this means of diagnosis (requiring a reboot of their machines into a custom debugging kernel). The customers I have dealt with (and frankly, I too) need to diagnose problems on a live running system as much as possible. They don't run every kernel du jour, so they don't need the purely hypothetical tools that are dependent on a permanent set of markers. - FChE --
Sun solved this problem for DTrace by applying their stability attributes to all probe points -- i.e. the probes that represent an ABI are marked as such and the temporary debugging probes are marked as unstable. -- Nicholas Miell <nmiell@comcast.net> --
We should really think about what we are doing before we add a marker in the kernel code. The information extracted should be both useful and expected not to change too much between versions. Ideally, implementation details should not be exported. Exporting useless information "just because we can" would indeed put pressure on maintainers. That's where I expect them to be the best persons to tell what is an implementation detail likely to change, and what is a more "conceptually stable" information. e.g. a context switch is a context switch, this does not change with the underlying implementation. I think that whenever we can add a more "generic" marker which solves many special cases, we should do so. In this case, using the system call instrumentation found in my architecture specific instrumentation patchset would comprehend futex instrumentation. By adding extraction of all system call parameters, things such as futexes should be covered. However, we would still need to instrument read() or exec() to extract the file names. Otherwise, we would have to start doing architecture-specific code which would "know" what arguments are passed to each system call. I guess we could do that if it lessens instrumentation intrusiveness, but we would have to deal with a system call tracing infrastructure tied closely to system call parameters. System call audit code seems to already do that, so I guess we could go that way. Then, I think we should turn to inner-kernel instrumentation only when the information extracted from the stable kernel ABI (e.g. system calls) is not complete enough to understand how things work. That would be the case for block I/O tracing for instance. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Agreed - so this futex instrumentation will not go anywhere. Prasad could perhaps help out with your arch specific syscall tracer. --
My 2c: if we even begin to countenance merging stuff like that, it's time for all the trace code to be removed. <checks the email headers> Nope, it wasn't April 1st. Guys, get a grip. --
