In a series of ten patches, Mathieu Desnoyers posted an updated version of the Linux Kernel Markers. In the first patch he explains the need for markers:
"With the increasing complexity of today's user-space application and the wide deployment of SMP systems, the users need an increasing understanding of the behavior and performance of a system across multiple processes/different execution contexts/multiple CPUs. In applications such as large clusters (Google, IBM), video acquisition (Autodesk), embedded real-time systems (Wind River, Monta Vista, Sony) or sysadmin/programmer-type tasks (SystemTAP from Redhat), a tool that permits tracing of kernel-user space interaction becomes necessary."
Mathieu goes on to explain that in complex system finding bugs can be even more difficult due to the rarity of their occurance, "one can therefore only hope at having the best conditions to statistically reproduce the bug while extracting information from the system. Some bugs have been successfully found at Google using their ktrace tracer only because they could enable it on production machines and therefore recreate the same context where the bug happened." He then added, "therefore, it makes sense to offer an instrumentation set of the most relevant events occurring in the Linux that can have the smallest performance cost possible when not active while not requiring a reboot of a production system to activate. This is essentially what the markers are providing."
From: Mathieu Desnoyers [email blocked] Subject: [patch 00/10] Linux Kernel Markers for 2.6.21-mm2 Date: Wed, 09 May 2007 21:55:55 -0400 Hi Andrew, He is an updated, folded, version of the Linux Kernel Markers. It replaces the version found in 2.6.21-mm2 at the exact same spot in the series file. Main changes : - It renames the MARK() trace_mark(), as suggested by Christoph Hellwig. - It now defines the structures contained out of the marker section outside of the #ifdef __KERNEL__ so the SystemTAP tool can use the header to extract the markers from the kernel binary. - The i386 optimization has been tweaked a bit : the call is now in the else branch. It has something to do with the i386 "fast path" being the taken branch of a jump, to make loops faster. It seems that the if/else statements are also affected : the if () will be faster than the else. Therefore, I put the call in the else, to make sure the fastest path is when the marker is disabled. - I added the blktrace port to the markers infrastructure. It makes one in-tree markers user. Please remove : linux-kernel-markers-kconfig-menus.patch linux-kernel-markers-architecture-independant-code.patch linux-kernel-markers-powerpc-optimization.patch linux-kernel-markers-i386-optimization.patch markers-add-instrumentation-markers-menus-to-avr32.patch linux-kernel-markers-non-optimized-architectures.patch markers-alpha-and-avr32-supportadd-alpha-markerh-add-arm26-markerh.patch linux-kernel-markers-documentation.patch # markers-define-the-linker-macro-extra_rwdata.patch markers-use-extra_rwdata-in-architectures.patch And add, instead : linux-kernel-markers-kconfig-menus.patch linux-kernel-markers-architecture-independant-code.patch linux-kernel-markers-header-visible-from-userspace.h linux-kernel-markers-powerpc-optimization.patch linux-kernel-markers-i386-optimization.patch linux-kernel-markers-non-optimized-architectures.patch linux-kernel-markers-documentation.patch # linux-kernel-markers-define-the-linker-macro-extra_rwdata.patch linux-kernel-markers-use-extra_rwdata-in-architectures.patch # linux-kernel-markers-port-blktrace-to-markers.patch Thanks, Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Subject: [patch 01/10] Linux Kernel Markers - Add kconfig menus for the marker code Date: Wed, 09 May 2007 21:55:56 -0400 With the increasing complexity of today's user-space application and the wide deployment of SMP systems, the users need an increasing understanding of the behavior and performance of a system across multiple processes/different execution contexts/multiple CPUs. In applications such as large clusters (Google, IBM), video acquisition (Autodesk), embedded real-time systems (Wind River, Monta Vista, Sony) or sysadmin/programmer-type tasks (SystemTAP from Redhat), a tool that permits tracing of kernel-user space interaction becomes necessary. Usage of such tools have been made to successfully pinpoint problems such as: latency issues in a user-space video acquisition application, slowdown problems in large clusters due to a switch to a different filesystems with a different cache size, abnormal Linux scheduler latency (just to name a few that I have personally investigated). The currently existing solutions does not give a system-wide overview of what - and when - things are happening on the system. Ptracing a program works with few processes, but quickly becomes useless when it comes to keeping track of many processes. Bugs occuring because of bad interaction of such complex systems can be very hard to find due to the fact that they occur rarely (sometimes once a week on hundreds of machines). One can therefore only hope at having the best conditions to statistically reproduce the bug while extracting information from the system. Some bugs have been successfully found at Google using their ktrace tracer only because they could enable it on production machines and therefore recreate the same context where the bug happened. Therefore, it makes sense to offer an instrumentation set of the most relevant events occurring in the Linux that can have the smallest performance cost possible when not active while not requiring a reboot of a production system to activate. This is essentially what the markers are providing. Since we cannot limit the growth of the Linux kernel, nor can we pre-determine each and every "interesting" instrumentation within each subsystem and driver, it is sensible to let this task to the persons who knows the best their code. Adding instrumentation should therefore be as easy as adding and maintaining a "printk" in the kernel code from the developer's point of view. Towards a complete tracing mechanism in the Linux kernel, the markers are only one step forward. The following step is to connect probes to those markers that will record the tracing information in buffers exported to user-space, organized in timestamped "events". Probe callbacks are responsible for serializing the information passed as parameter to the markers (described by the format string) into the events. A control mechanism to activate/stop the tracing is required, as well as a daemon that maps the buffers to write them to disk or send them through the network. Keeping track of the events also requires a centralized infrastructure : the idea is to assign a unique ID to each event so they can be later recognized in the trace. Keeping in mind that recording the complete instrumentation site name string for each event would be more that inefficient, assigning a numeric unique identifier makes sense. Finally, support for gathering events coming from user-space, with a minimal performance impact, is very useful to see the interaction between the system's execution contexts. The last steps are currently implemented in Linux Trace Toolkit Next Generation (LTTng). The SystemTAP project could clearly benefit from such an infrastructure for tracing. In addition, they would be providing support for dynamic addition of kernel probes through breakpoints/jumps when possible, with the associated restrictions (accessing local variables, reentrancy, speed). This marker infrastructure is a hook-callback mechanism. It is meant to have an impact as low as possible on the system performances when no callback (probe) is connected so markers (hooks) can be compiled into a production kernel without noticeable slowdown. The rationale behind this mechanism the following : 1 - It makes sense to have instrumentation (for tracing, profiling) within the kernel source tree so that it can follow its evolution. Other options, such as kprobes, imply maintaining an external set of instrumentation that must be adapted to each kernel version. Although it may make sense for distributions, it is not well suited for kernel developers, since they rarely work on a major distribution image. 2 - kprobes, although being a very good attempt at providing a dynamic hooking mechanism that has no impact when disabled, suffers from important limitations : a - It cannot access local variables of a function at a particular point within its body that will be consistent thorough the kernel versions without involving a lot of recurrent hair-pulling. b - Kprobes is slow, since it involves going though a trap each time a probe site is executed. Even though the djprobes project made a good effort to make things faster, it cannot currently instrument fully-preemptible kernels and does not solve (1), (2a) and (2c). c - On the reentrancy side, going though a trap (thus playing with interrupt enable/disable) and taking spinlocks are not suited to some code paths, i.e. : kernel/lockdep.c, printk (within the lockdep_on()/lockdep_off()). It must be understood that some code paths interesting for instrumentation often present a particular reentrancy challenge. Some more details : The probe callback connection to its markers is done dynamically. A predicted branch is used to skip the hook stack setup and function call when the marker is "disabled" (no probe is connected). Further optimizations can be implemented for each architecture to make this branch faster. Instrumentation of a subsystem becomes therefore a straightforward task. One has to add instrumentation within the key locations of the kernel code in the following form : trace_mark(subsystem_event, "%d %p", myint, myptr); Why use the markers instead of kprobes? The rationale behind this mechanism the following : 1 - It makes sense to have instrumentation (for tracing, profiling) within the kernel source tree so that it can follow its evolution. Other options, such as kprobes, imply maintaining an external set of instrumentation that must be adapted to each kernel version. Although it may make sense for distributions, it is not well suited for kernel developers, since they rarely work on a major distribution image. 2 - kprobes, although being a very good attempt at providing a dynamic hooking mechanism that has no impact when disabled, suffers from important limitations : a - It cannot access local variables of a function at a particular point within its body that will be consistent thorough the kernel versions without involving a lot of recurrent hair-pulling. b - Kprobes is slow, since it involves going though a trap each time a probe site is executed. Even though the djprobes project made a good effort to make things faster, it cannot currently instrument fully-preemptible kernels and does not solve (1), (2a) and (2c). c - On the reentrancy side, going though a trap (thus playing with interrupt enable/disable) and taking spinlocks are not suited to some code paths, i.e. : kernel/lockdep.c, printk (within the lockdep_on()/lockdep_off()). It must be understood that some code paths interesting for instrumentation often present a particular reentrancy challenge. Jim Keniston [email blocked] adds: kprobes remains a vital foundation for SystemTap. But markers are attactive as an alternate source of trace/debug info. Here's why: 1. Markers will live in the kernel and presumably be kept up to date by the maintainers of the enclosing code. We have a growing set of tapsets (probe libraries), each of which "knows" the source code for a certain area of the kernel. Whenever the underlying kernel code changes (e.g., a function or one of its args disappears or is renamed), there's a chance that the tapset will become invalid until we bring it back in sync with the kernel. As you can imagine, maintaining tapsets separate from the kernel source is a maintenance headache. Markers could mitigate this. 2. Because the kernel code is highly optimized, the kernel's dwarf info doesn't always accurately reflect which variables have which values on which lines (sometimes even upon entry to a function). A marker is a way to ensure that values of interest are available to SystemTap at marked points. 3. Sometimes the overhead of a kprobe probepoint is too much (either in terms of time or locking) for the particular hotspot we want to probe. In OLS2006 proceedings, vol. 1 http://www.linuxsymposium.org/2006/linuxsymposium_procv1.pdf Frank C. Eigler, from SystemTAP, presents its "static probing markers" (pp. 261-268) in his paper "Problem Solving With Systemtap". He explains the advantages : "In exchange for this effort, systemtap marker-based probes are faster and more precise than kprobes. The better precision comes from not having to covet the compiler's favours. Such fickle favours include retaining clean boundaries in the instruction stream between interesting statements, and precisely describing positions of variables in the stack frame. Since markers don't rely on debugging information, neither favour is required, and the compiler can channel its charms into unabated optimization. The speed advantage comes from using direct call instructions rather than int 3 breakpoints to dispatch to the systemtap handlers. We will see below just how big a difference this makes." He does a comparison of his "simple" marker solution with kprobes (his simple solution looks like my generic markers, but with a major race condition). I also posted numbers about the markers performance impact a few months ago in the initial thread. I can dig into my emails to find them for you if you consider it important for the Changelog. He concludes with : "To the extent that is true, we propose that these groups consider using a shared pool of static markers as the basic kernel-side instrumentation mechanism. If they prove to have as low dormant cost and as high active performance as initial experience suggests, perhaps this could motivate the various tracing efforts and kernel subsystem developers to finally join forces. Let's designate standard trace/probe points once and for all. Tracing backends can attach to these markers the same way systemtap would. There would be no need for them to maintain kernel patches any more. Let's think about it." This patch: Add Kconfig menus for the marker code. [bunk@stusta.de: Never ever select MODULES] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Adrian Bunk [email blocked] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- arch/alpha/Kconfig | 6 ++++++ arch/arm/Kconfig | 6 ++++++ arch/arm26/Kconfig | 6 ++++++ arch/avr32/Kconfig.debug | 7 +++++++ arch/cris/Kconfig | 6 ++++++ arch/frv/Kconfig | 6 ++++++ arch/h8300/Kconfig | 6 ++++++ arch/i386/Kconfig | 3 +++ arch/ia64/Kconfig | 3 +++ arch/m32r/Kconfig | 6 ++++++ arch/m68k/Kconfig | 6 ++++++ arch/m68knommu/Kconfig | 6 ++++++ arch/mips/Kconfig | 6 ++++++ arch/parisc/Kconfig | 6 ++++++ arch/powerpc/Kconfig | 3 +++ arch/ppc/Kconfig | 6 ++++++ arch/s390/Kconfig | 2 ++ arch/sh/Kconfig | 6 ++++++ arch/sh64/Kconfig | 6 ++++++ arch/sparc/Kconfig | 2 ++ arch/sparc64/Kconfig | 3 +++ arch/um/Kconfig | 6 ++++++ arch/v850/Kconfig | 6 ++++++ arch/x86_64/Kconfig | 3 +++ arch/xtensa/Kconfig | 6 ++++++ kernel/Kconfig.marker | 20 ++++++++++++++++++++ kernel/module.c | 1 + 27 files changed, 149 insertions(+) -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Subject: [patch 07/10] Linux Kernel Markers - Documentation Date: Wed, 09 May 2007 21:56:02 -0400 Here is some documentation explaining what is/how to use the Linux Kernel Markers. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- Documentation/marker.txt | 266 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 266 insertions(+) Index: linux-2.6-lttng/Documentation/marker.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/Documentation/marker.txt 2007-05-09 18:16:13.000000000 -0400 @@ -0,0 +1,266 @@ + Using the Linux Kernel Markers + + Mathieu Desnoyers + + + This document introduces to markers and discusses its purpose. It +shows some usage examples of the Linux Kernel Markers : how to insert markers +within the kernel and how to connect probes to a marker. Finally, it has some +probe module examples. This is what connects to a marker. + + +* Purpose of markers + +A marker placed in your code provides a hook to a function (probe) that +you can provide at runtime. A marker can be "on" (a probe is connected to it) +or "off" (no probe is attached). An "off" marker has no effect. When turned on, +the function you provide is called each time the marker is executed in the +execution context of the caller. When the function provided ends its execution, +it returns to the caller (marker site). + +You can put markers at important locations in the code. They act as +lightweight hooks that can pass an arbitrary number of parameters, +described in a printk-like format string, to a function whenever the marker +code is reached. + +They can be used for tracing (LTTng, LKET over SystemTAP), overall performance +accounting (SystemTAP). They could also be used to implement +efficient hooks for SELinux or any other subsystem that would have this +kind of need. + +Using the markers for system audit (SELinux) would require to pass a +variable by address that would be later checked by the marked routine. + + +* Usage + +In order to use the macro MARK, you should include linux/marker.h. + +#include <linux/marker.h> + +Add, in your code : + +trace_mark(subsystem_event, "%d %s", someint, somestring); +Where : +- subsystem_event is an identifier unique to your event + - subsystem is the name of your subsystem. + - event is the name of the event to mark. +- "%d %s" is the formatted string for the serializer. +- someint is an integer. +- somestring is a char pointer. + +Connecting a function (probe) to a marker is done by providing a probe +(function to call) for the specific marker through marker_set_probe(). It will +automatically connect the function and enable the marker site. Removing a probe +is done through marker_remove_probe(). Probe removal is preempt safe because +preemption is disabled around the probe call. See the "Probe example" section +below for a sample probe module. + +The marker mechanism supports multiple instances of the same marker. +Markers can be put in inline functions, inlined static functions and +unrolled loops. + +Note : It is safe to put markers within preempt-safe code : preempt_enable() +will not call the scheduler due to the tests in preempt_schedule(). + + +* Optimization for a given architecture + +You will find, in asm-*/marker.h, optimisations for given architectures +(currently i386 and powerpc). They use a load immediate instead of a data load, +which saves a data cache hit, but also requires cross CPU code modification. In +order to support embedded systems which use read-only memory for their code, the +optimization can be disabled through menu options. + +The MF_* flags can be used to control the type of marker. See the +include/marker.h header for the list of flags. They can be specified as the +first parameter of the _trace_mark() macro, such as the following example which +is safe with respect to lockdep.c (useful for marking lockdep.c and printk +functions). + +_trace_mark(MF_DEFAULT | ~MF_LOCKDEP, subsystem_eventb, MARK_NOARGS); + +Another example is to specify that a specific marker must never call printk : +_trace_mark(MF_DEFAULT | ~MF_PRINTK, subsystem_eventc, + "%d %s", someint, somestring,); + +Flag compatibility is checked before connecting the probe to the marker : the +right flags must be given to _marker_set_enable(). + + +* Probe example + +You can build the kernel modules, probe-example.ko and marker-example.ko, +using the following Makefile: +------------------------------ CUT ------------------------------------- +obj-m := probe-example.o marker-example.o +KDIR := /lib/modules/$(shell uname -r)/build +PWD := $(shell pwd) +default: + $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules +clean: + rm -f *.mod.c *.ko *.o +------------------------------ CUT ------------------------------------- +/* probe-example.c + * + * Connects two functions to marker call sites. + * + * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> + * + * This file is released under the GPLv2. + * See the file COPYING for more details. + */ + +#include <linux/sched.h> +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/marker.h> +#include <asm/atomic.h> + +#define NUM_PROBES ARRAY_SIZE(probe_array) + +struct probe_data { + const char *name; + const char *format; + marker_probe_func *probe_func; +}; + +void probe_subsystem_event(const struct __mark_marker_c *mdata, + const char *format, ...) +{ + va_list ap; + /* Declare args */ + unsigned int value; + const char *mystr; + + /* Assign args */ + va_start(ap, format); + value = va_arg(ap, typeof(value)); + mystr = va_arg(ap, typeof(mystr)); + + /* Call printk */ + printk("Value %u, string %s\n", value, mystr); + + /* or count, check rights, serialize data in a buffer */ + + va_end(ap); +} + +atomic_t eventb_count = ATOMIC_INIT(0); + +void probe_subsystem_eventb(const struct __mark_marker_c *mdata, + const char *format, ...) +{ + /* Increment counter */ + atomic_inc(&eventb_count); +} + +static struct probe_data probe_array[] = +{ + { .name = "subsystem_event", + .format = "%d %s", + .probe_func = probe_subsystem_event }, + { .name = "subsystem_eventb", + .format = MARK_NOARGS, + .probe_func = probe_subsystem_eventb }, +}; + +static int __init probe_init(void) +{ + int result; + uint8_t eID; + + for (eID = 0; eID < NUM_PROBES; eID++) { + result = marker_set_probe(probe_array[eID].name, + probe_array[eID].format, + probe_array[eID].probe_func, &probe_array[eID]); + if (!result) + printk(KERN_INFO "Unable to register probe %s\n", + probe_array[eID].name); + } + return 0; +} + +static void __exit probe_fini(void) +{ + uint8_t eID; + + for (eID = 0; eID < NUM_PROBES; eID++) { + marker_remove_probe(probe_array[eID].name); + } + synchronize_sched(); /* Wait for probes to finish */ + printk("Number of event b : %u\n", atomic_read(&eventb_count)); +} + +module_init(probe_init); +module_exit(probe_fini); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Mathieu Desnoyers"); +MODULE_DESCRIPTION("SUBSYSTEM Probe"); +------------------------------ CUT ------------------------------------- +/* marker-example.c + * + * Executes a marker when /proc/marker-example is opened. + * + * (C) Copyright 2007 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> + * + * This file is released under the GPLv2. + * See the file COPYING for more details. + */ + +#include <linux/module.h> +#include <linux/marker.h> +#include <linux/sched.h> +#include <linux/proc_fs.h> + +struct proc_dir_entry *pentry_example = NULL; + +static int my_open(struct inode *inode, struct file *file) +{ + int i; + + trace_mark(subsystem_event, "%d %s", 123, "example string"); + for (i=0; i<10; i++) { + trace_mark(subsystem_eventb, MARK_NOARGS); + } + return -EPERM; +} + +static struct file_operations mark_ops = { + .open = my_open, +}; + +static int example_init(void) +{ + printk(KERN_ALERT "example init\n"); + pentry_example = create_proc_entry("marker-example", 0444, NULL); + if (pentry_example) + pentry_example->proc_fops = &mark_ops; + else + return -EPERM; + return 0; +} + +static void example_exit(void) +{ + printk(KERN_ALERT "example exit\n"); + remove_proc_entry("marker-example", NULL); +} + +module_init(example_init) +module_exit(example_exit) + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Mathieu Desnoyers"); +MODULE_DESCRIPTION("Linux Trace Toolkit example"); +------------------------------ CUT ------------------------------------- +Sequence of operations : (as root) +make +insmod marker-example.ko +insmod probe-example.ko + (it is important to load the probe after the marked code) +cat /proc/marker-example (returns an expected error) +rmmod marker-example probe-example +dmesg +------------------------------ CUT ------------------------------------- -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
If the performance impact of
If the performance impact of these markers can be kept low enough, I think the day will come when companies will want to enable them for all machines. Machine time is a commodity that gets cheaper every year; debugging time is not.
Hopefully the future will bring a fully transparent and debuggable Linux kernel. It's nice to know everything that is going on on your machine. That kind of transparency is something that Linux can offer that no proprietary operating system could.