[Version 3]
OK, this should be my final RFC release. If there are no complaints
about this one, I'll post it as an official git pull.
This also affecs kprobes and perf.
At the Linux Collaboration Summit, I talked with Mathieu and others about
lowering the footprint of trace events. I spent all of last week
trying to get the size as small as I could.
Currently, each TRACE_EVENT() macro adds 1 - 5K per tracepoint. I got various
results by adding a TRACE_EVENT() with the compiler, depending on
config options that did not seem related. The new tracepoint I added
would add between 1 and 5K, but I did not investigate enough to
see what the true size was.
What was consistent, was the DEFINE_EVENT(). Currently, it adds
a little over 700 bytes per DEFINE_EVENT().
This patch series does not seem to affect TRACE_EVENT() much (had
the same various sizes), but consistently brings DEFINE_EVENT()s
down from 700 bytes to 250 bytes per DEFINE_EVENT(). Since syscalls
use one "class" and are equivalent to DEFINE_EVENT() this can
be a significant savings.
With events and syscalls (82 events and 616 syscalls), before this
patch series, the size of vmlinux was: 16161794, and afterward: 16058182.
That is 103,612 bytes in savings! (over 100K)
Without tracing syscalls (82 events), it brought the size of vmlinux
down from 1591046 to 15888394.
22,071 bytes in savings.
This is just an RFC (for now), to get peoples opinions on the changes.
It does a bit of rewriting of the CPP macros, just to warning you ;-)
Changes in v3:
o Ported to latest tip/tracing/core
o Fixed typo in change log that a comment in LWN noticed:
Wrote: 15999394 when it should have been 15888394.
(Note: these numbers are from the original posting. I need to
redo them again before posting officially).
o Added Mathieu Desnoyers's check_trace patch that will check
the callback to make sure it matches what DECLARE_TRACE() expects
it.
o Added the check_trace to the ftrace and ...From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
This check is meant to be used by tracepoint users which do a direct cast of
callbacks to (void *) for direct registration, thus bypassing the
register_trace_##name and unregister_trace_##name checks.
This permits to ensure that the callback type matches the function type at the
call site, but without generating any code.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20100430165959.GA25605@Krystal>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Li Zefan <lizf@cn.fujitsu.com>
CC: Masami Hiramatsu <mhiramat@redhat.com>
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/tracepoint.h | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 1d85f9a..8d5e4f6 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -137,9 +137,11 @@ static inline void tracepoint_update_probe_range(struct tracepoint *begin,
static inline int unregister_trace_##name(void (*probe)(proto)) \
{ \
return tracepoint_probe_unregister(#name, (void *)probe);\
+ } \
+ static inline void check_trace_callback_type_##name(void (*cb)(proto)) \
+ { \
}
-
#define DEFINE_TRACE_FN(name, reg, unreg) \
static const char __tpstrtab_##name[] \
__attribute__((section("__tracepoints_strings"))) = #name; \
@@ -168,6 +170,9 @@ static inline void tracepoint_update_probe_range(struct tracepoint *begin,
static inline int unregister_trace_##name(void (*probe)(proto)) \
{ \
return -ENOSYS; \
+ } \
+ static ...From: Steven Rostedt <srostedt@redhat.com>
This patch creates a ftrace_event_class struct that event structs point to.
This class struct will be made to hold information to modify the
events. Currently the class struct only holds the events system name.
This patch slightly increases the size of the text as well as decreases
the data size. The overall change is still a slight increase, but
this change lays the ground work of other changes to make the footprint
of tracepoints smaller.
With 82 standard tracepoints, and 616 system call tracepoints:
text data bss dec hex filename
5788186 1337252 9351592 16477030 fb6b66 vmlinux.orig
5792282 1333796 9351592 16477670 fb6de6 vmlinux.class
This patch also cleans up some stale comments in ftrace.h.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/ftrace_event.h | 6 ++++-
include/linux/syscalls.h | 6 +++-
include/trace/ftrace.h | 40 +++++++++++++++--------------------
kernel/trace/trace_events.c | 20 +++++++++---------
kernel/trace/trace_events_filter.c | 6 ++--
kernel/trace/trace_export.c | 6 ++++-
kernel/trace/trace_kprobe.c | 12 +++++-----
kernel/trace/trace_syscalls.c | 4 +++
8 files changed, 54 insertions(+), 46 deletions(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 39e71b0..496eea8 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -113,10 +113,14 @@ void tracing_record_cmdline(struct task_struct *tsk);
struct event_filter;
+struct ftrace_event_class {
+ char *system;
+};
+
struct ftrace_event_call {
struct list_head list;
+ struct ftrace_event_class *class;
char *name;
- char *system;
struct dentry *dir;
struct trace_event *event;
int enabled;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 057929b..ac5791d 100644
--- ...From: Steven Rostedt <srostedt@redhat.com>
The raw_init function pointer in the event is used to initialize
various kinds of events. The type of initialization needed is usually
classed to the kind of event it is.
Two events with the same class will always have the same initialization
function, so it makes sense to move this to the class structure.
Perhaps even making a special system structure would work since
the initialization is the same for all events within a system.
But since there's no system structure (yet), this will just move it
to the class.
text data bss dec hex filename
5788186 1337252 9351592 16477030 fb6b66 vmlinux.orig
5774567 1297492 9351592 16423651 fa9ae3 vmlinux.fields
5774510 1293204 9351592 16419306 fa89ea vmlinux.init
The text grew very slightly, but this is a constant growth that happened
with the changing of the C files that call the init code.
The bigger savings is the data which will be saved the more events share
a class.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/ftrace_event.h | 2 +-
include/linux/syscalls.h | 2 --
include/trace/ftrace.h | 9 ++++-----
kernel/trace/trace_events.c | 12 ++++++------
kernel/trace/trace_export.c | 2 +-
kernel/trace/trace_kprobe.c | 6 +++---
kernel/trace/trace_syscalls.c | 2 ++
7 files changed, 17 insertions(+), 18 deletions(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 479c3c1..393a839 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -133,6 +133,7 @@ struct ftrace_event_class {
int (*define_fields)(struct ftrace_event_call *);
struct list_head *(*get_fields)(struct ftrace_event_call *);
struct list_head fields;
+ int (*raw_init)(struct ftrace_event_call *);
};
struct ftrace_event_call {
@@ -144,7 +145,6 @@ struct ftrace_event_call {
int enabled;
...From: Steven Rostedt <srostedt@redhat.com>
Currently, every event has its own trace_event structure. This is
fine since the structure is needed anyway. But the print function
structure (trace_event_functions) is now separate. Since the output
of the trace event is done by the class (with the exception of events
defined by DEFINE_EVENT_PRINT), it makes sense to have the class
define the print functions that all events in the class can use.
This makes a bigger deal with the syscall events since all syscall events
use the same class. The savings here is another 37K.
text data bss dec hex filename
5788186 1337252 9351592 16477030 fb6b66 vmlinux.orig
5774574 1293204 9351592 16419370 fa8a2a vmlinux.init
5761154 1268356 9351592 16381102 f9f4ae vmlinux.print
To accomplish this, and to let the class know what event is being
printed, the event structure is embedded in the ftrace_event_call
structure. This should not be an issues since the event structure
was created for each event anyway.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/ftrace_event.h | 2 +-
include/linux/syscalls.h | 18 +++------------
include/trace/ftrace.h | 45 +++++++++++++++++-----------------------
kernel/trace/trace_events.c | 6 ++--
kernel/trace/trace_kprobe.c | 14 +++++-------
kernel/trace/trace_syscalls.c | 8 +++++++
6 files changed, 41 insertions(+), 52 deletions(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 4f77932..b1a007d 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -148,7 +148,7 @@ struct ftrace_event_call {
struct ftrace_event_class *class;
char *name;
struct dentry *dir;
- struct trace_event *event;
+ struct trace_event event;
int enabled;
int id;
const char *print_fmt;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index ...From: Steven Rostedt <srostedt@redhat.com>
Multiple events may use the same method to print their data.
Instead of having all events have a pointer to their print funtions,
the trace_event structure now points to a trace_event_functions structure
that will hold the way to print ouf the event.
The event itself is now passed to the print function to let the print
function know what kind of event it should print.
This opens the door to consolidating the way several events print
their output.
v2: Fix the new function graph tracer event calls to handle this change.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/ftrace_event.h | 17 +++-
include/linux/syscalls.h | 10 ++-
include/trace/ftrace.h | 13 ++-
include/trace/syscall.h | 6 +-
kernel/trace/blktrace.c | 13 ++-
kernel/trace/kmemtrace.c | 28 +++++--
kernel/trace/trace.c | 9 +-
kernel/trace/trace_functions_graph.c | 13 ++-
kernel/trace/trace_kprobe.c | 22 ++++--
kernel/trace/trace_output.c | 137 +++++++++++++++++++++++-----------
kernel/trace/trace_output.h | 2 +-
kernel/trace/trace_syscalls.c | 6 +-
12 files changed, 186 insertions(+), 90 deletions(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 393a839..4f77932 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -70,18 +70,25 @@ struct trace_iterator {
};
+struct trace_event;
+
typedef enum print_line_t (*trace_print_func)(struct trace_iterator *iter,
- int flags);
-struct trace_event {
- struct hlist_node node;
- struct list_head list;
- int type;
+ int flags, struct trace_event *event);
+
+struct trace_event_functions {
trace_print_func trace;
trace_print_func raw;
trace_print_func hex;
trace_print_func binary;
...From: Steven Rostedt <srostedt@redhat.com>
Now that the trace_event structure is embedded in the ftrace_event_call
structure, there is no need for the ftrace_event_call id field.
The id field is the same as the trace_event type field.
Removing the id and re-arranging the structure brings down the tracepoint
footprint by another 5K.
text data bss dec hex filename
5788186 1337252 9351592 16477030 fb6b66 vmlinux.orig
5761154 1268356 9351592 16381102 f9f4ae vmlinux.print
5761074 1262596 9351592 16375262 f9ddde vmlinux.id
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/ftrace_event.h | 5 ++---
include/trace/ftrace.h | 12 ++++++------
kernel/trace/trace_event_perf.c | 4 ++--
kernel/trace/trace_events.c | 7 +++----
kernel/trace/trace_events_filter.c | 2 +-
kernel/trace/trace_export.c | 4 ++--
kernel/trace/trace_kprobe.c | 18 ++++++++++--------
kernel/trace/trace_syscalls.c | 14 ++++++++------
8 files changed, 34 insertions(+), 32 deletions(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index b1a007d..0be0285 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -149,14 +149,13 @@ struct ftrace_event_call {
char *name;
struct dentry *dir;
struct trace_event event;
- int enabled;
- int id;
const char *print_fmt;
- int filter_active;
struct event_filter *filter;
void *mod;
void *data;
+ int enabled;
+ int filter_active;
int perf_refcount;
};
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 839c9fe..d5a61a3 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -150,7 +150,7 @@
*
* entry = iter->ent;
*
- * if (entry->type != event_<call>.id) {
+ * if (entry->type != event_<call>->event.type) {
* WARN_ON_ONCE(1);
* return TRACE_TYPE_UNHANDLED;
* }
@@ -221,7 +221,7 @@ ftrace_raw_output_##call(struct ...From: Steven Rostedt <srostedt@redhat.com>
The filter_active and enable both use an int (4 bytes each) to
set a single flag. We can save 4 bytes per event by combining the
two into a single integer.
text data bss dec hex filename
5788186 1337252 9351592 16477030 fb6b66 vmlinux.orig
5761074 1262596 9351592 16375262 f9ddde vmlinux.id
5761007 1256916 9351592 16369515 f9c76b vmlinux.flags
This gives us another 5K in savings.
The modification of both the enable and filter fields are done
under the event_mutex, so it is still safe to combine the two.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/ftrace_event.h | 21 +++++++++++++++++++--
kernel/trace/trace.h | 2 +-
kernel/trace/trace_events.c | 14 +++++++-------
kernel/trace/trace_events_filter.c | 10 +++++-----
kernel/trace/trace_kprobe.c | 2 +-
5 files changed, 33 insertions(+), 16 deletions(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 0be0285..5ac97a4 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -143,6 +143,16 @@ struct ftrace_event_class {
int (*raw_init)(struct ftrace_event_call *);
};
+enum {
+ TRACE_EVENT_FL_ENABLED_BIT,
+ TRACE_EVENT_FL_FILTERED_BIT,
+};
+
+enum {
+ TRACE_EVENT_FL_ENABLED = (1 << TRACE_EVENT_FL_ENABLED_BIT),
+ TRACE_EVENT_FL_FILTERED = (1 << TRACE_EVENT_FL_FILTERED_BIT),
+};
+
struct ftrace_event_call {
struct list_head list;
struct ftrace_event_class *class;
@@ -154,8 +164,15 @@ struct ftrace_event_call {
void *mod;
void *data;
- int enabled;
- int filter_active;
+ /*
+ * 32 bit flags:
+ * bit 1: enabled
+ * bit 2: filter_active
+ *
+ * Must hold event_mutex to change.
+ */
+ unsigned int flags;
+
int perf_refcount;
};
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index c88c563..6356259 100644
--- a/kernel/trace/trace.h
+++ ...I would also comment about flags read-side: * Flags are read concurrently without locking. Besides that minor nit, the whole patchset has my Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Thanks! Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
That can go in outside this patch set. As we discussed before, the filter_active which does the same today as flags does here also has the issue you are concerned with. IOW, this issue has nothing to do with this patch set, because the issue existed before the patch set and has not changed after the patch set. The comment should also be in Documentation, not here, since it would Thanks! I'll get an official release ready. -- Steve --
Typically, locking-related information belongs to comments close to the definition. I'm not sure why you say it affects users at all. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
I thought you were concerned about changes to the filter not taking affect immediately. But, yeah, if you are worried that developers should know that the read value may change, then sure, comment at the code. -- Steve --
From: Steven Rostedt <srostedt@redhat.com>
Move the defined fields from the event to the class structure.
Since the fields of the event are defined by the class they belong
to, it makes sense to have the class hold the information instead
of the individual events. The events of the same class would just
hold duplicate information.
After this change the size of the kernel dropped another 8K:
text data bss dec hex filename
5788186 1337252 9351592 16477030 fb6b66 vmlinux.orig
5774316 1306580 9351592 16432488 fabd68 vmlinux.reg
5774503 1297492 9351592 16423587 fa9aa3 vmlinux.fields
Although the text increased, this was mainly due to the C files
having to adapt to the change. This is a constant increase, where
new tracepoints will not increase the Text. But the big drop is
in the data size (as well as needed allocations to hold the fields).
This will give even more savings as more tracepoints are created.
Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS()
with several DEFINE_EVENT()s, then the savings will be lost. But
we are pushing developers to consolidate events with DEFINE_EVENT()
so this should not be an issue.
The kprobes define a unique class to every new event, but are dynamic
so it should not be a issue.
The syscalls however have a single class but the fields for the individual
events are different. The syscalls use a metadata to define the
fields. I moved the fields list from the event to the metadata and
added a "get_fields()" function to the class. This function is used
to find the fields. For normal events and kprobes, get_fields() just
returns a pointer to the fields list_head in the class. For syscall
events, it returns the fields list_head in the metadata for the event.
v2: Fixed the syscall fields. The syscall metadata needs a list
of fields for both enter and exit.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu ...From: Steven Rostedt <srostedt@redhat.com>
This patch adds data to be passed to tracepoint callbacks.
The created functions from DECLARE_TRACE() now need a mandatory data
parameter. For example:
DECLARE_TRACE(mytracepoint, int value, value)
Will create the register function:
int register_trace_mytracepoint((void(*)(void *data, int value))probe,
void *data);
As the first argument, all callbacks (probes) must take a (void *data)
parameter. So a callback for the above tracepoint will look like:
void myprobe(void *data, int value)
{
}
The callback may choose to ignore the data parameter.
This change allows callbacks to register a private data pointer along
with the function probe.
void mycallback(void *data, int value);
register_trace_mytracepoint(mycallback, mydata);
Then the mycallback() will receive the "mydata" as the first parameter
before the args.
A more detailed example:
DECLARE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));
/* In the C file */
DEFINE_TRACE(mytracepoint, TP_PROTO(int status), TP_ARGS(status));
[...]
trace_mytracepoint(status);
/* In a file registering this tracepoint */
int my_callback(void *data, int status)
{
struct my_struct my_data = data;
[...]
}
[...]
my_data = kmalloc(sizeof(*my_data), GFP_KERNEL);
init_my_data(my_data);
register_trace_mytracepoint(my_callback, my_data);
The same callback can also be registered to the same tracepoint as long
as the data registered is different. Note, the data must also be used
to unregister the callback:
unregister_trace_mytracepoint(my_callback, my_data);
Because of the data parameter, tracepoints declared this way can not have
no args. That is:
DECLARE_TRACE(mytracepoint, TP_PROTO(void), TP_ARGS());
will cause an error.
If no arguments are needed, a new macro can be used instead:
DECLARE_TRACE_NOARGS(mytracepoint);
Since there are no arguments, the proto and args ...From: Steven Rostedt <srostedt@redhat.com>
This patch removes the register functions of TRACE_EVENT() to enable
and disable tracepoints. The registering of a event is now down
directly in the trace_events.c file. The tracepoint_probe_register()
is now called directly.
The prototypes are no longer type checked, but this should not be
an issue since the tracepoints are created automatically by the
macros. If a prototype is incorrect in the TRACE_EVENT() macro, then
other macros will catch it.
The trace_event_class structure now holds the probes to be called
by the callbacks. This removes needing to have each event have
a separate pointer for the probe.
To handle kprobes and syscalls, since they register probes in a
different manner, a "reg" field is added to the ftrace_event_class
structure. If the "reg" field is assigned, then it will be called for
enabling and disabling of the probe for either ftrace or perf. To let
the reg function know what is happening, a new enum (trace_reg) is
created that has the type of control that is needed.
With this new rework, the 82 kernel events and 616 syscall events
has their footprint dramatically lowered:
text data bss dec hex filename
5788186 1337252 9351592 16477030 fb6b66 vmlinux.orig
5792282 1333796 9351592 16477670 fb6de6 vmlinux.class
5793448 1333780 9351592 16478820 fb7264 vmlinux.tracepoint
5796926 1337748 9351592 16486266 fb8f7a vmlinux.data
5774316 1306580 9351592 16432488 fabd68 vmlinux.regs
The size went from 16477030 to 16432488, that's a total of 44K
in savings. With tracepoints being continuously added, this is
critical that the footprint becomes minimal.
v3: Updated to handle void *data in beginning of probe parameters.
Also added the tracepoint: check_trace_callback_type_##call().
v2: Changed the callback probes to pass void * and typecast the
value within the function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
include/linux/ftrace_event.h | 19 +++++-
...