This is a series of patches for tracing aiming to reduce the size of events in the trace buffer. For some tracing tasks, the machine is so heavily loaded such that events are produced much faster than they can be written to disk or network. In this situation, the sizes of the trace events and the ring buffer are the primary factors limiting the length of time that can be traced. Even with ktrace (which has significantly smaller event sizes), a fairly minimal set of enabled events, and a 16MB buffer per cpu, we are finding we can only get about 20 seconds of tracing. This is not enough when the problem being diagnosed only occurs at greater timescales and cannot be easily detected at runtime. These patches are just a start to shrinking the size of trace events. I am planning to also make small versions of trace events that are used when CONFIG_SMALL_TRACES is enabled. I'm also open to discussion of other ways to shrink event sizes. The patches can be divided into three major groups. The first group is self explanitory patches, not actually related to shrinking events: tracing: Add a 'buffer_overwrite' debugfs file ring_buffer.c: Remove unused #include <linux/trace_irq.h> ring_buffer: Align buffer_page struct allocations only to fit the flags. The second group is patches to adjust packing and alignment of tracing event structures: ftrace: pack event structures. ftrace: fix event alignment: ftrace:context_switch and ftrace:wakeup ftrace: fix event alignment: module:module_request ftrace: fix event alignment: kvm:kvm_hv_hypercall ftrace: fix event alignment: mce:mce_record ftrace: fix event alignment: skb:kfree_skb ftrace: fix event alignment: jbd2:* ftrace: fix event alignment: ext4:* The third group are patches to remove 8 bytes from struct trace_entry: trace_output.c: adjust conditional expression formatting. small_traces: Add config option to shrink trace events. small_traces: Remove trace output of large fields. small_traces: Remove 8 bytes ...
From: Jiaying Zhang <jiayingz@google.com> Add a "buffer_overwrite" debugfs file for ftrace to control whether the buffer should be overwritten on overflow or not. The default remains to overwrite old events when the buffer is full. This patch adds the option to instead discard newest events when the buffer is full. This is useful to get a snapshot of traces just after enabling traces. Dropping the current event is also a simpler code path. Signed-off-by: David Sharp <dhsharp@google.com> --- Documentation/trace/ftrace.txt | 6 ++++ include/linux/ring_buffer.h | 2 + kernel/trace/ring_buffer.c | 11 +++++++ kernel/trace/trace.c | 59 +++++++++++++++++++++++++++++++++++++--- 4 files changed, 74 insertions(+), 4 deletions(-) diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 557c1ed..9237da3 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -138,6 +138,12 @@ of ftrace. Here is a list of some of the key files: This can only be updated when the current_tracer is set to "nop". + buffer_overwrite: + + This controls what happens when the trace buffer is full. + If "1" (default), the oldest events are discarded and + overwritten. If "0", then the newest events are discarded. + tracing_cpumask: This is a mask that lets the user only trace diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 8d3a248..ab38ac8 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer); int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size); +void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val); + struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer, unsigned long length); int ring_buffer_unlock_commit(struct ring_buffer *buffer, diff --git a/kernel/trace/ring_buffer.c ...
Instead of adding a new file, make this another "trace_option". Thanks, --
To do this, I had to add a special case to set_tracer_flags() to call ring_buffer_change_overwrite. It also means adding something not iterator-related to trace_iterator_flags. Although, maybe it's just the name of the enum and value prefix that should change. --
Add an "overwrite" trace_option for ftrace to control whether the buffer should
be overwritten on overflow or not. The default remains to overwrite old events
when the buffer is full. This patch adds the option to instead discard newest
events when the buffer is full. This is useful to get a snapshot of traces just
after enabling traces. Dropping the current event is also a simpler code path.
---
Documentation/trace/ftrace.txt | 5 +++++
include/linux/ring_buffer.h | 2 ++
kernel/trace/ring_buffer.c | 11 +++++++++++
kernel/trace/trace.c | 17 +++++++++++------
kernel/trace/trace.h | 1 +
5 files changed, 30 insertions(+), 6 deletions(-)
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 557c1ed..2842ea7 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -491,6 +491,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
latencies, as described in "Latency
trace format".
+ overwrite - This controls what happens when the trace buffer is
+ full. If "1" (default), the oldest events are
+ discarded and overwritten. If "0", then the newest
+ events are discarded.
+
sched_switch
------------
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 8d3a248..ab38ac8 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);
int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
+
struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
unsigned long length);
int ring_buffer_unlock_commit(struct ring_buffer *buffer,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9ed509a..3207147 100644
--- ...This option will remove several fields from trace events to reduce the payload size. Under heavy load, events can be produced faster than they can possibly be written to disk or sent on the network. Shrinking the payload size will enable fitting more traces in the ring buffer, which therefore enables tracing for longer periods, at the cost of losing some relatively unimportant details. Google-Bug-Id: 3224547 Signed-off-by: David Sharp <dhsharp@google.com> --- kernel/trace/Kconfig | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index ea37e2f..9c24fe5 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -403,6 +403,18 @@ config FUNCTION_PROFILER If in doubt, say N. +config SMALL_TRACES + bool "Shrink trace events" + default n + help + Minimize the size of trace events. Omits fields from trace event + header like preempt_count, and lock_depth. Chooses smaller + alternative event structures. Shrinking the payload size enables + fitting more traces in the ring buffer, which enables tracing for + longer periods when the buffer can not be read as fast as it is + filled, at the cost of losing some details. + + config FTRACE_MCOUNT_RECORD def_bool y depends on DYNAMIC_FTRACE -- 1.7.3.1 --
[ Added Frederic on Cc ] I rather have this as a run time option than a compile time. You could use two different trace event structures. The normal one and a "compact" one. Add to the "trace_options" and when it is set the traces are compact. -- Steve --
I considered that, and I generally thing it's a good idea. However, I also want to use this switch to shrink individual tracepoint event structures. eg: sched switch is a high frequency event and it is 68 bytes (60 after these patches) Can you suggest a syntax for TRACE_EVENT, DECLARE_EVENT_CLASS, etc, that could express the two versions and produce the right code? I'm worried about adding even further complexity to the TRACE_EVENT macros. I could add TRACE_EVENT_SMALL that takes two versions of TP_STRUCT__entry, TP_fast_assign, and TP_printk each, but then this will need to be permuted with your TP_CONDITIONAL patches as well. Thanks, d# --
I would not touch the TRACE_EVENT() structures. They are there as is and
I would not think about changing them. Something like that would never
make it into mainline.
Now what you can do, is to make your own events based off of the same
tracepoints. For example, the TRACE_EVENT(sched_switch...) has in
sched.c:
trace_sched_switch(prev, next);
You could even write a module that does something like this:
register_trace_sched_switch(probe_sched_switch, mydata);
void probe_sched_switch(void *mydata,
struct task_struct *prev,
struct task_struct *next)
{
struct ring_buffer *buffer;
struct ring_buffer_event *event;
struct myentry *entry;
event = trace_current_buffer_lock_reserve(buffer,
mytype, sizeof(*entry),
0, 0);
if (!event)
return;
entry = ring_buffer_event_data(event);
entry->myfield = prev->x;
...
trace_nowake_buffer_unlock_commit(buffer, event,
0, 0);
}
You will need to do a register_ftrace_event() to register that 'mytype'
and how to output it. Otherwise it would just be ignored in the "trace"
file.
All of the above would work fine as a loadable module that you could
easily maintain out of tree, and still uses the internals of the system.
-- Steve
--
But this would improve only google's tracing while this is a general
mainline tracing problem.
The first thing is that we need to get rid of the lock_depth field, the bkl
is dying.
For the rest what about having a bitmap of the fields we want to ignore,
which can be setup from a trace file for ftrace and as an ioctl for perf.
So this bitmap is easy to implement on the common fields.
For the rest, one could choose between using TP_fast_assign()
and TP_cond_assign().
TP_fast_assign() stays as is and doesn't implement bitmap field
ignoring. Those who want conditional record will need
TP_cond_assign().
Well, unfortunately this probably requires us to play
the same trickery than SYSCALL_DEFINE() in that we'll probably
need TP_cond_assign1(), TP_cond_assign2(), TP_cond_assign3(), etc...
#define TP_cond_assignx(nr, assign) \
if (call->bitmask & nr) { \
assign
}
#define TP_cond_assign2(nr, assign, ...) \
TP_cond_assignx(nr, assign) \
TP_cond_assign1(nr + 1, __VA_ARGS__)
#define TP_cond_assign3(nr, assign, ...) \
TP_cond_assignx(nr, assign) \
TP_cond_assign2(nr + 1, __VA_ARGS__)
That will also require a bit more trickery to dynamically
pre-compute the size of the trace entry.
--
Mathieu is working on encapsulating the assignments in their own macros. Instead of doing: __entry->foo = bar; We will have: tp_assign(foo, bar); This way we could probably use this to dynamically figure out what to assign. -- Steve --
Yep, it should also work that way. --
In an incredible exercise of good timing, I'm planning to post them just now :) Comments will be very welcome, Thanks! Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Signed-off-by: David Sharp <dhsharp@google.com> --- kernel/trace/ring_buffer.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 3207147..8ef7cc4 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -5,7 +5,6 @@ */ #include <linux/ring_buffer.h> #include <linux/trace_clock.h> -#include <linux/ftrace_irq.h> #include <linux/spinlock.h> #include <linux/debugfs.h> #include <linux/uaccess.h> -- 1.7.3.1 --
Signed-off-by: David Sharp <dhsharp@google.com> --- include/trace/events/jbd2.h | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h index 7447ea9..2f9f362 100644 --- a/include/trace/events/jbd2.h +++ b/include/trace/events/jbd2.h @@ -41,15 +41,15 @@ DECLARE_EVENT_CLASS(jbd2_commit, TP_STRUCT__entry( __field( int, dev_major ) __field( int, dev_minor ) - __field( char, sync_commit ) __field( int, transaction ) + __field( char, sync_commit ) ), TP_fast_assign( __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev); __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev); - __entry->sync_commit = commit_transaction->t_synchronous_commit; __entry->transaction = commit_transaction->t_tid; + __entry->sync_commit = commit_transaction->t_synchronous_commit; ), TP_printk("dev %d,%d transaction %d sync %d", @@ -93,17 +93,17 @@ TRACE_EVENT(jbd2_end_commit, TP_STRUCT__entry( __field( int, dev_major ) __field( int, dev_minor ) - __field( char, sync_commit ) __field( int, transaction ) __field( int, head ) + __field( char, sync_commit ) ), TP_fast_assign( __entry->dev_major = MAJOR(journal->j_fs_dev->bd_dev); __entry->dev_minor = MINOR(journal->j_fs_dev->bd_dev); - __entry->sync_commit = commit_transaction->t_synchronous_commit; __entry->transaction = commit_transaction->t_tid; __entry->head = journal->j_tail_sequence; + __entry->sync_commit = commit_transaction->t_synchronous_commit; ), TP_printk("dev %d,%d transaction %d sync %d head %d", -- 1.7.3.1 --
[ Ccing Ted for Acked-by ] -- Steve --
Signed-off-by: David Sharp <dhsharp@google.com> --- arch/x86/kvm/trace.h | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index a6544b8..ab41fb0 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall, TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa), TP_STRUCT__entry( - __field( __u16, code ) - __field( bool, fast ) __field( __u16, rep_cnt ) __field( __u16, rep_idx ) __field( __u64, ingpa ) __field( __u64, outgpa ) + __field( __u16, code ) + __field( bool, fast ) ), TP_fast_assign( - __entry->code = code; - __entry->fast = fast; __entry->rep_cnt = rep_cnt; __entry->rep_idx = rep_idx; __entry->ingpa = ingpa; __entry->outgpa = outgpa; + __entry->code = code; + __entry->fast = fast; ), TP_printk("code 0x%x %s cnt 0x%x idx 0x%x in 0x%llx out 0x%llx", -- 1.7.3.1 --
[ Ccing Marcelo and Avi for Acked-by's ] -- Steve --
Looks like a pessimisation. Before: 24 bytes After: 32 bytes (on a 64-bit machine, assuming no packing) -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
I don't understand this. Can you elaborate? What does "32-bit addressable" mean? And "predicated on packing the event structures"? Is the structure __attribute__((packed)), or is it not? -- error compiling committee.c: too many arguments to function --
The ring buffer gives you space that is a multiple of 4 bytes in length, and 32-bit aligned. Therefore it is useless to attempt to align the structure beyond 32-bit boundaries, eg, a 64-bit boundary, because it is unpredictable if the memory the structure will be It is not packed in Linus' tree, but one of the patches before this patch in this patch series adds __attribute__((packed)). This patch assumes that the event packing patch has been applied. This patch should not be applied if the packing patch is not (hence, --
Thanks for the explanations, it makes sense now. -- error compiling committee.c: too many arguments to function --
Ftrace event structures have a 12-byte struct trace_entry at the beginning.
If the structure is aligned, this means that if the first field is 64-bits,
there will be 4 bytes of padding. Ironically, due to the 4-byte ringbuffer
header, this will make 64-bit writes unaligned, if the ring buffer position
is currently 64-bit aligned:
4(rb)+12(ftrace)+4(pad) = 20; 20%8 = 4
Adding __attribute__((packed)) to the event structures removes the extra
space from the trace events, and actually improves alignment of trace
events with a first field that is 64-bits.
About 65 tracepoints have a 4-byte pad at offset 12:
# find events -name format | xargs -n1 awk '
$1=="name:" {name=$2}
$1=="format:"{FS="\t"}
$3=="offset:12;" && $4=="size:4;"{okay=1}
$3=="offset:16;" && !okay {print name}' | wc -l
65
With all 'syscalls' and 'timer' events enabled, this results in a 5%
improvement in a simple 512MB read benchmark with warm caches.
Tested:
setup:
# echo 1 >events/syscalls/enable
# echo 1 >events/timer/enable
# echo 0 > tracing_enabled
off:
# for n in $(seq 10) ; do \
time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
done
on:
# for n in $(seq 10) ; do \
echo 1 >tracing_enabled; \
time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
echo 0 >tracing_enabled; \
echo > trace; \
done
real time mean/median/stdev
w/o patch off: 1.1679/1.164/0.0169
w/o patch on : 1.9432/1.936/0.0274
w/ patch off: 1.1715/1.159/0.0431
w/ patch on : 1.8425/1.836/0.0138
"on" delta: -0.1007 --> -5.2%
Google-Bug-Id: 2895627
Signed-off-by: David Sharp <dhsharp@google.com>
---
include/trace/ftrace.h | 5 +++--
kernel/trace/trace.h | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index a9377c0..51d1f52 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -48,7 +48,8 @@
#define __array(type, item, len) type item[len];
#undef ...Signed-off-by: David Sharp <dhsharp@google.com> --- include/trace/events/ext4.h | 70 +++++++++++++++++++++--------------------- 1 files changed, 35 insertions(+), 35 deletions(-) diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index e5e345f..9887cdb 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -23,21 +23,21 @@ TRACE_EVENT(ext4_free_inode, TP_STRUCT__entry( __field( int, dev_major ) __field( int, dev_minor ) + __field( blkcnt_t, blocks ) __field( ino_t, ino ) - __field( umode_t, mode ) __field( uid_t, uid ) __field( gid_t, gid ) - __field( blkcnt_t, blocks ) + __field( umode_t, mode ) ), TP_fast_assign( __entry->dev_major = MAJOR(inode->i_sb->s_dev); __entry->dev_minor = MINOR(inode->i_sb->s_dev); + __entry->blocks = inode->i_blocks; __entry->ino = inode->i_ino; - __entry->mode = inode->i_mode; __entry->uid = inode->i_uid; __entry->gid = inode->i_gid; - __entry->blocks = inode->i_blocks; + __entry->mode = inode->i_mode; ), TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu", @@ -342,10 +342,10 @@ TRACE_EVENT(ext4_da_writepages, __field( long, pages_skipped ) __field( loff_t, range_start ) __field( loff_t, range_end ) + __field( pgoff_t, writeback_index ) __field( char, for_kupdate ) __field( char, for_reclaim ) __field( char, range_cyclic ) - __field( pgoff_t, writeback_index ) ), TP_fast_assign( @@ -356,10 +356,10 @@ TRACE_EVENT(ext4_da_writepages, __entry->pages_skipped = wbc->pages_skipped; __entry->range_start = wbc->range_start; __entry->range_end = wbc->range_end; + __entry->writeback_index = inode->i_mapping->writeback_index; __entry->for_kupdate = wbc->for_kupdate; __entry->for_reclaim = wbc->for_reclaim; __entry->range_cyclic = wbc->range_cyclic; - __entry->writeback_index = inode->i_mapping->writeback_index; ...
[ Ccing Ted for Acked-by ] -- Steve --
Signed-off-by: David Sharp <dhsharp@google.com> --- include/trace/events/mce.h | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h index 7eee778..4cbbcef 100644 --- a/include/trace/events/mce.h +++ b/include/trace/events/mce.h @@ -17,36 +17,36 @@ TRACE_EVENT(mce_record, TP_STRUCT__entry( __field( u64, mcgcap ) __field( u64, mcgstatus ) - __field( u8, bank ) __field( u64, status ) __field( u64, addr ) __field( u64, misc ) __field( u64, ip ) - __field( u8, cs ) __field( u64, tsc ) __field( u64, walltime ) __field( u32, cpu ) __field( u32, cpuid ) __field( u32, apicid ) __field( u32, socketid ) + __field( u8, cs ) + __field( u8, bank ) __field( u8, cpuvendor ) ), TP_fast_assign( __entry->mcgcap = m->mcgcap; __entry->mcgstatus = m->mcgstatus; - __entry->bank = m->bank; __entry->status = m->status; __entry->addr = m->addr; __entry->misc = m->misc; __entry->ip = m->ip; - __entry->cs = m->cs; __entry->tsc = m->tsc; __entry->walltime = m->time; __entry->cpu = m->extcpu; __entry->cpuid = m->cpuid; __entry->apicid = m->apicid; __entry->socketid = m->socketid; + __entry->cs = m->cs; + __entry->bank = m->bank; __entry->cpuvendor = m->cpuvendor; ), -- 1.7.3.1 --
[ Ccing Hidetoshi and Frederic for Acked-by's ] -- Steve --
Remove flags, preempt_count, and lock_depth from trace_entry. These fields
were used only by the latency tracer output and are relatively unimportant,
while incurring high payload overhead.
Shrink pid to 16 bits. pid_max is usually 32768. The potential for loss of
fidelity here is minimal, and can be guaranteed by setting pid_max.
Google-Bug-Id: 3224547
Signed-off-by: David Sharp <dhsharp@google.com>
---
include/linux/ftrace_event.h | 4 ++++
kernel/trace/trace.c | 7 ++++++-
kernel/trace/trace_events.c | 4 ++++
3 files changed, 14 insertions(+), 1 deletions(-)
diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 8beabb9..3a0bdd1 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -34,10 +34,14 @@ const char *ftrace_print_hex_seq(struct trace_seq *p,
*/
struct trace_entry {
unsigned short type;
+#ifdef CONFIG_SMALL_TRACES
+ short pid;
+#else
unsigned char flags;
unsigned char preempt_count;
int pid;
int lock_depth;
+#endif
};
#define FTRACE_MAX_EVENT \
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 177d4e5..73b6482 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1101,8 +1101,11 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
{
struct task_struct *tsk = current;
- entry->preempt_count = pc & 0xff;
+#ifdef CONFIG_SMALL_TRACES
+ entry->pid = (tsk) ? (typeof(entry->pid))tsk->pid : 0;
+#else
entry->pid = (tsk) ? tsk->pid : 0;
+ entry->preempt_count = pc & 0xff;
entry->lock_depth = (tsk) ? tsk->lock_depth : 0;
entry->flags =
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
@@ -1113,9 +1116,11 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
(need_resched() ? TRACE_FLAG_NEED_RESCHED : 0);
+#endif
}
...Google-Bug-Id: 3224547
Signed-off-by: David Sharp <dhsharp@google.com>
---
kernel/trace/trace.c | 6 ++++++
kernel/trace/trace_output.c | 18 ++++++++++++++++++
2 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ed5c14f..177d4e5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1747,6 +1747,7 @@ static void s_stop(struct seq_file *m, void *p)
static void print_lat_help_header(struct seq_file *m)
{
+#ifndef CONFIG_SMALL_TRACES
seq_puts(m, "# _------=> CPU# \n");
seq_puts(m, "# / _-----=> irqs-off \n");
seq_puts(m, "# | / _----=> need-resched \n");
@@ -1756,6 +1757,11 @@ static void print_lat_help_header(struct seq_file *m)
seq_puts(m, "# |||||/ delay \n");
seq_puts(m, "# cmd pid |||||| time | caller \n");
seq_puts(m, "# \\ / |||||| \\ | / \n");
+#else
+ seq_puts(m, "# CPU# delay \n");
+ seq_puts(m, "# cmd pid | time | caller \n");
+ seq_puts(m, "# \\ / | \\ | / \n");
+#endif
}
static void print_func_help_header(struct seq_file *m)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 9e33b1f..d5d7575 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -531,6 +531,7 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
* Prints the generic fields of irqs off, in hard or softirq, preempt
* count and lock depth.
*/
+#ifndef CONFIG_SMALL_TRACES
int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
{
int ret;
@@ -566,6 +567,12 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
return trace_seq_printf(s, "%d", entry->lock_depth);
}
+#else
+int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry ...Formatting change only to improve readability. No code changes except to
introduce intermediate variables.
Signed-off-by: David Sharp <dhsharp@google.com>
---
kernel/trace/trace_output.c | 26 +++++++++++++++-----------
1 files changed, 15 insertions(+), 11 deletions(-)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 02272ba..9e33b1f 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -533,20 +533,24 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
*/
int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
{
- int hardirq, softirq;
int ret;
-
- hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
- softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+ int hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
+ int softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+
+ const char irqs_off =
+ (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
+ (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+ '.';
+ const char need_resched =
+ (entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.';
+ const char hardsoft_irq =
+ (hardirq && softirq) ? 'H' :
+ hardirq ? 'h' :
+ softirq ? 's' :
+ '.';
if (!trace_seq_printf(s, "%c%c%c",
- (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
- (entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ?
- 'X' : '.',
- (entry->flags & TRACE_FLAG_NEED_RESCHED) ?
- 'N' : '.',
- (hardirq && softirq) ? 'H' :
- hardirq ? 'h' : softirq ? 's' : '.'))
+ irqs_off, need_resched, hardsoft_irq))
return 0;
if (entry->preempt_count)
--
1.7.3.1
--
buffer_page structs need to be aligned to 4 byte boundaries because the page
flags are stored in the two least-significant bits of the pointers in the page
list. Aligning to cache lines is sufficient, but doesn't seem to be necessary.
Reducing the alignement to only 4 bytes may improve cache efficiency.
Testing with Autotest's tracing_microbenchmark, there was no significant change
in overhead with this change.
Signed-off-by: David Sharp <dhsharp@google.com>
---
kernel/trace/ring_buffer.c | 20 ++++++++++++--------
1 files changed, 12 insertions(+), 8 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8ef7cc4..957a8b8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,14 @@ EXPORT_SYMBOL_GPL(ring_buffer_normalize_time_stamp);
#define RB_PAGE_HEAD 1UL
#define RB_PAGE_UPDATE 2UL
-
#define RB_FLAG_MASK 3UL
+#define RB_PAGE_ALIGNMENT (RB_FLAG_MASK+1)
+
+/* Ensure alignment of struct buffer_page */
+static __attribute__((unused)) void check_buffer_page_alignment(void)
+{
+ BUILD_BUG_ON(__alignof__(struct buffer_page) % RB_PAGE_ALIGNMENT != 0);
+}
/* PAGE_MOVED is not part of the mask */
#define RB_PAGE_MOVED 4UL
@@ -1004,8 +1010,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
WARN_ON(!nr_pages);
for (i = 0; i < nr_pages; i++) {
- bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
+ bpage = kzalloc_node(sizeof(*bpage), GFP_KERNEL,
+ cpu_to_node(cpu_buffer->cpu));
if (!bpage)
goto free_pages;
@@ -1059,8 +1065,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
lockdep_set_class(&cpu_buffer->reader_lock, buffer->reader_lock_key);
cpu_buffer->lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
- bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu));
+ bpage = kzalloc_node(sizeof(*bpage), ...The reason for the cache line boundaries because the bpages are per cpu, and if they are allocated next to some variable that gets change globally or even another bpage that is used for another cpu buffer, then I would expect we would get some cache line bouncing. Perhaps we should make bpage have its own slab allocation (kmem_cache_alloc). And perhaps even try to keep all bpages that share the same cache line on the same cpu buffer. -- Steve --
Maybe. Although, I was thinking of some patches to keep pages in a free-pool, so that when a cpu buffer needs a new page, it takes from the pool, and when a reader is done with a page, it returns to the free-pool. (This helps with the situation where more events occur on one CPU than others: pages are not locked to a CPU that isn't producing events.) In that case, it would be impossible to predict what cpu a bpage belongs to. But since this is theoretical, maybe it's not relevant. --
Signed-off-by: David Sharp <dhsharp@google.com> --- include/trace/events/module.h | 5 ++--- 1 files changed, 2 insertions(+), 3 deletions(-) diff --git a/include/trace/events/module.h b/include/trace/events/module.h index c7bb2f0..8154f5f 100644 --- a/include/trace/events/module.h +++ b/include/trace/events/module.h @@ -98,14 +98,14 @@ TRACE_EVENT(module_request, TP_ARGS(name, wait, ip), TP_STRUCT__entry( - __field( bool, wait ) __field( unsigned long, ip ) + __field( bool, wait ) __string( name, name ) ), TP_fast_assign( - __entry->wait = wait; __entry->ip = ip; + __entry->wait = wait; __assign_str(name, name); ), @@ -119,4 +119,3 @@ TRACE_EVENT(module_request, /* This part must be outside protection */ #include <trace/define_trace.h> - -- 1.7.3.1 --
[ Ccing Li for an Acked-by. He wrote this event ] -- Steve --
Signed-off-by: David Sharp <dhsharp@google.com>
---
include/trace/events/skb.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 75ce9d5..aa4a56a 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -19,16 +19,16 @@ TRACE_EVENT(kfree_skb,
TP_STRUCT__entry(
__field( void *, skbaddr )
- __field( unsigned short, protocol )
__field( void *, location )
+ __field( unsigned short, protocol )
),
TP_fast_assign(
__entry->skbaddr = skb;
+ __entry->location = location;
if (skb) {
__entry->protocol = ntohs(skb->protocol);
}
- __entry->location = location;
),
TP_printk("skbaddr=%p protocol=%u location=%p",
--
1.7.3.1
--
[ Ccing Neil for Acked-by ] -- Steve --
Signed-off-by: David Sharp <dhsharp@google.com> --- kernel/trace/trace_entries.h | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h index e3dfeca..81ab9c3 100644 --- a/kernel/trace/trace_entries.h +++ b/kernel/trace/trace_entries.h @@ -109,12 +109,12 @@ FTRACE_ENTRY(funcgraph_exit, ftrace_graph_ret_entry, */ #define FTRACE_CTX_FIELDS \ __field( unsigned int, prev_pid ) \ + __field( unsigned int, next_pid ) \ + __field( unsigned int, next_cpu ) \ __field( unsigned char, prev_prio ) \ __field( unsigned char, prev_state ) \ - __field( unsigned int, next_pid ) \ __field( unsigned char, next_prio ) \ - __field( unsigned char, next_state ) \ - __field( unsigned int, next_cpu ) + __field( unsigned char, next_state ) FTRACE_ENTRY(context_switch, ctx_switch_entry, -- 1.7.3.1 --
Maybe the simplest would be to lzo them as they get logged? I assume you already considered that? -Andi --
The tracing subsystem is supposed to be extremely low-overhead. Compressiong the event log would add considerable CPU overhead. If we can shrink the trace events by being more careful about unnecessary padding, and not logging fields that aren't needed, it's all win.... The problem of course with the latter is that different people will have disagreements about what is needed or not. And even adding conditionals will slow down the tracing hotpath. - Ted --
lzo as a rule of thumb is about as expensive as 3x memcpy() That's not really expensive. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Usually the most expensive thing in the trace is the memcpy(). Which is why we try to do it only once. -- Steve --
That is true for the decompression step but not for the compression --
That's a good point. At 10x it's definitely too slow. Too bad, would have been too easy. I guess one could still consider some very cheap data specific delta compressions (I assume there's a lot of redundancy in a tracing stream with similar events occurring in a row or nearby) I used this technique successfully with large log files in the past and overall moving less data around ends up being cheaper. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
