Re: [Patch 00/15] Reduce tracing payload size.

Previous thread: [PATCH RFCv3 0/2] CARMA Board Support by Ira W. Snyder on Friday, December 3, 2010 - 4:57 pm. (3 messages)

Next thread: [PATCH 1/1] Enable jack sense for Thinkpad Edge 13 by Manoj Iyer on Friday, December 3, 2010 - 5:34 pm. (2 messages)
From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

This is a series of patches for tracing aiming to reduce the size of events in
the trace buffer. For some tracing tasks, the machine is so heavily loaded such
that events are produced much faster than they can be written to disk or
network. In this situation, the sizes of the trace events and the ring buffer
are the primary factors limiting the length of time that can be traced.  Even
with ktrace (which has significantly smaller event sizes), a fairly minimal set
of enabled events, and a 16MB buffer per cpu, we are finding we can only get
about 20 seconds of tracing. This is not enough when the problem being
diagnosed only occurs at greater timescales and cannot be easily detected at
runtime.

These patches are just a start to shrinking the size of trace events. I am
planning to also make small versions of trace events that are used when
CONFIG_SMALL_TRACES is enabled. I'm also open to discussion of other ways to
shrink event sizes.


The patches can be divided into three major groups.

The first group is self explanitory patches, not actually related to shrinking
events:

	tracing: Add a 'buffer_overwrite' debugfs file
	ring_buffer.c: Remove unused #include <linux/trace_irq.h>
	ring_buffer: Align buffer_page struct allocations only to fit the flags.

The second group is patches to adjust packing and alignment of tracing event
structures:

	ftrace: pack event structures.
	ftrace: fix event alignment: ftrace:context_switch and ftrace:wakeup
	ftrace: fix event alignment: module:module_request
	ftrace: fix event alignment: kvm:kvm_hv_hypercall
	ftrace: fix event alignment: mce:mce_record
	ftrace: fix event alignment: skb:kfree_skb
	ftrace: fix event alignment: jbd2:*
	ftrace: fix event alignment: ext4:*

The third group are patches to remove 8 bytes from struct trace_entry:

	trace_output.c: adjust conditional expression formatting.
	small_traces: Add config option to shrink trace events.
	small_traces: Remove trace output of large fields.
	small_traces: Remove 8 bytes ...
From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

From: Jiaying Zhang <jiayingz@google.com>

Add a "buffer_overwrite" debugfs file for ftrace to control whether the buffer
should be overwritten on overflow or not. The default remains to overwrite old
events when the buffer is full. This patch adds the option to instead discard
newest events when the buffer is full. This is useful to get a snapshot of
traces just after enabling traces. Dropping the current event is also a simpler
code path.

Signed-off-by: David Sharp <dhsharp@google.com>
---
 Documentation/trace/ftrace.txt |    6 ++++
 include/linux/ring_buffer.h    |    2 +
 kernel/trace/ring_buffer.c     |   11 +++++++
 kernel/trace/trace.c           |   59 +++++++++++++++++++++++++++++++++++++---
 4 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 557c1ed..9237da3 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -138,6 +138,12 @@ of ftrace. Here is a list of some of the key files:
 	This can only be updated when the current_tracer
 	is set to "nop".
 
+  buffer_overwrite:
+
+	This controls what happens when the trace buffer is full.
+	If "1" (default), the oldest events are discarded and 
+	overwritten. If "0", then the newest events are discarded.
+
   tracing_cpumask:
 
 	This is a mask that lets the user only trace
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 8d3a248..ab38ac8 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);
 
 int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
 
+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
+
 struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
 						   unsigned long length);
 int ring_buffer_unlock_commit(struct ring_buffer *buffer,
diff --git a/kernel/trace/ring_buffer.c ...
From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:57 pm

Instead of adding a new file, make this another "trace_option".

Thanks,



--

From: David Sharp
Date: Wednesday, December 8, 2010 - 1:15 pm

To do this, I had to add a special case to set_tracer_flags() to call
ring_buffer_change_overwrite. It also means adding something not
iterator-related to trace_iterator_flags. Although, maybe it's just
the name of the enum and value prefix that should change.

--

From: David Sharp
Date: Wednesday, December 8, 2010 - 2:46 pm

Add an "overwrite" trace_option for ftrace to control whether the buffer should
be overwritten on overflow or not. The default remains to overwrite old events
when the buffer is full. This patch adds the option to instead discard newest
events when the buffer is full. This is useful to get a snapshot of traces just
after enabling traces. Dropping the current event is also a simpler code path.
---
 Documentation/trace/ftrace.txt |    5 +++++
 include/linux/ring_buffer.h    |    2 ++
 kernel/trace/ring_buffer.c     |   11 +++++++++++
 kernel/trace/trace.c           |   17 +++++++++++------
 kernel/trace/trace.h           |    1 +
 5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 557c1ed..2842ea7 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -491,6 +491,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
                    latencies, as described in "Latency
                    trace format".
 
+  overwrite - This controls what happens when the trace buffer is
+              full. If "1" (default), the oldest events are
+              discarded and overwritten. If "0", then the newest
+              events are discarded.
+
 sched_switch
 ------------
 
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 8d3a248..ab38ac8 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -100,6 +100,8 @@ void ring_buffer_free(struct ring_buffer *buffer);
 
 int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size);
 
+void ring_buffer_change_overwrite(struct ring_buffer *buffer, int val);
+
 struct ring_buffer_event *ring_buffer_lock_reserve(struct ring_buffer *buffer,
 						   unsigned long length);
 int ring_buffer_unlock_commit(struct ring_buffer *buffer,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9ed509a..3207147 100644
--- ...
From: David Sharp
Date: Monday, December 13, 2010 - 5:39 pm

Steve, any feedback on this patch?

--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

This option will remove several fields from trace events to reduce the
payload size. Under heavy load, events can be produced faster than they
can possibly be written to disk or sent on the network. Shrinking the
payload size will enable fitting more traces in the ring buffer, which
therefore enables tracing for longer periods, at the cost of losing some
relatively unimportant details.

Google-Bug-Id: 3224547

Signed-off-by: David Sharp <dhsharp@google.com>
---
 kernel/trace/Kconfig |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index ea37e2f..9c24fe5 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -403,6 +403,18 @@ config FUNCTION_PROFILER
 
 	  If in doubt, say N.
 
+config SMALL_TRACES
+	bool "Shrink trace events"
+	default n
+	help
+	  Minimize the size of trace events. Omits fields from trace event
+	  header like preempt_count, and lock_depth. Chooses smaller
+	  alternative event structures. Shrinking the payload size enables
+	  fitting more traces in the ring buffer, which enables tracing for
+	  longer periods when the buffer can not be read as fast as it is
+	  filled, at the cost of losing some details.
+
+
 config FTRACE_MCOUNT_RECORD
 	def_bool y
 	depends on DYNAMIC_FTRACE
-- 
1.7.3.1

--

From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:56 pm

[ Added Frederic on Cc ]

I rather have this as a run time option than a compile time.

You could use two different trace event structures. The normal one and a
"compact" one. Add to the "trace_options" and when it is set the traces
are compact.

-- Steve




--

From: David Sharp
Date: Friday, December 3, 2010 - 7:33 pm

I considered that, and I generally thing it's a good idea. However, I
also want to use this switch to shrink individual tracepoint event
structures.

eg: sched switch is a high frequency event and it is 68 bytes (60
after these patches)

Can you suggest a syntax for TRACE_EVENT, DECLARE_EVENT_CLASS, etc,
that could express the two versions and produce the right code?

I'm worried about adding even further complexity to the TRACE_EVENT
macros. I could add TRACE_EVENT_SMALL that takes two versions of
TP_STRUCT__entry, TP_fast_assign, and TP_printk each, but then this
will need to be permuted with your TP_CONDITIONAL patches as well.

Thanks,
d#

--

From: Steven Rostedt
Date: Friday, December 3, 2010 - 7:54 pm

I would not touch the TRACE_EVENT() structures. They are there as is and
I would not think about changing them. Something like that would never
make it into mainline.

Now what you can do, is to make your own events based off of the same
tracepoints. For example, the TRACE_EVENT(sched_switch...) has in
sched.c:

	trace_sched_switch(prev, next);


You could even write a module that does something like this:

	register_trace_sched_switch(probe_sched_switch, mydata);



void probe_sched_switch(void *mydata,
		struct task_struct *prev,
		struct task_struct *next)
{
	struct ring_buffer *buffer;
	struct ring_buffer_event *event;
	struct myentry *entry;

	event = trace_current_buffer_lock_reserve(buffer,
			mytype, sizeof(*entry),
			0, 0);

	if (!event)
		return;

	entry = ring_buffer_event_data(event);

	entry->myfield = prev->x;
	...

	trace_nowake_buffer_unlock_commit(buffer, event,
					0, 0);
}

You will need to do a register_ftrace_event() to register that 'mytype'
and how to output it. Otherwise it would just be ignored in the "trace"
file.

All of the above would work fine as a loadable module that you could
easily maintain out of tree, and still uses the internals of the system.

-- Steve


--

From: Frederic Weisbecker
Date: Thursday, December 9, 2010 - 7:55 am

But this would improve only google's tracing while this is a general
mainline tracing problem.

The first thing is that we need to get rid of the lock_depth field, the bkl
is dying.

For the rest what about having a bitmap of the fields we want to ignore,
which can be setup from a trace file for ftrace and as an ioctl for perf.

So this bitmap is easy to implement on the common fields.

For the rest, one could choose between using TP_fast_assign()
and TP_cond_assign().

TP_fast_assign() stays as is and doesn't implement bitmap field
ignoring. Those who want conditional record will need
TP_cond_assign().
Well, unfortunately this probably requires us to play
the same trickery than SYSCALL_DEFINE() in that we'll probably
need TP_cond_assign1(), TP_cond_assign2(), TP_cond_assign3(), etc...

#define TP_cond_assignx(nr, assign)	\
	if (call->bitmask & nr)	{	\
		assign
	}

#define TP_cond_assign2(nr, assign, ...)	\
	TP_cond_assignx(nr, assign)		\
	TP_cond_assign1(nr + 1, __VA_ARGS__)

#define TP_cond_assign3(nr, assign, ...)        \
        TP_cond_assignx(nr, assign)             \
        TP_cond_assign2(nr + 1, __VA_ARGS__)

That will also require a bit more trickery to dynamically
pre-compute the size of the trace entry.
--

From: Steven Rostedt
Date: Thursday, December 9, 2010 - 8:08 am

Mathieu is working on encapsulating the assignments in their own macros.

Instead of doing:

	__entry->foo = bar;

We will have:

	tp_assign(foo, bar);

This way we could probably use this to dynamically figure out what to
assign.

-- Steve


--

From: Frederic Weisbecker
Date: Thursday, December 9, 2010 - 8:28 am

Yep, it should also work that way.
--

From: Mathieu Desnoyers
Date: Thursday, December 9, 2010 - 9:16 am

In an incredible exercise of good timing, I'm planning to post them just now :)

Comments will be very welcome,

Thanks!

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Signed-off-by: David Sharp <dhsharp@google.com>
---
 kernel/trace/ring_buffer.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3207147..8ef7cc4 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5,7 +5,6 @@
  */
 #include <linux/ring_buffer.h>
 #include <linux/trace_clock.h>
-#include <linux/ftrace_irq.h>
 #include <linux/spinlock.h>
 #include <linux/debugfs.h>
 #include <linux/uaccess.h>
-- 
1.7.3.1

--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Signed-off-by: David Sharp <dhsharp@google.com>
---
 include/trace/events/jbd2.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 7447ea9..2f9f362 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -41,15 +41,15 @@ DECLARE_EVENT_CLASS(jbd2_commit,
 	TP_STRUCT__entry(
 		__field(	int,   dev_major                )
 		__field(	int,   dev_minor                )
-		__field(	char,	sync_commit		  )
 		__field(	int,	transaction		  )
+		__field(	char,	sync_commit		  )
 	),
 
 	TP_fast_assign(
 		__entry->dev_major	= MAJOR(journal->j_fs_dev->bd_dev);
 		__entry->dev_minor	= MINOR(journal->j_fs_dev->bd_dev);
-		__entry->sync_commit = commit_transaction->t_synchronous_commit;
 		__entry->transaction	= commit_transaction->t_tid;
+		__entry->sync_commit = commit_transaction->t_synchronous_commit;
 	),
 
 	TP_printk("dev %d,%d transaction %d sync %d",
@@ -93,17 +93,17 @@ TRACE_EVENT(jbd2_end_commit,
 	TP_STRUCT__entry(
 		__field(	int,   dev_major                )
 		__field(	int,   dev_minor                )
-		__field(	char,	sync_commit		  )
 		__field(	int,	transaction		  )
 		__field(	int,	head		  	  )
+		__field(	char,	sync_commit		  )
 	),
 
 	TP_fast_assign(
 		__entry->dev_major	= MAJOR(journal->j_fs_dev->bd_dev);
 		__entry->dev_minor	= MINOR(journal->j_fs_dev->bd_dev);
-		__entry->sync_commit = commit_transaction->t_synchronous_commit;
 		__entry->transaction	= commit_transaction->t_tid;
 		__entry->head		= journal->j_tail_sequence;
+		__entry->sync_commit = commit_transaction->t_synchronous_commit;
 	),
 
 	TP_printk("dev %d,%d transaction %d sync %d head %d",
-- 
1.7.3.1

--

From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:52 pm

[ Ccing Ted for Acked-by ]

-- Steve




--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Signed-off-by: David Sharp <dhsharp@google.com>
---
 arch/x86/kvm/trace.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index a6544b8..ab41fb0 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -62,21 +62,21 @@ TRACE_EVENT(kvm_hv_hypercall,
 	TP_ARGS(code, fast, rep_cnt, rep_idx, ingpa, outgpa),
 
 	TP_STRUCT__entry(
-		__field(	__u16, 		code		)
-		__field(	bool,		fast		)
 		__field(	__u16,		rep_cnt		)
 		__field(	__u16,		rep_idx		)
 		__field(	__u64,		ingpa		)
 		__field(	__u64,		outgpa		)
+		__field(	__u16, 		code		)
+		__field(	bool,		fast		)
 	),
 
 	TP_fast_assign(
-		__entry->code		= code;
-		__entry->fast		= fast;
 		__entry->rep_cnt	= rep_cnt;
 		__entry->rep_idx	= rep_idx;
 		__entry->ingpa		= ingpa;
 		__entry->outgpa		= outgpa;
+		__entry->code		= code;
+		__entry->fast		= fast;
 	),
 
 	TP_printk("code 0x%x %s cnt 0x%x idx 0x%x in 0x%llx out 0x%llx",
-- 
1.7.3.1

--

From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:49 pm

[ Ccing Marcelo and Avi for Acked-by's ]

-- Steve




--

From: Avi Kivity
Date: Saturday, December 4, 2010 - 1:11 am

Looks like a pessimisation.

Before: 24 bytes
After: 32 bytes

(on a 64-bit machine, assuming no packing)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: David Sharp
Date: Monday, December 6, 2010 - 1:38 pm

[Empty message]
From: Avi Kivity
Date: Tuesday, December 7, 2010 - 2:22 am

I don't understand this.  Can you elaborate?  What does "32-bit 
addressable" mean?  And "predicated on packing the event structures"?  
Is the structure __attribute__((packed)), or is it not?

-- 
error compiling committee.c: too many arguments to function

--

From: David Sharp
Date: Tuesday, December 7, 2010 - 2:16 pm

The ring buffer gives you space that is a multiple of 4 bytes in
length, and 32-bit aligned. Therefore it is useless to attempt to
align the structure beyond 32-bit boundaries, eg, a 64-bit boundary,
because it is unpredictable if the memory the structure will be

It is not packed in Linus' tree, but one of the patches before this
patch in this patch series adds __attribute__((packed)). This patch
assumes that the event packing patch has been applied. This patch
should not be applied if the packing patch is not (hence,
--

From: Avi Kivity
Date: Wednesday, December 8, 2010 - 2:18 am

Thanks for the explanations, it makes sense now.

-- 
error compiling committee.c: too many arguments to function

--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Ftrace event structures have a 12-byte struct trace_entry at the beginning.
If the structure is aligned, this means that if the first field is 64-bits,
there will be 4 bytes of padding. Ironically, due to the 4-byte ringbuffer
header, this will make 64-bit writes unaligned, if the ring buffer position
is currently 64-bit aligned:
4(rb)+12(ftrace)+4(pad) = 20;  20%8 = 4

Adding __attribute__((packed)) to the event structures removes the extra
space from the trace events, and actually improves alignment of trace
events with a first field that is 64-bits.

About 65 tracepoints have a 4-byte pad at offset 12:
 # find events -name format | xargs -n1 awk '
   $1=="name:" {name=$2}
   $1=="format:"{FS="\t"}
   $3=="offset:12;" && $4=="size:4;"{okay=1}
   $3=="offset:16;" && !okay {print name}' | wc -l
65

With all 'syscalls' and 'timer' events enabled, this results in a 5%
improvement in a simple 512MB read benchmark with warm caches.

Tested:

setup:
 # echo 1 >events/syscalls/enable
 # echo 1 >events/timer/enable
 # echo 0 > tracing_enabled
off:
 # for n in $(seq 10) ; do \
   time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
   done
on:
 # for n in $(seq 10) ; do \
   echo 1 >tracing_enabled; \
   time dd if=/dev/hda3 of=/dev/null bs=1K count=512K ; \
   echo 0 >tracing_enabled; \
   echo > trace; \
   done

real time      mean/median/stdev
w/o patch off: 1.1679/1.164/0.0169
w/o patch on : 1.9432/1.936/0.0274
w/  patch off: 1.1715/1.159/0.0431
w/  patch on : 1.8425/1.836/0.0138
"on" delta: -0.1007 --> -5.2%

Google-Bug-Id: 2895627

Signed-off-by: David Sharp <dhsharp@google.com>
---
 include/trace/ftrace.h |    5 +++--
 kernel/trace/trace.h   |    2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index a9377c0..51d1f52 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -48,7 +48,8 @@
 #define __array(type, item, len)	type	item[len];
 
 #undef ...
From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Signed-off-by: David Sharp <dhsharp@google.com>
---
 include/trace/events/ext4.h |   70 +++++++++++++++++++++---------------------
 1 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index e5e345f..9887cdb 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -23,21 +23,21 @@ TRACE_EVENT(ext4_free_inode,
 	TP_STRUCT__entry(
 		__field(	int,   dev_major                )
 		__field(	int,   dev_minor                )
+		__field(	blkcnt_t, blocks		)
 		__field(	ino_t,	ino			)
-		__field(	umode_t, mode			)
 		__field(	uid_t,	uid			)
 		__field(	gid_t,	gid			)
-		__field(	blkcnt_t, blocks		)
+		__field(	umode_t, mode			)
 	),
 
 	TP_fast_assign(
 		__entry->dev_major = MAJOR(inode->i_sb->s_dev);
 		__entry->dev_minor = MINOR(inode->i_sb->s_dev);
+		__entry->blocks	= inode->i_blocks;
 		__entry->ino	= inode->i_ino;
-		__entry->mode	= inode->i_mode;
 		__entry->uid	= inode->i_uid;
 		__entry->gid	= inode->i_gid;
-		__entry->blocks	= inode->i_blocks;
+		__entry->mode	= inode->i_mode;
 	),
 
 	TP_printk("dev %d,%d ino %lu mode 0%o uid %u gid %u blocks %llu",
@@ -342,10 +342,10 @@ TRACE_EVENT(ext4_da_writepages,
 		__field(	long,	pages_skipped		)
 		__field(	loff_t,	range_start		)
 		__field(	loff_t,	range_end		)
+		__field(       pgoff_t,	writeback_index		)
 		__field(	char,	for_kupdate		)
 		__field(	char,	for_reclaim		)
 		__field(	char,	range_cyclic		)
-		__field(       pgoff_t,	writeback_index		)
 	),
 
 	TP_fast_assign(
@@ -356,10 +356,10 @@ TRACE_EVENT(ext4_da_writepages,
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->range_start	= wbc->range_start;
 		__entry->range_end	= wbc->range_end;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
 		__entry->for_kupdate	= wbc->for_kupdate;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->writeback_index = inode->i_mapping->writeback_index;
 ...
From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:53 pm

[ Ccing Ted for Acked-by ]

-- Steve




--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Signed-off-by: David Sharp <dhsharp@google.com>
---
 include/trace/events/mce.h |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 7eee778..4cbbcef 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -17,36 +17,36 @@ TRACE_EVENT(mce_record,
 	TP_STRUCT__entry(
 		__field(	u64,		mcgcap		)
 		__field(	u64,		mcgstatus	)
-		__field(	u8,		bank		)
 		__field(	u64,		status		)
 		__field(	u64,		addr		)
 		__field(	u64,		misc		)
 		__field(	u64,		ip		)
-		__field(	u8,		cs		)
 		__field(	u64,		tsc		)
 		__field(	u64,		walltime	)
 		__field(	u32,		cpu		)
 		__field(	u32,		cpuid		)
 		__field(	u32,		apicid		)
 		__field(	u32,		socketid	)
+		__field(	u8,		cs		)
+		__field(	u8,		bank		)
 		__field(	u8,		cpuvendor	)
 	),
 
 	TP_fast_assign(
 		__entry->mcgcap		= m->mcgcap;
 		__entry->mcgstatus	= m->mcgstatus;
-		__entry->bank		= m->bank;
 		__entry->status		= m->status;
 		__entry->addr		= m->addr;
 		__entry->misc		= m->misc;
 		__entry->ip		= m->ip;
-		__entry->cs		= m->cs;
 		__entry->tsc		= m->tsc;
 		__entry->walltime	= m->time;
 		__entry->cpu		= m->extcpu;
 		__entry->cpuid		= m->cpuid;
 		__entry->apicid		= m->apicid;
 		__entry->socketid	= m->socketid;
+		__entry->cs		= m->cs;
+		__entry->bank		= m->bank;
 		__entry->cpuvendor	= m->cpuvendor;
 	),
 
-- 
1.7.3.1

--

From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:50 pm

[ Ccing Hidetoshi and Frederic for Acked-by's ]

-- Steve




--

From: Frederic Weisbecker
Date: Thursday, December 9, 2010 - 6:33 am

No problem for me.

--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Remove flags, preempt_count, and lock_depth from trace_entry. These fields
were used only by the latency tracer output and are relatively unimportant,
while incurring high payload overhead.

Shrink pid to 16 bits. pid_max is usually 32768. The potential for loss of
fidelity here is minimal, and can be guaranteed by setting pid_max.

Google-Bug-Id: 3224547

Signed-off-by: David Sharp <dhsharp@google.com>
---
 include/linux/ftrace_event.h |    4 ++++
 kernel/trace/trace.c         |    7 ++++++-
 kernel/trace/trace_events.c  |    4 ++++
 3 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 8beabb9..3a0bdd1 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -34,10 +34,14 @@ const char *ftrace_print_hex_seq(struct trace_seq *p,
  */
 struct trace_entry {
 	unsigned short		type;
+#ifdef CONFIG_SMALL_TRACES
+	short		pid;
+#else
 	unsigned char		flags;
 	unsigned char		preempt_count;
 	int			pid;
 	int			lock_depth;
+#endif
 };
 
 #define FTRACE_MAX_EVENT						\
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 177d4e5..73b6482 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1101,8 +1101,11 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
 {
 	struct task_struct *tsk = current;
 
-	entry->preempt_count		= pc & 0xff;
+#ifdef CONFIG_SMALL_TRACES
+        entry->pid			= (tsk) ? (typeof(entry->pid))tsk->pid : 0;
+#else
 	entry->pid			= (tsk) ? tsk->pid : 0;
+	entry->preempt_count		= pc & 0xff;
 	entry->lock_depth		= (tsk) ? tsk->lock_depth : 0;
 	entry->flags =
 #ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
@@ -1113,9 +1116,11 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
 		((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
 		((pc & SOFTIRQ_MASK) ? TRACE_FLAG_SOFTIRQ : 0) |
 		(need_resched() ? TRACE_FLAG_NEED_RESCHED : 0);
+#endif
 }
 ...
From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Google-Bug-Id: 3224547

Signed-off-by: David Sharp <dhsharp@google.com>
---
 kernel/trace/trace.c        |    6 ++++++
 kernel/trace/trace_output.c |   18 ++++++++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ed5c14f..177d4e5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1747,6 +1747,7 @@ static void s_stop(struct seq_file *m, void *p)
 
 static void print_lat_help_header(struct seq_file *m)
 {
+#ifndef CONFIG_SMALL_TRACES
 	seq_puts(m, "#                  _------=> CPU#            \n");
 	seq_puts(m, "#                 / _-----=> irqs-off        \n");
 	seq_puts(m, "#                | / _----=> need-resched    \n");
@@ -1756,6 +1757,11 @@ static void print_lat_help_header(struct seq_file *m)
 	seq_puts(m, "#                |||||/     delay             \n");
 	seq_puts(m, "#  cmd     pid   |||||| time  |   caller      \n");
 	seq_puts(m, "#     \\   /      ||||||   \\   |   /           \n");
+#else
+	seq_puts(m, "#               CPU#   delay             \n");
+	seq_puts(m, "#  cmd     pid   | time  |   caller      \n");
+	seq_puts(m, "#     \\   /      |   \\   |   /           \n");
+#endif
 }
 
 static void print_func_help_header(struct seq_file *m)
diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 9e33b1f..d5d7575 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -531,6 +531,7 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
  * Prints the generic fields of irqs off, in hard or softirq, preempt
  * count and lock depth.
  */
+#ifndef CONFIG_SMALL_TRACES
 int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
 {
 	int ret;
@@ -566,6 +567,12 @@ int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
 
 	return trace_seq_printf(s, "%d", entry->lock_depth);
 }
+#else
+int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry ...
From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Formatting change only to improve readability. No code changes except to
introduce intermediate variables.

Signed-off-by: David Sharp <dhsharp@google.com>
---
 kernel/trace/trace_output.c |   26 +++++++++++++++-----------
 1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index 02272ba..9e33b1f 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -533,20 +533,24 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
  */
 int trace_print_lat_fmt(struct trace_seq *s, struct trace_entry *entry)
 {
-	int hardirq, softirq;
 	int ret;
-
-	hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
-	softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+	int hardirq = entry->flags & TRACE_FLAG_HARDIRQ;
+	int softirq = entry->flags & TRACE_FLAG_SOFTIRQ;
+
+	const char irqs_off = 
+		(entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
+		(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ? 'X' :
+		'.';
+	const char need_resched =
+		(entry->flags & TRACE_FLAG_NEED_RESCHED) ? 'N' : '.';
+	const char hardsoft_irq =
+		(hardirq && softirq) ? 'H' :
+		hardirq ? 'h' :
+		softirq ? 's' :
+		'.';
 
 	if (!trace_seq_printf(s, "%c%c%c",
-			      (entry->flags & TRACE_FLAG_IRQS_OFF) ? 'd' :
-				(entry->flags & TRACE_FLAG_IRQS_NOSUPPORT) ?
-				  'X' : '.',
-			      (entry->flags & TRACE_FLAG_NEED_RESCHED) ?
-				'N' : '.',
-			      (hardirq && softirq) ? 'H' :
-				hardirq ? 'h' : softirq ? 's' : '.'))
+			      irqs_off, need_resched, hardsoft_irq))
 		return 0;
 
 	if (entry->preempt_count)
-- 
1.7.3.1

--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

buffer_page structs need to be aligned to 4 byte boundaries because the page
flags are stored in the two least-significant bits of the pointers in the page
list. Aligning to cache lines is sufficient, but doesn't seem to be necessary.
Reducing the alignement to only 4 bytes may improve cache efficiency.

Testing with Autotest's tracing_microbenchmark, there was no significant change
in overhead with this change.

Signed-off-by: David Sharp <dhsharp@google.com>
---
 kernel/trace/ring_buffer.c |   20 ++++++++++++--------
 1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8ef7cc4..957a8b8 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -644,8 +644,14 @@ EXPORT_SYMBOL_GPL(ring_buffer_normalize_time_stamp);
 #define RB_PAGE_HEAD		1UL
 #define RB_PAGE_UPDATE		2UL
 
-
 #define RB_FLAG_MASK		3UL
+#define RB_PAGE_ALIGNMENT	(RB_FLAG_MASK+1)
+
+/* Ensure alignment of struct buffer_page */
+static __attribute__((unused)) void check_buffer_page_alignment(void)
+{
+        BUILD_BUG_ON(__alignof__(struct buffer_page) % RB_PAGE_ALIGNMENT != 0);
+}
 
 /* PAGE_MOVED is not part of the mask */
 #define RB_PAGE_MOVED		4UL
@@ -1004,8 +1010,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 	WARN_ON(!nr_pages);
 
 	for (i = 0; i < nr_pages; i++) {
-		bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
+		bpage = kzalloc_node(sizeof(*bpage), GFP_KERNEL,
+				     cpu_to_node(cpu_buffer->cpu));
 		if (!bpage)
 			goto free_pages;
 
@@ -1059,8 +1065,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
 	lockdep_set_class(&cpu_buffer->reader_lock, buffer->reader_lock_key);
 	cpu_buffer->lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
 
-	bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
-			    GFP_KERNEL, cpu_to_node(cpu));
+	bpage = kzalloc_node(sizeof(*bpage), ...
From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:43 pm

The reason for the cache line boundaries because the bpages are per cpu,
and if they are allocated next to some variable that gets change
globally or even another bpage that is used for another cpu buffer, then
I would expect we would get some cache line bouncing.

Perhaps we should make bpage have its own slab allocation
(kmem_cache_alloc). And perhaps even try to keep all bpages that share
the same cache line on the same cpu buffer.

-- Steve


--

From: David Sharp
Date: Tuesday, December 7, 2010 - 3:44 pm

Maybe. Although, I was thinking of some patches to keep pages in a
free-pool, so that when a cpu buffer needs a new page, it takes from
the pool, and when a reader is done with a page, it returns to the
free-pool. (This helps with the situation where more events occur on
one CPU than others: pages are not locked to a CPU that isn't
producing events.) In that case, it would be impossible to predict
what cpu a bpage belongs to. But since this is theoretical, maybe it's
not relevant.
--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Signed-off-by: David Sharp <dhsharp@google.com>
---
 include/trace/events/module.h |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index c7bb2f0..8154f5f 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -98,14 +98,14 @@ TRACE_EVENT(module_request,
 	TP_ARGS(name, wait, ip),
 
 	TP_STRUCT__entry(
-		__field(	bool,		wait		)
 		__field(	unsigned long,	ip		)
+		__field(	bool,		wait		)
 		__string(	name,		name		)
 	),
 
 	TP_fast_assign(
-		__entry->wait	= wait;
 		__entry->ip	= ip;
+		__entry->wait	= wait;
 		__assign_str(name, name);
 	),
 
@@ -119,4 +119,3 @@ TRACE_EVENT(module_request,
 
 /* This part must be outside protection */
 #include <trace/define_trace.h>
-
-- 
1.7.3.1

--

From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:47 pm

[ Ccing Li for an Acked-by. He wrote this event ]

-- Steve




--

From: Li Zefan
Date: Sunday, December 5, 2010 - 6:28 pm

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Signed-off-by: David Sharp <dhsharp@google.com>
---
 include/trace/events/skb.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 75ce9d5..aa4a56a 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -19,16 +19,16 @@ TRACE_EVENT(kfree_skb,
 
 	TP_STRUCT__entry(
 		__field(	void *,		skbaddr		)
-		__field(	unsigned short,	protocol	)
 		__field(	void *,		location	)
+		__field(	unsigned short,	protocol	)
 	),
 
 	TP_fast_assign(
 		__entry->skbaddr = skb;
+		__entry->location = location;
 		if (skb) {
 			__entry->protocol = ntohs(skb->protocol);
 		}
-		__entry->location = location;
 	),
 
 	TP_printk("skbaddr=%p protocol=%u location=%p",
-- 
1.7.3.1

--

From: Steven Rostedt
Date: Friday, December 3, 2010 - 6:52 pm

[ Ccing Neil for Acked-by ]

-- Steve




--

From: Neil Horman
Date: Saturday, December 4, 2010 - 6:38 am

Looks good, thanks!
--

From: David Sharp
Date: Friday, December 3, 2010 - 5:13 pm

Signed-off-by: David Sharp <dhsharp@google.com>
---
 kernel/trace/trace_entries.h |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_entries.h b/kernel/trace/trace_entries.h
index e3dfeca..81ab9c3 100644
--- a/kernel/trace/trace_entries.h
+++ b/kernel/trace/trace_entries.h
@@ -109,12 +109,12 @@ FTRACE_ENTRY(funcgraph_exit, ftrace_graph_ret_entry,
  */
 #define FTRACE_CTX_FIELDS					\
 	__field(	unsigned int,	prev_pid	)	\
+	__field(	unsigned int,	next_pid	)	\
+	__field(	unsigned int,	next_cpu	)       \
 	__field(	unsigned char,	prev_prio	)	\
 	__field(	unsigned char,	prev_state	)	\
-	__field(	unsigned int,	next_pid	)	\
 	__field(	unsigned char,	next_prio	)	\
-	__field(	unsigned char,	next_state	)	\
-	__field(	unsigned int,	next_cpu	)
+	__field(	unsigned char,	next_state	)
 
 FTRACE_ENTRY(context_switch, ctx_switch_entry,
 
-- 
1.7.3.1

--

From: Andi Kleen
Date: Monday, December 6, 2010 - 6:22 am

Maybe the simplest would be to lzo them as they get logged?
I assume you already considered that?

-Andi

--

From: Ted Ts'o
Date: Monday, December 6, 2010 - 6:56 am

The tracing subsystem is supposed to be extremely low-overhead.
Compressiong the event log would add considerable CPU overhead.  If we
can shrink the trace events by being more careful about unnecessary
padding, and not logging fields that aren't needed, it's all win....

The problem of course with the latter is that different people will
have disagreements about what is needed or not.  And even adding
conditionals will slow down the tracing hotpath.

						- Ted
--

From: Andi Kleen
Date: Monday, December 6, 2010 - 7:58 am

lzo as a rule of thumb is about as expensive as 3x memcpy()
That's not really expensive.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Steven Rostedt
Date: Monday, December 6, 2010 - 9:17 am

Usually the most expensive thing in the trace is the memcpy(). Which is
why we try to do it only once.

-- Steve


--

From: Miguel Ojeda
Date: Monday, December 6, 2010 - 9:31 am

That is true for the decompression step but not for the compression
--

From: Andi Kleen
Date: Monday, December 6, 2010 - 9:41 am

That's a good point. At 10x it's definitely too slow. Too bad, would have been too easy.

I guess one could still consider some very cheap data specific delta compressions 
(I assume there's a lot of redundancy in a tracing stream with similar events
occurring in a row or nearby) I used this technique successfully with large log files 
in the past and overall moving less data around ends up being cheaper.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

Previous thread: [PATCH RFCv3 0/2] CARMA Board Support by Ira W. Snyder on Friday, December 3, 2010 - 4:57 pm. (3 messages)

Next thread: [PATCH 1/1] Enable jack sense for Thinkpad Edge 13 by Manoj Iyer on Friday, December 3, 2010 - 5:34 pm. (2 messages)