Re: [PATCH 0/7] [GIT PULL] tracing: updates

Previous thread: Re: [00/45] 2.6.27.46-stable review by Jayson R. King on Wednesday, March 31, 2010 - 6:52 pm. (3 messages)

Next thread: Re: Config NO_BOOTMEM breaks my amd64 box by H. Peter Anvin on Wednesday, March 31, 2010 - 8:16 pm. (2 messages)
From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 8:17 pm

Ingo,

Please pull the latest tip/tracing/core tree, which can be found at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
tip/tracing/core


Li Zefan (3):
      tracing: Convert some signal events to DEFINE_TRACE
      tracing: Update comments
      tracing: Remove side effect from module tracepoints that caused a GPF

Steven Rostedt (4):
      tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set
      ring-buffer: Add place holder recording of dropped events
      tracing: Show the lost events in the trace_pipe output
      ring-buffer: Add lost event count to end of sub buffer

----
 drivers/oprofile/cpu_buffer.c        |    4 +-
 include/linux/ftrace_event.h         |    1 +
 include/linux/module.h               |    6 +-
 include/linux/ring_buffer.h          |    6 +-
 include/trace/events/module.h        |   18 ++++--
 include/trace/events/signal.h        |   52 +++++++----------
 include/trace/ftrace.h               |   33 ++++++-----
 kernel/module.c                      |    8 +-
 kernel/trace/ring_buffer.c           |  101 ++++++++++++++++++++++++++++++++--
 kernel/trace/ring_buffer_benchmark.c |    2 +-
 kernel/trace/trace.c                 |   30 +++++++---
 kernel/trace/trace_functions_graph.c |    5 +-
 kernel/trace/trace_selftest.c        |    2 +-
 13 files changed, 187 insertions(+), 81 deletions(-)
--

From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 8:17 pm

From: Li Zefan <lizf@cn.fujitsu.com>

Use DECLARE_EVENT_CLASS to remove duplicate code:

text    data     bss     dec     hex filename
  23639    6084       8   29731    7423 kernel/signal.o.orig
  22727    6084       8   28819    7093 kernel/signal.o

2 events are converted:

  signal_queue_overflow: signal_overflow_fail, signal_lose_info

No functional change.

Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BA97FBD.8070703@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/trace/events/signal.h |   52 +++++++++++++++++-----------------------
 1 files changed, 22 insertions(+), 30 deletions(-)

diff --git a/include/trace/events/signal.h b/include/trace/events/signal.h
index a510b75..814566c 100644
--- a/include/trace/events/signal.h
+++ b/include/trace/events/signal.h
@@ -100,18 +100,7 @@ TRACE_EVENT(signal_deliver,
 		  __entry->sa_handler, __entry->sa_flags)
 );
 
-/**
- * signal_overflow_fail - called when signal queue is overflow
- * @sig: signal number
- * @group: signal to process group or not (bool)
- * @info: pointer to struct siginfo
- *
- * Kernel fails to generate 'sig' signal with 'info' siginfo, because
- * siginfo queue is overflow, and the signal is dropped.
- * 'group' is not 0 if the signal will be sent to a process group.
- * 'sig' is always one of RT signals.
- */
-TRACE_EVENT(signal_overflow_fail,
+DECLARE_EVENT_CLASS(signal_queue_overflow,
 
 	TP_PROTO(int sig, int group, struct siginfo *info),
 
@@ -135,6 +124,24 @@ TRACE_EVENT(signal_overflow_fail,
 );
 
 /**
+ * signal_overflow_fail - called when signal queue is overflow
+ * @sig: signal number
+ * @group: signal to process group or not (bool)
+ * @info: pointer to struct siginfo
+ *
+ * Kernel fails to generate 'sig' signal with 'info' siginfo, because
+ * siginfo queue is overflow, and the signal is dropped.
+ * 'group' is not 0 if the signal will be sent to a process group.
+ * ...
From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 8:17 pm

From: Li Zefan <lizf@cn.fujitsu.com>

Make some comments consistent with the code.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BA97FD0.7090202@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/trace/ftrace.h |   33 +++++++++++++++++++--------------
 1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index ea6f9d4..75dd778 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -154,9 +154,11 @@
  *
  *	field = (typeof(field))entry;
  *
- *	p = get_cpu_var(ftrace_event_seq);
+ *	p = &get_cpu_var(ftrace_event_seq);
  *	trace_seq_init(p);
- *	ret = trace_seq_printf(s, <TP_printk> "\n");
+ *	ret = trace_seq_printf(s, "%s: ", <call>);
+ *	if (ret)
+ *		ret = trace_seq_printf(s, <TP_printk> "\n");
  *	put_cpu();
  *	if (!ret)
  *		return TRACE_TYPE_PARTIAL_LINE;
@@ -450,38 +452,38 @@ perf_trace_disable_##name(struct ftrace_event_call *unused)		\
  *
  * static void ftrace_raw_event_<call>(proto)
  * {
+ *	struct ftrace_data_offsets_<call> __maybe_unused __data_offsets;
  *	struct ring_buffer_event *event;
  *	struct ftrace_raw_<call> *entry; <-- defined in stage 1
  *	struct ring_buffer *buffer;
  *	unsigned long irq_flags;
+ *	int __data_size;
  *	int pc;
  *
  *	local_save_flags(irq_flags);
  *	pc = preempt_count();
  *
+ *	__data_size = ftrace_get_offsets_<call>(&__data_offsets, args);
+ *
  *	event = trace_current_buffer_lock_reserve(&buffer,
  *				  event_<call>.id,
- *				  sizeof(struct ftrace_raw_<call>),
+ *				  sizeof(*entry) + __data_size,
  *				  irq_flags, pc);
  *	if (!event)
  *		return;
  *	entry	= ring_buffer_event_data(event);
  *
- *	<assign>;  <-- Here we assign the entries by the __field and
- *			__array macros.
+ *	{ <assign>; }  <-- Here we assign the entries by the __field and
+ *			   __array macros.
  *
- *	trace_current_buffer_unlock_commit(buffer, event, irq_flags, pc);
+ *	if ...
From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 8:17 pm

From: Steven Rostedt <srostedt@redhat.com>

If modules are configured in the build but unloading of modules is not,
then the refcnt is not defined. Place the get/put module tracepoints
under CONFIG_MODULE_UNLOAD since it references this field in the module
structure.

As a side-effect, this patch also reduces the code when MODULE_UNLOAD
is not set, because these unused tracepoints are not created.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/trace/events/module.h |    4 ++++
 kernel/module.c               |    5 +++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index a585f81..f07b44a 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -51,6 +51,9 @@ TRACE_EVENT(module_free,
 	TP_printk("%s", __get_str(name))
 );
 
+#ifdef CONFIG_MODULE_UNLOAD
+/* trace_module_get/put are only used if CONFIG_MODULE_UNLOAD is defined */
+
 DECLARE_EVENT_CLASS(module_refcnt,
 
 	TP_PROTO(struct module *mod, unsigned long ip),
@@ -86,6 +89,7 @@ DEFINE_EVENT(module_refcnt, module_put,
 
 	TP_ARGS(mod, ip)
 );
+#endif /* CONFIG_MODULE_UNLOAD */
 
 TRACE_EVENT(module_request,
 
diff --git a/kernel/module.c b/kernel/module.c
index 21591ad..d9e2379 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -59,8 +59,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/module.h>
 
-EXPORT_TRACEPOINT_SYMBOL(module_get);
-
 #if 0
 #define DEBUGP printk
 #else
@@ -467,6 +465,9 @@ MODINFO_ATTR(srcversion);
 static char last_unloaded_module[MODULE_NAME_LEN+1];
 
 #ifdef CONFIG_MODULE_UNLOAD
+
+EXPORT_TRACEPOINT_SYMBOL(module_get);
+
 /* Init the unload section of the module. */
 static void module_unload_init(struct module *mod)
 {
-- 
1.7.0


--

From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 8:17 pm

From: Li Zefan <lizf@cn.fujitsu.com>

Remove the @refcnt argument, because it has side-effects, and arguments with
side-effects are not skipped by the jump over disabled instrumentation and are
executed even when the tracepoint is disabled.

This was also causing a GPF as found by Randy Dunlap:

Subject: [PATCH 3/7] 2.6.33 GP fault only when built with tracing
LKML-Reference: <4BA2B69D.3000309@oracle.com>

Note, the current 2.6.34-rc has a fix for the actual cause of the GPF,
but this fixes one of its triggers.

Tested-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BA97FA7.6040406@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/module.h        |    6 ++----
 include/trace/events/module.h |   14 +++++++-------
 kernel/module.c               |    3 +--
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 5e869ff..393ec39 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -460,8 +460,7 @@ static inline void __module_get(struct module *module)
 	if (module) {
 		preempt_disable();
 		__this_cpu_inc(module->refptr->count);
-		trace_module_get(module, _THIS_IP_,
-				 __this_cpu_read(module->refptr->count));
+		trace_module_get(module, _THIS_IP_);
 		preempt_enable();
 	}
 }
@@ -475,8 +474,7 @@ static inline int try_module_get(struct module *module)
 
 		if (likely(module_is_live(module))) {
 			__this_cpu_inc(module->refptr->count);
-			trace_module_get(module, _THIS_IP_,
-				__this_cpu_read(module->refptr->count));
+			trace_module_get(module, _THIS_IP_);
 		}
 		else
 			ret = 0;
diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index 4b0f48b..a585f81 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -53,9 +53,9 @@ TRACE_EVENT(module_free,
 
 ...
From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 8:17 pm

From: Steven Rostedt <srostedt@redhat.com>

Currently, binary readers of the ring buffer only know where events were
lost, but not how many events were lost at that location.
This information is available, but it would require adding another
field to the sub buffer header to include it.

But when a event can not fit at the end of a sub buffer, it is written
to the next sub buffer. This means there is a good chance that the
buffer may have room to hold this counter. If it does, write
the counter at the end of the sub buffer and set another flag
in the data size field that states that this information exists.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c |   37 +++++++++++++++++++++++++++++++++----
 1 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 8295650..dc6d563 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -320,6 +320,8 @@ EXPORT_SYMBOL_GPL(ring_buffer_event_data);
 
 /* Flag when events were overwritten */
 #define RB_MISSED_EVENTS	(1 << 31)
+/* Missed count stored at end */
+#define RB_MISSED_STORED	(1 << 30)
 
 struct buffer_data_page {
 	u64		 time_stamp;	/* page time stamp */
@@ -340,6 +342,7 @@ struct buffer_page {
 	local_t		 write;		/* index for next write */
 	unsigned	 read;		/* index for next read */
 	local_t		 entries;	/* entries on this page */
+	unsigned long	 real_end;	/* real end of data */
 	struct buffer_data_page *page;	/* Actual data page */
 };
 
@@ -1770,6 +1773,13 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
 	kmemcheck_annotate_bitfield(event, bitfield);
 
 	/*
+	 * Save the original length to the meta data.
+	 * This will be used by the reader to add lost event
+	 * counter.
+	 */
+	tail_page->real_end = tail;
+
+	/*
 	 * If this event is bigger than the minimum size, then
 	 * we need to be careful that we don't subtract the
 	 * write counter enough to allow another writer ...
From: Frederic Weisbecker
Date: Wednesday, March 31, 2010 - 11:13 pm

Since you can not ensure you have a room for that,
is there no way to delay the update of cpu_buffer->last_overrun
until you find a large enough end of page for that?

I mean, you do this in rb_get_page():

       if (overwrite != cpu_buffer->last_overrun) {
               cpu_buffer->lost_events = overwrite - cpu_buffer->last_overrun;
               cpu_buffer->last_overrun = overwrite;
       }

If you don't have enough room to put lost_events in the end of
the buffer, you could keep as is cpu_buffer->last_overrun,
so the next time you swap the reader page, the new delta
will be added to the previous that you couldn't transmit.

That way you ensure the user always know when you lose events.

--

From: Steven Rostedt
Date: Thursday, April 1, 2010 - 5:23 am

When is not the issue since we set the MISSING_EVENTS flag already. It's
how many were lost at that time that we do not store.

I actually found in most cases there is room, and trace-cmd can report
it. When there is no room, it just says events were dropped without
reporting the actual number.


This is what it shows when the number can be stored:

       trace-cmd-4980  [002] 32784.303699: lock_acquire:         0xffff88003ebf0e08 &(&sbi->s_rsv_window_lock)->rlock
       trace-cmd-4980  [002] 32784.303700: lock_release:         0xffff88003ebf0e08 &(&sbi->s_rsv_window_lock)->rlock
CPU:1 [415 EVENTS DROPPED]
       trace-cmd-4981  [001] 32784.303701: lock_release:         0xffff88000c49d7d0 &ei->truncate_mutex
       trace-cmd-4981  [001] 32784.303704: lock_acquire:         0xffffffff816591e0 read rcu_read_lock
       trace-cmd-4980  [002] 32784.303705: lock_acquire:         0xffff88003c0719d8 &(&bgl->locks[i].lock)->rlock

This is what it shows when the number can not:

          <idle>-0     [002] 32784.304459: lock_acquire:         0xffffffff816591e0 read rcu_read_lock
          <idle>-0     [002] 32784.304464: lock_release:         0xffffffff816591e0 rcu_read_lock
CPU:1 [EVENTS DROPPED]
              ls-4984  [001] 32784.304471: lock_acquire:         0xffff88003f8b8018 &(&base->lock)->rlock
              ls-4984  [001] 32784.304471: lock_release:         0xffff88003f8b8018 &(&base->lock)->rlock
              ls-4984  [001] 32784.304471: softirq_exit:         vec=1 [action=TIMER]


The problem with storing it later is that there is no guarantee that it
will be stored later. How far should the output look for that number. By
delaying when to print it, the output would need to search for that
number.

What happens if events were lost again before we found the number? Do we
add them together? Then we wont know how many events were lost between
the two spots where the events were lost. That is, if 448 events were
lost but we could not store the number, and then 318 events ...
From: Frederic Weisbecker
Date: Thursday, April 8, 2010 - 4:42 pm

Yeah. My biggest worry was that the user may not be reported of some
lost events. But as far as we have the total and also a flag that at
least reports we missed events (whatever the number there), then
it's fine.

Thanks.

--

From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 8:17 pm

From: Steven Rostedt <srostedt@redhat.com>

Now that the ring buffer can keep track of where events are lost.
Use this information to the output of trace_pipe:

       hackbench-3588  [001]  1326.701660: lock_acquire: ffffffff816591e0 read rcu_read_lock
       hackbench-3588  [001]  1326.701661: lock_acquire: ffff88003f4091f0 &(&dentry->d_lock)->rlock
       hackbench-3588  [001]  1326.701664: lock_release: ffff88003f4091f0 &(&dentry->d_lock)->rlock
CPU:1 [LOST 673 EVENTS]
       hackbench-3588  [001]  1326.702711: kmem_cache_free: call_site=ffffffff81102b85 ptr=ffff880026d96738
       hackbench-3588  [001]  1326.702712: lock_release: ffff88003e1480a8 &mm->mmap_sem
       hackbench-3588  [001]  1326.702713: lock_acquire: ffff88003e1480a8 &mm->mmap_sem

Even works with the function graph tracer:

 2) ! 170.098 us  |                                            }
 2)   4.036 us    |                                            rcu_irq_exit();
 2)   3.657 us    |                                            idle_cpu();
 2) ! 190.301 us  |                                          }
CPU:2 [LOST 2196 EVENTS]
 2)   0.853 us    |                            } /* cancel_dirty_page */
 2)               |                            remove_from_page_cache() {
 2)   1.578 us    |                              _raw_spin_lock_irq();
 2)               |                              __remove_from_page_cache() {

Note, it does not work with the iterator "trace" file, since it requires
the use of consuming the page from the ring buffer to determine how many
events were lost, which the iterator does not do.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/ftrace_event.h |    1 +
 kernel/trace/trace.c         |   30 ++++++++++++++++++++++--------
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index c0f4b36..39e71b0 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -58,6 ...
From: Steven Rostedt
Date: Wednesday, March 31, 2010 - 8:17 pm

From: Steven Rostedt <srostedt@redhat.com>

Currently, when the ring buffer drops events, it does not record
the fact that it did so. It does inform the writer that the event
was dropped by returning a NULL event, but it does not put in any
place holder where the event was dropped.

This is not a trivial thing to add because the ring buffer mostly
runs in overwrite (flight recorder) mode. That is, when the ring
buffer is full, new data will overwrite old data.

In a produce/consumer mode, where new data is simply dropped when
the ring buffer is full, it is trivial to add the placeholder
for dropped events. When there's more room to write new data, then
a special event can be added to notify the reader about the dropped
events.

But in overwrite mode, any new write can overwrite events. A place
holder can not be inserted into the ring buffer since there never
may be room. A reader could also come in at anytime and miss the
placeholder.

Luckily, the way the ring buffer works, the read side can find out
if events were lost or not, and how many events. Everytime a write
takes place, if it overwrites the header page (the next read) it
updates a "overrun" variable that keeps track of the number of
lost events. When a reader swaps out a page from the ring buffer,
it can record this number, perfom the swap, and then check to
see if the number changed, and take the diff if it has, which would be
the number of events dropped. This can be stored by the reader
and returned to callers of the reader.

Since the reader page swap will fail if the writer moved the head
page since the time the reader page set up the swap, this gives room
to record the overruns without worrying about races. If the reader
sets up the pages, records the overrun, than performs the swap,
if the swap succeeds, then the overrun variable has not been
updated since the setup before the swap.

For binary readers of the ring buffer, a flag is set in the header
of each sub page (sub buffer) of the ring buffer. This ...
From: Ingo Molnar
Date: Friday, April 2, 2010 - 9:39 am

Pulled, thanks a lot Steve!

	Ingo
--

Previous thread: Re: [00/45] 2.6.27.46-stable review by Jayson R. King on Wednesday, March 31, 2010 - 6:52 pm. (3 messages)

Next thread: Re: Config NO_BOOTMEM breaks my amd64 box by H. Peter Anvin on Wednesday, March 31, 2010 - 8:16 pm. (2 messages)