Re: [RFC PATCH 1/2 v2] Unified trace buffer

Previous thread: [RFC PATCH 0/2 v2] Unified trace buffer (take two) by Steven Rostedt on Thursday, September 25, 2008 - 8:58 am. (1 message)

Next thread: [patch 5/6] kmsg: convert vmcp to kmsg api. by Martin Schwidefsky on Thursday, September 25, 2008 - 9:28 am. (1 message)
From: Steven Rostedt
Date: Thursday, September 25, 2008 - 8:58 am

This is probably very buggy. I ran it as a back end for ftrace but only
tested the irqsoff and ftrace tracers. The selftests are busted with it.

But this is an attempt to get a unified buffering system that was
talked about at the LPC meeting.

Now that it boots and runs (albeit, a bit buggy), I decided to post it.
This is some idea that I had to handle this.

I tried to make it as simple as possible.

I'm not going to explain all the stuff I'm doing here, since this code
is under a lot of flux (RFC, POC work), and I don't want to keep updating
this change log. When we finally agree on something, I'll make this
change log worthy.

If you want to know what this patch does, the code below explains it :-p

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  191 +++++++
 kernel/trace/Kconfig        |    3 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1172 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1367 insertions(+)

Index: linux-compile.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-compile.git/include/linux/ring_buffer.h	2008-09-25 01:23:42.000000000 -0400
@@ -0,0 +1,191 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		time_delta:27, type:5;
+	u32		data;
+	u64		array[];
+} __attribute__((__packed__));
+
+/*
+ * Recommend types by Linus Torvalds. Yeah, he didn't say
+ * this was a requirement, but it sounded good regardless.
+ */
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * (data is ignored)
+				 * size is variable depending on
+				 * the left over space on the page.
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time ...
From: Linus Torvalds
Date: Thursday, September 25, 2008 - 10:02 am

Your timestamp handling seems odd. You do it per-event, but I think it 
should happen for all events, ie just do

	*ts += event->time_delta;

_outside_ the case statement, and then in RB_TYPE_TIME_EXTENT you'd do 
either

 - relative:
	*ts += event->data << TS_SHIFT;

 - absolute timestamp events:
	*ts = (event->data << TS_SHIFT) + event->time_delta;

but the bigger issue is that I think the timestamp should be relative to 
the _previous_ event, not relative to the page start. IOW, you really 
should accumulate them. 

IOW, the base timestamp cannot be in the cpu_buffer, it needs to be in the 
iterator data structure, since it updates as you walk over it.

Otherwise the extended TSC format will be _horrible_. You don't want to 
add it in front of every event in the page just because you had a pause at 
the beginning of the page. You want to have a running update, so that you 
only need to add it after there was a pause.

		Linus
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 10:16 am

hehe, this code is in so much flux, that I gave up on keeping the

The problem with this is overwrite mode, which is the only mode ftace 
currently offers. What happens when your writer starts overwriting the 
ring buffer and there is no reader?

What happens is that the start value is gone. You do not have a way to use 
all the deltas to catch up to the remaining events.

-- Steve
--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 10:25 am

Overwrite things on page at a time. Don't you already do that? (I didn't 
check that closely, I just assumed you would do the _much_ simpler "move 
the head to the next page" thing rather than trying to mix head and tail 

Use the page start date for the first event in a page. But within pages, 
make everything depend on previous event.

		Linus
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 10:46 am

Yeah, I decided to do the blast the page instead of just blast the entry.

Hmm, I'm confused. I do this already. I put the page start time on each 
page already. The events on the page are based on the page start time
itself.  The iterator, or writer, keeps track of the last event. When it 
reaches a new page, it reads the new time stamp of the page and starts 
incrementing against that.

-- Steve

--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 10:35 am

* Linus Torvalds (torvalds@linux-foundation.org) wrote:

How about keeping the timestamps absolute ? (but just keep 27 LSBs)

It would help resynchronizing the timestamps if an event is lost and
would not accumulate error over and over. It would event help detecting
bugs in the tracer by checking if timestamps go backward.

Also, it would remove inter-dependency between consecutive events; we
would not have to know "for sure" what the previous timestamp was when
we write the current event. Just knowing if we need to write the full
TSC is enough (which implies knowing an upper bound), which is a much
more relax constraint than having to know the _exact_ previous
timestamp.

Is there a reason to use delta between events rather than simply write
the 27 LSBs that I would have missed ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 10:48 am

One answer is that your counter wrap problem is extended.

That is, you have 27 bits of time between each event to not worry about
wraps. But if you go against the page itself, the last event on that page
is more likely to suffer.

-- Steve
--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 11:25 am

You can do the exact same thing and manage to keep the absolute time.
You just have to adapt the reader like this : (this would be for
per-event cycle count in the 32 LSBs, slight bitmask adaptation needed
for 27 bits only).

keep a 64 bits TSC value in a per-buffer variable. The previous value is
always re-used for the next read.

let's call it :
tf->buffer.tsc  (u64)

read_event() would look like :

u32 timetamp = read_event_timetamp();

if(timestamp < (0xFFFFFFFFULL & tf->buffer.tsc)) {
    /* overflow */
    tf->buffer.tsc = ((tf->buffer.tsc & 0xFFFFFFFF00000000ULL)
                     + 0x100000000ULL) | (u64)timestamp;
} else {
    /* no overflow */
    tf->buffer.tsc = (tf->buffer.tsc & 0xFFFFFFFF00000000ULL)
                     | (u64)timestamp;
}

This will detect 32 bits overflow and keep the tf->buffer.tsc in sync
with the TSC representation on the traced machine as long as events are
less then 27 bits apart. A "full tsc" header can also be easily managed
with this by updating the tf->buffer.tsc value completely when such
event is met.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

Previous thread: [RFC PATCH 0/2 v2] Unified trace buffer (take two) by Steven Rostedt on Thursday, September 25, 2008 - 8:58 am. (1 message)

Next thread: [patch 5/6] kmsg: convert vmcp to kmsg api. by Martin Schwidefsky on Thursday, September 25, 2008 - 9:28 am. (1 message)