[PATCH v10 Golden] Unified trace buffer

Previous thread: [RFC PATCH 0/2 v3] Unified trace buffer by Steven Rostedt on Thursday, September 25, 2008 - 11:51 am. (1 message)

Next thread: [RFC PATCH 2/2 v3] ftrace: make work with new ring buffer by Steven Rostedt on Thursday, September 25, 2008 - 11:51 am. (1 message)
From: Steven Rostedt
Date: Thursday, September 25, 2008 - 11:51 am

This is probably very buggy. I ran it as a back end for ftrace but only
tested the irqsoff and ftrace tracers. The selftests are busted with it.

But this is an attempt to get a unified buffering system that was
talked about at the LPC meeting.

Now that it boots and runs (albeit, a bit buggy), I decided to post it.
This is some idea that I had to handle this.

I tried to make it as simple as possible.

I'm not going to explain all the stuff I'm doing here, since this code
is under a lot of flux (RFC, POC work), and I don't want to keep updating
this change log. When we finally agree on something, I'll make this
change log worthy.

If you want to know what this patch does, the code below explains it :-p

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  175 ++++++
 kernel/trace/Kconfig        |    3 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1218 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1397 insertions(+)

Index: linux-compile.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-compile.git/include/linux/ring_buffer.h	2008-09-25 13:59:09.000000000 -0400
@@ -0,0 +1,175 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+} __attribute__((__packed__));
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with ...
From: Steven Rostedt
Date: Thursday, September 25, 2008 - 6:02 pm

This version has been cleaned up a bit. I've been running it as
a back end to ftrace, and it has been handling pretty well.

I did not implement the GTOD sync part and will leave that for later.
But this is the basic design that I like and will be the basis
of my future work.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  178 ++++++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1252 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1435 insertions(+)

Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h	2008-09-25 20:36:12.000000000 -0400
@@ -0,0 +1,178 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+	u32		type:2, len:3, time_delta:27;
+	u32		array[];
+} __attribute__((__packed__));
+
+enum {
+	RB_TYPE_PADDING,	/* Left over page padding
+				 * array is ignored
+				 * size is variable depending on
+				 */
+	RB_TYPE_TIME_EXTENT,	/* Extent the time delta
+				 * array[0] = time delta (28 .. 59)
+				 * size = 8 bytes
+				 */
+	/* FIXME: RB_TYPE_TIME_STAMP not implemented */
+	RB_TYPE_TIME_STAMP,	/* Sync time stamp with external clock
+				 * array[0] = tv_nsec
+				 * array[1] = tv_sec
+				 * size = 16 bytes
+				 */
+
+	RB_TYPE_DATA,		/* Data record
+				 * If len is zero:
+				 *  array[0] holds the actual length
+				 *  array[1..(length+3)/4] holds data
+				 * else
+				 *  length = len << 2
+				 *  array[0..(length+3)/4] holds data
+				 */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define ...
From: Masami Hiramatsu
Date: Thursday, September 25, 2008 - 6:52 pm

Hi Steven,


Thank you for your great work.
It seems good to me(especially, encapsulating events :)).

I have one request of enhancement.

 > +static struct ring_buffer_per_cpu *
 > +ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
 > +{
[...]
 > +	cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
 > +					       cache_line_size()), GFP_KERNEL,
 > +					 cpu_to_node(cpu));

Here, you are using a slab object for page managing array,
the largest object size is 128KB(x86-64), so it can contain
16K pages = 64MB.

As I had improved relayfs, in some rare case(on 64bit arch),
we'd like to use larger buffer than 64MB.

http://sourceware.org/ml/systemtap/2008-q2/msg00103.html

So, I think similar hack can be applicable.

Would it be acceptable for the next version?

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 7:11 pm

I would like to avoid using vmalloc as much as possible, but I do see the 
limitation here. Here's my compromise.

Instead of using vmalloc if the page array is greater than one page, 
how about using vmalloc if the page array is greater than 
KMALLOC_MAX_SIZE?

This would let us keep the vmap area free unless we have no choice.

-- Steve

--

From: Masami Hiramatsu
Date: Thursday, September 25, 2008 - 7:47 pm

Hmm, that's a good idea.
In most cases, per-cpu buffer may be less than 64MB,
so I think it is reasonable.


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 8:20 pm

You could also fallback on a 2-level page array when buffer size is >
64MB. The cost is mainly a supplementary pointer dereference, but one
more should not make sure a big difference overall.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Peter Zijlstra
Date: Friday, September 26, 2008 - 12:18 am

I'm still not sure why we don't just link the pages using the page
frames, we don't need the random access, do we?

--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 3:45 am

Yeah we can go back to that (as ftrace does).

1) It can be very error prone. I will need to encapsulate the logic more.

2) I'm still not sure if crash can handle it.


I was going to reply to Masami with this answer, but it makes things more 
complex.  For v1 (non RFC v1) I wanted to start simple. v2 can have this 
enhancement.

-- Steve

--

From: Peter Zijlstra
Date: Friday, September 26, 2008 - 4:00 am

It ought to, and if it can't it should be fixed. Having easy access to
the pageframes is vital to debugging VM issues. So I'd not bother about

Right - I just object to having anything vmalloc.

--

From: Masami Hiramatsu
Date: Friday, September 26, 2008 - 9:57 am

I just requested that the expansion of buffer size limitation too. :)

I don't stick with vmalloc. If that (page frame chain?) can
achieve better performance, I agree that trace buffer uses it.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 10:14 am

v5 is out with this implementation. It may or may not be better 
performance, but the difference is most likely negligible.

Anyway, I'm happing with this last release, and hopefully it can get into
2.6.28.  This would mean I can start basing ftrace on top of it.

-- Steve

--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 3:47 am

Hmm, but this does make changing the buffer size much easier. I'll think 
about it  and perhaps try it out. If I can tidy it up nicer than the 
ftrace code, then I may include it for v1.

-- Steve

--

From: Mathieu Desnoyers
Date: Friday, September 26, 2008 - 9:04 am

Yes, that's a brilliant idea :)

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 10:11 am

[
  Note the removal of the RFC in the subject.
  I am happy with this version. It handles everything I need
  for ftrace.

  New since last version:

   - Fixed timing bug. I did not add the deltas properly when
     reading the buffer.

   - Removed "-1" time stamp normalize test. This made the
     clock go backwards!

   - Removed page pointer array and replaced it with the ftrace
     page struct link list trick. Since this is my second time
     writing this code (first with ftrace), it is actually much
     cleaner than the ftrace code.

   - Implemented buffer resizing. By using the page link list trick,
     this became much simpler.

   Note, the GOTD part is still not implemented, but can be done
   later without affecting this interface.

]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 ...
From: Arnaldo Carvalho de Melo
Date: Friday, September 26, 2008 - 10:31 am

Why do you need __packed__ here? With or without it the layout is the
same:

[acme@doppio examples]$ pahole packed
struct ring_buffer_event {
	u32 type:2;               /* 0:30  4 */
	u32 len:3;                /* 0:27  4 */
	u32 time_delta:27;        /* 0: 0  4 */
	u32 array[0];             /* 4     0 */

	/* size: 4, cachelines: 1, members: 4 */
	/* last cacheline: 4 bytes */
};

- Arnaldo
--

From: Linus Torvalds
Date: Friday, September 26, 2008 - 10:37 am

Indeed. And on some architectures 'packed' will actually mean that the 
compiler may think that it's unaligned, and then generate much worse code 
to access the fields. So if you align things anyway (and you do), then 
'packed' is the wrong thing to do.

		Linus
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 10:46 am

OK, I'm making v6 now with various cleanups. I'll nuke it on that one.

-- Steve

--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 10:02 am

btw., now that it's getting into shape, could you please fix the ftrace 

... to not have known bugs, so that we could try it in tip/ftrace and 
make sure it works well in practice?

it's a ton of changes already, it would be nice to get to some stable 
known-working state and do delta patches from that point on, and keep 
its 'works well' quality.

	Ingo
--

From: Steven Rostedt
Date: Saturday, September 27, 2008 - 10:18 am

OK, the patch that I was using was against Linus's tree. I'll port it over 
to linux-tip on Monday and get it past the "proof of concept" stage. 
Actually, the verison I have on my desk works pretty well. The main issues 
to solve is that some other tracers and the self test stick their noses 
into the buffering system, which would need to be fixed.

There's also some bugs in the status numbers printed in the latency_trace 
header. But I have not hit any bugs with the buffering itself.

I'll clean all this up and send out a patch on Monday. My wife is 
mandating that I do not do anymore work over the weekend ;-)

-- Steve

--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 11:05 am

[
  Changes since v5:

  - removed packed attribute from event structure.

  - added parenthesis around ~TS_MASK

  - fixed some comments in header

  - fixed ret value on ring_buffer_write on errors.

  - added check_pages when modifying the size of cpu buffers
]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

	type = RB_TYPE_DATA
	len = 0
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0]: 84 (Note the alignment)
	array[1..14]: <82 bytes of data> <2 bytes empty>

The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 ...
From: Richard Holden
Date: Friday, September 26, 2008 - 11:30 am

Forgive me if I've gotten this wrong but the terminology seems backwards
Here, I would think we only throw away new data if the producer catches up
with the consumer, if the consumer catches up with the producer we're


Just trying to understand the terminology before I look at the code so I'm
sorry if I have just completely misunderstood.

-Richard Holden


--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 11:39 am

Argh! Yes.  I'm the one that is backwards ;-)

Yeah, that is what I meant. Don't you know? You are suppose to understand 

I always get confused with the translation of what the head/tail to 
producer/consumer.

Here I have the producer adding to the tail, and the consumer reading from
the head. Perhaps this is backwards? I could change it.

s/head/foobar/g
s/tail/head/g
s/foobar/tail/g


Sure, thanks.

-- Steve

--

From: Peter Zijlstra
Date: Friday, September 26, 2008 - 11:59 am

Since you're already using the page frame, you can stick this per page
timestamp in there as well, and get the full page for data.

You can either use a struct page overlay like slob does, or add a u64 in
the union that contains struct {private, mapping}.



--

From: Martin Bligh
Date: Friday, September 26, 2008 - 12:46 pm

What did you guys think of Mathieu's idea of sticking the buffer length
in the header here, rather than using padding events? Seemed cleaner
to me.
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 12:52 pm

Actually I like the padding. This way when I move the event pointer 
forward, I only need to compare it to a constant (PAGE_SIZE), or test to 
see if the event is padding.  Placing this into the buffer page, I will 
have to always compare it to that pointer.

But I guess I could change it to that if needed. That doesn't affect the 
API, as it is only internal.

I'm almost done with v7, perhaps I might try that with v8 to see if I like 
it better.

-- Steve

--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 2:37 pm

OK, I just implemented the size field in the page struct. Seems to work 
well. I'm still keeping the padded event, so in the future, if we ever 
map these pages to userspace or files, these holes will have a type.

Will post later today, need to actually enter a real life for a bit.

-- Steve
--

From: Peter Zijlstra
Date: Friday, September 26, 2008 - 12:14 pm

actually nr_possible makes sense, and you might consider always
allocating buffers (and keeping them for offlined cpus) to avoid massive
allocations/frees cpu-hotplug events.

Mike Travis has been going over the kernel removing constructs like
this, and replacing them with dynamically allocated arrays of

--

From: Mike Travis
Date: Friday, September 26, 2008 - 3:28 pm

The other thing to consider is using a percpu variable.

Cheers,
Mike
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 4:56 pm

This structure is allocated on request.

-- Steve

--

From: Mike Travis
Date: Friday, September 26, 2008 - 5:05 pm

[Empty message]
From: Steven Rostedt
Date: Friday, September 26, 2008 - 5:18 pm

Actually in this case I chose num_possible_cpus(). Reason being is that 
later I may add an interface to allow the user to select which CPUs they 
want to trace, and this will only allocate a subset of CPU buffers. 
(not going to implement that in the first release).

But to lay the ground work, I set a buffers->cpumask to be that of all the 
cpus with buffers allocated. For now that mask is set to cpu_possible_map. 
Since num_possible_cpus() is defined as cpus_weight_nr(cpu_possible_map) 
I figured that was the better choice.

-- Steve

--

From: Mike Travis
Date: Friday, September 26, 2008 - 5:46 pm

One problem though, it's *theoretically* possible for num_possible to be
less than nr_cpu_ids and a cpu index may extend past the end of your
allocated array.  This would happen if  the cpu indices are allocated
some other way than as each cpu is discovered.  For example, a system
might want a group of cpus in one section (say by node, or socket) and
then a hole in the cpu_possible_map until the next group.  nr_cpu_ids
is guaranteed to be the highest possible cpu + 1.

Cheers,
Mike
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 5:52 pm

Thanks for the explanation. I'll change buffer->cpus to be set to 
nr_cpu_ids.

-- Steve

--

From: Peter Zijlstra
Date: Friday, September 26, 2008 - 12:17 pm

Arjan, any preferences wrt kerneloops.org?

--

From: Arjan van de Ven
Date: Friday, September 26, 2008 - 4:16 pm

On Fri, 26 Sep 2008 21:17:13 +0200



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Peter Zijlstra
Date: Friday, September 26, 2008 - 1:08 pm

You probably want synchronize_sched() here (and similar other places) to
ensure any active writer on the corresponding cpu is actually stopped.

Which suggests you want to use something like ring_buffer_lock_cpu() and

--

From: Masami Hiramatsu
Date: Friday, September 26, 2008 - 2:14 pm

Would it really be done in the buffer layer?
I think it should be done by each tracer, because buffer layer
can't ensure truly active writers have stopped.


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 2:26 pm

Actually it can ;-)

Since all writes to the buffer at least disable preemption, by issuing a 
synchronize_sched, we can guarantee that after disabling the record, all 
activity will be done.

-- Steve

--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 2:13 pm

[
  Changes since v6:

  - Added shift debug test to test both normalization of
    timestamp, but also the large time deltas. ftrace records too quickly
    to get large deltas :-/

  - Fixed some minor issues with keeping track of time.

  - used slob hack to put more information in the page struct and now
    have the full buffer page free for data. Thanks to Peter Zijlstra
    for suggesting the idea.

  - have the buffer use a cpu mask (initialized to cpu_possible_map)
    to allocate for cpu usage.

  - fixed entries counting.

  - use DIV_ROUND_UP macro (also suggested by Peter)
]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the ...
From: Steven Rostedt
Date: Friday, September 26, 2008 - 7:02 pm

[
  Changes since v7:

  - added the size of data in the page into the page frame.
    Suggested by Martin Bligh and Mathieu Desnoyers

  - Converted all static functions to be named with a rb_ prefix.
    This may conflict with rbtree functions in the future, but if
    this does happen, we will need to rename the functions in this
    file. The rb_ prefixed functions here are all static, so it only
    affects this code. Thanks to Arnaldo Carvalho de Melo.

  - Added some synchronized_sched() where record_disabled is
    incremented. There are other places that expect the caller
    to handle it. Suggested by Peter Zijlstra.

  - Use nr_cpu_ids for max cpu. Thanks to Mike Travis.
]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - ...
From: Steven Rostedt
Date: Friday, September 26, 2008 - 11:06 pm

[
  Changes since version 8:

  Two major bug fixes!

   - Had mix of referencing the pages link list with both
     page->lru and buffer_page->list.  Perhaps they luckily
     were lined up. But I have no idea why this didn't totally
     crash my box.

   - Missed a write stamp update that would cause funny times
]

From: Steven Rostedt <srostedt@redhat.com>
Subject: Unified trace buffer

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RB_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RB_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RB_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 bytes:

	type = RB_TYPE_DATA
	len = 2
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0..1]: <7 bytes of data> <1 byte empty>

This event is saved in 12 bytes of the buffer.

An event with 82 bytes of data:

	type = RB_TYPE_DATA
	len = 0
	time_delta: <time-stamp> - <prev_event-time-stamp>
	array[0]: 84 (Note the alignment)
	array[1..14]: <82 bytes of data> <2 ...
From: Ingo Molnar
Date: Saturday, September 27, 2008 - 11:39 am

small nitpicking review, nothing structural yet:








hm, should not be raw, at least initially. I am 95% sure we'll see 
lockups, we always did when we iterated ftrace's buffer implementation 

please use consistent vertical whitespaces.  Above, in the struct 
ring_buffer definition, you can add another tab to most of the vars - 
that will also make the '**buffers' line look nice.

same for all structs across this file. In my experience, a 50% vertical 

please name it RINGBUFFER_BUG_ON() / RINGBUFFER_WARN_ON(), so that we 
dont have to memorize another set of debug names. [ See 
DEBUG_LOCKS_WARN_ON() in include/linux/debug_locks.h ]


please apply ftrace's standard reverse christmas tree style and move the 


'addr' initialization can move to the definition line - you save two 



( optional:when there's a multi-line loop then i generally try to insert 
  an extra newline when starting the body - to make sure the iterator 

this function is too long, please split it up. The first condition's 

please use standard comment style:

 /*
  * Comment
  */

	Ingo
--

From: Steven Rostedt
Date: Saturday, September 27, 2008 - 12:24 pm

Hi Ingo,

Thanks for the review!


Yeah, Linus pointed this out with the rb_ static function names. But since 
the functions are static I kept them as is. But here we have global names.


Ah, I think I had it more complex and changed it to a literal without 


No biggy. I thought this would be nicer as inline. But I have no problem


It was to prevent lockdep from checking the locks from inside. We had 
issues with ftroce and lockdep in the past, because ftrace would trace the
internals of lockdep, and lockdep would then recurse back into itself to 
trace.  If lockdep itself can get away with not using raw_spinlocks, then



Heh, this was directly from a bug I had and laziness ;-)
I originally just had struct list_head pages (and no *tmp), which kept the 
christmas tree format. But later found that you need to initialize list 







Hmm, this is interesting. I kind of like this because it is not really a 
standard comment. It is a comment about the definitions of the enum. I 
believe if they are above:

  /*
   * Comment
   */
   RB_ENUM_TYPE,

It is not as readable. But if we do:

   RB_ENUM_TYPE,	/*
			 * Comment
			 */

The comment is not at the same line as the enum, which also looks 
unpleasing.

We can't could do:

			/*
   RB_ENUM_TYPE,	 * Comment
			 */
			/*
   RB_ENUM_TYPE2,	 * Comment
			 */

Because the ENUM is also in the comment :-p


I chose this way because we have:

  RB_ENUM_TYPE,		/* Comment
			 * More comment
			 */
  RB_ENUM_TYPE2,	/* Comment
			 */

Since I find this the nices way to describe enums. That last */ is 
good to space the comments apart, otherwise we have:

  RB_ENUM_TYPE,		/* Comment
			 * More comment */
  RB_ENUM_TYPE2,	/* Comment */

That is not as easy to see the separation of one description of enums with 
the other.

-- Steve

--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 12:41 pm

that's even worse i think :-/ And this isnt bikeshed-painting really, 
the RNGBF_ name hurts my eyes and RB_ is definitely confusing to read. 
(as the rbtree constants are in capitals as well and similarly named)

 RING_TYPE_PADDING

or:

 RINGBUF_TYPE_PADDING


yes, way too big. Sometimes we make savings from a 10 bytes function 
already. (but it's always case dependent - if a function has a lot of 
parameters then uninlining can hurt)

the only exception would be if there's normally only a single 

would be nice to make sure that ftrace's recursion checks work as 
intended - and the same goes for lockdep's recursion checks. Yes, we had 
problems in this area, and it would be nice to make sure it all works 

clarification: multi-line loop _condition_. It's pretty rare (this is 
such a case) but sometimes unavoidable - and then the newline helps 



So i suggested to fix it to:

 +	RB_TYPE_TIME_EXTENT,	/*
 +				 * Extent the time delta
 +				 * array[0] = time delta (28 .. 59)
 +				 * size = 8 bytes
 +				 */

ok? I.e. "comment" should have the same visual properties as other 
comments.

I fully agree with moving it next to the enum, i sometimes use that 
style too, it's a nice touch and more readable in this case than 
comment-ahead. (which we use for statements)

	Ingo
--

From: Steven Rostedt
Date: Saturday, September 27, 2008 - 12:54 pm

I don't mind the extra typing, it is just a bit more difficult to keep in

It is a hot path in the internals. Perhaps I'll make an inline function
in the interal code "rb_event_length" and have the other users call.

unsigned ring_buffer_event(struct ring_buffer_event *event)
{
	return rb_event_length(event);

But then we have:

        RB_TYPE_PADDING,        /*
				 * Left over page padding
                                 * array is ignored
                                 * size is variable depending on
                                 * how much padding is needed
                                 */
        RB_TYPE_TIME_EXTENT,    /*
				 * Extent the time delta
                                 * array[0] = time delta (28 .. 59)
                                 * size = 8 bytes
                                 */

Where it is not as easy to see which comment is with which enum. 
Especially when you have many enums. That's why I like the method I used 
with:

        RB_TYPE_PADDING,        /* Left over page padding
                                 * array is ignored
                                 * size is variable depending on
                                 * how much padding is needed
                                 */
        RB_TYPE_TIME_EXTENT,    /* Extent the time delta
                                 * array[0] = time delta (28 .. 59)
                                 * size = 8 bytes
                                 */

Where it is very easy to notice which comment goes with which enum.

-- Steve

--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 1:00 pm

that's really not a hard limit, but yeah.





oh, btw., that's a spelling mistake: s/extend/extend ?

	Ingo
--

From: Steven Rostedt
Date: Monday, September 29, 2008 - 8:05 am

OK, I did a quick survey of what others did in include/linux to handle 
multi line comments for enums. I ignored the single line comments since 
that is pretty standard. Here's what I found:

Those that do:

enum myenum {
	ENUM_PING_PONG,		/* Bounce a ball back and forth
				   till you have a winner. */
	ENUM_HONEY_CONE,	/* Soft and sweet a yummy for
				   the tummy. */
};

include/linux/atmdev.h
include/linux/fd.h
include/linux/hil.h
include/linux/if_pppol2tp.h
include/linux/ivtv.h
include/linux/libata.h
include/linux/mmzone.h
include/linux/reiserfs_fs.h
include/linux/reiserfs_fs_sb.h
include/linux/rtnetlink.h
include/linux/scc.h
include/linux/videodev2.h

Those that do:

enum myenum {
        ENUM_PING_PONG,         /* Bounce a ball back and forth */
                                /* till you have a winner.      */
        ENUM_HONEY_CONE,        /* Soft and sweet a yummy for   */
                                /* the tummy.                   */
};

include/linux/atmsvc.h
include/linux/pktcdvd.h


Those that do (what I did):

enum myenum {
        ENUM_PING_PONG,         /* Bounce a ball back and forth
                                 * till you have a winner.
				 */
        ENUM_HONEY_CONE,        /* Soft and sweet a yummy for
                                 * the tummy.
				 */
};

include/linux/buffer_head.h (with space between the two enums)
include/linux/personality.h


Those that do:

enum myenum {
	/*
	 * Bounce a ball back and forth
	 * till you have a winner.
	 */
        ENUM_PING_PONG,
	/*
	 * Soft and sweet a yummy for
	 * the tummy.
	 */
        ENUM_HONEY_CONE,
};

include/linux/cgroup.h
include/linux/cn_proc.h
include/linux/exportfs.h
include/linux/fb.h
include/linux/hil_mlc.h
include/linux/pci.h
include/linux/reiserfs_fs_i.h


And finally Doc book style:

/**
 * enum myenum
 * @ENUM_PING_PONG: Bounce a ball back and forth
 *                  till you have a winner.
 * @ENUM_HONEY_CONE: Soft and ...
From: Martin Bligh
Date: Saturday, September 27, 2008 - 1:07 pm

Would using tb_ (trace buffer) rather than rb_ help ?
--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 1:34 pm

excellent idea ...

	Ingo
--

From: Steven Rostedt
Date: Monday, September 29, 2008 - 9:10 am

[
  This is the final version of this patch. From now on, I will be sending
  changes on top of this patch.

  Changes since v9:

  All suggestions from Ingo Molnar.

  - Changed comment of enum to DocBook style.

  - Replaced the RB_TYPE_ enums with RINGBUF_TYPE_ prefixes to avoid
    name collision with rbtree. Note, I did not use the TB_ extension
    because I envision a "trace_buffer" layer on top of this layer
    in the future.

  - Moved ring_buffer_event_{length,data} into the .c file and added
    internal inlines. External uses will need to call the function.

  - Broke out rb_add_time_stamp function from rb_reserve_next_event.

  - made the cpu_buffer->lock back to a normal spin lock.

  - The rest are style changes.

]

This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.

The events recorded into the buffer have the following structure:

struct ring_buffer_event {
	u32 type:2, len:3, time_delta:27;
	u32 array[];
};

The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.

There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).

RINGBUF_TYPE_PADDING: this type is used to note extra space at the end
	of a buffer page.

RINGBUF_TYPE_TIME_EXTENT: This type is used when the time between events
	is greater than the 27 bit delta can hold. We add another
	32 bits, and record that in its own event (8 byte size).

RINGBUF_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
	help keep the buffer timestamps in sync.

RINGBUF_TYPE_DATA: The event actually holds user data.

The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.

Example, data size of 7 ...
From: Steven Rostedt
Date: Monday, September 29, 2008 - 9:11 am

Ingo,

I will add this patch to my linux-tip and then I will start porting ftrace 
over to it in an incremental fashion.

-- Steve



--

From: Mathieu Desnoyers
Date: Monday, September 29, 2008 - 4:35 pm

* Steven Rostedt (rostedt@goodmis.org) wrote:

Hi Steven,

You should have a look at mm/slob.c free_slob_page(). I think your page
free will generate a "bad_page" call due to mapping != NULL and mapcount
!= 0. I just ran into this in my own code. :)

Regards,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Monday, September 29, 2008 - 5:01 pm

Hi Mathieu!

Thanks! I must have been lucky some how not to trigger this :-/

I'll add an update patch for this.

-- Steve

--

From: Mathieu Desnoyers
Date: Monday, September 29, 2008 - 5:03 pm

My guess is that you never free your buffers in your test cases. I don't
know if it was expected; probably not if your code is built into the
kernel.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Monday, September 29, 2008 - 5:12 pm

Actually my resize does free the buffers and I did test this. I probably 
never ran the trace when testing the freeing which means those pointers 
could have luckily not have been changed.

-- Steve

--

From: Mathieu Desnoyers
Date: Monday, September 29, 2008 - 8:46 pm

I also got some corruption of the offset field in the struct page I use.
I think it might be related to the fact that I don't set the PG_private
bit (slob does set it when the page is in its free pages list). However,
given I'd like to pass the buffer pages to disk I/O and for network
socket and still keep the ability to re-use it when the I/O has been
performed, I wonder where I should put my 

                       struct list_head list;  /* linked list of buf pages */
                       size_t offset;          /* page offset in the buffer */

fields ? Any ideas ?

They are currently in :

struct buf_page {
        union {
                struct {
                        unsigned long flags;    /* mandatory */
                        atomic_t _count;        /* mandatory */
                        union {                 /* mandatory */
                                atomic_t _mapcount;
                                struct {
                                        u16 inuse;
                                        u16 objects;
                                };
                        };
                        struct list_head list;  /* linked list of buf pages */
                        size_t offset;          /* page offset in the buffer */
                };
                struct page page;
        };
};

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Monday, September 29, 2008 - 9:00 pm

Ah, I believe the disk IO uses the page frame. That might be a bit more 
difficult to pass the data to disk and still keep information on the
page frame.

-- Steve

--

From: Jonathan Corbet
Date: Tuesday, September 30, 2008 - 8:20 am

On Tue, 30 Sep 2008 00:00:11 -0400 (EDT)

Perhaps I'm speaking out of turn, but I have to wonder: am I the only one
who gets uncomfortable looking at these hacks to overload struct page?  It
seems fragile as all hell; woe to he who tries to make a change to struct
page someday and has to track all of this stuff down.

Are the savings gained by using struct page this way really worth the
added complexity?

jon
--

From: Peter Zijlstra
Date: Tuesday, September 30, 2008 - 8:54 am

Its not that complex IMHO, the thing that is ugly are those struct page
overloads, what we could do is try and sanitize the regular struct page
and pull all these things in.

Because the only reason people are doing these overloads is because
struct page in mm_types.h is becomming an unreadable mess.

Trouble is, looking at it I see no easy way out,

--

From: Linus Torvalds
Date: Tuesday, September 30, 2008 - 9:38 am

That's not the scary part. The scary part is that somebody may well want 
to access the trace buffer pages in complex ways.

If you mmap them, for example, you can use VM_PFNMAP to make sure that 
nobody should ever look at the "struct page", but if you want to do things 
like direct-to-disk IO on the trace pages (either with splice() or with 

The "unreadable mess" has exactly the same issues, though: people need to 
realize that when you overload fields in the page structure, you can then 
NEVER EVER use those pages for any other thing. 

For the internal VM code, that's ok. The VM knows that a page is either an 
anonymous page or a file mapping etc, and the overloading wrt mm_types.h 
is explicit. The same goes for SL*B, although it does the overloading 
differently.

Trace buffers are different, though. Do people realize that doing the 
overloading means that you never EVER can use those buffers for anything 
else? Do people realize that it means that splice() and friends are out of 

Quite frankly, we could just put it at the head of the page itself. Having 
a "whole page" for the trace data is not possible anyway, since the trace 
header itself will always eat 8 bytes.

And I do think it would potentially be a better model. Or at least safer.

			Linus
--

From: Steven Rostedt
Date: Tuesday, September 30, 2008 - 9:48 am

Actually, looking at the code, there is no reason I need to keep this in 
the frame buffer itself. I've also encapsulated the accesses to the 
incrementing of the pointers so it would be trivial to try other 
approaches.

The problem we had with the big array struct is that we can want large 
buffers and to do that with pointers means we would need to either come up 
with a large allocator or use vmap.

But I just realized that I could also just make a link list of page 
pointers and do the exact same thing without having to worry about page 
frames.  Again, the way I coded this up, it is quite trivial to replace 
the handling of the pages with other schemes.

-- Steve

--

From: Linus Torvalds
Date: Tuesday, September 30, 2008 - 10:01 am

That might be the best option.

Yes, doing it in the 'struct page' itself is obviously going to save us 
some memory over having specially allocated page headers, but it's not 
like we'd expect to have _that_ many of these, and having a separate 
structure is actually good in that it also would make it simpler/clearer 
when/if you want to add larger pages (or other non-page allocations) into 
the mix.

For example, if somebody really wants bigger areas, they can allocate them 
with vmalloc and/or multi-page allocations, and then add them as easily to 
the list of pages as if it was a normal page. Doing the same with playing 
tricks on 'struct page' would be pretty damn painful.

		Linus
--

From: Steven Rostedt
Date: Wednesday, October 1, 2008 - 8:14 am

The current method of overlaying the page frame as the buffer page pointer
can be very dangerous and limits our ability to do other things with
a page from the buffer, like send it off to disk.

This patch allocates the buffer_page instead of overlaying the page's
page frame. The use of the buffer_page has hardly changed due to this.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/ring_buffer.c |   54 ++++++++++++++++++++++++++-------------------
 1 file changed, 32 insertions(+), 22 deletions(-)

Index: linux-tip.git/kernel/trace/ring_buffer.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ring_buffer.c	2008-10-01 09:37:23.000000000 -0400
+++ linux-tip.git/kernel/trace/ring_buffer.c	2008-10-01 11:03:16.000000000 -0400
@@ -115,16 +115,10 @@ void *ring_buffer_event_data(struct ring
  * Thanks to Peter Zijlstra for suggesting this idea.
  */
 struct buffer_page {
-	union {
-		struct {
-			unsigned long	 flags;		/* mandatory */
-			atomic_t	 _count;	/* mandatory */
-			u64		 time_stamp;	/* page time stamp */
-			unsigned	 size;		/* size of page data */
-			struct list_head list;		/* list of free pages */
-		};
-		struct page page;
-	};
+	u64		 time_stamp;	/* page time stamp */
+	unsigned	 size;		/* size of page data */
+	struct list_head list;		/* list of free pages */
+	void *page;			/* Actual data page */
 };
 
 /*
@@ -133,9 +127,9 @@ struct buffer_page {
  */
 static inline void free_buffer_page(struct buffer_page *bpage)
 {
-	reset_page_mapcount(&bpage->page);
-	bpage->page.mapping = NULL;
-	__free_page(&bpage->page);
+	if (bpage->page)
+		__free_page(bpage->page);
+	kfree(bpage);
 }
 
 /*
@@ -237,11 +231,16 @@ static int rb_allocate_pages(struct ring
 	unsigned i;
 
 	for (i = 0; i < nr_pages; i++) {
+		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
+				    GFP_KERNEL, cpu_to_node(cpu));
+		if (!page)
+			goto free_pages;
+		list_add(&page->list, ...
From: Mathieu Desnoyers
Date: Wednesday, October 1, 2008 - 10:36 am

Hi Steven,

I understand that you want to allocate these struct buffer_page in
memory local to a given cpu node, which is great, but why do you feel
you need to align them on cache_line_size() ?

Hrm.. you put the timestamp in there, so I guess you're concerned about
having a writer on one CPU, a reader on another, and the fact that you
will have cache line bouncing because of that.

Note that if you put the timestamp and the unused bytes in a tiny header
at the beginning of the page, you

1 - make this information directly accessible for disk, network I/O
without any other abstraction layer.
2 - won't have to do such alignment on the struct buffer_page, because
it will only be read once it's been allocated.

My 2 cents ;)


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Wednesday, October 1, 2008 - 10:49 am

That was the approach I actually started with. But someone (I think
Peter) asked me to remove it.

Who knows, perhaps I can put it back. It's not that hard to do. This is
why I used BUF_PAGE_SIZE to determine the size of the buffer page.
Right now it BUF_PAGE_SIZE == PAGE_SIZE, but if we do add a header than
it will be BUF_PAGE_SIZE == PAGE_SIZE - sizeof(header)

-- Steve

--

From: Mathieu Desnoyers
Date: Wednesday, October 1, 2008 - 11:21 am

You could probably use alloc_pages_node instead here...


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Ingo Molnar
Date: Thursday, October 2, 2008 - 1:50 am

applied to tip/tracing/ftrace, with the extended changlog below - i 
think this commit warrants that extra mention.

	Ingo

--------------->
From da78331b4ced2763322d732ac5ba275965853bde Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt@goodmis.org>
Date: Wed, 1 Oct 2008 10:52:51 -0400
Subject: [PATCH] ftrace: type cast filter+verifier

The mmiotrace map had a bug that would typecast the entry from
the trace to the wrong type. That is a known danger of C typecasts,
there's absolutely zero checking done on them.

Help that problem a bit by using a GCC extension to implement a
type filter that restricts the types that a trace record can be
cast into, and by adding a dynamic check (in debug mode) to verify
the type of the entry.

This patch adds a macro to assign all entries of ftrace using the type
of the variable and checking the entry id. The typecasts are now done
in the macro for only those types that it knows about, which should
be all the types that are allowed to be read from the tracer.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/trace/trace.c           |   85 ++++++++++++++++++++++++++++------------
 kernel/trace/trace.h           |   42 ++++++++++++++++++++
 kernel/trace/trace_mmiotrace.c |   14 +++++--
 3 files changed, 112 insertions(+), 29 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c163406..948f7d8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1350,7 +1350,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 	}
 	switch (entry->type) {
 	case TRACE_FN: {
-		struct ftrace_entry *field = (struct ftrace_entry *)entry;
+		struct ftrace_entry *field;
+
+		trace_assign_type(field, entry);
 
 		seq_print_ip_sym(s, field->ip, sym_flags);
 		trace_seq_puts(s, " (");
@@ -1363,8 +1365,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
 	}
 	case TRACE_CTX:
 	case TRACE_WAKE: ...
From: Ingo Molnar
Date: Thursday, October 2, 2008 - 1:51 am

that was for the type filter commit. The 3 patches i've picked up into 
tip/tracing/ring-buffer are:

 b6eeea4: ftrace: preempt disable over interrupt disable
 52abc82: ring_buffer: allocate buffer page pointer
 da78331: ftrace: type cast filter+verifier

Thanks,

	Ingo
--

From: Ingo Molnar
Date: Thursday, October 2, 2008 - 2:05 am

trivial build fix below.

	Ingo

From 339ce9af3e6cbc02442b0b356c1ecb80a8ae92fb Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 2 Oct 2008 11:04:14 +0200
Subject: [PATCH] ring-buffer: fix build error
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

fix:

 kernel/trace/ring_buffer.c: In function ‘rb_allocate_pages’:
 kernel/trace/ring_buffer.c:235: error: ‘cpu’ undeclared (first use in this function)
 kernel/trace/ring_buffer.c:235: error: (Each undeclared identifier is reported only once
 kernel/trace/ring_buffer.c:235: error: for each function it appears in.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/trace/ring_buffer.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9814571..54a3098 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -232,7 +232,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 
 	for (i = 0; i < nr_pages; i++) {
 		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(cpu));
+				    GFP_KERNEL, cpu_to_node(i));
 		if (!page)
 			goto free_pages;
 		list_add(&page->list, &pages);
--

From: Ingo Molnar
Date: Thursday, October 2, 2008 - 2:38 am

ok, these latest ring-buffer updates cause more serious trouble, i just 
got this boot crash on a testbox:

[    0.324003] calling  tracer_alloc_buffers+0x0/0x14a @ 1
[    0.328008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[    0.332001] IP: [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc
[    0.332001] PGD 0 
[    0.332001] Oops: 0000 [1] SMP 
[    0.332001] CPU 0 
[    0.332001] Modules linked in:
[    0.332001] Pid: 1, comm: swapper Not tainted 2.6.27-rc8-tip-01064-gd163d6b-dirty #1
[    0.332001] RIP: 0010:[<ffffffff8027d28b>]  [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc
[    0.332001] RSP: 0018:ffff88003f9d7de0  EFLAGS: 00010287
[    0.332001] RAX: 0000000000000000 RBX: ffffffff80b08404 RCX: 0000000000000067
[    0.332001] RDX: 0000000000000004 RSI: 00000000000080d0 RDI: ffffffffffffffc0
[    0.332001] RBP: ffff88003f9d7e80 R08: ffff88003f8010b4 R09: 000000000003db02
[    0.332001] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88003f801600
[    0.332001] R13: 0000000000000004 R14: ffff88003f801580 R15: ffff88003f801618
[    0.332001] FS:  0000000000000000(0000) GS:ffffffff80a68280(0000) knlGS:0000000000000000
[    0.332001] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    0.332001] CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0
[    0.332001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.332001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.332001] Process swapper (pid: 1, threadinfo ffff88003f9d6000, task ffff88003f9d8000)
[    0.332001] Stack:  ffff88003f9d7df0 ffff88003f9d7e40 0000000000000283 ffffffff80b08404
[    0.332001]  ffffffff80b08404 ffff88003f801598 0000000000000000 ffff88003f801598
[    0.332001]  ffff88003f801580 0000016000000000 ffff88003f801600 ffff88003f9a2a40
[    0.332001] Call Trace:
[    0.332001]  [<ffffffff80a95f41>] ? tracer_alloc_buffers+0x0/0x14a
[    0.332001]  [<ffffffff80a95f67>] tracer_alloc_buffers+0x26/0x14a
[   ...
From: Steven Rostedt
Date: Thursday, October 2, 2008 - 6:16 am

The above "preempt disable" is the most likely culprit.  I'm trying to get 
towards an interrupt disabled free and lockless code path. But in doing 
so, one must be extra careful. This is why I'm taking baby steps towards 
this approach. Any little error in one of these steps, and you have race 
conditions biting you.

The above replaces interrupt disables with preempt disables, uses the
atomic data disable to protect against reentrancy. But this could also 
have opened up a bug that was not present with the interrupts disabled 

Hmm, that was a trivial patch. Perhaps the trivial ones are the ones that 
are most likely to be error prone. A developer will take much more care in 
developing a patch that is complex than something he sees as trivial ;-)

--

From: Steven Rostedt
Date: Thursday, October 2, 2008 - 6:17 am

-ENOATTACHMENT

-- Steve

--

From: Ingo Molnar
Date: Thursday, October 2, 2008 - 8:50 am

attached.

You can get the broken tree by doing this in tip/master:

  git-merge tip/tracing/ring-buffer

	Ingo
From: Steven Rostedt
Date: Thursday, October 2, 2008 - 11:27 am

I've just checked-out tip/tracing/ring-buffer. That tree is still broken 
too, right? Or do I need to merge it to get the broken version?

-- Steve

--

From: Ingo Molnar
Date: Thursday, October 2, 2008 - 11:55 am

yes, that's very likely broken too in a standalone way as well - but to 
get the exact tree i tested i'd suggest:

  git checkout tip/master
  git merge tip/tracing/ring-buffer

	Ingo
--

From: Steven Rostedt
Date: Thursday, October 2, 2008 - 4:18 pm

My original patch had a compile bug when NUMA was configured. I
referenced cpu when it should have been cpu_buffer->cpu.

Ingo quickly fixed this bug by replacing cpu with 'i' because that
was the loop counter. Unfortunately, the 'i' was the counter of
pages, not CPUs. This caused a crash when the number of pages allocated
for the buffers exceeded the number of pages, which would usually
be the case.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 kernel/trace/ring_buffer.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-tip.git/kernel/trace/ring_buffer.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ring_buffer.c	2008-10-02 09:09:01.000000000 -0400
+++ linux-tip.git/kernel/trace/ring_buffer.c	2008-10-02 18:58:44.000000000 -0400
@@ -232,7 +232,7 @@ static int rb_allocate_pages(struct ring
 
 	for (i = 0; i < nr_pages; i++) {
 		page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(i));
+				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
 		if (!page)
 			goto free_pages;
 		list_add(&page->list, &pages);


--

From: Steven Rostedt
Date: Thursday, October 2, 2008 - 4:36 pm

That should have been:

 "when the number of pages allocated for the buffers exceeded the number

-- Steve

--

From: Mathieu Desnoyers
Date: Thursday, October 2, 2008 - 9:56 pm

Declare NUMA-less cpu_to_node with a check that the cpu parameter exists so
people without NUMA test configs (namely Steven Rostedt and myself who ran into
this error both in the same day with different implementations) stop doing this
trivial mistake.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
---
 include/asm-x86/topology.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6-lttng/include/asm-x86/topology.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/topology.h	2008-10-03 00:37:05.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/topology.h	2008-10-03 00:45:52.000000000 -0400
@@ -182,9 +182,9 @@ extern int __node_distance(int, int);
 
 #else /* !CONFIG_NUMA */
 
-#define numa_node_id()		0
-#define	cpu_to_node(cpu)	0
-#define	early_cpu_to_node(cpu)	0
+#define	numa_node_id()		0
+#define	cpu_to_node(cpu)	((void)(cpu),0)
+#define	early_cpu_to_node(cpu)	cpu_to_node(cpu)
 
 static inline const cpumask_t *_node_to_cpumask_ptr(int node)
 {
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Thursday, October 2, 2008 - 10:20 pm

Actually the proper way would be to have:

static inline int cpu_to_node(int cpu)
{
	return 0;
}

static inline int early_cpu_to_node(int cpu)
{
	return 0;
}

This way you also get typechecks.

--

From: Mathieu Desnoyers
Date: Friday, October 3, 2008 - 8:56 am

That's how I did it first, but then I looked at asm-generic/topology.h
and have seen it uses #defines. Should we change them too ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Friday, October 3, 2008 - 9:26 am

The old way of doing this is with defines. But all new code should be 
static inline functions when feasible. This way we can get typechecking 
on the parameters even when the configuration is disabled.

Even if the rest of the file uses defines, the new code should be
static inlines. Eventually, even the old defines will be converted.

-- Steve

--

From: Mathieu Desnoyers
Date: Friday, October 3, 2008 - 10:21 am

Argh, I think topology.h is utterly broken :-(

Have you noticed the subtile interaction between the

include/asm-x86/topology.h :

#define numa_node_id()          0
#define cpu_to_node(cpu)        0
#define early_cpu_to_node(cpu)  0
...
#include <asm-generic/topology.h>


and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu)        ((void)(cpu),0)
#endif

If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :

include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
        return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>

include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif

(which will override the static inline !)

It results in an override of the arch-specific version. Nice eh ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Friday, October 3, 2008 - 10:54 am

Seems that they expect cpu_to_node to be a macro if NUMA is not 
configured.

Actually, since the asm-generic/topology.h does have the cpu shown 
(although not in inline format), the solution here is to simply remove
the

#define cpu_to_node() 0

And we can still make the early_cpu_to_node a static inline since it is 
not referenced in the generic code.

-- Steve

--

From: Mathieu Desnoyers
Date: Friday, October 3, 2008 - 11:53 am

Or we take a deep breath and clean this up ?

Ingo, I build tested this on x86_64 (with and without NUMA), x86_32,
powerpc, arm and mips. I applies to both -tip and 2.6.27-rc8. Could it
be pulled into -tip for further testing ?

Note that checkpatch.pl spills a warning telling me to modify include/asm-*/
files (unexisting in my tree) rather than arch/*/include/asm/. Any idea
why ?

Thanks,

Mathieu


topology.h define mess fix

Original goal : Declare NUMA-less cpu_to_node with a check that the cpu
parameter exists so people without NUMA test configs (namely Steven Rostedt and
myself who ran into this error both in the same day with different
implementations) stop doing this trivial mistake.

End result :

Argh, I think topology.h is utterly broken :-(

Have you noticed the subtile interaction between the

include/asm-x86/topology.h :

#define numa_node_id()          0
#define cpu_to_node(cpu)        0
#define early_cpu_to_node(cpu)  0
...
#include <asm-generic/topology.h>


and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu)        ((void)(cpu),0)
#endif

If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :

include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
        return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>

include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif

(which will override the static inline !)

It results in an override of the arch-specific version. Nice eh ?

This patch fixes this issue by declaring static inlines in
asm-generic/topology.h and by requiring a _complete_ override of the
topology functions when an architecture needs to override them. An
architecture overriding the topology functions should not include
asm-generic/topology.h anymore.

- alpha needs careful checking, as it did not implement ...
From: Luck, Tony
Date: Friday, October 3, 2008 - 1:14 pm

> - Major cross-architecture built test is required.

Some problems on ia64.  With defconfig build (which has
CONFIG_NUMA=y) I see this:

kernel/sched.c: In function 'find_next_best_node':
kernel/sched.c:6920: error: implicit declaration of function 'node_to_cpumask_ptr'
kernel/sched.c:6920: error: '__tmp__' undeclared (first use in this function)
kernel/sched.c:6920: error: (Each undeclared identifier is reported only once
kernel/sched.c:6920: error: for each function it appears in.)
kernel/sched.c: In function 'sched_domain_node_span':
kernel/sched.c:6952: error: 'nodemask' undeclared (first use in this function)
kernel/sched.c:6953: warning: ISO C90 forbids mixed declarations and code
kernel/sched.c:6964: error: implicit declaration of function 'node_to_cpumask_ptr_next'
kernel/sched.c: In function '__build_sched_domains':
kernel/sched.c:7510: error: 'pnodemask' undeclared (first use in this function)

On an "allnoconfig" build (which curiously also has CONFIG_NUMA=y :-) I see

mm/page_alloc.c: In function 'find_next_best_node':
mm/page_alloc.c:2086: error: implicit declaration of function 'node_to_cpumask_ptr'
mm/page_alloc.c:2086: error: 'tmp' undeclared (first use in this function)
mm/page_alloc.c:2086: error: (Each undeclared identifier is reported only once
mm/page_alloc.c:2086: error: for each function it appears in.)
mm/page_alloc.c:2107: error: implicit declaration of function 'node_to_cpumask_ptr_next'

There are most probably more errors ... but this is where the build stopped.

-Tony
--

From: Mathieu Desnoyers
Date: Friday, October 3, 2008 - 3:47 pm

[...]

Ah, I did not select config "generic" for ia64, and thus did not get
CONFIG_NUMA. Here is a v2 which fixes this.

Thanks for testing this.

Mathieu


topology.h define mess fix v2

Update : build fix for ia64 CONFIG_NUMA.

Original goal : Declare NUMA-less cpu_to_node with a check that the cpu
parameter exists so people without NUMA test configs (namely Steven Rostedt and
myself who ran into this error both in the same day with different
implementations) stop doing this trivial mistake.

End result :

Argh, I think topology.h is utterly broken :-(

Have you noticed the subtile interaction between the

include/asm-x86/topology.h :

#define numa_node_id()          0
#define cpu_to_node(cpu)        0
#define early_cpu_to_node(cpu)  0
...
#include <asm-generic/topology.h>


and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu)        ((void)(cpu),0)
#endif

If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :

include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
        return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>

include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif

(which will override the static inline !)

It results in an override of the arch-specific version. Nice eh ?

This patch fixes this issue by declaring static inlines in
asm-generic/topology.h and by requiring a _complete_ override of the
topology functions when an architecture needs to override them. An
architecture overriding the topology functions should not include
asm-generic/topology.h anymore.

- alpha needs careful checking, as it did not implement parent_node nor
  node_to_first_cpu previously.
- Major cross-architecture built test is required.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt ...
From: Ingo Molnar
Date: Friday, October 3, 2008 - 12:27 am

oh, stupid typo of the year :-)

applied to tip/tracing/ring-buffer, thanks for tracking it down! I've 
reactivated the topic branch for tip/master and i'm running a few tests 
before pushing it out for wider testing.

	Ingo
--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 2:06 am

I'm somewhat at a loss here because I'm unable to find any version of
kernel/trace/trace.c which looks anything like the one which is being

Why was this code using a cast in the first place?  It should be using
entry->some_field_i_dont_have_here?  That was the whole point in using 
the anonymous union in struct trace_entry?

--

From: Ingo Molnar
Date: Thursday, October 2, 2008 - 2:41 am

it's in tip/tracing/ring-buffer (also tip/master), but we are still 
working on it (i just triggered a crash with it) so i havent pushed it 

this whole mega-thread was about removing that union and turning the 
tracer into a type-opaque entity. I warned about the inevitable 
fragility - but with this type filter approach the risks should be 
substantially lower.

	Ingo
--

From: Steven Rostedt
Date: Thursday, October 2, 2008 - 6:06 am

As Ingo mentioned, you don't have this yet. And be happy that you don't 
;-)


Because the ring_buffer now allows for variable length entries, having a 
one size fits all entry is not optimal.

But because C is not the best for typecasting, we have this macro to help 
solve the issue. Instead of registering everything into a single union and 
causing small fields to be large, we have a macro you can register your
type with instead.

-- Steve

--

From: Peter Zijlstra
Date: Tuesday, September 30, 2008 - 10:00 am

The list_head in the page frame should be available regardless of
splice() stuffs.

--

From: Steven Rostedt
Date: Tuesday, September 30, 2008 - 10:41 am

Regardless, there's more info we want to store for each page than the list 
head.  Especially when we start converting this to lockless. I rather get 
out of the overlaying of the page frames, its nice to save the space, but 
really scares the hell out of me. I can just imagine this blowing up if we 
redo the paging, and I dislike this transparent coupling between the 
tracer buffer and the pages.

-- Steve

--

From: Peter Zijlstra
Date: Tuesday, September 30, 2008 - 10:49 am

The problem with storing the page link information inside the page is
that it doesnt transfer to another address space, so if you do indeed
mmap these pages, then the link information is bogus.

Of course, in such a situation you could ignore these headers, but
somehow that doesn't sound too apealing.

--

From: Steven Rostedt
Date: Tuesday, September 30, 2008 - 10:56 am

No that's not what I'm proposing. I'm proposing to allocate a page_header 
structure for every page we alloc, and make a link list of them.
In other words:


struct ring_buffer_per_cpu {
	[...]
	struct list_head pages;
	[...]
};

struct buffer_page {
	[...];
	void *page;
	struct list_head list;
	[...];
};

In ring_buffer_allocate_cpu:

	struct buffer_page *bpage;
	struct unsigned long addr;

	[...]

	for every page() {
		bpage = kzalloc(sizeof(*bpage), GFP_KERNEL);
		addr = get_free_page();
		bpage->page = (void *)addr;
		list_add(&bpage->list, &cpu_buffer->pages);
	}


Obviously need to add the error checking, but you get the idea. Here I do 
not need to change any of the later logic, because we are still dealing 
with the buffer_page. I only need to update way to index the page which is 
already encapsulated in its own function.

-- Steve


--

From: Steven Rostedt
Date: Tuesday, September 30, 2008 - 11:02 am

Of course we would not be declaring a "struct unsigned long" ;-)

-- Steve

--

From: Arnaldo Carvalho de Melo
Date: Friday, September 26, 2008 - 3:31 pm

Nitpick: Why cast to void *?

And sometimes you use the rb_ prefix, in other cases you use the longer
for ring_buffer_, is the ring_ namespace already used? Or can we make it
use rb_ consistently to shorten the names?

- Arnaldo
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 4:58 pm

I started using the rb_ because I was constantly breaking the 80 character 
line limit with ring_buffer ;-)   OK, for v8, I'll rename all static 
internal functions to rb_ and keep the global ones ring_buffer_

Thanks,

-- Steve

--

From: Linus Torvalds
Date: Friday, September 26, 2008 - 5:13 pm

It would probably be better to use something else than 'rb_', because that 
prefix is already used by the red-black trees, and exported as such (eg 
"rb_next()" etc).

But at least as long as it's static, it's probably not _too_ noticeable if 
the rest of the names don't overlap. We _do_ include <linux/rbtree.h>  
almost everywhere, since we use those things in the VM, in timers etc, so 
it comes in through pretty much all headers.

			Linus
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 5:23 pm

Well, I just compiled it and it didn't have any name collisions, but that 
doesn't mean that this wont change in the future.

What would you suggest?  buffer_ ? ringbuf_ ?

-- Steve

--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 5:28 pm

For kicks I just added #include <linux/rbtree.h> and it still passed. I 
don't think we'll be adding new functions to rbtree.h, so it may be 
fine to stay with the rb_ prefix.

-- Steve

--

Previous thread: [RFC PATCH 0/2 v3] Unified trace buffer by Steven Rostedt on Thursday, September 25, 2008 - 11:51 am. (1 message)

Next thread: [RFC PATCH 2/2 v3] ftrace: make work with new ring buffer by Steven Rostedt on Thursday, September 25, 2008 - 11:51 am. (1 message)