This is probably very buggy. I ran it as a back end for ftrace but only
tested the irqsoff and ftrace tracers. The selftests are busted with it.
But this is an attempt to get a unified buffering system that was
talked about at the LPC meeting.
Now that it boots and runs (albeit, a bit buggy), I decided to post it.
This is some idea that I had to handle this.
I tried to make it as simple as possible.
I'm not going to explain all the stuff I'm doing here, since this code
is under a lot of flux (RFC, POC work), and I don't want to keep updating
this change log. When we finally agree on something, I'll make this
change log worthy.
If you want to know what this patch does, the code below explains it :-p
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
include/linux/ring_buffer.h | 175 ++++++
kernel/trace/Kconfig | 3
kernel/trace/Makefile | 1
kernel/trace/ring_buffer.c | 1218 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 1397 insertions(+)
Index: linux-compile.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-compile.git/include/linux/ring_buffer.h 2008-09-25 13:59:09.000000000 -0400
@@ -0,0 +1,175 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+ u32 type:2, len:3, time_delta:27;
+ u32 array[];
+} __attribute__((__packed__));
+
+enum {
+ RB_TYPE_PADDING, /* Left over page padding
+ * array is ignored
+ * size is variable depending on
+ */
+ RB_TYPE_TIME_EXTENT, /* Extent the time delta
+ * array[0] = time delta (28 .. 59)
+ * size = 8 bytes
+ */
+ /* FIXME: RB_TYPE_TIME_STAMP not implemented */
+ RB_TYPE_TIME_STAMP, /* Sync time stamp with ...This version has been cleaned up a bit. I've been running it as
a back end to ftrace, and it has been handling pretty well.
I did not implement the GTOD sync part and will leave that for later.
But this is the basic design that I like and will be the basis
of my future work.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
include/linux/ring_buffer.h | 178 ++++++
kernel/trace/Kconfig | 4
kernel/trace/Makefile | 1
kernel/trace/ring_buffer.c | 1252 ++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 1435 insertions(+)
Index: linux-trace.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-trace.git/include/linux/ring_buffer.h 2008-09-25 20:36:12.000000000 -0400
@@ -0,0 +1,178 @@
+#ifndef _LINUX_RING_BUFFER_H
+#define _LINUX_RING_BUFFER_H
+
+#include <linux/mm.h>
+#include <linux/seq_file.h>
+
+struct ring_buffer;
+struct ring_buffer_iter;
+
+/*
+ * Don't reference this struct directly, use the inline items below.
+ */
+struct ring_buffer_event {
+ u32 type:2, len:3, time_delta:27;
+ u32 array[];
+} __attribute__((__packed__));
+
+enum {
+ RB_TYPE_PADDING, /* Left over page padding
+ * array is ignored
+ * size is variable depending on
+ */
+ RB_TYPE_TIME_EXTENT, /* Extent the time delta
+ * array[0] = time delta (28 .. 59)
+ * size = 8 bytes
+ */
+ /* FIXME: RB_TYPE_TIME_STAMP not implemented */
+ RB_TYPE_TIME_STAMP, /* Sync time stamp with external clock
+ * array[0] = tv_nsec
+ * array[1] = tv_sec
+ * size = 16 bytes
+ */
+
+ RB_TYPE_DATA, /* Data record
+ * If len is zero:
+ * array[0] holds the actual length
+ * array[1..(length+3)/4] holds data
+ * else
+ * length = len << 2
+ * array[0..(length+3)/4] holds data
+ */
+};
+
+#define RB_EVNT_HDR_SIZE (sizeof(struct ring_buffer_event))
+#define ...Hi Steven,
Thank you for your great work.
It seems good to me(especially, encapsulating events :)).
I have one request of enhancement.
> +static struct ring_buffer_per_cpu *
> +ring_buffer_allocate_cpu_buffer(struct ring_buffer *buffer, int cpu)
> +{
[...]
> + cpu_buffer->pages = kzalloc_node(ALIGN(sizeof(void *) * pages,
> + cache_line_size()), GFP_KERNEL,
> + cpu_to_node(cpu));
Here, you are using a slab object for page managing array,
the largest object size is 128KB(x86-64), so it can contain
16K pages = 64MB.
As I had improved relayfs, in some rare case(on 64bit arch),
we'd like to use larger buffer than 64MB.
http://sourceware.org/ml/systemtap/2008-q2/msg00103.html
So, I think similar hack can be applicable.
Would it be acceptable for the next version?
Thank you,
--
Masami Hiramatsu
Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division
e-mail: mhiramat@redhat.com
--
I would like to avoid using vmalloc as much as possible, but I do see the limitation here. Here's my compromise. Instead of using vmalloc if the page array is greater than one page, how about using vmalloc if the page array is greater than KMALLOC_MAX_SIZE? This would let us keep the vmap area free unless we have no choice. -- Steve --
Hmm, that's a good idea. In most cases, per-cpu buffer may be less than 64MB, so I think it is reasonable. -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com --
You could also fallback on a 2-level page array when buffer size is > 64MB. The cost is mainly a supplementary pointer dereference, but one more should not make sure a big difference overall. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I'm still not sure why we don't just link the pages using the page frames, we don't need the random access, do we? --
Yeah we can go back to that (as ftrace does). 1) It can be very error prone. I will need to encapsulate the logic more. 2) I'm still not sure if crash can handle it. I was going to reply to Masami with this answer, but it makes things more complex. For v1 (non RFC v1) I wanted to start simple. v2 can have this enhancement. -- Steve --
It ought to, and if it can't it should be fixed. Having easy access to the pageframes is vital to debugging VM issues. So I'd not bother about Right - I just object to having anything vmalloc. --
I just requested that the expansion of buffer size limitation too. :) I don't stick with vmalloc. If that (page frame chain?) can achieve better performance, I agree that trace buffer uses it. Thank you, -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com --
v5 is out with this implementation. It may or may not be better performance, but the difference is most likely negligible. Anyway, I'm happing with this last release, and hopefully it can get into 2.6.28. This would mean I can start basing ftrace on top of it. -- Steve --
Hmm, but this does make changing the buffer size much easier. I'll think about it and perhaps try it out. If I can tidy it up nicer than the ftrace code, then I may include it for v1. -- Steve --
Yes, that's a brilliant idea :) Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
[
Note the removal of the RFC in the subject.
I am happy with this version. It handles everything I need
for ftrace.
New since last version:
- Fixed timing bug. I did not add the deltas properly when
reading the buffer.
- Removed "-1" time stamp normalize test. This made the
clock go backwards!
- Removed page pointer array and replaced it with the ftrace
page struct link list trick. Since this is my second time
writing this code (first with ftrace), it is actually much
cleaner than the ftrace code.
- Implemented buffer resizing. By using the page link list trick,
this became much simpler.
Note, the GOTD part is still not implemented, but can be done
later without affecting this interface.
]
This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.
The events recorded into the buffer have the following structure:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.
There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).
RB_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.
RB_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).
RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.
RB_TYPE_DATA: The event actually holds user data.
The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.
Example, data size of 7 ...Why do you need __packed__ here? With or without it the layout is the
same:
[acme@doppio examples]$ pahole packed
struct ring_buffer_event {
u32 type:2; /* 0:30 4 */
u32 len:3; /* 0:27 4 */
u32 time_delta:27; /* 0: 0 4 */
u32 array[0]; /* 4 0 */
/* size: 4, cachelines: 1, members: 4 */
/* last cacheline: 4 bytes */
};
- Arnaldo
--
Indeed. And on some architectures 'packed' will actually mean that the compiler may think that it's unaligned, and then generate much worse code to access the fields. So if you align things anyway (and you do), then 'packed' is the wrong thing to do. Linus --
OK, I'm making v6 now with various cleanups. I'll nuke it on that one. -- Steve --
btw., now that it's getting into shape, could you please fix the ftrace ... to not have known bugs, so that we could try it in tip/ftrace and make sure it works well in practice? it's a ton of changes already, it would be nice to get to some stable known-working state and do delta patches from that point on, and keep its 'works well' quality. Ingo --
OK, the patch that I was using was against Linus's tree. I'll port it over to linux-tip on Monday and get it past the "proof of concept" stage. Actually, the verison I have on my desk works pretty well. The main issues to solve is that some other tracers and the self test stick their noses into the buffering system, which would need to be fixed. There's also some bugs in the status numbers printed in the latency_trace header. But I have not hit any bugs with the buffering itself. I'll clean all this up and send out a patch on Monday. My wife is mandating that I do not do anymore work over the weekend ;-) -- Steve --
[
Changes since v5:
- removed packed attribute from event structure.
- added parenthesis around ~TS_MASK
- fixed some comments in header
- fixed ret value on ring_buffer_write on errors.
- added check_pages when modifying the size of cpu buffers
]
This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.
The events recorded into the buffer have the following structure:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.
There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).
RB_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.
RB_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).
RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.
RB_TYPE_DATA: The event actually holds user data.
The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.
Example, data size of 7 bytes:
type = RB_TYPE_DATA
len = 2
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0..1]: <7 bytes of data> <1 byte empty>
This event is saved in 12 bytes of the buffer.
An event with 82 bytes of data:
type = RB_TYPE_DATA
len = 0
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0]: 84 (Note the alignment)
array[1..14]: <82 bytes of data> <2 bytes empty>
The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 ...Forgive me if I've gotten this wrong but the terminology seems backwards Here, I would think we only throw away new data if the producer catches up with the consumer, if the consumer catches up with the producer we're Just trying to understand the terminology before I look at the code so I'm sorry if I have just completely misunderstood. -Richard Holden --
Argh! Yes. I'm the one that is backwards ;-) Yeah, that is what I meant. Don't you know? You are suppose to understand I always get confused with the translation of what the head/tail to producer/consumer. Here I have the producer adding to the tail, and the consumer reading from the head. Perhaps this is backwards? I could change it. s/head/foobar/g s/tail/head/g s/foobar/tail/g Sure, thanks. -- Steve --
Since you're already using the page frame, you can stick this per page
timestamp in there as well, and get the full page for data.
You can either use a struct page overlay like slob does, or add a u64 in
the union that contains struct {private, mapping}.
--
What did you guys think of Mathieu's idea of sticking the buffer length in the header here, rather than using padding events? Seemed cleaner to me. --
Actually I like the padding. This way when I move the event pointer forward, I only need to compare it to a constant (PAGE_SIZE), or test to see if the event is padding. Placing this into the buffer page, I will have to always compare it to that pointer. But I guess I could change it to that if needed. That doesn't affect the API, as it is only internal. I'm almost done with v7, perhaps I might try that with v8 to see if I like it better. -- Steve --
OK, I just implemented the size field in the page struct. Seems to work well. I'm still keeping the padded event, so in the future, if we ever map these pages to userspace or files, these holes will have a type. Will post later today, need to actually enter a real life for a bit. -- Steve --
actually nr_possible makes sense, and you might consider always allocating buffers (and keeping them for offlined cpus) to avoid massive allocations/frees cpu-hotplug events. Mike Travis has been going over the kernel removing constructs like this, and replacing them with dynamically allocated arrays of --
The other thing to consider is using a percpu variable. Cheers, Mike --
This structure is allocated on request. -- Steve --
Actually in this case I chose num_possible_cpus(). Reason being is that later I may add an interface to allow the user to select which CPUs they want to trace, and this will only allocate a subset of CPU buffers. (not going to implement that in the first release). But to lay the ground work, I set a buffers->cpumask to be that of all the cpus with buffers allocated. For now that mask is set to cpu_possible_map. Since num_possible_cpus() is defined as cpus_weight_nr(cpu_possible_map) I figured that was the better choice. -- Steve --
One problem though, it's *theoretically* possible for num_possible to be less than nr_cpu_ids and a cpu index may extend past the end of your allocated array. This would happen if the cpu indices are allocated some other way than as each cpu is discovered. For example, a system might want a group of cpus in one section (say by node, or socket) and then a hole in the cpu_possible_map until the next group. nr_cpu_ids is guaranteed to be the highest possible cpu + 1. Cheers, Mike --
Thanks for the explanation. I'll change buffer->cpus to be set to nr_cpu_ids. -- Steve --
Arjan, any preferences wrt kerneloops.org? --
On Fri, 26 Sep 2008 21:17:13 +0200 -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
You probably want synchronize_sched() here (and similar other places) to ensure any active writer on the corresponding cpu is actually stopped. Which suggests you want to use something like ring_buffer_lock_cpu() and --
Would it really be done in the buffer layer? I think it should be done by each tracer, because buffer layer can't ensure truly active writers have stopped. -- Masami Hiramatsu Software Engineer Hitachi Computer Products (America) Inc. Software Solutions Division e-mail: mhiramat@redhat.com --
Actually it can ;-) Since all writes to the buffer at least disable preemption, by issuing a synchronize_sched, we can guarantee that after disabling the record, all activity will be done. -- Steve --
[
Changes since v6:
- Added shift debug test to test both normalization of
timestamp, but also the large time deltas. ftrace records too quickly
to get large deltas :-/
- Fixed some minor issues with keeping track of time.
- used slob hack to put more information in the page struct and now
have the full buffer page free for data. Thanks to Peter Zijlstra
for suggesting the idea.
- have the buffer use a cpu mask (initialized to cpu_possible_map)
to allocate for cpu usage.
- fixed entries counting.
- use DIV_ROUND_UP macro (also suggested by Peter)
]
This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.
The events recorded into the buffer have the following structure:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.
There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).
RB_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.
RB_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).
RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.
RB_TYPE_DATA: The event actually holds user data.
The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.
Example, data size of 7 bytes:
type = RB_TYPE_DATA
len = 2
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0..1]: <7 bytes of data> <1 byte empty>
This event is saved in 12 bytes of the ...[
Changes since v7:
- added the size of data in the page into the page frame.
Suggested by Martin Bligh and Mathieu Desnoyers
- Converted all static functions to be named with a rb_ prefix.
This may conflict with rbtree functions in the future, but if
this does happen, we will need to rename the functions in this
file. The rb_ prefixed functions here are all static, so it only
affects this code. Thanks to Arnaldo Carvalho de Melo.
- Added some synchronized_sched() where record_disabled is
incremented. There are other places that expect the caller
to handle it. Suggested by Peter Zijlstra.
- Use nr_cpu_ids for max cpu. Thanks to Mike Travis.
]
This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.
The events recorded into the buffer have the following structure:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.
There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).
RB_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.
RB_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).
RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.
RB_TYPE_DATA: The event actually holds user data.
The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.
Example, data size of 7 bytes:
type = RB_TYPE_DATA
len = 2
time_delta: <time-stamp> - ...[
Changes since version 8:
Two major bug fixes!
- Had mix of referencing the pages link list with both
page->lru and buffer_page->list. Perhaps they luckily
were lined up. But I have no idea why this didn't totally
crash my box.
- Missed a write stamp update that would cause funny times
]
From: Steven Rostedt <srostedt@redhat.com>
Subject: Unified trace buffer
This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.
The events recorded into the buffer have the following structure:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.
There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).
RB_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.
RB_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).
RB_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.
RB_TYPE_DATA: The event actually holds user data.
The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.
Example, data size of 7 bytes:
type = RB_TYPE_DATA
len = 2
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0..1]: <7 bytes of data> <1 byte empty>
This event is saved in 12 bytes of the buffer.
An event with 82 bytes of data:
type = RB_TYPE_DATA
len = 0
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0]: 84 (Note the alignment)
array[1..14]: <82 bytes of data> <2 ...small nitpicking review, nothing structural yet: hm, should not be raw, at least initially. I am 95% sure we'll see lockups, we always did when we iterated ftrace's buffer implementation please use consistent vertical whitespaces. Above, in the struct ring_buffer definition, you can add another tab to most of the vars - that will also make the '**buffers' line look nice. same for all structs across this file. In my experience, a 50% vertical please name it RINGBUFFER_BUG_ON() / RINGBUFFER_WARN_ON(), so that we dont have to memorize another set of debug names. [ See DEBUG_LOCKS_WARN_ON() in include/linux/debug_locks.h ] please apply ftrace's standard reverse christmas tree style and move the 'addr' initialization can move to the definition line - you save two ( optional:when there's a multi-line loop then i generally try to insert an extra newline when starting the body - to make sure the iterator this function is too long, please split it up. The first condition's please use standard comment style: /* * Comment */ Ingo --
Hi Ingo, Thanks for the review! Yeah, Linus pointed this out with the rb_ static function names. But since the functions are static I kept them as is. But here we have global names. Ah, I think I had it more complex and changed it to a literal without No biggy. I thought this would be nicer as inline. But I have no problem It was to prevent lockdep from checking the locks from inside. We had issues with ftroce and lockdep in the past, because ftrace would trace the internals of lockdep, and lockdep would then recurse back into itself to trace. If lockdep itself can get away with not using raw_spinlocks, then Heh, this was directly from a bug I had and laziness ;-) I originally just had struct list_head pages (and no *tmp), which kept the christmas tree format. But later found that you need to initialize list Hmm, this is interesting. I kind of like this because it is not really a standard comment. It is a comment about the definitions of the enum. I believe if they are above: /* * Comment */ RB_ENUM_TYPE, It is not as readable. But if we do: RB_ENUM_TYPE, /* * Comment */ The comment is not at the same line as the enum, which also looks unpleasing. We can't could do: /* RB_ENUM_TYPE, * Comment */ /* RB_ENUM_TYPE2, * Comment */ Because the ENUM is also in the comment :-p I chose this way because we have: RB_ENUM_TYPE, /* Comment * More comment */ RB_ENUM_TYPE2, /* Comment */ Since I find this the nices way to describe enums. That last */ is good to space the comments apart, otherwise we have: RB_ENUM_TYPE, /* Comment * More comment */ RB_ENUM_TYPE2, /* Comment */ That is not as easy to see the separation of one description of enums with the other. -- Steve --
that's even worse i think :-/ And this isnt bikeshed-painting really, the RNGBF_ name hurts my eyes and RB_ is definitely confusing to read. (as the rbtree constants are in capitals as well and similarly named) RING_TYPE_PADDING or: RINGBUF_TYPE_PADDING yes, way too big. Sometimes we make savings from a 10 bytes function already. (but it's always case dependent - if a function has a lot of parameters then uninlining can hurt) the only exception would be if there's normally only a single would be nice to make sure that ftrace's recursion checks work as intended - and the same goes for lockdep's recursion checks. Yes, we had problems in this area, and it would be nice to make sure it all works clarification: multi-line loop _condition_. It's pretty rare (this is such a case) but sometimes unavoidable - and then the newline helps So i suggested to fix it to: + RB_TYPE_TIME_EXTENT, /* + * Extent the time delta + * array[0] = time delta (28 .. 59) + * size = 8 bytes + */ ok? I.e. "comment" should have the same visual properties as other comments. I fully agree with moving it next to the enum, i sometimes use that style too, it's a nice touch and more readable in this case than comment-ahead. (which we use for statements) Ingo --
I don't mind the extra typing, it is just a bit more difficult to keep in
It is a hot path in the internals. Perhaps I'll make an inline function
in the interal code "rb_event_length" and have the other users call.
unsigned ring_buffer_event(struct ring_buffer_event *event)
{
return rb_event_length(event);
But then we have:
RB_TYPE_PADDING, /*
* Left over page padding
* array is ignored
* size is variable depending on
* how much padding is needed
*/
RB_TYPE_TIME_EXTENT, /*
* Extent the time delta
* array[0] = time delta (28 .. 59)
* size = 8 bytes
*/
Where it is not as easy to see which comment is with which enum.
Especially when you have many enums. That's why I like the method I used
with:
RB_TYPE_PADDING, /* Left over page padding
* array is ignored
* size is variable depending on
* how much padding is needed
*/
RB_TYPE_TIME_EXTENT, /* Extent the time delta
* array[0] = time delta (28 .. 59)
* size = 8 bytes
*/
Where it is very easy to notice which comment goes with which enum.
-- Steve
--
that's really not a hard limit, but yeah. oh, btw., that's a spelling mistake: s/extend/extend ? Ingo --
OK, I did a quick survey of what others did in include/linux to handle
multi line comments for enums. I ignored the single line comments since
that is pretty standard. Here's what I found:
Those that do:
enum myenum {
ENUM_PING_PONG, /* Bounce a ball back and forth
till you have a winner. */
ENUM_HONEY_CONE, /* Soft and sweet a yummy for
the tummy. */
};
include/linux/atmdev.h
include/linux/fd.h
include/linux/hil.h
include/linux/if_pppol2tp.h
include/linux/ivtv.h
include/linux/libata.h
include/linux/mmzone.h
include/linux/reiserfs_fs.h
include/linux/reiserfs_fs_sb.h
include/linux/rtnetlink.h
include/linux/scc.h
include/linux/videodev2.h
Those that do:
enum myenum {
ENUM_PING_PONG, /* Bounce a ball back and forth */
/* till you have a winner. */
ENUM_HONEY_CONE, /* Soft and sweet a yummy for */
/* the tummy. */
};
include/linux/atmsvc.h
include/linux/pktcdvd.h
Those that do (what I did):
enum myenum {
ENUM_PING_PONG, /* Bounce a ball back and forth
* till you have a winner.
*/
ENUM_HONEY_CONE, /* Soft and sweet a yummy for
* the tummy.
*/
};
include/linux/buffer_head.h (with space between the two enums)
include/linux/personality.h
Those that do:
enum myenum {
/*
* Bounce a ball back and forth
* till you have a winner.
*/
ENUM_PING_PONG,
/*
* Soft and sweet a yummy for
* the tummy.
*/
ENUM_HONEY_CONE,
};
include/linux/cgroup.h
include/linux/cn_proc.h
include/linux/exportfs.h
include/linux/fb.h
include/linux/hil_mlc.h
include/linux/pci.h
include/linux/reiserfs_fs_i.h
And finally Doc book style:
/**
* enum myenum
* @ENUM_PING_PONG: Bounce a ball back and forth
* till you have a winner.
* @ENUM_HONEY_CONE: Soft and ...Would using tb_ (trace buffer) rather than rb_ help ? --
excellent idea ... Ingo --
[
This is the final version of this patch. From now on, I will be sending
changes on top of this patch.
Changes since v9:
All suggestions from Ingo Molnar.
- Changed comment of enum to DocBook style.
- Replaced the RB_TYPE_ enums with RINGBUF_TYPE_ prefixes to avoid
name collision with rbtree. Note, I did not use the TB_ extension
because I envision a "trace_buffer" layer on top of this layer
in the future.
- Moved ring_buffer_event_{length,data} into the .c file and added
internal inlines. External uses will need to call the function.
- Broke out rb_add_time_stamp function from rb_reserve_next_event.
- made the cpu_buffer->lock back to a normal spin lock.
- The rest are style changes.
]
This is a unified tracing buffer that implements a ring buffer that
hopefully everyone will eventually be able to use.
The events recorded into the buffer have the following structure:
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.
There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).
RINGBUF_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.
RINGBUF_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).
RINGBUF_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.
RINGBUF_TYPE_DATA: The event actually holds user data.
The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.
Example, data size of 7 ...Ingo, I will add this patch to my linux-tip and then I will start porting ftrace over to it in an incremental fashion. -- Steve --
* Steven Rostedt (rostedt@goodmis.org) wrote: Hi Steven, You should have a look at mm/slob.c free_slob_page(). I think your page free will generate a "bad_page" call due to mapping != NULL and mapcount != 0. I just ran into this in my own code. :) Regards, Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Hi Mathieu! Thanks! I must have been lucky some how not to trigger this :-/ I'll add an update patch for this. -- Steve --
My guess is that you never free your buffers in your test cases. I don't know if it was expected; probably not if your code is built into the kernel. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Actually my resize does free the buffers and I did test this. I probably never ran the trace when testing the freeing which means those pointers could have luckily not have been changed. -- Steve --
I also got some corruption of the offset field in the struct page I use.
I think it might be related to the fact that I don't set the PG_private
bit (slob does set it when the page is in its free pages list). However,
given I'd like to pass the buffer pages to disk I/O and for network
socket and still keep the ability to re-use it when the I/O has been
performed, I wonder where I should put my
struct list_head list; /* linked list of buf pages */
size_t offset; /* page offset in the buffer */
fields ? Any ideas ?
They are currently in :
struct buf_page {
union {
struct {
unsigned long flags; /* mandatory */
atomic_t _count; /* mandatory */
union { /* mandatory */
atomic_t _mapcount;
struct {
u16 inuse;
u16 objects;
};
};
struct list_head list; /* linked list of buf pages */
size_t offset; /* page offset in the buffer */
};
struct page page;
};
};
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Ah, I believe the disk IO uses the page frame. That might be a bit more difficult to pass the data to disk and still keep information on the page frame. -- Steve --
On Tue, 30 Sep 2008 00:00:11 -0400 (EDT) Perhaps I'm speaking out of turn, but I have to wonder: am I the only one who gets uncomfortable looking at these hacks to overload struct page? It seems fragile as all hell; woe to he who tries to make a change to struct page someday and has to track all of this stuff down. Are the savings gained by using struct page this way really worth the added complexity? jon --
Its not that complex IMHO, the thing that is ugly are those struct page overloads, what we could do is try and sanitize the regular struct page and pull all these things in. Because the only reason people are doing these overloads is because struct page in mm_types.h is becomming an unreadable mess. Trouble is, looking at it I see no easy way out, --
That's not the scary part. The scary part is that somebody may well want to access the trace buffer pages in complex ways. If you mmap them, for example, you can use VM_PFNMAP to make sure that nobody should ever look at the "struct page", but if you want to do things like direct-to-disk IO on the trace pages (either with splice() or with The "unreadable mess" has exactly the same issues, though: people need to realize that when you overload fields in the page structure, you can then NEVER EVER use those pages for any other thing. For the internal VM code, that's ok. The VM knows that a page is either an anonymous page or a file mapping etc, and the overloading wrt mm_types.h is explicit. The same goes for SL*B, although it does the overloading differently. Trace buffers are different, though. Do people realize that doing the overloading means that you never EVER can use those buffers for anything else? Do people realize that it means that splice() and friends are out of Quite frankly, we could just put it at the head of the page itself. Having a "whole page" for the trace data is not possible anyway, since the trace header itself will always eat 8 bytes. And I do think it would potentially be a better model. Or at least safer. Linus --
Actually, looking at the code, there is no reason I need to keep this in the frame buffer itself. I've also encapsulated the accesses to the incrementing of the pointers so it would be trivial to try other approaches. The problem we had with the big array struct is that we can want large buffers and to do that with pointers means we would need to either come up with a large allocator or use vmap. But I just realized that I could also just make a link list of page pointers and do the exact same thing without having to worry about page frames. Again, the way I coded this up, it is quite trivial to replace the handling of the pages with other schemes. -- Steve --
That might be the best option. Yes, doing it in the 'struct page' itself is obviously going to save us some memory over having specially allocated page headers, but it's not like we'd expect to have _that_ many of these, and having a separate structure is actually good in that it also would make it simpler/clearer when/if you want to add larger pages (or other non-page allocations) into the mix. For example, if somebody really wants bigger areas, they can allocate them with vmalloc and/or multi-page allocations, and then add them as easily to the list of pages as if it was a normal page. Doing the same with playing tricks on 'struct page' would be pretty damn painful. Linus --
The current method of overlaying the page frame as the buffer page pointer
can be very dangerous and limits our ability to do other things with
a page from the buffer, like send it off to disk.
This patch allocates the buffer_page instead of overlaying the page's
page frame. The use of the buffer_page has hardly changed due to this.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
kernel/trace/ring_buffer.c | 54 ++++++++++++++++++++++++++-------------------
1 file changed, 32 insertions(+), 22 deletions(-)
Index: linux-tip.git/kernel/trace/ring_buffer.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ring_buffer.c 2008-10-01 09:37:23.000000000 -0400
+++ linux-tip.git/kernel/trace/ring_buffer.c 2008-10-01 11:03:16.000000000 -0400
@@ -115,16 +115,10 @@ void *ring_buffer_event_data(struct ring
* Thanks to Peter Zijlstra for suggesting this idea.
*/
struct buffer_page {
- union {
- struct {
- unsigned long flags; /* mandatory */
- atomic_t _count; /* mandatory */
- u64 time_stamp; /* page time stamp */
- unsigned size; /* size of page data */
- struct list_head list; /* list of free pages */
- };
- struct page page;
- };
+ u64 time_stamp; /* page time stamp */
+ unsigned size; /* size of page data */
+ struct list_head list; /* list of free pages */
+ void *page; /* Actual data page */
};
/*
@@ -133,9 +127,9 @@ struct buffer_page {
*/
static inline void free_buffer_page(struct buffer_page *bpage)
{
- reset_page_mapcount(&bpage->page);
- bpage->page.mapping = NULL;
- __free_page(&bpage->page);
+ if (bpage->page)
+ __free_page(bpage->page);
+ kfree(bpage);
}
/*
@@ -237,11 +231,16 @@ static int rb_allocate_pages(struct ring
unsigned i;
for (i = 0; i < nr_pages; i++) {
+ page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
+ GFP_KERNEL, cpu_to_node(cpu));
+ if (!page)
+ goto free_pages;
+ list_add(&page->list, ...Hi Steven, I understand that you want to allocate these struct buffer_page in memory local to a given cpu node, which is great, but why do you feel you need to align them on cache_line_size() ? Hrm.. you put the timestamp in there, so I guess you're concerned about having a writer on one CPU, a reader on another, and the fact that you will have cache line bouncing because of that. Note that if you put the timestamp and the unused bytes in a tiny header at the beginning of the page, you 1 - make this information directly accessible for disk, network I/O without any other abstraction layer. 2 - won't have to do such alignment on the struct buffer_page, because it will only be read once it's been allocated. My 2 cents ;) -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
That was the approach I actually started with. But someone (I think Peter) asked me to remove it. Who knows, perhaps I can put it back. It's not that hard to do. This is why I used BUF_PAGE_SIZE to determine the size of the buffer page. Right now it BUF_PAGE_SIZE == PAGE_SIZE, but if we do add a header than it will be BUF_PAGE_SIZE == PAGE_SIZE - sizeof(header) -- Steve --
You could probably use alloc_pages_node instead here... -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
applied to tip/tracing/ftrace, with the extended changlog below - i
think this commit warrants that extra mention.
Ingo
--------------->
From da78331b4ced2763322d732ac5ba275965853bde Mon Sep 17 00:00:00 2001
From: Steven Rostedt <rostedt@goodmis.org>
Date: Wed, 1 Oct 2008 10:52:51 -0400
Subject: [PATCH] ftrace: type cast filter+verifier
The mmiotrace map had a bug that would typecast the entry from
the trace to the wrong type. That is a known danger of C typecasts,
there's absolutely zero checking done on them.
Help that problem a bit by using a GCC extension to implement a
type filter that restricts the types that a trace record can be
cast into, and by adding a dynamic check (in debug mode) to verify
the type of the entry.
This patch adds a macro to assign all entries of ftrace using the type
of the variable and checking the entry id. The typecasts are now done
in the macro for only those types that it knows about, which should
be all the types that are allowed to be read from the tracer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/trace/trace.c | 85 ++++++++++++++++++++++++++++------------
kernel/trace/trace.h | 42 ++++++++++++++++++++
kernel/trace/trace_mmiotrace.c | 14 +++++--
3 files changed, 112 insertions(+), 29 deletions(-)
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c163406..948f7d8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1350,7 +1350,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
}
switch (entry->type) {
case TRACE_FN: {
- struct ftrace_entry *field = (struct ftrace_entry *)entry;
+ struct ftrace_entry *field;
+
+ trace_assign_type(field, entry);
seq_print_ip_sym(s, field->ip, sym_flags);
trace_seq_puts(s, " (");
@@ -1363,8 +1365,9 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
}
case TRACE_CTX:
case TRACE_WAKE: ...that was for the type filter commit. The 3 patches i've picked up into tip/tracing/ring-buffer are: b6eeea4: ftrace: preempt disable over interrupt disable 52abc82: ring_buffer: allocate buffer page pointer da78331: ftrace: type cast filter+verifier Thanks, Ingo --
trivial build fix below.
Ingo
From 339ce9af3e6cbc02442b0b356c1ecb80a8ae92fb Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 2 Oct 2008 11:04:14 +0200
Subject: [PATCH] ring-buffer: fix build error
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
fix:
kernel/trace/ring_buffer.c: In function ‘rb_allocate_pages’:
kernel/trace/ring_buffer.c:235: error: ‘cpu’ undeclared (first use in this function)
kernel/trace/ring_buffer.c:235: error: (Each undeclared identifier is reported only once
kernel/trace/ring_buffer.c:235: error: for each function it appears in.)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/trace/ring_buffer.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9814571..54a3098 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -232,7 +232,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
for (i = 0; i < nr_pages; i++) {
page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
- GFP_KERNEL, cpu_to_node(cpu));
+ GFP_KERNEL, cpu_to_node(i));
if (!page)
goto free_pages;
list_add(&page->list, &pages);
--
ok, these latest ring-buffer updates cause more serious trouble, i just got this boot crash on a testbox: [ 0.324003] calling tracer_alloc_buffers+0x0/0x14a @ 1 [ 0.328008] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 0.332001] IP: [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc [ 0.332001] PGD 0 [ 0.332001] Oops: 0000 [1] SMP [ 0.332001] CPU 0 [ 0.332001] Modules linked in: [ 0.332001] Pid: 1, comm: swapper Not tainted 2.6.27-rc8-tip-01064-gd163d6b-dirty #1 [ 0.332001] RIP: 0010:[<ffffffff8027d28b>] [<ffffffff8027d28b>] ring_buffer_alloc+0x207/0x3fc [ 0.332001] RSP: 0018:ffff88003f9d7de0 EFLAGS: 00010287 [ 0.332001] RAX: 0000000000000000 RBX: ffffffff80b08404 RCX: 0000000000000067 [ 0.332001] RDX: 0000000000000004 RSI: 00000000000080d0 RDI: ffffffffffffffc0 [ 0.332001] RBP: ffff88003f9d7e80 R08: ffff88003f8010b4 R09: 000000000003db02 [ 0.332001] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88003f801600 [ 0.332001] R13: 0000000000000004 R14: ffff88003f801580 R15: ffff88003f801618 [ 0.332001] FS: 0000000000000000(0000) GS:ffffffff80a68280(0000) knlGS:0000000000000000 [ 0.332001] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [ 0.332001] CR2: 0000000000000008 CR3: 0000000000201000 CR4: 00000000000006e0 [ 0.332001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 0.332001] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 0.332001] Process swapper (pid: 1, threadinfo ffff88003f9d6000, task ffff88003f9d8000) [ 0.332001] Stack: ffff88003f9d7df0 ffff88003f9d7e40 0000000000000283 ffffffff80b08404 [ 0.332001] ffffffff80b08404 ffff88003f801598 0000000000000000 ffff88003f801598 [ 0.332001] ffff88003f801580 0000016000000000 ffff88003f801600 ffff88003f9a2a40 [ 0.332001] Call Trace: [ 0.332001] [<ffffffff80a95f41>] ? tracer_alloc_buffers+0x0/0x14a [ 0.332001] [<ffffffff80a95f67>] tracer_alloc_buffers+0x26/0x14a [ ...
The above "preempt disable" is the most likely culprit. I'm trying to get towards an interrupt disabled free and lockless code path. But in doing so, one must be extra careful. This is why I'm taking baby steps towards this approach. Any little error in one of these steps, and you have race conditions biting you. The above replaces interrupt disables with preempt disables, uses the atomic data disable to protect against reentrancy. But this could also have opened up a bug that was not present with the interrupts disabled Hmm, that was a trivial patch. Perhaps the trivial ones are the ones that are most likely to be error prone. A developer will take much more care in developing a patch that is complex than something he sees as trivial ;-) --
attached. You can get the broken tree by doing this in tip/master: git-merge tip/tracing/ring-buffer Ingo
I've just checked-out tip/tracing/ring-buffer. That tree is still broken too, right? Or do I need to merge it to get the broken version? -- Steve --
yes, that's very likely broken too in a standalone way as well - but to get the exact tree i tested i'd suggest: git checkout tip/master git merge tip/tracing/ring-buffer Ingo --
My original patch had a compile bug when NUMA was configured. I
referenced cpu when it should have been cpu_buffer->cpu.
Ingo quickly fixed this bug by replacing cpu with 'i' because that
was the loop counter. Unfortunately, the 'i' was the counter of
pages, not CPUs. This caused a crash when the number of pages allocated
for the buffers exceeded the number of pages, which would usually
be the case.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
kernel/trace/ring_buffer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-tip.git/kernel/trace/ring_buffer.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ring_buffer.c 2008-10-02 09:09:01.000000000 -0400
+++ linux-tip.git/kernel/trace/ring_buffer.c 2008-10-02 18:58:44.000000000 -0400
@@ -232,7 +232,7 @@ static int rb_allocate_pages(struct ring
for (i = 0; i < nr_pages; i++) {
page = kzalloc_node(ALIGN(sizeof(*page), cache_line_size()),
- GFP_KERNEL, cpu_to_node(i));
+ GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
if (!page)
goto free_pages;
list_add(&page->list, &pages);
--
That should have been: "when the number of pages allocated for the buffers exceeded the number -- Steve --
Declare NUMA-less cpu_to_node with a check that the cpu parameter exists so
people without NUMA test configs (namely Steven Rostedt and myself who ran into
this error both in the same day with different implementations) stop doing this
trivial mistake.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
---
include/asm-x86/topology.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Index: linux-2.6-lttng/include/asm-x86/topology.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86/topology.h 2008-10-03 00:37:05.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86/topology.h 2008-10-03 00:45:52.000000000 -0400
@@ -182,9 +182,9 @@ extern int __node_distance(int, int);
#else /* !CONFIG_NUMA */
-#define numa_node_id() 0
-#define cpu_to_node(cpu) 0
-#define early_cpu_to_node(cpu) 0
+#define numa_node_id() 0
+#define cpu_to_node(cpu) ((void)(cpu),0)
+#define early_cpu_to_node(cpu) cpu_to_node(cpu)
static inline const cpumask_t *_node_to_cpumask_ptr(int node)
{
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Actually the proper way would be to have:
static inline int cpu_to_node(int cpu)
{
return 0;
}
static inline int early_cpu_to_node(int cpu)
{
return 0;
}
This way you also get typechecks.
--
That's how I did it first, but then I looked at asm-generic/topology.h and have seen it uses #defines. Should we change them too ? -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
The old way of doing this is with defines. But all new code should be static inline functions when feasible. This way we can get typechecking on the parameters even when the configuration is disabled. Even if the rest of the file uses defines, the new code should be static inlines. Eventually, even the old defines will be converted. -- Steve --
Argh, I think topology.h is utterly broken :-(
Have you noticed the subtile interaction between the
include/asm-x86/topology.h :
#define numa_node_id() 0
#define cpu_to_node(cpu) 0
#define early_cpu_to_node(cpu) 0
...
#include <asm-generic/topology.h>
and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu) ((void)(cpu),0)
#endif
If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :
include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>
include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif
(which will override the static inline !)
It results in an override of the arch-specific version. Nice eh ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Seems that they expect cpu_to_node to be a macro if NUMA is not configured. Actually, since the asm-generic/topology.h does have the cpu shown (although not in inline format), the solution here is to simply remove the #define cpu_to_node() 0 And we can still make the early_cpu_to_node a static inline since it is not referenced in the generic code. -- Steve --
Or we take a deep breath and clean this up ?
Ingo, I build tested this on x86_64 (with and without NUMA), x86_32,
powerpc, arm and mips. I applies to both -tip and 2.6.27-rc8. Could it
be pulled into -tip for further testing ?
Note that checkpatch.pl spills a warning telling me to modify include/asm-*/
files (unexisting in my tree) rather than arch/*/include/asm/. Any idea
why ?
Thanks,
Mathieu
topology.h define mess fix
Original goal : Declare NUMA-less cpu_to_node with a check that the cpu
parameter exists so people without NUMA test configs (namely Steven Rostedt and
myself who ran into this error both in the same day with different
implementations) stop doing this trivial mistake.
End result :
Argh, I think topology.h is utterly broken :-(
Have you noticed the subtile interaction between the
include/asm-x86/topology.h :
#define numa_node_id() 0
#define cpu_to_node(cpu) 0
#define early_cpu_to_node(cpu) 0
...
#include <asm-generic/topology.h>
and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu) ((void)(cpu),0)
#endif
If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :
include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>
include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif
(which will override the static inline !)
It results in an override of the arch-specific version. Nice eh ?
This patch fixes this issue by declaring static inlines in
asm-generic/topology.h and by requiring a _complete_ override of the
topology functions when an architecture needs to override them. An
architecture overriding the topology functions should not include
asm-generic/topology.h anymore.
- alpha needs careful checking, as it did not implement ...> - Major cross-architecture built test is required. Some problems on ia64. With defconfig build (which has CONFIG_NUMA=y) I see this: kernel/sched.c: In function 'find_next_best_node': kernel/sched.c:6920: error: implicit declaration of function 'node_to_cpumask_ptr' kernel/sched.c:6920: error: '__tmp__' undeclared (first use in this function) kernel/sched.c:6920: error: (Each undeclared identifier is reported only once kernel/sched.c:6920: error: for each function it appears in.) kernel/sched.c: In function 'sched_domain_node_span': kernel/sched.c:6952: error: 'nodemask' undeclared (first use in this function) kernel/sched.c:6953: warning: ISO C90 forbids mixed declarations and code kernel/sched.c:6964: error: implicit declaration of function 'node_to_cpumask_ptr_next' kernel/sched.c: In function '__build_sched_domains': kernel/sched.c:7510: error: 'pnodemask' undeclared (first use in this function) On an "allnoconfig" build (which curiously also has CONFIG_NUMA=y :-) I see mm/page_alloc.c: In function 'find_next_best_node': mm/page_alloc.c:2086: error: implicit declaration of function 'node_to_cpumask_ptr' mm/page_alloc.c:2086: error: 'tmp' undeclared (first use in this function) mm/page_alloc.c:2086: error: (Each undeclared identifier is reported only once mm/page_alloc.c:2086: error: for each function it appears in.) mm/page_alloc.c:2107: error: implicit declaration of function 'node_to_cpumask_ptr_next' There are most probably more errors ... but this is where the build stopped. -Tony --
[...]
Ah, I did not select config "generic" for ia64, and thus did not get
CONFIG_NUMA. Here is a v2 which fixes this.
Thanks for testing this.
Mathieu
topology.h define mess fix v2
Update : build fix for ia64 CONFIG_NUMA.
Original goal : Declare NUMA-less cpu_to_node with a check that the cpu
parameter exists so people without NUMA test configs (namely Steven Rostedt and
myself who ran into this error both in the same day with different
implementations) stop doing this trivial mistake.
End result :
Argh, I think topology.h is utterly broken :-(
Have you noticed the subtile interaction between the
include/asm-x86/topology.h :
#define numa_node_id() 0
#define cpu_to_node(cpu) 0
#define early_cpu_to_node(cpu) 0
...
#include <asm-generic/topology.h>
and
include/asm-generic/topology.h :
#ifndef cpu_to_node
#define cpu_to_node(cpu) ((void)(cpu),0)
#endif
If any architecture decide for some reason to use a static inline rather
than a define, as currently done with node_to_first_cpu :
include/asm-x86/topology.h :
static inline int node_to_first_cpu(int node)
{
return first_cpu(cpu_online_map);
}
...
#include <asm-generic/topology.h>
include/asm-generic/topology.h :
#ifndef node_to_first_cpu
#define node_to_first_cpu(node) ((void)(node),0)
#endif
(which will override the static inline !)
It results in an override of the arch-specific version. Nice eh ?
This patch fixes this issue by declaring static inlines in
asm-generic/topology.h and by requiring a _complete_ override of the
topology functions when an architecture needs to override them. An
architecture overriding the topology functions should not include
asm-generic/topology.h anymore.
- alpha needs careful checking, as it did not implement parent_node nor
node_to_first_cpu previously.
- Major cross-architecture built test is required.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Steven Rostedt ...oh, stupid typo of the year :-) applied to tip/tracing/ring-buffer, thanks for tracking it down! I've reactivated the topic branch for tip/master and i'm running a few tests before pushing it out for wider testing. Ingo --
I'm somewhat at a loss here because I'm unable to find any version of kernel/trace/trace.c which looks anything like the one which is being Why was this code using a cast in the first place? It should be using entry->some_field_i_dont_have_here? That was the whole point in using the anonymous union in struct trace_entry? --
it's in tip/tracing/ring-buffer (also tip/master), but we are still working on it (i just triggered a crash with it) so i havent pushed it this whole mega-thread was about removing that union and turning the tracer into a type-opaque entity. I warned about the inevitable fragility - but with this type filter approach the risks should be substantially lower. Ingo --
As Ingo mentioned, you don't have this yet. And be happy that you don't ;-) Because the ring_buffer now allows for variable length entries, having a one size fits all entry is not optimal. But because C is not the best for typecasting, we have this macro to help solve the issue. Instead of registering everything into a single union and causing small fields to be large, we have a macro you can register your type with instead. -- Steve --
The list_head in the page frame should be available regardless of splice() stuffs. --
Regardless, there's more info we want to store for each page than the list head. Especially when we start converting this to lockless. I rather get out of the overlaying of the page frames, its nice to save the space, but really scares the hell out of me. I can just imagine this blowing up if we redo the paging, and I dislike this transparent coupling between the tracer buffer and the pages. -- Steve --
The problem with storing the page link information inside the page is that it doesnt transfer to another address space, so if you do indeed mmap these pages, then the link information is bogus. Of course, in such a situation you could ignore these headers, but somehow that doesn't sound too apealing. --
No that's not what I'm proposing. I'm proposing to allocate a page_header
structure for every page we alloc, and make a link list of them.
In other words:
struct ring_buffer_per_cpu {
[...]
struct list_head pages;
[...]
};
struct buffer_page {
[...];
void *page;
struct list_head list;
[...];
};
In ring_buffer_allocate_cpu:
struct buffer_page *bpage;
struct unsigned long addr;
[...]
for every page() {
bpage = kzalloc(sizeof(*bpage), GFP_KERNEL);
addr = get_free_page();
bpage->page = (void *)addr;
list_add(&bpage->list, &cpu_buffer->pages);
}
Obviously need to add the error checking, but you get the idea. Here I do
not need to change any of the later logic, because we are still dealing
with the buffer_page. I only need to update way to index the page which is
already encapsulated in its own function.
-- Steve
--
Of course we would not be declaring a "struct unsigned long" ;-) -- Steve --
Nitpick: Why cast to void *? And sometimes you use the rb_ prefix, in other cases you use the longer for ring_buffer_, is the ring_ namespace already used? Or can we make it use rb_ consistently to shorten the names? - Arnaldo --
I started using the rb_ because I was constantly breaking the 80 character line limit with ring_buffer ;-) OK, for v8, I'll rename all static internal functions to rb_ and keep the global ones ring_buffer_ Thanks, -- Steve --
It would probably be better to use something else than 'rb_', because that prefix is already used by the red-black trees, and exported as such (eg "rb_next()" etc). But at least as long as it's static, it's probably not _too_ noticeable if the rest of the names don't overlap. We _do_ include <linux/rbtree.h> almost everywhere, since we use those things in the VM, in timers etc, so it comes in through pretty much all headers. Linus --
Well, I just compiled it and it didn't have any name collisions, but that doesn't mean that this wont change in the future. What would you suggest? buffer_ ? ringbuf_ ? -- Steve --
For kicks I just added #include <linux/rbtree.h> and it still passed. I don't think we'll be adding new functions to rbtree.h, so it may be fine to stay with the rb_ prefix. -- Steve --
