As I told Martin, I was thinking about taking an axe and moving stuff around in relay. Which I just did. This patch reimplements relay with a linked list of pages. Provides read/write wrappers which should be used to read or write from the buffers. It's the core of a layered approach to the design requirements expressed by Martin and discussed earlier. It does not provide _any_ sort of locking on buffer data. Locking should be done by the caller. Given that we might think of very lightweight locking schemes, it makes sense to me that the underlying buffering infrastructure supports event records larger than 1 page. A cache saving 3 pointers is used to keep track of current page used for the buffer for write, read and current subbuffer header pointer lookup. The offset of each page within the buffer is saved in the page frame structure to permit cache lookup without extra locking. TODO : Currently, no splice file operations are implemented. Should come soon. The idea is to splice the buffers directly into files or to the network. This patch is released as early RFC. It builds, that's about it. Testing will come when I implement the splice ops. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Martin Bligh <mbligh@google.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: prasad@linux.vnet.ibm.com CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Steven Rostedt <rostedt@goodmis.org> CC: od@suse.com CC: "Frank Ch. Eigler" <fche@redhat.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: hch@lst.de CC: David Wilder <dwilder@us.ibm.com> --- include/linux/ltt-relay.h | 291 +++++++++++++++++++++++++++++ ltt/ltt-relay-alloc.c | 450 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 741 insertions(+) Index: linux-2.6-lttng/ltt/ltt-relay-alloc.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ ...
Why? What aspects of Steve's ring-buffer interface will hinder us optimizing the implementation to be as light-weight as you like? The thing is, I'd like to see it that light as well ;-) As for the page-spanning entries, I think we can do that with Steve's system just fine, its just that Linus said its a dumb idea and Steve dropped it, but there is nothing fundamental stopping us from recording a length that is > PAGE_SIZE and copying data into the pages one at a time. Nor do I see it impossible to implement splice on top of Steve's ring-buffer.. So again, why? --
Which defeats the whole purpose of the exercise, we want to provide a single mechanism - including locking - that is usable to all. Otherwise everybody gets to do the hard part themselves, which will undoubtedly result in many broken/suboptimal locking schemes. --
Well, this is my answer to Steven's "this is too complex" comments,
which I suspect is really a "this is too implement to implement". Sorry
Steven, but you do not actually propose anything to address my concerns,
which are : I want to export this data to userspace without tricky
dependencies on the compiler ABI. I also don't want to be limited in
locking infrastructure implementation.
Those are the kind of concerns that are much easier to address in a
layered and modular implementation. If we try to do everything in the
same C file, we end up having typing/memory management/time management
all closely tied.
So I am all for providing a common infrastructure which implements all
this, but I think this infrastructure should itself be layered and
modular.
Also, I have something really really near to the requirements expressed
in LTTng, which is :
Linux Kernel Markers : Event data typing exportable to userspace without
tricky compiler ABI dependency.
TODO : Export marker list to debugfs.
Allow individual marker enable/disable
through debugfs file.
Use per client buffer marker IDs rather
than a global ID table.
Export the markers IDs/format/name through
one small buffer for each client buffer.
ltt-relay : Buffer coherency management. TODO : splice.
ltt-relay-alloc : Buffer allocation and read/write, without vmap.
ltt-tracer : In-kernel API to manage trace allocation,
start/stop.
TODO : Currently has a statically limited set of
buffers. Should be extended so that clients could
register new buffers.
ltt-control : Netlink control which calls the in-kernel
ltt-tracer API.
TODO : switch ...Ok, I'll try to explain my point of view. The thing is : I want those
binary buffers to be exported to userspace, and I fear that the approach
taken by Steven (let's write "simple" C structure directly into the
buffers) will in fact be much more _complex_ (due to subtle compiler
dependencies) that doing our own event payload (event data) format.
Also, things such as
ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.
will probably become a problem when trying to go for a more efficient
locking scheme.
ring_buffer_peek: Look at a next item in the cpu buffer.
This kind of feature is useless for data extraction to userspace and
poses serious synchronization concerns if you have other writers in the
same buffer.
Structure for event records :
struct ring_buffer_event {
u32 type:2, len:3, time_delta:27;
u32 array[];
};
The only good thing about reserving 2 bits for event IDs in there is to
put the most frequent events in those IDs, which is clearly not the
case:
RB_TYPE_PADDING: utterly useless. Can be expressed by a sub-buffer
header telling the length of data written into the subbuffer (what you
guys call a "page", but what I still think might be worthy to be of
variable size, especially with a light locking infrastructure and
considering we might want to export this data to userspace).
RB_TYPE_TIME_EXTENT : I'd reserve a separate bit for this one.
Also, if size _really_ matters, we should just export the event ID and
look up the event payload size through a separate table. If the payload
consists of a binary blob, then the data structure should start by a
payload size and then have a the actual binary blob.
struct ring_buffer_event {
u32 time_ext:1, evid:4, time_lsb:27;
union {
u32 array[];
struct {
u32 ext_id;
u32 array[];
};
struct {
u32 ext_time;
u32 array[];
};
struct {
...I now have both. But I think userspace can now easily see when there I wanted to push the event ids out of the ring buffer logic. Only a few internal ones are used. Otherwise, we'll have one hell of a bit enum Well, you need to record wraps. Probably more often then you record delta As Linus said, anything bigger than a page should be outside this buffer. All the buffer would then need is a pointer to the data. Then the tracer -- Steve --
As I already told you in person, if you have, say 16 pages for your
buffer, you could peek into all the pages which are not being currently
written into (15 other pages). This would permit to remove unnecessary
writer synchronization from a cmpxchg scheme : it would only have to
push the readers upon page switch rather than at every single even.
Pushing the readers involves, at best, a fully synchronized cmpxchg,
(I notice the comment in your v10 says that minimum size for event is 8
bytes, isn't it rather 4 bytes ?)
(len field set to zero for events bigger than 28 bytes.. what do you use
for events with 0 byte payload ? I'd rather use the 28 bytes value to
identify all events >= 28 bytes and then use the first field to identify
the length).
What you propose here is to duplicate the information : have the number
of bytes used in the page header, exported to userspace, and also to
reserve an event ID (which becomes unavailable to encode anything else)
for "padding" event. There is clearly a space loss here.
Also, dealing with end of page padding will become a bit complex : you
have to check whether your padding event fits in the space left at the
end (4-bytes aligned, but minimum event size is 8 bytes, as stated in
your comment ? Is this true ?) Also, what happens if you plan to write
an event bigger than 28 bytes in your subbuffer (or page) and happen to
be at the end of the page ? You'd have to create padding which is larger
than 28 bytes. Your v10 comments seems to indicate the design does not
This is why I propose to dynamically allocate event IDs through the
markers infrastructure. Other you'll have to declare and export such
Uh ?
The smallest record here is 8 bytes too. It has time_ext bit, time_lsb,
evid and evsize. It does not contain any event-specific payload.
With this given implementation :
struct ring_buffer_event {
u32 time_ext:1, evid:4, time_lsb:27;
};
I see the 4 first bytes of this smallest event (the ring_buffer_event).
Nope. You ...Actually, I also have been looking at doing things lockless. Note, the peek operation is for the iteration mode of read, which must disable writes (for now). The iteration mode does not consume the reader, in fact there is no consumer. It is used to give a trace after the fact. No writer should be present anyway. the peek operation will not be used for a consumer read, which allows for I force a zero type payload to have a len of 1. This keeps the minimum at Because my focus is not on userspace. I don't care about user space apps! My user space app is "cat". I don't want to map ever single kind of event into the buffer. Keeping the length in the header keeps everything more robust. I don't have to worry about searching an event hash table, or remap what events are. Hell, I don't want to always register an event, I might just say, load this data and be done with it. The user shouldn't need to always keep track of event sizes on reading. Yep, we add more than 28 bytes of padding. Linus said this was fine. If you don't have enough space to fit your event, go to the next page. If you only have 4 bytes left, yes we have an exception to the rule, the I don't want to be dependent on markers. No need to be, it's too much for OK, my smallest event with data is 8 bytes, yours is 12. What do I want a payload without data for? I see: 4 bytes time_ext, time 2 byts evid 2 bytes evsize There's your 8bytes. You need anoter 4 bytes to hold data, which makes it With no data. I'm not going to worry about saving 4 bytes for a zero I actually don't care which method we use. This is something that we can Actually, when we write to a new page, we save the timestamp. We don't care about pages we overwrite, since the next read page still has the timestamp value that was record when it was first written to. I'm saying, every page needs a timestamp value that can be retrieved, otherwise, how do you retrieve the timestamp of a page that was The ...
I forgot to mention one important detail. The "head" index will stay
on the page frame. That way we do not need to figure out which
head_page we are on. We grab the head_page, we atomically
(local_inc_return) the head pointer on that page. If the return value is
still on the page, we succeeded. We can also increment a value on this
page frame that will prevent recording if we somehow overflowed the buffer
before relinquishing the stack.
That is
reserve_event()
IRQ->
reserve_event();
[...]
IRQ->reserve_event() came back to original head!
Here we do not have a big enough buffer, and this is just stupid ;-)
We would drop packets in this case, and should drop the guy on his
head who came up with the too small buffer.
-- Steve
--
The only compiler dependant thing in there is the bitfield, which could
be recoded using regular bitops.
You can do
stop:
cpu_buffer->stop=1;
smp_wmb();
sched_sync();
start:
smp_wmb();
cpu_buffer->stop=0;
write:
if (unlikely(smp_rmb(), cpu_buffer->stop))) {
return -EBUSY;
or
while (cpu_buffer->stop)
cpu_relax();
}
Sure, its probably possible to rework the merge-iterator to use consume,
but that would require it having storage for 1 event, which might be a
bit messy.
How would your locking deal with this? - it really is a requirement to
Not so, these types are buffer package types, not actual event types, I
But then you'd need a sub-buffer header, which in itself takes space,
this padding solution seems like a fine middle-ground, it only takes
space when you need it and it free otherwise.
I'm really not seeing what any of these proposals have on top of what
Steve currently has. We have the ability to encode up to 28 bytes of
payload in a 4 bytes header, which should suffice for most entries,
Still not quite understanding where you get the MSBs from, how do you
Discarting the whole sub-buffer idea, it could be used to validate
whichever time-stamp logic.
Personally I don't particulary like the sub-buffer concept, and I don't
I don't think you _ever_ want to insert actual video memory in the
trace, what you _might_ possibly want to do, is insert a copy of a
frame, but that you can do with paged buffers like we have, just add an
entry with 3840*1200*4 bytes (one screen in my case), and use sub-writes
to copy everything to page-alinged chunks.
There is nothing techinically prohibiting this in Steve's scheme, except
for Linus telling you you're an idiot for doing it.
The NOP packets you dislike allow us to align regular entries with page
boundaries, which in turn allows this 0-copy stuff. If you use the
sub-write iterator stuff we taked about a while ago, you can leave out
the NOPs.
Having both makes sense (if you ...+struct ring_buffer_event {
+ u32 type:2, len:3, time_delta:27;
+ u32 array[];
+};
I am mostly talking about what goes in the array[] part of the event.
That will stop tracing if you want to read a trace at the same time you
want to write it, which does not permit continuous streaming.
I agree that such start/stop is needed to control tracing on/off, but it
Sure, I plan to support this. This would be a subset of userspace data
extraction. The in-kernel consumer would be a consumer just like a
splice operation which would extract buffers one event at a time. Rather
than doing a splice to write the data to disk or to the network, this
special in-kernel consumer would merge-sort events from the various
buffers it is connected to and pretty-print one event record at a time
to a seq_file.
Instead of peeking in the "next event", which implies closely coupled
locking with the writer, it would therefore have to ability to consume
all the oldest subbuffers which are not being written to.
Being a standard consumer, it would therefore increment the consumer
offset, which is synchronized by the writer at each subbuffer (page)
I actually think we only need 1 bit to represent extended timestamp
headers. The rest of these event types are just unneeded and consume
precious ID space for nothing. I have not seen any justification telling
why reserving these events actually does something that cannot be done
by the extended timestamp header or by reserving a small header at the
Note that you already need to save the timestamp in the subbuffer
header. The question is : should we also save the unused size.
If we don't, keeping a padding event will likely :
- Rarely save space, given the variable size nature of event records,
the likeliness of needing padding is pretty high. A padding event is
bigger than just writing the unused size anyway, given the unused size
can be written in a few bits for a page, but the padding event
...Remember, I plan on supporting two modes. An iterator mode and a consumer
producer mode. Do you have an iterator mode? That's what most of the
ftrace tools use.
The iterator mode will disable writing and use the "peek" mode. The
consumer mode will not, and have its own ways to read. As is right
I do that too. Yes I have duplicate info. But I still like the robustness
False. It is a "special" event that can actually be in 4 bytes. Which is
what would be wasted anyway. Lenght is ignored by the padding event. Its
basically just another way to make sure we are not reading useless data.
From the rb_event_length function:
+ switch (event->type) {
+ case RINGBUF_TYPE_PADDING:
+ /* undefined */
I rather keep the page headers out of the buffer. Let the tracer expose
Create something special for >page_size events. Those are not the norm
This sounds like something special, and can be done with something
different. No, I don't want the ring buffer infrastructure to be complex
mechanism that can do everything. It should be a generic infrastructure
that does what 99% of the tracers want to do. And 100% of those tracers
Again, make a special tracer for this. This is not the common case, and
Only need to handle writes in stack order (interrupts and nmis). I'll move
the tail index into the page itself, and then we can do atomic operations
on both uping the tail and moving the tail page pointer. This will work
Heh, I was thinking of doing the same thing in the page frame. Sounds
like this is growing into what you did. See, I told you that it will.
Actually, I plan on pushing all the work to the reader. The writer will
have full speed. I'm not going to worry about it now, I'm pretty sure we
have the infrastructure in place to make it happen. Another reason I don't
want the ring buffer to interface with the user land directly is because
it lets us change it without worrying.
Remember, there is no internal kernel ...> A 4 byte dataless payload is useless anyway. Not at all convinced that's true - we used it for lots of things. Start and end of irq event is one frequent example. Also, in a compact mode, we can record start and end of syscalls like this (without parameters). --
Ah, sorry, maybe I misread this. No data with no event ids, etc is fairly useless. 4 bytes data including space to record event ids is OK. --
In Steven's scheme, the event IDs in the 4 bytes are reserved for (useless) internal use ;) They can therefore not be used for specific tracer event IDs, which I think is a misuse of the precious bits otherwise available to store really useful event IDs. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I'm using them, so they must not be totally useless. ;-) But ftrace has its own event ids and I don't want the ring buffer to ever have to know about them. -- Steve --
You are actually using them to put redundant information that could be encoded differently and thus save 4 bits per event records, more or less what will be needed by most tracers (15 IDs, 1 reserved for an extended ID field). So the fact that you use them does not mean they are really required, and I don't think such duplicated information actually makes things more solid. Maybe just more obscure ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I really like the idea of keeping the tracer event ids out of the ring Well, at least for version 1 these bits stay. I already found two bugs by hitting the event padding when the size said it should not have. This redundant information makes me feel a bit more cozy when they match. -- Steve --
Hi - That's fine for a ring buffer that only ever contains data from one event source. How do you imagine multiplexing working, where one wants to grep a single debugfs file that contains data from different event sources? Punt to another layer? - FChE --
Field goal! Yes! That is exactly what I want. If your trace has only one event, why would it want to record them? The event id belongs in the tracer layer not the ring buffer layer. I would like to make a "trace_buffer" layer that includes all of this. -- Steve --
This is all over 1 bit of information, right? Since you need at least 1 for the timestamp stuff. --
I am not saying anything about the actual number of events with 0 bytes payload I actually have in my own instrumentation, if this is what you mean. I am just saying that it leaves this room available for such events. Even if there is a 32 bits payload associated with those events, the fact that we can encode the event ID in the 32 bits header will bring 4 bits of information could be added to the 32-bits header if we allow tracers to register their first 15 event IDs in those 4 bits. But well... let's keep that for v2. ;) Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
That's true. So do we have a bunch of stuff that we really really need Sounds like a plan ;-) All this stuff is internal representations anyway. --
Stuff like irq_exit has 0 byte payload, is very high rate, and useful, Probably pointers on 32 bits archs. Any "int" value, on 32 or 64 bits, will need careful attention if we want only to export the 28 LSBs. It Yup. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree< |
