RFC RFC RFC RFC RFC!!!! Now did I get your attention that this is a request for comment patch. This is probably very buggy. I ran it as a back end for ftrace but only tested the irqsoff and ftrace tracers. The selftests are busted with it. But this is an attempt to get a unified buffering system that was talked about at the LPC meeting. I did not get a chance to implement all the event recording and printing in the debugfs/tracing/buffers directory. But I got enough to do some ftrace work with it. Now that it boots and runs (albeit, a bit buggy), I decided to post it. This is some idea that I had to handle this. I tried to make it as simple as possible. Basically we have: buffer = ring_buffer_alloc(size, flags, max_event_size, print_func, name); We can record either the fast way of reserving a part of the buffer: event = ring_buffer_lock_reserve(buffer, event_id, length, &flags); event->data = record_this_data; ring_buffer_unlock_commit(buffer, event, flags); Or if we already have the data together: ring_buffer_write(buffer, event_id, length, data); We can read by consuming or iterating: event = ring_buffer_consume(buffer); iter = ring_buffer_start(buffer, iter_flags); event = ring_buffer_read(iter, &next_cpu); ring_buffer_finish(iter); Note, the iteration part stops recording to the buffer. This is a feature. If you want producer/consumer, then you should be using the consumer. Signed-off-by: Steven Rostedt <srostedt@redhat.com> --- include/linux/ring_buffer.h | 138 +++ kernel/trace/Kconfig | 4 kernel/trace/Makefile | 1 kernel/trace/ring_buffer.c | 1565 ++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 1708 insertions(+) Index: linux-compile.git/include/linux/ring_buffer.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-compile.git/include/linux/ring_buffer.h 2008-09-23 17:45:49.000000000 -0400 @@ -0,0 +1,138 ...
Don't like that max_event_size param. Don't like that print_func param. This can, in generic, not work. Due to the simple fact that we might straddle a page boundary. Therefore I think its best to limit our self Don't like the event_id, just stick to plain binary data, and leave By the above, this would have to be per-cpu as you cannot interpret the Why? --- So the interface I would like to see is: struct ringbuffer *ringbuffer_alloc(unsigned long size); void ringbuffer_free(struct ringbuffer *rb); /* * disables preemption, cmpxchg reserves size on local cpu, memcpy * chunks of @buf into the page buffers, enables preemption. */ int ringbuffer_write(struct ringbuffer *buffer, const void *buf, unsigned long size); /* * consumes @size data from @cpu's buffer and copies it into @buf. * has internal synchronization for read pos, returns error when * overwritten - but resets state to next half-buffer. */ int ringbuffer_read(struct ringbuffer *buffer, int cpu, void *buf, unsigned long size); This interface is enough to abstract the fact that our buffer is non-linear and further assumes as little as possible. For your IRQ/preempt tracers you can possibly add another operation: /* * swaps the @cpu buffer of @src and @dst, returns error when dimensions * mis-match. */ int ringbuffer_xchg_cpu(struct ringbuffer *src, struct ringbuffer *dst, int cpu); --- On top of that foundation build an eventbuffer, which knows about encoding/decoding/printing events. This too needs to be a flexible layer - as I suspect the google guys will want their ultra-compressed events back. I'm not quite sure yet how to model this layer most flexible without being a nuisance. So obviously its also this layer that has the whole debugfs interface, but maybe we could even push that one layer up, so as to keep that lockdep keys cannot be in dynamic storage, also mainline raw_spinlock_t People (read SGI) prefer you dynamically allocate this array due to ...
Actually, that is there so that a user reading the buffer could allocate a buffer and read all the events. It also allows for greater than 1 page size events to be recorded, which it currently does not do (the At the meeting we talked about having a way to do pretty print from the That gives you a way to see what buffers are allocated in The code does not allow straddling of pages. Actually this is not much different than what ftrace does today. If the new entry straddles a page, we mark the rest of the page as "not used" and start on the next page. This is actually nicer than straddling pages, which is what logdev does, and one of the things that makes logdev slow. Otherwise, we always need to copy twice. Once into the tracer sturcture and than again into the buffer. Having implemented both concepts, I much prefer this one. It also allows This is up to debate. I know you don't like this extra event layer, but seriously, all my uses with ring buffers has had some kind of event. But then I'm sure you can argue that if you are using a single type you can can the event. I'm open to doing this, but I would like a consensus on this. Because if we do not, then it is a lot more work to keep the iterator knowing about overruns, and becomes a pain in the ass. I ended up in ftrace trying hard to disable tracing when we are reading the trace. If we do not, then the output always becomes corrupted. That is, the end half of the trace does not match the beginning. For static reads (non consuming), in my pass experience, has become the preferred method. Note that the consuming read does not have this limitation, because a consuming read is usually done with in flight tracing. A static read is usually done (for best results) from running a Again this should be up to debate. I'm not that attached to one way or Again this forces a double copy when one should be enough. This is what So you want to push the merge sort into the tracer. This also means that ...
i'd prefer Peter's simplification and not pass event_id along. Since static types are lost anyway (which is the biggest cost and risk of any such abstraction), we have to convert between types early on. Whether event_id is visible in the API is no big difference. (It might be cheaper to not pass it along even if everyone ends up using it - as it has no semantic meaning anyway.) pretty much the only generic tracing information is time and payload size. ( but even a time key is debatable - there are various resolutions needed by different usecases. Some usecases are even fine without any timestamps at all - they just want to know the ordering of events and that's it.) i'd like to see some automatic type protection though, as an off-by-default debug option: encode the record type on storing and double-check it on extraction. So it should be possible to reliably store/restore a typed trace buffer and notice corruption early in testing. because there's one thing that is far more important tracer feature than sheer performance: robustness. Automated type checking in debug mode would also mean we could go for RLE encoding much more agressively. Most of the risks of a more complex, more compressed and pointer-laden data format come from type mismatches and the loss of compiler protection against human errors/stupidity. (running off the end of the page, misinterpreting a pointer, a record, etc.) Ingo --
I'm not sure why this is any harder to deal with in write, than it is
in reserve? We should be able to make reserve handle this just
as well?
If you use write rather than reserve, you have to copy all the data
That would be nice. However, we need to keep at least the length
and timestamp fields common so we can do parsing and the mergesort?
+struct ring_buffer_event {
+ unsigned long long counter;
+ short type;
+ short length;
+ char body[];
+} __attribute__((__packed__))
Is useful when gathering GB of data across 10,000 machines ;-)
Also reduces general overhead for everyone to keep events small.
--
No, imagine the mentioned case where we're straddling a page boundary.
A----| |----B
^------|
So when we reserve we get a pointer into page A, but our reserve length
will run over into page B. A write() method will know how to check for
this and break up the memcpy to copy up-to the end of A and continue
into B.
You cannot expect the reserve/commit interface users to do this
correctly - it would also require one to expose too much internals,
And here I was thinking you guys bit encoded the event id into the
All of it would, basically I have no notion of an event in the
ringbuffer API. You write $something and your read routine would need to
be smart enough to figure it out.
The trivial case is a fixed size entry, in which case you always know
how much to read. A slightly more involved but still easy to understand
example might be a 7bit encoding and using the 8th bit for continuation.
Another option is to start out with a fixed sized header that contains a
length field.
But the raw ringbuffer layer, the one concerned with fiddling the pages
Exactly - which is why a flexible encoding layer makes sense to me -
aside from the abstraction itself.
--
This would not happen. The ring buffer reserve routine will take care of
it. If the requested length will straddle a page, I add a "nop" event
into the end of A, and give the user a pointer starting at B.
A----| |-----B
^^ ^
I'll give you the ftrace example. In ftrace we record the task pid,
preempt count, and interrupt flags. The reserve commit way is this.
event = ring_buffer_reserve(buffer, sizeof(*event));
event->pid = current->pid;
event->pc = preempt_count();
event->flags = flags;
ring_buffer_commit(buffer, event);
Done! One copy of all this data directly into the buffer. But if we use
the write method that you propose:
struct event event;
event.pid = current->pid;
event.pc = preempt_count();
event.flags = flags;
ring_buffer_write(buffer, &event, sizeof(event));
One copy into event has been done, but we are not done yet. Inside the
write we need to do a...
memcopy(buffer->buf, data, size);
Logdev has an internal buffer that does exactly what you are proposing.
I had to give it some minor smarts to improve preformance, and stability.
If you don't even have the length then it is over when you wrap. The
There's no way to know where the next record starts. It gets ugly with
callbacks.
Giving the ring buffer at least the length as a minimum, is what is
needed. And as I have that nop id as well, I would need to put that
either into a type field or counter (I currently use both for these
As I have learned from both ftrace and logdev, giving the ring buffer a
little intelligence, has paid off a lot. I will say, I have not hit that
minimum with my patches. But I will argue that you are going beneath that
But being too minimal and flexible, will actually be counter productive.
That is, you will not be able to enhance it due to the lack of abilities.
That is, the top layers will now be limited.
-- Steve
--
I would suggest just not allowing page straddling. Yeah, it would limit event size to less than a page, but seriously, do people really want more than that? If you have huge events, I suspect it would be a hell of a lot better to support some kind of indirection scheme than to force the ring buffer to handle insane cases. Most people will want the events to be as _small_ as humanly possible. The normal event size should hopefully be in the 8-16 bytes, and I think the RFC patch is already broken because it allocates that insane 64-bit event counter for things. Who the hell wants a 64-bit event counter that much? That's broken. Linus --
The event counter is just the timestamp (quick patch, simple to fix). The term "counter" was bad. It should have been timestamp, which one would want a 64bit timestamp. Or at least a way to figure it out. Yes, we can store a special event called "timestamp" and have a smaller counter. But for simplicity, the 64 bit was easy. The event id was just 16 bits, which I think is way more than enough. The current method has a 16 bit length as well, and prevents crossing of page boundaries. Other than that, I would love to have you review more of this patch. Note, I plan on hacking the "max_event_size", and just have that be the standard "PAGE_SIZE". If you need a preallocated storage to store events, one could just use PAGE_SIZE and fit any event they want into it. Thanks, -- Steve --
Yup, is just a confusing name. we can definitely make this a smaller field by doing an offset time from the last event, but we agreed on 64 bits to keep version 1 simple ;-) I think in retrospect the timestamp events we used with wall time stuck in them were a mistake, as NTP will make them difficult. We should have just recorded wall time at the start of the buffer, and done offsets from there. Without relayfs subbuffers, the offset thing gets trickier, as you'd have to update the "start time" constantly once you'd filled the buffer and were shifting the start pointer. OTOH, I guess it's only 1 cacheline. M. --
One definitely would _not_ want the full 64-bit timestamp. There are two cases: - lots and lots of events just do a 32 bit "timestamp delta" to the previous packet (where the first packet in the queue would be a delta from a value in the queue header - and we can obviously make that value be the TSC so that the first delta is always zero, but it may also make sense to make it a real delta, and the queue header would contain some good synchronization point value). - occasional events Oops. The delta wouldn't fit. So create a new "timestamp update" packet with a 64-bit thing when doing the reservation. There's obviously no cost issue (since this would only happen for things where there was a multi-second delay - or at least an appreciable fraction of a delay - between events) This definitely is worth doing. If we have small trace objects (and many things really do have pretty small traces), using just a 32-bit TSC not only saves 4 bytes per trace event, but it makes it quite reasonable to keep the trace data 4-byte-aligned rather than requiring 8-byte alignment. Of course, if the traces end up being horribly bloated, none of that will matter. But I really would hope that you we keep the header itself to just 8 bytes (and being 2 4-byte entities), so that small payloads are reasonable. And that looks doable, if you have a 16-bit "type" and a 16-bit "size" field. One thing I'd like to do is to also architecturally reserve a few of the types for internal queue management stuff. Things like "padding" objects (or a "end-of-ringbuffer" object), and the TSC overflow object, and a "time sync" object (or heartbeat). So maybe the type would have the high bit set as a "reserved for internal ringbuffer use" or whatever. Linus --
Hi Linus, I agree that the standard "high event rate" use-case, when events are as small as possible, would fit perfectly in 4kB sub-subbfers. However, I see a few use-cases where having the ability to write across page boundaries would be useful. Those will likely be low event-rate situations where it is useful to take a bigger snapshot of a problematic condition, but still to have it synchronized with the rest of the trace data. e.g. : - Writing a whole video frame into the trace upon video card glitch. - Writing a jumbo frame (up to 9000 bytes) into the buffer when a network card error is detected or when some iptables rules (LOG, TRACE ?) are reached. - Dumping a kernel stack (potentially 8KB) in a single event when a kernel OOPS is reached. - Dumping a userspace process stack into the trace upon SIGILL, SIGSEGV and friends. That's only what I come up with from the top of my head, and I am sure we'll find very ingenious users who will find plenty of other use-cases where 4kB events won't be enough. (It reminds me of someone saying "640K ought to be enough for anybody.") ;-) If the write abstraction supports page straddling, I think it would be a huge gain in simplicity for such users because they would not have to break their payload in various events and have to create another event layer on top of all that which would identify events uniquely with a "cookie" or to protect writing events into the buffers with another layer of locking. Besides, there are other memory backends where the buffers can be put that do not depend on the page size, namely video card memory. It can be very useful to collect data that survives reboots. Given that this memory will likely consist of contiguous pages, I see no need to limit the maximum event size to a page on such support. Therefore, I think the support for page-crossing should be placed in the "write" abstraction (which would be specific to the type of memory used to back the buffers) rather that the ...
But these are _all_ things that would be much better off with a "allocate a separate buffer, and just add a pointer to the trace". Why? If for no other reason than the fact that we don't even want to spend lots of time to (atomically) have to copy the big data into the trace buffer! Just allocate the buffer and fill it in (maybe it's pre-allocated already, like when a network packet event happens!) and do all of that independently of the low-level trace code. And then add the trace with the pointer. We want the low-level trace code to be useful for things like interrupt events etc, which makes it a _disaster_ to try to add huge buffers directly to the ring buffer. You also don't want to allocate a multi-megabyte ring buffer for some odd case that happens rarely, when you can allocate the big memory users dynamically. So limiting a trace entry to 4kB does not mean that you can't add more than 4kB to the trace - it just means that you need to have a "data indirection" trace type. Nothing more, nothing less. [ And btw - you'd need that *anyway* for other reasons. You also don't want to have any length fields have to be 32-bit lengths etc - the length field of the trace buffer entry should be something really small like 8 or 16 bits, or even be implicit in the type for some basic event types, so that a trace event doesn't necessarily waste any bits at ALL on the length field ] Linus --
Can't the reserve interface just put a padding event into page A,
Depends how you count ;-) One more time than you would have to
+/* header plus 32-bits of event data */
+struct ktrace_entry {
+ u32 event_type:5, tsc_shifted:27;
+ u32 data;
+};
If you don't have timestamps, you need domain-specific context to merge
the per-cpu buffers back together. As long as these are common format
That's what we discussed at KS/plumbers, and seems like the simplest
When you loop around the ringbuffer, you need to shift the starting "read"
pointer up to the next event, don't you? How do you do that to start on
I like the abstraction, yes ;-) Just not convinced how much we can put in it.
--
Yes, I think having a "padding" entry that just gets skipped on read would Why "tsc_shifted"? I think 27 bits is probably fine, but not by removing precision. Instead of shifting it so it will fit (and dropping low bits as uninteresting), do it by encoding it as a delta against the previous thing. 27 bits would still be sufficient for any high-performance thing that has tons and tons of packets, and if you only have a few trace events you can afford to have the "TSC overflow" event type (and if you want it that dense, you could just make 'data' be the high bits, for a total of 59 bits rather than 64 bits of TSC. 59 bits of cycle counters is perfectly fine unless you are talking trace events over a year or so (I didn't do the math, but let's assume a 4GHz TSC as a reasonable thing even going forward - even _if_ CPU's get faster than that, the TSC is unlikely to tick faster since it's just not worth it from a power standpoint). Ok, I did the math. 1<<27 seconds (assuming the low 32 bits are just fractions) is something like 4+ years. I _really_ don't think we need more than that (or even close to that) in TSC timestamps for tracing within one single buffer. Once you go to the next ring buffer, you'd get a new time-base anyway. Linus --
Right now I have a list of pages that make up the ring buffer. Are you
saying that the first entry in the page should be a timestamp?
Anyway, after talking with Peter Zijlstra, I'm working on RFC-v2, which
splits up the ring buffer a bit more. I'm removing all the debugfs crud,
and I even will remove the merge sort from the ring buffer.
I will now have a ring_buffer API, which will do basic recording. It will
have two modes when allocated. Fixed sized entry mode where you can just
put whatever you want in (I'm still aligning everything by 8 bytes, just
since memory is cheap). Or you can have variable length mode that will
make the following event header:
struct {
unsigned char length;
unsigned char buff[];
};
The length will be shifted 3 since we are 8 byte aligned anyway, making
the largest entry 2046 bytes (2045 bytes of data since 1 byte is already
taken for the length field). If the next entry is not large enough to fit
on the page, I will enter a zero length and that will tell the tracer that
the entry is padding to the end of the page.
For fixed sized entries, a simple calculation of whether an entry can fit
on a page will determine if there is an entry or padding.
Then I will add a trace_buffer API that will add the counting and merge
sort on top of this interface. If you don't care about the
trace_buffering, one could simply use the ring_buffering and be done with
it.
Note, I am still keeping the reserve and commit interface for now.
-- Steve
--
I think the most straightforward model would be that the "head" of the ring buffer (regardless of size in pages) would have that timestamp. Making them per-page is an option, of course, I have no strong opinions either way. The per-page one could have advantages (ie it would give a nice upper limit for just how many entries you have to walk in order to convert an entry into a full timestamp), but I certainly don't think that's a big decision, more of a detail. But if we start out with having the full TSC in each entry, that's easily going to be painful to fix later. If we start out with a delta system, changing the details of where the base is gotten is likely to be exactly that - just a detail. So I'd like the thing to have small headers, and be designed from the So the only reason I'm not thrilled with this is that I really think that timestamping should be inherent, and at the lowest level. Without timestamping, what's the real point? EVERYBODY eventually wants a timestamp. We added it even to the kernel printk()'s. People want them for network packets to user space. X wants it for all its events. It's one of those things that people never do from the beginning, but that everybody eventually wants anyway. So I certainly don't mind layering, but I *do* mind it if it then means that some people will use a broken model and not have timestamps. So I think the timestamping code should just be there - without it, a trace buffer is pointless. Linus --
From: Linus Torvalds <torvalds@linux-foundation.org> Small headers are good, but I'd suggest making sure there is an "offset" or similar field in there. Not that I want to encourage frequent changes to header layout, but if you do need to add something, then this offset field allows you to do so while keeping existing analysis tools working. They will just ignore the new information in the headers, but they will still be able to get at the data bits using the offset. --
Right now I have a "length" field when fixed_length is not specified. This length is currently offest by 8 leaving all 8 bytes aligned. We could change this to be variable? Would that help you? -- Steve --
One thing we said we could do is compile the "decompaction" tools
along with the kernel, in the kernel tree. Then if we change the in-kernel
format, you don't break all the userspace tools. We used:
struct ktrace_time {
u32 seconds;
u32 nanoseconds;
};
struct ktrace_event {
struct ktrace_time time;
u16 cpu;
u16 type;
u32 pid;
u32 data;
u32 elapsed_time; /* ns for interrupt, otherwise us */
u32 expanded_length;
}
The format is much easier to parse for userspace tools, though much
less compact. A simple C tool can turn in-kernel format into userspace
format:
1. Merge the per-cpu buffers into a single stream
2. put cpu ids in
3. Work out which pid was running, from the last context switch
4. Put in elapsed times (keeping track of the start of a system call
and recording the delta at end of system call (or interrupt, etc))
--
Hi - If the common tracing idea still includes kernel-supplied ASCII as an alternative to binary (so that one could grep some debugfs file), then it would be nice to have some common code for decoding the trace records offline. If we can associate a simple specific struct type ("struct ftrace_event") with each trace id type in an object-code-resident table, we could automate some of this. The kernel build process could sniff dwarf data (a la acme's struct-layout-printing dwarves) at or before modpost time to generate a snippet of C code for each such event struct. That C code could be the generic ASCII-printing callback for use by debugfs. A variant could be a userspace-usable version. Given the same id-to-struct-name mapping, Sytemtap could expose event records field by field, by name. - FChE --
Hi Frank, I think we are going a step below this. That is, the ring buffer itself will not be expecting to expose anything to the user interface. That will need to be done in a higher layer. Right now we just want a way to stabilize the ring buffer infrastructure. Then we can add a tracing infrastructure on top that can do the above work. I'm working on having a ring_buffer.c that will do the bare minimum, and a trace_buffer.c that will be a layer on top that will add more functionality. What you are asking for may apply there. Thanks, -- Steve --
OK, then how about this?
Each page will start with a time stamp (I'm still aligning everything by 8
bytes, just because it simplifies things). Then we can have a 3 byte
(24 bit) counter offset? Then we can have a header that looks like:
struct {
unsigned char time[3];
unsigned char length;
unsigned char buff[];
};
This still allows me to have the 2048 byte size buffer.
Or is 24 bits for time too small? The offest will be from the previous
entry, and not the beginning of the page.
If one defines a fixed size entry, we could just use the full 32 bits for
the timestamp, since the length will be ignored in that case, and will
become part of the buffer.
Hence,
struct {
unsigned int time;
unsigned char length;
unsigend char buff[];
OK, the bottom layer will have some kind of timestamps. Now we only need
to agree on what the header will look like.
Thanks,
-- Steve
--
How about we just steal 5 bits from the timestamp to indicate event
lengths up to 32 bytes, and if it's 0, that means there's a length
field following? Also that'd mean you could use a longer length field
and get beyond 256 bytes to 4096, without impacting most events.
struct {
u32 length:5, time_delta:27;
u16 length;
u8 buf[];
};
struct {
u32 length:5, time_delta:27; /* where length == 0 */
u8 buf[];
};
Obviously we could less than 5 bits, even just 1 for a flag ...
--
OK then. Since I like the idea of aligning the buffer to 8 bytes, we can
always shift the length field by 3. So...
For records 256 bytes or less, we only have:
struct {
u32 length:5, time_delta: 27;
u8 buf[];
};
For 257 bytes or more we have:
struct {
u32 length:5 (=0), time_delta: 27;
u16 large_length;
u8 buf[];
};
This is what you want?
-- Steve
--
BTW, if we declare the ring buffer to have fixed length entries, do we still want to record the length? Hmm, probably should. It would make the code easier. If it is fixed length under 256 bytes, we still need to record the timestamp, and the length is only 5 bits. It could probably store something else, but we can leave that for version two ;-) Heck, for now, I'll remove the work I did to add the fix length option, and just use this default. If you have fixed length entries, it will just make it easier to code the above layers. -- Steve --
I rather like this idea, as it gives small entries (the common case) the least overhead but does allow for larger ones. By also putting the time in there you can do the merge sort iterator, Linus was right that everybody wants this anyway. As for delta encoding the time, we could make the tick log the absolute time packet, that's at least 100Hz and it already has to compute the full gtod thing anyway. I don't much like Linus' idea of bringing type information back into the primitive header (sorry Linus ;-)). I'd much rather keep that abstraction in the next layer. --
There is part of the type stuff that belongs in the lower layer, it seems - the padding events for the up-to-end-of-page buffering, and the timestamp extensions. It seems wrong to split those across two layers. But perhaps we can keep a couple of bits for this, and three of the bits to represent the length of the data payload (maybe in 4 byte multiples rather than bytes?) That'd let up to 28 bytes as a payload in a short event. --
Hmm, you've got a point there, then it would be 3 package types: - regular - full time - nop Right - if you use raw tsc you're dependent on clock speed, if we'd normalize that on ns instead you'd need at least: l(10000000)/l(2) 23.25349666421153643532 bits to handle HZ=100, leaving us with 32-2-24 = 6 bits for size. Sounds doable (unless I mis-counted on the 0's). Also, I agree on the 4byte alignment, rather than the 8byte Steve seems to favour. --
Please don't normalize to ns. It's really quite hard, and it's rather _expensive_ on many CPU's. It involves a non-constant 64-bit divide, after all. I bet it can be optimized to be a multiply-by-inverse instead, but it would be a 128-bit (or maybe just 96-bit?) multiply, and the code would be nasty, and likely rather more expensive than the TSC reading itself. Sure, you have to normalize at _some_ point, and normalizing early might make some things simpler, but the main thing that would become easier is people messing about in the raw log buffer on their own directly, which would hopefully be something that we'd discourage _anyway_ (ie we should try to use helper functions for people to do things like "get the next event data", not only because the headers are going to be odd due to trying to pack things together, but because maybe we can more easily extend on them later that way when nobody accesses the headers by hand). And I don't think normalizing later is in any way more fundamentally hard. It just means that you do part of the expensive things after you have gathered the trace, rather than during. Linus --
Agree with you on doing the expensive stuff later. If we wanted to get something that'd pack down to a couple fewer bits, and approximate ns, we could always >> 1 if you were > 2GHz, and >> 2 if you where > 4GHz, etc. which is at least cheap. But do we really need more than 3 bits for size anyway? 28 bytes would fit most events such as system calls, interrupts, page faults, all the common stuff.. Longer than that starts to be expensive in memory consumption anyway, and if you're logging 32 bytes or more ... 2 extra bytes for a length field is a small overhead to pay. M. --
... which is exactly what sched_clock() does, combined with a multiplication. (which is about as expensive as normal linear arithmetics on most CPUs - i.e. in the 1 cycle range) Normalizing has the advantage that we dont have to worry about it ever again. Not about a changing scale due to cpufreq, slowing down or speeding up TSCs due to C2/C3. We have so much TSC breakage all across the spectrum that post-processing it is a nightmare in practice. Plus we want sched_clock() to be fast anyway. in the distant future we not only will have constant-TSC but it wont stop in C2/C3 either at a whim (which they do right now, messing up timestamps). At that stage fast time readout it will be so sane that CPU makers should really provide a nanosec readout - it's easy to do a simple multiplicator and hide the few cycles multiplicator latency to RDTSC (this is continuous time after all so it's easy for the hw). Hm? Ingo --
Hi Ingo, The problem with sched_clock is that it gives a 1 HZ timestamp accuracy for events happening across different CPUs. Within this 1 HZ range, it uses the TSC and clip when it reaches a max. Good enough for scheduler or for tracing events on a single CPU, but I think it is not exactly what we need to reorder events happening across CPUs. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Hmm, sched_clock gives ns accuracy unless the tsc is disabled. And in that case, we don't have any CPU clock :-/ -- Steve --
Even on architectures with non-synchronized TSCs ? -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Yep, even on those ;-) -- Steve --
First off, that's simply not true. Yes, it happens to be true on modern x86-64 CPU's. But in very few other places. Doing even just 64-bit multiples is _expensive_. It's not even _near_ single-cycle. Total and utter bullshit, all of it. Have you forgotten all the oopses due to divide-by-zero because sched_clock() was called early? All that early code that we might well want to trace through? Not only that, but have you forgotten about FTRACE and -pg? Which means that every single C function calls into tracing code, and that can basically only be disabled on a per-file basis? As for C2/C3 - that's just an argument for *not* doing anything at trace time. What do you think happens when you try to trace through those things? You're much better off trying to sort out the problems later, when you don't hold critical locks and are possibly deep down in some buggy ACPI code, and you're trying to trace it exactly _because_ it is buggy. The thing is, the trace timestamp generation should be at least capable of being just a couple of versions of assembly language. If you cannot write it in asm, you lose. You cannot (and MUST NOT) use things like a virtualized TSC by mistake. If the CPU doesn't natively support 'rdtsc' in hardware on x86, for example, you have to have another function altogether for the trace timestamp. And no way in hell do we want to call complex indirection chains that take us all over the map and have fragile dependencies that we have already hit several times wrt things like cpufreq. WE ARE MUCH BETTER OFF WITH EVEN _INCORRECT_ TIME THAN WE ARE WITH FRAGILE Yeah. And we want system calls to be _really_ fast, because they are even more critical than the scheduler. So maybe we can use a "gettime()" system call. IOW, your argument is a non-argument. No way in HELL do we want to mix up sched_clock() in tracing. Quite the reverse. We want to have the ability to trace _into_ sched_clock() and never even have to think about ...
Slight correction. You can annotate the function with "notrace" and that function will not be traced. So the "only be disabled on a per-file Currently my code calls "ring_buffer_time_stamp" to get the time stamp, whatever it will be. Currently it is using sched_clock, but since I have it as a wrapper, it shouldn't be too hard to modify later. I'll also add a "ring_buffer_time_stamp_normalize(ts)" function to be called on reading of the trace. This will normalize whatever time stamp that we use back to ns for the users. For now it will just return "ts" since sched_clock is already normalize. IOW, for my current work, I don't care what timing we use. Others do, and I will try to make the tracing infrastructure let it be easy to change what is used. For now, I'm concentrating on the infrastructure itself. -- Steve --
Ok. It's still true that we absolutely don't want to add random notrace markers to code just because it's shared with the scheduler. And "sched_clock()" is not a single function with just a few well-defined places, nor are all versions of it at all appropriate for tracing (the non-TSC ones are a total joke - it works for scheduling, but not tracing. Yes. The code looked fine, and had a FIXME. I have no objection to using it as a known buggy approximation for TSC in order to not force every architecture to immediately write one when the patch is discussed. But I literally would expect that on x86, we'd basically just have a function that does "rdtsc" for the common case, along with possibly a generic fallback that does "xadd" in the absense of any other reasonable alternative. Linus --
firstly, for the sake of full disclosure, the very first versions of the latency tracer (which, through hundreds of revisions, morphed into ftrace), used raw TSC timestamps. I stuck to that simple design for a _long_ time because i shared your exact views about robustness and simplicity. But it was pure utter nightmare to get the timings right after the fact, and i got a _lot_ of complaints about the quality of timings, and i could never _trust_ the timings myself for certain types of analysis. So i eventually went to the scheduler clock and never looked back. So i've been there, i've done that. In fact i briefly tried to use the _GTOD_ clock for tracing - that was utter nightmare as well, because the scale and breath of the GTOD code is staggering. cpu_clock() proved to be a good middle ground. I'm not a believer in any of this stuff, i just react to how things behave in practice, as none of this is really easy to get right from the theoretical angle. ... so we can certainly go back to the TSC again, i'll be the last one to complain about extra tracing performance and it will certainly make it less fragile. Maybe i was wrong in declaring that a dead end and it will work out fine. [ Worst-case we'll go back to the sched_clock() again, that will always be an easy option. ] ... and regarding sched.o's notrace marking, we have had functional -pg tracing of sched.o for ages, and we could enable it right now. We can trace inside the runqueue lock just fine. Note: we indeed have commit c349e0a0 that proves that this stuff can lock up. But most notrace markings are not for robustness reasons at all. They have two purposes: 1) correctness. sched_clock() itself should be recursion free. (Even _that_ has a recursion check btw., so normally it shouldnt lock up - it just wont produce very nice traces.) 2) we use notrace and -no-pg to filter out very high-frequency and mostly uninteresting debug function calls. lockdep.o is an example, ...
heh, and i even have a link for a latency tracing patch for 2005 that is still alive that proves it: http://people.redhat.com/mingo/latency-tracing-patches/patches/latency-tracing.patch (dont look at the quality of that code too much) It has this line for timestamp generation: + timestamp = get_cycles(); i.e. we used the raw TSC, we used RDTSC straight away, and we used that for _years_, literally. So i can tell you my direct experience with it: i had far more problems with the tracer due to inexact timings and traces that i could not depend on, than i had problems with sched_clock() locking up or crashing. Far more people complained about the accuracy of timings than about performance or about the ability (or inability) to stream gigs of tracing data to user-space. It was a very striking difference: - every second person who used the tracer observed that the timings looked odd at places. - only every 6 months has someone asked whether he could save gigabytes of trace data. For years i maintained a tracer with TSC timestamps, and for years i maintained another tracer that used sched_clock(). Exact timings are a feature most people are willing to spend extra cycles on. You seem to dismiss that angle by calling my arguments bullshit, but i dont know on what basis you dismiss it. Sure, a feature and extra complexity _always_ has a robustness cost. If your argument is that we should move cpu_clock() to assembly to make it more dependable - i'm all for it. Ingo --
Hi Ingo, I completely agree with both Linus and you that accuracy utterly matters. I currently provide a time source meant to meant the tracing requirements and support architectures lacking synchronized TSC (or tsc at all) in my lttng tree. Feel free to have a look. I've had statisfied users relying on these time sources for about 3 years. See the lttng-timestamp-* commits in git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git The one in question here (x86) is here. You'll see that everything fits in a small header and can thus be inlined in the callers. http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=include... Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Umm. cpu_clock() isn't even cross-cpu synchronized, and has actually thrown away all the information that can make it so, afaik. At least the comments say "never more than 2 jiffies difference"). You do realize that if you want to order events across CPU's, we're not talking about "jiffies" here, we're talking about 50-100 CPU _cycles_. You also ignore the early trace issues, and have apparently not used it for FTRACE. You also ignore the fact that without TSC, it goes into the same "crap mode" that is appropriate for the scheduler, but totally useless for tracing. IOW, you say that I call your arguments BS without telling you why, but that's just because you apparently cut out all the things I _did_ tell you why about! The fact is, people who do tracing will want better clocks - and have gotten with other infrastructure - than you have apparently cared about. You've worried about scheduler tracing, and you seem to want to just have everybody use a simple but known-bad approach that was good enough for you. Linus --
Oh, and I didn't notice (because Steven pointed out "notrace" and I didn't see any of them), that in order to get things to work you had just added CFLAGS_REMOVE_lockdep.o = -pg CFLAGS_REMOVE_lockdep_proc.o = -pg CFLAGS_REMOVE_mutex-debug.o = -pg CFLAGS_REMOVE_rtmutex-debug.o = -pg CFLAGS_REMOVE_cgroup-debug.o = -pg CFLAGS_REMOVE_sched_clock.o = -pg CFLAGS_REMOVE_sched.o = -mno-spe -pg all ovr the place, which was part of my argument against this crap in the first place. Yes, by using all that common infrastructure, you can share some code, but you will always hit that case that now you have serious issues with actually marking it. Now the tracer has to have recursion detection if you ever want to trace any function that might be used for the clock - and quite frankly, especially with virtualization, it's not AT ALL obvious what those are all the time.. That is exactly one of the examples I gave for _not_ doing this. Go back and read my previous emails. Rather than talking about how I call your arguments BS without saying why. Linus --
The above I added because they just made the function tracer output quite ugly and bloated. The lockdep got messing when it was debugging the locks used in the tracer that was tracing lockdep. The above had nothing to do with what kind of clock we used. We added recursion protection, but it was still producing ugly output. --
You'll also find in the lib Makefile: ifdef CONFIG_FTRACE ORIG_CFLAGS := $(KBUILD_CFLAGS) KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS)) endif Which removes all -pg flags from all in the lib directory. The reason is that a lot of archs (well, I know PPC for sure) use these functions on early boot up, where simply calling mcount will produce a page fault. -- Steve --
Really, we traced all these files for ages. I can restore it if it's worthwile - but lockdep totally kills the usability of function traces, it inserts thousands of uninteresting events over and over again. Note commit c349e0a0, there i added -no-pg for robustness reasons (we locked up) - back then the scheduler clock was within sched.c. Now it is all separated out cleanly in kernel/sched_clock.o and i think we can add yes, that's true. And that's why we absolutely want to have recursion detection anyway - even given ftrace's _totally_ conservative design it's _very_ easy to accidentally recurse somewhere. ftrace has the same robustness design as lockdep has: as little external infrastructure dependencies as possible. And lockdep has recursion checks too, and excessive amounts of paranoia all around the place. Ftrace has the same robustness philosophy too, and yes, despite that we judged cpu_clock() to be worth the risk, because accurate and fast timestamps are a feature and we didnt want to duplicate. If then i'd rather move towards simplifying sched_clock() even more - that's important for the scheduler too. The scheduler clock has _very_ we lived with these kinds of complications for years literally, and i'm not that stupid to not simplify a debugging framework when i can do it - and i'd not mind risk reduction. Paravirt is hugely misdesigned piece of PITA, for basically every piece of infrastructure that we have: locking, GTOD, lockdep, and yes, the tracer too. So ... if you could suggest a method of how we could get good timestamps in a post-processed method with TSC timestamps, that would be great. That would solve all this discussion and i'm not going to argue against saving raw TSC timestamps, as they _are_ simpler and more robust. Ingo --
and note that there's another pragmatic argument: often we notice cpu_clock() bugs by looking at traces. I.e. people fixing trace timestamps _fix the scheduler_. Sometimes it is very hard to notice scheduling artifacts that happen due to small inaccuracies in cpu_clock(). so there's continuous coupling between precise scheduling and good trace timestamps. I'd be willing to pay a lot more for that than the few (rather obvious...) robustness problems we had with sched_clock() in the past. anyway ... i'm not _that_ attached to the idea, we can certainly go back to the original ftrace method of saving raw TSC timestamps and postprocessing. I think users will quickly force us back to a more dependable clock, and if not then you were right and i was wrong ;-) In fact even when we used sched_clock() there were some artifacts: as you pointed it out we dont want to do per event cross-CPU synchronization by default as that is very expensive. Some people wanted GTOD clock for tracing and we very briefly tried that - but that was an utter maintenance nightmare in practice. Ingo --
to prove it, i just applied this patch:
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -21,7 +21,6 @@ CFLAGS_REMOVE_mutex-debug.o = -pg
CFLAGS_REMOVE_rtmutex-debug.o = -pg
CFLAGS_REMOVE_cgroup-debug.o = -pg
CFLAGS_REMOVE_sched_clock.o = -pg
-CFLAGS_REMOVE_sched.o = -mno-spe -pg
endif
obj-$(CONFIG_PROFILING) += profile.o
and sched.o was fully traced again. For example schedule() to idle is 38
function calls:
$ cd /debug/tracing
$ echo ftrace > current_tracer
$ cat trace
# tracer: ftrace
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
[...]
ssh-2734 [001] 52.291772: schedule <-schedule_timeout
ssh-2734 [001] 52.291772: hrtick_clear <-schedule
ssh-2734 [001] 52.291774: _spin_lock <-schedule
ssh-2734 [001] 52.291775: deactivate_task <-schedule
ssh-2734 [001] 52.291776: dequeue_task <-deactivate_task
ssh-2734 [001] 52.291777: dequeue_task_fair <-dequeue_task
ssh-2734 [001] 52.291778: update_curr <-dequeue_task_fair
ssh-2734 [001] 52.291779: calc_delta_mine <-update_curr
ssh-2734 [001] 52.291780: hrtick_start_fair <-dequeue_task_fair
ssh-2734 [001] 52.291782: find_busiest_group <-schedule
ssh-2734 [001] 52.291783: idle_cpu <-find_busiest_group
ssh-2734 [001] 52.291784: target_load <-find_busiest_group
ssh-2734 [001] 52.291785: weighted_cpuload <-target_load
ssh-2734 [001] 52.291786: weighted_cpuload <-find_busiest_group
ssh-2734 [001] 52.291787: cpu_avg_load_per_task <-find_busiest_group
ssh-2734 [001] 52.291788: source_load <-find_busiest_group
ssh-2734 [001] 52.291789: weighted_cpuload <-source_load
...here is the observations from my second blind attempt - this time to
show that we can trace even lockdep internals just fine.
I applied the patch attached below: it removes all the -pg removals from
kernel/Makefile and restores the notrace markings in kernel/lockdep.c
that commit 1d09daa5 ("ftrace: use Makefile to remove tracing from
lockdep") removed.
then i enabled the whole lockdep machinery (to insert as much extra code
as possible):
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
and enabled ftrace function tracing:
CONFIG_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
[ mcount-record is a new ftrace feature for v2.6.28 ]
and rebuilt and rebooted into the kernel, and enabled ftrace.
It worked just fine this time too, and i was able to trace inside
lockdep and lockstat internals:
<idle>-0 [001] 71.120828: _spin_lock <-tick_do_update_jiffies64
<idle>-0 [001] 71.120829: __lock_acquire <-lock_acquire
<idle>-0 [001] 71.120829: lock_acquired <-_spin_lock
<idle>-0 [001] 71.120829: print_lock_contention_bug <-lock_acquired
<idle>-0 [001] 71.120830: do_timer <-tick_do_update_jiffies64
this wasnt done since May this year (since commit 1d09daa5).
It's absolutely not unrobust. Yes, more code will always regress more
than no code, but it's a cost/benefit balance not a design must-have.
And to prove me wrong i'm sure we'll have some really bad cpu_clock()
regression within the next 24 hours ;)
Why did we do commit 1d09daa5, 6ec56232 and c349e0a0 which marked all
these .o's notrace? Partly to address a real (meanwhile fixed)
regression wrt. sched_clock() [and this supports your point], partly to
remove extra unnecessary trace entries from debug infrastructure, and
partly out of pure paranoia: i didnt ...Now do the same on a CPU that doesn't have TSC. And notice how useless the timestamps are. Linus --
i do not understand this argument of yours. (really)
1) is your point that we might lock up?
2) or perhaps that the timestamps update only once every jiffy, and are
in essence useless because they show the same value again and again?
the latter is true, and that's why we were pushed hard in the past by
tracer users towards using GTOD timestamps. Everyone's favorite
suggestion was: "why dont you use gettimeofday internally in the
tracer???".
We resisted that because GTOD timestamps are totally crazy IMO:
- it is 1-2 orders of magnitude more code than cpu_clock() and
all sched_clock() variants altogether.
- it's also pretty fragile code that uses non-trivial locking
internally.
- pmtimer takes like 6000-10000 cycles to read. hpet ditto. Not to talk
about the PIT. Same on other architectures.
[ ... and as usual, only Sparc64 is sane in this field. ]
for a some time we had a runtime option in the latency tracer that
allowed the GTOD clock to be used (default-off) - but even that one was
too much and too fragile so we removed it - it never got upstream.
Fortunately this is not a big issue as almost everything on this planet
that runs Linux and has a kernel developer or user sitting in front of
it has a TSC - and if it doesnt have a TSC it doesnt have any other
high-precision time source to begin with. So worst-case sched_clock()
falls back to a sucky jiffies approximation:
unsigned long long __attribute__((weak)) sched_clock(void)
{
return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
}
3) ... or perhaps is it your point more highlevel, that we shouldnt be
dealing with timestamps in a central manner _at all_ in the tracer, and
we should make them purely optional?
I indeed _had_ a few cases (bugs i debugged) where i was not interested
at all in the timestamps, just in their relative ordering. For that we
had a switch in the latency tracer that turned on (expensive!) central
synchronization [a shared global atomic ...Have you at all followed the discussion about the people who asked for cross-CPU ordering? They wanted not timestamps at all, but global atomic counter updates. Which is very reasonable, if timestamps don't work (and jiffies certainly doesn't, especially in a NOHZ environment). IOW, my whole argument is that what tracers want is _not_ the same thing as what "sched_clock()" wants. Linus --
ah, i see what you mean. I think that's an orthogonal property. Well, it's important in some cases, not that important in other cases. Historically we've been flip-flopping on that issue in ftrace, whether it should be coherent by default or not. We had at least three of four variations of global synchronization. (one was an atomic generation counter, another variant a global lock) Eventually people noticed the overhead and asked for it to be faster. If all you do is to trace high-freq events on all CPUs and you are _not_ interested in the precise interactions, the overhead of global synchronization can hurt a lot. In any case, SMP coherency of trace events is an independent property of the tracer, and preferably something that can be turned on/off. Ingo --
Just a note. The current ring buffering system that I'm proposing keeps its own time stamp counter (currently sched_clock) that will most likely be updated later. I'm trying to keep this ring buffer system as dumb as possible. It does not even implement the merge sort. That's up to the tracer to handle. There's nothing stopping the trace from adding some atomic counter to each event to help it sort. So yes, the tracer can implement anything it wants on top of the ring buffer ;-) -- Steve --
Mathieu seems to disagree, it would be good if he can share some specifics so we can work on resolving those. --
correct. The price is all the notifier/callback overhead and the loss of type checking of the record contents. But that's an unavoidable price of yes, very nice! :) Ingo --
let me outline why that flip-flopping occured.
- coherent tracer: has built-in serialization of global events. If the
tracer shows events to be after each other, they were after each
other.
- incoherent tracer: events might be mixed up slightly on the micro
scale.
In your tree you'll see dozens of fixes from me in the past 10 years or
so where i used various tracers to find some bug. Some of them were done
with coherent tracers, some of them were done with incoherent tracers.
Here a few common patterns that influenced which kind of tracer i used:
- SMP races. There it's really important to see the ordering of events,
and coherent tracers (where the ordering of events as displayed by
the tracer can be trusted) are used by default
- EXCEPT: _very_ often an SMP race goes away if we add global
synchronization to trace events. Sometimes the pure delay can hide
races. So incoherent tracers are very important here.
- analysis of performance problems: here incoherent tracers win hands
down. It's important to see all events on all CPUs, but it's not at
all important to see the precise micro-ordering of events on a global
basis. We want to see rough workload behavior, how tasks iteract -
and most importantly, we want tracing to be as low-overhead as
possible.
- [ in many cases coherency does not matter because we only look at a
single CPU's or app's trace. ]
for example on an 16-way CPU, when i run a high-event-count workload and
add global serialization to events, the workload can easily be slower by
10% or more. Sometimes even the characteristics of the workload changes
due to having a globally synchronized tracer.
So ... neither coherent nor incoherent tracers are a clear, obvious
default.
IMO incoherent is the more useful default in terms of being able to find
bugs with it, because it has a higher utility factor. People have to
interpret traces anyway, and the main usecase of ftrace is that i ask ...Steve got the _worst-case_ cpu_clock() difference down to 60 usecs not so long ago. It might have regressed since then, it's really hard to do it without cross-CPU synchronization. ( But it's not impossible, as Steve has proven it, because physical time goes on linearly on each CPU so we have a chance to do it: by accurately correlating the GTOD timestamps we get at to-idle/from-idle times to the TSC. ) And note that i'm not only talking about cross-CPU synchronization, i'm also talking about _single CPU_ timestamps. How do you get it right with TSCs via a pure postprocessing method? A very large body of modern CPUs will halt the TSC when they go into idle. (about 70% of the installed base or so) Note, we absolutely cannot do accurate timings in a pure TSC-post-processing environment: unless you want to trace _every_ to-idle and from-idle event, which can easily be tens of thousands of extra events per seconds. What we could do perhaps is a hybrid method: - save a GTOD+TSC pair at important events, such as to-idle and from-idle, and in the periodic sched_tick(). [ perhaps also save it when we change cpufreq. ] - save the (last_GTOD, _relative_-TSC) pair in the trace entry with that we have a chance to do good post-processed correlation - at the cost of having 12-16 bytes of timestamp, per trace entry. Or we could upscale the GTOD to 'TSC time', at go-idle and from-idle. Which is rather complicated with cpufreq - which frequency do we want to upscale to if we have a box with three available frequencies? We could ignore cpufreq altogether - but then there goes dependable tracing on i very much used early code tracing with ftrace in the past. In fact once i debugged and early boot hang that happened so early before _PRINTK_ was not functional yet (!). So, to solve this bug, i hacked ftrace to use early_printk(), to print out the last 10,000 functions executed before the hang - and that's how i found the reason for the hang - ...
The "full timestamp" records should include:
* absolute tsc
* absolute monotonic timestamp
* new tsc freqency
If you then make sure that all the cpufreq/idle/suspend-resume code
emits appropriate records when changing the tsc frequency, then you
should always be able to fully regenerate an absolute timestamp.
If you generate the monotonic timestamp with a good clocksource, then
you should be able to correlate the timestamps between cpus.
Oddly enough, this is identical to the Xen clocksource's use of the tsc ;)
J
--
Simple solution: turn off cpufreq whilst tracing is on ;-) Harder: Keep a timebase and frequency divisor on a per-cpu basis and calculate your offsets from there. This brings you down to HPET resolution though --
note, CONFIG_LOCKSTAT - which hooks deep inside lockdep, uses cpu_clock() too for timings, for similar reasons. Lockdep and lockstat both have a very robust design and a very good track record to prove it. you'd be correct in pointing out that we _do_ have a relatively high regression count in cpu_clock()/sched_clock(), but the reason for that is that its implementation balances on the narrow edge of doability. It implements a very unstable set of requirements: "absolutely fast" pitted against "as accurate as possible". That is two conflicting requirements and it is a very fine line to walk and everyone tries to have their own variant of it and their own balance. We try as hard as possible to use the TSC even in face of C2/C3, cpufreq, unstable TSCs, etc. The moment we go too much towards performance we regress accuracy and hurt the scheduler's quality and vice versa. and note that my years long experience in the tracing field show that it has a _very_ similar need for accuracy versus performance, so it was a good match for ftrace. Ingo --
That suggests that frequency changes should be recorded at a lower layer
as well, along with full timestamps and deltas, so that a log parser can
correctly generate the timestamps without having to parse higher-level
trace records. Maybe the simplest thing to do is make the full
timestamp records also include frequency and offset information (to deal
with discontinuities caused by things like a vcpu switching between
physical cpus with unsynced tscs).
J
--
Yes and no. The reason I say "and no" is that it's not technically really possible to atomically give the exact TSC at which the frequency change took place. We just don't have the information, and I doubt we will ever have it. As such, there is no point in trying to make it a low-level special op, because we'd _still_ end up being totally equivalent with just doing as regular trace-event, with a regular TSC field, and then just fill the data field with the new frequency. But yes, I do think we'd need to have that as a trace packet type. I thought I even said so in my RFC for packet types. Ahh, it was in the but yes, we obviously need the frequency in order to calculate some kind of wall-clock time (it doesn't _have_ to be in the same packet type as the thing that tries to sync with a real clock, but it makes sense for it to be there. That said, if people think they can do a good job of ns conversion, I'll stop arguing. Quite frankly, I think people are wrong about that, and quite frankly, I think that anybody who looks even for one second at those "alternate" sched_clock() implementations should realize that they aren't suitable, but whatever. I'm not writing the code, I can only try to convince people to not add the insane call-chains we have now. Linus --
hm, i'd really hope hw makers see the light and actually make the hw do it all. Signs are that they are cozying up to these ideas. Good and fast timestamps are important, and it is _infinitely_ more easy to do it in hw than in sw. Firstly they need a low-frequency (10khz-100khz) shared clock line across all CPUs. A single line - and since it's low frequency it could be overlaid on some existing data line and filtered out. That works across NUMA nodes as well and physics allows it to be nanosec accurate up to dozens of meters or so. Then they need some really cheap way to realize what absolute value the clock counts, and read it out every now and then in the CPU, and approximate it inbetween, and have a secondary stage cheap few-transitors long-latency multiplicator that keeps passing on the nanosec-ish value to a register/MSR that can be read out by the instruction. This trivially works fine even if the CPU is turned off. It uses nary any power as it's low freq, and can be spread across larger system designs too. In fact it would be a totally exciting new capability for things like analysis of SMP events. PEBS/BTS could be extended to save this kind of timestamp, and suddenly one could see _very_ accurately what happens between CPUs, without expensive bus snooping kit. and CPUs wont go beyond the '~1nsec' event granularity for quite some time anyway - so nanoseconds is not a time scale that gets obsoleted quickly. [ i guess this proves it that everyone has his pipe dream ;-) ] Ingo --
Can this possibly be true? I mean, light travels only one foot every nanosecond. Can it really keep nanosecond accuracy up to dozens of meters away? If you send the same signal to CPU1 that is 1 foot away, as well as send it to CPU2 that is 2 feet away. CPU2 will get that signal at least 1 nanosec after CPU1 receives it. Of course if the hardware is smart enough to know this topology, then it could account for these delays in traffic. -- Steve --
Sure. NTP keeps machines within 1ms (or better) of each other even
though the network latency is much higher and jittery.
J
--
yes. And there are radio telescope arrays that are synced up to do delta interferometry, over thousands of kilometers. Syncing up time over a few dozen meters is no challenge - and the reason for that ease is that physical time is neatly and uniformly broadcasted by nature in a pretty dependable way, at around 300 thousand kilometers per second. the challenge is to make it cheap enough for commodity hw. I.e. no extra CPU pins or lines in critical parts of the board, no extra power, low transistor count, no impact on any critical path, short and reliable clock readout after powerup, etc. But that is quite possible too IMO, and the payback is very real. [ OTOH, this is a world that still ships FreeDOS on many whitebox PC instead of putting Linux on it, so dont expect logic to prevail in all cases ;-) ] Ingo --
Yup, it can be done. Ingo's proposal reminds me on a project we did in 1999 for distributed computing. |------------| |master clock| |1 MHz |===========[slave 1]=========[slave 2] .......... |sync pulse | |1 Hz | |------------| The slave implementation was a simple PLL driven by the master clock with an output frequency of 1GHz. The micro seconds counter was driven by the 1MHz clock and the nanoseconds part by the PLL clock. The nanoseconds counter was implemented so it stopped counting at 999 and it was reset to 0 when the 1MHz pulse came in. That made sure that the PLL inaccuracy was corrected with every 1MHz pulse. When a slave attached itself to the system, then it waited for the 1Hz sync pulse, queried the absolute time from the master via the network and synced itself to the next 1Hz sync pulse. The signal runtime was compensated by the position of the node in the topology, i.e. the distance to the master clock. That's simple math. Signal runtime is known for a given wiring / PCB layout. So you just have to know the distance. For twisted pair the signal speed is ~ 0.6 * c =~ 180000 km/s. So a node which is 1m away from the master gets the signal delayed by: 0.001 / 180000 s ~= 5.5 nsec That's constant and easy to account for. You can get into the +/- 10nsec accuracy accross a large distributed system that way. The sad part is, that we talked to AMD/Intel about this back then and they both thought it would be a nice idea for cross socket synced counters and could be easy implemented in hardware. Then they went off and did the sh*t which we have to deal with right now. Thanks, tglx --
Well, you don't need the tsc at the precise moment of the frequency
change. You just need to emit the current tsc+frequency+wallclock time
before you emit any more delta records after the frequency change. You
Yeah. If you ever mention wallclock time in the event stream, you have
to tie it to your local timebase (tsc+frequency) to make the whole thing
Yeah. Unfortunately, in the virtual case - unless you're virtualizing
the tsc itself, which is horrible - you can't really control or measure
how the tsc is going to behave, because its all under the hypervisor's
control. A "cpu" could be migrated between different physical cpus, the
whole machine could be migrated between hosts, or suspended, etc, making
it very hard to use the naked tsc. In that case the only real option is
to use a hypervisor-supplied timebase (which for Xen and KVM is a
tsc-based scheme exactly like we've been discussing, except the
hypervisor provides the tsc timing parameters).
asm/x86/kernel/pvclock.c does the tsc to ns conversion with just adds
and multiplies, but unfortunately it can't be expressed in C because it
uses the extra precision the x86 gives for multiplies.
J
--
hm, i'm not sure you've thought through this delta record idea. Take a system that goes idle thousands of times a second. (it's easy - just some networking workload) Now take a tracer that just wants to emit a trace entry every now and then. Once or twice a second or so. Note that suddenly you have thousands of totally artificial 'delta' time records between two real events, and have to post-process all your way up between these events to get to the real timeline. ... it is totally impractical and costly. and then i havent even mentioned some other big complications: - the numeric errors that mount up over thousands of delta events - the memory overhead over thousands of entries - the fact that cpufreq and PLL changes are rarely atomic and that the TSC can flip-flop between two frequencies. TSC based delta time post-processing is just not practical. Micro-time has to be synced up to GTOD in some manner - OR you might declare that the TSC _has_ to never stop and has to be constant. (no modern x86 CPU meets that criterium at the moment though.) ... and the moment you accept the fact that the GTOD _has_ to be mixed into it, all the rest follows pretty much automatically: either you store the (GTOD,freq,TSC) triple and post-process that absolute timestamp, and you accept the in-memory cost of that and do post-processing, or you compress that triple in situ and store the result only. [ You will then also want some fall-back implementation for CPUs that have no TSCs, and for architectures that have no default implementation - something jiffies based. And you want some special hooks for paravirt, as always. ] I.e. you will end up having something quite close to cpu_clock()/sched_clock(). _Or_ if you manage to get any better than that for tracing then please tell us about it because we want to apply those concepts to cpu_clock()/sched_clock() too :-) Ingo --
No, as I said: "You just need to emit the current
tsc+frequency+wallclock time before you emit any more delta records
after the frequency change."
When an event which affects the tsc occurs, like a frequency change or
pause, set a flag. When you're next about to emit a delta, check the
You need to know the frequency at the time you sample the tsc, and you
need to know when the frequency changes. If you don't, you can't use
the tsc for time, regardless of whether you process it immediately or
Right. You store (GTOD,freq,tsc) every time you need that information,
and then interpolate with the tsc while you know its monotonic.
Unless your tsc is completely screwed, the (GTOD,freq,tsc) triple is
going to be stored at a fairly low frequency, and won't fill your event
buffer very much (though it might be a large proportion of your recorded
Well, whatever the best timer the platform has. And maybe its already
processed into real time, in which case you just emit raw deltas and
Well, its what Xen does already for time. It works well.
J
--
This is where I get confused. When do we add this to the trace buffer? Every record (currently) records the delta time from the last event. If a frequency changes, do we need to record the tuple at that moment? If so, we have an issue. We can be in the middle of tracing the GTOD and we can not recurse back into the GTOD for fear of deadlocking on the GTOD spinlocks. Do we add this data to the buffers when it happens? This means every allocated ring buffer (and yes you can have more than one), will need to register a call back so that it can record this information into the buffer. Or am I just not getting it? -- Steve --
No, you just need to save away a consistent (tsc,freq,GTOD) triple at
some point in time; it doesn't matter when it is, so long as the tsc
frequency hasn't changed since then. Once you have that, you can insert
it into your trace buffer, and then use it as the base for subsequent
deltas.
And as I mentioned in the other mail, if you don't mind backtracking
when you process the log, you can even record (tsc,freq) (tsc,GTOD)
Right, you don't need to read the GTOD when you actually write a record.
J
--
Even in my last version I added a "TIME_STAMP" type that can be used in the future to add some kind of synchronization into the trace, that reads GTOD or whatnot. But as you can see, I've been trying to implement these various ideas, since the devil is in the details and the code is the details. How do you get this GTOD read in the ring buffer? If the ring buffer does it without any knowledge from the tracer, it may be doing it a inappropriate times. This would also imply that the GTOD infrastructure itself is reentrent safe. Imagine tracing the GTOD code when the buffer decides it is about time to add the GTOD timestamp into the buffer. Can the GTOD handle this recursion. If the GTOD has spinlocks, probably not. Perhaps we can add a ring_buffer_write_safe() method that would prevent the trace from doing these. Or we can add a way for the tracer to trigger If we do end up making a new clock API, I imagine that each arch will define their own. -- Steve --
It doesn't need to read the GTOD synchronously when writing the trace
record. When a tsc event occurs, it needs to store the
GTOD/tsc/frequency somewhere at that point. When you next write a trace
record, if that structure has been updated, you write it into the trace
before emitting the next delta timestamp.
In fact you can read the GTOD at almost any time between tsc frequency
updates, so you can defer it to whenever is convenient. At the time the
tsc changes frequency, you emit: (tsc,frequency), and a bit later
(before it changes again) you emit (tsc,GTOD). That allows you to
retroactively compute GTODs for all timestamps. Obviously it would be
good to get them as close as possible together.
J
--
OK, let me rephrase my question. How and where do we record this? Do we keep this information in some global variable that we must compare to every time we add a new item in the trace? Do we have the buffer register a call back to record this information? -- Steve --
Something like (total pseudocode):
struct tsc_time_parameters {
int version; /* even - values OK; odd - values being updated */
u64 tsc;
u32 tsc_freq;
u64 gtod;
};
DEFINE_PERCPU(struct tsc_time_parameters, tsc_params);
/* To be called after a tsc frequency change, before any new
trace records are being emitted, in a context where we can call get_GTOD() */
void update_tsc_params(void)
{
struct tsc_time_parameters *p = __get_percpu_var(tsc_params);
p->version |= 1;
wmb();
p->tsc = get_tsc();
p->tsc_freq = get_tsc_freq();
p->gtod = get_GTOD();
wmb();
p->version++;
wmb();
}
DEFINE_PERCPU(unsigned, current_tsc_version);
DEFINE_PERCPU(u64, prev_tsc);
/* may be called in any context */
u64 get_trace_timestamp_delta(void)
{
const struct tsc_time_parameters *p = &__get_percpu_var(tsc_params);
unsigned *current_version = &__get_cpu_var(current_tsc_version);
u64 prev = __get_cpu_var(prev_tsc);
u64 now, ret;
/* check the current tsc_params version against the last one we emitted;
if the version is odd, then we interrupted the parameters as they were
being updated, so just emit a new delta with the old parameters */
if (unlikely(*current_version != p->version && !(p->version & 1))) {
/* XXX probably need a loop to deal with p->version changing under our feet */
emit_tsc_freq_record(p);
prev = p->tsc;
__get_cpu_var(current_tsc_version) = p->version;
}
now = read_tsc();
ret = now - prev;
__get_cpu_var(prev_tsc) = now;
return ret;
}
J
--
[ "When people ask me what language my mother tongue is, I simply reply 'C'" - Steven Rostedt ] This is exactly why I have that saying ;-) We probably wont to check here that p didn't change again. Hmm, the beginning of each patch will need to record the global tsc, as well as this information. Simply because in overwrite mode, we do not want to lose it if the producer is faster than te consumer. -- Steve --
No, they could probably be static, depending where everything ends up.
Yes, and any other place the tsc might get affected, like going into a
Yeah, you may want to put the whole thing in a loop to make sure that
the version is consistent. You might end up emitting multiple redundant
Each patch? Page?
J
--
Yep, s/patch/page/. I was already half asleep when I replied. ;) -- Steve --
I've been just using sched_clock() which already normalizes to ns. But I use a wrapper (ring_buffer_time_stamp) so we can decide on how to keep track later. If we do not normalize, then we must come up yet another generic way to read the CPU clock for all archs. And then we also need to come up with another generic way to normalize it later for output. If I'm missing that this already exists, then I'll go and use it, but I do not think that tracing is worthy enough to implement this timing infrastructure just to get faster traces. -- Steve --
Why would any of this be "generic"? Quite the reverse. It should be as trace-buffer specific as possible, so that we do *not* share any code or any constraints with other people. Just do rdtsc at first, and make it depend on x86. If the thing is made simple enough, it will be a couple of lines of code for architectures to read their own timestamp counters. And since the normalization is then no longer in the critical part, _that_ can be architecture-independent, but obviously still trace-specific. You need to know the frequency, and that involves having frequency events in the trace if it changes, but if you don't see any frequency events you just take "current frequency". And doing it at trace parse time, we can some day enable a boot trace that actually WORKS. Have you looked at the timestamp events we get from "sched_clock()" in early bootup? They show up in the kernel logs when you have CONFIG_PRINTK_TIME. And they are totally and utterly broken and _useless_ for the early stages right now. And they shouldn't have to be that way. Yeah, we'll never be able to trace stuff that happens really early (tracing will obviously always need kernel page tables and some really basic stuf working), but we should be able to trace through things like TSC calibration for boot time analysis. It wasn't that long ago that we had the whole discussion about TSC calibration taking 200ms. Or the early ACPI code. And get meaningful data. Linus --
generic as in, could be implement in architecture dependent ways but with a "generic" interface. IOW, I don't want the trace to be dependent on any arch. ftrace already runs on x86, ppc, sparc64, mips, arm, sh, and I could do a HAVE_RING_BUFFER_TIMESTAMP config option for archs that implement it, and just use something dumb for those that don't. For now I'll keep to sched_clock, just because it makes it easy for me. But with The one thing that seemed to me most apparent from talking to people at LPC, is that they want a simple ring buffer API. If every tracer that uses this must come up with its own time keeping management, I don't think this will be used at all (except by those that are maintaining tracers My logdev code has a define option to use bootmem for its buffers, and it also uses an atomic counter to try to keep things in order. Heck, at early boot, the events happen in order anyway, since it is still a single CPU system. -- Steve --
No, no. The timestamp code is all in the ring buffer code. That was why I refused to have the layering without it. And hell no, nobody should *ever* read the "tsc_delta" fields etc. Those are entirely internal to the buffering. If any user _ever_ reads or writes those on its own, it's a bug, plain and simple. So when you read trace events, you should get the event data and the timestamp from the trace buffer routines. Nobody should ever even _see_ the internal trace buffer implementation! Linus --
Pretty much all CPUs word align on a 8 byte bounder (until we get those 128bit boxes running), but not all can word align on 4 bytes. I was hoping to make the buffer output somewhat the same across archs. Otherwise, this is going to be quite a complex mess IMHO. -- Steve --
Mmm. Either I'm confused, or we're talking about different things. If we just record the TSC unshifted, in 27 bits, at 4GHz, that gives us about 1/30 of a second? So we either shift, use > 27 bits, or record at least 30 events a second, none of which I like much ... --
If we use 32 bits instead of 27, then the timestamp events are only about once per second, which is probably fine for overhead ... ? I think we're OK losing 5 bits of precision, that's only 32 cycles, given all the CPU reordering stuff we've talked about here, not-quite-synced TSCs, etc. I suspect you thought we were shifting by much more than this, in reality it was 5-10 bits, with timestamp events inbetween, though we put wall time in them, which I think was a mistake. (note: I'm not suggesting we have to use this compact a format, at least by default) --
You'd have them ONCE IN A BLUE MOON. If there is nothing going on, you don't need the timestamps at all. Yes, some people will _want_ a heartbeat, of course, and there might be serialization points where we want to serialize some external clock to the TSC, but that's a separate issue from generating a full TSC value. We may well decide that once a second we want a whole packet with TSC _and_ HR-timer sync information, for example. But that's a separate issue. And if there is a lot of things going on (eg you're tracing things like a block device performance issue), you have events _much_ closer to each other than 1/30th of a second, and you again _never_ need a full 59-bit timestamp. Because the delta would be cumulative, and as long as you are generating events, you're also updating the base TSC. The only case you'll see lots of those timestamps (where "lots" is guaranteed to be less than 30 in one second) would be if you're tracing something that literally does a couple of events per second. But then you sure as hell don't need to worry about performance _or_ memory use. If you have some trace that gives you five hits per second (and spread out, to boot!), you'll generate twice the number of trace entries because each entry would always be preceded by an extended thing, but hey, do we really worry about five trace events per second? And quite frankly, most of the tracing I've ever done really does fall into the "either nothing" or "a flood of events" thing. The "5-25 events per second at regular intervals" case really doesn't sound very common at all - not that I would worry about it if it was since it's going to be a really simple case.. And btw, this would be an issue only on really fast CPU's anyway. If you're in the mobile space, your TSC will be clocking at less than that, so the slower the machine, the less the overhead. Again, that's exactly what you want - the overhead is not some fixed thing that is detemined by the fastest ...
Yeah, you're right - we can just mark it dirty, and 'pre-log' the timestamp events when someone calls a reserve and we haven't logged anything for more time than we can store. Did not think of that. Was only 5 bits for us, not an extra 37, but still, is much better. Is a 5-bit event id generic enough though? --
Actually, I was keeping the event id completely out of the ring buffer and let a higher layer deal with that. For padding, I just made the length field zero. For overflows of the timestamp, we can reserve the -1 timestamp as a trigger to read the tsc again and put the full 64 bits into the record. Just an idea. -- Steve --
No, we don't shift (we don't want to lose precision), and we don't use more than 27 bits by default. the TSC at each entry should be a _delta_. It's the difference from the last one. And if you get less than 30 events per second, and you need a bigger difference, you insert an extra "sync" tracepoint that has a 59-bit thing (27 bits _plus_ the extra 'data'). Yes, it adds 8 bytes (assuming that minimal format), but it does so only for any trace event that is more than 1/30th of a second from its previous one. IOW, think of this not in bytes, but in bytes-per-second. It adds at most 8*30=240 bytes per second, but what it _saves_ is that when you have tens of thousands of events, it shaves 4 bytes FOR EACH EVENT. See? Also, quite often, the clock won't be running at 4GHz even if the CPU might. Intel already doesn't make the TSC be the nominal frequency, and other architectures with TSC's have long had the TSC be something like a "divide-by-16" clock rather than every single cycle because it's more power-efficient. So there is often a built-in shift, and I doubt we'll see 10GHz TSC's even if we see 10GHz CPU's (which many people consider unlikely anyway, but I'm not going to bet against technology). Linus --
The reason why Martin did use only a 27 bits TSC in ktrace was that they were statically limited to 32 event types. I doubt this will suffice for general purpose kernel tracing. For simplicity, I would just start with a header made of the 32 TSC LSBs, 16 bits for events ID and 16 bits for event size in the buffer header. We can always create extra-compact schemes later on which can be tied to specific buffers. I actually have I agree that, in the end, we will end up with "delta" information given by the timestamp, but there is a way to encode that very simply without having to compute any time delta between events : we just have to keep the bits we are interested to save (say, the 32 LSBs) and write that as a time value. Then, whenever we have to write this value, we either have a heartbeat system making sure we always detect 32 bit overflows by writing an event at least once per 32-bit overflow or by adding the full 64-bits timestamp as a prefix to the event when this occurs (as you proposed). Note that the latter proposal imply extra computation at the tracing site, which could have some performance impact. There are a few reasons why I would prefer to stay away from enconding time deltas and use direct LSB tsc representation in the event headers. First, deltas make it hard to deal with missing information (lost events, lost buffers); it those cases, you simply don't know what the delta is. OTOH, if you encode directly the LSBs read from the cycle counter, you can more easily deal with such lack of information (lost events) and lost subbuffers by writing an extended 64-bits event header when needed. The benefit of using the bigger event header when required over using heartbeat, even if it makes the tracing fastpath a bit slower, is that is won't impact systems using dynamic ticks. Heartbeats are generally bad at that because their require the system to be woken up periodically. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE ...
Well, I actually think we could do the same - for the "internal" types.
So why not do something like 4-5 bits for the basic type information, and
then oen of those cases is a "freeform" thing, and the others are reserved
for other uses.
So a trace entry header could easily look something like
struct trace_entry {
u32 tsc_delta:27,
type:5;
u32 data;
u64 array[];
}
and then depending on the that 5-bit type, the "data" field in the header
means different things, and the size of the trace_entry also is different.
So it could be something like
- case 0: EnfOfPage marker
(data is ignored)
size = 8
- case 1: TSCExtend marker
data = extended TSC (bits 28..59)
size = 8
- case 2: TimeStamp marker
data = tv_nsec
array[0] = tv_sec
size = 16
- case 3: LargeBinaryBlob marker
data = 32-bit length of binary data
array[0] = 64-bit pointer to binary blob
array[1] = 64-bit pointer to "free" function
size = 24
- case 4: SmallBinaryBlob marker
data = inline length in bytes, must be < 4096
array[0..(len+7)/8] = inline data, padded
size = (len+15) & ~7
- case 5: AsciiFormat marker
data = number of arguments
array[0] = 64-bit pointer to static const format string
array[1..arg] = argument values
size = 8*(2+arg)
...
ie we use a few bits for "trace _internal_ type fields", and then for a
few of those types we have internal meanings, and other types just means
that the user can fill in the data itself.
IOW, you _could_ have an interface like
ascii_marker_2(ringbuffer,
"Reading sector %lu-%lu",
sector, sector+nsec);
and what it would create would be a fairly small trace packet that looks
something like
.type = 5,
.tsc_delta = ...,
.data = 2,
.array[0] = (const char *) "Reading sector %lu-%lu\n"
.array[1] = xx,
.array[2] = yy
and you would not actually print it out as ASCII until somebody read it
from the kernel (and any "binary" interface would get the string as a
string, not as a ...Btw, in case it wasn't clear, those two are totally different things. The "case 1" thing is the thing that gets inserted automatically by the trace code when it's needed because the 27-bit TSC is too limited. The "case 2" thing is to allow us to occasionally synchronize with some global known wall-time clock like the HPET + xtime. IOW, it would be something that on demand creates a mapping from wall clock to TSC for that particular CPU. I guess I should perhaps have put the TSC frequency in there in that "case 2" thing too. Maybe that should be in "data" (in kHz) and tv_sec/tv_nsec should be in array[0..1], and the time sync packet would be 24 bytes. Linus --
We could use a page header instead to contain the "unused_size"
information. It does not need to be an event per se. Putting this
information in the page header makes it easy for a consumer to just read
the amount of bytes needed, excluding the padding (turns up to be useful
for network streaming of trace data). Also, it frees up one event ID for
other uses. I think the event ID real estate is pretty important,
because every event ID we don't keep for "internal uses" could be used
I would prefer to put the extended timestamp within the event header
instead of creating a separate entry for this for atomicity concerns
(what happens if a long interrupt executes between the TSCExtend marker
event and the event expecting to be written right next to it ?). If we
choose to have such "full tsc" event headers, we would have to reserve 1
selection bit. It leaves us 4 bits for event IDs. If we remove
heartbeats, we need even less "internal" IDs.
Given that we can allocate event IDs per buffer, if in general we assume
that buffer users will have few event IDs, I think we can find a way to
minimize the number of "internal" event IDs and keep possibly all the 4
bits (16 IDs) for real tracer events. One way to achieve it is to encode
the extra typing information within a table (dumped in a separate buffer)
This one could even be a standard event put in a single buffer. There is
I agree that exporting semantic information is important. Moreover, I
think this should also be made available when the trace is exported in
binary format to userspace. The markers currently in mainline has been
designed to do this efficiently. With small adaptation of the markers,
one could do a :
trace_mark(block, read,
"Reading sector sec_from to sec_to",
"sec_from %lu sec_to %lu",
sector, sector + nsec);
The nice part about the markers is that it keeps tables for the
description "Reading sector sec_from to sec_to" and the typing
information "sec_from %lu sec_to ...The log entries should be reserved with interrupts disabled anyway, and they are per-CPU, so there are no atomicity issues. For NMI's, things get more exciting. I'd really prefer NMI's to go to a separate ring buffer entirely, because otherwise consistency gets really hard. Using lockless algorithms for a variable-sized pool of pages is a disaster waiting to happen. I don't think we can currently necessarily reasonably trace NMI's, but it's something to keep in mind as required support eventually. Linus --
I actually do use a lockless algorithm in LTTng and don't have to disable interrupts around tracing. This is how I get the kind of performance the Google folks expect. I would recommend to stay with interrupt disable + per-cpu spinlock (slow and heavy locking) for v1, but to keep in mind that we might want to go for a more lightweight LTTng does it, no disaster happened in the past 2-3 years. :) I guess we could manage to deal with NMI tracing specfically using the NMI tracing is a nice-to-have (and lttng does provide it), but the core thing is really performance; disabling interrupts happens to be fairly slow on many architectures. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I remembered other concerns about 27 vs 32 bits TSC decision, which are
rather important. First, if we have a 27 bits TSC, with overflow every
33ms at 4GHz, we assume the kernel will _never_ have an interrupt
latency longer than this for correct heartbeat behavior. The sad thing
is that it is not uncommon to see such interrupt latencies once in a
while. It's especially bad if the correctness of the timestamps gathered
by the one tool that would be helpful to debug such interrupt latency
problem is broken by the thing it try to instrument.
Given that heartbeat events have this particular downside, we may think
writing an extended TSC value in the event header will solve the
problem, but it actually creates a new kind of problem when we try to
relax buffer locking. The basic idea is that it's easy to compute the
delta from the previous timestamp taken on the same CPU when disabling
interrupts because we are serializing all operations for this specific
core. However, a more lightweight locking (without interrupt disabling)
involves using a loop such as:
reserve space (preempt is disabled) :
unsigned long old, new;
do {
old = read_offset();
timestamp = get_cycles();
new = old + header_size() + event_size();
} while ((local_cmpxchg(&buf->offset, old, new) != old);
write event header
write event payload
commit
So we can make sure the timestamps will never go backward in a given
buffer (this is insured by the fact that this cmpxchg ties together
timestamp read and buffer space reservation).
If we want to implement detection of TSC rollover at the tracing site,
we would have to remember the last TSC value read, and this is where it
becomes racy. We cannot read the previous event slot, because we have no
guarantee it's been written to (only reserve operation is ordered. write
and commit are unordered). We would therefore have to keep a copy of the
previous TSC value, but given we don't do any locking, it would only be
safe to only keep data that can be read/writte ...We do no such thing. Guys, the heartbeat is a _separate_ thing from overflow handling. You don't handle overflow by having a heartbeat that beats fifty times a second just to insert events, just so that the TSC delta would always fit in 27 bits. That would work, but be stupid. It would mean that you fill up your event buffer with uninteresting crud just because nothing happens. Yes, many people want to have a heartbeat (a "Mark" event) every once in a while, but what I suggest is independent of heartbeats, even if it _could_ be implemented that way. What I suggest is simply that when you insert an event, you always read the full 64 bits of TSC (on x86 - others will do other things), and then you insert the delta against the last one. After all, you cannot read just 27 bits of the TSC anyway. You _have_ to read the whole 64 bits, and then you subtract the pervious trace event TSC (that you have in the per-CPU trace buffer header) from that. You now have a delta value. And if the delta doesn't fit in 27 bits, you generate a 59-bit TSC event! None of this has _anything_ to do with interrupt latency. There is no dependency on a heartbeat, or any dependency on always inserting a trace event at least 30 times a second. There's no worry about conversions, and these are all trivial single assembly instructions to do (or a couple, on a 32-bit architecture that needs to do a sub/sbb pair and test two different registers to see if the result fits in 27 bits). The only issue is that if you insert trace events more seldom, you'll always get the extra TSC event as well, inserted automatically in front of the event you explicitly inserted. The tracer doesn't need to know. Linus --
I think we all agree that a supplementary copy is no wanted, but I think this question is orthogonal to having a write wrapper. The way we can do both is by using reserve/commit to deal with space reservation, and a write() to perform the actual data write into the buffers once space has been reserved. Reserve/commit would allocate a variable-sized "slot" into the buffer. We reserve X amount of bytes, and it returns the offset from the buffer start where the allocated slot is. This reserve/commit mechanism deals with synchronization (cli/spinlock or cmpxchg_local scheme...). We can then use this offset to see in which page(s) we have to write. This offset + len can in fact cross multiple page boundaries. Doing this elegantly could involve a page array that would represent the buffer data : struct page **buffer; And be given as parameter to the read() and write() methods, which would deal with page-crossing. e.g. /* * Perform an aligned write of the input data into the buffer. * * buffer : page pointer array * woffset : offset in the page pointer array where write starts from * data : input data * len : length of data to copy * * Note : if a NULL buffer is passed, no copy is performed, but the * alignment and offset calculation is done. Useful to calculate the * size to reserve. * * return : length written */ size_t write(struct page **buffer, size_t woffset, void *data, size_t len); Therefore, we could have code which writes in the buffers, without extra copy, and without using vmap, in multiple writes for a single event, which would deal with data alignment, e.g. : size_t woffset, evsize = 0; evsize += write(NULL, evsize, &var1, sizeof(var1)); evsize += write(NULL, evsize, &var2, sizeof(var2)); evsize += write(NULL, evsize, &var3, sizeof(var3)); woffset = reserve(..., evsize); woffset += write(buffer, woffset, &var1, sizeof(var1)); woffset += write(buffer, woffset, &var2, sizeof(var2)); woffset += write(buffer, woffset, &var3, ...
Mathieu, I'm starting to think that you are just too smart for your own good ;-) No it does not make sense. Well, it does not after looking at it for 10 seconds. Which brings up my point, the interface must be simple, and not cause people to spend minutes trying to figure out what the trace is doing. -- Steve --
Yes, we can do the sub-write, how about:
struct ringbuffer_write_state
ringbuffer_write_start(struct ringbuffer *buffer, unsigned long size);
int ringbuffer_write(struct ringbuffer_write_state *state,
const void *buf, unsigned long size);
void ringbuffer_write_finish(struct ringbuffer_write_state *state);
That way write_start() can do the reserve and set a local write
iterator. write() can then do whatever, either the direct copy of break
it up - will error on overflowing the reserved size. write_finish() will
clean up (sti, preempt_enable etc..)
--
Yup, that looks neat. I don't know if it's worth separating data
alignment concerns from this part of the infrastructure so it stays
simple. OTOH, embedding automatic alignment of data elements would be
easy to do here, e.g. :
struct ringbuffer_write_state
ringbuffer_write_start(struct ringbuffer *buffer, unsigned long size);
int ringbuffer_write(struct ringbuffer_write_state *state,
const void *buf,
unsigned long size,
unsigned long alignment);
#define ringbuffer_compute_size(size, alignment) \
ringbuffer_write(NULL, NULL, size, alignment)
void ringbuffer_write_finish(struct ringbuffer_write_state *state);
So ringbuffer_compute_size could be useds to compute the total slot size
needed to write the event before doing the ringbuffer_write_start(). It
would be good to keep ringbuffer_write() mostly as a static inline so
the compiler could statically compile in much of these operations.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
