Re: [RFC PATCH 1/3] Unified trace buffer

Previous thread: [PATCH 2/3] corruption check: run the corruption checks from a work queue by Arjan van de Ven on Tuesday, September 23, 2008 - 9:54 pm. (8 messages)

Next thread: [RFC PATCH 0/3] An Unified tracing buffer (attempt) by Steven Rostedt on Tuesday, September 23, 2008 - 10:10 pm. (1 message)
From: Steven Rostedt
Date: Tuesday, September 23, 2008 - 10:10 pm

RFC RFC RFC RFC RFC!!!!

Now did I get your attention that this is a request for comment patch.

This is probably very buggy. I ran it as a back end for ftrace but only
tested the irqsoff and ftrace tracers. The selftests are busted with it.

But this is an attempt to get a unified buffering system that was
talked about at the LPC meeting.

I did not get a chance to implement all the event recording and printing
in the debugfs/tracing/buffers directory. But I got enough to do
some ftrace work with it.

Now that it boots and runs (albeit, a bit buggy), I decided to post it.
This is some idea that I had to handle this.

I tried to make it as simple as possible. Basically we have:

buffer = ring_buffer_alloc(size, flags, max_event_size, print_func, name);

We can record either the fast way of reserving a part of the buffer:

event = ring_buffer_lock_reserve(buffer, event_id, length, &flags);
event->data = record_this_data;
ring_buffer_unlock_commit(buffer, event, flags);

Or if we already have the data together:

ring_buffer_write(buffer, event_id, length, data);

We can read by consuming or iterating:

event = ring_buffer_consume(buffer);

iter = ring_buffer_start(buffer, iter_flags);
event = ring_buffer_read(iter, &next_cpu);
ring_buffer_finish(iter);

Note, the iteration part stops recording to the buffer. This is a feature.
If you want producer/consumer, then you should be using the consumer.


Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 include/linux/ring_buffer.h |  138 +++
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/ring_buffer.c  | 1565 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1708 insertions(+)

Index: linux-compile.git/include/linux/ring_buffer.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-compile.git/include/linux/ring_buffer.h	2008-09-23 17:45:49.000000000 -0400
@@ -0,0 +1,138 ...
From: Peter Zijlstra
Date: Wednesday, September 24, 2008 - 8:03 am

Don't like that max_event_size param.
Don't like that print_func param.

This can, in generic, not work. Due to the simple fact that we might
straddle a page boundary. Therefore I think its best to limit our self

Don't like the event_id, just stick to plain binary data, and leave

By the above, this would have to be per-cpu as you cannot interpret the

Why?


---

So the interface I would like to see is:

struct ringbuffer *ringbuffer_alloc(unsigned long size);
void ringbuffer_free(struct ringbuffer *rb);

/*
 * disables preemption, cmpxchg reserves size on local cpu, memcpy 
 * chunks of @buf into the page buffers, enables preemption.
 */
int ringbuffer_write(struct ringbuffer *buffer, const void *buf, unsigned long size);

/*
 * consumes @size data from @cpu's buffer and copies it into @buf.
 * has internal synchronization for read pos, returns error when
 * overwritten - but resets state to next half-buffer.
 */
int ringbuffer_read(struct ringbuffer *buffer, int cpu, void *buf, unsigned long size);

This interface is enough to abstract the fact that our buffer
is non-linear and further assumes as little as possible.


For your IRQ/preempt tracers you can possibly add another operation:

/*
 * swaps the @cpu buffer of @src and @dst, returns error when dimensions
 * mis-match.
 */
int ringbuffer_xchg_cpu(struct ringbuffer *src, struct ringbuffer *dst, int cpu);

---

On top of that foundation build an eventbuffer, which knows about
encoding/decoding/printing events.

This too needs to be a flexible layer - as I suspect the google guys
will want their ultra-compressed events back.

I'm not quite sure yet how to model this layer most flexible without
being a nuisance.

So obviously its also this layer that has the whole debugfs interface,
but maybe we could even push that one layer up, so as to keep that

lockdep keys cannot be in dynamic storage, also mainline raw_spinlock_t


People (read SGI) prefer you dynamically allocate this array due to ...
From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 8:44 am

Actually, that is there so that a user reading the buffer could allocate 
a buffer and read all the events. It also allows for greater than 1 page 
size events to be recorded, which it currently does not do (the 

At the meeting we talked about having a way to do pretty print from the

That gives you a way to see what buffers are allocated in

The code does not allow straddling of pages. Actually this is not much
different than what ftrace does today. If the new entry straddles a page,
we mark the rest of the page as "not used" and start on the next page.
This is actually nicer than straddling pages, which is what logdev does,
and one of the things that makes logdev slow.

Otherwise, we always need to copy twice. Once into the tracer sturcture
and than again into the buffer.

Having implemented both concepts, I much prefer this one. It also allows 

This is up to debate. I know you don't like this extra event layer,
but seriously, all my uses with ring buffers has had some kind of event.
But then I'm sure you can argue that if you are using a single type you
can can the event.

I'm open to doing this, but I would like a consensus on this.



Because if we do not, then it is a lot more work to keep the iterator 
knowing about overruns, and becomes a pain in the ass. I ended up in
ftrace trying hard to disable tracing when we are reading the trace. If we 
do not, then the output always becomes corrupted. That is, the end half 
of the trace does not match the beginning.

For static reads (non consuming), in my pass experience, has become the 
preferred method. Note that the consuming read does not have this 
limitation, because a consuming read is usually done with in flight 
tracing. A static read is usually done (for best results) from running a 

Again this should be up to debate. I'm not that attached to one way or 

Again this forces a double copy when one should be enough. This is what

So you want to push the merge sort into the tracer. This also means that ...
From: Ingo Molnar
Date: Thursday, September 25, 2008 - 3:38 am

i'd prefer Peter's simplification and not pass event_id along. Since 
static types are lost anyway (which is the biggest cost and risk of any 
such abstraction), we have to convert between types early on. Whether 
event_id is visible in the API is no big difference.

(It might be cheaper to not pass it along even if everyone ends up using 
it - as it has no semantic meaning anyway.)

pretty much the only generic tracing information is time and payload 
size. ( but even a time key is debatable - there are various resolutions 
needed by different usecases. Some usecases are even fine without any 
timestamps at all - they just want to know the ordering of events and 
that's it.)

i'd like to see some automatic type protection though, as an 
off-by-default debug option: encode the record type on storing and 
double-check it on extraction. So it should be possible to reliably 
store/restore a typed trace buffer and notice corruption early in 
testing.

because there's one thing that is far more important tracer feature than 
sheer performance: robustness.

Automated type checking in debug mode would also mean we could go for 
RLE encoding much more agressively. Most of the risks of a more complex, 
more compressed and pointer-laden data format come from type mismatches 
and the loss of compiler protection against human errors/stupidity. 
(running off the end of the page, misinterpreting a pointer, a record, 
etc.)

	Ingo
--

From: Martin Bligh
Date: Wednesday, September 24, 2008 - 8:47 am

I'm not sure why this is any harder to deal with in write, than it is
in reserve? We should be able to make reserve handle this just
as well?

If you use write rather than reserve, you have to copy all the data

That would be nice. However, we need to keep at least the length
and timestamp fields common so we can do parsing and the mergesort?

+struct ring_buffer_event {
+       unsigned long long counter;
+       short type;
+       short length;
+       char body[];
+} __attribute__((__packed__))


Is useful when gathering GB of data across 10,000 machines ;-)
Also reduces general overhead for everyone to keep events small.
--

From: Peter Zijlstra
Date: Wednesday, September 24, 2008 - 9:11 am

No, imagine the mentioned case where we're straddling a page boundary.

A----|   |----B
    ^------|

So when we reserve we get a pointer into page A, but our reserve length
will run over into page B. A write() method will know how to check for
this and break up the memcpy to copy up-to the end of A and continue
into B.

You cannot expect the reserve/commit interface users to do this
correctly - it would also require one to expose too much internals,


And here I was thinking you guys bit encoded the event id into the

All of it would, basically I have no notion of an event in the
ringbuffer API. You write $something and your read routine would need to
be smart enough to figure it out.

The trivial case is a fixed size entry, in which case you always know
how much to read. A slightly more involved but still easy to understand
example might be a 7bit encoding and using the 8th bit for continuation.

Another option is to start out with a fixed sized header that contains a
length field.

But the raw ringbuffer layer, the one concerned with fiddling the pages

Exactly - which is why a flexible encoding layer makes sense to me -
aside from the abstraction itself.

--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 9:26 am

This would not happen. The ring buffer reserve routine will take care of 
it. If the requested length will straddle a page, I add a "nop" event
into the end of A, and give the user a pointer starting at B.

 A----|  |-----B
     ^^  ^

I'll give you the ftrace example. In ftrace we record the task pid,
preempt count, and interrupt flags.  The reserve commit way is this.

event = ring_buffer_reserve(buffer, sizeof(*event));
event->pid = current->pid;
event->pc = preempt_count();
event->flags = flags;
ring_buffer_commit(buffer, event);

Done! One copy of all this data directly into the buffer. But if we use
the write method that you propose:

struct event event;

event.pid = current->pid;
event.pc = preempt_count();
event.flags = flags;
ring_buffer_write(buffer, &event, sizeof(event));

One copy into event has been done, but we are not done yet. Inside the
write we need to do a...

memcopy(buffer->buf, data, size);


Logdev has an internal buffer that does exactly what you are proposing.
I had to give it some minor smarts to improve preformance, and stability.
If you don't even have the length then it is over when you wrap. The
There's no way to know where the next record starts. It gets ugly with
callbacks.

Giving the ring buffer at least the length as a minimum, is what is 
needed. And as I have that nop id as well, I would need to put that
either into a type field or counter (I currently use both for these 



As I have learned from both ftrace and logdev, giving the ring buffer a
little intelligence, has paid off a lot. I will say, I have not hit that
minimum with my patches. But I will argue that you are going beneath that

But being too minimal and flexible, will actually be counter productive. 
That is, you will not be able to enhance it due to the lack of abilities.

That is, the top layers will now be limited.

-- Steve

--

From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 9:24 am

I would suggest just not allowing page straddling.

Yeah, it would limit event size to less than a page, but seriously, do 
people really want more than that? If you have huge events, I suspect it 
would be a hell of a lot better to support some kind of indirection 
scheme than to force the ring buffer to handle insane cases.

Most people will want the events to be as _small_ as humanly possible. The 
normal event size should hopefully be in the 8-16 bytes, and I think the 
RFC patch is already broken because it allocates that insane 64-bit event 
counter for things. Who the hell wants a 64-bit event counter that much? 
That's broken.

		Linus
--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 9:37 am

The event counter is just the timestamp (quick patch, simple to fix). The 
term "counter" was bad. It should have been timestamp, which one would 
want a 64bit timestamp. Or at least a way to figure it out. Yes, we can 
store a special event called "timestamp" and have a smaller counter. But 
for simplicity, the 64 bit was easy. The event id was just 16 bits, which 
I think is way more than enough.

The current method has a 16 bit length as well, and prevents crossing of
page boundaries.

Other than that, I would love to have you review more of this patch.

Note, I plan on hacking the "max_event_size", and just have that be the
standard "PAGE_SIZE". If you need a preallocated storage to store events, 
one could just use PAGE_SIZE and fit any event they want into it.

Thanks,

-- Steve

--

From: Martin Bligh
Date: Wednesday, September 24, 2008 - 9:56 am

Yup, is just a confusing name. we can definitely make this a smaller field
by doing an offset time from the last event, but we agreed on 64 bits to
keep version 1 simple ;-)

I think in retrospect the timestamp events we used with wall time stuck
in them were a mistake, as NTP will make them difficult. We should have
just recorded wall time at the start of the buffer, and done offsets from
there.

Without relayfs subbuffers, the offset thing gets trickier, as you'd have
to update the "start time" constantly once you'd filled the buffer and
were shifting the start pointer. OTOH, I guess it's only 1 cacheline.

M.
--

From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 10:25 am

One definitely would _not_ want the full 64-bit timestamp.

There are two cases:

 - lots and lots of events

   just do a 32 bit "timestamp delta" to the previous packet (where the 
   first packet in the queue would be a delta from a value in the queue 
   header - and we can obviously make that value be the TSC so that the 
   first delta is always zero, but it may also make sense to make it a 
   real delta, and the queue header would contain some good 
   synchronization point value).

 - occasional events

   Oops. The delta wouldn't fit. So create a new "timestamp update" packet 
   with a 64-bit thing when doing the reservation. There's obviously no 
   cost issue (since this would only happen for things where there was a 
   multi-second delay - or at least an appreciable fraction of a delay - 
   between events)

This definitely is worth doing. If we have small trace objects (and many 
things really do have pretty small traces), using just a 32-bit TSC not 
only saves 4 bytes per trace event, but it makes it quite reasonable to 
keep the trace data 4-byte-aligned rather than requiring 8-byte alignment.

Of course, if the traces end up being horribly bloated, none of that will 
matter. But I really would hope that you we keep the header itself to just 
8 bytes (and being 2 4-byte entities), so that small payloads are 
reasonable. And that looks doable, if you have a 16-bit "type" and a 
16-bit "size" field.

One thing I'd like to do is to also architecturally reserve a few of the 
types for internal queue management stuff. Things like "padding" objects 
(or a "end-of-ringbuffer" object), and the TSC overflow object, and a 
"time sync" object (or heartbeat). So maybe the type would have the high 
bit set as a "reserved for internal ringbuffer use" or whatever.

			Linus
--

From: Mathieu Desnoyers
Date: Wednesday, September 24, 2008 - 11:01 am

Hi Linus,

I agree that the standard "high event rate" use-case, when events are as
small as possible, would fit perfectly in 4kB sub-subbfers. However,
I see a few use-cases where having the ability to write across page
boundaries would be useful. Those will likely be low event-rate
situations where it is useful to take a bigger snapshot of a problematic
condition, but still to have it synchronized with the rest of the trace
data. e.g. :

- Writing a whole video frame into the trace upon video card glitch.
- Writing a jumbo frame (up to 9000 bytes) into the buffer when a
  network card error is detected or when some iptables rules (LOG, TRACE
  ?) are reached.
- Dumping a kernel stack (potentially 8KB) in a single event when a
  kernel OOPS is reached.
- Dumping a userspace process stack into the trace upon SIGILL, SIGSEGV
  and friends.

That's only what I come up with from the top of my head, and I am sure
we'll find very ingenious users who will find plenty of other use-cases
where 4kB events won't be enough.

(It reminds me of someone saying "640K ought to be enough for anybody.")
;-)

If the write abstraction supports page straddling, I think it would be a
huge gain in simplicity for such users because they would not have to
break their payload in various events and have to create another event
layer on top of all that which would identify events uniquely with a
"cookie" or to protect writing events into the buffers with another
layer of locking.

Besides, there are other memory backends where the buffers can be put
that do not depend on the page size, namely video card memory. It can be
very useful to collect data that survives reboots. Given that this
memory will likely consist of contiguous pages, I see no need to limit
the maximum event size to a page on such support. Therefore, I think the
support for page-crossing should be placed in the "write" abstraction
(which would be specific to the type of memory used to back the
buffers) rather that the ...
From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 1:49 pm

But these are _all_ things that would be much better off with a "allocate 
a separate buffer, and just add a pointer to the trace".

Why? If for no other reason than the fact that we don't even want to spend 
lots of time to (atomically) have to copy the big data into the trace 
buffer!

Just allocate the buffer and fill it in (maybe it's pre-allocated already, 
like when a network packet event happens!) and do all of that 
independently of the low-level trace code. And then add the trace with the 
pointer.

We want the low-level trace code to be useful for things like interrupt 
events etc, which makes it a _disaster_ to try to add huge buffers 
directly to the ring buffer. You also don't want to allocate a 
multi-megabyte ring buffer for some odd case that happens rarely, when you 
can allocate the big memory users dynamically.

So limiting a trace entry to 4kB does not mean that you can't add more 
than 4kB to the trace - it just means that you need to have a "data 
indirection" trace type. Nothing more, nothing less.

[ And btw - you'd need that *anyway* for other reasons. You also don't 
  want to have any length fields have to be 32-bit lengths etc - the 
  length field of the trace buffer entry should be something really small 
  like 8 or 16 bits, or even be implicit in the type for some basic event 
  types, so that a trace event doesn't necessarily waste any bits at ALL 
  on the length field ]

		Linus
--

From: Martin Bligh
Date: Wednesday, September 24, 2008 - 9:49 am

Can't the reserve interface just put a padding event into page A,

Depends how you count ;-) One more time than you would have to

+/* header plus 32-bits of event data */
+struct ktrace_entry {
+       u32 event_type:5, tsc_shifted:27;
+       u32 data;
+};


If you don't have timestamps, you need domain-specific context to merge
the per-cpu buffers back together. As long as these are common format

That's what we discussed at KS/plumbers, and seems like the simplest

When you loop around the ringbuffer, you need to shift the starting "read"
pointer up to the next event, don't you? How do you do that to start on

I like the abstraction, yes ;-) Just not convinced how much we can put in it.
--

From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 10:36 am

Yes, I think having a "padding" entry that just gets skipped on read would 

Why "tsc_shifted"?

I think 27 bits is probably fine, but not by removing precision. Instead 
of shifting it so it will fit (and dropping low bits as uninteresting), do 
it by encoding it as a delta against the previous thing. 27 bits would 
still be sufficient for any high-performance thing that has tons and tons 
of packets, and if you only have a few trace events you can afford to have 
the "TSC overflow" event type (and if you want it that dense, you could 
just make 'data' be the high bits, for a total of 59 bits rather than 64 
bits of TSC.

59 bits of cycle counters is perfectly fine unless you are talking trace 
events over a year or so (I didn't do the math, but let's assume a 4GHz 
TSC as a reasonable thing even going forward - even _if_ CPU's get faster 
than that, the TSC is unlikely to tick faster since it's just not worth it 
from a power standpoint).

Ok, I did the math. 1<<27 seconds (assuming the low 32 bits are just 
fractions) is something like 4+ years. I _really_ don't think we need more 
than that (or even close to that) in TSC timestamps for tracing within one 
single buffer.

Once you go to the next ring buffer, you'd get a new time-base anyway.

		Linus
--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 10:49 am

Right now I have a list of pages that make up the ring buffer. Are you 
saying that the first entry in the page should be a timestamp?

Anyway, after talking with Peter Zijlstra, I'm working on RFC-v2, which
splits up the ring buffer a bit more. I'm removing all the debugfs crud, 
and I even will remove the merge sort from the ring buffer.

I will now have a ring_buffer API, which will do basic recording. It will 
have two modes when allocated. Fixed sized entry mode where you can just 
put whatever you want in (I'm still aligning everything by 8 bytes, just 
since memory is cheap). Or you can have variable length mode that will 
make the following event header:

struct {
	unsigned char length;
	unsigned char buff[];
};

The length will be shifted 3 since we are 8 byte aligned anyway, making 
the largest entry 2046 bytes (2045 bytes of data since 1 byte is already 
taken for the length field). If the next entry is not large enough to fit
on the page, I will enter a zero length and that will tell the tracer that 
the entry is padding to the end of the page.

For fixed sized entries, a simple calculation of whether an entry can fit 
on a page will determine if there is an entry or padding.

Then I will add a trace_buffer API that will add the counting and merge 
sort on top of this interface. If you don't care about the 
trace_buffering, one could simply use the ring_buffering and be done with 
it.

Note, I am still keeping the reserve and commit interface for now.

-- Steve

--

From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 1:23 pm

I think the most straightforward model would be that the "head" of the 
ring buffer (regardless of size in pages) would have that timestamp. 
Making them per-page is an option, of course, I have no strong opinions 
either way. The per-page one could have advantages (ie it would give a 
nice upper limit for just how many entries you have to walk in order to 
convert an entry into a full timestamp), but I certainly don't think 
that's a big decision, more of a detail.

But if we start out with having the full TSC in each entry, that's easily 
going to be painful to fix later. If we start out with a delta system, 
changing the details of where the base is gotten is likely to be exactly 
that - just a detail.

So I'd like the thing to have small headers, and be designed from the 

So the only reason I'm not thrilled with this is that I really think that 
timestamping should be inherent, and at the lowest level.

Without timestamping, what's the real point? EVERYBODY eventually wants a 
timestamp. We added it even to the kernel printk()'s. People want them for 
network packets to user space. X wants it for all its events. It's one of 
those things that people never do from the beginning, but that everybody 
eventually wants anyway.

So I certainly don't mind layering, but I *do* mind it if it then means 
that some people will use a broken model and not have timestamps. So I 
think the timestamping code should just be there - without it, a trace 
buffer is pointless.

		Linus
--

From: David Miller
Date: Wednesday, September 24, 2008 - 1:37 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

Small headers are good, but I'd suggest making sure there is an
"offset" or similar field in there.

Not that I want to encourage frequent changes to header layout, but if
you do need to add something, then this offset field allows you to do
so while keeping existing analysis tools working.  They will just
ignore the new information in the headers, but they will still be able
to get at the data bits using the offset.
--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 1:48 pm

Right now I have a "length" field when fixed_length is not specified. This 
length is currently offest by 8 leaving all 8 bytes aligned.

We could change this to be variable? Would that help you?

-- Steve

--

From: Martin Bligh
Date: Wednesday, September 24, 2008 - 1:51 pm

One thing we said we could do is compile the "decompaction" tools
along with the kernel, in the kernel tree. Then if we change the in-kernel
format, you don't break all the userspace tools. We used:

struct ktrace_time {
        u32 seconds;
        u32 nanoseconds;
};

struct ktrace_event {
        struct ktrace_time time;
        u16 cpu;
        u16 type;
        u32 pid;
        u32 data;
        u32 elapsed_time;               /* ns for interrupt, otherwise us */
        u32 expanded_length;
}

The format is much easier to parse for userspace tools, though much
less compact. A simple C tool can turn in-kernel format into userspace
format:

1. Merge the per-cpu buffers into a single stream
2. put cpu ids in
3. Work out which pid was running, from the last context switch
4. Put in elapsed times (keeping track of the start of a system call
and recording the delta at end of system call (or interrupt, etc))
--

From: Frank Ch. Eigler
Date: Wednesday, September 24, 2008 - 2:24 pm

Hi -


If the common tracing idea still includes kernel-supplied ASCII as an
alternative to binary (so that one could grep some debugfs file), then
it would be nice to have some common code for decoding the trace
records offline.

If we can associate a simple specific struct type ("struct
ftrace_event") with each trace id type in an object-code-resident
table, we could automate some of this.  The kernel build process could
sniff dwarf data (a la acme's struct-layout-printing dwarves) at or
before modpost time to generate a snippet of C code for each such
event struct.  That C code could be the generic ASCII-printing
callback for use by debugfs.  A variant could be a userspace-usable
version.  Given the same id-to-struct-name mapping, Sytemtap could
expose event records field by field, by name.


- FChE
--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 2:33 pm

Hi Frank,

I think we are going a step below this. That is, the ring buffer itself 
will not be expecting to expose anything to the user interface.

That will need to be done in a higher layer. Right now we just want a way 
to stabilize the ring buffer infrastructure. Then we can add a tracing 
infrastructure on top that can do the above work.

I'm working on having a ring_buffer.c that will do the bare minimum, and a 
trace_buffer.c that will be a layer on top that will add more 
functionality. What you are asking for may apply there.

Thanks,

-- Steve

--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 1:47 pm

OK, then how about this?

Each page will start with a time stamp (I'm still aligning everything by 8 
bytes, just because it simplifies things). Then we can have a 3 byte
(24 bit) counter offset? Then we can have a header that looks like:

struct {
	unsigned char time[3];
	unsigned char length;
	unsigned char buff[];
};

This still allows me to have the 2048 byte size buffer.

Or is 24 bits for time too small? The offest will be from the previous
entry, and not the beginning of the page.

If one defines a fixed size entry, we could just use the full 32 bits for 
the timestamp, since the length will be ignored in that case, and will 
become part of the buffer.

Hence,

struct {
	unsigned int time;
	unsigned char length;
	unsigend char buff[];


OK, the bottom layer will have some kind of timestamps. Now we only need 
to agree on what the header will look like.

Thanks,

-- Steve

--

From: Martin Bligh
Date: Wednesday, September 24, 2008 - 2:03 pm

How about we just steal 5 bits from the timestamp to indicate event
lengths up to 32 bytes, and if it's 0, that means there's a length
field following? Also that'd mean you could use a longer length field
and get beyond 256 bytes to 4096, without impacting most events.

struct {
        u32 length:5, time_delta:27;
        u16 length;
        u8 buf[];
};

struct {
        u32 length:5, time_delta:27;        /* where length == 0 */
        u8 buf[];
};

Obviously we could less than 5 bits, even just 1 for a flag ...
--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 2:17 pm

OK then. Since I like the idea of aligning the buffer to 8 bytes, we can 
always shift the length field by 3. So...

For records 256 bytes or less, we only have:

struct {
	u32 length:5, time_delta: 27;
	u8 buf[];
};

For 257 bytes or more we have:

struct {
	u32 length:5 (=0), time_delta: 27;
	u16 large_length;
	u8 buf[];
};

This is what you want?

-- Steve

--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 2:51 pm

BTW, if we declare the ring buffer to have fixed length entries, do we 
still want to record the length?

Hmm, probably should. It would make the code easier. If it is fixed length 
under 256 bytes, we still need to record the timestamp, and the length is 
only 5 bits. It could probably store something else, but we can leave that 
for version two ;-)

Heck, for now, I'll remove the work I did to add the fix length option, 
and just use this default. If you have fixed length entries, it will
just make it easier to code the above layers.

-- Steve

--

From: Peter Zijlstra
Date: Thursday, September 25, 2008 - 3:41 am

I rather like this idea, as it gives small entries (the common case) the
least overhead but does allow for larger ones.

By also putting the time in there you can do the merge sort iterator,
Linus was right that everybody wants this anyway.

As for delta encoding the time, we could make the tick log the absolute
time packet, that's at least 100Hz and it already has to compute the
full gtod thing anyway.

I don't much like Linus' idea of bringing type information back into the
primitive header (sorry Linus ;-)). I'd much rather keep that
abstraction in the next layer.

--

From: Martin Bligh
Date: Thursday, September 25, 2008 - 7:33 am

There is part of the type stuff that belongs in the lower layer, it seems -
the padding events for the up-to-end-of-page buffering, and the timestamp
extensions. It seems wrong to split those across two layers.

But perhaps we can keep a couple of bits for this, and three of the bits
to represent the length of the data payload (maybe in 4 byte multiples
rather than bytes?) That'd let up to 28 bytes as a payload in a short event.
--

From: Peter Zijlstra
Date: Thursday, September 25, 2008 - 7:53 am

Hmm, you've got a point there, then it would be 3 package types:
 
 - regular
 - full time
 - nop


Right - if you use raw tsc you're dependent on clock speed, if we'd
normalize that on ns instead you'd need at least:

l(10000000)/l(2)
23.25349666421153643532

bits to handle HZ=100, leaving us with 32-2-24 = 6 bits for size.

Sounds doable (unless I mis-counted on the 0's).

Also, I agree on the 4byte alignment, rather than the 8byte Steve seems
to favour.

--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 8:05 am

Please don't normalize to ns.

It's really quite hard, and it's rather _expensive_ on many CPU's. It 
involves a non-constant 64-bit divide, after all. I bet it can be 
optimized to be a multiply-by-inverse instead, but it would be a 128-bit 
(or maybe just 96-bit?) multiply, and the code would be nasty, and likely 
rather more expensive than the TSC reading itself.

Sure, you have to normalize at _some_ point, and normalizing early might 
make some things simpler, but the main thing that would become easier is 
people messing about in the raw log buffer on their own directly, which 
would hopefully be something that we'd discourage _anyway_ (ie we should 
try to use helper functions for people to do things like "get the next 
event data", not only because the headers are going to be odd due to 
trying to pack things together, but because maybe we can more easily 
extend on them later that way when nobody accesses the headers by hand).

And I don't think normalizing later is in any way more fundamentally hard. 
It just means that you do part of the expensive things after you have 
gathered the trace, rather than during.

			Linus
--

From: Martin Bligh
Date: Thursday, September 25, 2008 - 8:25 am

Agree with you on doing the expensive stuff later. If we wanted to get
something that'd pack down to a couple fewer bits, and approximate ns,
we could always >> 1 if you were > 2GHz, and >> 2 if you where > 4GHz,
etc. which is at least cheap.

But do we really need more than 3 bits for size anyway? 28 bytes
would fit most events such as system calls, interrupts, page faults,
all the common stuff.. Longer than that starts to be expensive
in memory consumption anyway, and if you're logging 32 bytes
or more ... 2 extra bytes for a length field is a small overhead to pay.

M.
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 8:36 am

... which is exactly what sched_clock() does, combined with a 
multiplication. (which is about as expensive as normal linear 
arithmetics on most CPUs - i.e. in the 1 cycle range)

Normalizing has the advantage that we dont have to worry about it ever 
again. Not about a changing scale due to cpufreq, slowing down or 
speeding up TSCs due to C2/C3. We have so much TSC breakage all across 
the spectrum that post-processing it is a nightmare in practice. Plus we 
want sched_clock() to be fast anyway.

in the distant future we not only will have constant-TSC but it wont 
stop in C2/C3 either at a whim (which they do right now, messing up 
timestamps). At that stage fast time readout it will be so sane that CPU 
makers should really provide a nanosec readout - it's easy to do a 
simple multiplicator and hide the few cycles multiplicator latency to 
RDTSC (this is continuous time after all so it's easy for the hw).

Hm?

	Ingo
--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 9:23 am

Hi Ingo,

The problem with sched_clock is that it gives a 1 HZ timestamp accuracy
for events happening across different CPUs. Within this 1 HZ range, it
uses the TSC and clip when it reaches a max. Good enough for scheduler
or for tracing events on a single CPU, but I think it is not exactly
what we need to reorder events happening across CPUs.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 9:32 am

Hmm,

sched_clock gives ns accuracy unless the tsc is disabled. And in that 
case, we don't have any CPU clock :-/


-- Steve

--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 10:20 am

Even on architectures with non-synchronized TSCs ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 10:32 am

Yep, even on those ;-)

-- Steve
--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 9:40 am

First off, that's simply not true.

Yes, it happens to be true on modern x86-64 CPU's. But in very few other 
places. Doing even just 64-bit multiples is _expensive_. It's not even 
_near_ single-cycle.


Total and utter bullshit, all of it.

Have you forgotten all the oopses due to divide-by-zero because 
sched_clock() was called early? All that early code that we might well 
want to trace through?

Not only that, but have you forgotten about FTRACE and -pg? Which means 
that every single C function calls into tracing code, and that can 
basically only be disabled on a per-file basis? 

As for C2/C3 - that's just an argument for *not* doing anything at trace 
time. What do you think happens when you try to trace through those 
things? You're much better off trying to sort out the problems later, when 
you don't hold critical locks and are possibly deep down in some buggy 
ACPI code, and you're trying to trace it exactly _because_ it is buggy.

The thing is, the trace timestamp generation should be at least capable of 
being just a couple of versions of assembly language. If you cannot write 
it in asm, you lose. You cannot (and MUST NOT) use things like a 
virtualized TSC by mistake. If the CPU doesn't natively support 'rdtsc' in 
hardware on x86, for example, you have to have another function altogether 
for the trace timestamp.

And no way in hell do we want to call complex indirection chains that take 
us all over the map and have fragile dependencies that we have already hit 
several times wrt things like cpufreq.

WE ARE MUCH BETTER OFF WITH EVEN _INCORRECT_ TIME THAN WE ARE WITH FRAGILE 

Yeah. And we want system calls to be _really_ fast, because they are even 
more critical than the scheduler. So maybe we can use a "gettime()" system 
call.

IOW, your argument is a non-argument. No way in HELL do we want to mix up 
sched_clock() in tracing. Quite the reverse. We want to have the ability 
to trace _into_ sched_clock() and never even have to think about ...
From: Steven Rostedt
Date: Thursday, September 25, 2008 - 9:53 am

Slight correction. You can annotate the function with "notrace" and 
that function will not be traced. So the "only be disabled on a per-file

Currently my code calls "ring_buffer_time_stamp" to get the time stamp, 
whatever it will be.  Currently it is using sched_clock, but since I have 
it as a wrapper, it shouldn't be too hard to modify later.

I'll also add a "ring_buffer_time_stamp_normalize(ts)" function to be 
called on reading of the trace. This will normalize whatever time stamp 
that we use back to ns for the users. For now it will just return "ts" 
since sched_clock is already normalize.

IOW, for my current work, I don't care what timing we use. Others do, and 
I will try to make the tracing infrastructure let it be easy to change 
what is used. For now, I'm concentrating on the infrastructure itself.

-- Steve

--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 10:07 am

Ok. It's still true that we absolutely don't want to add random notrace 
markers to code just because it's shared with the scheduler. And 
"sched_clock()" is not a single function with just a few well-defined 
places, nor are all versions of it at all appropriate for tracing (the 
non-TSC ones are a total joke - it works for scheduling, but not tracing. 

Yes. The code looked fine, and had a FIXME. I have no objection to using 
it as a known buggy approximation for TSC in order to not force every 
architecture to immediately write one when the patch is discussed. But I 
literally would expect that on x86, we'd basically just have a function 
that does "rdtsc" for the common case, along with possibly a generic 
fallback that does "xadd" in the absense of any other reasonable 
alternative.

				Linus
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 12:55 pm

firstly, for the sake of full disclosure, the very first versions of the 
latency tracer (which, through hundreds of revisions, morphed into 
ftrace), used raw TSC timestamps.

I stuck to that simple design for a _long_ time because i shared your 
exact views about robustness and simplicity. But it was pure utter 
nightmare to get the timings right after the fact, and i got a _lot_ of 
complaints about the quality of timings, and i could never _trust_ the 
timings myself for certain types of analysis.

So i eventually went to the scheduler clock and never looked back.

So i've been there, i've done that. In fact i briefly tried to use the 
_GTOD_ clock for tracing - that was utter nightmare as well, because the 
scale and breath of the GTOD code is staggering.

cpu_clock() proved to be a good middle ground. I'm not a believer in any 
of this stuff, i just react to how things behave in practice, as none of 
this is really easy to get right from the theoretical angle.

... so we can certainly go back to the TSC again, i'll be the last one 
to complain about extra tracing performance and it will certainly make 
it less fragile. Maybe i was wrong in declaring that a dead end and it 
will work out fine. [ Worst-case we'll go back to the sched_clock() 
again, that will always be an easy option. ]

... and regarding sched.o's notrace marking, we have had functional -pg 
tracing of sched.o for ages, and we could enable it right now. We can 
trace inside the runqueue lock just fine. Note: we indeed have commit 
c349e0a0 that proves that this stuff can lock up.

But most notrace markings are not for robustness reasons at all. They 
have two purposes:

1) correctness. sched_clock() itself should be recursion free. (Even
   _that_ has a recursion check btw., so normally it shouldnt lock up - 
   it just wont produce very nice traces.)

2) we use notrace and -no-pg to filter out very high-frequency and
   mostly uninteresting debug function calls. lockdep.o is an example, 
   ...
From: Ingo Molnar
Date: Thursday, September 25, 2008 - 1:12 pm

heh, and i even have a link for a latency tracing patch for 2005 that is 
still alive that proves it:

   http://people.redhat.com/mingo/latency-tracing-patches/patches/latency-tracing.patch

(dont look at the quality of that code too much)

It has this line for timestamp generation:

+       timestamp = get_cycles();

i.e. we used the raw TSC, we used RDTSC straight away, and we used that 
for _years_, literally.

So i can tell you my direct experience with it: i had far more problems 
with the tracer due to inexact timings and traces that i could not 
depend on, than i had problems with sched_clock() locking up or 
crashing.

Far more people complained about the accuracy of timings than about 
performance or about the ability (or inability) to stream gigs of 
tracing data to user-space.

It was a very striking difference:

  - every second person who used the tracer observed that the timings 
    looked odd at places.

  - only every 6 months has someone asked whether he could save 
    gigabytes of trace data.

For years i maintained a tracer with TSC timestamps, and for years i 
maintained another tracer that used sched_clock(). Exact timings are a 
feature most people are willing to spend extra cycles on.

You seem to dismiss that angle by calling my arguments bullshit, but i 
dont know on what basis you dismiss it. Sure, a feature and extra 
complexity _always_ has a robustness cost. If your argument is that we 
should move cpu_clock() to assembly to make it more dependable - i'm all 
for it.

	Ingo
--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 1:29 pm

Hi Ingo,

I completely agree with both Linus and you that accuracy utterly
matters. I currently provide a time source meant to meant the tracing
requirements and support architectures lacking synchronized TSC (or tsc
at all) in my lttng tree. Feel free to have a look. I've had statisfied
users relying on these time sources for about 3 years.


See the lttng-timestamp-* commits in
git://git.kernel.org/pub/scm/linux/kernel/git/compudj/linux-2.6-lttng.git

The one in question here (x86) is here. You'll see that everything fits
in a small header and can thus be inlined in the callers.

http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=include...

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 1:24 pm

Umm. cpu_clock() isn't even cross-cpu synchronized, and has actually 
thrown away all the information that can make it so, afaik. At least the 
comments say "never more than 2 jiffies difference"). You do realize that 
if you want to order events across CPU's, we're not talking about 
"jiffies" here, we're talking about 50-100 CPU _cycles_.

You also ignore the early trace issues, and have apparently not used it 
for FTRACE. You also ignore the fact that without TSC, it goes into the 
same "crap mode" that is appropriate for the scheduler, but totally 
useless for tracing.

IOW, you say that I call your arguments BS without telling you why, but 
that's just because you apparently cut out all the things I _did_ tell you 
why about!

The fact is, people who do tracing will want better clocks - and have 
gotten with other infrastructure - than you have apparently cared about. 
You've worried about scheduler tracing, and you seem to want to just have 
everybody use a simple but known-bad approach that was good enough for 
you.

			Linus
--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 1:29 pm

Oh, and I didn't notice (because Steven pointed out "notrace" and I didn't 
see any of them), that in order to get things to work you had just added

CFLAGS_REMOVE_lockdep.o = -pg
CFLAGS_REMOVE_lockdep_proc.o = -pg
CFLAGS_REMOVE_mutex-debug.o = -pg
CFLAGS_REMOVE_rtmutex-debug.o = -pg
CFLAGS_REMOVE_cgroup-debug.o = -pg
CFLAGS_REMOVE_sched_clock.o = -pg
CFLAGS_REMOVE_sched.o = -mno-spe -pg

all ovr the place, which was part of my argument against this crap in the 
first place.

Yes, by using all that common infrastructure, you can share some code, but 
you will always hit that case that now you have serious issues with 
actually marking it. Now the tracer has to have recursion detection if you 
ever want to trace any function that might be used for the clock - and 
quite frankly, especially with virtualization, it's not AT ALL obvious 
what those are all the time..

That is exactly one of the examples I gave for _not_ doing this. Go back 
and read my previous emails. Rather than talking about how I call your 
arguments BS without saying why.

			Linus
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 1:47 pm

The above I added because they just made the function tracer output
quite ugly and bloated.  The lockdep got messing when it was debugging
the locks used in the tracer that was tracing lockdep. The above had
nothing to do with what kind of clock we used.  We added recursion
protection, but it was still producing ugly output.

--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 2:01 pm

You'll also find in the lib Makefile:

ifdef CONFIG_FTRACE
ORIG_CFLAGS := $(KBUILD_CFLAGS)
KBUILD_CFLAGS = $(subst -pg,,$(ORIG_CFLAGS))
endif

Which removes all -pg flags from all in the lib directory. The reason is 
that a lot of archs (well, I know PPC for sure) use these functions on 
early boot up, where simply calling mcount will produce a page fault.

-- Steve

--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 2:10 pm

Really, we traced all these files for ages. I can restore it if it's 
worthwile - but lockdep totally kills the usability of function traces, 
it inserts thousands of uninteresting events over and over again.

Note commit c349e0a0, there i added -no-pg for robustness reasons (we 
locked up) - back then the scheduler clock was within sched.c. Now it is 
all separated out cleanly in kernel/sched_clock.o and i think we can add 

yes, that's true. And that's why we absolutely want to have recursion 
detection anyway - even given ftrace's _totally_ conservative design 
it's _very_ easy to accidentally recurse somewhere.

ftrace has the same robustness design as lockdep has: as little external 
infrastructure dependencies as possible. And lockdep has recursion 
checks too, and excessive amounts of paranoia all around the place.

Ftrace has the same robustness philosophy too, and yes, despite that we 
judged cpu_clock() to be worth the risk, because accurate and fast 
timestamps are a feature and we didnt want to duplicate.

If then i'd rather move towards simplifying sched_clock() even more - 
that's important for the scheduler too. The scheduler clock has _very_ 

we lived with these kinds of complications for years literally, and i'm 
not that stupid to not simplify a debugging framework when i can do it - 
and i'd not mind risk reduction. Paravirt is hugely misdesigned piece of 
PITA, for basically every piece of infrastructure that we have: locking, 
GTOD, lockdep, and yes, the tracer too.

So ... if you could suggest a method of how we could get good timestamps 
in a post-processed method with TSC timestamps, that would be great. 
That would solve all this discussion and i'm not going to argue against 
saving raw TSC timestamps, as they _are_ simpler and more robust.

	Ingo
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 2:16 pm

and note that there's another pragmatic argument: often we notice 
cpu_clock() bugs by looking at traces. I.e. people fixing trace 
timestamps _fix the scheduler_. Sometimes it is very hard to notice 
scheduling artifacts that happen due to small inaccuracies in 
cpu_clock().

so there's continuous coupling between precise scheduling and good trace 
timestamps. I'd be willing to pay a lot more for that than the few 
(rather obvious...) robustness problems we had with sched_clock() in the 
past.

anyway ... i'm not _that_ attached to the idea, we can certainly go back 
to the original ftrace method of saving raw TSC timestamps and 
postprocessing. I think users will quickly force us back to a more 
dependable clock, and if not then you were right and i was wrong ;-)

In fact even when we used sched_clock() there were some artifacts: as 
you pointed it out we dont want to do per event cross-CPU 
synchronization by default as that is very expensive. Some people wanted 
GTOD clock for tracing and we very briefly tried that - but that was an 
utter maintenance nightmare in practice.

	Ingo
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 2:41 pm

to prove it, i just applied this patch:

Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -21,7 +21,6 @@ CFLAGS_REMOVE_mutex-debug.o = -pg
 CFLAGS_REMOVE_rtmutex-debug.o = -pg
 CFLAGS_REMOVE_cgroup-debug.o = -pg
 CFLAGS_REMOVE_sched_clock.o = -pg
-CFLAGS_REMOVE_sched.o = -mno-spe -pg
 endif
 
 obj-$(CONFIG_PROFILING) += profile.o

and sched.o was fully traced again. For example schedule() to idle is 38 
function calls:

$ cd /debug/tracing
$ echo ftrace > current_tracer
$ cat trace

# tracer: ftrace
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
[...]
             ssh-2734  [001]    52.291772: schedule <-schedule_timeout
             ssh-2734  [001]    52.291772: hrtick_clear <-schedule
             ssh-2734  [001]    52.291774: _spin_lock <-schedule
             ssh-2734  [001]    52.291775: deactivate_task <-schedule
             ssh-2734  [001]    52.291776: dequeue_task <-deactivate_task
             ssh-2734  [001]    52.291777: dequeue_task_fair <-dequeue_task
             ssh-2734  [001]    52.291778: update_curr <-dequeue_task_fair
             ssh-2734  [001]    52.291779: calc_delta_mine <-update_curr
             ssh-2734  [001]    52.291780: hrtick_start_fair <-dequeue_task_fair
             ssh-2734  [001]    52.291782: find_busiest_group <-schedule
             ssh-2734  [001]    52.291783: idle_cpu <-find_busiest_group
             ssh-2734  [001]    52.291784: target_load <-find_busiest_group
             ssh-2734  [001]    52.291785: weighted_cpuload <-target_load
             ssh-2734  [001]    52.291786: weighted_cpuload <-find_busiest_group
             ssh-2734  [001]    52.291787: cpu_avg_load_per_task <-find_busiest_group
             ssh-2734  [001]    52.291788: source_load <-find_busiest_group
             ssh-2734  [001]    52.291789: weighted_cpuload <-source_load
           ...
From: Ingo Molnar
Date: Thursday, September 25, 2008 - 2:56 pm

here is the observations from my second blind attempt - this time to 
show that we can trace even lockdep internals just fine.

I applied the patch attached below: it removes all the -pg removals from 
kernel/Makefile and restores the notrace markings in kernel/lockdep.c 
that commit 1d09daa5 ("ftrace: use Makefile to remove tracing from 
lockdep") removed.

then i enabled the whole lockdep machinery (to insert as much extra code 
as possible):

 CONFIG_DEBUG_SPINLOCK=y
 CONFIG_DEBUG_MUTEXES=y
 CONFIG_DEBUG_LOCK_ALLOC=y
 CONFIG_PROVE_LOCKING=y
 CONFIG_LOCKDEP=y
 CONFIG_LOCK_STAT=y
 CONFIG_DEBUG_LOCKDEP=y
 CONFIG_TRACE_IRQFLAGS=y

and enabled ftrace function tracing:

 CONFIG_FTRACE=y
 CONFIG_DYNAMIC_FTRACE=y
 CONFIG_FTRACE_MCOUNT_RECORD=y

[ mcount-record is a new ftrace feature for v2.6.28 ]

and rebuilt and rebooted into the kernel, and enabled ftrace.

It worked just fine this time too, and i was able to trace inside 
lockdep and lockstat internals:

          <idle>-0     [001]    71.120828: _spin_lock <-tick_do_update_jiffies64
          <idle>-0     [001]    71.120829: __lock_acquire <-lock_acquire
          <idle>-0     [001]    71.120829: lock_acquired <-_spin_lock
          <idle>-0     [001]    71.120829: print_lock_contention_bug <-lock_acquired
          <idle>-0     [001]    71.120830: do_timer <-tick_do_update_jiffies64

this wasnt done since May this year (since commit 1d09daa5).

It's absolutely not unrobust. Yes, more code will always regress more 
than no code, but it's a cost/benefit balance not a design must-have. 
And to prove me wrong i'm sure we'll have some really bad cpu_clock() 
regression within the next 24 hours ;)

Why did we do commit 1d09daa5, 6ec56232 and c349e0a0 which marked all 
these .o's notrace? Partly to address a real (meanwhile fixed) 
regression wrt. sched_clock() [and this supports your point], partly to 
remove extra unnecessary trace entries from debug infrastructure, and 
partly out of pure paranoia: i didnt ...
From: Linus Torvalds
Date: Thursday, September 25, 2008 - 2:58 pm

Now do the same on a CPU that doesn't have TSC. And notice how useless the 
timestamps are.

		Linus
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 3:14 pm

i do not understand this argument of yours. (really)

1) is your point that we might lock up?


2) or perhaps that the timestamps update only once every jiffy, and are 
in essence useless because they show the same value again and again?

the latter is true, and that's why we were pushed hard in the past by 
tracer users towards using GTOD timestamps. Everyone's favorite 
suggestion was: "why dont you use gettimeofday internally in the 
tracer???".

We resisted that because GTOD timestamps are totally crazy IMO:

- it is 1-2 orders of magnitude more code than cpu_clock() and 
  all sched_clock() variants altogether.

- it's also pretty fragile code that uses non-trivial locking
  internally.

- pmtimer takes like 6000-10000 cycles to read. hpet ditto. Not to talk
  about the PIT. Same on other architectures.

[ ... and as usual, only Sparc64 is sane in this field. ]

for a some time we had a runtime option in the latency tracer that 
allowed the GTOD clock to be used (default-off) - but even that one was 
too much and too fragile so we removed it - it never got upstream.

Fortunately this is not a big issue as almost everything on this planet 
that runs Linux and has a kernel developer or user sitting in front of 
it has a TSC - and if it doesnt have a TSC it doesnt have any other 
high-precision time source to begin with. So worst-case sched_clock() 
falls back to a sucky jiffies approximation:

unsigned long long __attribute__((weak)) sched_clock(void)
{
        return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
}


3) ... or perhaps is it your point more highlevel, that we shouldnt be 
dealing with timestamps in a central manner _at all_ in the tracer, and 
we should make them purely optional?

I indeed _had_ a few cases (bugs i debugged) where i was not interested 
at all in the timestamps, just in their relative ordering. For that we 
had a switch in the latency tracer that turned on (expensive!) central 
synchronization [a shared global atomic ...
From: Linus Torvalds
Date: Thursday, September 25, 2008 - 4:33 pm

Have you at all followed the discussion about the people who asked for 
cross-CPU ordering? They wanted not timestamps at all, but global atomic 
counter updates. Which is very reasonable, if timestamps don't work (and 
jiffies certainly doesn't, especially in a NOHZ environment).

IOW, my whole argument is that what tracers want is _not_ the same thing 
as what "sched_clock()" wants.

		Linus
--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 10:16 am

ah, i see what you mean. I think that's an orthogonal property.

Well, it's important in some cases, not that important in other cases.

Historically we've been flip-flopping on that issue in ftrace, whether 
it should be coherent by default or not. We had at least three of four 
variations of global synchronization. (one was an atomic generation 
counter, another variant a global lock)

Eventually people noticed the overhead and asked for it to be faster.

If all you do is to trace high-freq events on all CPUs and you are _not_ 
interested in the precise interactions, the overhead of global 
synchronization can hurt a lot.

In any case, SMP coherency of trace events is an independent property of 
the tracer, and preferably something that can be turned on/off.

	Ingo
--

From: Steven Rostedt
Date: Saturday, September 27, 2008 - 10:38 am

Just a note.  The current ring buffering system that I'm proposing keeps 
its own time stamp counter (currently sched_clock) that will most likely 
be updated later. I'm trying to keep this ring buffer system as dumb as 
possible. It does not even implement the merge sort. That's up to the 
tracer to handle. There's nothing stopping the trace from adding some 
atomic counter to each event to help it sort.

So yes, the tracer can implement anything it wants on top of the ring 
buffer ;-)

-- Steve

--

From: Peter Zijlstra
Date: Saturday, September 27, 2008 - 10:50 am

Mathieu seems to disagree, it would be good if he can share some
specifics so we can work on resolving those.

--

From: Steven Rostedt
Date: Saturday, September 27, 2008 - 11:18 am

[Empty message]
From: Ingo Molnar
Date: Saturday, September 27, 2008 - 11:42 am

correct. The price is all the notifier/callback overhead and the loss of 
type checking of the record contents. But that's an unavoidable price of 

yes, very nice! :)

	Ingo
--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 10:36 am

let me outline why that flip-flopping occured.

 - coherent tracer: has built-in serialization of global events. If the
   tracer shows events to be after each other, they were after each
   other.

 - incoherent tracer: events might be mixed up slightly on the micro
   scale.

In your tree you'll see dozens of fixes from me in the past 10 years or 
so where i used various tracers to find some bug. Some of them were done 
with coherent tracers, some of them were done with incoherent tracers.

Here a few common patterns that influenced which kind of tracer i used:

 - SMP races. There it's really important to see the ordering of events, 
   and coherent tracers (where the ordering of events as displayed by 
   the tracer can be trusted) are used by default

 - EXCEPT: _very_ often an SMP race goes away if we add global
   synchronization to trace events. Sometimes the pure delay can hide
   races. So incoherent tracers are very important here.

 - analysis of performance problems: here incoherent tracers win hands 
   down. It's important to see all events on all CPUs, but it's not at 
   all important to see the precise micro-ordering of events on a global
   basis. We want to see rough workload behavior, how tasks iteract - 
   and most importantly, we want tracing to be as low-overhead as 
   possible.

 - [ in many cases coherency does not matter because we only look at a 
     single CPU's or app's trace. ]

for example on an 16-way CPU, when i run a high-event-count workload and 
add global serialization to events, the workload can easily be slower by 
10% or more. Sometimes even the characteristics of the workload changes 
due to having a globally synchronized tracer.

So ... neither coherent nor incoherent tracers are a clear, obvious 
default.

IMO incoherent is the more useful default in terms of being able to find 
bugs with it, because it has a higher utility factor. People have to 
interpret traces anyway, and the main usecase of ftrace is that i ask ...
From: Ingo Molnar
Date: Thursday, September 25, 2008 - 1:52 pm

Steve got the _worst-case_ cpu_clock() difference down to 60 usecs not 
so long ago. It might have regressed since then, it's really hard to do 
it without cross-CPU synchronization.

( But it's not impossible, as Steve has proven it, because physical time
  goes on linearly on each CPU so we have a chance to do it: by
  accurately correlating the GTOD timestamps we get at to-idle/from-idle
  times to the TSC. )

And note that i'm not only talking about cross-CPU synchronization, i'm 
also talking about _single CPU_ timestamps. How do you get it right with 
TSCs via a pure postprocessing method? A very large body of modern CPUs 
will halt the TSC when they go into idle. (about 70% of the installed 
base or so)

Note, we absolutely cannot do accurate timings in a pure 
TSC-post-processing environment: unless you want to trace _every_ 
to-idle and from-idle event, which can easily be tens of thousands of 
extra events per seconds.

What we could do perhaps is a hybrid method:

 - save a GTOD+TSC pair at important events, such as to-idle and
   from-idle, and in the periodic sched_tick(). [ perhaps also save it 
   when we change cpufreq. ]

 - save the (last_GTOD, _relative_-TSC) pair in the trace entry

with that we have a chance to do good post-processed correlation - at 
the cost of having 12-16 bytes of timestamp, per trace entry.

Or we could upscale the GTOD to 'TSC time', at go-idle and from-idle. 
Which is rather complicated with cpufreq - which frequency do we want to 
upscale to if we have a box with three available frequencies? We could 
ignore cpufreq altogether - but then there goes dependable tracing on 

i very much used early code tracing with ftrace in the past. In fact 
once i debugged and early boot hang that happened so early before 
_PRINTK_ was not functional yet (!).

So, to solve this bug, i hacked ftrace to use early_printk(), to print 
out the last 10,000 functions executed before the hang - and that's how 
i found the reason for the hang - ...
From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 2:14 pm

The "full timestamp" records should include:

    * absolute tsc
    * absolute monotonic timestamp
    * new tsc freqency

If you then make sure that all the cpufreq/idle/suspend-resume code
emits appropriate records when changing the tsc frequency, then you
should always be able to fully regenerate an absolute timestamp.

If you generate the monotonic timestamp with a good clocksource, then
you should be able to correlate the timestamps between cpus.

Oddly enough, this is identical to the Xen clocksource's use of the tsc ;)

    J
--

From: Martin Bligh
Date: Thursday, September 25, 2008 - 2:15 pm

Simple solution: turn off cpufreq whilst tracing is on ;-)

Harder: Keep a timebase and frequency divisor on a per-cpu basis
and calculate your offsets from there. This brings you down to
HPET resolution though
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 1:20 pm

note, CONFIG_LOCKSTAT - which hooks deep inside lockdep, uses 
cpu_clock() too for timings, for similar reasons. Lockdep and lockstat 
both have a very robust design and a very good track record to prove it.

you'd be correct in pointing out that we _do_ have a relatively high 
regression count in cpu_clock()/sched_clock(), but the reason for that 
is that its implementation balances on the narrow edge of doability. It 
implements a very unstable set of requirements: "absolutely fast" pitted 
against "as accurate as possible".

That is two conflicting requirements and it is a very fine line to walk 
and everyone tries to have their own variant of it and their own 
balance. We try as hard as possible to use the TSC even in face of 
C2/C3, cpufreq, unstable TSCs, etc. The moment we go too much towards 
performance we regress accuracy and hurt the scheduler's quality and 
vice versa.

and note that my years long experience in the tracing field show that it 
has a _very_ similar need for accuracy versus performance, so it was a 
good match for ftrace.

	Ingo
--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 2:02 pm

That suggests that frequency changes should be recorded at a lower layer
as well, along with full timestamps and deltas, so that a log parser can
correctly generate the timestamps without having to parse higher-level
trace records.  Maybe the simplest thing to do is make the full
timestamp records also include frequency and offset information (to deal
with discontinuities caused by things like a vcpu switching between
physical cpus with unsynced tscs).

    J
--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 2:55 pm

Yes and no.

The reason I say "and no" is that it's not technically really possible to 
atomically give the exact TSC at which the frequency change took place. We 
just don't have the information, and I doubt we will ever have it.

As such, there is no point in trying to make it a low-level special op, 
because we'd _still_ end up being totally equivalent with just doing as 
regular trace-event, with a regular TSC field, and then just fill the data 
field with the new frequency.

But yes, I do think we'd need to have that as a trace packet type. I 
thought I even said so in my RFC for packet types. Ahh, it was in the 

but yes, we obviously need the frequency in order to calculate some kind 
of wall-clock time (it doesn't _have_ to be in the same packet type as the 
thing that tries to sync with a real clock, but it makes sense for it to 
be there.

That said, if people think they can do a good job of ns conversion, I'll 
stop arguing. Quite frankly, I think people are wrong about that, and 
quite frankly, I think that anybody who looks even for one second at those 
"alternate" sched_clock() implementations should realize that they aren't 
suitable, but whatever. I'm not writing the code, I can only try to 
convince people to not add the insane call-chains we have now.

			Linus
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 3:25 pm

hm, i'd really hope hw makers see the light and actually make the hw do 
it all. Signs are that they are cozying up to these ideas.

Good and fast timestamps are important, and it is _infinitely_ more easy 
to do it in hw than in sw.

Firstly they need a low-frequency (10khz-100khz) shared clock line 
across all CPUs. A single line - and since it's low frequency it could 
be overlaid on some existing data line and filtered out. That works 
across NUMA nodes as well and physics allows it to be nanosec accurate 
up to dozens of meters or so. Then they need some really cheap way to 
realize what absolute value the clock counts, and read it out every now 
and then in the CPU, and approximate it inbetween, and have a secondary 
stage cheap few-transitors long-latency multiplicator that keeps passing 
on the nanosec-ish value to a register/MSR that can be read out by the 
instruction.

This trivially works fine even if the CPU is turned off. It uses nary 
any power as it's low freq, and can be spread across larger system 
designs too. In fact it would be a totally exciting new capability for 
things like analysis of SMP events. PEBS/BTS could be extended to save 
this kind of timestamp, and suddenly one could see _very_ accurately 
what happens between CPUs, without expensive bus snooping kit.

and CPUs wont go beyond the '~1nsec' event granularity for quite some 
time anyway - so nanoseconds is not a time scale that gets obsoleted 
quickly.

[ i guess this proves it that everyone has his pipe dream ;-) ]

	Ingo
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 3:45 pm

Can this possibly be true? I mean, light travels only one foot every
nanosecond. Can it really keep nanosecond accuracy up to dozens of  meters 
away? If you send the same signal to CPU1 that is 1 foot away, as well as 
send it to CPU2 that is 2 feet away. CPU2 will get that signal at least 1 
nanosec after CPU1 receives it.

Of course if the hardware is smart enough to know this topology, then it 
could account for these delays in traffic.

-- Steve

--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 4:04 pm

Sure.  NTP keeps machines within 1ms (or better) of each other even
though the network latency is much higher and jittery.

    J

--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 4:25 pm

yes. And there are radio telescope arrays that are synced up to do delta 
interferometry, over thousands of kilometers. Syncing up time over a few 
dozen meters is no challenge - and the reason for that ease is that 
physical time is neatly and uniformly broadcasted by nature in a pretty 
dependable way, at around 300 thousand kilometers per second.

the challenge is to make it cheap enough for commodity hw. I.e. no extra 
CPU pins or lines in critical parts of the board, no extra power, low 
transistor count, no impact on any critical path, short and reliable 
clock readout after powerup, etc. But that is quite possible too IMO, 
and the payback is very real.

[ OTOH, this is a world that still ships FreeDOS on many whitebox PC
  instead of putting Linux on it, so dont expect logic to prevail
  in all cases ;-) ]

	Ingo
--

From: Thomas Gleixner
Date: Friday, September 26, 2008 - 7:04 am

Yup, it can be done. Ingo's proposal reminds me on a project we did in
1999 for distributed computing.

|------------|
|master clock|
|1 MHz       |===========[slave 1]=========[slave 2] ..........
|sync pulse  |
|1 Hz        |
|------------|

The slave implementation was a simple PLL driven by the master clock
with an output frequency of 1GHz. The micro seconds counter was driven
by the 1MHz clock and the nanoseconds part by the PLL clock. 

The nanoseconds counter was implemented so it stopped counting at 999
and it was reset to 0 when the 1MHz pulse came in. That made sure that
the PLL inaccuracy was corrected with every 1MHz pulse.

When a slave attached itself to the system, then it waited for the 1Hz
sync pulse, queried the absolute time from the master via the network
and synced itself to the next 1Hz sync pulse.

The signal runtime was compensated by the position of the node in the
topology, i.e. the distance to the master clock. That's simple math.
Signal runtime is known for a given wiring / PCB layout. So you just
have to know the distance.

For twisted pair the signal speed is ~ 0.6 * c =~ 180000 km/s. 

So a node which is 1m away from the master gets the signal delayed by:
 0.001 / 180000 s ~= 5.5 nsec

That's constant and easy to account for. You can get into the +/-
10nsec accuracy accross a large distributed system that way.

The sad part is, that we talked to AMD/Intel about this back then and
they both thought it would be a nice idea for cross socket synced
counters and could be easy implemented in hardware. Then they went off
and did the sh*t which we have to deal with right now.

Thanks,

	tglx
--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 3:39 pm

Well, you don't need the tsc at the precise moment of the frequency
change.  You just need to emit the current tsc+frequency+wallclock time
before you emit any more delta records after the frequency change.  You

Yeah.  If you ever mention wallclock time in the event stream, you have
to tie it to your local timebase (tsc+frequency) to make the whole thing

Yeah.  Unfortunately, in the virtual case - unless you're virtualizing
the tsc itself, which is horrible - you can't really control or measure
how the tsc is going to behave, because its all under the hypervisor's
control.  A "cpu" could be migrated between different physical cpus, the
whole machine could be migrated between hosts, or suspended, etc, making
it very hard to use the naked tsc.  In that case the only real option is
to use a hypervisor-supplied timebase (which for Xen and KVM is a
tsc-based scheme exactly like we've been discussing, except the
hypervisor provides the tsc timing parameters).

asm/x86/kernel/pvclock.c does the tsc to ns conversion with just adds
and multiplies, but unfortunately it can't be expressed in C because it
uses the extra precision the x86 gives for multiplies.

    J
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 3:55 pm

hm, i'm not sure you've thought through this delta record idea.

Take a system that goes idle thousands of times a second. (it's easy - 
just some networking workload)

Now take a tracer that just wants to emit a trace entry every now and 
then. Once or twice a second or so.

Note that suddenly you have thousands of totally artificial 'delta' time 
records between two real events, and have to post-process all your way 
up between these events to get to the real timeline.

... it is totally impractical and costly.

and then i havent even mentioned some other big complications:

 - the numeric errors that mount up over thousands of delta events

 - the memory overhead over thousands of entries

 - the fact that cpufreq and PLL changes are rarely atomic and that the
   TSC can flip-flop between two frequencies.

TSC based delta time post-processing is just not practical. Micro-time 
has to be synced up to GTOD in some manner - OR you might declare that 
the TSC _has_ to never stop and has to be constant. (no modern x86 CPU 
meets that criterium at the moment though.)

... and the moment you accept the fact that the GTOD _has_ to be mixed 
into it, all the rest follows pretty much automatically: either you 
store the (GTOD,freq,TSC) triple and post-process that absolute 
timestamp, and you accept the in-memory cost of that and do 
post-processing, or you compress that triple in situ and store the 
result only.

[ You will then also want some fall-back implementation for CPUs that 
  have no TSCs, and for architectures that have no default 
  implementation - something jiffies based. And you want some special 
  hooks for paravirt, as always. ]

I.e. you will end up having something quite close to 
cpu_clock()/sched_clock().

_Or_ if you manage to get any better than that for tracing then please 
tell us about it because we want to apply those concepts to 
cpu_clock()/sched_clock() too :-)

	Ingo
--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 6:17 pm

No, as I said:  "You just need to emit the current
tsc+frequency+wallclock time before you emit any more delta records
after the frequency change."

When an event which affects the tsc occurs, like a frequency change or
pause, set a flag.  When you're next about to emit a delta, check the

You need to know the frequency at the time you sample the tsc, and you
need to know when the frequency changes.  If you don't, you can't use
the tsc for time, regardless of whether you process it immediately or

Right.  You store (GTOD,freq,tsc) every time you need that information,
and then interpolate with the tsc while you know its monotonic.

Unless your tsc is completely screwed, the (GTOD,freq,tsc) triple is
going to be stored at a fairly low frequency, and won't fill your event
buffer very much (though it might be a large proportion of your recorded

Well, whatever the best timer the platform has.  And maybe its already
processed into real time, in which case you just emit raw deltas and

Well, its what Xen does already for time.  It works well.

    J
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 6:27 pm

This is where I get confused.  When do we add this to the trace buffer?
Every record (currently) records the delta time from the last event.
If a frequency changes, do we need to record the tuple at that moment?

If so, we have an issue. We can be in the middle of tracing the GTOD
and we can not recurse back into the GTOD for fear of deadlocking on
the GTOD spinlocks.

Do we add this data to the buffers when it happens?  This means every
allocated ring buffer (and yes you can have more than one), will need to
register a call back so that it can record this information into the
buffer.

Or am I just not getting it?

-- Steve

--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 6:49 pm

No, you just need to save away a consistent (tsc,freq,GTOD) triple at
some point in time; it doesn't matter when it is, so long as the tsc
frequency hasn't changed since then.  Once you have that, you can insert
it into your trace buffer, and then use it as the base for subsequent
deltas.

And as I mentioned in the other mail, if you don't mind backtracking
when you process the log, you can even record (tsc,freq) (tsc,GTOD)

Right, you don't need to read the GTOD when you actually write a record.

    J
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 3:59 pm

Even in my last version I added a "TIME_STAMP" type that can be used in 
the future to add some kind of synchronization into the trace, that reads
GTOD or whatnot.

But as you can see, I've been trying to implement these various ideas, 
since the devil is in the details and the code is the details.

How do you get this GTOD read in the ring buffer? If the ring buffer does
it without any knowledge from the tracer, it may be doing it a 
inappropriate times. This would also imply that the GTOD infrastructure 
itself is reentrent safe.  Imagine tracing the GTOD code when the buffer 
decides it is about time to add the GTOD timestamp into the buffer. Can 
the GTOD handle this recursion. If the GTOD has spinlocks, probably not.

Perhaps we can add a ring_buffer_write_safe() method that would prevent 
the trace from doing these. Or we can add a way for the tracer to trigger 

If we do end up making a new clock API, I imagine that each arch will 
define their own.

-- Steve

--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 6:27 pm

It doesn't need to read the GTOD synchronously when writing the trace
record.  When a tsc event occurs, it needs to store the
GTOD/tsc/frequency somewhere at that point.  When you next write a trace
record, if that structure has been updated, you write it into the trace
before emitting the next delta timestamp.

In fact you can read the GTOD at almost any time between tsc frequency
updates, so you can defer it to whenever is convenient.  At the time the
tsc changes frequency, you emit: (tsc,frequency), and a bit later
(before it changes again) you emit (tsc,GTOD).  That allows you to
retroactively compute GTODs for all timestamps.  Obviously it would be
good to get them as close as possible together.

    J
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 6:35 pm

OK, let me rephrase my question.

How and where do we record this?  Do we keep this information in some 
global variable that we must compare to every time we add a new item in 
the trace?

Do we have the buffer register a call back to record this information?

-- Steve

--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 7:07 pm

Something like (total pseudocode):

struct tsc_time_parameters {
	int version;		/* even - values OK; odd - values being updated */
	u64 tsc;
	u32 tsc_freq;
	u64 gtod;
};

DEFINE_PERCPU(struct tsc_time_parameters, tsc_params);

/* To be called after a tsc frequency change, before any new
   trace records are being emitted, in a context where we can call get_GTOD() */
void update_tsc_params(void)
{
	struct tsc_time_parameters *p = __get_percpu_var(tsc_params);

	p->version |= 1;
	wmb();

	p->tsc = get_tsc();
	p->tsc_freq = get_tsc_freq();
	p->gtod = get_GTOD();

	wmb();
	p->version++;
	wmb();
}

DEFINE_PERCPU(unsigned, current_tsc_version);
DEFINE_PERCPU(u64, prev_tsc);

/* may be called in any context */
u64 get_trace_timestamp_delta(void)
{
	const struct tsc_time_parameters *p = &__get_percpu_var(tsc_params);
	unsigned *current_version = &__get_cpu_var(current_tsc_version);
	u64 prev = __get_cpu_var(prev_tsc);
	u64 now, ret;

	/* check the current tsc_params version against the last one we emitted;
		if the version is odd, then we interrupted the parameters as they were
		being updated, so just emit a new delta with the old parameters */
	if (unlikely(*current_version != p->version && !(p->version & 1))) {
		/* XXX probably need a loop to deal with p->version changing under our feet */
		emit_tsc_freq_record(p);
		prev = p->tsc;
		__get_cpu_var(current_tsc_version) = p->version;
	}

	now = read_tsc();
	ret = now - prev;
	__get_cpu_var(prev_tsc) = now;

	return ret;
}


    J
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 7:25 pm

[
  "When people ask me what language my mother tongue is,
   I simply reply 'C'" - Steven Rostedt
]

This is exactly why I have that saying ;-)







We probably wont to check here that p didn't change again.


Hmm, the beginning of each patch will need to record the global tsc, as 
well as this information. Simply because in overwrite mode, we do not want 
to lose it if the producer is faster than te consumer.

-- Steve

--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 10:31 pm

No, they could probably be static, depending where everything ends up. 

Yes, and any other place the tsc might get affected, like going into a


Yeah, you may want to put the whole thing in a loop to make sure that
the version is consistent.  You might end up emitting multiple redundant

Each patch?  Page?

    J
--

From: Steven Rostedt
Date: Friday, September 26, 2008 - 3:41 am

Yep, s/patch/page/. I was already half asleep when I replied. ;)

-- Steve

--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 8:26 am

I've been just using sched_clock() which already normalizes to ns. But I 
use a wrapper (ring_buffer_time_stamp) so we can decide on how to keep 
track later.

If we do not normalize, then we must come up yet another generic way to read 
the CPU clock for all archs. And then we also need to come up with another 
generic way to normalize it later for output.

If I'm missing that this already exists, then I'll go and use it, but I do 
not think that tracing is worthy enough to implement this timing 
infrastructure just to get faster traces.

-- Steve

--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 10:22 am

Why would any of this be "generic"?

Quite the reverse. It should be as trace-buffer specific as possible, so 
that we do *not* share any code or any constraints with other people.

Just do rdtsc at first, and make it depend on x86. If the thing is made 
simple enough, it will be a couple of lines of code for architectures to 
read their own timestamp counters.

And since the normalization is then no longer in the critical part, _that_ 
can be architecture-independent, but obviously still trace-specific. You 
need to know the frequency, and that involves having frequency events in 
the trace if it changes, but if you don't see any frequency events you 
just take "current frequency".

And doing it at trace parse time, we can some day enable a boot trace that 
actually WORKS. Have you looked at the timestamp events we get from 
"sched_clock()" in early bootup? They show up in the kernel logs when you 
have CONFIG_PRINTK_TIME. And they are totally and utterly broken and 
_useless_ for the early stages right now. And they shouldn't have to be 
that way.

Yeah, we'll never be able to trace stuff that happens really early 
(tracing will obviously always need kernel page tables and some really 
basic stuf working), but we should be able to trace through things like 
TSC calibration for boot time analysis. It wasn't that long ago that we 
had the whole discussion about TSC calibration taking 200ms. Or the early 
ACPI code. And get meaningful data.

			Linus
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 10:39 am

generic as in, could be implement in architecture dependent ways but with
a "generic" interface. IOW, I don't want the trace to be dependent on
any arch. ftrace already runs on x86, ppc, sparc64, mips, arm, sh, and 

I could do a HAVE_RING_BUFFER_TIMESTAMP config option for archs that 
implement it, and just use something dumb for those that don't. For now
I'll keep to sched_clock, just because it makes it easy for me. But with

The one thing that seemed to me most apparent from talking to people
at LPC, is that they want a simple ring buffer API. If every tracer that
uses this must come up with its own time keeping management, I don't think
this will be used at all (except by those that are maintaining tracers

My logdev code has a define option to use bootmem for its buffers, and it 
also uses an atomic counter to try to keep things in order. Heck, at early 
boot, the events happen in order anyway, since it is still a single CPU 
system.

-- Steve

--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 11:14 am

No, no.

The timestamp code is all in the ring buffer code. That was why I refused 
to have the layering without it.

And hell no, nobody should *ever* read the "tsc_delta" fields etc. Those 
are entirely internal to the buffering. If any user _ever_ reads or writes 
those on its own, it's a bug, plain and simple.

So when you read trace events, you should get the event data and the 
timestamp from the trace buffer routines. Nobody should ever even _see_ 
the internal trace buffer implementation!

			Linus
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 8:20 am

Pretty much all CPUs word align on a 8 byte bounder (until we get those 
128bit boxes running), but not all can word align on 4 bytes. I was hoping 
to make the buffer output somewhat the same across archs.

Otherwise, this is going to be quite a complex mess IMHO.

-- Steve

--

From: Martin Bligh
Date: Wednesday, September 24, 2008 - 10:54 am

Mmm. Either I'm confused, or we're talking about different things.

If we just record the TSC unshifted, in 27 bits, at 4GHz, that gives us
about 1/30 of a second? So we either shift, use > 27 bits, or record
at least 30 events a second, none of which I like much ...
--

From: Martin Bligh
Date: Wednesday, September 24, 2008 - 11:04 am

If we use 32 bits instead of 27, then the timestamp events are only
about once per second, which is probably fine for overhead ... ?

I think we're OK losing 5 bits of precision, that's only 32 cycles,
given all the CPU reordering stuff we've talked about here,
not-quite-synced TSCs, etc. I suspect you thought we were
shifting by much more than this, in reality it was 5-10 bits,
with timestamp events inbetween, though we put wall time in
them, which I think was a mistake.

(note: I'm not suggesting we have to use this compact a format,
at least by default)
--

From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 1:39 pm

You'd have them ONCE IN A BLUE MOON.

If there is nothing going on, you don't need the timestamps at all.

Yes, some people will _want_ a heartbeat, of course, and there might be 
serialization points where we want to serialize some external clock to the 
TSC, but that's a separate issue from generating a full TSC value. We may 
well decide that once a second we want a whole packet with TSC _and_ 
HR-timer sync information, for example. But that's a separate issue.

And if there is a lot of things going on (eg you're tracing things like a 
block device performance issue), you have events _much_ closer to each 
other than 1/30th of a second, and you again _never_ need a full 59-bit 
timestamp. Because the delta would be cumulative, and as long as you are 
generating events, you're also updating the base TSC.

The only case you'll see lots of those timestamps (where "lots" is 
guaranteed to be less than 30 in one second) would be if you're tracing 
something that literally does a couple of events per second. But then you 
sure as hell don't need to worry about performance _or_ memory use. If you 
have some trace that gives you five hits per second (and spread out, to 
boot!), you'll generate twice the number of trace entries because each 
entry would always be preceded by an extended thing, but hey, do we really 
worry about five trace events per second?

And quite frankly, most of the tracing I've ever done really does fall 
into the "either nothing" or "a flood of events" thing. The "5-25 events 
per second at regular intervals" case really doesn't sound very common at 
all - not that I would worry about it if it was since it's going to be a 
really simple case..

And btw, this would be an issue only on really fast CPU's anyway. If 
you're in the mobile space, your TSC will be clocking at less than that, 
so the slower the machine, the less the overhead. Again, that's exactly 
what you want - the overhead is not some fixed thing that is detemined by 
the fastest ...
From: Martin Bligh
Date: Wednesday, September 24, 2008 - 1:56 pm

Yeah, you're right - we can just mark it dirty, and 'pre-log' the timestamp
events when someone calls a reserve and we haven't logged anything
for more time than we can store. Did not think of that. Was only 5 bits
for us, not an extra 37, but still, is much better.

Is a 5-bit event id generic enough though?
--

From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 2:08 pm

Actually, I was keeping the event id completely out of the ring buffer and 
let a higher layer deal with that. For padding, I just made the length 
field zero.

For overflows of the timestamp, we can reserve the -1 timestamp as a 
trigger to read the tsc again and put the full 64 bits into the record.

Just an idea.

-- Steve

--

From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 1:30 pm

No, we don't shift (we don't want to lose precision), and we don't use 
more than 27 bits by default.

the TSC at each entry should be a _delta_. It's the difference from the 
last one. And if you get less than 30 events per second, and you need a 
bigger difference, you insert an extra "sync" tracepoint that has a 59-bit 
thing (27 bits _plus_ the extra 'data').

Yes, it adds 8 bytes (assuming that minimal format), but it does so only 
for any trace event that is more than 1/30th of a second from its previous 
one. IOW, think of this not in bytes, but in bytes-per-second. It adds at 
most 8*30=240 bytes per second, but what it _saves_ is that when you have 
tens of thousands of events, it shaves 4 bytes FOR EACH EVENT.

See?

Also, quite often, the clock won't be running at 4GHz even if the CPU 
might. Intel already doesn't make the TSC be the nominal frequency, and 
other architectures with TSC's have long had the TSC be something like a 
"divide-by-16" clock rather than every single cycle because it's more 
power-efficient.

So there is often a built-in shift, and I doubt we'll see 10GHz TSC's even 
if we see 10GHz CPU's (which many people consider unlikely anyway, but 
I'm not going to bet against technology).

			Linus
--

From: Mathieu Desnoyers
Date: Wednesday, September 24, 2008 - 1:53 pm

The reason why Martin did use only a 27 bits TSC in ktrace was that they
were statically limited to 32 event types. I doubt this will suffice for
general purpose kernel tracing. For simplicity, I would just start with
a header made of the 32 TSC LSBs, 16 bits for events ID and 16 bits for
event size in the buffer header. We can always create extra-compact
schemes later on which can be tied to specific buffers. I actually have

I agree that, in the end, we will end up with "delta" information given
by the timestamp, but there is a way to encode that very simply without
having to compute any time delta between events : we just have to keep
the bits we are interested to save (say, the 32 LSBs) and write that as
a time value. Then, whenever we have to write this value, we either have
a heartbeat system making sure we always detect 32 bit overflows by
writing an event at least once per 32-bit overflow or by adding the full
64-bits timestamp as a prefix to the event when this occurs (as you
proposed). Note that the latter proposal imply extra computation at the
tracing site, which could have some performance impact.

There are a few reasons why I would prefer to stay away from enconding
time deltas and use direct LSB tsc representation in the event headers.
First, deltas make it hard to deal with missing information (lost
events, lost buffers); it those cases, you simply don't know what the
delta is. OTOH, if you encode directly the LSBs read from the cycle
counter, you can more easily deal with such lack of information (lost
events) and lost subbuffers by writing an extended 64-bits event header
when needed.

The benefit of using the bigger event header when required over using
heartbeat, even if it makes the tracing fastpath a bit slower, is that
is won't impact systems using dynamic ticks. Heartbeats are generally
bad at that because their require the system to be woken up
periodically.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE ...
From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 3:28 pm

Well, I actually think we could do the same - for the "internal" types.

So why not do something like 4-5 bits for the basic type information, and 
then oen of those cases is a "freeform" thing, and the others are reserved 
for other uses.

So a trace entry header could easily look something like

	struct trace_entry {
		u32 tsc_delta:27,
		     type:5;
		u32 data;
		u64 array[];
	}

and then depending on the that 5-bit type, the "data" field in the header 
means different things, and the size of the trace_entry also is different.

So it could be something like

 - case 0: EnfOfPage marker
	(data is ignored)
	size = 8

 - case 1: TSCExtend marker
	data = extended TSC (bits 28..59)
	size = 8

 - case 2: TimeStamp marker
	data = tv_nsec
	array[0] = tv_sec
	size = 16

 - case 3: LargeBinaryBlob marker
	data = 32-bit length of binary data
	array[0] = 64-bit pointer to binary blob
	array[1] = 64-bit pointer to "free" function
	size = 24

 - case 4: SmallBinaryBlob marker
	data = inline length in bytes, must be < 4096
	array[0..(len+7)/8] = inline data, padded
	size = (len+15) & ~7

 - case 5: AsciiFormat marker
	data = number of arguments
	array[0] = 64-bit pointer to static const format string
	array[1..arg] = argument values
	size = 8*(2+arg)

  ...

ie we use a few bits for "trace _internal_ type fields", and then for a 
few of those types we have internal meanings, and other types just means 
that the user can fill in the data itself.

IOW, you _could_ have an interface like

	ascii_marker_2(ringbuffer,
		"Reading sector %lu-%lu",
		sector, sector+nsec);

and what it would create would be a fairly small trace packet that looks 
something like

	.type = 5,
	.tsc_delta = ...,
	.data = 2,
	.array[0] = (const char *) "Reading sector %lu-%lu\n"
	.array[1] = xx,
	.array[2] = yy

and you would not actually print it out as ASCII until somebody read it 
from the kernel (and any "binary" interface would get the string as a 
string, not as a ...
From: Linus Torvalds
Date: Wednesday, September 24, 2008 - 3:41 pm

Btw, in case it wasn't clear, those two are totally different things.

The "case 1" thing is the thing that gets inserted automatically by the 
trace code when it's needed because the 27-bit TSC is too limited.

The "case 2" thing is to allow us to occasionally synchronize with some 
global known wall-time clock like the HPET + xtime. IOW, it would be 
something that on demand creates a mapping from wall clock to TSC for that 
particular CPU. 

I guess I should perhaps have put the TSC frequency in there in that "case 
2" thing too. Maybe that should be in "data" (in kHz) and tv_sec/tv_nsec 
should be in array[0..1], and the time sync packet would be 24 bytes.

			Linus
--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 10:15 am

We could use a page header instead to contain the "unused_size"
information. It does not need to be an event per se. Putting this
information in the page header makes it easy for a consumer to just read
the amount of bytes needed, excluding the padding (turns up to be useful
for network streaming of trace data). Also, it frees up one event ID for
other uses. I think the event ID real estate is pretty important,
because every event ID we don't keep for "internal uses" could be used

I would prefer to put the extended timestamp within the event header
instead of creating a separate entry for this for atomicity concerns
(what happens if a long interrupt executes between the TSCExtend marker
event and the event expecting to be written right next to it ?). If we
choose to have such "full tsc" event headers, we would have to reserve 1
selection bit. It leaves us 4 bits for event IDs. If we remove
heartbeats, we need even less "internal" IDs.

Given that we can allocate event IDs per buffer, if in general we assume
that buffer users will have few event IDs, I think we can find a way to
minimize the number of "internal" event IDs and keep possibly all the 4
bits (16 IDs) for real tracer events. One way to achieve it is to encode
the extra typing information within a table (dumped in a separate buffer)

This one could even be a standard event put in a single buffer. There is

I agree that exporting semantic information is important. Moreover, I
think this should also be made available when the trace is exported in
binary format to userspace. The markers currently in mainline has been
designed to do this efficiently. With small adaptation of the markers,
one could do a :

  trace_mark(block, read,
             "Reading sector sec_from to sec_to",
             "sec_from %lu sec_to %lu",
             sector, sector + nsec);

The nice part about the markers is that it keeps tables for the
description "Reading sector sec_from to sec_to" and the typing
information "sec_from %lu sec_to ...
From: Linus Torvalds
Date: Thursday, September 25, 2008 - 10:29 am

The log entries should be reserved with interrupts disabled anyway, and 
they are per-CPU, so there are no atomicity issues.

For NMI's, things get more exciting. I'd really prefer NMI's to go to a 
separate ring buffer entirely, because otherwise consistency gets really 
hard. Using lockless algorithms for a variable-sized pool of pages is a 
disaster waiting to happen.

I don't think we can currently necessarily reasonably trace NMI's, but 
it's something to keep in mind as required support eventually.

		Linus
--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 10:42 am

I actually do use a lockless algorithm in LTTng and don't have to
disable interrupts around tracing. This is how I get the kind of
performance the Google folks expect. I would recommend to stay with
interrupt disable + per-cpu spinlock (slow and heavy locking) for v1,
but to keep in mind that we might want to go for a more lightweight

LTTng does it, no disaster happened in the past 2-3 years. :)

I guess we could manage to deal with NMI tracing specfically using the

NMI tracing is a nice-to-have (and lttng does provide it), but the core
thing is really performance; disabling interrupts happens to be fairly
slow on many architectures.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Mathieu Desnoyers
Date: Thursday, September 25, 2008 - 9:37 am

I remembered other concerns about 27 vs 32 bits TSC decision, which are
rather important. First, if we have a 27 bits TSC, with overflow every
33ms at 4GHz, we assume the kernel will _never_ have an interrupt
latency longer than this for correct heartbeat behavior. The sad thing
is that it is not uncommon to see such interrupt latencies once in a
while. It's especially bad if the correctness of the timestamps gathered
by the one tool that would be helpful to debug such interrupt latency
problem is broken by the thing it try to instrument.

Given that heartbeat events have this particular downside, we may think
writing an extended TSC value in the event header will solve the
problem, but it actually creates a new kind of problem when we try to
relax buffer locking. The basic idea is that it's easy to compute the
delta from the previous timestamp taken on the same CPU when disabling
interrupts because we are serializing all operations for this specific
core. However, a more lightweight locking (without interrupt disabling)
involves using a loop such as:

reserve space (preempt is disabled) :
unsigned long old, new;

do {
  old = read_offset();
  timestamp = get_cycles();
  new = old + header_size() + event_size();
} while ((local_cmpxchg(&buf->offset, old, new) != old);
write event header
write event payload
commit

So we can make sure the timestamps will never go backward in a given
buffer (this is insured by the fact that this cmpxchg ties together
timestamp read and buffer space reservation).

If we want to implement detection of TSC rollover at the tracing site,
we would have to remember the last TSC value read, and this is where it
becomes racy. We cannot read the previous event slot, because we have no
guarantee it's been written to (only reserve operation is ordered. write
and commit are unordered). We would therefore have to keep a copy of the
previous TSC value, but given we don't do any locking, it would only be
safe to only keep data that can be read/writte ...
From: Linus Torvalds
Date: Thursday, September 25, 2008 - 9:49 am

We do no such thing.

Guys, the heartbeat is a _separate_ thing from overflow handling.

You don't handle overflow by having a heartbeat that beats fifty times a 
second just to insert events, just so that the TSC delta would always fit 
in 27 bits. That would work, but be stupid. It would mean that you fill up 
your event buffer with uninteresting crud just because nothing happens.

Yes, many people want to have a heartbeat (a "Mark" event) every once in a 
while, but what I suggest is independent of heartbeats, even if it _could_ 
be implemented that way. What I suggest is simply that when you insert an 
event, you always read the full 64 bits of TSC (on x86 - others will do 
other things), and then you insert the delta against the last one.

After all, you cannot read just 27 bits of the TSC anyway. You _have_ to 
read the whole 64 bits, and then you subtract the pervious trace event TSC 
(that you have in the per-CPU trace buffer header) from that. You now have 
a delta value.

And if the delta doesn't fit in 27 bits, you generate a 59-bit TSC event!

None of this has _anything_ to do with interrupt latency. There is no 
dependency on a heartbeat, or any dependency on always inserting a trace 
event at least 30 times a second. There's no worry about conversions, and 
these are all trivial single assembly instructions to do (or a couple, on 
a 32-bit architecture that needs to do a sub/sbb pair and test two 
different registers to see if the result fits in 27 bits).

The only issue is that if you insert trace events more seldom, you'll 
always get the extra TSC event as well, inserted automatically in front of 
the event you explicitly inserted. The tracer doesn't need to know.

			Linus
--

From: Steven Rostedt
Date: Thursday, September 25, 2008 - 10:02 am

Note: RFC v2 implements this.

--

From: Mathieu Desnoyers
Date: Wednesday, September 24, 2008 - 9:13 am

I think we all agree that a supplementary copy is no wanted, but I think
this question is orthogonal to having a write wrapper. The way we can do
both is by using reserve/commit to deal with space reservation, and a
write() to perform the actual data write into the buffers once space has
been reserved.

Reserve/commit would allocate a variable-sized "slot" into the buffer.
We reserve X amount of bytes, and it returns the offset from the
buffer start where the allocated slot is. This reserve/commit mechanism
deals with synchronization (cli/spinlock or cmpxchg_local scheme...).

We can then use this offset to see in which page(s) we have to write.
This offset + len can in fact cross multiple page boundaries.

Doing this elegantly could involve a page array that would represent the
buffer data :

struct page **buffer;

And be given as parameter to the read() and write() methods, which would
deal with page-crossing.

e.g.

/*
 * Perform an aligned write of the input data into the buffer.
 *
 * buffer : page pointer array
 * woffset : offset in the page pointer array where write starts from
 * data : input data
 * len : length of data to copy
 *
 * Note : if a NULL buffer is passed, no copy is performed, but the
 * alignment and offset calculation is done. Useful to calculate the
 * size to reserve.
 *
 * return : length written
 */
size_t write(struct page **buffer, size_t woffset, void *data, size_t len);

Therefore, we could have code which writes in the buffers, without extra
copy, and without using vmap, in multiple writes for a single event,
which would deal with data alignment, e.g. :

size_t woffset, evsize = 0;

evsize += write(NULL, evsize, &var1, sizeof(var1));
evsize += write(NULL, evsize, &var2, sizeof(var2));
evsize += write(NULL, evsize, &var3, sizeof(var3));

woffset = reserve(..., evsize);

woffset += write(buffer, woffset, &var1, sizeof(var1));
woffset += write(buffer, woffset, &var2, sizeof(var2));
woffset += write(buffer, woffset, &var3, ...
From: Steven Rostedt
Date: Wednesday, September 24, 2008 - 9:31 am

Mathieu,

I'm starting to think that you are just too smart for your own good ;-)
No it does not make sense. Well, it does not after looking at it for 10 
seconds.

Which brings up my point, the interface must be simple, and not cause 
people to spend minutes trying to figure out what the trace is doing.

-- Steve

--

From: Peter Zijlstra
Date: Wednesday, September 24, 2008 - 9:39 am

Yes, we can do the sub-write, how about:

struct ringbuffer_write_state 
ringbuffer_write_start(struct ringbuffer *buffer, unsigned long size);

int ringbuffer_write(struct ringbuffer_write_state *state, 
                     const void *buf, unsigned long size);

void ringbuffer_write_finish(struct ringbuffer_write_state *state);


That way write_start() can do the reserve and set a local write
iterator. write() can then do whatever, either the direct copy of break
it up - will error on overflowing the reserved size. write_finish() will
clean up (sti, preempt_enable etc..)

--

From: Mathieu Desnoyers
Date: Wednesday, September 24, 2008 - 9:51 am

Yup, that looks neat. I don't know if it's worth separating data
alignment concerns from this part of the infrastructure so it stays
simple. OTOH, embedding automatic alignment of data elements would be
easy to do here, e.g. :

struct ringbuffer_write_state 
ringbuffer_write_start(struct ringbuffer *buffer, unsigned long size);

int ringbuffer_write(struct ringbuffer_write_state *state, 
                     const void *buf,
                     unsigned long size,
                     unsigned long alignment);

#define ringbuffer_compute_size(size, alignment) \
  ringbuffer_write(NULL, NULL, size, alignment)

void ringbuffer_write_finish(struct ringbuffer_write_state *state);

So ringbuffer_compute_size could be useds to compute the total slot size
needed to write the event before doing the ringbuffer_write_start(). It
would be good to keep ringbuffer_write() mostly as a static inline so
the compiler could statically compile in much of these operations.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

[