Re: Unified tracing buffer

Previous thread: [patch 1/3] cpu alloc: Use in slub by Christoph Lameter on Friday, September 19, 2008 - 1:37 pm. (1 message)

Next thread: sparse_irq: userspace ABI breakage in -tip tree by Arjan van de Ven on Friday, September 19, 2008 - 3:53 pm. (4 messages)
From: Martin Bligh
Date: Friday, September 19, 2008 - 2:33 pm

During kernel summit and Plumbers conference, Linus and others
expressed a desire for a unified
tracing buffer system for multiple tracing applications (eg ftrace,
lttng, systemtap, blktrace, etc) to use.
This provides several advantages, including the ability to interleave
data from multiple sources,
not having to learn 200 different tools, duplicated code/effort, etc.

Several of us got together last night and tried to cut this down to
the simplest usable system
we could agree on (and nobody got hurt!). This will form version 1.
I've sketched out a few
enhancements we know that we want, but have agreed to leave these
until version 2.
The answer to most questions about the below is "yes we know, we'll
fix that in version 2"
(or 3). Simplicity was the rule ...

Sketch of design.  Enjoy flaming me. Code will follow shortly.


STORAGE
-------

We will support multiple buffers for different tracing systems, with
separate names, event id spaces.
Event ids are 16 bit, dynamically allocated.
A "one line of text" print function will be provided for each event,
or use the default (probably hex printf)
Will provide a "flight data recorder" mode, and a "spool to disk" mode.

Circular buffer per cpu, protected by per-cpu spinlock_irq
Word aligned records.
Variable record length, header will start with length record.
Timestamps in fixed timebase, monotonically increasing (across all CPUs)


INPUT_FUNCTIONS
---------------

allocate_buffer (name, size)
        return buffer_handle

register_event (buffer_handle, event_id, print_function)
        You can pass in a requested event_id from a fixed set, and
will be given it, or an error
        0 means allocate me one dynamically
        returns event_id     (or -E_ERROR)

record_event (buffer_handle, event_id, length, *buf)


OUTPUT
------

Data will be output via debugfs, and provide the following output streams:

/debugfs/tracing/<name>/buffers/text
    clear text stream (will merge the per-cpu streams via ...
From: Randy Dunlap
Date: Friday, September 19, 2008 - 2:42 pm

Arch-specific "word"?
or some fixed-size-for-all-systems (so that trace buffers can be

what timestamp resolution?


---
~Randy
--

From: Martin Bligh
Date: Friday, September 19, 2008 - 2:57 pm

u16


Mmmm. I don't see anything wrong with making it just 8 byte aligned, personally.

ns is probably sufficient for output, but may need to be higher
internally to get
correct ordering of events across CPUs. So, as long as we record this in
the buffer header, the internal resolution shouldn't be critical. The text
print output ... I'd say ns? We can put it relative to wall time in there,
as long as we record it in the buffer header at trace start. I guess we should
document the buffer header ;-)
--

From: Olaf Dabrunz
Date: Friday, September 19, 2008 - 3:41 pm

Ids for event types. Either allocated dynamically, if the tracer needs
new ids on each use, or statically assigned for others (like my fctrace
or Steven's ftrace, I believe). Should we have a reserved range / registry
for static allocation, maybe something like a very simple version of

Unaligned can be much slower. I guess some very quick tracers can

Yes. :)

-- 
Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg

--

From: Martin Bligh
Date: Friday, September 19, 2008 - 3:19 pm

Sure, but it's per-tracer, so hopefully won't be a big problem (eg fctrace
would have a different event-id namespace from blktrace)
--

From: Olaf Dabrunz
Date: Saturday, September 20, 2008 - 1:10 am

Ah, that is right. We can distinguish them.

-- 
Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg

--

From: Steven Rostedt
Date: Saturday, September 20, 2008 - 1:29 am

Right!

 We stated in our little meeting that the true event id association is 
buffer id / event id tuple. We will not be assigning ranges for events 
for specific tracers. Ftrace will not have its own range. The static ids 
are reserved for the static trace points and some various static trace 
types that the average kernel developer may use.

Think "string event type" for a event type that will simply hold an ASCII 
string.

-- Steve

--

From: Mathieu Desnoyers
Date: Saturday, September 20, 2008 - 4:40 am

Just to be sure of the "that the average kernel developer may use"
meaning : I would recommend keeping those static ID range only for
internal buffering mechanism events. E.g., if we need to add a
periodical event in every stream so we can detect timestamp wrap-around,
that would be part of the buffering infrastructure itself, and thus
reserve an event ID.

core_heartbeat (u64 timestamp)

Same thing if we want to export the table that maps:

  event name <-> event ID <-> event typing (includes event size info)

That table can be presented into the buffers (possible a single metadata
buffer) in the form of two static event IDs :

core_id_name (u16 id, const char *name)
core_id_type (u16 id, const char *type)

Here another assumption for portability is to declare event type as a
possibly extended format string. Alternative suggestions are welcome.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Saturday, September 20, 2008 - 1:26 am

ftrace has two outputs. One is ns from start of boot, the other is ns from 
start of trace. Not sure we need to make wall time in there, unless we add 
it in the trace header (as you stated). Probably best for Version #2.

-- Steve

--

From: Mathieu Desnoyers
Date: Saturday, September 20, 2008 - 4:44 am

For simplicity and efficiency, I would try to keep the timestamp
recorded for every event as close as what we are reading from the
hardware without manipulation. We can save the timestamp value at the
beginning of the trace (in the buffer header) and we can also save the
scaling value in the same header so we can transform stuff like cycle
counts into ns, ps from boot time or from trace start when we
pretty-print.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Olaf Dabrunz
Date: Friday, September 19, 2008 - 3:28 pm

Hi Martin and everyone,

thanks for writing this up.


ASCII "0" and "1"?

And my e-mail is odabrunz@novell.com or preferably od@suse.de, as in
this mail.

Thanks, :)

-- 
Olaf Dabrunz (od/odabrunz), SUSE Linux Products GmbH, Nürnberg

--

From: Martin Bligh
Date: Friday, September 19, 2008 - 3:09 pm

Oops. s/binary/boolean/ and yes, in ASCII.
--

From: Frank Ch. Eigler
Date: Friday, September 19, 2008 - 4:18 pm

Hi -





How do you imagine record_event() being used from the point of view of
the instrumented module?  Is it to be protected by some sort of test
of the control variable?  Is the little binary event buffer supposed
to be constructed unconditionally?  (On the stack?)


You should compare this to markers and tracepoints.  It sounds to me like
this is not that different from

     trace_mark (event_name, "%*b", length, buf);

where the goofy "%*b" could be some magic to identify the proposed
"everything is a short blob" event record type.

By the way, systemtap supports formatted printing that generates
binary records via directives like "%4b" for 4-byte ints.  I wonder if
that would be a suitable facility for this and/or markers to allow
instrumentation code to *generate* those binary event records.


Do you believe that fans of tracepoints would support a single

Can you spell out this part a little more?  I wonder because at the
tracing miniconf on Wednesday we talked about systemtap's likely need
to *consume* these trace events as they are being generated.

If systemtap can only see them as a binary blob or a rendered ascii
string, they would not be as useful as if the record was decomposable
in kernel.  Perhaps the event-type-registration call can declare the
binary struct, like a perl pack directive ... or a marker (binary)

This sort of control is (or should be) already available for markers.


- FChE
--

From: Steven Rostedt
Date: Saturday, September 20, 2008 - 1:50 am

[ Note, It is very late for me and I probably should be finishing up 
packing for my trip home, but I'm stupid and decided to respond now 
instead ]



The event buffer is allocated when you create the buffer. The tracer will 

This is completely separate from the trace buffer itself. The trace points 
and trace markers simply write whatever they want into the trace buffer.
The tracepoints and trace markers are something to hook points of code to 
do some type of tracing. The trace buffer record_event is how it will do 


The "record_event" would not (I repeat, "not") be in general code. It will 
be used by the different tracers. The markers/tracepoints will be in the 
code that can hooked to do things like profiling, or if you want, record 

What does this mean exactly. I should have asked more details about this
but I was too worried about time constraints (since Linus was speaking 
next to think about it at the time.

That is, when you read a marker, who and when does that data get consumed?

I don't understand the above. I'm also thinking that we are 
miscommunicating a bit here.

Let make make an example with something that I know, ftrace.

We hit the tracepoint in, lets say, scheduler. Previously at 
initialization time, ftrace would have registered with this trace point 
and would have allocated a buffer. When the tracepoint is actually hit, it 
jumps to the function that ftrace registered. Then this function would 
record the event into the buffer using 'record_event'. At a later time, 
the user could read the buffer from the filesystem using either the pretty 
print format or raw binary format.

This code is not replacing tracepoints or markers. When you say you will 
consume at reading, it sounds like you don't even need to use the buffer 
mechanism and the trace points/markers is good enough for you. The 
tracepoints and markers are not something that is being replaced. The 
trace buffers are just something to use that we can record data to that we ...
From: Mathieu Desnoyers
Date: Saturday, September 20, 2008 - 6:37 am

I would propose disable the event source ASAP when disabled, without any
conditional test if possible. This is actually what the Kernel Markers
does with the help of Immediate Values. It should also come with a
finer-grained filtering based on a global "enable" variable and
per-buffer "enable" variables so a tracer can atomically start
collecting all of its event types. We could then require both
global tracing and per-buffer tracing to be enabled to write into the
buffer.

I think I see where Frank is going, and I agree that this is more or
less what I have had in mind : the Markers could be used as a global
event ID registry and could hold the event name, ID, types (format
string) table. Therefore, the markers would simply become the "Write to
buffer" interface, which would associate IDs automatically and keep the
format strings into a table dumped into a metainformation trace buffer
at trace start and whenever IDs are dynamically registered while the

As I said above, the tracepoints are meant to be a in-kernel API which
instruments the kernel code. It leaves the markers, which are meant to
be exposed to userspace anyway, for such record_event use. They would
actually accomplish two things : they would register the event (just
declaring the marker puts the event in a special section, which is our
mapping table) and would also record the event when enabled and

Given that systemtap may need to access the kernel state and the moment
the instrumentation is reached, I think it implies they have to be
called _before_ the data is writter to the buffers. We can thus see
SystemTAP as a very powerful filtering mechanism. SystemTAP could also
choose to directly use the instrumentation available (kprobes,
tracepoints, ...) when it does not need to act as a filter, but more
like a statistic gathering module. So I think we should simply provide a
callback filter registration mechanism in the filtering chain, thus
having a filtering pseudo-code looking like :

if ...
From: Steven Rostedt
Date: Saturday, September 20, 2008 - 6:51 am

[ Note, I'm now at the airport sipping on coffee, and I had 2 hours

No, absolutely not!

Sorry, I don't want to touch markers. I'm fine with tracepoints, but
there should be no need to use a damn marker if I want to use the tracer.
I shouldn't need to even touch tracepoints to use the trace buffer.
That is making things too complicated again. The tracepoints and markers 
should allow you to hook into the buffers. They are separate. I can 
imagine using tracepoints without needing buffers and I can see using the 
buffers without using tracepoints or markers. They are separate things. Do 
not bind the use of the buffers around markers.


Markers are great for you and for many others, but this is about the 
tracing mechanism and one should not be forced to use markers if they want 
to do a trace.

--

From: Steven Rostedt
Date: Saturday, September 20, 2008 - 7:54 am

Mathieu,

Think about the function tracer itself. It gets called at every funtion, 
where I record the interrupts enabled state, task pid, preempt state, 
function addr, and parent function addr. (that's just off the top of my 
head, I may even record more).

What I don't want is a:

function_call(unsigned long func, unsigned long parent)
{
	struct ftrace_event event;

	event.pid = current->pid;
	event.pc = preempt_count();
	event.irq = local_irq_flags();
	event.func = func;
	event.parent = parent;

	trace_mark(func_event_id, "%p",
		sizeof(event), &event);
}


and then to turn on function tracing, I need to hook into this marker. I'd 
rather just push the data right into the buffer here without having to 
make another function call to hook into this.

I'd rather have instead a simple:

	struct ftrace_event *event;

	event = ring_buffer_reserve(func_event_id,
				sizeof(*event));

	event->pid = current->pid;
	event->pc = preempt_count();
	event->irq = local_irq_flags();
	event->func = func;
	event->parent = parent;

	ring_buffer_commit(event);


-- Steve

--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 11:45 am

Hi Steven,

As I expressed above, this is merely one way I propose data could be
exported to user-space. If you have other simpler design ideas in mind,
I look forward to hear them so we can discuss the technical difficulties
associated with that kind of exercice : sending binary data across the
kernel-userspace boundary.


The scheme you propose here is based on a few inherent assumptions :

- You assume ring_buffer_reserve() and ring_buffer_commit() are static
  inline and thus does not turn into function calls.
- You assume these are small enough so they can be inlined without
  causing L1 insn cache trashing when tracing is activated.
- You therefore assume they use a locking scheme that lets them be
  really really compact (e.g. interrupt disable and spin lock).
- You assume that the performance impact of doing a function call is
  bigger than the impact of locking, which is false by at least a factor
  10.

Interrupt disable and spin locks are _really_ slow. So I think putting
the function call concern up front here is really a matter of premature
optimization gone wrong.

I've got burned in the past history of LTTng. The first versions has a
code generator which created specialized code to serialize the
information into the buffers, exactly like you propose to do. But the
overall impact on kernel code size ended up being too big because we
have to repeat all the code to deal with the buffers for every different
type.

However, I think there might be a way to satisfy us both. An information
source like dynamic function trace happen to fit in a particular
use-case where one single execution site is used to format the data
received as parameter for a _lot_ of instrumented sites, and the type
and event names happen to be the same everywhere. This would therefore
benefit widely of having the capability to write directly into the
buffers.

The thing is that I would like ftrace to expose the types it expects to
write into the trace buffers so a generic trace buffer ...
From: Steven Rostedt
Date: Monday, September 22, 2008 - 2:39 pm

I don't assume anything. I will have the requirement that reserve and 
commit must be paired, and for the first version, hold locks.

Maybe I should rename it to: ring_buffer_lock_reserve and 
ring_buffer_unlock_commit. To show this.


YUCK YUCK YUCK!!!!

Mathieu,

Do I have to bring up the argument of simplicity again? I will never use
such an API.  Mine was very simple, I have to spend 10 minutes trying to

These are not special cases, these are what I use often. They are not 
special for me.

-- Steve

--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 8:27 pm

By saying you don't want to do any function call, the only technical
reason I see for you wanting that is performance, and thus you would
assume the above. If not, why don't you want to make another function

I was actually waiting for you to propose an alternative, but I fear you
already did without me noticing :)

How do you deal with exporting data across kernel/user boundary in your
proposal exactly ? How does this work on architecture with 64-bits
kernel and 32-bits userland... ? A simple C structure copy might be
simple to _code_, but hellish to export to userspace and lead to hard to
debug binary incompatibilities (different gcc flags, 32/64 bits
user/kernel). And this is without telling about the non-portability of
the exported data.

If gcc/icc-knowledgeful people can reassure me by certifying it won't
generate a mess, fine, but until then, I stay very doubtful about
solutions involving to imply binary compability between kernel and
userland.

And common.. 10 minutes to understand the above code. Your _are_ kidding
me right ? Would that help if I create a small 4 lineish wrapper around
the buffer write ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Peter Zijlstra
Date: Friday, September 19, 2008 - 5:07 pm

Oddly whitespace damaged mail..


I'd hoped for an interface like:

struct ringbuffer *ringbuffer_alloc(const char *name, size_t size);
void ringbuffer_free(struct ringbuffer *buffer);
int ringbuffer_write(struct ringbuffer *buffer, const char *buf, size_t size);
int ringbuffer_read(struct ringbuffer *buffer, int cpu, char *buf, size_t size);

On top of which you'd do the event thing, the register event with a
callback idea makes sense, except I'd split the consumption into two:
 - one method to pull the binary event out, which knows how long it
ought to be etc..
 - one method to convert the binary event to ASCII

You don't always need the latter one, esp if you're dumping to disk.

You can also generalize the merge sorting forward iterator when you have
that, by providing an event compare function.

By doing it like this folks can focus on utterly optimizing the
ringbuffer to death, and other folks can toy around with doing fancy
event encodings (/me pitches asn.1-der encoded structured data and runs
like crazy)



--

From: K.Prasad
Date: Monday, September 22, 2008 - 7:07 am

In conjunction with the previous email on this thread
(http://lkml.org/lkml/2008/9/22/160), may I suggest
the equivalent interfaces in -mm tree (2.6.27-rc5-mm1) to be:

relay_printk(<some struct with default filenames/pathnames>, <string>,
...) ;
relay_dump(<some struct with default filenames/pathnames>, <binary
data>);
and
relay_cleanup_all(<the struct name>); - Single interface that cleans up
all files/directories/output data created under a logical entity.

Thanks,
K.Prasad

--

From: Peter Zijlstra
Date: Monday, September 22, 2008 - 7:45 am

Dude, relayfs is such a bad performing mess that extending it seems like
a bad idea. Better to write something new and delete everything relayfs
related.

Also, it seems prudent to separate the ring-buffer implementation from
the event encoding/decoding facilities.



--

From: Martin Bligh
Date: Monday, September 22, 2008 - 9:29 am

There did seem to be pretty universal agreement that we'd rather not

Right - in conversation I had with Mathieu later, he suggested cleaning up
relayfs - I fear this will delay us far too long, and get bogged down.
If we can get one clean circular buffer implementation, then both
relayfs and the tracing could share that common solution,
--

From: Peter Zijlstra
Date: Monday, September 22, 2008 - 9:36 am

Currently only blktrace and kvmtrace use relayfs, and I've heard people
talk about converting both to use lttng/ftrace infrastructure. At which
point relayfs is orphaned and ready for removal.

--

From: Masami Hiramatsu
Date: Monday, September 22, 2008 - 1:50 pm

Hi Peter,
Systemtap is still a heavy user of relayfs. :-)
Anyway, if new buffering mechanism is enough for us, I think
we're happy to move on it.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 8:05 pm

LTTng sits on top of relay for buffer allocation and for the mmap
operation (that's about it, it overrides the rest).

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 7:49 pm

LTTng only uses relay for buffer mapping and mmap to userspace. The rest
of internal buffer management is done within LTTng by overriding relay
callbacks. One thing we could think of is to incrementally fix relay

Sure, but still I think both are needed, even if they are separated as
two different layers (as they should).


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Tom Zanussi
Date: Monday, September 22, 2008 - 10:25 pm

Hmm, I haven't seen complaints lately about about relayfs being 'bad
performing'.  The write/reserve functions are pretty fast - they
don't do much else in the fast path other than update an index, but
if they're still too slow, please let me know how to make them faster.
In any case, I'll post a couple patches in a few minutes that give
complete control over the write path for anyone who doesn't want to be
hampered by the existing versions.

As for the interface, yeah, it has gathered some some cruft over time
and has turned out to be too complex for most people.  The reason a lot
of that complexity is there in the first place though, ironically, is
that it was put there in explicit support of the requirements of
LTT/LTTng (sub-buffers, padding, mmap, etc), which supposedly
represented the needs of all 'industrial-strength' tracers at the time.
Well, four years after the 'troll merge' that initially got relayfs
streamlined and into the kernel, in anticipation of a soon-to-follow
streamlined LTT/LTTng which has yet to emerge, apparently those
requirements are no longer valid and neither LTTng nor anything else
needs the capabilities of relayfs.  That's fine, if it isn't needed, it
isn't needed.  But since it no longer has to conform to the requirements
of any imaginary tracer, maybe it should be put through yet another
streamlining effort and everything that's not required to support
current users removed:

- get rid of anything having to do with padding, nobody needs it and its
only affect has been to horribly distort and complicate a lot of the
code
- get rid of sub-buffers, they just cause confusion
- get rid of mmap, nobody uses it
- no sub-buffers and no mmap support means we can get rid of most of the
callbacks, and a lot of API confusion along with them
- add relay flags - they probably should have been used from the
beginning and options made explicit instead of being shoehorned into the
callback functions.

Going even further, why not just replace the current ...
From: Peter Zijlstra
Date: Tuesday, September 23, 2008 - 2:31 am

- get rid of the vmap buffers as they cause tlb pressure and eat up
precious vspace on 32 bit platforms.

--

From: Mathieu Desnoyers
Date: Tuesday, September 23, 2008 - 11:13 am

Although I agree on the basic idea, namely to use a sane amount of TLB
entries for tracing, I disagree on the way proposed to reach this goal.
Such memory management concerns belong to the mm field and should not be
done "oh so cleverly" by a buffer management infrastructure in the back
of the kernel memory management infrastructure.

I think we should instead try to figure out what is currently missing in
the kernel vmap mechanism (probably the ability to vmap from large 4MB
pages after boot), and fix _that_ instead (if possible), which would not
only benefit to tracing, but also to module support.

Also, I would like to keep a contiguous address mapping within buffers
so we could keep the buffer read/write code as simple as possible,
leveraging the existing CPU MM unit.

I added Christoph Lameter to the CC list, he always comes with clever
ideas. :)

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Christoph Lameter
Date: Tuesday, September 23, 2008 - 11:33 am

With some custom code one can vmap 2MB pages on x86. See the VMEMMAP support
in the x86 arch. The code in mm/sparse-vmemmap.c could be abstracted for a
general 2MB mapping API to reduce TLB pressure for the buffers. If there are
concerns about fragmentation then one could fallback to 4kb TLBs. See the

Oh mostly we are just recycling the old ideas.
--

From: Linus Torvalds
Date: Tuesday, September 23, 2008 - 11:56 am

No. Don't go there. Piece of absolute shit.

The problem with VMAP is that it's _limited_. We don't have reasonable 
virtual address space holes for x86-32.

The other is that physically contiguos buffers are hard to come by. 
Certainly not an acceptable solution.

The third is that if you have multiple buffers, you need to look them up 
in software anyway, so the whole notion of mis-using the TLB to avoid a 
software lookup is TOTAL CRAP.

Don't do virtual mapping. IT IS BROKEN. IT IS A TOTAL AND UTTER PIECE OF 
SHIT.

I will absolutely not take any general-purpse tracing code if I'm aware of 
it mis-using the TLB to play games.

		Linus
--

From: Mathieu Desnoyers
Date: Tuesday, September 23, 2008 - 6:50 am

LTTng uses relay mmap. That's about the only feature of relay it uses
along with memory allocation.  It however implements its own buffer
management mechanism with poll() and ioctl GET_SUBBUF/PUT_SUBBUF to read
subbuffers. But these ops are all within LTTng.

BTW it would be good to change relay so it can take a buffer pointer as
input for relay_open. That would help getting memory mapped in the
linear mapping to be used for tracing when known at boot time.


Sounds interesting. So then vmsplice would be used to support sending

I am not sure of that. I think there is some room for relay improvements
we could work on. As for the mechanism used to insure data coherency, I
think relay does not provide any. Could it be changed to an interrupt
disable+spinlock ? Then, in a second phase, we can optimize it by using

I did the same with LTTV :) Writing userspace tools, including GUIs and
everything, can be quite a big task.

I would be good to keep in mind that a layered infrastructure would be
good. A bit like network packet encapsulation, we could have :

Layer 2 : Event payload (dealt by a unified event encoding
                         infrastructure)
          - Structure defined by event ID/type mapping table

Layer 1 : Events (dealt by a unified buffer layout infrastructure)
          - Event header
            - Timestamp
            - Event ID
            - Event size

Layer 0 : Buffers (dealt by a unified buffering infrastructure)
          - Buffer header
          - Subbuffers


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Martin Bligh
Date: Tuesday, September 23, 2008 - 7:00 am

Actually, I think if you did all that, it'd be pretty close to what we
want anyway ...
--

From: K.Prasad
Date: Tuesday, September 23, 2008 - 10:55 am

In the perspective of having a layered infrastructure, can we consider
the interfaces later added over relay (to be used as a wrapper), namely
relay_printk() and relay_dump()?

Also add the following features to it and we get close to the
functionality that is sought:

- Add callbacks to append fine-granular timestamp information depending
upon user's requirement
- Ability to provision more custom-defined control files that can suit
independent tracer's requirements

These interfaces already come along with the following features (some
repetition here from my previous email for the sake of completeness):

- Very minimal work required to log data using the interfaces. Usage is
made simple to resemble the printk(). Like

	struct relay_printk_data *tpk;
	tpk->parent_dir = "PARENT";
	tpk->dir = "DIR";
	relay_printk(tpk, <String to be output>);
	relay_dump(tpk, <Some binary data to output>); 

Output at:
	<debugfs_mount>/PARENT/DIR/<TRACE FILES>

- Assumes default values for most tunables, such as per-CPU buffer size,
relay-flags (such as global vs local per-cpu buffers, flight recorder vs
overwrite mode), thus reducing the work required for setting up these
interfaces. They can be over-written in case of advanced needs.

- Well defined control operations to start, stop tracing operations, status
files to indicate buffer overflow, etc.

- Given the recent patches that Tom Zanussi has sent to bring in the
erstwhile 'trace' functionality into relay itself, there can be a lot of
code-reduction in relay.c (in -mm) thereby leading to a light-weight
implementation.

Thanks,
K.Prasad

--

From: Martin Bligh
Date: Tuesday, September 23, 2008 - 11:27 am

Might well work, but let's see what relayfs comes out looking like.

You really don't want to store strings in the buffer, it's horribly
inefficient. I think the intent was to store binary data from tagged
events, along with the format strings, and do all the expansion later.
--

From: Tom Zanussi
Date: Tuesday, September 23, 2008 - 8:50 pm

OK, then, I'll continue with the cleanup patchset and see where it
goes...

Tom

--

From: K.Prasad
Date: Tuesday, September 23, 2008 - 10:42 pm

Hi Tom,
	Kindly let us know if the patches are available in some
downloadable location or have been maintained in a git tree.

I'm planning to re-base the relay_* interfaces (erstwhile 'trace' code),
to work on top of your patches.

Thanks,
K.Prasad
 
--

From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:07 pm

Here's the current relay cleanup patchset.  The first two patches make
the write path completely replaceable, the third adds flags along with
some related cleanup, and the next 5 remove the padding in several
stages. 

It's a work in progress, but because I wanted the intermediate stages to
actually work and not break anything, some of these patches, especially
05, are just temporary and will be removed in the next iteration.

I didn't have time to clean up the first 3 either - I'll also do that
the next time around.

Anyway, removing the padding has simplified the read/splice code
significantly; it's always been a source of headaches so I'm glad it's
gone, and it doesn't seem to have broken anything - a quick test using
blktrace in both read() and splice() modes didn't show any problems.

In the next round I plan to do vmap and sub-buffer removal.

Tom




--

From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:07 pm

Clean up relay_switch_subbuf() and make waking up consumers optional.

Over time, relay_switch_subbuf() has accumulated some cruft - this
patch cleans it up and at the same time makes available some of it
available as common functions that any subbuf-switch implementor would
need (this is partially in preparation for the next patch, which makes
the subbuf-switch function completely replaceable).  It also removes
the hard-coded reader wakeup and moves it into a replaceable callback
called notify_consumers(); this allows any given tracer to implement
consumer notification as it sees fit.
---
 include/linux/relay.h |   51 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/relay.c        |   43 +++++++++++++++++++++++------------------
 2 files changed, 75 insertions(+), 19 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 953fc05..2242004 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -159,6 +159,15 @@ struct rchan_callbacks
 	 * The callback should return 0 if successful, negative if not.
 	 */
 	int (*remove_buf_file)(struct dentry *dentry);
+
+	/*
+	 * wakeup_readers - sub-buffer was switched, let readers know
+	 * @buf: the channel buffer
+	 *
+	 * Called during sub-buffer switch.  Users who don't want any
+	 * wakeups should implement an empty version.
+	 */
+        void (*wakeup_readers)(struct rchan_buf *buf);
 };
 
 /*
@@ -186,6 +195,48 @@ extern size_t relay_switch_subbuf(struct rchan_buf *buf,
 				  size_t length);
 
 /**
+ *	relay_event_toobig - is event too big to fit in a sub-buffer?
+ *	@buf: relay channel buffer
+ *	@length: length of event
+ *
+ *	Returns 1 if too big, 0 otherwise.
+ *
+ *	switch_subbuf() helper function
+ */
+static inline int relay_event_toobig(struct rchan_buf *buf, size_t length)
+{
+	return length > buf->chan->subbuf_size;
+}
+
+/**
+ *	relay_update_filesize - add to filesize of relay file
+ *	@buf: relay channel buffer
+ *	@length: length to add
+ *
+ ...
From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:07 pm

Make the relay sub-buffer switch code replaceable.

With this patch, tracers now have complete control over the relay
write (or reserve) path if they choose to do so, by implementing their
own version of the sub-buffer switch function (switch_subbuf()), in
addition to their own local write/reserve functions.  Tracers who
choose not to do so automatically default to the normal behavior.
---
 include/linux/relay.h |   22 +++++++++++++++++-----
 kernel/relay.c        |   13 ++++++++-----
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 2242004..a1dcfc1 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -168,6 +168,18 @@ struct rchan_callbacks
 	 * wakeups should implement an empty version.
 	 */
         void (*wakeup_readers)(struct rchan_buf *buf);
+
+	/*
+	 * switch_subbuf - sub-buffer switch callback
+	 * @buf: the channel buffer
+	 * @length: size of current event
+	 *
+	 * Returns either the length passed in or 0 if full.
+	 *
+	 * Performs sub-buffer-switch tasks such as updating filesize,
+	 * waking up readers, etc.
+	 */
+	size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length);
 };
 
 /*
@@ -191,8 +203,8 @@ extern void relay_subbufs_consumed(struct rchan *chan,
 extern void relay_reset(struct rchan *chan);
 extern int relay_buf_full(struct rchan_buf *buf);
 
-extern size_t relay_switch_subbuf(struct rchan_buf *buf,
-				  size_t length);
+extern size_t switch_subbuf_default_callback(struct rchan_buf *buf,
+					     size_t length);
 
 /**
  *	relay_event_toobig - is event too big to fit in a sub-buffer?
@@ -259,7 +271,7 @@ static inline void relay_write(struct rchan *chan,
 	local_irq_save(flags);
 	buf = chan->buf[smp_processor_id()];
 	if (unlikely(buf->offset + length > chan->subbuf_size))
-		length = relay_switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length);
 	memcpy(buf->data + buf->offset, data, length);
 	buf->offset += length;
 ...
From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:07 pm

relay should probably have had a flags param from the beginning; it
wasn't originally added because it wasn't originally needed - it
probably would have helped avoid some of the callback contortions
that were added due to a lack of flags.  This adds them and does a
small amount of low-hanging cleanup, and is also in preparation for
some new flags in future patches.
---
 block/blktrace.c      |    5 ++---
 include/linux/relay.h |   19 ++++++++++---------
 kernel/relay.c        |   20 ++++++++++----------
 virt/kvm/kvm_trace.c  |    9 ++++-----
 4 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/block/blktrace.c b/block/blktrace.c
index eb9651c..150c5f7 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry)
 static struct dentry *blk_create_buf_file_callback(const char *filename,
 						   struct dentry *parent,
 						   int mode,
-						   struct rchan_buf *buf,
-						   int *is_global)
+						   struct rchan_buf *buf)
 {
 	return debugfs_create_file(filename, mode, parent, buf,
 					&relay_file_operations);
@@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 		goto err;
 
 	bt->rchan = relay_open("trace", dir, buts->buf_size,
-				buts->buf_nr, &blk_relay_callbacks, bt);
+			       buts->buf_nr, &blk_relay_callbacks, bt, 0UL);
 	if (!bt->rchan)
 		goto err;
 
diff --git a/include/linux/relay.h b/include/linux/relay.h
index a1dcfc1..648b4da 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -28,6 +28,12 @@
 #define RELAYFS_CHANNEL_VERSION		7
 
 /*
+ * relay channel flags
+ */
+#define RCHAN_MODE_OVERWRITE		0x00000001	/* 'flight' mode */
+#define RCHAN_GLOBAL_BUFFER		0x00000002	/* not using per-cpu */
+
+/*
  * Per-cpu relay channel buffer
  */
 struct rchan_buf
@@ -66,11 +72,11 @@ struct rchan
 	void *private_data;		/* for user-defined data */
 	size_t last_toobig;		/* tried to log event > subbuf size */
 ...
From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:07 pm

Add reserved param to switch-subbuf, in preparation for non-pad write/reserve.

Because a write/reserve can now cross sub-buffer boundaries, we use
the length returned as a remainder for the new sub-buffer, and use the
reserved param to return a pointer to the reserved space, or NULL if
it couldn't be reserved.  This patch also changes write/reserve to
preserve their current behavior despite that change.  This all goes
away in a future patch, but is here now so things don't break.
---
 include/linux/relay.h |   24 ++++++++++++++++--------
 kernel/relay.c        |   12 +++++++++---
 2 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 648b4da..13163b0 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -173,13 +173,16 @@ struct rchan_callbacks
 	 * switch_subbuf - sub-buffer switch callback
 	 * @buf: the channel buffer
 	 * @length: size of current event
+	 * @reserved: a pointer to the space reserved
 	 *
 	 * Returns either the length passed in or 0 if full.
 	 *
 	 * Performs sub-buffer-switch tasks such as updating filesize,
 	 * waking up readers, etc.
 	 */
-	size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length);
+	size_t (*switch_subbuf)(struct rchan_buf *buf,
+				size_t length,
+				void **reserved);
 };
 
 /*
@@ -205,7 +208,8 @@ extern void relay_reset(struct rchan *chan);
 extern int relay_buf_full(struct rchan_buf *buf);
 
 extern size_t switch_subbuf_default_callback(struct rchan_buf *buf,
-					     size_t length);
+					     size_t length,
+					     void **reserved);
 
 /**
  *	relay_event_toobig - is event too big to fit in a sub-buffer?
@@ -268,12 +272,14 @@ static inline void relay_write(struct rchan *chan,
 {
 	unsigned long flags;
 	struct rchan_buf *buf;
+	void *reserved;
 
 	local_irq_save(flags);
 	buf = chan->buf[smp_processor_id()];
+	reserved = buf->data + buf->offset;
 	if (unlikely(buf->offset + length > chan->subbuf_size))
-		length = ...
From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:07 pm

Map the first sub-buffer at the end of the buffer, for temporary convenience.

Make relay buffers 'circular' for writing by mapping the first subbuf
at end of last subbuf.  This is so we can do writes across last subbuf
boundary without adding special write logic.  This is a temporary
state of affairs and it all goes away in a future patch, but it's here
now so things will still work.
---
 kernel/relay.c |   26 +++++++++++++++-----------
 1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index 9ea9240..9a08fec 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -125,20 +125,20 @@ static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
 /**
  *	relay_alloc_buf - allocate a channel buffer
  *	@buf: the buffer struct
- *	@size: total size of the buffer
  *
  *	Returns a pointer to the resulting buffer, %NULL if unsuccessful. The
  *	passed in size will get page aligned, if it isn't already.
  */
-static void *relay_alloc_buf(struct rchan_buf *buf, size_t *size)
+static void *relay_alloc_buf(struct rchan_buf *buf)
 {
 	void *mem;
-	unsigned int i, j, n_pages;
+	unsigned int i, j, n_pages, n_subbuf_pages;
 
-	*size = PAGE_ALIGN(*size);
-	n_pages = *size >> PAGE_SHIFT;
+	buf->chan->alloc_size = PAGE_ALIGN(buf->chan->alloc_size);
+	n_pages = buf->chan->alloc_size >> PAGE_SHIFT;
+	n_subbuf_pages = PAGE_ALIGN(buf->chan->subbuf_size) >> PAGE_SHIFT;
 
-	buf->page_array = relay_alloc_page_array(n_pages);
+	buf->page_array = relay_alloc_page_array(n_pages + n_subbuf_pages);
 	if (!buf->page_array)
 		return NULL;
 
@@ -148,11 +148,14 @@ static void *relay_alloc_buf(struct rchan_buf *buf, size_t *size)
 			goto depopulate;
 		set_page_private(buf->page_array[i], (unsigned long)buf);
 	}
-	mem = vmap(buf->page_array, n_pages, VM_MAP, PAGE_KERNEL);
+	for (i = 0; i < n_subbuf_pages; i++)
+		buf->page_array[n_pages + i] = buf->page_array[i];
+	mem = vmap(buf->page_array, n_pages + n_subbuf_pages, VM_MAP,
+		   ...
From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:07 pm

Replace relay_reserve/relay_write with non-padded versions.

The old versions of relay_reserve/relay_write would write/reserve an
event only if the whole thing could fit in the remaining space of the
current sub-buffer; if it couldn't it would add padding to the current
sub-buffer and reserve in the next.  The new versions don't add
padding but use up all the space in a sub-buffer and write the
remainder in the next sub-buffer.  They won't however write a partial
event - if there's not enough space for the event in the current
sub-buffer and the next sub-buffer isn't free, the whole reserve/write
will fail.
---
 include/linux/relay.h |   41 +++++++++++++++++++----------
 kernel/relay.c        |   69 +++++++++++++++++++++++++++----------------------
 2 files changed, 65 insertions(+), 45 deletions(-)

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 13163b0..c42b2d3 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -207,9 +207,9 @@ extern void relay_subbufs_consumed(struct rchan *chan,
 extern void relay_reset(struct rchan *chan);
 extern int relay_buf_full(struct rchan_buf *buf);
 
-extern size_t switch_subbuf_default_callback(struct rchan_buf *buf,
-					     size_t length,
-					     void **reserved);
+extern size_t relay_switch_subbuf_default_callback(struct rchan_buf *buf,
+						   size_t length,
+						   void **reserved);
 
 /**
  *	relay_event_toobig - is event too big to fit in a sub-buffer?
@@ -270,17 +270,23 @@ static inline void relay_write(struct rchan *chan,
 			       const void *data,
 			       size_t length)
 {
-	unsigned long flags;
+	size_t remainder = length;
 	struct rchan_buf *buf;
+	unsigned long flags;
 	void *reserved;
 
 	local_irq_save(flags);
 	buf = chan->buf[smp_processor_id()];
 	reserved = buf->data + buf->offset;
-	if (unlikely(buf->offset + length > chan->subbuf_size))
-		length = chan->cb->switch_subbuf(buf, length, &reserved);
+	if (unlikely(buf->offset + length > buf->chan->subbuf_size)) ...
From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:07 pm

Remove padding-related code from relay_read()/relay_splice_read() et al.

Because we no longer write padding, we no longer have to read it or
account for it anywhere else, greatly simplifying the related code.
---
 kernel/relay.c |  149 ++++++++------------------------------------------------
 1 files changed, 20 insertions(+), 129 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index 15e4de2..21b3e19 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -966,72 +966,13 @@ static void relay_file_read_consume(struct rchan_buf *buf,
 				    size_t bytes_consumed)
 {
 	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
-	size_t read_subbuf;
-
-	if (buf->subbufs_produced == buf->subbufs_consumed &&
-	    buf->offset == buf->bytes_consumed)
-		return;
-
-	if (buf->bytes_consumed + bytes_consumed > subbuf_size) {
-		relay_subbufs_consumed(buf->chan, buf->cpu, 1);
-		buf->bytes_consumed = 0;
-	}
 
 	buf->bytes_consumed += bytes_consumed;
-	if (!read_pos)
-		read_subbuf = buf->subbufs_consumed % n_subbufs;
-	else
-		read_subbuf = read_pos / buf->chan->subbuf_size;
-	if (buf->bytes_consumed + buf->padding[read_subbuf] == subbuf_size) {
-		if ((read_subbuf == buf->subbufs_produced % n_subbufs) &&
-		    (buf->offset == subbuf_size))
-			return;
-		relay_subbufs_consumed(buf->chan, buf->cpu, 1);
-		buf->bytes_consumed = 0;
-	}
-}
 
-/*
- *	relay_file_read_avail - boolean, are there unconsumed bytes available?
- */
-static int relay_file_read_avail(struct rchan_buf *buf, size_t read_pos)
-{
-	size_t subbuf_size = buf->chan->subbuf_size;
-	size_t n_subbufs = buf->chan->n_subbufs;
-	size_t produced = buf->subbufs_produced;
-	size_t consumed = buf->subbufs_consumed;
-
-	relay_file_read_consume(buf, read_pos, 0);
-
-	consumed = buf->subbufs_consumed;
-
-	if (unlikely(buf->offset > subbuf_size)) {
-		if (produced == consumed)
-			return 0;
-		return 1;
-	}
-
-	if (unlikely(produced - consumed >= n_subbufs)) ...
From: Tom Zanussi
Date: Wednesday, September 24, 2008 - 11:08 pm

Clean up remaining padding-related junk.

Removes the rest of the padding-related junk.  Also simplifies the
subbuf_start callback a bit.
---
 block/blktrace.c      |    5 +++--
 include/linux/relay.h |   12 ++----------
 kernel/relay.c        |   19 ++++---------------
 virt/kvm/kvm_trace.c  |    7 ++++---
 4 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/block/blktrace.c b/block/blktrace.c
index 150c5f7..271b7b7 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -334,8 +334,9 @@ static const struct file_operations blk_msg_fops = {
  * Keep track of how many times we encountered a full subbuffer, to aid
  * the user space app in telling how many lost events there were.
  */
-static int blk_subbuf_start_callback(struct rchan_buf *buf, void *subbuf,
-				     void *prev_subbuf, size_t prev_padding)
+static int blk_subbuf_start_callback(struct rchan_buf *buf,
+				     void *subbuf,
+				     int first_subbuf)
 {
 	struct blk_trace *bt;
 
diff --git a/include/linux/relay.h b/include/linux/relay.h
index c42b2d3..85f43a0 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -51,8 +51,6 @@ struct rchan_buf
 	struct page **page_array;	/* array of current buffer pages */
 	unsigned int page_count;	/* number of current buffer pages */
 	unsigned int finalized;		/* buffer has been finalized */
-	size_t *padding;		/* padding counts per sub-buffer */
-	size_t prev_padding;		/* temporary variable */
 	size_t bytes_consumed;		/* bytes consumed in cur read subbuf */
 	size_t early_bytes;		/* bytes consumed before VFS inited */
 	unsigned int cpu;		/* this buf's cpu */
@@ -88,23 +86,17 @@ struct rchan_callbacks
 	 * subbuf_start - called on buffer-switch to a new sub-buffer
 	 * @buf: the channel buffer containing the new sub-buffer
 	 * @subbuf: the start of the new sub-buffer
-	 * @prev_subbuf: the start of the previous sub-buffer
-	 * @prev_padding: unused space at the end of previous sub-buffer
+	 * @first_subbuf: boolean, is this the first ...
From: Tom Zanussi
Date: Monday, September 22, 2008 - 10:27 pm

Clean up relay_switch_subbuf() and make waking up consumers optional.
    
Over time, relay_switch_subbuf() has accumulated some cruft - this
patch cleans it up and at the same time makes available some of it
available as common functions that any subbuf-switch implementor would
need (this is partially in preparation for the next patch, which makes
the subbuf-switch function completely replaceable).  It also removes
the hard-coded reader wakeup and moves it into a replaceable callback
called notify_consumers(); this allows any given tracer to implement
consumer notification as it sees fit.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 953fc05..17f0515 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -159,6 +159,15 @@ struct rchan_callbacks
 	 * The callback should return 0 if successful, negative if not.
 	 */
 	int (*remove_buf_file)(struct dentry *dentry);
+
+	/*
+	 * notify_consumers - new sub-buffer available, let consumers know
+	 * @buf: the channel buffer
+	 *
+	 * Called during sub-buffer switch.  Applications which don't
+	 * want to notify anyone should implement an empty version.
+	 */
+        void (*notify_consumers)(struct rchan_buf *buf);
 };
 
 /*
@@ -186,6 +195,48 @@ extern size_t relay_switch_subbuf(struct rchan_buf *buf,
 				  size_t length);
 
 /**
+ *	relay_event_toobig - is event too big to fit in a sub-buffer?
+ *	@buf: relay channel buffer
+ *	@length: length of event
+ *
+ *	Returns 1 if too big, 0 otherwise.
+ *
+ *	switch_subbuf() helper function.
+ */
+static inline int relay_event_toobig(struct rchan_buf *buf, size_t length)
+{
+	return length > buf->chan->subbuf_size;
+}
+
+/**
+ *	relay_update_filesize - increase relay file i_size by length
+ *	@buf: relay channel buffer
+ *	@length: length to add
+ *
+ *	switch_subbuf() helper function.
+ */
+static inline void relay_update_filesize(struct rchan_buf *buf, size_t length)
+{
+	if ...
From: Andrew Morton
Date: Tuesday, September 23, 2008 - 1:15 pm

On Tue, 23 Sep 2008 00:27:02 -0500

Does this comment format and placement get properly processed by the

What locking protects the non-atomic modification of the 64-bit i_size

This also needs caller-provided locking.  That's part of the function's

I think you can put the unlikely() into relay_event_toobig() and gcc
--

From: Tom Zanussi
Date: Monday, September 22, 2008 - 10:27 pm

Make the relay sub-buffer switch code replaceable.
    
With this patch, tracers now have complete control over the relay
write (or reserve) path if they choose to do so, by implementing their
own version of the sub-buffer switch function (switch_subbuf()), in
addition to their own local write/reserve functions.  Tracers who
choose not to do so automatically default to the normal behavior.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>

diff --git a/include/linux/relay.h b/include/linux/relay.h
index 17f0515..52e4d61 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -168,6 +168,18 @@ struct rchan_callbacks
 	 * want to notify anyone should implement an empty version.
 	 */
         void (*notify_consumers)(struct rchan_buf *buf);
+
+	/*
+	 * switch_subbuf - sub-buffer switch callback
+	 * @buf: the channel buffer
+	 * @length: size of current event
+	 *
+	 * Returns either the length passed in or 0 if full.
+	 *
+	 * Performs sub-buffer-switch tasks such as updating filesize,
+	 * waking up readers, etc.
+	 */
+	size_t (*switch_subbuf)(struct rchan_buf *buf, size_t length);
 };
 
 /*
@@ -191,8 +203,9 @@ extern void relay_subbufs_consumed(struct rchan *chan,
 extern void relay_reset(struct rchan *chan);
 extern int relay_buf_full(struct rchan_buf *buf);
 
-extern size_t relay_switch_subbuf(struct rchan_buf *buf,
-				  size_t length);
+extern size_t switch_subbuf_default_callback(struct rchan_buf *buf,
+					     size_t length);
+
 
 /**
  *	relay_event_toobig - is event too big to fit in a sub-buffer?
@@ -259,7 +272,7 @@ static inline void relay_write(struct rchan *chan,
 	local_irq_save(flags);
 	buf = chan->buf[smp_processor_id()];
 	if (unlikely(buf->offset + length > chan->subbuf_size))
-		length = relay_switch_subbuf(buf, length);
+		length = chan->cb->switch_subbuf(buf, length);
 	memcpy(buf->data + buf->offset, data, length);
 	buf->offset += length;
 	local_irq_restore(flags);
@@ -285,7 +298,7 @@ static inline void ...
From: Andrew Morton
Date: Tuesday, September 23, 2008 - 1:17 pm

On Tue, 23 Sep 2008 00:27:30 -0500

It would be nice to keep the `relay_' prefix on the exported relay
interface?  Something called `switch_subbuf_default_callback' could
belong to pretty much anywhere in the kernel.


--

From: Tom Zanussi
Date: Monday, September 22, 2008 - 10:27 pm

Add channel flags to relay, remove global callback param.
    
relay should probably have had a flags param from the beginning; it
wasn't originally added because it wasn't originally needed - it
probably would have helped avoid some of the callback contortions
that were added due to a lack of flags.  This adds them and does a
small amount of low-hanging cleanup, and is also in preparation for
some new flags in future patches.

Signed-off-by: Tom Zanussi <tzanussi@gmail.com>

diff --git a/block/blktrace.c b/block/blktrace.c
index eb9651c..150c5f7 100644
--- a/block/blktrace.c
+++ b/block/blktrace.c
@@ -356,8 +356,7 @@ static int blk_remove_buf_file_callback(struct dentry *dentry)
 static struct dentry *blk_create_buf_file_callback(const char *filename,
 						   struct dentry *parent,
 						   int mode,
-						   struct rchan_buf *buf,
-						   int *is_global)
+						   struct rchan_buf *buf)
 {
 	return debugfs_create_file(filename, mode, parent, buf,
 					&relay_file_operations);
@@ -424,7 +423,7 @@ int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev,
 		goto err;
 
 	bt->rchan = relay_open("trace", dir, buts->buf_size,
-				buts->buf_nr, &blk_relay_callbacks, bt);
+			       buts->buf_nr, &blk_relay_callbacks, bt, 0UL);
 	if (!bt->rchan)
 		goto err;
 
diff --git a/include/linux/relay.h b/include/linux/relay.h
index 52e4d61..18fd269 100644
--- a/include/linux/relay.h
+++ b/include/linux/relay.h
@@ -25,7 +25,13 @@
 /*
  * Tracks changes to rchan/rchan_buf structs
  */
-#define RELAYFS_CHANNEL_VERSION		7
+#define RELAYFS_CHANNEL_VERSION		8
+
+/*
+ * relay channel flags
+ */
+#define RCHAN_MODE_OVERWRITE		0x00000001	/* 'flight' mode */
+#define RCHAN_GLOBAL_BUFFER		0x00000002	/* not using per-cpu */
 
 /*
  * Per-cpu relay channel buffer
@@ -66,11 +72,11 @@ struct rchan
 	void *private_data;		/* for user-defined data */
 	size_t last_toobig;		/* tried to log event > subbuf size */
 	struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel ...
From: Andrew Morton
Date: Tuesday, September 23, 2008 - 1:20 pm

On Tue, 23 Sep 2008 00:27:56 -0500

What is the significance of this change?  Does it affect the
kernel<->userspace interface?  Is it back-compatible with existing
	
--

From: Tom Zanussi
Date: Tuesday, September 23, 2008 - 8:57 pm

Thanks for the comments on these patches - I'll include the changes in
the next go-round.


No, nothing to do with the kernel-userspace interface.  The channel
version is included in the channel struct and was meant as an aid in
deciphering channel data in crash dumps.


--

From: Marcel Holtmann
Date: Friday, September 19, 2008 - 5:26 pm

I talked with Thomas and Steven about it during lunch and we might wanna
also use it for sniffing/monitoring/capturing efforts. Inaky and I
talked about unifying things like usbmon, hcidump, tcpdump etc. with a
common interface into the kernel. Currently every subsystem does it
differently. Especially when it comes to add monitor support to SDIO we
ran into the problem that we don't just wanna invent another interface.

Right now I haven't spent much time on this and regret that I didn't sit
together with you guys yesterday, but it was not that high priority on
my list of nice things to have. Anyway, since I am mostly interested in
subsystems that copy a lot of packets around, it would be nice to only
reference to the data. For example for structures like SKBs and URBs and
only have to copy the data if we have a consumer. My thoughts :)

Regards

Marcel


--

From: Steven Rostedt
Date: Saturday, September 20, 2008 - 2:03 am

Martin,

First I like to express my appreciation to you for writing this up. Not 
only that, but being the one person from keeping us from killing each 
other ;-)




I don't remember talking about the "spool to disk" for version 1.
We still want to do this? I thought we would have overwrite mode (flight
data record), and a "throw all new data away when the producer fills the 


I was talking with Thomas about this, and we probably want (and I'm sure 
Mathieu and others would agree), a...

  event_handle = reserve_event(buffer_handle, event_id, length)

as well as a..

  comit_event(event_handle).


Oh, and all commands should start with the namespace.

  ring_buffer_alloc()
  ring_buffer_free()
  ring_buffer_record_event()


Ah, I thought we were going to have:

  /debugfs/tracing/buffers/<name>/<buffer crap>

and each tracer have

  /debugfs/tracing/<name>/<trace command crap>

This way we can easily see all the buffers in one place that are allocated
without having to see a tracer name first.

The reason I like the way I propose, is that a utility that needs to read 
all the buffers, doesn't need to go into directories that don't even have 

I wonder if we should make this another sub dir:



Sounds good,

Thanks!

-- Steve

--

From: Mathieu Desnoyers
Date: Saturday, September 20, 2008 - 6:55 am

Yes, I think the spool to disk mode will be the default mode needed by
a big amount people who want to stream data out continuously. The flight
recorder is needed mostly for event backlog analysis. I think we have to

It's also easy to be sizeof(void *) aligned, as long as we export
sizeof(void *) in the buffer header so we keep portability. But we can
keep that for v2. It's also good to write a magic number in the trace

How about :

  trace_mark(ftrace_evname, "size %lu binary %pW",
    sizeof(mystruct), mystruct);
  or
  trace_mark(sched_wakeup, "target_pid %ld", task->pid);

Note the namespacing with buffers being "ftrace" and "sched" here.

That would encapsulate the whole
  - Event ID registration
  - Event type registration
  - Sending data out
  - Enabling the event source directly at the source

We can then export the markers through a debugfs file and let userland
enable them one by one and possibly connect systemtap filters on them
(one table of registered filters, one table for the markers, a command

We could even rename markers if required, I don't really care. e.g. :
  trace_mark -> ring_buffer_record_event()

people can still do ls debugfs/tracing/*/buffers/. But yes, we did agree
on having the buffers/ subdir outside of the "trace command crap". It
makes the buffers easier to see in the directory tree, and makes it
clear that those buffers can be used by other users than the actual

Sure.

If needed, we could change the markers to take two separate parameters :

trace_mark(tracer_name, event_name, "format", args)


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Arjan van de Ven
Date: Saturday, September 20, 2008 - 7:12 am

On Sat, 20 Sep 2008 09:55:48 -0400

I would like to ask for the following from the start: have a field for
a longer description of the marker that describes it's usage and
context. Getting this there from the start is critical, because only
when adding the marker point do people still really remember why/what
(and having to type a good description also helps them to realize if
this is the right point or not). This can then be exposed to the user
so he has a standing chance of knowing what the marker is about.

It also has a standing chance of being updated when the code changes
this way

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 11:52 am

I agree, and I think it might be required in both markers and
tracepoints.

Given that tracepoints are declared in a global header
(DECLARE_TRACE()), I would add this kind of description here. Tracepoint
uses within the kernel code (statements like :
  trace_sched_switch(prev, next);
added to the scheduler) would therefore be tied to the description
without having to contain it in the core kernel code.

Markers, on the other hand, could become the "event description"
interface which is exported to userspace. Considering that, I guess it's
as important to let a precise description follow the markers.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jason Baron
Date: Thursday, October 2, 2008 - 8:28 am

hi,

Tracepoints and markers seem to both have their place, with tracepoints
being integral to kernel users, and markers being important for
userspace. However, it seems to me like there is overlap in the
code and an extra level of indirection when markers are layered on
tracespoints. could they be merged a bit more?

What if we extended DEFINE_TRACE() to also create a
'set_marker(marker_cb)' function where 'marker_cb' has the function signature:

marker_cb(<tracepoint prototype>, *marker_probe_func);

We then also create 'register_marker_##name' function in DEFINE_TRACE(),
which allows one to regiser marker callbacks in the usual way.

Then 'marker_cb' function is then called in '__DO_TRACE' if anybody has
registered a marker (which can set the tracepoint.state appropriately).

The 'marker_cb' function then marshalls its arguemnts and passes them
through to the marker functions that were registered.

I think in this way we can simplify the tracepoints and markers by
combining them to a large extent.

thanks,

-Jason




--

From: Mathieu Desnoyers
Date: Friday, October 3, 2008 - 9:11 am

I think what you propose here is already in y LTTng tree in a different
form. It's a patch to markers to allow declaring a marker which enables
an associated tracepoint when enabled. This way, we can have a marker
(exposed to userspace) connecting itself automatically to a tracepoint
when enabled.

It's here :
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commitdiff;h=d...

Note that locking depends on the psrwlock patch so we can have nested
module list readers. Otherwise locking becomes _really_ messy. :-(


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jason Baron
Date: Friday, October 3, 2008 - 11:37 am

That patch simplifies using markers with tracepoints and couples
markers and tracepoints much more closely. But I was proposing to make
the coupling tighter...

Couldn't 'marker_probe_register()' register the marker directly with
the tracepoint callsite? Have DEFINE_TRACE() take an additional argument
which references a marker callback funtion. That function would look 
like (very loose C code):

marker_blah_callback(TPPROTO(arg1, arg2), marker_probe_func *probe,
private_data)
{
	probe(private_data, "%arg1 %arg2", arg1->a, arg2->b);
}

The 'marker_blah_callback()' would be invoked from within DO_TRACE() for
each marker that has been registered with the associated tracepoint, in
a similar way to how we iterate over the tracepoint callbacks, we can
iterate over the registered markers and pass them to the
'marker_blah_callback()' function. 

By associating the marker_blah_callback() in DEFINE_TRACE(), we only
need to look in one file to understand what is associated with a
particular tracepoint. I think marker.c and tracepoint.c could also be
consolidated at that point.

thanks,

-Jason







--

From: Mathieu Desnoyers
Date: Friday, October 3, 2008 - 12:10 pm

I don't want the tracepoints to be coupled with markers (which are a
userspace API). The other way around is fine : letting a marker
automatically enable a tracepoint makes sense, but the opposite would
tie the in-kernel API (tracepoint) to the external marker
representation, and I would like to avoid that.

And how do you plan to deal with :

TPPROTO(arg1, arg2) == void ?

C won't let you define stuff like :

blah(void, marker_probe_func *probe, void *private_data)

The devil is in the details.... ;)


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jason Baron
Date: Friday, October 3, 2008 - 12:25 pm

The interface to markers is still marker_probe_register() and
marker_probe_unregister(). I don't see how that changes with this

it'd be simple enough to pass the the noargs requirement down as an
extra argument to DO_TRACE(), and then invoke the callback with no arguments.

thanks,

-Jason


--

From: Mathieu Desnoyers
Date: Friday, October 3, 2008 - 12:56 pm

"Have DEFINE_TRACE() take an additional argument which references a
marker callback funtion." -> it would tie the tracepoint definition to a
marker. Or am I misunderstanding something ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jason Baron
Date: Friday, October 3, 2008 - 1:25 pm

Not sure. Maybe the confusion is that I am really talking about two
callbacks here. First, there is a tracepoint->marker callback which is
the 'marker_blah_callback()' that I mentioned above, and is the one
which is referenced in DEFINE_TRACE(). There is also the marker->userspace 
callback which is registered via something similar to marker_probe_register(),
only it is registered directly with the tracepoint. 

I think this potentially better address's Arjan's concern b/c it ties
the 'tracepoint->marker' callback directly to the tracepoint. And this
'tracepoint->marker' callback function in essense documents the marker
interface for a tracepoint. And this proposal documents the interfaces
(both tracepoints and markers) all in one place. 

If I'm not clear, I can prototype it if you think that would help?

thanks,

-Jason

--

From: Frank Ch. Eigler
Date: Friday, October 3, 2008 - 2:52 pm

Hi -


I'm glad the discussion seems to be slowly turning toward the event
layering issues, but ... "markers being a userspace API"?  That seems
to be grossly misleading terminology.  It's a kernel API like anything
else being discussed here.  What's different about it
(vs. tracepoints) is that it could be one of the first clients of the
unified ring buffer widget that concretely addresses the issue of
encoding abstract traceworthy events into serialized data.

(Of course, merging the lower level aspects of the marker/tracepoint
implementations is all well and good.)

- FChE
--

From: KOSAKI Motohiro
Date: Sunday, September 21, 2008 - 8:09 pm

Sorry, disagreed.

I believe any marker point should be independent to tracer.
Otherwise, number of trace_mark() will blow up easily.



--

From: Peter Zijlstra
Date: Monday, September 22, 2008 - 2:57 am

I really think we should separate the ringbuffer management from the
event stuff.

--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 7:36 pm

Sure, I am strongly in favor of separating those two, given they
represent two different things. However, the requirement I have heard at
KS2008 was to provide 

- Unified buffering mechanism
- Timestamps synchronized across all buffers
- Unified event IDs management, so events from various sources could be
  shared between tools.
- As of my understanding, unified event structure, which can be exported
  to userspace and be shared across different tools.
- Unified buffer control/management mechanism.

These all represent different infrastructure parts, but are all needed
if we want tools to be able to share the data exported through those
buffers.

Relay is a good example of having only a _single_ of these layers in
common : there is currently no way the different relay users can share
the data they collect because they have simply no idea how others
structure their data.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: K.Prasad
Date: Monday, September 22, 2008 - 6:57 am

With due apologies for pitching-in late, I thought I'd bring visibility to
the two new interfaces - namely relay_printk() and relay_dump() - now a part
of -mm tree (since 2.6.27-rc5-mm1) are meant to address such needs;
although not completely in its present form but quite substantially.
(Refer: Documentation/filesystems/relay.txt). As far as re-usability is
concerned, many parts of this interface are directly adopted from
SystemTap's runtime. Blktrace had been made to work using these interfaces
(http://tinyurl.com/4q9d4p) reducing about ~130 lines of code from the
blktrace related files.

With more effort, say additions such as a)ability to specify custom
names for files b)ability to create user-defined control files (in
addition to what comes default) will make it usable along with tracers
such as ftrace (ref:http://tinyurl.com/3ppbwh) (and is something that I
intended to work upon).

While relay_printk() interface brings a high-level abstract interface
over 'relay' by masking all the setup/tear-down details and the ability
to use per-CPU buffers; relay_dump() is its equivalent that performs
binary dumping through debugfs interface (a requirement for the unified
tracing buffer, as I learn from the email). Also the use of default
file-names, debugfs output path results in huge reduction of setup code
required by the end-user along with the ability to override the defaults
if required in a special case. Examples of the resulting code-brevity can
be seen at samples/relay/*.c in 2.6.27-rc5-mm1 tree.

I am quite sure that with minimal changes to infrastructure underlying
beneath these two interfaces, we can meet out most of the requirements
stated above; and am open for suggestions.

Kindly let me know what the community thinks about the same.

Thanks,
--

From: Masami Hiramatsu
Date: Monday, September 22, 2008 - 12:45 pm

Hi Martin,


I agree to integrate tracing buffer mechanism, but I don't think
your proposal is the simplest one.

To simplify, I think the layered buffering mechanism is desirable.
- The lowest layer just provides named circular buffers(both per-cpu and
  uni-buffer in system) and read/write scheme.
- Next layer provides user/kernel interface including controls.
- Top layer defines packet(and event) formatting utilities.
- Additionally, it would better provide some library routines(timestamp,
  event-id synchronize, and so on).

Since this unified buffer is used from different kind of tracers/loggers,
I don't think all of them(and future tracers) want to be tied down by
"event-id"+"parameter" format.
So, Sorry, I disagree about that the tracing buffer defines its *data format*,
it's just overkill for me.

Thank you,


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Martin Bligh
Date: Monday, September 22, 2008 - 1:13 pm

I think you're right that we can layer this, and we didn't make a particularly
good job of splitting those things out. I'll try to pull together
another revision
to reflect this and other suggested changes.

One thing that I think is still important is to have a unified timestamp
mechanism on everything, so we can co-ordinate different things back
together in userspace from different trace tools, so I intend to keep
that at a lower level, but I think you're right that the event id, etc can
move up into separate layers.
--

From: Masami Hiramatsu
Date: Monday, September 22, 2008 - 3:25 pm

Hi Martin,



I'm not so sure that the unified 'timestamp' must be required by all tracers.
If you just need to merge and sort per-cpu data, you can use an atomic
sequential number for it.
IMHO, the unified 'timestamp' would better be an option, because some
architectures can't support it. I think preparing timestamp-callback
function will help us.

By the way, systemtap uses two modes;

- single-channel mode
 In this mode, all cpus share one buffer channel to write and read.
 each writer locks spinlock and write a probe-local data to buffer.

- per-cpu buffer mode
 In this mode, we use an atomic sequential number for ordering data. If
 user doesn't need it(because they have their own timestamps), they can
 choose not to use that seq-number.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Darren Hart
Date: Monday, September 22, 2008 - 4:11 pm

There have been several posts on the timestamp for the events.  From a
real-time perspective, this timestamp will be a very important datapoint for
each event, and the more accurate/higher resolution the better.  Some thoughts.

o pretty print resolution should definitely be nanosecond (IMHO)

o internal storage should be "whatever is fastest" with the transformation to
 ns data stored in the trace header (as I believe Mathieu mentioned).

o for archs where the clock isn't synchronized across CPUs, perhaps for now it
 would be adequate to record the per cpu timestamps in the trace header and
 include the cpu id for each event as well.  This is in keeping with the
 previous suggestion to collect the most primitive data available without
 doing any sort of transformation at trace time.

--
Darren Hart
--

From: Masami Hiramatsu
Date: Monday, September 22, 2008 - 5:04 pm

Hi Darren,


Sure, I know the precise timestamp is required for real-time sensitive
tracers. however, there are some other cases. for example debugging,
we don't need timestamps, but just want to know the order of events. :-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Martin Bligh
Date: Monday, September 22, 2008 - 4:16 pm

An atomic sequential number is:

(a) far less meaningful than a timestamp for the user
(b) more expensive to compute in many cases.

I think we came up with a way to approximate this, using a callback every
ms or so as the higher order bits, and a sequential counter in the lower
order for those broken platforms.

But perhaps it would be better if we started with a discussion of which
platforms can't do global timestamps, and why not? I know some of them
are fixable, but perhaps not all.
--

From: Masami Hiramatsu
Date: Monday, September 22, 2008 - 5:05 pm

Hi Martin,


Sure, atomic counter might be more expensive but accurate for ordering.
(and it can use on almost all architectures)
The cost depends on the architecture and system configuration.

So, I think it is preferable user to choose their timestamp rather
than fix it. For example, calling callback when writing a log entry
as following;

write_log(struct buffer *buffer, char *data, int len)
{
    /* reserve a logging space */
    char *buf = reserve(buffer, len + buffer->timestamp.len);

    /* write a timestamp */
    buf = buffer->timestamp.write(buf);

    /* write a body */
    memcpy(buf, data, len);
}

And unified buffer prepares default timestamp.write callbacks.

char * timestamp_write(char * buf); // write arch-specific timestamp
char * seqnum_write(char * buf); // write an sequence number



For example, my laptop (this machine/Core2Duo) doesn't return correct TSC. :-(

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Martin Bligh
Date: Monday, September 22, 2008 - 5:12 pm

OK, that'd fix 99% of it, even if we only get time within 1ms or so (but still

Can you define incorrect for me (in this case)?

We had similar problems with some AMD platforms that we can fix by syncing
the TSCs on exit_idle, etc.
--

From: Masami Hiramatsu
Date: Tuesday, September 23, 2008 - 7:49 am

Hi Martin,


On my laptop, TSC is disabled at boot time.

$ dmesg | grep TSC
checking TSC synchronization [CPU#0 -> CPU#1]:
Measured 4246549092 cycles TSC warp between CPUs, turning off TSC clock.
Marking TSC unstable due to: check_tsc_sync_source failed.

$ cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips	: 3998.45
clflush size	: 64
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts pni
monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm
bogomips	: 3994.43
clflush size	: 64
power management:

Actually, I had measured TSC drifting and reported to systemtap-bugzilla
http://sources.redhat.com/bugzilla/show_bug.cgi?id=3916#c19

Curiously, I've tested on another Core2Duo laptop, which cpu is same model and
same stepping, but on that laptop I couldn't see TSC drifting.

Hmm, very interesting. :-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi ...
From: Mathieu Desnoyers
Date: Tuesday, September 23, 2008 - 8:04 am

* Masami Hiramatsu (mhiramat@redhat.com) wrote:
[...]

... @ 2.00GHz and cpu MHz : 1000.000 ; isn't that a bit odd ? (same for
both cpus)

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Masami Hiramatsu
Date: Tuesday, September 23, 2008 - 8:30 am

2.00GHz is the maximum(model) frequency. And 'cpu MHz' means
current frequency. (yep, now I'm using cpufreq)
Anyway, when I measured TSC drift, I killed cpuspeed service and
fixed freq to 2000. ;-)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Linus Torvalds
Date: Tuesday, September 23, 2008 - 9:01 am

Ahh. I have an idea..

Maybe that thing does thermal throttling? 

Fixing the frequency at the highest setting is actually one of the worst 
things you can do, because if the device is thermally limited, it will 
still do the whole throttling thing, but now it won't do it by changing 
the frequency any more, it will do it by essentially forxing the external 
frequency down.

And that is going to be *very* inefficient. You really really don't want 
that. Your performance will actually be _worse_ than if the CPU went to a 
lower frequency. And it might explain the unreliable TSC too, because I 
suspect constant TSC is really constant only wrt the bus clock to the CPU.

The termal throttling thing is a "protect the CPU from overheating" last 
ditch effort, and because it doesn't lower voltage, it isn't actually at 
all as efficient at saving power (and thus cooling the CPU) as a real 
frequency change event would be.

And fixing the frequency to the highest frequency in a tight laptop 
enclosure is the best way to force that behavior (in contrast - in a 
desktop system with sufficient cooling, it's usually not a problem at all 
to just say "run at highest frequency").

And btw, that also explains why you had so *big* changes in frequency: the 
throttling I think happens with a 1/8 duty cycle thing, iirc.

It's supposed to be very rare with Core 2. Thermal throttling was quite 
common with the P4 one, and was the main overheating protection initially. 
These days, you should only see it for really badly designed devices that 
simply don't have enough thermal cooling, but if the design calls for 
mostly running at low frequency because it's some thing-and-light notebook 
with insufficient cooling (or some thick-and-heavy thing that is just 
total crap), and you force it to always run at full speed, I can imagine 
it kicking in to protect the CPU.

It's obviously also going to be much easier to see if the ambient 
temperature is high. If you want to get best peformance, ...
From: Masami Hiramatsu
Date: Tuesday, September 23, 2008 - 10:04 am

Hi Linus,

Thank you for your advice. I tested it again according your advice,
I did:
- service cpuspeed stop
- echo 1000000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed
  and checked /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq is
  1000000.
- echo 1 >  /proc/acpi/thermal_zone/THM/polling_frequency
- cooling with spray-can :)
- cat /proc/acpi/thermal_zone/THM/temperature
  temperature:             39 C

and ran the test.
---
p0: c:1107576,	 ns:990280	 ratio:111
p0: c:1805640,	 ns:1008787	 ratio:178
p0: c:1998324,	 ns:1000127	 ratio:199
p0: c:946380,	 ns:990280	 ratio:95
p0: c:871728,	 ns:1000267	 ratio:87
p0: c:1807380,	 ns:1007949	 ratio:179
p0: c:1784808,	 ns:1000127	 ratio:178
p0: c:1768488,	 ns:991676	 ratio:178
p0: c:1802292,	 ns:1008299	 ratio:178
p0: c:1787088,	 ns:1000406	 ratio:178
p0: c:1999176,	 ns:1000896	 ratio:199
p0: c:881364,	 ns:991956	 ratio:88
p0: c:1802712,	 ns:1008019	 ratio:178
p0: c:1787088,	 ns:998590	 ratio:178
---
this seems not so stable yet. :-(

After test I checked temperature again.
# cat /proc/acpi/thermal_zone/THM/temperature
temperature:             39 C

Hmm, 39 C is not so high. I wouldn't be surprised even if this
is an individual product bug. Anyway, currently, Linux itself
works well on this laptop with hpet.:-)


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Thomas Gleixner
Date: Tuesday, September 23, 2008 - 10:30 am

Do you have C-States enabled on that machine ?

ls /sys/devices/system/cpu/cpu0/cpuidle/

has it more than a state0 entry ?

If yes, please do:

cat /sys/devices/system/cpu/cpu0/cpuidle/stateX/usage

where X is the highest number in there.

cat /proc/acpi/processor/CPU0/power

might be useful as well.

Thanks,

	tglx
--

From: Masami Hiramatsu
Date: Tuesday, September 23, 2008 - 11:59 am

Hi Thomas,


Yes, there are state0 to state3
# ls /sys/devices/system/cpu/cpu0/cpuidle/

# cat /sys/devices/system/cpu/cpu0/cpuidle/state3/usage

# cat /proc/acpi/processor/CPU0/power
active state:            C0
max_cstate:              C8
bus master activity:     00000000
maximum allowed latency: 2000000000 usec
states:
    C1:                  type[C1] promotion[--] demotion[--] latency[001] usage[00000016] duration[00000000000000000000]
    C2:                  type[C2] promotion[--] demotion[--] latency[001] usage[00037969] duration[00000000000024288003]
    C3:                  type[C3] promotion[--] demotion[--] latency[057] usage[00171818] duration[00000000001881257636]

Could these help you?


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Thomas Gleixner
Date: Tuesday, September 23, 2008 - 12:36 pm

C3 stops the TSC. So depending on how many C3 entries you have on the
different cores, your TSCs will drift apart. Some BIOSes do even a
lousy job trying to fixup the TSCs on exit from C3, which makes things

Yup, explains your TSC observation. Nothing we can do about. Broken by
system design :( Welcome in the wonderful world of Inhell/BIOS/ACPI !

Thanks,

	tglx
--

From: Martin Bligh
Date: Tuesday, September 23, 2008 - 12:38 pm

We have linux patches that sync the TSC on exit_idle. I'll see if I can get
Michael to send them out.

M.
--

From: Thomas Gleixner
Date: Tuesday, September 23, 2008 - 12:41 pm

Are you sure that they sync it precicely enough that there is no user
space observable way of time going backwards between cores ?

Thanks,

	tglx
--

From: Martin Bligh
Date: Tuesday, September 23, 2008 - 12:50 pm

I think the tolerance is about 500 cycles. If that's not sufficient, I guess
we'll have to either live with some slight misordering (which people have
pointed out is kind of inevitable anyway) on these broken machines?
It was sufficient for what we were using it for, but maybe not for everyone.
--

From: Thomas Gleixner
Date: Tuesday, September 23, 2008 - 1:03 pm

Well, I dont care about the trace reordering at all. I care about user
space visible time going backwards issues observed via the
gettimeofday vsyscall. 500 cycles should be fine, I doubt that we can
migrate in less than that :)

I guess you try this only for machines where the TSC runs with
constant frequency, right ?

Looking forward to your patches.

Thanks,

	tglx


 
--

From: Martin Bligh
Date: Tuesday, September 23, 2008 - 2:02 pm

We don't do DVFS on the 'broken' TSC machines, just halt.
But yes, it's selective
--

From: Masami Hiramatsu
Date: Tuesday, September 23, 2008 - 1:03 pm

Thank you for analyzing! :-)
Hmm, then could I fix that by fixing my dsdt...?


-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Thomas Gleixner
Date: Tuesday, September 23, 2008 - 1:08 pm

You can limit c-states so you dont do down to the C3 state, but there
is a trade off vs. power saving.

Lets wait for Martins magic TSC patches first :)

     tglx
--

From: Linus Torvalds
Date: Tuesday, September 23, 2008 - 8:46 am

Hmm.. Very interesting.

It smells of a non-stable TSC, but your Core2 Cpu shouldn't have that 

I'd have assumed it was just some initial offset issue, but your 
bug-report seems to say that it really does change the TSC frequency when 
the CPU frequency changes. That should _not_ happen on a core2 CPU, afaik. 
I didn't even know it could be a setup issue, but it does really smell 
like your TSC frequency changes.

Now, unstable TSC's are not uncommon per se, and most older Intel CPU's 
will do it, it's just that I thought it was fixed in Core2 (and later P4's 
for that matter).

The rule *should* be that:
 - family = 15 (netburst), model 3 or higher has constant TSC
 - family = 6 (PPro), model 14 or higher (Core, Core 2)
have constant TSCs. This is quite clearly documented: see Intel ia docs, 
vol 3B, 18.10 "Time-stamp counter".

Very odd. I wonder what your laptop does to screw this up. 

I also suspect that since we already _noticed_ that the TSC isn't stable, 
we should also have then cleared the "constant TSC" bit. And we apparently 
didn't.

Btw, your CPU looks quite odd in other respects too. Judging by your 
bugzilla entry, the TSC sometimes ticks at 2GHz (fine), sometimes at 1Ghz 
(also fine), and sometimes at 667/500MHz judging by the ratios you show 
for TSC/timer tick.

And that last one is really odd, afaik most 2GHz Core 2 duo's will have a 
lower frequency of 1GHz. Is that some special low-power version, perhaps? 
Or maybe it isn't a speedstep-able CPU at all, and the system actually 
changes the *bus* frequency (and then the CPU frequency is some constant 
factor of that). If so, the system messes with the CPU in bad ways.

And btw, I'm almost certain that what you see isn't actually any "CPU 
drift" in the sense that I strongly suspect that the TSC's for both cores 
will change frequency together. So because the TSC isn't stable, it's not 
a good time-source, but despite that it's not necessarily a bad way to 
compare events across cores.

To ...
From: Linus Torvalds
Date: Monday, September 22, 2008 - 5:39 pm

Don't be silly.

An atomic counter is no more accurate for ordering than anything else.

Why?

Because all it tells you is the ordering of the atomic increment, not of 
the caller. The atomic increment is not related to all the other ops that 
the code that you trace actually does in any shape or form, and so the 
ordering of the trace doesn't actually imply anything for the ordering of 
the operations you are tracing!

Except for a single CPU, of course, but for that case you don't need a 
sequence number either, since the ordering is entirely determined by the 
ring buffer itself.

So the counter will be more expensive (cross-cpu cache bouncing for EVERY 
SINGLE EVENT), less useful (no real meaning for people who DO want to have 
a timestamp), and it's really no more "ordered" than anything that bases 
itself on a TSC.

The fact is, you cannot order operations based on log messages unless you 
have a lock around the whole caller - absolutely _no_ amount of locking or 
atomic accesses in the log itself will guarantee ordering of the upper 
layers.

And sure, if you have locking at a higher layer, then a sequence number is 
sufficient, but on the other hand, so is a well-synchronized TSC.

So personally, I think that the optimal solution is:

 - let each ring buffer be associated with a "gettimestamp()" function, so 
   that everybody _can_ set it to something of their own. But default to 
   something sane, namely a raw TSC thing.

 - Add synchronization events to the ring buffer often enough that you can 
   make do with a _raw_ (ie unscaled) 32-bit timestamp. Possibly by simply 
   noticing when the upper 32 bits change, although you could possibly do 
   it with a heartbeat too.

 - Similarly, add a synchronization event when the TSC frequency changes.

 - Make the synchronization packet contain the full 64-bit TSC base, in 
   addition to TSC frequency info _and_ the timebase.

 - From those synchronization events, you should be able to get a very 
   ...
From: Roland Dreier
Date: Monday, September 22, 2008 - 6:26 pm

> Because all it tells you is the ordering of the atomic increment, not of 
 > the caller. The atomic increment is not related to all the other ops that 
 > the code that you trace actually does in any shape or form, and so the 
 > ordering of the trace doesn't actually imply anything for the ordering of 
 > the operations you are tracing!

This reminds me of a naive question that occurred to me while we were
discussing this at KS.  Namely, what does "ordering" mean for events?

An example I'm all too familiar with is the lack of ordering of MMIO on
big SGI systems -- if you forget an mmiowb(), then two CPUs taking a
spinlock and doing writel() inside the spinlock and then dropping the
spinlock (which should be enough to "order" things) might see the
writel() reach the final device "out of order" because the write has to
travel through a routed system fabric.

Just like Einstein said, it really seems to me that the order of things
depends on your frame of reference.

 - R.
--

From: Steven Rostedt
Date: Monday, September 22, 2008 - 6:39 pm

In my logdev tracer (see http://rostedt.homelinux.com/logdev) I used an 
atomic counter to keep "order". But what I would say to people what this 
order means, is that order is among multiple traces between multiple CPUS.
That is if you have.

   CPU 1                                CPU 2
trace_point_a                        trace_point_c
trace_point_b                        trace_point_d

If you see in the trace:

trace_point_a
trace_point_c

You really do not know which happened first. Simply because trace_point_c 
could have been hit first, but for interrupts and nmis and what not, 
trace_point_a could have easily been recorded first. But to me, 
trace_points are more like memory barriers.

If I see:

trace_point_c
trace_point_a
trace_point_b
trace_point_d

I can assume that everything before trace_point_c happened before 
everything after trace_point_a, and that all before trace_point_b happened 
before trace_point_d.

One can not assume that the trace points themselves are in order. But you 
can assume that the things outside the trace points are, like memory 
barriers. I have found lots of race conditions with my logdev, and it was 
due to this "memory barrier" likeness to be able to see the races.

Unfortunately, if you are using an out of sync TSC, you lose even the 
memory barrier characteristic of the trace.

-- Steve

--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 7:02 pm

Exactly as Linus said, event ordering comes down to this : a choice
between heavy locking around the real operation traced and the tracing
statement itself (irq disable/spinlock) or the acknowledgement that the
ordering is only insured across the actual tracing _instrumentation_.

A worse case scenario would be to get an interrupt between the "real"
operation (e.g. a memory or mmio write) and the tracing statement, be
scheduled out, which would let a lot of stuff happen between the actual
impact of the operation on kernel memory and the tracing statement
itself.

If we want to be _sure_ such thing never happen, we would then have to
pay the price of heavy locking and that would not be pretty, especially
for complex data structure modifications comes in play. I don't really
think anyone with an half-sane mind would want to slow down such
critical kernel operations for the benefit of totally ordered tracing.

However, in many cases where ordering matters, e.g. to instrument
spinlocks themselves, if we put the instrumentation within the critical
section rather than outside of it, then we benefit from the existing
kernel locking (but only for events related to this specific spinlock).
This is the same for many synchronization primitives, except for atomic
operations, where we have to accept that the order will be imperfect.

So only in the specific case of instrumentation of things like locking,
where it is possible to insure that instrumentation is synchronized with
the instrumented operation, does it make a difference to choose the TSC
(which implies a slight delta between the TSCs due to cache line delays
at synchronization and delay due to TSCs drifts caused by temperature)
over an atomic increment.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Darren Hart
Date: Monday, September 22, 2008 - 7:26 pm

Hrm, i think that overlooks the other reason to use a time based counter over
an atomic increment: you might care about time.  Perhaps one might be less
concerned with actual order tightly grouped events and more concerned with the
actual time delta between more temporally distant events.  In that case, using
a clocksource would still be valuable. Although admitedtly the caller could
embed that in their payload, but since we seem to agree we need some kind of
counter, the time-based counter appears to be the most flexible.

Thanks,

-- 
Darren Hart
--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 7:31 pm

See my answer to Linus for a proposal on how to do both :)


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Linus Torvalds
Date: Monday, September 22, 2008 - 8:26 pm

Heh. Yes. In general, there is no single ordering unless you actually use 
a serializing lock on all CPU's involved. 

And exactly as in the theory of relativity, two people on different CPU's 
can actually validly _disagree_ about the ordering of the same event. 
There are things that act as "light-cones" and are borders for what 
everybody can agree on, but basically, in the absence of explicit locks, 
it is very possible that no such thing as "ordering" may even exist.

Now, an atomic increment on a single counter obviously does imply *one* 
certain ordering, but it really only defines the ordering of that counter 
itself. It does not at all necessarily imply any ordering on the events 
that go on around the counter in other unrelated cachelines.

Which is exactly why even a global counter in no way orders "events" in 
general, unless those events have something else that does so.

		Linus
--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 8:36 pm

Unless I am missing something, in the case we use an atomic operation
which implies memory barriers (cmpxchg and atomic_add_return does), one
can be sure that all memory operations done before the barrier are
completed at the barrier and that all memory ops following the barrier
will happen after.

Did you have something else in mind ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Linus Torvalds
Date: Monday, September 22, 2008 - 9:05 pm

Sure (if you have a barrier - not all architectures will imply that for an 
incrment).

But that still doesn't mean a thing.

You have two events (a) and (b), and you put trace-points on each. In your 
trace, you see (a) before (b) by comparing the numbers. But what does that 
mean? 

The actual event that you traced is not the trace-point - the trace-point 
is more like a fancy "printk". And the fact that one showed up before 
another in the trace buffer, doesn't mean that the events _around_ the 
trace happened in the same order.

You can use the barriers to make a partial ordering, and if you have a 
separate tracepoint for entry into a region and exit, you can perhaps show 
that they were totally disjoint. Or maybe they were partially overlapping, 
and you'll never know exactly how they overlapped.

Example:

	trace(..);
	do_X();

being executed on two different CPU's. In the trace, CPU#1 was before 
CPU#2. Does that mean that "do_X()" happened first on CPU#1?

No.

The only way to show that would be to put a lock around the whole trace 
_and_ operation X, ie

	spin_lock(..);
	trace(..);
	do_X();
	spin_unlock(..);

and now, if CPU#1 shows up in the trace first, then you know that do_X() 
really did happen first on CPU#1. Otherwise you basically know *nothing*, 
and the ordering of the trace events was totally and utterly meaningless.

See? Trace events themselves may be ordered, but the point of the trace 
event is never to know the ordering of the trace itself - it's to know the 
ordering of the code we're interested in tracing. The ordering of the 
trace events themselves is irrelevant and not useful.

And I'd rather see people _understand_ that, than if they think the 
ordering is somehow something they can trust.

Btw, if you _do_ have locking, then you can also know that the "do_X()" 
operations will be essentially as far apart in some theoretical notion of 
"time" (let's imagine that we do have global time, even if we don't) as 
the cost of ...
From: Steven Rostedt
Date: Monday, September 22, 2008 - 8:43 pm

Hmm, I've been pretty spoiled by x86 mostly ording things correctly, and 
the not x86 boxes I've used has mostly been UP.

But, with that, with a global atomic counter, and the following trace:

cpu 0: trace_point_a
cpu 1: trace_point_c
cpu 0: trace_point_b
cpu 1: trace_point_d


Could the event a really come after event d, even though we already hit 
event b?

But I guess you are stating the fact that what the computer does 
internally, no one really knows. Without the help of real memory barriers, 
ording of memory accesses is mostly determined by tarot cards.

But basically, the perceived result of assembly commands is suppose to be 
accurate at the user level.  The traces that I've used not only shows the 
order (or perceived order) of events, but also the output of the corrupted 
data when the race happens. Usually with the two together, you can pretty 
much guarantee that the traced events actually did occur in the order 
presented.

But without some perceived accurate ording, even when you do see the 
corrupted data, the events can easily be misleading, even on an arch that 
usually orders the events as seen by the user.

-- Steve

--

From: Masami Hiramatsu
Date: Monday, September 22, 2008 - 9:10 pm

yes, if event c is an interrupt event :-).

  cpu 0       cpu 1
             hit event d
hit event a
log event a
             irq event c
             log event c
hit event b
log event b
             log event d

so, I think if we really need to order events, we have to stop
irq right after hitting an event.

Anyway, in most case, I think it works, but as accurate as
synchronized-TSC if hardware supports it.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: Martin Bligh
Date: Monday, September 22, 2008 - 9:17 pm

How could you fix that in any practical way?
--

From: Masami Hiramatsu
Date: Tuesday, September 23, 2008 - 8:23 am

Hi Martin,


In that case, I think it's easy to know that event c is irq
related event by checking source code. :-)
And practically, in most cases, I think we can presume what
happened by event arguments and subsequent events.

Anyway, since there must be a delay between hitting event(and
as linus said, happening event) and logging it, I think buffering
mechanism itself should focus on ensuring only writing and
reading order.

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com


--

From: Steven Rostedt
Date: Tuesday, September 23, 2008 - 3:53 am

heh, This is assuming that event c is in an IRQ handler.

Since I control where event c is, I can prevent that. I'm talking about
the CPU doing something funny that would have c come after d.

But I didn't specify exactly what the events were, so I'll accept that 
explanation ;-)

--

From: Linus Torvalds
Date: Monday, September 22, 2008 - 9:19 pm

Each tracepoint will basically give a partial ordering (if you make it so, 
of course - and on x86 it's hard to avoid it).

And with many trace-points, you can narrow down ordering if you're lucky.

But say that you have code like

	CPU#1		CPU#2

	trace_a		trace_c
	..		..
	trace_b		trace_d

and since each CPU itself is obviously strictly ordered, you a priori know 
that a < b, and c < d. But your trace buffer can look many different ways:

 - a -> b -> c -> d
   c -> d -> a -> b

   Now you do know that what happened between c and d must all have 
   happened entirely after/before the things that happened between
   a and b, and there is no overlap.

   This is only assuming the x86 full memory barrier from a "lock xadd" of 
   course, but those are the semantics you'd get on x86. On others, the 
   ordering might not be that strong.

 - a -> c -> b -> d
   a -> c -> d -> b

   With these trace point orderings, you really don't know anything at all 
   about the order of any access that happened in between. CPU#1 might 

Well, x86 defines a memory order. But what I'm trying to explain is that 
memory order still doesn't actually specify what happens to the code that 
actually does tracing! The trace is only going to show the order of the 
tracepoints, not the _other_ memory accesses. So you'll have *some* 
information, but it's very partial.

And the thing is, all those other memory accesses are the ones that do all 
the real work. You'll know they happened _somewhere_ between two 
tracepoints, but not much more than that.

This is why timestamps aren't really any worse than sequence numbers in 
all practical matters. They'll get you close enough that you can consider 
them equivalent to a cache-coherent counter, just one that you don't have 
to take a cache miss for, and that increments on its own!

Quite a lot of CPU's have nice, dependable, TSC's that run at constant 
frequency. 

And quite a lot of traces care a _lot_ about real time. When you do IO ...
From: Mathieu Desnoyers
Date: Tuesday, September 23, 2008 - 7:12 am

Hrm, Documentation/atomic_ops.txt states that :

"Unlike the above routines, it is required that explicit memory
barriers are performed before and after the operation.  It must be
done such that all memory operations before and after the atomic
operation calls are strongly ordered with respect to the atomic
operation itself."

So on architectures with weaker ordering, the kernel atomic ops already
require that explicit smp_mb() are inserted before and after the atomic
increment. The same applies to cmpxchg.

Therefore I think it's ok, given the semantic provided by these two
atomic operations, to assume they imply a smp_mb() for any given
architecture. If not, then the architecture-specific implementation

Yep. If two "real kernel" events happen to belong to the same
overlapping time window, there is not much we can know about their
order. Adding tracing statements before and after traced kernel
operations could help to make this window as small as possible, but I
doubt it's worth the performance penality and event duplication (and
incremented trace size).


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Mathieu Desnoyers
Date: Monday, September 22, 2008 - 7:30 pm

Hi Linus,


If possible, I would recommend to stay as far away as possible from
using frequency change events to support broken frequency scaling.
First, some architectures, like AMD, does not trigger an event for every
frequency changes on the CPU (if my memory is right, I think the
southbridge temperature throttling does not generate any event) and we
also have to consider the delay between the frequency change and and
moment the instrumentation would be called. Each frequency change event
would therefore increase the over TSC imprecision and I doubt we really
want to rely on this for our time base. I welcome people to prove me
wrong or paranoid by providing results proving the correct synchronicity
of such approach over time. :-) I however explain how LTTng deals with

One gettimestamp() we could think of, to satisfy both people expecting
tracing to perform a memory barrier and for people expecting a
timestamp, would be to write timestamps taken from synchronized TSCs
into a variable with a cmpxchg, which would succeed only if the TSC
value is higher than the value present in this variable. That would
give both the memory barrier behavior and the timestamping. Given the
cache-line bouncing implied, I would only recommend to activate this
when really needed, e.g. when debugging race issues.

I currently use this scheme in LTTng to deal with broken x86
architecture with non-synchronized TSCs. Basically, all CPUs laging behind
others take the TSC value read from memory and adds 200 cycles (or how
many cycles it takes to actually read tsc) from the last cycle counter
value written to memory by the previous time base read. I also plan to
execute an heartbeat on every CPUs someday to give an upper bound to the
imprecision of such TSCs so it could be use for some sort of performance

I currently use the scheme you propose here on architectures
lacking TSC. I hook in the timer interrupt to increment the MSBs.
I doubt anyone will ever go the way of locking every caller, but ...
From: Masami Hiramatsu
Date: Monday, September 22, 2008 - 8:06 pm

Hi Linus,


Indeed.
If TSC(or similar time counter) can provide synchronized-time, I don't
have any comment on that(AFAIK, latest x86 and ia64 can provide it).


Thank you very much for giving me a good idea!

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: KOSAKI Motohiro
Date: Tuesday, September 23, 2008 - 7:36 am

I can't imazine a merit of the single-channel mode.
Could you please explain it?

Because some architecture don't have fine grained timestamp?
if so, could you explain which architecture don't have it?



--

From: Frank Ch. Eigler
Date: Tuesday, September 23, 2008 - 8:02 am

Hi -


It could be a way of saving some memory and merging hassle for
low-throughput data.  (Remember that systemtap enables in-situ
analysis of events so that often only brief final results need be sent
along need be sent out.)  If timestampwise cross-cpu merging can be
done on demand by the hypothetical future buffer widget, then little
reason remains not to use it.

- FChE
--

From: Masami Hiramatsu
Date: Tuesday, September 23, 2008 - 8:21 am

Hi Motohiro,


Actually, single-channel mode is for not-frequently event tracing.
At least systemtap case, sometimes we just want to collect data
and watch it periodically(as like as 'top'). Or, just monitoring
errors as additional printk. in these cases, overhead is not so
important.

I think the main reason of using single-channel mode is simplicity of
userspace reader. We can use 'cat' or 'tail' to read the buffer on-line.
I'm not sure how much overhead ftrace-like buffer merging routine has,
but if kernel provides an interface which gives us single-merged buffer

I heard that get_cycles always returns 0 on some cpu (ARM, MIPS, etc)...
(I think some of their platforms support variants of get_cycles)

Thank you,

-- 
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@redhat.com

--

From: KOSAKI Motohiro
Date: Tuesday, September 23, 2008 - 10:59 am

Yup, I also think it is better.

Thanks.



--

From: Martin Bligh
Date: Tuesday, September 23, 2008 - 11:28 am

That was the plan, yes. Merge sort is cheap ;-)
--

From: Andi Kleen
Date: Monday, September 22, 2008 - 8:33 pm

This is what relayfs always was promised to be, but apparently
never quite became. But before adding a new one I would recommend
to remove relayfs first.

-Andi
--

From: Martin Bligh
Date: Monday, September 22, 2008 - 8:47 pm

It's a little different, though similar. relayfs is an unstructured buffer.
This would be a sequence of events with a common timestamp format,
and hopefully other commonalties too.

I agree that the underlying buffer structure could be shared, as has
been pointed out (buried in this long thread). However, in another buried
comment, it was pointed out that relayfs would have no users once this
was done, so ...

I don't think we can remove relayfs before adding this and switching over
the users though (possibly it could all be done at the same time, but messy)
--

From: Andi Kleen
Date: Monday, September 22, 2008 - 10:04 pm

It's only two in tree users (+ systemtap) so shouldn't be a big hazzle.
And they're all tracers so presumably they all pass time stamps. But please
don't add a multitude of new methods to do this.

BTW I've been using something like that since early 2000. It was called
ktrace and was based on a old SGI patch.

-Andi

-- 
ak@linux.intel.com
--

Previous thread: [patch 1/3] cpu alloc: Use in slub by Christoph Lameter on Friday, September 19, 2008 - 1:37 pm. (1 message)

Next thread: sparse_irq: userspace ABI breakage in -tip tree by Arjan van de Ven on Friday, September 19, 2008 - 3:53 pm. (4 messages)