Re: [RFC/Requirements/Design] h/w error reporting

Previous thread: linux-next: build warnings after in Linus' tree by Stephen Rothwell on Tuesday, November 9, 2010 - 5:56 pm. (1 message)

Next thread: [PATCH] vfio: fix config virtualization, esp command byte by Tom Lyon on Tuesday, November 9, 2010 - 6:09 pm. (4 messages)
From: Luck, Tony
Date: Tuesday, November 9, 2010 - 5:56 pm

At the Linux plumbers conference we had an interesting discussion on
the current state and future direction for hardware error reporting.
Thanks to Mauro for setting up the session, and to all those who
attended.  Cc: list created by looking at the most vocal on the last
thread on this subject - but everyone is invited to chime in.

Here are my notes on what was said - please add anything that I
missed/forgot ... or with your own thoughts on this topic.

The current situation
---------------------

We have several subsystems & methods for reporting hardware errors:

1) EDAC ("Error Detection and Correction").  In its original form
this consisted of a platform specific driver that read topology
information and error counts from chipset registers and reported
the results via a sysfs interface.  For example:

# ls -l /sys/devices/system/edac/mc/mc0
total 0
-r--r--r-- 1 root root 4096 Nov  8 14:48 ce_count
-r--r--r-- 1 root root 4096 Nov  8 14:48 ce_noinfo_count
drwxr-xr-x 2 root root    0 Nov  8 14:47 csrow0
drwxr-xr-x 2 root root    0 Nov  8 14:47 csrow1
lrwxrwxrwx 1 root root    0 Nov  8 14:48 device -> ../../../../pci0000:00/0000:00:10.0
-r--r--r-- 1 root root 4096 Nov  8 14:48 mc_name
--w------- 1 root root 4096 Nov  8 14:48 reset_counters
-rw-r--r-- 1 root root 4096 Nov  8 14:48 sdram_scrub_rate
-r--r--r-- 1 root root 4096 Nov  8 14:48 seconds_since_reset
-r--r--r-- 1 root root 4096 Nov  8 14:48 size_mb
-r--r--r-- 1 root root 4096 Nov  8 14:48 ue_count
-r--r--r-- 1 root root 4096 Nov  8 14:48 ue_noinfo_count

some chipset drivers also report some pci device error information
and others provide mechanisms to inject errors for testing.

2) mcelog - x86 specific decoding of machine check bank registers
reporting in binary form via /dev/mcelog. Recent additions make use
of the APEI extensions that were documented in version 4.0a of the
ACPI specification to acquire more information about errors without
having to rely reading chipset registers directly. A user ...
From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 3:14 am

Well, the direction is that we are unifying ftrace and perf events and we are 
actively phasing out individual ftrace plugins as matching events become available 
(we already removed a few).

Most new tools use the perf syscall and tool writers have expressed the very 
understandable desire that all events (and their reporting facility) be enumerated 
and accessible via a unified API/ABI.

While it often seems easier for subsystems to just do their own ad-hoc 
logging/reporting in the short run (every subsystem tends to think it has its own 
very specific requirements for logging - while users/tool-authors can only shake 
their head in disbelief when looking at the myriads of incompatible and inconsistent 
facilities). The tooling requirement for unification is strong here and can not be 

Note that Boris has been working on extending perf events into this area as well, 
see this recent submission of patches on lkml:

  [PATCH 20/20] ras: Add RAS daemon

One thing is clear: any 'health subsystem' should not do its own flavor of error 
reporting - instead we want to unify various forms of event logging into a common 
facility.

RAS/EDAC could do its own hardware-specific settings via a separate subsystem - 
although even many of those can be expressed via their respective events. (and we 
are open on the perf events side to give callbacks/facilities for such use)

The synergies of unified event reporting are very strong.

Thanks,

	Ingo
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 7:40 am

Hi Ingo,

The thing that was brought up at KS was the problems between the way
perf and ftrace get their data. We have two buffer systems and two
interfaces for that. Forget the debugfs for now, the ftrace ring buffer
was designed for fast high speed tracing, with or without a reader. The
perf buffer was designed for analyzing a specific task (although it can
do more, but for a single task it shines).

The format of data that ftrace uses and the format perf uses is also
currently incompatible.

Linus said flat out that if he gets one complaint that a tool breaks
because a format change or an ABI disappears, he will revert the patch
that did that change immediately.

During the tracing summit at Linux Plumbers, Thomas stated that we have
two choices.

1) We can keep the status-quo and just have two separate interfaces
(whether both would be supported by the perf user tool was not
discussed)

2) We come up with a new syscall (or syscalls) that can be designed for
both the needs of perf and ftrace. This syscall would be kept out of
mainline until everyone was happy with it. After we are happy with it
and have tools that work well with it, we will push it to mainline. Then
the old interfaces would still be supported but nothing new added to
them. And all new development would be with the new syscall(s) and
eventually we deprecate the old interface. This would truly unify ftrace
and perf.

The second option was agreed upon by myself, Thomas, Frederic, and
Peter, and it was OK'd by Linus.

What do you think about it?



--

From: Peter Zijlstra
Date: Wednesday, November 10, 2010 - 7:43 am

I would still like to see lots more detail before we commit to anything,
sure the second way is the only way out, but you still need to come up
with a trace data format and a control ABI.

Without those its pretty pointless to even talk about stuff.


--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 8:09 am

Great! Let's start with that then. Could you list some of the basic
needs of perf? And then I can start talking about what I need for
ftrace, and we also should keep in mind things we may want to do in the

Agreed, but we really do want to find a way out, thus lets start the
conversation. Basically, start from square one. We now have both ftrace
and perf, and we know what we need. We can start working on something
with both in mind, and perhaps keeping track of other things.

I added Mathieu too. I know to you LTTng does not exist, but he can at
least give ideas about something we may not have thought about and may
want to do in the future.

-- Steve



--

From: Mathieu Desnoyers
Date: Wednesday, November 10, 2010 - 8:28 am

* Steven Rostedt (rostedt@goodmis.org) wrote:

If we come up with an ABI that fulfills my user needs, I'll switch to it and
deprecate LTTng. I've actually worked almost full time on common kernel tracer
infrastructures this year, leaving LTTng development almost to a standing halt.

How do you guys want to proceed ? Do you want to throw your requirements over
the wall or do you want me to propose a trace format we can discuss on ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Peter Zijlstra
Date: Wednesday, November 10, 2010 - 8:30 am

Needs for what? I've already got a full control ABI and I can already
redirect output to other buffers ;-)

I don't have enumeration of what all is redirected to what, I pretend
that people know wth they're doing.. so if you want session lists of
what tracepoints are active on which buffers and the like you'll have to
come up with something for that.

As for the buffer, I prefer a u64 aligned data stream, but the very
least I need is frame encapsulation. What I don't want _ever_ is stupid
sub-buffers. And no they're not needed, see the discussion about sync
markers a while back.

I also don't want to support the stupid concurrent read/write from tail.

What I do want is both mmap() and splice(), this means buffer size needs
to be specified at buffer creation.

I currently support overwrite (flight-recorder) and non-overwrite modes
depending on PROT_WRITE, I guess that can easily be pushed into the
buffer create call.

As to the mmap() part, it needs a control page to expose the head/tail
pointers and some data.

And as you know I need to write > PAGE_SIZE entries.
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 8:53 am

You are talking about what you said that ftrace ring buffer is totally
broken, that if the writer is writing to the tail, and the reader is
reading from the head, it is broken?

Let me get this straight.

We have a writer constantly writing to the tail of the buffer. On
another CPU we have a reader, that will start at that tail (where the
writer just wrote) and go backwards.

What happens if the writer continues writing? Do we stop the reader and
have it write what it just wrote? Or just consider that the reader goes
the opposite direction of the writer, and when it hits the writer, it
continues, since now it has the new data again.

Now the question comes, how do we show this data to the user? Does the
user need to sort the data? If the reader reads X amount of data, it
gets X from where the writer just wrote. Then the writer writes Y data.
The reader reads X amount again, but X > Y, do we read the Y where the
writer wrote, and then read the buffer part that is older than the
previous read? Thus the user now has the burden to sort the buffer?

I'm really confused to how to use a buffer like this?

-- Steve


--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 9:52 am

BTW, the sub buffers is just an implementation detail. I suspect that
we'll have to end up with something that splits the buffer up. Whether
we have 'markers' or something else. They all break down the buffer into

I was thinking about this more. I guess it can work if the reader always
goes the opposite direction of the writer. It's just any user that uses
this will need to cope with it. I would personally like both methods
implemented. One as the "broken design" (as you put it) which removes
the burden of sorting from the user. But the "fast design" which




Sure.

Also, lets not focus now on implementation. Let's try to concentrate on
what we want the tools to be able to do.

For example, I would like:

Very small entries, and pick and chose what I want in my entries.

A way to read it fast to a file or over the network (splice).

The read backwards seems like a cool idea, but I would not want to throw
away the read forwards part either.

How we implement this, we can work together on.

-- Steve


--

From: Borislav Petkov
Date: Wednesday, November 10, 2010 - 10:05 am

It would also be cool to be able to allocate those buffers as early as
possible, even if before MCA is enabled, so that I won't have to copy
MCE data which got logged before the tracing subsystem got enabled to
the buffers proper.

-- 
Regards/Gruss,
Boris.
--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 10:41 am

We could even have some (small) statically enabled build-time buffer that could be 
enabled straight away before any allocators are enabled.

Thanks,

	Ingo
--

From: Luck, Tony
Date: Wednesday, November 10, 2010 - 10:50 am

Agreed - in fact the error reporting paths will also want
some pre-allocated guaranteed space at all times.  Allocating
memory from within an NMI or Machine Check handler would
cause too many problems.

-Tony
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 11:09 am

I'm not sure it needs to be small. We can have a persistent buffer that
may be resized at any time. It can start off small, or with a kernel
command line, be as big as you want it.

Basically, what ftrace has now.

Also, with a way that root user can get a handle on this buffer, and
just trace global events with it.

-- Steve


--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 11:52 am

Yeah.

Thanks,

	Ingo
--

From: Frederic Weisbecker
Date: Wednesday, November 10, 2010 - 10:25 am

If the size of the sub-buffers are tunable (all the same size inside a
whole buffer, but that size is tunable), then someone who doesn't want
to use subbuffers can just use a single big subbuffer :)

--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 10:48 am

Yep. The obvious direction is to extend the event buffering ABI we already have, 
with whatever additions that are needed:

 - document that we already support flight recorder mode

 - a more compressed record format

 - NOP filler events up to page boundary, for better splice and for better flight 
   recorder

 - splice support

etc. That's how it evolved until now and it's all very extensible.

Steve, could you please list the additions you have in mind, in order of priority?

Thanks,

	Ingo
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 11:05 am

A few of things that pop up quickly are:

1) lockless

2) as-fast-as possible

3) support all tasks / all CPUs and still have as-fast-as-possible

Peter said at LPC that the perf buffering system was not designed to
handle high speed tracing. But he also said he does not like the way the
ftrace buffering works.

I think if we take a step back, we can come up with a new buffering/ABI
system that can satisfy everyone. We will still support the current
method now, but I really don't think it is designed with everything we
had in mind. I do not envision that we can "envolve" to where we want to
be. We may have to bite the bullet, just like iptables did when they saw
the failures of ipchains, and redesign something new now that we
understand what the requirements are.

I do think we need to come up with something new but still support the
old methods. Thomas came up with this idea, and Peter, Frederic and
myself agreed.

-- Steve


--

From: Luck, Tony
Date: Wednesday, November 10, 2010 - 11:23 am

This is a clear requirement for use in h/w error
reporting too.  Taking locks in NMI or machine
check handler isn't an option.

-Tony
--

From: Peter Zijlstra
Date: Wednesday, November 10, 2010 - 11:31 am

Don't worry, lots of PMIs are NMIs, perf needs to be fully NMI safe
otherwise things simply don't work.
--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 11:49 am

Yep, in fact perf was fully NMI safe earlier than the ftrace ring-buffer.

When perf code is NMI unsafe we notice it very quickly. I regularly record millions 
of events per second.

Thanks,

	Ingo
--

From: Peter Zijlstra
Date: Wednesday, November 10, 2010 - 11:24 am

You're not very good at listening, I said the perf infrastructure and
event handling mechanism isn't geared towards full throughput but
instead on sampling.

There is lots of code between getting the event and landing it in the
buffer. The buffer itself is perfectly suited for high speed low
overhead stuffs, the perf data format possibly not because its not
bitfield happy.




--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 11:41 am

Note that even that is an implementational detail that can be changed: even with a 
sampling model the sampling bits are in a flag word, so common combinations can be 
checked for quickly and open-coded into flat fall-through code - if the sample 
decoding ever shows up as overhead. (It doesnt even need any ABI changes.)


Even that can be tweaked via allowing more compressed records. I doubt it will help 
as much, but it's still an incremental change that can be validated carefully.

Fact is that we have an ABI, happy users, happy tools and happy developers, so going 
incrementally is important and allows us to validate and measure every step while 
still having a full tool-space in place - and it will help everyone, in addition to 
the ftrace/lttng usecases.

We'll need to embark on this incremental path instead of a rewrite-the-world thing. 
As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can 
and will do better here.

Thanks,

	Ingo
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 12:00 pm

Thus you are saying that we stick to the status quo, and also ignore the
fact that perf was a rewrite-the-world from ftrace to begin with.

-- Steve


--

From: Frederic Weisbecker
Date: Wednesday, November 10, 2010 - 12:11 pm

Perhaps you and Mathieu can summarize your requirements here and then explain
why extending the current ABI wouldn't work. It's quite normal that people
try to find a solution fully backward compatible in the first place. If
it's not possible, fine, but then justify it.

--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 12:30 pm

Yeah, that's pretty much the only reasonable approach really. It also makes every 
single step testable and verifiable, and often optional as well:

 - How much do we win from more compressed records? Do we win? Do we want _larger_,
   less encoded records on some CPUs because their construction overhead and cache
   behavior is better there?

 - How much does splice() help?

 - How much do the sampling fast-path approaches help. How many apps will make use
   of them?

Those are all issues that are virtually undecidable individually if the approach is 
an all-or-nothing flag-day thing.

Fact is that the perf events based tool space is vibrant and alive, and new uses are 
popping up every week. We'd be utter fools to not embark on an iterative approach 
here. It does not even limit us in any way technically.

The days of full tracing subsystem rewrites are over Steve, i'm afraid it is time 
for us to grow up ;-)

Thanks,

	Ingo
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 12:48 pm

Probably more if we eventually fix the bug that causes a copy of the
page when it is not needed.

-- Steve


--

From: Mathieu Desnoyers
Date: Wednesday, November 10, 2010 - 1:23 pm

Sure, here are the requirements my user-base have, followed by a listing of Perf
and Ftrace pain points, some of which are directly derived from their respective
ABIs, others partially caused by their implementation and partially caused by
their ABI.

- Low overhead is key
  - 150 ns per event (cache-hot)
  - Zero-copy (splice to disk/network, mmap for zero-copy in-place data
    analysis)
- Compactness of traces
  - e.g. 96 bits per event (including typical 64-bit payload), no PID saved per
    event.
- Scalability to multi-core and multi-processor
  - Per-CPU buffers, time-stamp reading both scalable to many cpus *and* accurate
- Production-grace tracer reliability
  - Trace clock accuracy within 100ns, ordering can be inferred based on
    lock/interrupt handler knowledge, ability to know when ordering might be
    wrong.
- Flight recorder mode
  - Support concurrent read while writer is overwriting buffer data
    (Thomas Gleixner named these "trace-shots")
- Support multiple trace sessions in parallel
  - Engineer + Operator + flight recorder for automated bug reports
- Availability of trace buffers for crash diagnosis
  - Save to disk, network, use kexec or persistent memory
- Heterogeneous environment support
  - Portability
  - Distinct host/target environment support
  - Management of multiple target kernel versions
  - No dependency on kernel image to analyze traces
    (traces contain complete information)
- Live view/analysis of trace streams via the network
  - Impact on buffer flushing, power saving, idle, ...
- Synchronized system-wide (hypervisor, kernel and user-space) traces
- Scalability of analysis tools to very large data sets (> 10GB)
- Standardization of trace format across analysis tools


* Ring Buffer issues with Perf:

- Perf does not support flight recorder tracing (concurrent read/write)
  - Sub-buffers are needed to support concurrent read/writes in flight recorder
    mode. Peter still has to convince me otherwise (if he cares).
  - ...
From: Luck, Tony
Date: Wednesday, November 10, 2010 - 1:54 pm

When I hear somebody say "flight recorder" - I think of "black boxes"
in airplanes that log data while the flight is running, and are only
looked at offline later.  So I'm confused by the "concurrent read/write"
requirement.

Perhaps you could explain the use cases of your "flight recorder",
because it seems that the name doesn't fit exactly, and this is
causing me (and maybe others) some confusion.

Thanks

-Tony
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 2:06 pm

Hmm, I had this argument with Mathieu before, but I guess I mistakenly
let him win ;-)

I call "flight recorder" mode "overwrite" mode. Basically there's two
modes. They only have meaning when the ring buffer is full and a write
takes place.

1) produce/consumer mode - When the writer reaches the reader, all new
events are discarded. This means that you lose the latest events while
you keep older events around.

2) overwrite mode (flight recorder) - when the writer reaches the
reader, it pushes the reader forward, and writes the new events over the
old ones. This way, new events are always existent, where as old events
are lost.

1 is much easier to implement than 2, especially when doing it in a
lockless way.

I guess I should have fought harder to keep the terminology of
"overwrite" mode. This is the third time is the last week I had to
explain what "flight recorder" mode was. Where as, overwrite mode was a
bit more obvious.

-- Steve


--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 2:34 pm

Ah, I forgot to mention use cases.

Even when recording a trace (lots of data, so it is saved to disk and
not all in kernel memory) I like to know that if something happens and
disables the trace, I have all the trace information up to the point of
failure.

With producer/consumer mode, you risk the reader being late and the
events in the trace that led up to failure (or any other anomaly) were
dropped.

-- Steve


--

From: Mathieu Desnoyers
Date: Wednesday, November 10, 2010 - 3:51 pm

As Steven pointed out, the flight recorder buffers are set to overwrite the
oldest data when the buffer is filled. Therefore, the tracer can be used in
close-circuit mode (without extracting the data out of the memory buffers) to
keep a trace of the recent events. The trace can be extracted when an
interesting condition (trigger) occurs.

A typical use-case is to let it run on an end-user machine to enhance
application crash diagnosis with tracing information, albeit using a very small
fraction of the system resources to do so.

The reason why "concurrent read/write" is required is for server-class machines
which needs to continuously be able to gather trace data to report/find/locate
problematic scenarios happening. This means we're not only interested in one
single failure, but rather by a whole set of erroneous/warning conditions that
need to be reported. Stopping tracing every time data is gathered is
inappropriate, because it would hide errors/warnings that would be happening
during data collection.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Thomas Gleixner
Date: Wednesday, November 10, 2010 - 4:12 pm

Aargh! Just because it can be done all in one with an insane amount of
complexity does not mean that it's an absolute requirement and a good
solution.

So if you want to have both the flight recorder crash documentation
and the ongoing monitoring then use two separate sessions with
separate modes and be done with it.

Cramming both into the same session is just insane.

The first rule is "Keep It Simple!".  Period.

Thanks,

	tglx
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 4:20 pm

It's not that complex. Both Mathieu and I have implemented it. Really,
I've seen a lot more complex code. Just because it does not fit into the
CS101 course does not mean that it is totally complex.

-- Steve


--

From: Thomas Gleixner
Date: Wednesday, November 10, 2010 - 4:45 pm

I know you have implemented it and it's not about CS101, it's about
the insanity of what we have in the ftrace ringbuffer code. It does
not fulfil in any way the "Keep it Simple" requirement. Period.

And as I said earlier on IRC, you are trying to create the

  ZeroImpactFilteringMultiSessionFlightRecorderOverwriteFifoSplicePerfMmapTracer

which is a nice wet dream, but completely unrealistic to achieve in
one go.

When I yelled at you folks in Boston last week and suggested to come
up with a syscall for buffers and the corresponding configuration
interfaces along with a unified record format, then I certainly did
not ask for the ZIFMSFROFSPMT thingy and a rewrite the world approach.

I told you back in 2008 that you need to think hard about the
interfaces and start with a reasonable simple implementation. Then
proceed from there.

The overall achievement so far is an ongoing ringbuffer pissing
contest, zero interfaces and lenghty explanations which kind of tracer
madness is preferred by whom.

I don't call this progress. If you did not get the message last week,
then you have it in writing now to digest as long as it takes:

 Get your gear together and come up with sensible gradual approaches
 which bring us to a better progress ratio than 0/year.

Thanks,

	tglx
--

From: Ted Ts'o
Date: Thursday, November 11, 2010 - 11:25 am

At least when I've used ftrace for the "flight recorder" use case, I'm
not tracing as well.  What I do is enable a bunch of trace points,
maybe I've sprinkled in some "trace_printk()'s" into various kernel
code paths, and then I run the workload which locks up the kernel.
When locks up, I've used sysrq-z to dump out the ftrace ring buffer,
and usually _exactly_ what I need to debug the lock up is waiting for
me in the ring buffer.

So, this use case, is incredibly useful, and I hope whatever folks do
with the new-fangled API, that somehow "overwrite mode" is supported.
Even if for speed reasons, what you do is wait until for the head to
overrun the tail, that the tail gets bumped up by 50% and we lose half
the log (so that whatever expensive locking is necessary only happens
once in a while), I at least would find that quite acceptable.

The other feature/requirements request I would make is that there
should be a way that common kernel abstractions, such as converting a
dev_t to either a MAJOR/MINOR number pair, or to a device name, be
made available.  For now I've changed the tracepoints to translate
MAJOR/MINOR and drop integers into the ring buffer, and a generic
workaround in the future is to always drop strings into the ring
buffer instead of allowing the translation to be done in TP_printk
(which doesn't work for perf; it causes the userspace perf client to
fall over and die, without even skipping the problematic tracepoint
record --- boo, hiss).  But that can be relatively inefficient,
because we're now having to drop potentially fairly large text strings
into ring buffer, because of limitations that perf has in its output
transformations step.

I know that because perf is doing its output transformation in
userspace, there are fundamental limitations about what it can do.
But it would be nice if it could be expanded at least _somewhat_, and
either way, there needs to be some clear documentation about what it
can and can not accept.  And if these limitations means ...
From: Mathieu Desnoyers
Date: Wednesday, November 10, 2010 - 4:28 pm

I'm afraid this is not what I proposed above. I'm open to use different tracing
sessions for different things. However, the server-class case needs to
continuously gather data so that "trace-shots" can be gathered when problems
occur. But if you hit two problems back to back, you don't want to lose the
trace leading to the second issue. Hence the motivation for supporting

I'd like to start with an implementation that skips some of these requirements
initially, but what I really think we need to figure out is how we organize our
ABIs to finally support these requirements.

Thanks,


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Thomas Gleixner
Date: Wednesday, November 10, 2010 - 4:58 pm

Realistically, you are interested in the first one, simply because in
99.9% of the cases the second problem is caused by the first one. Do we
really need to care about the 0.1% which fall into the other category?

Not at all. Simply because the likeliness of those back to back events
_AND_ giving us the 0.1% case is approaching zero.

Of course you can argue with your academic hat on that I'm ignoring
that we might catch this rare "easter and xmas fall on the same day"

I did not say, that you should not think about this, but the progress
so far in more than TWO YEARS is exaclty ZERO. And that's what I'm
concerned about.

Thanks,

	tglx
--

From: Ingo Molnar
Date: Thursday, November 11, 2010 - 2:17 am

Note, there is an existing ABI in place, please use that. (It's highly extensible so 
it can support just about any ABI experiment that can even be turned into a smooth 
ABI replacement.)

I think Frederic just started iterating it - but if Mathieu and Steve helps out it 

Yes, indeed that is the main problem i see as well.

Most of the problems listed in the various documents can be solved iteratively in 
the existing facilities. There is not a single requirement where Peter or me said 
'No, this cannot be done, go away!'. Each and every item was answered with: 'sure, 
we can do that' - or at worst with a 'do we really need it?'. Each and every item 
fits naturally into existing goals as well - so it's not like some different world 
view is being forced on anyone.

We only have one basic condition: please introduce these thing step by step in the 
existing ABI.

This is a must-have for tools, and there is another very important factor as well: a 
couple of items can have disadvantages beyond the claimed advantages, so we want to 
be able to evaluate the effects in isolation, test them and if needed, undo them. It 
will settle the 'do we really need this?' kind of sub-arguments for sub-features.

So being intelligent about it, being iterative is my only requirement to you guys: 
you are free to change anything, go wild, but please make it iterative and dont try 
to fork the tooling and developer community.

The time has come to not grow the list of requirements but to shrink it.

Thanks,

	Ingo
--

From: Mathieu Desnoyers
Date: Thursday, November 11, 2010 - 6:37 am

Hi Ingo,


Is there any way we could proceed without piling up work-arounds over Perf's ABI ?
At this point, the only benefit of growing from the Perf ABI is comparable to
dragging a ball and chain all along. Yes, I agree that a smooth transition
should be the target, but I disagree on the means. I propose to come up with a
new ABI and eventually move the perf tools to this ABI, which is not a split in
the tracing developers community; rather more a unification.

Which do you prefer: a sequence of continuous ABI breakages or a single ABI
switch when the new ABI is ready ? In terms of contributor and user pain, I
think the second option is much better. We'll have to keep the old Perf ABI
around for a while anyway to keep users and tool developers happy. As Linus
pointed out at KS, this ABI is now cast in stone. If we need to break it, the
only thing we can do is create a new one. He said that he will personally revert
any ABI-breaking tracing patch if he ever receives a single complaint from a
user. This is not a context in which we want to start playing games with the
existing ABIs.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Frederic Weisbecker
Date: Wednesday, November 10, 2010 - 2:30 pm

Yeah, but the main point here is to explain why/how reaching those goals is not
efficiently possible through an extension of the current ABI, in practice.

I'm going to try for some of them. Note when I'll talk about ABI breakage,
it actually means: create a new ABI and support the old one, schedule its
deprecation in the long term.




We could do splice in perf through an extension of the current ABI.
The rest seems more about kernel internals.



In perf we save the pid from two places:

- perf headers, see PERF_SAMPLE_TID
- from the common fields of the trace events

Ftrace too for common fields.

It's useful to keep PERF_SAMPLE_TID for low overhead events (like
perf little sampling). Otherwise we can certainly deduce the pid
from context switch trace events.

But the pid in the trace event headers remains. We probably should
get rid of that.

There are also the other common fields:

	struct trace_entry {
		unsigned short		type;


Type is needed by perf. If we have one buffer per event, we could
retrieve which event we are dealing with. But if buffers are
multiplexed per cpu, we need this.


	unsigned char		flags;


Useful for ftrace, not for perf which will be able to save regs
soon.


	unsigned char		preempt_count;


Dunno. Should be optional.


	int			pid;


Kill!


	int			lock_depth;


Killed ;)


};



=> Abi breakage needed. Can be made through an ABI extension though, but















Use splice for save to disk or network. But I don't understand the kexec
thing.







ditto.
This works well for perf and ftrace currently. Have you


We all try to ensure backward compatibility. It only gets broken
because of unwanted regressions or scheduled deprecation in the


































You can select them independantly, except for trace events, for which











What is this userspace tracing? Is this userspace tracing made in ...
From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 2:54 pm

Yeah, I discussed this with Mathieu. There's a pretty trivial fix for

If an interrupt (or softirq) preempts the recorded trace, then events
that are recorded in that interrupt all get the same time as the event
it preempted. Giving us the assumption that all events happened at once.

Again, this is just a side effect and the fix is trivial. But may

That was actually a decision made by Linus. But is trivial to change. As
there's nothing hard coded about the design that forces us to have page
size sub buffers. I don't even think that it would require an ABI

Yep, they are large, but can be trimmed. This would require no abi
breakage since the these headers are also described in the event
formats. Thus changing the current tools should cope with the headers
changing. In fact they were designed too since the lock-depth was known

He's talking about tracing the tracepoints in a loaded module. We
currently have no way to add them while a trace is happening. The trace
formats do not exist and may not exist (if module is unloaded) when the
trace ends.

But who really loads and then unloads a module during tracing. As pretty
much all kernel developers cringe at the fact that modules get

Me too, lets go shopping!

-- Steve


--

From: Frederic Weisbecker
Date: Wednesday, November 10, 2010 - 3:19 pm

If you have this patch ready that fixes my ring buffer inexperience,
please send it right away. I might accomodate with the self-ABI breakage,
depending on what it is...

--

From: Frederic Weisbecker
Date: Wednesday, November 10, 2010 - 3:49 pm

Hmm, in practice this is an ABI breakage as we have scripts that rely on
the common_pid field for example. We can fix this, but older tools won't



Hehe :)

Anyway this can be expressed through an ABI extension,
using a kind of lazy tracepoint registration or so.

--

From: Mathieu Desnoyers
Date: Wednesday, November 10, 2010 - 5:11 pm

Agreed, although 65536 types ID is probably overkill for the common case.
I prefer to go for approaches with a header that contains a smaller number of





Yep, you'd have to support the two formats side-to-side for a while anyway. So

That's right. It's more in the trace-clock area. Let's keep this problem for


There were more details below on the impact of supporting flight recorder on the
trace format (using sub-buffers, etc). The ABI impact is more than just a flag,


This one is when the kernel is crashed. So there is not much still available,
certainly not splice(). :) The idea is to keep the trace buffers around in the

Portable bitfields comes to my mind. And no, it's not enough to just reverse the

The setup is that the traces are gathered on telecom switches, and brought to a
host machine for viewing. The user has to deal with traces gathered from various
kernel versions.

I did push Steven to support cross-endianness and self-describing types in
Ftrace in the past, and I have to admit that a large part of this requirement is

Yep, this one involves that the trace metadata (currently exported through
debugfs) should make its way along with the trace stream. One way to do it would



There are ways to layout the trace data so that a userspace tool can dig through

I'm working for the Linux Foundation CELF group and Ericsson, with the
Multi-Core Association, to come up with a standardized trace format across
trace providers in the industry, so that we can use the same tools to analyze
traces taken from heterogeneous systems (hardware traces, OS traces, user-space
traces...).

Given the live analysis and low-overhead requirements, being able to generate

Nope, this one is an ABI breakage. The current mmap shared control head/tail
values used for synchronization between the kernel (writer) and user-space
(reader) does not allow concurrent read/write in flight recorder mode. We need,


Because we need to get exclusive access to the next sub-buffer ...
From: Steven Rostedt
Date: Thursday, November 11, 2010 - 9:10 am

Note, ftrace currently has over 600 event types. Unless we compact it
down into bits, using two bytes is fine.

-- Steve


--

From: Mathieu Desnoyers
Date: Thursday, November 11, 2010 - 9:34 am

I understand that overall the numer of events overflows 256. However, this does
not mean they are typically all activated. Moreover, if we add multiple buffer
support, these events don't necessarily have to end up in the same buffer. This
numeric identifier is only there to distinguish between events ending in the
same buffer after all.

So the number of available events does not count as an argument for choosing the
typical number of bytes to represent the event IDs. The number of events
typically activated for a trace session does.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 12:11 pm

No, i'm saying we dont do new things just for the sake of it being new, without 
exhausting existing facilities.

None of the examples/arguments offered so far seemed to necessiate throwing away 

No, the thing is that there were no tools and no ABI - perf was mostly about the ABI 
and about the user-space tooling - ftrace didnt really have that (and oprofile had 
deep problems).

Thanks,

	Ingo
--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 12:16 pm

OK, not the buffering, but the infrastructure. That's not much of a
difference to this topic. Which is about the ABI which includes all from

Well, we need to separate out the buffer in perf regardless, since it is
very entwined in the code. Does it now support flight recorder mode?

-- Steve


--

From: Steven Rostedt
Date: Wednesday, November 10, 2010 - 12:38 pm

Question:

Can we make perf lower what it records, thus speeding up the amount it
records, without breaking the ABI?

Can we add flight recorder mode splice version, non mmap, without
breaking the ABI?

If we can make perf as fast as ftrace in its recording, and maybe even
faster if we have the ability to select what is recorded and compress
the events, I'm all for it.

-- Steve



--

From: Ingo Molnar
Date: Wednesday, November 10, 2010 - 11:27 am

Yeah - that's a self-evident goal for just about any kernel code.

	Ingo
--

Previous thread: linux-next: build warnings after in Linus' tree by Stephen Rothwell on Tuesday, November 9, 2010 - 5:56 pm. (1 message)

Next thread: [PATCH] vfio: fix config virtualization, esp command byte by Tom Lyon on Tuesday, November 9, 2010 - 6:09 pm. (4 messages)