At the Linux plumbers conference we had an interesting discussion on the current state and future direction for hardware error reporting. Thanks to Mauro for setting up the session, and to all those who attended. Cc: list created by looking at the most vocal on the last thread on this subject - but everyone is invited to chime in. Here are my notes on what was said - please add anything that I missed/forgot ... or with your own thoughts on this topic. The current situation --------------------- We have several subsystems & methods for reporting hardware errors: 1) EDAC ("Error Detection and Correction"). In its original form this consisted of a platform specific driver that read topology information and error counts from chipset registers and reported the results via a sysfs interface. For example: # ls -l /sys/devices/system/edac/mc/mc0 total 0 -r--r--r-- 1 root root 4096 Nov 8 14:48 ce_count -r--r--r-- 1 root root 4096 Nov 8 14:48 ce_noinfo_count drwxr-xr-x 2 root root 0 Nov 8 14:47 csrow0 drwxr-xr-x 2 root root 0 Nov 8 14:47 csrow1 lrwxrwxrwx 1 root root 0 Nov 8 14:48 device -> ../../../../pci0000:00/0000:00:10.0 -r--r--r-- 1 root root 4096 Nov 8 14:48 mc_name --w------- 1 root root 4096 Nov 8 14:48 reset_counters -rw-r--r-- 1 root root 4096 Nov 8 14:48 sdram_scrub_rate -r--r--r-- 1 root root 4096 Nov 8 14:48 seconds_since_reset -r--r--r-- 1 root root 4096 Nov 8 14:48 size_mb -r--r--r-- 1 root root 4096 Nov 8 14:48 ue_count -r--r--r-- 1 root root 4096 Nov 8 14:48 ue_noinfo_count some chipset drivers also report some pci device error information and others provide mechanisms to inject errors for testing. 2) mcelog - x86 specific decoding of machine check bank registers reporting in binary form via /dev/mcelog. Recent additions make use of the APEI extensions that were documented in version 4.0a of the ACPI specification to acquire more information about errors without having to rely reading chipset registers directly. A user ...
Well, the direction is that we are unifying ftrace and perf events and we are actively phasing out individual ftrace plugins as matching events become available (we already removed a few). Most new tools use the perf syscall and tool writers have expressed the very understandable desire that all events (and their reporting facility) be enumerated and accessible via a unified API/ABI. While it often seems easier for subsystems to just do their own ad-hoc logging/reporting in the short run (every subsystem tends to think it has its own very specific requirements for logging - while users/tool-authors can only shake their head in disbelief when looking at the myriads of incompatible and inconsistent facilities). The tooling requirement for unification is strong here and can not be Note that Boris has been working on extending perf events into this area as well, see this recent submission of patches on lkml: [PATCH 20/20] ras: Add RAS daemon One thing is clear: any 'health subsystem' should not do its own flavor of error reporting - instead we want to unify various forms of event logging into a common facility. RAS/EDAC could do its own hardware-specific settings via a separate subsystem - although even many of those can be expressed via their respective events. (and we are open on the perf events side to give callbacks/facilities for such use) The synergies of unified event reporting are very strong. Thanks, Ingo --
Hi Ingo, The thing that was brought up at KS was the problems between the way perf and ftrace get their data. We have two buffer systems and two interfaces for that. Forget the debugfs for now, the ftrace ring buffer was designed for fast high speed tracing, with or without a reader. The perf buffer was designed for analyzing a specific task (although it can do more, but for a single task it shines). The format of data that ftrace uses and the format perf uses is also currently incompatible. Linus said flat out that if he gets one complaint that a tool breaks because a format change or an ABI disappears, he will revert the patch that did that change immediately. During the tracing summit at Linux Plumbers, Thomas stated that we have two choices. 1) We can keep the status-quo and just have two separate interfaces (whether both would be supported by the perf user tool was not discussed) 2) We come up with a new syscall (or syscalls) that can be designed for both the needs of perf and ftrace. This syscall would be kept out of mainline until everyone was happy with it. After we are happy with it and have tools that work well with it, we will push it to mainline. Then the old interfaces would still be supported but nothing new added to them. And all new development would be with the new syscall(s) and eventually we deprecate the old interface. This would truly unify ftrace and perf. The second option was agreed upon by myself, Thomas, Frederic, and Peter, and it was OK'd by Linus. What do you think about it? --
I would still like to see lots more detail before we commit to anything, sure the second way is the only way out, but you still need to come up with a trace data format and a control ABI. Without those its pretty pointless to even talk about stuff. --
Great! Let's start with that then. Could you list some of the basic needs of perf? And then I can start talking about what I need for ftrace, and we also should keep in mind things we may want to do in the Agreed, but we really do want to find a way out, thus lets start the conversation. Basically, start from square one. We now have both ftrace and perf, and we know what we need. We can start working on something with both in mind, and perhaps keeping track of other things. I added Mathieu too. I know to you LTTng does not exist, but he can at least give ideas about something we may not have thought about and may want to do in the future. -- Steve --
* Steven Rostedt (rostedt@goodmis.org) wrote: If we come up with an ABI that fulfills my user needs, I'll switch to it and deprecate LTTng. I've actually worked almost full time on common kernel tracer infrastructures this year, leaving LTTng development almost to a standing halt. How do you guys want to proceed ? Do you want to throw your requirements over the wall or do you want me to propose a trace format we can discuss on ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Needs for what? I've already got a full control ABI and I can already redirect output to other buffers ;-) I don't have enumeration of what all is redirected to what, I pretend that people know wth they're doing.. so if you want session lists of what tracepoints are active on which buffers and the like you'll have to come up with something for that. As for the buffer, I prefer a u64 aligned data stream, but the very least I need is frame encapsulation. What I don't want _ever_ is stupid sub-buffers. And no they're not needed, see the discussion about sync markers a while back. I also don't want to support the stupid concurrent read/write from tail. What I do want is both mmap() and splice(), this means buffer size needs to be specified at buffer creation. I currently support overwrite (flight-recorder) and non-overwrite modes depending on PROT_WRITE, I guess that can easily be pushed into the buffer create call. As to the mmap() part, it needs a control page to expose the head/tail pointers and some data. And as you know I need to write > PAGE_SIZE entries. --
You are talking about what you said that ftrace ring buffer is totally broken, that if the writer is writing to the tail, and the reader is reading from the head, it is broken? Let me get this straight. We have a writer constantly writing to the tail of the buffer. On another CPU we have a reader, that will start at that tail (where the writer just wrote) and go backwards. What happens if the writer continues writing? Do we stop the reader and have it write what it just wrote? Or just consider that the reader goes the opposite direction of the writer, and when it hits the writer, it continues, since now it has the new data again. Now the question comes, how do we show this data to the user? Does the user need to sort the data? If the reader reads X amount of data, it gets X from where the writer just wrote. Then the writer writes Y data. The reader reads X amount again, but X > Y, do we read the Y where the writer wrote, and then read the buffer part that is older than the previous read? Thus the user now has the burden to sort the buffer? I'm really confused to how to use a buffer like this? -- Steve --
BTW, the sub buffers is just an implementation detail. I suspect that we'll have to end up with something that splits the buffer up. Whether we have 'markers' or something else. They all break down the buffer into I was thinking about this more. I guess it can work if the reader always goes the opposite direction of the writer. It's just any user that uses this will need to cope with it. I would personally like both methods implemented. One as the "broken design" (as you put it) which removes the burden of sorting from the user. But the "fast design" which Sure. Also, lets not focus now on implementation. Let's try to concentrate on what we want the tools to be able to do. For example, I would like: Very small entries, and pick and chose what I want in my entries. A way to read it fast to a file or over the network (splice). The read backwards seems like a cool idea, but I would not want to throw away the read forwards part either. How we implement this, we can work together on. -- Steve --
It would also be cool to be able to allocate those buffers as early as possible, even if before MCA is enabled, so that I won't have to copy MCE data which got logged before the tracing subsystem got enabled to the buffers proper. -- Regards/Gruss, Boris. --
We could even have some (small) statically enabled build-time buffer that could be enabled straight away before any allocators are enabled. Thanks, Ingo --
Agreed - in fact the error reporting paths will also want some pre-allocated guaranteed space at all times. Allocating memory from within an NMI or Machine Check handler would cause too many problems. -Tony --
I'm not sure it needs to be small. We can have a persistent buffer that may be resized at any time. It can start off small, or with a kernel command line, be as big as you want it. Basically, what ftrace has now. Also, with a way that root user can get a handle on this buffer, and just trace global events with it. -- Steve --
If the size of the sub-buffers are tunable (all the same size inside a whole buffer, but that size is tunable), then someone who doesn't want to use subbuffers can just use a single big subbuffer :) --
Yep. The obvious direction is to extend the event buffering ABI we already have, with whatever additions that are needed: - document that we already support flight recorder mode - a more compressed record format - NOP filler events up to page boundary, for better splice and for better flight recorder - splice support etc. That's how it evolved until now and it's all very extensible. Steve, could you please list the additions you have in mind, in order of priority? Thanks, Ingo --
A few of things that pop up quickly are: 1) lockless 2) as-fast-as possible 3) support all tasks / all CPUs and still have as-fast-as-possible Peter said at LPC that the perf buffering system was not designed to handle high speed tracing. But he also said he does not like the way the ftrace buffering works. I think if we take a step back, we can come up with a new buffering/ABI system that can satisfy everyone. We will still support the current method now, but I really don't think it is designed with everything we had in mind. I do not envision that we can "envolve" to where we want to be. We may have to bite the bullet, just like iptables did when they saw the failures of ipchains, and redesign something new now that we understand what the requirements are. I do think we need to come up with something new but still support the old methods. Thomas came up with this idea, and Peter, Frederic and myself agreed. -- Steve --
This is a clear requirement for use in h/w error reporting too. Taking locks in NMI or machine check handler isn't an option. -Tony --
Don't worry, lots of PMIs are NMIs, perf needs to be fully NMI safe otherwise things simply don't work. --
Yep, in fact perf was fully NMI safe earlier than the ftrace ring-buffer. When perf code is NMI unsafe we notice it very quickly. I regularly record millions of events per second. Thanks, Ingo --
You're not very good at listening, I said the perf infrastructure and event handling mechanism isn't geared towards full throughput but instead on sampling. There is lots of code between getting the event and landing it in the buffer. The buffer itself is perfectly suited for high speed low overhead stuffs, the perf data format possibly not because its not bitfield happy. --
Note that even that is an implementational detail that can be changed: even with a sampling model the sampling bits are in a flag word, so common combinations can be checked for quickly and open-coded into flat fall-through code - if the sample decoding ever shows up as overhead. (It doesnt even need any ABI changes.) Even that can be tweaked via allowing more compressed records. I doubt it will help as much, but it's still an incremental change that can be validated carefully. Fact is that we have an ABI, happy users, happy tools and happy developers, so going incrementally is important and allows us to validate and measure every step while still having a full tool-space in place - and it will help everyone, in addition to the ftrace/lttng usecases. We'll need to embark on this incremental path instead of a rewrite-the-world thing. As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can and will do better here. Thanks, Ingo --
Thus you are saying that we stick to the status quo, and also ignore the fact that perf was a rewrite-the-world from ftrace to begin with. -- Steve --
Perhaps you and Mathieu can summarize your requirements here and then explain why extending the current ABI wouldn't work. It's quite normal that people try to find a solution fully backward compatible in the first place. If it's not possible, fine, but then justify it. --
Yeah, that's pretty much the only reasonable approach really. It also makes every single step testable and verifiable, and often optional as well: - How much do we win from more compressed records? Do we win? Do we want _larger_, less encoded records on some CPUs because their construction overhead and cache behavior is better there? - How much does splice() help? - How much do the sampling fast-path approaches help. How many apps will make use of them? Those are all issues that are virtually undecidable individually if the approach is an all-or-nothing flag-day thing. Fact is that the perf events based tool space is vibrant and alive, and new uses are popping up every week. We'd be utter fools to not embark on an iterative approach here. It does not even limit us in any way technically. The days of full tracing subsystem rewrites are over Steve, i'm afraid it is time for us to grow up ;-) Thanks, Ingo --
Probably more if we eventually fix the bug that causes a copy of the page when it is not needed. -- Steve --
Sure, here are the requirements my user-base have, followed by a listing of Perf
and Ftrace pain points, some of which are directly derived from their respective
ABIs, others partially caused by their implementation and partially caused by
their ABI.
- Low overhead is key
- 150 ns per event (cache-hot)
- Zero-copy (splice to disk/network, mmap for zero-copy in-place data
analysis)
- Compactness of traces
- e.g. 96 bits per event (including typical 64-bit payload), no PID saved per
event.
- Scalability to multi-core and multi-processor
- Per-CPU buffers, time-stamp reading both scalable to many cpus *and* accurate
- Production-grace tracer reliability
- Trace clock accuracy within 100ns, ordering can be inferred based on
lock/interrupt handler knowledge, ability to know when ordering might be
wrong.
- Flight recorder mode
- Support concurrent read while writer is overwriting buffer data
(Thomas Gleixner named these "trace-shots")
- Support multiple trace sessions in parallel
- Engineer + Operator + flight recorder for automated bug reports
- Availability of trace buffers for crash diagnosis
- Save to disk, network, use kexec or persistent memory
- Heterogeneous environment support
- Portability
- Distinct host/target environment support
- Management of multiple target kernel versions
- No dependency on kernel image to analyze traces
(traces contain complete information)
- Live view/analysis of trace streams via the network
- Impact on buffer flushing, power saving, idle, ...
- Synchronized system-wide (hypervisor, kernel and user-space) traces
- Scalability of analysis tools to very large data sets (> 10GB)
- Standardization of trace format across analysis tools
* Ring Buffer issues with Perf:
- Perf does not support flight recorder tracing (concurrent read/write)
- Sub-buffers are needed to support concurrent read/writes in flight recorder
mode. Peter still has to convince me otherwise (if he cares).
- ...When I hear somebody say "flight recorder" - I think of "black boxes" in airplanes that log data while the flight is running, and are only looked at offline later. So I'm confused by the "concurrent read/write" requirement. Perhaps you could explain the use cases of your "flight recorder", because it seems that the name doesn't fit exactly, and this is causing me (and maybe others) some confusion. Thanks -Tony --
Hmm, I had this argument with Mathieu before, but I guess I mistakenly let him win ;-) I call "flight recorder" mode "overwrite" mode. Basically there's two modes. They only have meaning when the ring buffer is full and a write takes place. 1) produce/consumer mode - When the writer reaches the reader, all new events are discarded. This means that you lose the latest events while you keep older events around. 2) overwrite mode (flight recorder) - when the writer reaches the reader, it pushes the reader forward, and writes the new events over the old ones. This way, new events are always existent, where as old events are lost. 1 is much easier to implement than 2, especially when doing it in a lockless way. I guess I should have fought harder to keep the terminology of "overwrite" mode. This is the third time is the last week I had to explain what "flight recorder" mode was. Where as, overwrite mode was a bit more obvious. -- Steve --
Ah, I forgot to mention use cases. Even when recording a trace (lots of data, so it is saved to disk and not all in kernel memory) I like to know that if something happens and disables the trace, I have all the trace information up to the point of failure. With producer/consumer mode, you risk the reader being late and the events in the trace that led up to failure (or any other anomaly) were dropped. -- Steve --
As Steven pointed out, the flight recorder buffers are set to overwrite the oldest data when the buffer is filled. Therefore, the tracer can be used in close-circuit mode (without extracting the data out of the memory buffers) to keep a trace of the recent events. The trace can be extracted when an interesting condition (trigger) occurs. A typical use-case is to let it run on an end-user machine to enhance application crash diagnosis with tracing information, albeit using a very small fraction of the system resources to do so. The reason why "concurrent read/write" is required is for server-class machines which needs to continuously be able to gather trace data to report/find/locate problematic scenarios happening. This means we're not only interested in one single failure, but rather by a whole set of erroneous/warning conditions that need to be reported. Stopping tracing every time data is gathered is inappropriate, because it would hide errors/warnings that would be happening during data collection. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Aargh! Just because it can be done all in one with an insane amount of complexity does not mean that it's an absolute requirement and a good solution. So if you want to have both the flight recorder crash documentation and the ongoing monitoring then use two separate sessions with separate modes and be done with it. Cramming both into the same session is just insane. The first rule is "Keep It Simple!". Period. Thanks, tglx --
It's not that complex. Both Mathieu and I have implemented it. Really, I've seen a lot more complex code. Just because it does not fit into the CS101 course does not mean that it is totally complex. -- Steve --
I know you have implemented it and it's not about CS101, it's about the insanity of what we have in the ftrace ringbuffer code. It does not fulfil in any way the "Keep it Simple" requirement. Period. And as I said earlier on IRC, you are trying to create the ZeroImpactFilteringMultiSessionFlightRecorderOverwriteFifoSplicePerfMmapTracer which is a nice wet dream, but completely unrealistic to achieve in one go. When I yelled at you folks in Boston last week and suggested to come up with a syscall for buffers and the corresponding configuration interfaces along with a unified record format, then I certainly did not ask for the ZIFMSFROFSPMT thingy and a rewrite the world approach. I told you back in 2008 that you need to think hard about the interfaces and start with a reasonable simple implementation. Then proceed from there. The overall achievement so far is an ongoing ringbuffer pissing contest, zero interfaces and lenghty explanations which kind of tracer madness is preferred by whom. I don't call this progress. If you did not get the message last week, then you have it in writing now to digest as long as it takes: Get your gear together and come up with sensible gradual approaches which bring us to a better progress ratio than 0/year. Thanks, tglx --
At least when I've used ftrace for the "flight recorder" use case, I'm not tracing as well. What I do is enable a bunch of trace points, maybe I've sprinkled in some "trace_printk()'s" into various kernel code paths, and then I run the workload which locks up the kernel. When locks up, I've used sysrq-z to dump out the ftrace ring buffer, and usually _exactly_ what I need to debug the lock up is waiting for me in the ring buffer. So, this use case, is incredibly useful, and I hope whatever folks do with the new-fangled API, that somehow "overwrite mode" is supported. Even if for speed reasons, what you do is wait until for the head to overrun the tail, that the tail gets bumped up by 50% and we lose half the log (so that whatever expensive locking is necessary only happens once in a while), I at least would find that quite acceptable. The other feature/requirements request I would make is that there should be a way that common kernel abstractions, such as converting a dev_t to either a MAJOR/MINOR number pair, or to a device name, be made available. For now I've changed the tracepoints to translate MAJOR/MINOR and drop integers into the ring buffer, and a generic workaround in the future is to always drop strings into the ring buffer instead of allowing the translation to be done in TP_printk (which doesn't work for perf; it causes the userspace perf client to fall over and die, without even skipping the problematic tracepoint record --- boo, hiss). But that can be relatively inefficient, because we're now having to drop potentially fairly large text strings into ring buffer, because of limitations that perf has in its output transformations step. I know that because perf is doing its output transformation in userspace, there are fundamental limitations about what it can do. But it would be nice if it could be expanded at least _somewhat_, and either way, there needs to be some clear documentation about what it can and can not accept. And if these limitations means ...
I'm afraid this is not what I proposed above. I'm open to use different tracing sessions for different things. However, the server-class case needs to continuously gather data so that "trace-shots" can be gathered when problems occur. But if you hit two problems back to back, you don't want to lose the trace leading to the second issue. Hence the motivation for supporting I'd like to start with an implementation that skips some of these requirements initially, but what I really think we need to figure out is how we organize our ABIs to finally support these requirements. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Realistically, you are interested in the first one, simply because in 99.9% of the cases the second problem is caused by the first one. Do we really need to care about the 0.1% which fall into the other category? Not at all. Simply because the likeliness of those back to back events _AND_ giving us the 0.1% case is approaching zero. Of course you can argue with your academic hat on that I'm ignoring that we might catch this rare "easter and xmas fall on the same day" I did not say, that you should not think about this, but the progress so far in more than TWO YEARS is exaclty ZERO. And that's what I'm concerned about. Thanks, tglx --
Note, there is an existing ABI in place, please use that. (It's highly extensible so it can support just about any ABI experiment that can even be turned into a smooth ABI replacement.) I think Frederic just started iterating it - but if Mathieu and Steve helps out it Yes, indeed that is the main problem i see as well. Most of the problems listed in the various documents can be solved iteratively in the existing facilities. There is not a single requirement where Peter or me said 'No, this cannot be done, go away!'. Each and every item was answered with: 'sure, we can do that' - or at worst with a 'do we really need it?'. Each and every item fits naturally into existing goals as well - so it's not like some different world view is being forced on anyone. We only have one basic condition: please introduce these thing step by step in the existing ABI. This is a must-have for tools, and there is another very important factor as well: a couple of items can have disadvantages beyond the claimed advantages, so we want to be able to evaluate the effects in isolation, test them and if needed, undo them. It will settle the 'do we really need this?' kind of sub-arguments for sub-features. So being intelligent about it, being iterative is my only requirement to you guys: you are free to change anything, go wild, but please make it iterative and dont try to fork the tooling and developer community. The time has come to not grow the list of requirements but to shrink it. Thanks, Ingo --
Hi Ingo, Is there any way we could proceed without piling up work-arounds over Perf's ABI ? At this point, the only benefit of growing from the Perf ABI is comparable to dragging a ball and chain all along. Yes, I agree that a smooth transition should be the target, but I disagree on the means. I propose to come up with a new ABI and eventually move the perf tools to this ABI, which is not a split in the tracing developers community; rather more a unification. Which do you prefer: a sequence of continuous ABI breakages or a single ABI switch when the new ABI is ready ? In terms of contributor and user pain, I think the second option is much better. We'll have to keep the old Perf ABI around for a while anyway to keep users and tool developers happy. As Linus pointed out at KS, this ABI is now cast in stone. If we need to break it, the only thing we can do is create a new one. He said that he will personally revert any ABI-breaking tracing patch if he ever receives a single complaint from a user. This is not a context in which we want to start playing games with the existing ABIs. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Yeah, but the main point here is to explain why/how reaching those goals is not
efficiently possible through an extension of the current ABI, in practice.
I'm going to try for some of them. Note when I'll talk about ABI breakage,
it actually means: create a new ABI and support the old one, schedule its
deprecation in the long term.
We could do splice in perf through an extension of the current ABI.
The rest seems more about kernel internals.
In perf we save the pid from two places:
- perf headers, see PERF_SAMPLE_TID
- from the common fields of the trace events
Ftrace too for common fields.
It's useful to keep PERF_SAMPLE_TID for low overhead events (like
perf little sampling). Otherwise we can certainly deduce the pid
from context switch trace events.
But the pid in the trace event headers remains. We probably should
get rid of that.
There are also the other common fields:
struct trace_entry {
unsigned short type;
Type is needed by perf. If we have one buffer per event, we could
retrieve which event we are dealing with. But if buffers are
multiplexed per cpu, we need this.
unsigned char flags;
Useful for ftrace, not for perf which will be able to save regs
soon.
unsigned char preempt_count;
Dunno. Should be optional.
int pid;
Kill!
int lock_depth;
Killed ;)
};
=> Abi breakage needed. Can be made through an ABI extension though, but
Use splice for save to disk or network. But I don't understand the kexec
thing.
ditto.
This works well for perf and ftrace currently. Have you
We all try to ensure backward compatibility. It only gets broken
because of unwanted regressions or scheduled deprecation in the
You can select them independantly, except for trace events, for which
What is this userspace tracing? Is this userspace tracing made in ...Yeah, I discussed this with Mathieu. There's a pretty trivial fix for If an interrupt (or softirq) preempts the recorded trace, then events that are recorded in that interrupt all get the same time as the event it preempted. Giving us the assumption that all events happened at once. Again, this is just a side effect and the fix is trivial. But may That was actually a decision made by Linus. But is trivial to change. As there's nothing hard coded about the design that forces us to have page size sub buffers. I don't even think that it would require an ABI Yep, they are large, but can be trimmed. This would require no abi breakage since the these headers are also described in the event formats. Thus changing the current tools should cope with the headers changing. In fact they were designed too since the lock-depth was known He's talking about tracing the tracepoints in a loaded module. We currently have no way to add them while a trace is happening. The trace formats do not exist and may not exist (if module is unloaded) when the trace ends. But who really loads and then unloads a module during tracing. As pretty much all kernel developers cringe at the fact that modules get Me too, lets go shopping! -- Steve --
If you have this patch ready that fixes my ring buffer inexperience, please send it right away. I might accomodate with the self-ABI breakage, depending on what it is... --
Hmm, in practice this is an ABI breakage as we have scripts that rely on the common_pid field for example. We can fix this, but older tools won't Hehe :) Anyway this can be expressed through an ABI extension, using a kind of lazy tracepoint registration or so. --
Agreed, although 65536 types ID is probably overkill for the common case. I prefer to go for approaches with a header that contains a smaller number of Yep, you'd have to support the two formats side-to-side for a while anyway. So That's right. It's more in the trace-clock area. Let's keep this problem for There were more details below on the impact of supporting flight recorder on the trace format (using sub-buffers, etc). The ABI impact is more than just a flag, This one is when the kernel is crashed. So there is not much still available, certainly not splice(). :) The idea is to keep the trace buffers around in the Portable bitfields comes to my mind. And no, it's not enough to just reverse the The setup is that the traces are gathered on telecom switches, and brought to a host machine for viewing. The user has to deal with traces gathered from various kernel versions. I did push Steven to support cross-endianness and self-describing types in Ftrace in the past, and I have to admit that a large part of this requirement is Yep, this one involves that the trace metadata (currently exported through debugfs) should make its way along with the trace stream. One way to do it would There are ways to layout the trace data so that a userspace tool can dig through I'm working for the Linux Foundation CELF group and Ericsson, with the Multi-Core Association, to come up with a standardized trace format across trace providers in the industry, so that we can use the same tools to analyze traces taken from heterogeneous systems (hardware traces, OS traces, user-space traces...). Given the live analysis and low-overhead requirements, being able to generate Nope, this one is an ABI breakage. The current mmap shared control head/tail values used for synchronization between the kernel (writer) and user-space (reader) does not allow concurrent read/write in flight recorder mode. We need, Because we need to get exclusive access to the next sub-buffer ...
Note, ftrace currently has over 600 event types. Unless we compact it down into bits, using two bytes is fine. -- Steve --
I understand that overall the numer of events overflows 256. However, this does not mean they are typically all activated. Moreover, if we add multiple buffer support, these events don't necessarily have to end up in the same buffer. This numeric identifier is only there to distinguish between events ending in the same buffer after all. So the number of available events does not count as an argument for choosing the typical number of bytes to represent the event IDs. The number of events typically activated for a trace session does. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
No, i'm saying we dont do new things just for the sake of it being new, without exhausting existing facilities. None of the examples/arguments offered so far seemed to necessiate throwing away No, the thing is that there were no tools and no ABI - perf was mostly about the ABI and about the user-space tooling - ftrace didnt really have that (and oprofile had deep problems). Thanks, Ingo --
OK, not the buffering, but the infrastructure. That's not much of a difference to this topic. Which is about the ABI which includes all from Well, we need to separate out the buffer in perf regardless, since it is very entwined in the code. Does it now support flight recorder mode? -- Steve --
Question: Can we make perf lower what it records, thus speeding up the amount it records, without breaking the ABI? Can we add flight recorder mode splice version, non mmap, without breaking the ABI? If we can make perf as fast as ftrace in its recording, and maybe even faster if we have the ability to select what is recorded and compress the events, I'm all for it. -- Steve --
Yeah - that's a self-evident goal for just about any kernel code. Ingo --
