Hi, There seem to have been some churn regarding Perf problems with per-cpu memory allocation which uses vmalloc. Long story short: faulting NMIs reactivate NMIs faster than supposed, because x86 re-enables NMIs at the first iret encountered, which leads to nested NMIs. x86_32 cannot use vmalloc_sync_all() to sychronize the TLBs from every processes because the vmalloc area is mapped in a different address space for each process on this architecture. A second alternative is to duplicate the per-cpu allocation API to have a variant using kmalloc only. This would lead to code and API duplication and should probably be kept as last resort. A third solution to this problem is to make the page fault handler aware of NMIs and ensure it can be called from this context. This third solution is proposed by this patchset. So I'm respinning this patchset which has been sitting for a while, used for about 1-2 years in the LTTng tree without problems, already tested in a -tip sub-branch in the past. It uses a ret/popf instruction pair instead of iret when it detects that a trap handler is nested over an NMI. A second patch takes care of making the page fault handler nmi-safe by using the cr3 register rather than accessing ->current, which could be in the middle of being changed by a context switch. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Ok, please tell me where I am wrong then.. by looking into
arch/x86/mm/fault.c, I see that vmalloc_sync_all() touches pgd_list
entries while the pgd_lock spinlock is taken, with interrupts disabled.
So it's protected against concurrent pgd_list modification from
a - vmalloc_sync_all() on other CPUs
b - local interrupts
However, a completely normal interrupt can come on a remote CPU, run
vmalloc_fault() and issue a set_pgd concurrently. Therefore I conclude
this interrupt disable is not there to insure any kind of protection
against concurrent updates.
Also, we see that vmalloc_fault has comments such as :
(for x86_32)
* Do _not_ use "current" here. We might be inside
* an interrupt in the middle of a task switch..
So it takes the pgd_addr from cr3, not from current. Using only the
stack/registers makes this NMI-safe even if "current" is invalid when
the NMI comes. This is caused by the fact that __switch_to will update
the registers before updating current_task without disabling interrupts.
You are right in that x86_64 does not seems to play as safely as x86_32
on this matter; it uses current->mm. Probably it shouldn't assume
"current" is valid. Actually, I don't see where x86_64 disables
interrupts around __switch_to, so this would seem to be a race
condition. Or have I missed something ?
(Ingo)
hm, i guess it's still useful to keep the
__ARCH_WANT_INTERRUPTS_ON_CTXSW case working too. On -rt we used to
enable it to squeeze a tiny bit more latency out of the system.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: akpm@osdl.org
CC: mingo@elte.hu
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
---
arch/x86/mm/fault.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
Index: linux-2.6-lttng/arch/x86/mm/fault.c
===================================================================
--- ...On Wed, Jul 14, 2010 at 8:49 AM, Mathieu Desnoyers
I agree that NMI handlers shouldn't touch vmalloc space. But now that
percpu data is mapped through the VM, I do agree that other CPU's may
potentially need to touch that data, and an interrupt (including an
NMI) might be the first to create the mapping.
And that's why the faulting code needs to be interrupt-safe for the
vmalloc area.
However, it does look like the current scheduler should make it safe
to access "current->mm->pgd" from regular interrupts, so the problem
is apparently only an NMI issue. So exactly what are the circumstances
that create and expose percpu data on a CPU _without_ mapping it on
that CPU?
IOW, I'm missing some background here. I agree that at least some
basic percpu data should generally be available for an NMI handler,
but at the same time I wonder how come that basic percpu data wasn't
already mapped?
The traditional irq vmalloc race was something like:
- one CPU does a "fork()", which copies the basic kernel mappings
- in another thread a driver does a vmalloc(), which creates a _new_
mapping that didn't get copied.
- later on a switch_to() switches to the newly forked process that
missed the vmalloc initialization
- we take an interrupt for the driver that needed the new vmalloc
space, but now it doesn't have it, and we fill it in at run-time for
the (rare) race.
and I'm simply not seeing how fork() could ever race with percpu data setup.
So please just document the sequence that actually needs the page
table setup for the NMI/percpu case.
This patch (1/2) doesn't look horrible per se. I have no problems with
it. I just want to understand why it is needed.
Linus
--
The problem originally addressed by this patch is the case where a NMI handler try to access vmalloc'd per-cpu data, which goes as follow: - One CPU does a fork(), which copies the basic kernel mappings. - Perf allocates percpu memory for buffer control data structures. This mapping does not get copied. - Tracing is activated. - switch_to() to the newly forked process which missed the new percpu allocation. - We take a NMI, which touches the vmalloc'd percpu memory in the Perf tracing handler, therefore leading to a page fault in NMI context. Here, we might be in the middle of switch_to(), where ->current might not be in sync with the current cr3 register. The three choices we have to handle this that I am aware of are: 1) supporting page faults in NMI context, which imply removing ->current dependency and supporting iret-less return path. 2) duplicating the percpu alloc API with a variant that maps to kmalloc. 3) using vmalloc_sync_all() after creating the mapping. (only works for x86_64, not x86_32). Choice 3 seems like a no-go on x86_32, choice 2 seems like a last-resort (involves API duplication and reservation of a fixed-amount of per-cpu memory at boot). Hence the proposal of choice 1. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
On Wed, Jul 14, 2010 at 10:06 AM, Mathieu Desnoyers
[ And patch 2/2 is much more intrusive, and touches a critical path
too.. If it was just the 1/2 series, I don't think I would care. For
Ok. I was wondering why anybody would allocate core percpu variables
so late that this would ever be an issue, but I guess perf is a
reasonable such case. And reasonable to do from NMI.
That said - grr. I really wish there was some other alternative than
adding yet more complexity to the exception return path. That "iret
re-enables NMI's unconditionally" thing annoys me.
In fact, I wonder if we couldn't just do a software NMI disable
instead? Hav ea per-cpu variable (in the _core_ percpu areas that get
allocated statically) that points to the NMI stack frame, and just
make the NMI code itself do something like
NMI entry:
- load percpu NMI stack frame pointer
- if non-zero we know we're nested, and should ignore this NMI:
- we're returning to kernel mode, so return immediately by using
"popf/ret", which also keeps NMI's disabled in the hardware until the
"real" NMI iret happens.
- before the popf/iret, use the NMI stack pointer to make the NMI
return stack be invalid and cause a fault
- set the NMI stack pointer to the current stack pointer
NMI exit (not the above "immediate exit because we nested"):
clear the percpu NMI stack pointer
Just do the iret.
Now, the thing is, now the "iret" is atomic. If we had a nested NMI,
we'll take a fault, and that re-does our "delayed" NMI - and NMI's
will stay masked.
And if we didn't have a nested NMI, that iret will now unmask NMI's,
and everything is happy.
Doesn't the above sound like a good solution? In other words, we solve
the whole problem by simply _fixing_ the crazy Intel "iret-vs-NMI"
semantics. And we don't need to change the hotpath, and we'll just
_allow_ nested faults within NMI's.
Am I missing something? Maybe I'm not as clever as I think I am... But
I _feel_ clever.
Linus
--
Yeah. Frederic (re-)discovered this problem via very hard to debug crashes when he extended perf call-graph tracing to have a bit larger buffer and used Ok. We can solve it by allocating the space from the non-vmalloc percpu area - I think at this point [NMI re-entry] we've corrupted the top of the NMI kernel stack already, due to entering via the IST stack mechanism, which is non-nesting and which enters at the same point - right? We could solve that by copying that small stack frame off before entering the 'generic' NMI routine - but it all feels a bit pulled in by the hair. I feel uneasy about taking pagefaults from the NMI handler. Even if we implemented it all correctly, who knows what CPU erratas are waiting there to be discovered, etc ... I think we should try to muddle through by preventing these situations from happening (and adding a WARN_ONCE() to the vmalloc page-fault handler would certainly help as well), and only go to more clever schemes if no other option looks sane anymore? Thanks, Ingo --
Yeah, you're right, but we could easily fix that up. We know we don't
need any stack for the nested case, so all we would need to do is to
just subtract a small bit off %rsp, and copy the three words or so to
Why? It's much cleaner than making the _real_ codepaths much worse.
Linus
--
There is also the fact we need to handle the lost NMI, by defering its treatment or so. That adds even more complexity. --
On Wed, Jul 14, 2010 at 12:36 PM, Frederic Weisbecker I don't think your read my proposal very deeply. It already handles them by taking a fault on the iret of the first one (that's why we point to the stack frame - so that we can corrupt it and force a fault). Linus --
It only handles the case of a single NMI coming in. What happens in this
scenario?
- NMI (1) comes in.
- takes a fault
- iret
- NMI (2) comes in.
- nesting detected, popf/ret
- takes another fault
- NMI (3) comes in.
- nesting detected, popf/ret
- iret faults
- executes only one extra NMI handler
We miss NMI (3) here. I think this is an important change from a semantic where,
AFAIK, the hardware should be allowed to assume that the CPU will execute as
many nmi handlers are there are NMIs acknowledged.
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
On Wed, Jul 14, 2010 at 1:17 PM, Mathieu Desnoyers [ two nested NMI's ] The _right_ thing happens. What do you think the hardware would have done itself? The NMI was blocked. It wouldn't get replayed twice. If you have two NMI's happening while another NMI is active, you will get a single NMI after the first NMI has completed. So stop these _idiotic_ complaints. And face the music: - NMI's aren't that important. They are a _hell_ of a lot less important than the standard page fault path, for example. - We do _not_ want to add more NMI magic outside of the NMI codepaths. It's much better to handle NMI special cases in the NMI code, rather than sprinkle them in random other codepaths (like percpu allocators) that have NOTHING WHAT-SO-EVER to do with NMI's! --
If it ever became an issue, we could even do what softirqs do and re-execute the NMI handler. At least for things like PMU NMIs we have to handle them once they have been (re-)issued, or we'd get a stuck PMU. But in any case it should be a non-issue. Ingo --
Hrm, Frederic, I hate to ask that but.. what are you doing with those percpu 8k data structures exactly ? :) Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
So, when an event triggers in perf, we sometimes want to capture the stacktrace
that led to the event.
We want this stacktrace (here we call that a callchain) to be recorded
locklessly. So we want this callchain buffer per cpu, with the following
type:
#define PERF_MAX_STACK_DEPTH 255
struct perf_callchain_entry {
__u64 nr;
__u64 ip[PERF_MAX_STACK_DEPTH];
};
That makes 2048 bytes. But per cpu is not enough for the callchain to be recorded
locklessly, we also need one buffer per context: task, softirq, hardirq, nmi, as
an event can trigger in any of these.
Since we disable preemption, none of these contexts can nest locally. In
fact hardirqs can nest but we just don't care about this corner case.
So, it makes 2048 * 4 = 8192 bytes. And that per cpu.
--
Ah OK, so you mean that perf now has 2 different ring buffer implementations ? How about using a single one that is generic enough to handle perf and ftrace needs instead ? (/me runs away quickly before the lightning strikes) ;) -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
That's no ring buffer. It's a temporary linear buffer to fill a stacktrace, and get its effective size before committing it to the real ring buffer. Sure that involves two copies. But I don't have a better solution in mind than using a pre-buffer for that, since we can't know the size of the stacktrace in advance. We could always reserve the max stacktrace size, but that would be wasteful. --
I was more thinking along the lines of making sure a ring buffer has the proper support for your use-case. It shares a lot of requirements with a standard ring buffer: - Need to be lock-less - Need to reserve space, write data in a buffer By configuring a ring buffer with 4k sub-buffer size (that's configurable dynamically), all we need to add is the ability to squash a previously saved record from the buffer. I am confident we can provide a clean API for this that would allow discard of previously committed entry as long as we are still on the same non-fully-committed sub-buffer. This fits your use-case exactly, so that's fine. You could have one 4k ring buffer per cpu per execution context. I wonder if each Linux architecture have support for separated thread vs softirtq vs irq vs nmi stacks ? Even then, given you have only one stack for all shared irqs, you need something that is concurrency-aware at the ring buffer level. These small stack-like ring buffers could be used to save your temporary stack copy. When you really need to save it to a larger ring buffer along with a trace, then you just take a snapshot of the stack ring buffers. So you get to use one single ring buffer synchronization and memory allocation mechanism, that everyone has reviewed. The advantage is that we would not be having this nmi race discussion in the first place: the generic ring buffer uses "get page" directly rather than relying on vmalloc, because these bugs have already been identified and dealt with years ago. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
squash? truncate you mean? So we can allocate/reserve the largest possible event size and write the actual event and then truncate to the actually used size? I really dislike how that will end up with huge holes in the buffer when you get nested events. Also, I think you're forgetting that doing the stack unwind is a very costly pointer chase, adding a simple linear copy really doesn't seem like a problem. Additionally, if you have multiple consumers you can simply copy the stacktrace again, avoiding the whole pointer chase exercise. While you could conceivably copy from one ringbuffer into another that will result Why would that be relevant? We can have NMI inside IRQ inside soft-IRQ inside task context in general (dismissing the nested IRQ mess). You don't need to have a separate stack for each context in order to nest OK, why? Your proposal includes the exact same extra copy but introduces That's like saying don't use percpu_alloc() but open-code the thing using kmalloc()/get_pages().. I really don't see any merit in that. --
This reluctance against splitting a buffer into sub-buffers might contribute to explain the poor performance experienced with the Perf ring buffer. These "sub-buffers" are really nothing new: these are called "periods" in the audio world. They help lowering the ring buffer performance overhead because: 1) They allow writing into the ring buffer without SMP-safe synchronization primitives and memory barriers for each record. Synchronization is only needed across sub-buffer boundaries, which amortizes the cost over a large number of events. 2) They are much more splice (and, in general, page-exchange) friendly, because records written after a synchronization point start at the beginning of a page. This removes the need for extra copies. So I have to ask: do you detest the sub-buffer concept only because you are tied to the current Perf userspace ABI which cannot support this without an ABI change ? I'm trying to help out here, but it does not make the task easy if we have both hands tied in our back because we have to keep backward ABI compatibility for a Nope. I'm thinking that we can use a buffer just to save the stack as we call functions and return, e.g. call X -> reserve space to save "X" and arguments. call Y -> same for Y. call Z -> same for Z. return -> discard event for Z. return -> discard event for Y. if we grab the buffer content at that point, then we have X and its arguments, which is the function currently executed. That would require the ability to uncommit and unreserve an event, which is not a problem as long as we have not I thought that this buffer was chasing the function entry/exits rather than doing a stack unwind, but I might be wrong. Perhaps Frederic could tell us more Assuming Frederic is saving information to this stack-like ring buffer at each function entry and discarding at each function return, then we don't have the pointer chase. What I am proposing does not even involve a copy: when we want to take a snapshot, we just ...
The only SMP barrier we (should) have is when we update the user visible
head pointer. The buffer code itself uses local{,64}_t for all other
atomic ops.
If you want to amortize that barrier, simply hold off the head update
This just doesn't make any sense at all, I could splice full pages just
fine, splice keeps page order so these synchronization points aren't
critical in any way.
The only problem I have with splice atm is that we don't have a buffer
interface without mmap() and we cannot splice pages out from under
Dude, its a published user<->kernel ABI, also you're not saying why you
would want to break it. In your other email you allude to things like
flight recorder mode, that could be done with the current set-up, no
need to break the ABI at all. All you need to do is track the tail
We don't have a callback on function entry, and I'm not going to use
Again, I'm not really seeing the point of using sub-buffers at all.
Also, what happens when we write an event after Y? Then the discard must
No, its a pure stack unwind from NMI context. When we get an event (PMI,
tracepoint, whatever) we write out event, if the consumer asked for a
And suppose the stack-trace was all of 16 entries (not uncommon for a
kernel stack), then you waste a whole page for 128 bytes (assuming your
sub-buffer is page sized). I'll take the memcopy, thank you.
--
To throw some hard numbers into the discussion, i found two random callgraph
perf.data's on my boxes (both created prior the start of this discussion) and
here is the distribution of their call-chain length:
aldebaran:~> perf report -D | grep 'chain: nr:' | cut -d: -f3- | sort -n | uniq -c
2 4
21 6
23 8
13 9
20 10
29 11
21 12
25 13
54 14
112 15
72 16
77 17
35 18
38 19
48 20
29 21
10 22
97 23
3 24
1 25
2 26
2 28
2 29
1 30
2 31
So the peak/average here is around 15 entries.
The other one:
phoenix:~> perf report -D | grep 'chain: nr:' | cut -d: -f3- | sort -n | uniq -c
1 2
70 3
222 4
112 5
116 6
329 7
241 8
163 9
203 10
287 11
159 12
4 13
6 14
22 15
2 16
11 17
5 18
Here the average is even lower - around 8 entries.
Thanks,
Ingo
--
Extracted from: http://lkml.org/lkml/2010/7/9/368 (executive summary) * Throughput * Flight recorder mode Ring Buffer Library 83 ns/entry (512kB sub-buffers, no reader) 89 ns/entry (512kB sub-buffers: read 0.3M entries/s) Ftrace Ring Buffer: 103 ns/entry (no reader) 187 ns/entry (read by event: read 0.4M entries/s) Perf record (flight recorder mode unavailable) * Discard mode Ring Buffer Library: 96 ns/entry discarded 257 ns/entry written (read: 2.8M entries/s) Perf Ring Buffer: 423 ns/entry written (read: 2.3M entries/s) (Note that this number is based on the perf event approximation output (based on a 24 bytes/entry estimation) rather than the benchmark module count due its inaccuracy, which is caused by perf not letting the benchmark module know about discarded events.) It is really hard to get a clear picture of the data write overhead with perf, because you _need_ to consume data. Making perf support flight recorder mode I understand your point about amortized synchronization. However I still don't see how you can achieve flight recorder mode, efficient seek on multi-GB traces without reading the whole event stream, and live streaming without sub-buffers If you need to read non-filled pages, then you need to splice pages piece-wise. This does not fit well with flight recorder tracing, for which the solution Steven and I have found is to atomically exchange pages (for Ftrace) or The problem Perf has is probably more with flight recorder (overwrite) tracing How do you plan to read the data concurrently with the writer overwriting the Given that this buffer is simply used to dump the stack unwind result then I So why the copy ? Frederic seems to put the stack unwind in a special temporary Well, now that I understand what you are trying to achieve, I retract my proposal of using a stack-like ring buffer ...
I don't consider reading while writing (in overwrite mode) a valid case. Because I don't want to support truncating reservations (because that leads to large nops for nested events) and when the event needs to go to multiple buffers you can re-use the stack-dump without having to do the unwind again. The problem with the continuation thing Linus suggested is that it would bloat the output 3 fold. A stack entry is a single u64. If you want to wrap that in a continuation event you need: a header (u64), a cookie (u64) and the entry (u64). Continuation events might make heaps of sense for larger data pieces, but I don't see them being practical for such small pieces. --
How inconvenient. It happens that the relatively large group of users I am working for do care for this use-case. They cannot afford to stop tracing as soon as they hit "one bug". This "bug" could be a simple odd scenario that they Agreed in this case. Truncating reservations might make sense for filtering, but even there I have a strong preference for filtering directly on the information received as parameter, before performing buffer space reservation, whenever Yep. What I did in a past life in earlier LTTng versions was to use a 2-pass unwind. The first pass is the most costly because it brings all the data into the L1 cache. This first pass is used to compute the array size you need to save the whole stack frame, but it does not copy anything. The second pass performs the copy. This was surprisingly efficient. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Snapshot is fine, just swivel the whole buffer. --
There is a very important trade-off between the amount of information that can be kept around in memory to take as snapshot and the amount of system memory reserved for buffers. The sub-buffer scheme is pretty good at that: the whole memory reserved (except the extra reader-owned sub-buffer) is available to save the flight recorder trace. With the multiple-buffer scheme you propose, only one of the buffers can be used to save data. This is very limiting, especially for embedded systems in telecom switches that does not have that much memory: all the memory reserved for the buffer that is currently inactive is simply wasted. It does not even allow the user to gather a longer snapshot. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
For example, would you like to read system audit log always after stop the audit? NO, that's a most important requirement for tracers, especially for system admins (they're the most important users of Linux) to check the system health and catch system troubles. For performance measurement and checking hotspot, one-shot tracing is enough. But it's just for developers. But for the real world computing, Linux is just an OS, users want to run their system, middleware and applications, without troubles. But when they hit a trouble, they wanna shoot it ASAP. The flight recorder mode is mainly for those users. Thank you, -- Masami HIRAMATSU 2nd Research Dept. Hitachi, Ltd., Systems Development Laboratory E-mail: masami.hiramatsu.pt@hitachi.com --
You cannot over-write and consistently read the buffer, that's plain impossible. With sub-buffers you can swivel a sub-buffer and consistently read that, but there is no guarantee the next sub-buffer you steal was indeed adjacent to the previous buffer you stole as that might have gotten over-written by the active writer while you were stealing the previous one. If you want to snapshot buffers, do that, simply swivel the whole trace buffer, and continue tracing in a new one, then consume the old trace in a consistent manner. I really see no value in being able to read unrelated bits and pieces of a buffer. So no, I will _not_ support reading an over-write buffer while there is an active reader. --
If you think it is impossible, then you should really go have a look at the generic ring buffer library, at LTTng and at Ftrace. It looks like we're all We don't care about taking the next adjascent sub-buffer. We care about always grabbing the oldest sub-buffer that has been written up to the currentmost So you need to allocate many trace buffers to accomplish the same and an extra layer on top that does this buffer exchange, I don't call that "simple". Note that only two trace buffers might not be enough if you have repeated failures in a short time window; the consumer might take some time to extract all these. Compared to that, the sub-buffer scheme only needs a single buffer with 2 (or more) sub-buffers, plus an extra sub-buffer owned by the reader that we exchange with the sub-buffer we want to grab for reading. The reader always grabs the sub-buffer with the oldest data into it. The number of sub-buffers used is the limit on the number of snapshots that can be taken in a relatively short time Within a sub-buffer, events are adjascent, and between sub-buffers, events are guaranteed to be in order (oldest to newest event). It is only in the case where buffers are relatively small compared to the data throughput that the writer can overwrite information that would have been useful for a snapshot (e.g. overwriting relatively recent information while the reader reads the oldest sub-buffer), but in that case users simply have to tune they buffer size (I guess you mean active writer) Here you argue that you don't need to support this feature at the ring buffer level because you can have a group of ring buffers that does it instead. How is your multiple-buffer scheme any simpler than sub-buffers ? Either you have to allocate many of them up front, or, if you want to do it on-demand, you have to perform memory allocation in NMI context. I don't see any of these two solutions as particularly appealing. Thanks, Mathieu -- Mathieu Desnoyers Operating System ...
Right, we cannot ensure that. In over-written mode, reader could lose some data, because of overwriting by writers. (or writer may fails to write new data on buffer in non-overwritten mode) However, I think that doesn't mean this mode is completely useless. If we can know when(where) the data was lost, the rest of data Hmm, would that consume much more memory compared with sub-buffer ring buffer if we have spare buffers? Or, if allocating it after reader opens buffer, will it also slow I think there is a trade-off of perfect snapshot and consuming memory, I hope you to reconsider how over-write buffer is useful even if it is far from perfect. Thank you, -- Masami HIRAMATSU 2nd Research Dept. Hitachi, Ltd., Systems Development Laboratory E-mail: masami.hiramatsu.pt@hitachi.com --
It all depends on the frequency on your events and on the amount of memory used for the buffer. If you are tracing syscalls in a semi-idle box with a ring buffer of 500 MB per cpu, you really don't care about the writer catching up the reader: it will simply not happen. OTOH if you are tracing function graphs, no buffer size will ever be enough: the writer will always be faster and catch up the reader. Using the sub-buffer scheme though, and allowing concurrent writer and reader in overwriting mode, we can easily tell the user about the writer beeing faster and content that have been lost. On top of these informations, the user can chose what to do: trying with a larger buffer or so. See? It's not our role to say: the result might be unreliable if the user does silly settings (not enough memory, reader too slow for random reasons, too high frequency events or so...). Let the user deal with that and just inform him about unreliable results. This is what ftrace does currently. Also the snapshot thing doesn't look like a replacement. If you are tracing on a low memory embedded system, you consume a lot of memory to keep the snapshot alive, it means the live buffer can be critically lowered and you might in turn lose traces there. That said it's an interesting feature that may fit on other kind of environments or for other needs. Off-topic: It's sad that about tracing, we often have to figure out the needs from embedded world, or learn from indirect sources. In the end we rarely know from them directly. Except may be in confs.... --
I'm not quite sure why. Is it something fundamental, or just an implementation issue? One thing that I think could easily make sense in a _lot_ of buffering areas is the notion of a "continuation" buffer. We know we have cases where we want to attach a lot of data to one particular event, but the buffering itself is inevitably always going to have some limits on atomicity etc. And quite often, the event that _generates_ the data is not necessarily going to have all that data in one contiguous region, and doing a scatter-gather memcpy to get it that way is not good either. At the same time, I do _not_ believe that the kernel ring-buffer code should handle pointers to sub-buffers etc, or worry about iovec-like arrays of smaller ranges. So if _that_ is what you mean by "concept of sub-buffers", then I agree with you. But what I do think might make a lot of sense is to allow buffer fragments, and just teach user space to do de-fragmentation. Where it would be important that the de-fragmentation really is all in user space, and not really ever visible to the ring-buffer implementation itself (and there would not, for example, be any guarantees that the fragments would be contiguous - there could be other events in the buffer in between fragments). Maybe we could even say that fragments might be across different CPU ring-buffers, and user-space needs to sort it out if it wants to (where "sort it out" literally would mean having to sort and re-attach them in the right order, since there wouldn't be any ordering between them). From a kernel perspective, the only thing you need for fragment handling would be to have a buffer entry that just says "I'm fragment number X of event ID Y". Nothing more. Everything else would be up to the parser in user space to work out. In other words - if you have something like the current situation, where you want to save a whole back-trace, INSTEAD of allocating a large max-sized buffer for it and "linearizing" the back-trace in order to ...
The real issue here, IMHO, is that Perf has tied gory ring buffer implementation details to the userspace perf ABI, and there is now strong unwillingness from Perf developers to break this ABI. About the sub-buffer definition: it only means that a buffer is splitted into many regions. Their boundary are synchronization points between the data producer and consumer. This involves padding the end of regions when records do not fit in the remaining space. I think that the problem lays in that Peter wants all his ring-buffer data to be side-to-side, without padding. He needs this because the perf ABI, presented to the user-space perf program, requires this: every implementation detail is exposed to user-space through a mmap'd memory region (yeah, even the control data is touched by both the kernel and userland through that shared page). When Perf has been initially proposed, I've thought that because the perf user-space tool is shipped along with the kernel sources, we could change the ABI easily afterward, but Peter seems to disagree and wants it to stay the as it is for backward compatibility and not offending contributors. If I had known this when the ABI first came in, I would have surely nack'd it. Now we are stucked with this ABI which exposes every tiny ring buffer implementation detail to userspace, which simply kills any future enhancement. Thanks, Mathieu P.S.: I'm holding back reply to the rest of your email to increase focus on the fundamental perf ABI problem. -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Looks useful. There's a steady trickle of new events and we already use type encapsulation for things like trace events - which are only made sense of later on in user-space. We may want to add things like a NOP event to pad out the end of page Thanks, Ingo --
/me once again experiences the subtle difference between 'Y' and 'N' when postponing a mail So adding fragments would be possible as well. We've got the space for such extensions in the ABI and the basic model of streaming information is not affected. [ The control structure of the mmap area is there for performance/wakeup optimizations (and to allow the kernel to lose information on producer overload, while still giving user-space an idea that we lost data and how much) - it does not affect semantics and does not limit us. ] So there's no design limitation - Peter simply prefers one possible solution over another and outlined his reasons - we should hash that out based on the technical arguments. Thanks, Ingo --
I am glad to hear this. So should I understand that if we show that the current perf ABI imposes significant design constraints and results in poor performance and inability to support flight recorder mode (which is needed to unify the ring buffers), we can deprecate the ABI ? Or simply write the page (or sub-buffer) size information in a page (or sub-buffer) header. The gain here is that by doing so we don't have to reserve an event ID for the NOP event, which adds one extra ID reserved in _each_ event header. You might be tempted to say "oh, it's just a single value, who cares ?", but with the amount of data we're moving, being able to represent the event header on a very small amount of bits really makes a difference. Bloat creeps in one single bit at a time until we start not caring about adding whole integers, and when we're there the game was over long ago: performance suffer deeply. The huge size of the perf event headers is another factor that might explain its poor performance by the way. I am doubtful about an "optimization" that affects what should be a slow path: user-space wakeup for delivering a multiple events at once. Have you checked if This can be performed with a standard system call rather than playing games with a shared pages into which both the kernel and user-space write. The advantage is that by letting user-space calling the kernel (rather than just writing "I'm done" in that page by updating the consumer value), we can let the kernel perform tasks that might enable us to implement flight recorder mode all Well, so far, the main limitation I can see is that it does not allow us to do Another argument I've seen from Peter is that he prefers the perf kernel-userspace interaction to happen through this shared page to diminish the number of traced events generated by perf activity. But I find this argument unconvincing, because it really only applies to system call tracing: the rest of tracing will be affected by the perf user-space process ...
On Tue, Aug 3, 2010 at 12:45 PM, Mathieu Desnoyers
The thing is - I think my outlined buffer fragmentation model would
work fine with the perf ABI too. Exactly because there is no deep
structure, just the same "stream of small events" both from a kernel
and a user model standpoint. Sure, the stream would now contain a new
event type, but that's trivial. It would still be _entirely_
reasonable to have the actual data in the exact same ring buffer,
including the whole mmap'ed area.
Of course, when user space actually parses it, user space would have
to eventually defragment the event by allocating a new area and
copying the fragments together in the right order, but that's pretty
trivial to do. It certainly doesn't affect the current mmap'ed
interface in the least.
Now, whether the perf people feel they want that kind of
functionality, I don't know. It's possible that they simply do not
want to handle events that are complex enough that they would have
arbitrary size.
Linus
--
Yes, indeed. Your scheme (using a "cookie" to identify multiple related events, each of them being the "continuation" of the previous event with the same cookie) would work on top of basically all ring buffers implementations. We already use something similar to follow socket buffers and block device buffers I agree. Although I think the scheme you propose can sit on top of the ring buffer and does not necessarily need to be at the bottom layer. The sub-buffer disagreement Peter and I have is related to the ring buffer core. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
The sub-buffer thing that both ftrace and lttng have is creating a large buffer from a lot of small buffers, I simply don't see the point of doing that. It adds complexity and limitations for very little gain. Their benefit is known synchronization points into the stream, you can parse each sub-buffer independently, but you can always break up a continuous stream into smaller parts or use a transport that includes index points or whatever. Their down side is that you can never have individual events larger than the sub-buffer, you need to be aware of the sub-buffer when reserving space etc.. --
The first major gain is the ability to implement flight recorder tracing (overwrite mode), which Perf still lacks. A second major gain: having these sub-buffers lets the trace analyzer seek in the trace very efficiently by allowing it to perform a binary search for time to find the appropriate sub-buffer. It becomes immensely useful with large traces. The third major gain: for live streaming of traces, having sub-buffer lets you "package" the event data you send over the network into sub-buffers. So the trace analyzer, receiving this information live while the trace is being recorded, can start using the information when the full sub-buffer is received. It does not have to play games with the last event (or event header) perhaps being incompletely sent, which imply that you absolutely _need_ to save the event size along with each event header (you cannot simply let the analyzer parse the event payload to determine the size). Here again, space wasted. Furthermore, this deals with information loss: a trace is still readable even if a sub-buffer must be discarded. Making sure events don't cross sub-buffer boundaries simplify a lot of things, starting with dealing with "overwritten" sub-buffers in flight recorder mode. I understand that you could perform amortized synchronization without sub-buffers. I however don't see how flight recorder, efficient seek on multi-GB traces (without reading the whole event stream), and live streaming can be True. But with configurable sub-buffer size (can be from 4kB to many MB), I Only the ring buffer needs to be aware of that. It returns an error if the event is larger than the sub-buffer size. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
http://lkml.org/lkml/2009/7/6/178 I've send out something like that several times, but nobody took it (that is, tested it and provided a user). Note how it doesn't require You can add sync events with a specific magic cookie in. Once you find the cookie you can sync and start reading it reliably -- the advantage is that sync events are very easy to have as an option and don't See the sync events. Also, a transport can rewrite the stream any which way it pretty well wants to, as long as the kernel<->user interface is reliable an unreliable user<->user transport can repackage it to suit See the above patch, simply parse the events and push the tail pointer ahead of the reservation before you trample on it. If you worry about the cost of parsing the events, you can amortize that by things like keeping the offset of the first event in every page in the pageframe, or the offset of the next sync event or whatever scheme you want. Again, no need for sub-buffers. Also, not having sub-buffers makes reservation easier since you don't need to worry about those empty tails. --
+static void perf_output_tail(struct perf_mmap_data *data, unsigned int head)
...
+ unsigned long tail, new;
...
+ unsigned long size;
+ while (tail + size - head < 0) {
....
+ }
How is the while condition ever be supposed to be true ? I guess nobody took it
You need to read the whole trace to find these cookies (even if it is just once
at the beginning if you create an index). My experience with users have shown me
that the delay between stopping trace gathering having the data shown to the
user is very important, because this is repeatedly done while debugging a
Perf, on its reserve/commit fast paths:
perf_output_begin: 543 bytes
(perf_output_get_handle is inlined)
perf_output_put_handle: 201 bytes
perf_output_end: 77 bytes
calls perf_output_put_handle
Total for perf: 821 bytes
Generic Ring Buffer Library reserve/commit fast paths:
Reserve: 511 bytes
Commit: 266 bytes
Total for Generic Ring Buffer: 777 bytes
So the generic ring buffer is not only faster and supports sub-buffers (along
with all the nice features this brings); its reserve and commit hot paths
I am guessing you plan to rely on these sync events to know which data "blocs"
I'm not sure that patch is ready for prime-time yet. As you point out in your
following email, you need to stop tracing to consume data, which does not fit my
Hrm ? AFAIK, the page-frame is an internal kernel-only data structure. That
won't be exported to user-space, so how is the parser supposed to see this
So far I've shown that you sub-buffer-less implementation is not even simpler
than a implementation using sub-buffers.
By the way, even with your sub-buffer free scheme, you cannot write an event
bigger than your buffer size. So you have a likewise limitation in terms of
maximum event size (so you already have to test this on your fast path).
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System ...I know, I never claimed it was, it was always an illustration of how to Depends on what you want to do, you can start reading at any point in the stream and be guaranteed to find a sync point within sync-distance Yeah, because after having had to wait for 36h for the problem to trigger that extra minute really kills. All I can say is that in my experience brain throughput is the limiting All I can say is that less code doesn't equal less complex (nor faster per-se). Nor have I spend all my time on writing the ring-buffer, there's more interesting things to do. And the last time I ran perf on perf, the buffer wasn't the thing that was taking most time. And unlike what you claim below, it most certainly can deal with events Its about the kernel parsing the buffer to push the tail ahead of the reserve window, so that you have a reliable point to start reading the trace from -- or didn't you actually get the intent of that patch? --
Even if you want to index all sync points you can quickly skip through the file using the sync-distance, after which you'll have, on average, only 1/2 avg-event-size to read before you find your next sync point. So suppose you have a 1M sync-distance, and an effective average event size of 128 bytes, then for a 4G file, you can find all sync points by only reading ~262144 bytes (not counting for the fact that the pagecache will bring in full pages, which would result in something like 16M to be read in total or somesuch -- which, again assumes read-ahead isn't going to play tricks on you). --
How do you distinguish between sync events and random payload data ? Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
At _any_ point in the stream ? So if I take, let's say, a few kB of Perf ring buffer data and I choose to encode than into an event into another buffer (e.g. we're tracing part of the network traffic). Then we end up in a situation where the event payload will contain your "so special" sync point data. Basically, you have no guarantee that you won't mix up standard event data and your synchronization event headers. Here I have to bring up the fact that Linux kernel developers are not the only tracer users. Traces of multi-GB can be generated easily within a few seconds/minutes on many workloads, so we're not talking of many-hours-traces here. But if we need to read the whole trace before it can be shown, we're adding a significant delay before the trace can be accessed. In my experience, both brain and data gathering throughputs are limiting factors to debugging. Drawing fancy pictures merely helps speeding up the brain process Less code = less instruction cache overhead. I've also shown that the LTTng code is at least twice faster. In terms of complexity, it is not much more complex; I also took the extra care of doing the formal proofs to make sure the corner-cases were dealt with, which I don't reckon neither Steven nor yourself I must admit that I probably spent much more time working on the ring buffer than you did. It looks like one's interest can only focus on so many areas at once. So if you are not that interested in ring buffers, can you at least stop opposing to people who care deeply ? If we agree that we don't care about the same use-cases, there might be room for many ring buffers in the kernel. It's just a shame that we have to multiply amount of code-review. We have to note that this goes against Linus' request for Very interesting. I know the trace clock performance are terrible too. But let's What I said below was: perf cannot write events larger than its buffer size. So it already has to take that "test" branch for maximum event size. I said ...
Yes Mathieu, you did a formal proof. Good for you. But honestly, it is starting to get very annoying to hear you constantly stating that, because, to most kernel developers, it is meaningless. Any slight modification of your algorithm, renders the proof invalid. You are not the only one that has done a proof to an algorithm in the kernel, but you are definitely the only one that constantly reminds people that you have done so. Congrats on your PhD, and in academia, proofs are important. But this is a ring buffer, not a critical part of the workings of the kernel. There are much more critical and fragile parts of the kernel that work fine without a formal proof. Paul McKenney did a proof for RCU not for us, but just to help give him a warm fuzzy about it. RCU is much more complex than the ftrace ring buffer, and it also is much more critical. If Paul gets it wrong, a machine will crash. He's right to worry. And even Paul told me that no formal proof makes up for large scale testing. Which BTW, the ftrace ring buffer has gone through. Someday I may go ahead and do that proof, but I did do a very intensive state diagram, and I'm quite confident that it works. It's been deployed for quite a bit, and the design has yet to be a factor in any bug report of the ring buffer. -- Steve --
Egad! Go on vacation and the world falls apart. So, I want to allocate a 10Meg buffer. I need to make sure the kernel has 10megs of memory available. If the memory is quite fragmented, then too bad, I lose out. Oh wait, I could also use vmalloc. But then again, now I'm blasting valuable TLB entries for a tracing utility, thus making the tracer have a even bigger impact on the entire system. BAH! I originally wanted to go with the continuous buffer, but I was convinced after trying to implement it, that it was a bad choice. Specifically, because of needing to 1) get large amounts of memory that is continuous, or 2) eating up TLB entries and causing the system to perform poorer. I chose page size "sub-buffers" to solve the above. It also made implementing splice trivial. OK, I admit, I never thought about mmapping the buffers, just because I figured splice was faster. But I do have patches that allow a user to mmap the entire ring buffer, but only in a "producer/consumer" mode. Note, I use page size sub-buffers, but the design could work with any size sub-buffers. I just never implemented that (even though, when I The answer to that is to make a macro to do the assignment of the event, and add a new API. event = ring_buffer_reserve_unlimited(); ring_buffer_assign(event, data1); ring_buffer_assign(event, data2); ring_buffer_commit(event); The ring_buffer_reserve_unlimited() could reserve a bunch of space beyond one ring buffer. It could reserve data in fragments. Then the ring_buffer_assgin() could either copy directly to the event (if the event exists on one sub buffer) or do a copy the space was fragmented. Of course, userspace would need to know how to read it. And it can get complex due to interrupts coming in and also reserving between fragments, or what happens if a partial fragment is overwritten. But all these are not impossible to solve. -- Steve --
FYI: the generic ring buffer also implements the mmap() interface for the flight The main difference between our designs is that Ftrace use a linked list and the generic ring buffer lib. uses a sub-buffer/page table. Considering the use-case of reading available flight recorder pages in reverse order I've hear about at LinuxCon (heard about it from both from Peter and Masami, and it actually makes a whole lot of sense, because the data we care about the most and want to read ASAP is the last subbuffers), I think the page table is more appropriate (and flexible) than a single-direction linked list, because it allows to pick a Dealing with fragmentation, sub-buffer loss, etc. is then pushed up to the trace analyzer. While I agree that we have to keep the burden of complexity out of the kernel as much as possible, I also think that an elegant design at the data producer level which keeps the trace reader/analyzer simple and reliable should be favored if it keeps a similar level of complexity in the kernel code. A good argument supporting this is that some tracing users want to use a mmap() scheme directly on the trace buffers to analyze the data directly in user-space on the traced machine. In these cases, the complexity/overhead added to the analyzer will impact the overall performance of the system being traced. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
With memory compaction, the cpu churns for a while, then you have your buffer. Of course there's still no guarantee, just a significantly Most trace entries will occupy much less than a page, and are accessed sequentially, so I don't think this will have a large impact. -- error compiling committee.c: too many arguments to function --
The bigger the buffers, the lower the probabilities of success are. My users often allocate buffers as large as a few GB per cpu. Relying on compaction does You seem to underestimate the frequency at which trace events can be generated. E.g., by the time you run the scheduler once (which we can consider a very hot kernel path), some tracing modes will generate thousands of events, which will touch a very significant amount of TLB entries. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Wow. Even if you could compact that much memory, it would take quite a Let's say a trace entry occupies 40 bytes and a TLB miss costs 200 cycles on average. So we have 100 entries per page costing 200 cycles; amortized each entry costs 2 cycles. There's an additional cost caused by the need to re-fill the TLB later, but you incur that anyway if the scheduler caused a context switch. Of course, my assumptions may be completely off (likely larger entries but smaller miss costs). Has a vmalloc based implementation been tested? It seems so much easier than the other alternatives. -- error compiling committee.c: too many arguments to function --
A quick test (shown below) gives the cost of a TLB miss on the Intel Xeon E5404: Number of cycles added over test baseline: tlb and cache hit: 12.42 tlb hit, l2 hit, l1 miss 17.88 tlb hit,l2+l1 miss 32.34 tlb and cache miss 449.58 So it's closer to 500 per tlb miss. Also, your analysis does not seem to correctly represent reality of the TLB trashing cost. On a workload walking over a large number of random pages (e.g. a large hash table) all the time, eating just a few more TLB entries will impact the number of misses over the entire workload. So it's not much the misses that we see at the tracing site that is the problem, but also the extra misses taken by the application caused by the extra pressure on TLB. So just a few more TLB entries taken by the tracer will likely hurt The performance hit is not taken if the scheduler schedules another thread with Depending on the tracer design, the avg. event size can range from 12 bytes (lttng is very agressive in event size compaction) to about 40 bytes (perf); so for this you are mostly right. However, as explained above, the TLB miss cost is I tested it in the past, and must admit that I changed from a vmalloc-based implementation to page-based using software cross-page write primitives based on feedback from Steven and Ingo. Diminishing TLB trashing seemed like a good approach, and using vmalloc on 32-bit machines is a pain, because users have to tweak the vmalloc region size at boot. So all in all, I moved to a vmalloc-less implementation without much more thought. If you feel we should test the performance of both approaches, we could do it in the generic ring buffer library (it allows both type of allocation backends). However, we'd have to find the right type of TLB-trashing real-world workload to have meaningful results. This might be the hardest part. Thanks, Mathieu # tlbmiss.c #include <sys/time.h> #include <time.h> #include <stdio.h> #include ...
The cache miss would not be avoided if the TLB was hit, so it should not be accounted as part of the costs (though a TLB miss will increase cache pressure). Also, your test does not allow the cpu to pipeline anything; in reality, different workloads have different TLB miss costs: - random reads (pointer chasing) incur almost the full impact since the processor is stalled - sequential writes can be completely pipelined and suffer almost no impact Let's say this doubles the impact. So 10 cycles per trace entry. Will I really think this should be benchmarked. If the user workload thrashes the TLB, it should use huge pages itself, that will make it immune from kernel TLB thrashing and give it a nice For the vmalloc area hit, it's lower. For the user application, it may specJBB is a well known TLB intensive workload, known to benefit greatly from large pages. <snip test> For a similar test see http://people.redhat.com/akivity/largepage.c. -- error compiling committee.c: too many arguments to function --
Forgot to comment about the i386 issue - that really is a blocker if you absolutely need to support large trace buffers on 32-bit machines. I would urge all those people to move to x86_64 and be done with it, but I don't know all the use cases. It's possible to hack this to work by having a private mm_struct and switching to it temporarily, but it will be horribly slow. -- error compiling committee.c: too many arguments to function --
Heh. For a moment there I thought you were describing the the way XFS writes transactions into it's log. Replace "CPU ring-buffers" with "in-core log buffers", "userspace parsing" with "log recovery" and "event ID" with "transaction ID", and the concept you describe is eerily similar. That includes the fact that transactions are not contiguous in the log, can interleave fragments between concurrent transaction commits and they can span multiple log buffers, too. It works pretty well for scaling concurrent writers.... I'll get back in my box now ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com --
That's certainly a good model when you have to stream into a persistent-storage transaction log space with multiple writers. The difference is that with instrumentation we are generally able to make things per task or per cpu so there's no real multi-CPU 'concurrent writers' concurrency. You dont have that luxory/simplicity when logging to storage, of course! Thanks, Ingo --
[ /me removes the duplicate email of himself! ] To be fair, that's just a temp buffer. -- Steve (/me sends this to try to remove the dup email he's getting ) --
On Wed, Jul 14, 2010 at 12:14 PM, Linus Torvalds
.. but if the option is to never take a fault at all from the NMI
handler, and that is doable, than that would be good, of course.
But that may not be fixable. At least not without much more pain than
just adding a fairly simple hack to the NMI path itself, and keeping
all the NMI pain away from all the other cases.
And doing the per-cpu NMI nesting hack would actually also work as a
way to essentially block NMI's from critical regions. With my NMI
nestign avoidance suggestion, you could literally do something like
this to block NMI's:
/* This is just a fake stack structure */
struct nmi_blocker {
unsigned long rflags;
unsigned long cs;
unsigned long rip;
};
void block_nmi_on_this_cpu(struct nmi_blocker *blocker)
{
get_cpu();
memset(blocker, 0, sizeof(*blocker));
per_cpu_nmi_stack_frame = blocker;
}
void unblock_nmi_on_this_cpu(struct nmi_blocker *blocker)
{
per_cpu_nmi_stack_frame = NULL;
barrier();
/* Did an NMI happen? If so, we're now running NMI-blocked by hardware,
* we need to emulate the NMI and do a real 'iret' here
*/
if (blocker->cs == INVALID_CS)
asm volatile(".. build stack frame, call NMI routine ..");
put_cpu();
}
or similar. Wouldn't that be nice to have as a capability?
Linus
--
> or similar. Wouldn't that be nice to have as a capability? It means the NMI watchdog would get useless if these areas become common. Again I suspect all of this is not really needed anyways if vmalloc_sync_all() works properly. That would solve the original problem Mathieu was trying to solve for per_cpu data. The rule would be just to call vmalloc_sync_all() properly when changing per CPU data too. In fact I'm pretty sure it worked originally. Perhaps it regressed? -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Yep, that would solve the page fault in nmi problem altogether without adding I'd first ask the obvious to Perf authors: does perf issue vmalloc_sync_all() between percpu data allocation and tracing activation ? The generic ring buffer library I posted last week does it already as a precaution for this very specific reason (making sure NMIs never trigger page faults). Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
I suspect the low level per cpu allocation functions should just call it. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Yes, specifically the point at which we allocate new per cpu memory blocks. -hpa --
Hello, We can definitely do that but walking whole page table list is too heavy to do automatically at that level especially when all users other than NMI would be fine w/ the default lazy approach. If Linus' approach doesn't pan out, I think the right thing to do would be adding a wrapper to vmalloc_sync_all() and let perf code call it after percpu allocation. Thanks. -- tejun --
Ok, I should try. Until now I didn't because I clearly misunderstand the vmalloc internals. I'm not even quite sure why a memory allocated with vmalloc sometimes can be not mapped (and then fault once for this to sync). Some people have tried to explain me but the picture is still vague to me. And moreover, I'm not quite sure whether vmalloc_sync_all() syncs the pgd for every tasks or so... Tejun seemed to say it's not necessary the case in x86-32... Again I think I haven't totally understood the details. Thanks. --
So the issue is that the system can have thousands and thousands of
page tables (one for each process), and what do you do when you add a
new kernel virtual mapping?
You can:
- make sure that you only ever use _one_ single top-level entry for
all vmalloc issues, and can make sure that all processes are created
with that static entry filled in. This is optimal, but it just doesn't
work on all architectures (eg on 32-bit x86, it would limit the
vmalloc space to 4MB in non-PAE, whatever)
- at vmalloc time, when adding a new page directory entry, walk all
the tens of thousands of existing page tables under a lock that
guarantees that we don't add any new ones (ie it will lock out fork())
and add the required pgd entry to them.
- or just take the fault and do the "fill the page tables" on demand.
Quite frankly, most of the time it's probably better to make that last
choice (unless your hardware makes it easy to make the first choice,
which is obviously simplest for everybody). It makes it _much_ cheaper
to do vmalloc. It also avoids that nasty latency issue. And it's just
simpler too, and has no interesting locking issues with how/when you
expose the page tables in fork() etc.
So the only downside is that you do end up taking a fault in the
(rare) case where you have a newly created task that didn't get an
even newer vmalloc entry. And that fault can sometimes be in an
interrupt or an NMI. Normally it's trivial to handle that fairly
simple nested fault. But NMI has that inconvenient "iret unblocks
NMI's, because there is no dedicated 'nmiret' instruction" problem on
x86.
Linus
--
Adding new PGDs should happen only very rarely (in fact at most once per boot on i386-PAE36 with only 4 entries, most used by user space), most of the time when you do a vmalloc it changes only lower level tables. The PGD for the kernel mappings is already set up. On x86-64 it can happen more often in theory, but in practice it should be also extremly rare because a PGD is so large. That's why I'm not sure this problem even happened. It should be extremly rare that you exactly add that PGD during the per cpu allocation. It can happen in theory, but for such a rare case take a lock and walking everything should be fine. -Andi --
Actually, that's _exactly_ the wrong kind of thinking.
Bad latency is bad latency, even when it happens rarely. So latency
problems kill - even when they are rare. So you want to avoid them.
And walking every possible page table is a _huge_ latency problem when
it happens.
In contrast, what's the advantage of doing thigns synchronously while
holding a lock? It's that you can avoid a few page faults, and get
better CPU use. But that's _stupid_ if it's something that is very
rare to begin with.
So the very rarity argues for the lazy approach. If it wasn't rare,
there would be a much stronger argument for trying to do things
up-front.
Linus
--
But then, even if you ensure that, don't we need to also fill lower level entries for the new mapping. Also, why is this a worry for vmalloc but not for kmalloc? Don't we also But then how did the previous tasks get this new mapping? You said we don't walk through every process page tables for vmalloc. I would understand this race if we were to walk on every processes page tables and add the new mapping on them, but we missed one new task that Yeah. So the parts of the problem I don't understand are: - why don't we have this problem with kmalloc() ? - did I understand well the race that makes the fault necessary, ie: we walk the tasklist lockless, add the new mapping if not present, but we might miss a task lately forked, but the fault will fix that. Thanks. --
No because those are always shared for the kernel and have been filled in for init_mm. Also most updates only update the lower tables anyways, top level updates are extremly rare. In fact on PAE36 they should only happen at most once, if at all, and most likely at early boot anyways where you only have a single task. On x86-64 they will only happen once every 512GB of vmalloc. The new task will always get a copy of the reference init_mm, which was already updated. -Andi --
Ok, got it. But then, in the example here with perf, I'm allocating 8192 bytes per cpu and my total memory amount is of 2 GB. And it always fault at least once on access, after the allocation. I really doubt it's because we are adding a new top level page table, considering the amount of memory I have. It seems to me that the mapping of a newly allocated vmalloc area is always inserted through the lazy way (update on fault). Or there is something I'm missing. Thanks. --
If I understand your question, you do not need to worry about the lower
level entries because all the processes will share the same top level.
process 1's GPD ------,
|
+------> PMD --> ...
|
process 2' GPD -------'
Thus we have one page entry shared by all processes. The issue happens
when the vm space crosses the PMD boundary and we need to update all the
GPD's of all processes to point to the new PMD we need to add to handle
Because all of memory (well 800 some megs on 32 bit) is mapped into
memory for all processes. That is, kmalloc only uses this memory (as
does get_free_page()). All processes have a PMD (or PUD, whatever) that
maps this memory. The issues only arises when we use new virtual memory,
which vmalloc does. Vmalloc may map to physical memory that is already
mapped to all processes but the address that the vmalloc uses to access
that memory is not yet mapped.
The usual reason the kernel uses vmalloc is to get a contiguous range of
memory. The vmalloc can map several pages as one contiguous piece of
memory that in reality is several different pages scattered around
physical memory. kmalloc can only map pages that are contiguous in
physical memory. That is, if kmalloc gets 8192 bytes on an arch with
4096 byte pages, it will allocate two consecutive pages in physical
memory. If two contiguous pages are not available even if thousand of
single pages are, the kmalloc will fail, where as the vmalloc will not.
An allocation of vmalloc can use two different pages and just map the
page table to make them contiguous in view of the kernel. Note, this
comes at a cost. One is when we do this, we suffer the case where we
need to update a bunch of page tables. The other is that we must waste
TLB entries to point to these separate pages. Kmalloc and
get_free_page() uses the big memory mappings. That is, if the TLB allows
us to map large pages, we can do that for kernel memory since we just
want the ...Oh right. We point to that PMD, and the update has been made itself inside Totally! You've made it very clear to me. Moreover I did not know we can have such variable page size. I mean I thought Yeah :) Thanks a lot for your explanations! --
In x86_64, if bit 7 in the PDE (Page Directory Entry) is set then it points to a 2 Meg page. Otherwise it points to a page table which will have 512 PTE's pointing to 4K pages. Download: http://support.amd.com/us/Processor_TechDocs/24593.pdf It has nice diagrams that explains this. Check out page 207 (fig 5-17) and 210 (fig 5-22). The phys_pmd_init() in arch/x86/mm/init_64.c will try to map memory using 2M pages if it can, otherwise it falls back to 4K pages. -- Steve --
Yes, but now they are all mapped by the one *shared* top-level entry. Think about it. [ Time passes ] End result: if you can map the whole vmalloc area with a single top-level entry that is shared by all processes, and can then just fill in the lower levels when doing actual allocations, it means that all processes will automatically get the entries added, and do not need any fixups. In other words, the page tables will be automatically correct and filled in for everybody - without having to traverse any lists, without any extra locking, and without any races. So this is efficient and simple, and never needs any faulting to fill in page tables later on. (Side note: "single top-level entry" could equally well be "multiple preallocated entries covering the whole region": the important part is not really the "single entry", but the "preallocated and filled into No. The kmalloc space is all in the 1:1 kernel mapping, and is always mapped. Even with PAGEALLOC_DEBUG, it's always mapped at the top level, and even if a particular page is unmapped/remapped for debugging, it is done so in the shared kernel page tables (which ends up being the above trivial case - there is just a single set of page We always add the mapping to the "init_mm" page tables when it is created (just a single mm), and when fork creates a new page table, it will always copy the kernel mapping parts from the old one. So the _common_ case is that all normal mappings are already set up in page tables, including newly created page tables. The uncommon case is when there is a new page table created _and_ a new vmalloc mapping, and the race that happens between those events. Whent hat new page table is then later used (and it can be _much_ later, of course: we're now talking process scheduling, so IO delays etc are relevant), it won't necessarily have the page table entries for vmalloc stuff that was created since the page tables were created. So we fill _those_ in dynamically. But vmalloc mappings ...
On Thu, Jul 15, 2010 at 7:51 AM, Linus Torvalds
Btw, this is true to the degree that I would _much_ rather just get
rid of the crazy "vmalloc_sync_all()" crap entirely, and make it clear
that non-lazy vmalloc page table fill is a bug.
Because quite frankly, it _is_ a bug to depend on non-lazy vmalloc.
The whole function is only implemented on 32-bit x86, so any code that
thinks it needs it is either just wrong, or will only work on 32-bit
x86 anyway (and on other architectures by pure chance, likely because
their VM fill granularity is so big that they never saw the problem).
So getting rid of vmalloc_sync_all() entirely would be a good thing.
Then we wouldn't have that silly and pointless interface, and we
wouldn't have that crazy "this only does something on x86-32,
everywhere else it's a placebo".
Linus
--
Such new page table created that might race is only about top level page right? Otherwise it wouldn't race since the top level entries are shared and then updates inside lower level pages are naturally propagated, if I understood you well. So, if only top level pages that gets added can generate such lazily mapping update, I wonder why I experienced this fault everytime with my patches. I allocated 8192 bytes per cpu in a x86-32 system that has only 2 GB. I doubt there is a top level page table update there at this time with such a small amount of available memory. But still it faults once on access. Yeah, agreed. But I'm still confused about when exactly we need to fault Yeah agreed. Thanks for your explanations! --
A few trace_printks and a tracing_off() on fault would probably show exactly what was happening ;-) -- Steve --
* Linus Torvalds (torvalds@linux-foundation.org) wrote Let's try to figure out how far we can go with this idea. First, to answer Ingo's critic, let's assume we do a stack frame copy before entering the Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we I assume you mean "popf/ret" here. So assuming we use a frame copy, we should change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0 That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi This would be rather: - Copy the nesting 0 stack copy back onto the real nmi stack. - clear the percpu nmi stack pointer Which presumably faults if we changed the return stack for an invalid one and executes as many NMIs as there are "missed nmis" counted (missed nmis should probably be read with an xchg() instruction). So, one question persists, regarding the "** !" comment: what do we do if an NMI comes in exactly at that point ? I'm afraid it will overwrite the "real" nmi stack, and therefore drop all the "pending" nmis by setting the nmi stack return address to a valid one. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
On Wed, Jul 14, 2010 at 1:39 PM, Mathieu Desnoyers
No. As mentioned, there is no such counter in real hardware either.
Look at what happens for the not-nested case:
- NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
NMI's disabled
- NMI2 triggers. Nothing happens, the NMI's are disabled.
- NMI3 triggers. Again, nothing happens, the NMI's are still disabled
- the NMI handler returns.
- What happens now?
How many NMI interrupts do you get? ONE. Exactly like my "emulate it
in software" approach. The hardware doesn't have any counters for
Yes, that was as typo. The whole point of using popf was obviously to
I think you're confused. Or I am by your question.
The NMI code would literally just do:
- check if the NMI was nested, by looking at whether the percpu
nmi-stack-pointer is non-NULL
- if it was nested, do nothing, an return with a popf/ret. The only
stack this sequence might needs is to save/restore the register that
we use for the percpu value (although maybe we can just co a "cmpl
$0,%_percpu_seg:nmi_stack_ptr" and not even need that), and it's
atomic because at this point we know that NMI's are disabled (we've
not _yet_ taken any nested faults)
- if it's a regular (non-nesting) NMI, we'd basically do
6* pushq 48(%rsp)
to copy the five words that the NMI pushed (ss/esp/eflags/cs/eip)
and the one we saved ourselves (if we needed any, maybe we can make do
with just 5 words).
- then we just save that new stack pointer to the percpu thing with a simple
movq %rsp,%__percpu_seg:nmi_stack_ptr
and we're all done. The final "iret" will do the right thing (either
fault or return), and there are no races that I can see exactly
because we use a single nmi-atomic instruction (the "iret" itself) to
either re-enable NMI's _or_ test whether we should re-do an NMI.
There is a single-instruction window that is interestign in the return
path, which is the window between the two final instructions:
movl ...The NMI latch records the second NMI. Note this is edge-sensitive like NMI2 latched above causes the NMI handler to be invoked as the next Two. :) Maciej --
You just count differently. I don't count the first one (the "real"
NMI). That obviously happens. So I only count how many interrupts we
need to fake. That's my "one". That's the one that happens as a result
of the fault that we take on the iret in the emulated model.
So there is no need to count anything. We take a fault on the iret if
we got a nested NMI (regardless of how _many_ such nested NMI's we
took). That's the "latch", exactly like in the hardware. No counter.
(Yeah, yeah, you can call it a "one-bit counter", but I don't think
that's a counter. It's just a bit of information).
Linus
--
Hardware has something like a strapped-high D flip-flop (NMI goes to the clock input) with an extra reset input I presume -- this dates back to 8086 when the transistor count mattered with accuracy higher than 1e6. ;) Maciej --
So I figure that given Maciej's response, we can get at most 2 nested nmis, no more, no less. So I was probably going too far with the counter, but we need 2. However, failure to deliver the second NMI in this case would not match the Ah, right, you only need to do the copy and use the copy for the nesting level 0 NMI handler. The nested NMI can work on the "real" nmi stack because we never Yes, this was this exact instruction window I was worried about. I see another possible failure mode: - NMI - page fault - iret - NMI - set nmi_stack_ptr to 0, popf/lret. - page fault (yep, another one!) - iret - movl $0,%__percpu_seg:nmi_stack_ptr - iret So in this case, movl/iret are executed with NMIs enabled. So if an NMI comes in after the movl instruction, it will not detect that it is nested and will re-use the percpu "nmi_stack_ptr" stack, squashing the "faulty" stack ptr with a brand new one which won't trigger a fault. I'm afraid that in this case, the last NMI handler will iret to the "nesting 0" handler at the iret instruction, which will in turn return to itself, breaking all hell loose in the meantime (endless iret loop). So this also calls for special-casing an NMI nested on top of the following iret - movl $0,%__percpu_seg:nmi_stack_ptr - iret <----- instruction. At the beginning of the NMI handler, we could detect if we are either nested over an NMI (checking nmi_stack_ptr != NULL) or if we are at this If we can find a clean way to handle this NMI vs iret problem outside of the entry_*.S code, within NMI-specific code, I'm indeed all for it. entry_*.s is already complicated enough as it is. I think checking the %rip at NMI entry could work out. Thanks! Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
On Wed, Jul 14, 2010 at 3:21 PM, Mathieu Desnoyers
I think the %rip check should be pretty simple - exactly because there
is only a single point where the race is open between that 'mov' and
the 'iret'. So it's simpler than the (similar) thing we do for
debug/nmi stack fixup for sysenter that has to check a range.
The only worry is if that crazy paravirt code wants to paravirtualize
the iretq. Afaik, paravirt does that exactly because they screw up
iret handling themselves. Maybe we could stop doing that stupid iretq
paravirtualization, and just tell the paravirt people to do the same
thing I propose, and just allow nesting.
Linus
--
We screw around with iret because there's a separate interrupt mask flag
which can't be set atomically with respect to a stack/ring change (well,
there's more to it, but I won't confuse matters).
But it only really matters if the PV guest can also get NMI-like
interrupts. While Xen supports NMI for PV guests, we don't use it much
and I haven't looked into implementing support for it yet.
J
--
Umm, I know. It's what this whole discussion (non-paravirtualized) is
all about. And I have a suggestion that should fix the
non-paravirtualized case _without_ actually touching anything but the
NMI code itself.
What I tried to say is that the paravirtualized people should take a
look at my suggestion, and see if they can apply the logic to their
NMI handling too. And in the process totally remove the need for
paravirtualizing iret, exactly because the approach handles the magic
NMI lock logic entirely in the NMI handler itself.
Because I think the same thing that would make us not need to worry
about nested page faults _during_ NMI (because we could make the NMI
code do the right thing even when the hardware doesn't lock out NMI's
for us) is also the exact same logic that the paravirt monitor could
do for its own NMI handling.
Wouldn't it be nice to be able to remove the need to paravirtualize iret?
Linus
--
Nothing in this thread is ringing any alarm bells from that perspective,
so I don't much mind either way. When I get around to dealing with
paravirtualized NMI, I'll look at the state of things and see how to go
from there. (Xen's iret hypercall takes a flag to say whether to unmask
NMIs, which will probably come in handy.)
I don't think any of the other pure PV implementations have NMI either,
Of course. But we also need to do an iret in a hypercall to handle ring
switching in some cases, so we still need it aside from the interrupt issue.
J
--
On Wed, Jul 14, 2010 at 3:37 PM, Linus Torvalds So this is what I think it might look like, with the %rip in place. And I changed the "nmi_stack_ptr" thing to have both the pointer and a flag - because it turns out that in the single-instruction race case, we actually want the old pointer. Totally untested, of course. But _something_ like this might work: # # Two per-cpu variables: a "are we nested" flag (one byte), and # a "if we're nested, what is the %rsp for the nested case". # # The reason for why we can't just clear the saved-rsp field and # use that as the flag is that we actually want to know the saved # rsp for the special case of having a nested NMI happen on the # final iret of the unnested case. # nmi: cmpb $0,%__percpu_seg:nmi_stack_nesting jne nmi_nested_corrupt_and_return cmpq $nmi_iret_address,0(%rsp) je nmi_might_be_nested # create new stack is_unnested_nmi: # Save some space for nested NMI's. The exception itself # will never use more space, but it might use less (since # if will be a kernel-kernel transition). But the nested # exception will want two save registers and a place to # save the original CS that it will corrupt subq $64,%rsp # copy the five words of stack info. 96 = 64 + stack # offset of ss. pushq 96(%rsp) # ss pushq 96(%rsp) # rsp pushq 96(%rsp) # eflags pushq 96(%rsp) # cs pushq 96(%rsp) # rip # and set the nesting flags movq %rsp,%__percpu_seg:nmi_stack_ptr movb $0xff,%__percpu_seg:nmi_stack_nesting regular_nmi_code: ... # regular NMI code goes here, and can take faults, # because this sequence now has proper nested-nmi # handling ... nmi_exit: movb $0,%__percpu_seg:nmi_stack_nesting nmi_iret_address: iret # The saved rip points to the final NMI iret, after we've cleared # nmi_stack_ptr. Check the CS segment to make sure. nmi_might_be_nested: cmpw $__KERNEL_CS,8(%rsp) jne is_unnested_nmi # This is the case when we hit just as we're supposed to do the final # iret of ...
On Wed, Jul 14, 2010 at 6:23 PM, Linus Torvalds
I didn't fill in the iret fault details, because I thought that would
be trivial. We get an exception, it's a slow case, how hard can it be
to just call the NMI code?
But thinking some more about it, it doesn't feel as trivial any more.
We want to set up that same nesting code for the faked NMI call, but
now I made it be two separate variables, and they need to be set in an
NMI-safe way without us actually having access to the whole NMI
blocking that the CPU does for a real NMI.
So there's a few subtleties there too. Probably need to make the two
percpu values adjacent, and use cmpxchg16b in the "emulate NMI on
exception" code to set them both atomically. Or something. So I think
it's doable, but it's admittedly more complicated than I thought it
would be.
.. and obviously there's nothing that guarantees that the code I
already posted is correct either. The whole concept might be total
crap.
Linus
--
I'm wondering if we really have to handle this with a fault. Couldn't we just send iret to the following nmi handler instead ? (chaining nmis) I think we can even find a way to handle the fact that the fake nmi does not run with nmis disabled. We could keep the nmi nested bit set if we find out that iret will branch to the fake nmi. We would then have to make sure the fake nmi entry point is a little further than the standard nmi entry point: somewhere Hrm, we could probably get away with only keeping the nmi_stack_nested per-cpu variable. The nmi_stack_ptr could be known statically if we set it at a fixed offset from the bottom of stack rather than using an offset relative to the top (which can change depending if we are nested over the kernel or userspace). Call me optimistic if you want, but I think we're really getting somewhere. :) Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
On Thu, Jul 15, 2010 at 11:43 AM, Linus Torvalds
Hmm. Of course - one way of solving this might be to just make the
32-bit case switch stacks in software. That might be a good idea
regardless, and would not be complicated. We already do that for
sysenter, but the NMI case would be simpler because we don't need to
worry about being re-entered by NMI/DEBUG during the stack switch.
And since we have to play some games with moving the data on the stack
around _anyway_, doing the whole "switch stacks entirely rather than
just subtract a bit from the old stack" would be fairly logical.
So I think you may end up being right: we don't need to save the
original NMI stack pointer, because we can make sure that the
replacement stack (that we need anyway) is always deterministic.
Linus
--
Hi Linus, I modified your code, intenting to handle the fake NMI entry gracefully given that NMIs are not necessarily disabled at the entry point. It uses a "need fake NMI" flag rather than playing games with CS and faults. When a fake NMI is needed, it simply jumps back to the beginning of regular nmi code. NMI exit code and fake NMI entry are made reentrant with respect to NMI handler interruption by testing, at the very beginning of the NMI handler, if a NMI is nested over the whole nmi_atomic .. nmi_atomic_end code region. This code assumes NMIs have a separate stack. This code is still utterly untested and might eat your Doritos, only provided for general enjoyment. Thanks, Mathieu # # Two per-cpu variables: a "are we nested" flag (one byte). # a "do we need to execute a fake NMI" flag (one byte). # The %rsp at which the stack copy is saved is at a fixed address, which leaves # enough room at the bottom of NMI stack for the "real" NMI entry stack. This # assumes we have a separate NMI stack. # The NMI stack copy top of stack is at nmi_stack_copy. # The NMI stack copy "rip" is at nmi_stack_copy_rip, which is set to # nmi_stack_copy-32. # nmi: # Test if nested over atomic code. cmpq $nmi_atomic,0(%rsp) jae nmi_addr_is_ae # Test if nested over general NMI code. cmpb $0,%__percpu_seg:nmi_stack_nesting jne nmi_nested_set_fake_and_return # create new stack is_unnested_nmi: # Save some space for nested NMI's. The exception itself # will never use more space, but it might use less (since # if will be a kernel-kernel transition). # Save %rax on top of the stack (need to temporarily use it) pushq %rax movq %rsp, %rax movq $nmi_stack_copy,%rsp # copy the five words of stack info. rip starts at 8+0(%rax). pushq 8+32(%rax) # ss pushq 8+24(%rax) # rsp pushq 8+16(%rax) # eflags pushq 8+8(%rax) # cs pushq 8+0(%rax) # rip movq 0(%rax),%rax # restore %rax set_nmi_nesting: # and set the nesting flags movb ...
On Thu, Jul 15, 2010 at 3:01 PM, Mathieu Desnoyers
That is totally bogus. The NMI can be nested by exceptions and
function calls - the whole _point_ of this thing. So testing "rip" for
anything else than the specific final "iret" is meaningless. You will
It also needs to be made per-cpu (and the flags be per-cpu).
Then you could in fact possibly test the stack pointer for whether it
is in the NMI stack area, and use the value of %rsp itself as the
flag. So you could avoid the flag entirely. Because testing %rsp is
valid - testing %rip is not.
That would also avoid the race, because %rsp (as a flag) now gets
cleared atomically by the "iret". So that might actually solve things.
Linus
--
This seems really clean to me. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
On Thu, Jul 15, 2010 at 3:16 PM, Linus Torvalds
Hmm. So on x86-32, it's easy: if the NMI is nested, you can literally
look at the current %rsp value, and see if it's within the NMI stack
region.
But on x86-64, due to IST, you need to look at the saved-rsp value on
the stack, since the %rsp always gets reset to the NMI stack region
regardless of where it was before.
Why do we force IST use for NMI, btw? Maybe we shouldn't, and just use
the normal kernel stack mechanisms?
Linus
--
The reasons for using TSS (32 bits) or IST (64 bits) are: concern about the size of the regular kernel stack, and a concern that the kernel stack pointer may not be in a usable state. The former is not a problem here: we're doing a stack switch anyway, and so the additional overhead on the main stack is pretty minimal, but the latter may be. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
If you don't use IST the SYSCALL entry is racy during the window when RSP is not set up yet (same for MCE etc.) -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Right, the kernel stack is not ready. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Well, it may not be ready for the _current_ NMI handler, but if we're
going to do a stack switch in sw on NMI anyway... ?
Linus
--
No, the problem is that without IST it'll try to drop the NMI stack frame itself *on the user stack*. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Oh, because SS has already been cleared, but rsp still points to the
user stack? Ok, that does seem insurmountable.
Linus
--
Well, SS doesn't matter for 64 bits, but yes, RSP still points to the user stack. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
The CPU written initial stack frame would still go on the wrong stack. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
There are 2 tests done on NMI handler entry: 1) test if nested over nmi_atomic region (which is a very restrained region around nmi_exit, which does not do any function call nor take traps). 2) test if the per-cpu nmi_nesting flag is set. That could be used as a way to detect "nesting over NMI", but I'm not entirely sure it would deal with the "we need a fake NMI" flag set/clear (more or less equivalent to setting CS to 0 in your implementation and then back to some other value). The "set" is done with NMIs disabled, but the "clear" is done at fake Well, I'm still unconvinced there is anything to solve, as I built my NMI entry with 2 tests: one for "nmi_atomic" code range and the other for per-cpu nesting flag. Given that I set/clear the per-cpu nesting flag either with NMIs off or within the nmi_atomic code range, this should all work fine. Unless I am missing something else ? Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Hi Linus, What I omitted in my original description paragraph is that I also test for NMIs nested over NMI "regular code" with a "nesting" per-cpu flag, which deals with the concerns you expressed in your reply about function calls and traps. I'm self-replying to keep track of Avi's comment about the need to save/restore cr2 at the beginning/end of the NMI handler, so we don't end up corrupting a VM CR2 if we have the following scenario: trap in VM, NMI, trap in NMI. So I added cr2 awareness to the code snippet below, so we should be close to have something that starts to make sense. (although I'm not saying it's bug-free yet) ;) Please note that I'll be off on vacation for 2 weeks starting this evening (back on August 2) without Internet access, so my answers might be delayed. Thanks ! Mathieu Code originally written by Linus Torvalds, modified by Mathieu Desnoyers intenting to handle the fake NMI entry gracefully given that NMIs are not necessarily disabled at the entry point. It uses a "need fake NMI" flag rather than playing games with CS and faults. When a fake NMI is needed, it simply jumps back to the beginning of regular nmi code. NMI exit code and fake NMI entry are made reentrant with respect to NMI handler interruption by testing, at the very beginning of the NMI handler, if a NMI is nested over the whole nmi_atomic .. nmi_atomic_end code region. It also tests for nested NMIs by keeping a per-cpu "nmi nested" flag"; this deals with detection of nesting over the "regular nmi" execution. This code assumes NMIs have a separate stack. # # Two per-cpu variables: a "are we nested" flag (one byte). # a "do we need to execute a fake NMI" flag (one byte). # The %rsp at which the stack copy is saved is at a fixed address, which leaves # enough room at the bottom of NMI stack for the "real" NMI entry stack. This # assumes we have a separate NMI stack. # The NMI stack copy top of stack is at nmi_stack_copy. # The NMI stack copy "rip" is at nmi_stack_copy_rip, ...
On Thu, Jul 15, 2010 at 11:31 AM, Mathieu Desnoyers
I thought about trying that, but I don't think it's true. At least not
for the 32-bit case.
The thing is, the 32-bit case will re-use the kernel stack if it
happens in kernel space, and will thus start from a random space (and
won't push all the information anyway). So a nested NMI really doesn't
know where the original NMI stack is to be found unless we save it
off.
In the case of x86-64, I think the stack will always be at a fixed
address, and push a fixed amount of data (because we use the IST
thing). So there you could probably just use the flag, but you'd still
have to handle the 32-bit case, and quite frankly, I think it would be
much nicer if the logic could be shared for the 32-bit and 64-bit
cases.
But maybe I'm missing something.
Linus
--
The first thing that strikes me is that we could be interrupted by a standard interrupt on top of the iret instruction below. This interrupt handler could in turn be interrupted by a NMI, so the NMI handler would not know that it is nested over nmi_iret_address. Maybe we could simply disable interrupts explicitly at the beginning of the handler, so they get re-enabled by iret below upon return from nmi ? Doing that would ensure that only NMIs can interrupt us. I'll look a bit more at the code and come back with more comments if things come up. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
On Thu, Jul 15, 2010 at 9:44 AM, Mathieu Desnoyers
No, that can never happen.
Why? Simple: regular interrupts aren't ever enabled in eflags. So the
only kinds of traps we can get are NMI's (that don't follow the normal
rules), and exceptions.
Of course, if there is some trap that re-enables interrupts even if
the trap happened in an interrupt-disabled region, then that would
change things, but that would be a bad bug regardless (and totally
independently of any NMI issues). So in that sense it's a "could
happen", but it's something that would be a totally separate bug.
Linus
--
Ah, you're right, since NMIs are an intr gate, IF is disabled in the EFLAGS all Yep, this kind of scenario would therefore be a bug that does not belong to the specific realm of nmis. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Yes, the only specific issue here is NMI -> trap -> IRET -> [nested NMI], which is what this whole thing is about. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
By trading off some memory, we don't need this trickery. We can
allocate two nmi stacks, so the code becomes:
nmi:
cmpb $0, %__percpu_seg:nmi_stack_nesting
je unnested_nmi
cmpq $nmi_iret,(%rsp)
jne unnested_nmi
cmpw $__KERNEL_CS,8(%rsp)
jne unnested_nmi
popf
retfq
unnested_nmi:
xorq $(nmi_stack_1 ^ nmi_stack_2),%__percpu_seg:tss_nmi_ist_entry
movb $1, __percpu_seg:nmi_stack_nesting
regular_nmi:
...
regular_nmi_end:
movb $0, __percpu_seg:nmi_stack_nesting
nmi_iret:
iretq
--
error compiling committee.c: too many arguments to function
--
I really don't think you need even that. See earlier in the discussion
about how we could just test %rsp itself. Which makes all the %rip
testing totally unnecessary, because we don't even need any flags,and
we have no races because %rsp is atomically changed with taking the
exception.
Lookie here, the %rsp comparison really isn't that hard:
nmi:
pushq %rax
pushq %rdx
movq %rsp,%rdx # current stack top
movq 40(%rsp),%rax # old stack top
xor %rax,%rdx # same 8kB aligned area?
shrq $13,%rdx # ignore low 13 bits
je it_is_a_nested_nmi # looks nested..
non_nested:
...
... ok, we're not nested, do normal NMI handling ...
...
popq %rdx
popq %rax
iret
it_is_a_nested_nmi:
cmpw $0,48(%rsp) # double-check that it really was a nested exception
jne non_nested # from user space or something..
# this is the nested case
# NOTE! NMI's are blocked, we don't take any exceptions etc etc
addq $-160,%rax # 128-byte redzone on the old stack + 4 words
movq (%rsp),%rdx
movq %rdx,(%rax) # old %rdx
movq 8(%rsp),%rdx
movq %rdx,8(%rax) # old %rax
movq 32(%rsp),%rdx
movq %rdx,16(%rax) # old %rflags
movq 16(%rsp),%rdx
movq %rdx,24(%rax) # old %rip
movq %rax,%rsp
popq %rdx
popq %rax
popf
ret $128 # restore %rip and %rsp
doesn't that look pretty simple?
NOTE! OBVIOUSLY TOTALLY UNTESTED!
Linus
--
Too simple - an MCE will switch to its own stack, failing the test. Now that we have correctable MCEs, that's not a good idea. So the plain everyday sequence NMI #PF MCE (uncompleted) NMI will fail. Plus, even in the non-nested case, you have to copy the stack frame, or the nested NMI will corrupt it. With stack switching, the nested NMI is allocated its own frame. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Ahh, true. And I think we do DEBUG traps with IST too.
So we do need the explicit flag over the region. Too bad. I was hoping
to handle the nested case without having to set up the percpu segment
(that whole conditional swapgs thing, which is extra painful in NMI).
And at that point, if you require the separate flag anyway, the %rsp
range test is equivalent to the %rip range test.
Linus
--
Well, we have to do that anyway for the non-nested case. So we just do it before checking whether we're nested or not, and undo it on the popf; retf path. -- error compiling committee.c: too many arguments to function --
On Sun, Jul 18, 2010 at 10:36 AM, Linus Torvalds
The non_nested case still needs to start off with moving it's stack
frame to a safe area that won't be overwritten by any nesting NMI's
(note that they cannot nest at this point, since we've done nothing
that can fault). So we'd still need that
7* pushq 48(%rsp)
which copies the five words that got pushed by hardware, and the two
register-save locations that we used for the nesting check and special
return.
After we've done those 7 pushes, we can then run code that may take a
fault. Because when the fault returns with an "iret" and re-enables
NMI's, our nesting code is ready.
So all told, we need a maximum of about 216 bytes of stack for the
nested NMI case: 56 bytes for the seven copied words, and the 160
bytes that we build up _under_ the stack pointer for the nested case.
And we need the NMI stack itself to be aligned in order for that
"ignore low bits" check to work. Although we don't actually have to do
that "xor+shr", we could do the test equally well with a "sub+unsigned
compare against stack size".
Other than that, I think the extra test that we're really nested might
It migth be safer to check the saved CS rather than the saved SS on
the stack to see that we really are in kernel mode. It's possible that
somebody could load a NULL SS in user mode and then just not use the
stack - and try to make it look like they are in kernel mode for when
the NMI happens. Now, I _think_ that loading a zero SS is supposed to
trap, but checking CS is still likely to be the better test for "were
we in kernel mode". That's where the CPL is really encoded, after all.
So that "cmpw $0,48(%rsp)" is probably ok, but it would likely be
better to do it as
testl $3,24(%rsp)
jne non_nested
instead. That's what entry_64.S does everywhere else.
Oh, and the non-nested case obviously needs all the regular "make the
kernel state look right" code. Like the swapgs stuff etc if required.
My example code was really meant ...Are you sure you don't want to use Mathieu's 2/2 patch? We are fixing the x86 problem that iret re-enables NMIs, and you don't want to touch anything else but the NMI code. But it may be saner to just fix the places that call iret. We can perhaps encapsulate those into a single macro that we can get right and will be correct everywhere it is used. The ugliest part of Mathieu's code is dealing with paravirt, but paravirt is ugly to begin with. Doing this prevents nested NMIs as well as all the unknowns that will come with dealing with nested NMIs. Where as, handling all iret's should be straight forward, although a bit more intrusive than what we would like. Just saying, -- Steve --
Yeah, I'm pretty sure. Unless somebody can show that it's faster, I
really don't want to muck with regular iret's. Also, as shown during
the discussion, even with Mathieu's 2/2 patch, we'd _still_ need NMI
to also save cr2 etc.
So the sane thing to do is to put all the NMI crap where it belongs.
NMI's need to know about the fact that them taking exceptions is
special. That whole "vmalloc_sync_all()" is simply pure brokenness.
In other words, it is _not_ just about 'iret' fixup. It's a bigger
thing. NMI's are special, and we don't want to spread that specialness
around.
Linus
--
Implements an alternative iret with popf and return so trap and exception handlers can return to the NMI handler without issuing iret. iret would cause NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to copy the return instruction pointer to the top of the previous stack, issue a popf, loads the previous esp and issue a near return (ret). It allows placing dynamically patched static jumps in asm gotos, which will be used for optimized tracepoints, in NMI code since returning from a breakpoint would be valid. Accessing vmalloc'd memory, which allows executing module code or accessing vmapped or vmalloc'd areas from NMI context, would also be valid. This is very useful to tracers like LTTng. This patch makes all faults, traps and exception safe to be called from NMI context *except* single-stepping, which requires iret to restore the TF (trap flag) and jump to the return address in a single instruction. Sorry, no kprobes support in NMI handlers because of this limitation. This cannot be emulated with popf/lret, because lret would be single-stepped. It does not apply to "immediate values" because they do not use single-stepping. This code detects if the TF flag is set and uses the iret path for single-stepping, even if it reactivates NMIs prematurely. Test to detect if nested under a NMI handler is only done upon the return from trap/exception to kernel, which is not frequent. Other return paths (return from trap/exception to userspace, return from interrupt) keep the exact same behavior (no slowdown). alpha and avr32 use the active count bit 31. This patch moves them to 28. TODO : test alpha and avr32 active count modification TODO : test with lguest, xen, kvm. tested on x86_32 (tests implemented in a separate patch) : - instrumented the return path to export the EIP, CS and EFLAGS values when taken so we know the return path code has been executed. - trace_mark, using immediate values, with 10ms delay with the breakpoint activated. Runs ...
Watch out for the RF flag too, that is not set correctly by POPFD -- that may be important for faulting instructions that also have a hardware What about the VM flag for VM86 tasks? It cannot be changed by POPFD either. How about only using the special return path when a nested exception is about to return to the NMI handler? You'd avoid all the odd cases then that do not happen in the NMI context. Maciej --
This is exactly what this patch does :-) It selects the return path with + testl $NMI_MASK,TI_preempt_count(%ebp) + jz resume_kernel /* Not nested over NMI ? */ In addition, about int3 breakpoints use in the kernel, AFAIK the handler does not explicitly set the RF flag, and the breakpoint instruction (int3) appears not to set it. (from my understanding of Intel's Intel Architecture Software Developer’s Manual Volume 3: System Programming 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C) So it should be safe to set a int3 breakpoint in a NMI handler with this patch. It's just the "single-stepping" feature of kprobes which is problematic. Luckily, only int3 is needed for code patching bypass. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Ah, OK then -- I understood you actually tested the value of TF in the The CPU only sets RF itself in the image saved in certain cases -- you'd see it set in the page fault handler for example, so that once the handler has finished any instruction breakpoint does not hit (presumably again, because the instruction breakpoint debug exception has the highest Actually the breakpoint exception handler should actually probably set RF explicitly, but that depends on the exact debugging scenario, so I can't comment on it further. I don't know how INT3 is used in this context, so I'm just noting this may be a danger zone. Maciej --
It tests it too. When it detects that the return path is about to return to a NMI handler, it checks if the TF flag is set. If it is set, then "iret" is really needed, because TF can only single-step an instruction when set by "iret". The popf/ret scheme would otherwise trap at the "ret" instruction that follows popf. Anyway, single-stepping is really discouraged in nmi handlers, Well, the only case where I think it might make sense to allow a breakpoint in NMI handler code would be to temporarily replace a static branch, which should In the case of temporary bypass, the int3 is only there to divert the instruction execution flow to somewhere else, and we come back to the original code at the address following the instruction which has the breakpoint. So basically, we never come back to the original instruction, ever. We might as well just clear the RF flag from the EFLAGS image before popf. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Hmm, with Pentium Pro and more recent processors there is actually a nasty hack that will let you get away with POPF/RET and TF set. ;) You Yes, if you return to elsewhere, then that's actually quite desirable IMHO. This RF flag is quite complicated to handle and there are some errata involved too. If I understand it correctly, all fault-class exception handlers are expected to set it manually in the image to be restored if they return to the original faulting instruction (that includes the debug exception handler if it was invoked as a fault, i.e. in response to an instruction breakpoint). Then all trap-class exception handlers are expected to clear the flag (and that includes the debug exception handler if it was invoked as a trap, e.g. in response to a data breakpoint or a single step). I haven't checked if Linux gets these bits right, but it may be worth doing so. For the record -- GDB hardly cares, because it removes any instruction breakpoints before it is asked to resume execution of an instruction that has a breakpoint set at, single-steps the instruction with all the other threads locked out and then reinserts the breakpoints so that they can hit again. Then it proceeds with whatever should be done next to fulfil the execution request. Maciej --
You need to save/restore cr2 in addition, otherwise the following hits you - page fault - processor writes cr2, enters fault handler - nmi - page fault - cr2 overwritten I guess you would usually not notice the corruption since you'd just see a spurious fault on the page the NMI handler touched, but if the first fault happened in a kvm guest, then we'd corrupt the guest's cr2. But the whole thing strikes me as overkill. If it's 8k per-cpu, what's wrong with using a per-cpu pointer to a kmalloc() area? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
OK, just to make sure: you mean we'd have to save/restore the cr2 register at the beginning/end of the NMI handler execution, right ? The shouldn't we Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is much more than perf) can potentially cause large latencies, which could be squashed by allowing page faults in NMI handlers. This looks like a stronger argument to me. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
You need to fix all other code too that walks tasks lists to avoid all those. % gid for_each_process | wc -l In fact the mm-struct walk is cheaper than a task-list walk because there are always less than tasks. -Andi --
This can very well be done incrementally. And I agree, these should eventually targeted too, especially those which hold locks. We've already started hearing about tasklist lock live-locks in the past year, so I think we're pretty much at the point where it should be looked at. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Why is that kernel code calling vmalloc_sync_all()? If it is only NMI which cannot take vmalloc faults, why bother? If not, why not? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Modules come as yet another example of stuff that is loaded in vmalloc'd space and can be accesses from NMI context. That would include oprofile, tracers, and probably others I'm forgetting about. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Module loading can certainly take a vmalloc_sync_all() (though I agree it's unpleasant). Anything else? Note perf is not modular at this time, but could be made so with preempt/sched notifiers to hook the context switch. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Actually, module loading is already a performance problem; a lot of distros load sometimes hundreds of modules on startup, and it's heavily serialized, so I can see this being desirable to skip. I really hope noone ever gets the idea of touching user space from an NMI handler, though, and expecting it to work... -hpa --
There aren't that many processes at this time (or there shouldn't be, don't know how fork-happy udev is at this stage), so the sync should be pretty fast. In any case, we can sync only modules that contain NMI I think the concern here is about an NMI handler's code running in vmalloc space, or is it something else? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Code or data, yes; including percpu data. -hpa --
Use kmalloc and percpu pointers, it's not that onerous. Oh, and you can access vmalloc space by switching cr3 temporarily to init_mm's, no? Obviously not a very performant solution, at least without PCIDs, but can be used if needed. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
What people don't seem to understand is that WE SHOULD NOT MAKE NMI
FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
WHAT-SO-EVER TO DO WITH NMI.
I'm shouting, because this point seems to have been continually
missed. It was missed in the original patches, and it's been missed in
the discussions.
Non-NMI code should simply never have to even _think_ about NMI's. Why
should it? It should just do whatever comes "natural" within its own
context.
This is why I've been pushing for the "let's just fix NMI" approach.
Not adding random hacks to other code sequences that have nothing
what-so-ever to do with NMI.
So don't add NMI code to the page fault code. Not to the debug code,
or to the module loading code. Don't say "use special allocations
because the NMI code may care about these particular data structures".
Because that way lies crap and unmaintainability.
Linus
--
But we're not talking about non-NMI code. The 8k referred to in the original patch are buffers used by NMI stack recording. Module code vmalloc_sync_all() is only need by code that is executed during NMI, "fixing NMI" will result in code that is understandable by maybe three people after long and hard thinking. NMI can happen in too many semi-defined contexts, so there will be plenty of edge cases. I'm not If NMI code can call random hooks and access random data, yes. But I don't think we're at that point yet. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Yes, we are. We're talking about breakpoints (look at the subject
line), and you are very much talking about things like that _idiotic_
vmalloc_sync_all() by module loading code etc etc.
Every _single_ "solution" I have seen - apart from my suggestion - has
been about making code "special" because some other code might run in
an NMI. Module init sequences having to do idiotic things just because
they have data structures that might get accessed by NMI.
And the thing is, if we just do NMI's correctly, and allow nesting,
ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
stupid things elsewhere.
In other words, why the hell are you arguing? Help Mathieu write the
low-level NMI handler right, and remove that idiotic
"vmalloc_sync_all()" that is fundamentally broken and should not
exist. Rather than talk about adding more of that kind of crap.
Linus
--
One issue I have with nesting NMIs is that you need a nesting limit, otherwise you'll overflow the NMI stack. We just got rid of nesting for normal interrupts because of this stack overflow problem which hit in real situations. In some cases you can get quite high NMI frequencies, e.g. with performance counters. Now the current performance counter handlers do not nest by themselves of course, but they might nest with other longer running NMI users. I think none of the current handlers are likely to nest for very long, but there's more and more NMI coded all the time, so it's definitely a concern. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Have you actually looked at the suggestion I (and now Mathieu)
suggested code for?
The nesting is very limited. NMI's would nest just once, and when that
happens, the nested NMI would never use more than something like a
hundred bytes of stack (most of which is what the CPU pushes
directly). And there would be no device interrupts that nest, and
practically the faults that nest obviously aren't going to be complex
faults either (ie the page fault would be the simple case that never
calls to 'handle_vm_fault()', but handles it all in
arch/x86/mm/fault.c.
IOW, there is absolutely _no_ issues with nesting. It's two levels
deep, and a much smaller stack footprint than our regular exception
nesting for those two levels too.
And your argument that there would be more and more NMI usage only
makes it more important that we handle NMI's without going crazy. Just
handle them cleanly instead of making them something totally special.
Linus
--
Maybe I'm misunderstanding everything (and it has been a lot of emails in the thread), but the case I was thinking of would be if the second NMI faults too, and then another one comes in after the IRET etc. -Andi --
No, the nested NMI cannot fault, because it never even enters C code. It literally just returns immediately after having noticed it is nested (and corrupted the stack of the original one, so that the original NMI will re-do itself at return).. So the nested NMI will use some few tens of bytes of stack. In fact, it will use the stack "above" the stack that the original NMI handler is using, because it will reset the stack pointer back to the top of the NMI stack. So in a very real sense, it is not even extending the stack, it is just re-using a small part of the same stack that the original NMI used (and that we copied away so that it doesn't matter that it gets re-used) As to another small but important detail: the _nested_ NMI actually returns using "popf+ret", leaving NMI's blocked again. Thus guaranteeing forward progress and lack of NMI storms. To summarize: - the "original" (first-level) NMI can take faults (like the page fault to fill in vmalloc pages lazily, or debug faults). That will actually cause two stack frames (or three, if you debug a page fault that happened while NMI was active). So there is certainly exception nesting going on, but we're talking _much_ less stack than normal stack usage where the nesting can be deep and in complex routines. - any "nested" NMI's will not actually use any more stack at all than a non-nested one, because we've pre-reserved space for them (and we _had_ to reserve space for them due to IST) - even if we get NMI's during the execution of the original NMI, there can be only one such "spurious" NMI per nested exception. So if we take a single page fault, that exception will re-enable NMI (because it returns with "iret"), and as a result we may take a _single_ new nested NMI until we disable NMI's again. In other words, the approach is not all that different from doing "lazy irq disable" like powerpc does for regular interrupts. For NMI's, we do it because it's impossible (on x86) to disable NMI's without actually ...
We're not proposing to actually "nest" NMIs per se. We copy the stack at the beginning of the NMI handler (and then use the copy) to permit nesting of faults over NMI handlers. Following NMIs that would "try" to nest over the NMI handler would see their regular execution postponed until the end of the currently running NMI handler. It's OK for these "nested" NMI handlers to use the bottom of NMI stack because the NMI handler on which they are trying to nest is only using the stack copy. These "nested" handlers return to the original NMI handler very early just after setting a "pending nmi" flag. There is more to it (e.g. handling NMI handler exit atomically with respect to incoming NMIs); please refer to the last assembly code snipped I sent to Linus a little earlier today for details. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Well, I'd put it in the nmi handler registration code, but you're right. A user placing breakpoints can't even tell whether the Well, at least we'll get a good test case for kvm's nmi blocking emulation (it's tricky since if we fault on an iret sometimes nmis get unblocked even though the instruction did not complete). -- error compiling committee.c: too many arguments to function --
USB hotplug is a use-case happening randomly after the system is well there and running; I'm afraid this does not fit in your module loading expectations. It triggers tons of events, many of these actually load modules. Thanks, -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
How long would vmalloc_sync_all take with a few thousand mm_struct take? We share the pmds, yes? So it's a few thousand memory accesses. The direct impact is probably negligible, compared to actually loading the module from disk. All we need is to make sure the locking doesn't slow down unrelated stuff. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
It's not the memory accesses, it's the need to synchronize all the CPUs. -hpa --
I'm missing something. Why do we need to sync all cpus? the vmalloc_sync_all() I'm reading doesn't. Even if we do an on_each_cpu() somewhere, it isn't the end of the world. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Also you have to remember that vmalloc_sync_all() only does something when the top level page is actually updated. That is very rare. (in many cases it should happen at most once per boot) Most mapping changes update lower levels, and those are already shared. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
I think the concern was also potentially doing things like backtraces
etc that may need access to the module data structures (I think the
ELF headers end up all being in vmalloc space too, for example).
The whole debugging thing is also an issue. Now, I obviously am not a
big fan of remote debuggers, but everybody tells me I'm wrong. And
putting a breakpoint on NMI is certainly not insane if you are doing
debugging in the first place. So it's not necessarily always about the
page faults.
Linus
--
We already have infrastructure for kprobes to prevent breakpoints on critical code (the __kprobes section). In principle kgdb/kdb could be taught about honoring those too. That wouldn't help for truly external JTAG debuggers, but I would assume those generally can (should) handle any contexts anyways. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
It doesn't help with NMI code calling other functions, or with data breakpoints. -- error compiling committee.c: too many arguments to function --
On startup you don't have many processes. If there's a problem it's surely not the fault of vmalloc_sync_all() BTW in my experience one reason module loading was traditionally slow was that it did a stop_machine(). I think(?) that has been fixed at some point. But even with that's it's more an issue on larger It can make sense for a backtrace in a profiler. In fact perf is nearly doing it I believe, but moves it to the self IPI handler in most cases. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Interesting, is the self IPI guaranteed to execute synchronously after the NMI's IRET? Or can the core IRET faster than the APIC and so we get the backtrace at the wrong place? (and does it matter? the NMI itself is not always accurate) -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
self ipi runs after the next STI (or POPF) -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Perf actually already does that to unwind user-space stacks... ;-) See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users. What we do is a manual page table walk (using __get_user_pages_fast) and simply bail when the page is not available. That said, I think that the thing that started the whole per-cpu-per-context temp stack-frame storage story also means that that function is now broken and can lead to kmap_atomic corruption. I really should brush up that stack based kmap_atomic thing, last time I got stuck on FRV wanting things. Linus should I refresh that whole series and give a FRV a slow but working implementation and then let David Howells sort out things if he cares about that? --
That's not really "touching user space", though. -hpa --
That doesn't make sense. vmalloc_all_sync() should work on 32bit too. It just needs to walk all processes and fix up every page table. -Andi --
Yeah, I've been taken aback when Tejun told me that a few moments ago. I initially thought that vmalloc_sync_all() synchronized all page mappings of all processes on x86_32. But apparently that does not seem to be the case. I'm adding him in CC. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Well it worked when it was originally written. That was for the case of a NMI handler in a module. If it doesn't work fix it. I don't think the NMI-safe fault is really needed with it. -Andi --
Hello, Yeah, vmalloc_sync_all() synchronizes everything by walking every page table, so it should work. I was saying that just flushing TLB wouldn't cut it because multiple top level page table entries can be used to map vmalloc areas. It seems that both 32 and 64bit does that tho. Thanks. -- tejun --
