For the record, I think the patching code gross overkill.
A stop_machine (or lightweight variant using IPI) would be sufficient and
vastly simpler. Trying to patch NMI handlers while they're running is
already crazy.I'd keep this version up your sleeve for they day when it's needed.
Rusty.
-
I wouldn't mind if it was limited to the code within do_nmi(), but then
we would have to accept potential GPF ifA - the NMI or MCE code calls any external kernel code (printk,
notify_die, spin_lock/unlock, die_nmi, lapic_wd_event (perfctr code,
calls printk too for debugging)...B - we try to patch this code at the wrong moment
I could live with that, but I would prefer to have a solid, non flaky
solution. My goal is to help the kernel quality _improve_ rather than
deteriorate. Therefore, if one decides to use the immediate values to
leave dormant spinlock instrumentation in the kernel, I wouldn't want it
to have undesirable side-effects (GPF) when the instrumentation isIf we choose to go this way, stop_machine would have to do a sync_core()
on every CPU before it reactivates interrupts for this to respect
Intel's errata. It's not just a matter of not executing the code while
it is modified; the issue here is that we must insure that we don't have
an incoherent trace cache. So, as is, stop_machine would not respect
the errata.--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
Sure, but as I pointed out previously, such calls are already best effort.
You can do very little safely from do_nmi(), and calling printk isn't one of
them, nor is grabbing a spinlock (well, actually you could as long as it'sYes, I don't think stop_machine is actually what you want anyway, since you
are happy to run in interrupt context. An IPI-based scheme is probably
better, and also has the side effect of iret doing the sync you need, IIUC.Hope that clarifies,
Rusty.
-
yes and no.. do_nmi uses the "bust spinlocks" exactly for this. So this
is ok by design. Other than this, we can end up mixing up the console
data output with different sources of characters, but I doubt somethingYup, see arch/x86/kernel/nmi_64.c : nmi_watchdog_tick()
It defines a spinlock to "Serialise the printks". I guess it's good to
protect against other nmi watchdogs running on other CPUs concurrently,So should we put a warning telling "enabling tracing or profiling on a
production system that also uses NMI watchdog could potentially cause a
crash" ? The rarer a bug is, the more difficult it is to debug. It does
not make the bug hurt less when it happens.The normal thing to do when a potential deadlock is detected is to fix
it, not to leave it there under the premise that it doesn't matter since
it happens rarely. In our case, where we know there is a potential race,
I don't see any reason not to make sure it never happens. What's the
cost of it ?arch/x86/kernel/immediate.o : 2.4K
let's compare..
kernel/stop_machine.o : 3.9K
so I think that code size is not an issue there, especially since the
Yup, looping in IPIs with interrupts disabled should do the job there.
It's just awful for interrupt latency on large SMP systems :( Being
currently bad at it is not a reason to make it worse. If we have a CPU
that is within a high latency irq disable region when we send the IPI,
we can easily end up waiting for this critical section to end with
interrupts disabled on all CPUs. The fact that we would wait for the
longest interrupt disable region with IRQs disabled implies that we
increase the maximum latency of the system, by design. I'm not sure I
would like to be the new longest record-beating IRQ off region.--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
Yeah, in the "machine is dying" case, we do make best effort to get the
message out. Worrying about a GPF on the off chance we happen to bestop_machine is necessary (although it could use simplification, patches
Yet you could have implemented it in the time it took us to have this
conversation.Just because code is clever doesn't mean it should go in. There are enough
things in the kernel which have to be complex that we should always be on the
lookout for things which can be made simpler.Cheers,
Rusty.
-
Ok, let's start with a simple version then. I'll plan to resubmit the
more sophisticated one when I submit a patch to make my markers use the
immediate values.Here it is, for a first round of review :
Immediate Values - x86 Optimization
x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
non atomic writes to a code region only touched by us (nobody can execute it
since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :
1 byte : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
on all i386 processors (a, b, c, d, si, di, bp, sp). 8
bytes : (only for x86_64)
"=r" : A register operand is allowed provided that it is in a
general register.
That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.- Create the instruction in a discarded section to calculate its size. This is
how we can align the beginning of the instruction on an address that will
permit atomic modificatino of the immediate value without knowing the size of
the opcode used by the compiler.
- Bugfix : 8 bytes 64 bits immediate value was declared as "4 bytes" in the
immediate structure.
- Change the immediate.c update code to support variable length opcodes.- Vastly simplified, using a busy looping IPI with interrupts disabled.
Does not protect against NMI nor MCE.Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Klee...
Since immediate values are by definition an optimization, I think it makes
sense to insist they be 1, 2, 4 or 8 bytes. A BUILD_BUG_ON() in the rightThis is now almost entirely generic code, but I suppose we can let the next
I think it would be easier to just fast-path the num_online_cpus == 1 case,
even if you want to keep this "update_early" interface.But I like your IPI algorithm: very tight.
Thanks,
Rusty.
-
* Rusty Russell (rusty@rustcorp.com.au) wrote:
Almost.. actually, I have to call those architecture specific primitives :
- text_poke : a memcpy that copies to write protected memory (disables the
WP bit)
- sync_core : synchronize the core, only defined on x86I'll also have to add a flush_icache_range to support powerpc correctly,
but this one turns out to be implemented on each architecture and
therefore is not a problem.So I guess the proper way to do this would be to add :
#define text_poke memcpy
#define text_poke_early text_poke
#define sync_core()to asm-$ARCH/cacheflush.h for each architecture we add support for
HAS_IMMEDIATE ?Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
I'll move it to kernel/immediate.c. We can therefore remove the powerpc
Nope, that could lead to problems. I call core_immediate_update()
_very_ early, before boot_cpu_init() is called. Therefore,
cpu_online_map is not set yet. I am not sure the benefit of using
num_online_cpus outweights the added fragility wrt other boot processThanks. Many thanks for the review, I'll repost the patchset for
comments.Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
Ah, I see the problem. It would in fact be clearer for us to move
boot_cpu_init() up to just after smp_setup_processor_id() in start_kernelI think it's still a win, though worth a comment that we always go via the
non-IPI path for the early boot case.Cheers,
Rusty.
-
Ok, another potential problem then :
If we fast-path the num_online_cpus == 1, then updates done after the
boot process will have to go through this code. Therefore, we would have
to disable interrupts in the early boot code.However, I doubt it is safe to use the paravirtualized
local_irq_disable/enable before the paravirt code is executed ?--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
Mathieu Desnoyers wrote:
- Use "=g" constraint for char immediate value inline assembly."=g" is the same as "=rmi" which is inherently bogus. In your actual
code you use "=r", the correct constraint is "=q".-hpa
-
Hi Peter,
Yup, =g wasn't what I was looking for at all, the header comment is
bogus.From
http://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constra...`r'
A register operand is allowed provided that it is in a general register.From
http://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Const...
Intel 386 config/i386/constraints.mdq
Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 64-bit mode, any integer register.I am worried that "=q" might exclude the si and di registers in 32-bit mode.
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
For "char" (8-bit) values, sp/bp/si/di are illegal in 32-bit mode.
Hence "=q".
-hpa
-
Ah! yep, I see, so we say:
1 byte : "=q"
2 bytes : "=r"
4 bytes : "=r"
8 bytes : "=r"--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
That's right.
-hpa
-
Andrew,
This version of the immediate values x86 optimization should be used
instead of the one originally submitted. It uses the correct constraints
and uses the new asm.h instead of asm-compat.h.Immediate Values - x86 Optimization
x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
non atomic writes to a code region only touched by us (nobody can execute it
since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use "=r" constraint for char immediate value inline assembly.
- Use "=q" for 1 byte immediate values to exclude si, di, bp, sp from the usable
register set.
- Use asm-x86/asm.h.Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
---
arch/x86/Kconfig.i386 | 3
arch/x86/Kconfig.x86_64 | 3
arch/x86/kernel/Makefile_32 | 1
arch/x86/kernel/Makefile_64 | 1
arch/x86/kernel/immediate.c | 330 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps_32.c | 10 -
include/asm-x86/immediate.h | 100 +++++++++++++
7 files changed, 444 insertions(+), 4 deletions(-)Index: linux-2.6-lttng/arch/x86/kernel/Makefile_32
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_32 2007-11-13 15:28:11.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/Makefile_32 2007-11-13 15:29:13.000000000 -0500
@@ -34,6 +34,7 @@ obj-$(CONFIG_KPROBES) += kprobes_32.o
obj-$(CONFI...
Something else to watch out for... in 64-bit mode the lengths most of
these will depend on which register is used, since whether or not a REX
prefix is needed will vary.As far as I can tell, you're assuming fixed length instructions, which
is wrong unless you manually constrain yourself to only legacy registers.-hpa
-
This is what I was pointing out in this previous message :
http://lkml.org/lkml/2007/10/20/92"I am still trying to figure out if we must assume that gas will produce
different length opcodes for mov instructions. The choice is:
- Either I use a "r" constraint and let gcc produce the instructions,
that I need to assume to have correct size so I can align their
immediate values (therefore, taking the offset from the end of the
instruction will not help). Here, if gas changes its behavior
dramatically for a given immediate value size, it will break.- Second choice is to stick to a particular register, choosing the one
with the less side-effect, and encoding the instruction ourselves. I
start to think that this second solution might be safer, even though
we wouldn't let the compiler select the register which has the less
impact by itself."Andi seemed to trust gas stability and you answered:
"The comment was referring to x86-64, but I incorrectly remembered that
applying to "movq $imm,%reg" as opposed to loading from an absolute
address. gas actually has a special opcode (movabs) for the 64-bit
version of the latter variant, which is only available with %rax and its
subregisters.Nevermind, in other words. It's still true, though, that the immediate
will always be the last thing in the instruction -- that's a fixture of
the instruction format."So, in the end, is there a way to make x86_64 use a fixed-size opcode
for the 1, 2, 4 and 8 bytes load immediates or we will have to force the
use of a specific register ?(and we can't take a pointer from the end of the instruction, because we
need to align the immediate value correctly)Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
For a 64-bit load, you'll always have a REX prefix. For 8-, 16- and
32-bit load, the length of the instruction will depend on the register
chosen, unless you constrain to either all legacy or all upper
registers, or you force gas to generate a prefix, but I don't think
there is a way to do that that will work with assemblers all the way
back to 2.12, which is at least what we officially support (I have no
idea if assemblers that far back actually *work*, mind you.)-hpa
-
Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :1 byte : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
on all i386 processors (a, b, c, d, si, di, bp, sp).
8 bytes : (only for x86_64)
"=r" : A register operand is allowed provided that it is in a
general register.That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.Does it make sense ?
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
That's probably the best bet.
-hpa
-
Immediate Values - x86 Optimization
x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
non atomic writes to a code region only touched by us (nobody can execute it
since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :
1 byte : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
on all i386 processors (a, b, c, d, si, di, bp, sp). 8
bytes : (only for x86_64)
"=r" : A register operand is allowed provided that it is in a
general register.
That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
---
arch/x86/Kconfig.i386 | 3
arch/x86/Kconfig.x86_64 | 3
arch/x86/kernel/Makefile_32 | 1
arch/x86/kernel/Makefile_64 | 1
arch/x86/kernel/immediate.c | 330 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps_32.c | 10 -
include/asm-x86/immediate.h | 109 ++++++++++++++
7 files changed, 453 insertions(+), 4 deletions(-)Index: linux-2.6-lttng/arch/x86/kernel/Makefile_32
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_32 2007-11-13 17:...
I just had a couple of utterly sick ideas.
Consider this variant (this example is for a 32-bit immediate on x86-64,
but the obvious macroizations apply):.section __discard,"a",@progbits
1: movl $0x12345678,%r9d
2:
.previous.section __immediate,"a",@progbits
.quad foo_immediate, (3f)-4, 4
.previous.org . + ((-.-(2b-1b)) & 3), 0x90
movl $0x12345678,%r9d
3:The idea is that the instruction is emitted into a section, which is
marked DISCARD in the linker script. That lets us actually measure the
length, and since we know the immediate is always at the end of the
instruction... done!-hpa
-
(the patch follows at the end of the discussion, it applies on top of
the "Immediate Values x86 Optimization (update 2)" patch.could we simply declare a
.section __discard,"",@progbits instead ?
I think this should remove the need to tweak the linker script, since
So in general we would have :
.org . + ((-.-(2b-1b)) & (immediate_size - 1)), 0x90
Here is the result, on i386, for an immediate value used by markers. I
normally use a 1 byte immediate value for this, but I tried to change
the data type to check the result. The resulting assembly and binary
generated on x86_64 follows.* A 1 byte immediate value (no need to align) :
#APP
.section __immediate,"a",@progbits
.long __mark_kernel_sched_try_wakeup.30104+8, (3f)-1, 1
.previous
mov $0,%al
3:.LVL1116:
#NO_APP
.LBE2180:
testb %al, %al
jne .L1028* A 2 bytes immediate value :
#APP
.section __discard,"",@progbits
1:
mov $0,%ax
2:
.previous
.section __immediate,"a",@progbits
.long __mark_kernel_sched_try_wakeup.30104+8, (3f)-2, 2
.previous
.org . + ((-.-(2b-1b)) & (2-1)), 0x90
mov $0,%ax
3:.LVL1116:
#NO_APP
.LBE2180:
testw %ax, %ax
jne .L1028* A 4 bytes immediate value :
#APP
.section __discard,"",@progbits
1:
mov $0,%eax
2:
.previous
.section __immediate,"a",@progbits
.long __mark_kernel_sched_try_wakeup.30104+8, (3f)-4, 4
.previous
.org . + ((-.-(2b-1b)) & (4-1)), 0x90
mov $0,%eax
3:.LVL1116:
#NO_APP
.LBE2180:
testl %eax, %eax
jne .L1028And on x86_64 :
* A 1 byte immediate value :
#APP
.section __immediate,"a",@progbits
.quad __mark_kernel_sched_try_wakeup.31114+16, (3f)-1, 1
.previous
mov $0,%al
3:.LVL705:
#NO_APP
.LBE2003:
testb %al, %al
je .L676Resulting ...
The allocation flag is a loader property, not a linker property. It
would still be manifest in the vmlinux file; I think this is
undesirable. Perhaps not a big deal, but neither is adding a /DISCARD/
statement to the linker script:/DISCARD/ : { *(__discard) }
-hpa
-
Add a __discard sectionto the linker script. Code produced in this section will
not be put in the vmlinux file. This is useful when we have to calculate the
size of an instruction before actually declaring it (for alignment purposes for
instance). This is used by the immediate values.Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
---
arch/x86/kernel/vmlinux_32.lds.S | 1 +
arch/x86/kernel/vmlinux_64.lds.S | 1 +
2 files changed, 2 insertions(+)Index: linux-2.6-lttng/arch/x86/kernel/vmlinux_32.lds.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/vmlinux_32.lds.S 2007-11-14 14:10:43.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/vmlinux_32.lds.S 2007-11-14 14:11:32.000000000 -0500
@@ -205,6 +205,7 @@ SECTIONS
/* Sections to be discarded */
/DISCARD/ : {
*(.exitcall.exit)
+ *(__discard)
}STABS_DEBUG
Index: linux-2.6-lttng/arch/x86/kernel/vmlinux_64.lds.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/vmlinux_64.lds.S 2007-11-14 14:10:46.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/vmlinux_64.lds.S 2007-11-14 14:11:48.000000000 -0500
@@ -227,6 +227,7 @@ SECTIONS
/DISCARD/ : {
*(.exitcall.exit)
*(.eh_frame)
+ *(__discard)
}STABS_DEBUG
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
Acked-by: H. Peter Anvin <hpa@zytor.com>
-
Wow! I like this idea. I'll give it a try in a test sandbox to see if
gas or ld complains. I'll keep you posted.--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
| Amit K. Arora | [RFC] Heads up on sys_fallocate() |
| H. Peter Anvin | Re: [RFC 00/15] x86_64: Optimize percpu accesses |
| Nicolas Pitre | Re: [RFC patch 08/18] cnt32_to_63 should use smp_rmb() |
| Bart Van Assche | Integration of SCST in the mainstream Linux kernel |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Natalie Protasevich | [BUG] New Kernel Bugs |
