Mathieu Desnoyers wrote: - Use "=g" constraint for char immediate value inline assembly. "=g" is the same as "=rmi" which is inherently bogus. In your actual code you use "=r", the correct constraint is "=q". -hpa -
Hi Peter, Yup, =g wasn't what I was looking for at all, the header comment is bogus. From http://gcc.gnu.org/onlinedocs/gcc/Simple-Constraints.html#Simple-Constraints `r' A register operand is allowed provided that it is in a general register. From http://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints Intel 386 config/i386/constraints.md q Any register accessible as rl. In 32-bit mode, a, b, c, and d; in 64-bit mode, any integer register. I am worried that "=q" might exclude the si and di registers in 32-bit mode. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
For "char" (8-bit) values, sp/bp/si/di are illegal in 32-bit mode. Hence "=q". -hpa -
Ah! yep, I see, so we say: 1 byte : "=q" 2 bytes : "=r" 4 bytes : "=r" 8 bytes : "=r" -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
Andrew, This version of the immediate values x86 optimization should be used instead of the one originally submitted. It uses the correct constraints and uses the new asm.h instead of asm-compat.h. Immediate Values - x86 Optimization x86 optimization of the immediate values which uses a movl with code patching to set/unset the value used to populate the register used as variable source. Changelog: - Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing non atomic writes to a code region only touched by us (nobody can execute it since we are protected by the immediate_mutex). - Put immediate_set and _immediate_set in the architecture independent header. - Use $0 instead of %2 with (0) operand. - Add x86_64 support, ready for i386+x86_64 -> x86 merge. - Use "=r" constraint for char immediate value inline assembly. - Use "=q" for 1 byte immediate values to exclude si, di, bp, sp from the usable register set. - Use asm-x86/asm.h. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Andi Kleen <ak@muc.de> CC: "H. Peter Anvin" <hpa@zytor.com> CC: Chuck Ebbert <cebbert@redhat.com> CC: Christoph Hellwig <hch@infradead.org> CC: Jeremy Fitzhardinge <jeremy@goop.org> --- arch/x86/Kconfig.i386 | 3 arch/x86/Kconfig.x86_64 | 3 arch/x86/kernel/Makefile_32 | 1 arch/x86/kernel/Makefile_64 | 1 arch/x86/kernel/immediate.c | 330 ++++++++++++++++++++++++++++++++++++++++++++ arch/x86/kernel/traps_32.c | 10 - include/asm-x86/immediate.h | 100 +++++++++++++ 7 files changed, 444 insertions(+), 4 deletions(-) Index: linux-2.6-lttng/arch/x86/kernel/Makefile_32 =================================================================== --- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_32 2007-11-13 15:28:11.000000000 -0500 +++ linux-2.6-lttng/arch/x86/kernel/Makefile_32 2007-11-13 15:29:13.000000000 -0500 @@ -34,6 +34,7 @@ obj-$(CONFIG_KPROBES) += kprobes_32.o obj-$(CONFIG_MODULES) += module_32.o ...
Something else to watch out for... in 64-bit mode the lengths most of these will depend on which register is used, since whether or not a REX prefix is needed will vary. As far as I can tell, you're assuming fixed length instructions, which is wrong unless you manually constrain yourself to only legacy registers. -hpa -
This is what I was pointing out in this previous message : http://lkml.org/lkml/2007/10/20/92 "I am still trying to figure out if we must assume that gas will produce different length opcodes for mov instructions. The choice is: - Either I use a "r" constraint and let gcc produce the instructions, that I need to assume to have correct size so I can align their immediate values (therefore, taking the offset from the end of the instruction will not help). Here, if gas changes its behavior dramatically for a given immediate value size, it will break. - Second choice is to stick to a particular register, choosing the one with the less side-effect, and encoding the instruction ourselves. I start to think that this second solution might be safer, even though we wouldn't let the compiler select the register which has the less impact by itself." Andi seemed to trust gas stability and you answered: "The comment was referring to x86-64, but I incorrectly remembered that applying to "movq $imm,%reg" as opposed to loading from an absolute address. gas actually has a special opcode (movabs) for the 64-bit version of the latter variant, which is only available with %rax and its subregisters. Nevermind, in other words. It's still true, though, that the immediate will always be the last thing in the instruction -- that's a fixture of the instruction format." So, in the end, is there a way to make x86_64 use a fixed-size opcode for the 1, 2, 4 and 8 bytes load immediates or we will have to force the use of a specific register ? (and we can't take a pointer from the end of the instruction, because we need to align the immediate value correctly) Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
For a 64-bit load, you'll always have a REX prefix. For 8-, 16- and 32-bit load, the length of the instruction will depend on the register chosen, unless you constrain to either all legacy or all upper registers, or you force gas to generate a prefix, but I don't think there is a way to do that that will work with assemblers all the way back to 2.12, which is at least what we officially support (I have no idea if assemblers that far back actually *work*, mind you.) -hpa -
Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :
1 byte : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
on all i386 processors (a, b, c, d, si, di, bp, sp).
8 bytes : (only for x86_64)
"=r" : A register operand is allowed provided that it is in a
general register.
That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.
Does it make sense ?
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
That's probably the best bet. -hpa -
Immediate Values - x86 Optimization
x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.
Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
non atomic writes to a code region only touched by us (nobody can execute it
since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.
Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :
1 byte : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
on all i386 processors (a, b, c, d, si, di, bp, sp). 8
bytes : (only for x86_64)
"=r" : A register operand is allowed provided that it is in a
general register.
That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
---
arch/x86/Kconfig.i386 | 3
arch/x86/Kconfig.x86_64 | 3
arch/x86/kernel/Makefile_32 | 1
arch/x86/kernel/Makefile_64 | 1
arch/x86/kernel/immediate.c | 330 ++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/traps_32.c | 10 -
include/asm-x86/immediate.h | 109 ++++++++++++++
7 files changed, 453 insertions(+), 4 deletions(-)
Index: linux-2.6-lttng/arch/x86/kernel/Makefile_32
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/Makefile_32 2007-11-13 17:06:16.000000000 -0500
+++ ...I just had a couple of utterly sick ideas. Consider this variant (this example is for a 32-bit immediate on x86-64, but the obvious macroizations apply): .section __discard,"a",@progbits 1: movl $0x12345678,%r9d 2: .previous .section __immediate,"a",@progbits .quad foo_immediate, (3f)-4, 4 .previous .org . + ((-.-(2b-1b)) & 3), 0x90 movl $0x12345678,%r9d 3: The idea is that the instruction is emitted into a section, which is marked DISCARD in the linker script. That lets us actually measure the length, and since we know the immediate is always at the end of the instruction... done! -hpa -
Wow! I like this idea. I'll give it a try in a test sandbox to see if gas or ld complains. I'll keep you posted. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
(the patch follows at the end of the discussion, it applies on top of
the "Immediate Values x86 Optimization (update 2)" patch.
could we simply declare a
.section __discard,"",@progbits instead ?
I think this should remove the need to tweak the linker script, since
So in general we would have :
.org . + ((-.-(2b-1b)) & (immediate_size - 1)), 0x90
Here is the result, on i386, for an immediate value used by markers. I
normally use a 1 byte immediate value for this, but I tried to change
the data type to check the result. The resulting assembly and binary
generated on x86_64 follows.
* A 1 byte immediate value (no need to align) :
#APP
.section __immediate,"a",@progbits
.long __mark_kernel_sched_try_wakeup.30104+8, (3f)-1, 1
.previous
mov $0,%al
3:
.LVL1116:
#NO_APP
.LBE2180:
testb %al, %al
jne .L1028
* A 2 bytes immediate value :
#APP
.section __discard,"",@progbits
1:
mov $0,%ax
2:
.previous
.section __immediate,"a",@progbits
.long __mark_kernel_sched_try_wakeup.30104+8, (3f)-2, 2
.previous
.org . + ((-.-(2b-1b)) & (2-1)), 0x90
mov $0,%ax
3:
.LVL1116:
#NO_APP
.LBE2180:
testw %ax, %ax
jne .L1028
* A 4 bytes immediate value :
#APP
.section __discard,"",@progbits
1:
mov $0,%eax
2:
.previous
.section __immediate,"a",@progbits
.long __mark_kernel_sched_try_wakeup.30104+8, (3f)-4, 4
.previous
.org . + ((-.-(2b-1b)) & (4-1)), 0x90
mov $0,%eax
3:
.LVL1116:
#NO_APP
.LBE2180:
testl %eax, %eax
jne .L1028
And on x86_64 :
* A 1 byte immediate value :
#APP
.section __immediate,"a",@progbits
.quad __mark_kernel_sched_try_wakeup.31114+16, (3f)-1, 1
.previous
mov $0,%al
3:
.LVL705:
#NO_APP
.LBE2003:
testb %al, %al
je .L676
Resulting binary ...The allocation flag is a loader property, not a linker property. It
would still be manifest in the vmlinux file; I think this is
undesirable. Perhaps not a big deal, but neither is adding a /DISCARD/
statement to the linker script:
/DISCARD/ : { *(__discard) }
-hpa
-
Add a __discard sectionto the linker script. Code produced in this section will
not be put in the vmlinux file. This is useful when we have to calculate the
size of an instruction before actually declaring it (for alignment purposes for
instance). This is used by the immediate values.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen <ak@muc.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Chuck Ebbert <cebbert@redhat.com>
CC: Christoph Hellwig <hch@infradead.org>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
---
arch/x86/kernel/vmlinux_32.lds.S | 1 +
arch/x86/kernel/vmlinux_64.lds.S | 1 +
2 files changed, 2 insertions(+)
Index: linux-2.6-lttng/arch/x86/kernel/vmlinux_32.lds.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/vmlinux_32.lds.S 2007-11-14 14:10:43.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/vmlinux_32.lds.S 2007-11-14 14:11:32.000000000 -0500
@@ -205,6 +205,7 @@ SECTIONS
/* Sections to be discarded */
/DISCARD/ : {
*(.exitcall.exit)
+ *(__discard)
}
STABS_DEBUG
Index: linux-2.6-lttng/arch/x86/kernel/vmlinux_64.lds.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/vmlinux_64.lds.S 2007-11-14 14:10:46.000000000 -0500
+++ linux-2.6-lttng/arch/x86/kernel/vmlinux_64.lds.S 2007-11-14 14:11:48.000000000 -0500
@@ -227,6 +227,7 @@ SECTIONS
/DISCARD/ : {
*(.exitcall.exit)
*(.eh_frame)
+ *(__discard)
}
STABS_DEBUG
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-
Acked-by: H. Peter Anvin <hpa@zytor.com> -
For the record, I think the patching code gross overkill. A stop_machine (or lightweight variant using IPI) would be sufficient and vastly simpler. Trying to patch NMI handlers while they're running is already crazy. I'd keep this version up your sleeve for they day when it's needed. Rusty. -
I wouldn't mind if it was limited to the code within do_nmi(), but then we would have to accept potential GPF if A - the NMI or MCE code calls any external kernel code (printk, notify_die, spin_lock/unlock, die_nmi, lapic_wd_event (perfctr code, calls printk too for debugging)... B - we try to patch this code at the wrong moment I could live with that, but I would prefer to have a solid, non flaky solution. My goal is to help the kernel quality _improve_ rather than deteriorate. Therefore, if one decides to use the immediate values to leave dormant spinlock instrumentation in the kernel, I wouldn't want it to have undesirable side-effects (GPF) when the instrumentation is If we choose to go this way, stop_machine would have to do a sync_core() on every CPU before it reactivates interrupts for this to respect Intel's errata. It's not just a matter of not executing the code while it is modified; the issue here is that we must insure that we don't have an incoherent trace cache. So, as is, stop_machine would not respect the errata. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
Sure, but as I pointed out previously, such calls are already best effort. You can do very little safely from do_nmi(), and calling printk isn't one of them, nor is grabbing a spinlock (well, actually you could as long as it's Yes, I don't think stop_machine is actually what you want anyway, since you are happy to run in interrupt context. An IPI-based scheme is probably better, and also has the side effect of iret doing the sync you need, IIUC. Hope that clarifies, Rusty. -
yes and no.. do_nmi uses the "bust spinlocks" exactly for this. So this is ok by design. Other than this, we can end up mixing up the console data output with different sources of characters, but I doubt something Yup, see arch/x86/kernel/nmi_64.c : nmi_watchdog_tick() It defines a spinlock to "Serialise the printks". I guess it's good to protect against other nmi watchdogs running on other CPUs concurrently, So should we put a warning telling "enabling tracing or profiling on a production system that also uses NMI watchdog could potentially cause a crash" ? The rarer a bug is, the more difficult it is to debug. It does not make the bug hurt less when it happens. The normal thing to do when a potential deadlock is detected is to fix it, not to leave it there under the premise that it doesn't matter since it happens rarely. In our case, where we know there is a potential race, I don't see any reason not to make sure it never happens. What's the cost of it ? arch/x86/kernel/immediate.o : 2.4K let's compare.. kernel/stop_machine.o : 3.9K so I think that code size is not an issue there, especially since the Yup, looping in IPIs with interrupts disabled should do the job there. It's just awful for interrupt latency on large SMP systems :( Being currently bad at it is not a reason to make it worse. If we have a CPU that is within a high latency irq disable region when we send the IPI, we can easily end up waiting for this critical section to end with interrupts disabled on all CPUs. The fact that we would wait for the longest interrupt disable region with IRQs disabled implies that we increase the maximum latency of the system, by design. I'm not sure I would like to be the new longest record-beating IRQ off region. -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
Ok, let's start with a simple version then. I'll plan to resubmit the
more sophisticated one when I submit a patch to make my markers use the
immediate values.
Here it is, for a first round of review :
Immediate Values - x86 Optimization
x86 optimization of the immediate values which uses a movl with code patching
to set/unset the value used to populate the register used as variable source.
Changelog:
- Use text_poke_early with cr0 WP save/restore to patch the bypass. We are doing
non atomic writes to a code region only touched by us (nobody can execute it
since we are protected by the immediate_mutex).
- Put immediate_set and _immediate_set in the architecture independent header.
- Use $0 instead of %2 with (0) operand.
- Add x86_64 support, ready for i386+x86_64 -> x86 merge.
- Use asm-x86/asm.h.
Ok, so the most flexible solution that I see, that should fit for both
i386 and x86_64 would be :
1 byte : "=Q" : Any register accessible as rh: a, b, c, and d.
2, 4 bytes : "=R" : Legacy register—the eight integer registers available
on all i386 processors (a, b, c, d, si, di, bp, sp). 8
bytes : (only for x86_64)
"=r" : A register operand is allowed provided that it is in a
general register.
That should make sure x86_64 won't try to use REX prefixed opcodes for
1, 2 and 4 bytes values.
- Create the instruction in a discarded section to calculate its size. This is
how we can align the beginning of the instruction on an address that will
permit atomic modificatino of the immediate value without knowing the size of
the opcode used by the compiler.
- Bugfix : 8 bytes 64 bits immediate value was declared as "4 bytes" in the
immediate structure.
- Change the immediate.c update code to support variable length opcodes.
- Vastly simplified, using a busy looping IPI with interrupts disabled.
Does not protect against NMI nor MCE.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Andi Kleen ...Since immediate values are by definition an optimization, I think it makes sense to insist they be 1, 2, 4 or 8 bytes. A BUILD_BUG_ON() in the right This is now almost entirely generic code, but I suppose we can let the next I think it would be easier to just fast-path the num_online_cpus == 1 case, even if you want to keep this "update_early" interface. But I like your IPI algorithm: very tight. Thanks, Rusty. -
Ok, another potential problem then : If we fast-path the num_online_cpus == 1, then updates done after the boot process will have to go through this code. Therefore, we would have to disable interrupts in the early boot code. However, I doubt it is safe to use the paravirtualized local_irq_disable/enable before the paravirt code is executed ? -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
* Rusty Russell (rusty@rustcorp.com.au) wrote: Almost.. actually, I have to call those architecture specific primitives : - text_poke : a memcpy that copies to write protected memory (disables the WP bit) - sync_core : synchronize the core, only defined on x86 I'll also have to add a flush_icache_range to support powerpc correctly, but this one turns out to be implemented on each architecture and therefore is not a problem. So I guess the proper way to do this would be to add : #define text_poke memcpy #define text_poke_early text_poke #define sync_core() to asm-$ARCH/cacheflush.h for each architecture we add support for HAS_IMMEDIATE ? Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -
