One of the things that bothered me about the latest ftrace code was this annoying daemon that would wake up once a second to see if it had work to do. If it did not, it would go to sleep, otherwise it would do its work and then go to sleep. You see, the reason for this is that for ftrace to maintain performance when configured in but disabled, it would need to change all the locations that called "mcount" (enabled with the gcc -pg option) into nops. The "-pg" option in gcc sets up a function profiler to call this function called "mcount". If you simply have "mcount" return, it will still add 15 to 18% overhead in performance. Changing all the calls to nops moved the overhead into noise. To get rid of this, I had the mcount code record the location that called it. Later, the "ftraced" daemon would wake up and look to see if any new functions were recorded. If so, it would call kstop_machine and convert the calls to "nops". We needed kstop_machine because bad things happen on SMP if you modify code that happens to be in the instruction cache of another CPU. This "ftraced" kernel thread would be a happy little worker, but it caused some pains. One, is that it woke up once a second, and Ted Tso got mad at me because it would show up on PowerTop. I could easily make the default 5 seconds, and even have it runtime configurable, with a trivial patch. I have not got around to doing that yet. The other annoying thing, and this one bothers me the most, is that we can not have this enabled on a production -rt kernel. The latency caused by the kstop_machine when doing work is lost in the noise on a non-rt kernel, but it can be up to 800 microseconds, and that would kill the -rt kernel. The reason this bothered me the most, is that -rt is where it came from, and ftraced was not treating its motherland very well. Along came Gregory Haskins, who was bickering about having ftrace enabled on a production -rt kernel. I told him the reasons that this would be bad and then he started ...
Hi Steven, If you really want to stop using stop_machine, I think you should have a look at my immediate values infrastructure : see: http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x8... particularly replace_instruction_safe(), which uses a temporary breakpoint to safely replace an instruction on a live SMP system without using stop_machine. Feel free to try it in ftrace. :) You may need to tweak the imv_notifier(), which is responsible for executing the breakpoint. The only special thing this handler has to know is the non-relocatable instructions you might want to insert (it only cares about the new instructions, not the old ones being replaced). The current code deals with 5-bytes jumps. Note that the instruction is executed in the breakpoint handler with preemption disabled, which might not be well suited for a call instruction. I would recommend to patch in a 5-bytes jmp with 0 offset (e9 00 00 00 00) when disabled (this makes a single 5-bytes instruction and thus makes sure no instruction pointer can iret in the middle). When enabling the site, you could patch-in the original call, but you should tweak the imv_notifier() so that it uses the jmp offset 0 in the bypass instead of the function call, because preemption would otherwise be disabled around the call when executed in the breakpoint bypass. Therefore, simply statically forcing the bypass code to e9 00 00 00 00 in all the cases (a nop) would do the job for your needs. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I originally used jumps instead of nops, but unfortunately, they actually hurt performance more than adding nops. Ingo told me it was probably due to using up the jump predictions of the CPU. -- Steve --
Hrm, are you sure you use a single 5-bytes nop instruction then, or do you use a mix of various nop sizes (add_nops) on some architectures ? You can consume the branch prediction buffers for conditional branches, but I doubt static jumps have this impact ? I don't see what "jump predictions" you are referring to here exactly. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I use (for x86) what is in include/asm-x86/nops.h depending on what the I don't know the details, but we definitely saw a drop in preformance between using nops and static jumps. -- Steve --
That's bad : #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4 #define K8_NOP5 K8_NOP3 K8_NOP2 #define K7_NOP5 K7_NOP4 ASM_NOP1 So, when you try, later, to replace these instructions with a single 5-bytes instruction, a preempted thread could iret in the middle of your Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00 instead of the 5-bytes add_nops ? On which architectures ? -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I ran this on my Dell (intel Xeon), which IIRC did show the performance degration. I unfortunately don't have the time to redo those tests, but you are welcome to. Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump. In fact, the comments in that file still say it is a jmp. Remember, my first go was to use the jmp. -- Steve --
kstop_machine does not guarantee that you won't have _any_ thread preempted with IP pointing exactly in the middle of your instructions _before_ the modification scheduled back in _after_ the modification and thus causing an illegal instruction. I'll try to find time to compare : multi-instructions 5-bytes nops (although this approach is just buggy) 5-bytes jump to the next address 2-bytes jump to offset +3. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Hmm, good point. Unless... Can a processor be preempted in a middle of nops? What do nops do for a processor? Can it skip them nicely in one shot? This means I'll have to do the benchmarks again, and see what the performance difference of a jmp and a nop is significant. I'm thinking that if the processor can safely skip nops without any type of processing, this may be the reason that nops are better than a jmp. A jmp causes the processor to do a little more work. I might even run a test to see if I can force a processor that uses the three-two nops to preempt between them. I can add a test in x86 ftrace.c to check to see which nop was used, and use the jmp if the arch does not have a 5 byte nop. I'm assuming that jmp is more expensive than the nops because otherwise a jmp 0 would have been used as a 5 byte nop. -- Steve --
Given that those are multiple instructions, I think a processor has all the rights to preempt in the middle of them. And even if some specific architecture, for any obscure reason, happens to merge them, I don't Yup, although one architecture not triggering this doesn't say much about the various x86 flavors out there. In any case - if you trigger the problem, we have to fix it. - if you do not succeed to trigger the problem, we will have to test it on a wider architecture range and maybe end up fixit it anyway to play safe with the specs. I would propose the following alternative : Create new macros in include/asm-x86/nops.h : /* short jump, offset 3 bytes : skips total of 5 bytes */ #define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n" #if defined(CONFIG_MK7) #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 #elif defined(CONFIG_X86_P6_NOP) #define ATOMIC_NOP5 P6_NOP5 #elif defined(CONFIG_X86_64) #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 #else #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5 #endif And then optimize if necessary. You will probably find plenty of knowledgeable people who will know better 5-bytes nop instruction more efficient than this "generic" short jump offset 0x3. Then you can use the (buggy) 3nops/2nops as a performance baseline and see the performance hit on each architecture. First get it right, then make it fast.... -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
[ patch included ]
I'm stubborn, I want to get it right _and_ keep it fast.
I still want the NOPS. Using jmps will hurt performance and that would
keep this turned off on all distros.
But lets think outside the box here (and we will ignore Alan's cat).
Right now the issue is that we might preempt after the first nop, and when
we enable the code, that task will crash when it tries to read the second
nop.
Since we are doing the modifications from kstop_machine, all the tasks are
stopped. We can simply look to see if the tasks have been preempted in
kernel space and if so, is their instruction pointer pointing to the
second nop. If it is, move the ip forward.
Here's a patch that does just that for both i386 and x86_64.
I added a field in the thread_info struct called "ip". This is a pointer
to the location of the task ip in the stack if it was preempted in kernel
space. Null otherwise:
jz restore_all
+ lea PT_EIP(%esp), %eax
+ movl %eax, TI_ip(%ebp)
call preempt_schedule_irq
+ GET_THREAD_INFO(%ebp)
+ movl $0, TI_ip(%ebp)
jmp need_resched
Then, just before we enable tracing (we only need to do this when we
enable tracing, since that is when we have a two instruction nop), we look
at all the tasks. If the task->thread_info->ip is set, this means that it
was preempted just before going back to the kernel.
We look at the **ip and see if it compares with the second nop. If it
does, we increment the ip by the size of that nop:
if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
/* Match, move the ip forward */
*ip += x86_nop5_part2;
We do this just once before enabling all the locations, and we only do it
if we have a two part nop.
Interesting enough, I wrote a module that did the following:
void (*silly_func)(void);
void do_something_silly(void)
{
}
static int my_thread(void *arg)
{
int i;
...For me it would seem better to just not use two part 5 byte nops instead of adding such hacks. I doubt there are that many of them anyways. I bet you won't be able to measure any difference between the different nop types in any macro benchmark. -Andi --
I wish we had a true 5 byte nop. The alternative is a jmp 0, which is measurable. This is replacing mcount from a kernel compile with the -pg option. With a basic build (not counting modules), I have over 15,000 locations that are turned into these 5 byte nops. # objdump -dr vmlinux.o | grep mcount |wc 15152 45489 764924 If we use the jmp 0, then yes, we will see the overhead. The double nop that is used for 5 bytes, is significantly better than the jump. -- Steve --
I guess the next step is to test if it is slow. -- Steve --
I don't think so. Multiple redundant prefixes can be really expensive on some uarchs. A no-op that isn't cheap isn't a no-op at all, it's a slow-op. Linus --
A quick meaningless benchmark showed a slight perfomance hit. Here's 10 runs of "hackbench 50" using the two part 5 byte nop: run 1 Time: 4.501 run 2 Time: 4.855 run 3 Time: 4.198 run 4 Time: 4.587 run 5 Time: 5.016 run 6 Time: 4.757 run 7 Time: 4.477 run 8 Time: 4.693 run 9 Time: 4.710 run 10 Time: 4.715 avg = 4.6509 And 10 runs using the above 5 byte nop: run 1 Time: 4.832 run 2 Time: 5.319 run 3 Time: 5.213 run 4 Time: 4.830 run 5 Time: 4.363 run 6 Time: 4.391 run 7 Time: 4.772 run 8 Time: 4.992 run 9 Time: 4.727 run 10 Time: 4.825 avg = 4.8264 # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 2220 stepping : 3 cpu MHz : 2799.992 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy bogomips : 5599.98 clflush size : 64 power management: ts fid vid ttp tm stc There's 4 of these. Just to make sure, I ran the above nop test again: [ this is reverse from the above runs ] run 1 Time: 4.723 run 2 Time: 5.080 run 3 Time: 4.521 run 4 Time: 4.841 run 5 Time: 4.696 run 6 Time: 4.946 run 7 Time: 4.754 run 8 Time: 4.717 run 9 Time: 4.905 run 10 Time: 4.814 avg = 4.7997 And again the two part nop: run 1 Time: 4.434 run 2 Time: 4.496 run 3 Time: 4.801 run 4 Time: 4.714 run 5 Time: 4.631 run 6 Time: 5.178 run 7 Time: 4.728 run 8 Time: 4.920 run 9 Time: 4.898 run 10 Time: 4.770 avg = 4.757 This time it was close, but still seems to have some difference. heh, perhaps it's just noise. -- Steve --
Hi Steven,
I tried to run my own tests to see if I could get to know if these
numbers are actually meaningful at all. My results seems to show that
there is not any significant difference between the various
configurations, and actually that the only one tendency I see is that
the 2-bytes jump offset 0x03 would be slightly faster than the 3/2 nop
on Intel Xeon. But we would have to run these a bit more often to
confirm that I guess.
I am just trying to get a sense of whether we are really trying hard to
optimize something worthless in practice, and to me it looks like it.
But it could be the architecture I am using that brings these results.
Mathieu
Intel Xeon dual quad-core
Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
3/2 nop used :
K8_NOP3 K8_NOP2
#define K8_NOP2 ".byte 0x66,0x90\n"
#define K8_NOP3 ".byte 0x66,0x66,0x90\n"
** Summary **
Test A : make -j20 2.6.27-rc2 kernel (real time)
Avg. std.dev
Case 1 : ftrace not compiled-in. 1m9.76s 0.41s
Case 2 : 3/2 nops 1m9.95s 0.36s
Case 3 : 2-bytes jump, offset 0x03 1m9.10s 0.40s
Case 4 : 5-bytes jump, offset 0x00 1m9.25s 0.34s
Test B : hackbench 15
Case 1 : ftrace not compiled-in. 0.349s 0.007s
Case 2 : 3/2 nops 0.351s 0.014s
Case 3 : 2-bytes jump, offset 0x03 0.350s 0.007s
Case 4 : 5-bytes jump, offset 0x00 0.351s 0.010s
** Detail **
* Test A
benchmark : make -j20 2.6.27-rc2 kernel
make clean; make -j20; make clean done before the tests to prime caches.
Same .config used.
Case 1 : ftrace not compiled-in.
real 1m9.980s
user 7m27.664s
sys 0m48.771s
real 1m9.330s
user 7m27.244s
sys 0m50.567s
real 1m9.393s
user 7m27.408s
sys 0m50.511s
real 1m9.674s
user 7m28.088s
sys 0m50.327s
real 1m10.441s
user 7m27.736s
sys 0m49.687s
real time
average : 1m9.76s
std. dev. : 0.41s
after a reboot with the same kernel ...Small correction : my architecture uses the P6_NOP5, which is an atomic 5-bytes nop (just looked at the runtime output of find_nop_table()). 5: nopl 0x00(%eax,%eax,1) #define P6_NOP5 ".byte 0x0f,0x1f,0x44,0x00,0\n" But the results stands. Maybe I should try to force a test with a K8_NOP3 K8_NOP2 nop. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Hi Steven, I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and Intel Pentium 4 boxes to compare a baseline (function doing a bit of memory read and arithmetic operations) to cases where nops are used. Here are the results. The kernel module used for the benchmarks is below, feel free to run it on your own architectures. Xeon : NR_TESTS 10000000 test empty cycles : 165472020 test 2-bytes jump cycles : 166666806 test 5-bytes jump cycles : 166978164 test 3/2 nops cycles : 169259406 test 5-bytes nop with long prefix cycles : 160000140 test 5-bytes P6 nop cycles : 163333458 AMD64 : NR_TESTS 10000000 test empty cycles : 145142367 test 2-bytes jump cycles : 150000178 test 5-bytes jump cycles : 150000171 test 3/2 nops cycles : 159999994 test 5-bytes nop with long prefix cycles : 150000156 test 5-bytes P6 nop cycles : 150000148 Intel Pentium 4 : NR_TESTS 10000000 test empty cycles : 290001045 test 2-bytes jump cycles : 310000568 test 5-bytes jump cycles : 310000478 test 3/2 nops cycles : 290000565 test 5-bytes nop with long prefix cycles : 311085510 test 5-bytes P6 nop cycles : 300000517 test Generic 1/4 5-bytes nops cycles : 310000553 test K7 1/4 5-bytes nops cycles : 300000533 These numbers show that both on Xeon and AMD64, the .byte 0x66,0x66,0x66,0x66,0x90 (osp osp osp osp nop, which is not currently used in nops.h) is the fastest nop on both architectures. The currently used 3/2 nops looks like a _very_ bad choice for AMD64 cycle-wise. The currently used 5-bytes P6 nop used on Xeon seems to be a bit slower than the 0x66,0x66,0x66,0x66,0x90 nop ...
Note that the biggest problems of a jump-based nop are likely to happen when there are I$ misses and/or when there are other jumps involved. Ie a some microarchitectures tend to have issues with jumps to jumps, or when there are multiple control changes in the same (possibly partial) cacheline because the instruction stream prediction may be predecoded in the L1 I$, and multiple branches in the same cacheline - or in the same execution cycle - can pollute that kind of thing. So microbenchmarking this way will probably make some things look unrealistically good. On the P4, the trace cache makes things even more interesting, since it's another level of I$ entirely, with very different behavior for the hit case vs the miss case. And I$ misses for the kernel are actually fairly high. Not in microbenchmarks that tend to have very repetive behavior and a small I$ footprint, but in a lot of real-life loads the *bulk* of all action is in user space, and then the kernel side is often invoced with few loops (the kernel has very few loops indeed) and a cold I$. So your numbers are interesting, but it would be really good to also get some info from Intel/AMD who may know about microarchitectural issues for the cases that don't show up in the hot-I$-cache environment. Linus --
Must be careful to miss the big picture here. We have two assumptions here in this thread: - Normal alternative() nops are relatively infrequent, typically in points with enough pipeline bubbles anyways, and it likely doesn't matter how they are encode. And also they don't have an issue with mult part instructions anyways because they're not patched at runtime, so always the best known can be used. - The one case where nops are very frequent and matter and multipart is a problem is with ftrace noping out the call to mcount at runtime because that happens on every function entry. Even there the overhead is not that big, but at least measurable in kernel builds. Now the numbers have shown that just by not using frame pointer ( -pg right now implies frame pointer) you can get more benefit than what you lose from using non optimal nops. So for me the best strategy would be to get rid of the frame pointer and ignore the nops. This unfortunately would require going away from -pg and instead post process gcc output to insert "call mcount" manually. But the nice advantage of that is that you could actually set up a custom table of callers built in a ELF section and with that you don't actually need the runtime patching (which is only done currently because there's no global table of mcount calls), but could do everything in stop_machine(). Without runtime patching you also don't need single part nops. I think that would be the best option. I especially like it because it would prevent forcing frame pointer which seems to be costlier than any kinds of nosp. -Andi --
How would you deal with inlines? Using debug information? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
> How would you deal with inlines? Using debug information? -pg already ignores inlines, so they aren't even traced today. It pretty much has to, assume an inline gets spread out by the global optimizer over the rest of the function, where would the mcount calls be inserted? -Andi --
Good point. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
I agree that if frame pointer brings a too big overhead, it should not be used. Sorry to ask, I feel I must be missing something, but I'm trying to figure out where you propose to add the "call mcount" ? In the caller or in the callee ? In the caller, I guess it would replace the normal function call, call a trampoline which would jump to the normal code. In the callee, as what is currently done with -pg, the callee would have a call mcount at the beginning of the function. Or is it a different scheme I don't see ? I am trying to figure out how you happen to do all that without dynamic code modification and manage not to hurt performance. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
callee like gcc. caller would be likely more bloated because there are more calls than functions. Also if it was at the callee more code would be needed because the function currently The dynamic code modification is only needed because there is no global table of the mcount call sites. So instead it discovers them at runtime, but that requires runtime save patching With a custom call scheme one could just build up a table of call sites at link time using an ELF section and then when tracing is enabled/disabled always patch them all in one go in a stop_machine(). Then you wouldn't need parallel execution safe patching anymore and it doesn't matter what the nops look like. The other advantage is that it would allow getting rid of the frame pointer. -Andi --
I agree that the custom call scheme could let you know the mcount call site addresses at link time, so you could replace the call instructions with nops (at link time, so you actually don't know much about the exact hardware the kernel will be running on, which makes it harder to choose the best nop). To me, it seems that doing this at link time, as you propose, is the best approach, as it won't impact the system bootup time as much as the current ftrace scheme. However, I disagree with you on one point : if you use nops which are made of multiple instructions smaller than 5 bytes, enabling the tracer (patching all the sites in a stop_machine()) still present the risk of having a preempted thread with a return IP pointing directly in the middle of what will become a 5-bytes call instruction. When the thread will be scheduled again after the stop_machine, an illegal instruction fault (or any random effect) will occur. Therefore, building a table of mcount call sites in a ELF section, declaring _single_ 5-bytes nop instruction in the instruction stream that would fit for all target architectures in lieue of mcount call, so it can be later patched-in with the 5-bytes call at runtime seems like a good way to go. Mathieu P.S. : It would be good to have a look at the alternative.c lock prefix vs preemption race I identified a few weeks ago. Actually, this currently existing cpu hotplug bug is related to the preemption issue I just explained here. ref. http://lkml.org/lkml/2008/7/30/265, especially: "As a general rule, never try to combine smaller instructions into a bigger one, except in the case of adding a lock-prefix to an instruction : this case insures that the non-lock prefixed instruction is still valid after the change has been done. We could however run into a nasty non-synchronized atomic instruction use in SMP mode if a thread happens to be scheduled out right after the lock prefix. Hopefully the alternative code uses the refrigerator... (hrm, it ...
If a kernel thread is preempted in single-cpu mode right after the NOP (nop about to be turned into a lock prefix), then we CPU hotplug a CPU, and then the thread is scheduled back again, a SMP-unsafe atomic operation will be used on shared SMP variables, leading to corruption. No corruption would happen in the reverse case : going from SMP to UP is ok because we split a bit instruction into tiny pieces, which does not present this condition. Changing the 0x90 (single-byte nop) currently used into a 0x3E DS segment override prefix should fix this issue. Since the default of the atomic instructions is to use the DS segment anyway, it should not affect the behavior. ** BIG FAT WARNING (tm) ** : this patch assumes that the 0x3E prefix will leave atomic operations as-is (thus assuming they normally touch data in the DS segment). Is there an obnoxious corner-case I would have missed ? This source : AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions States "Instructions that Reference a Non-Stack Segment—If an instruction encoding references any base register other than rBP or rSP, or if an instruction contains an immediate offset, the default segment is the data segment (DS). These instructions can use the segment-override prefix to select one of the non-default segments, as shown in Table 1-5." Therefore, forcing the DS segment on the atomic operations, which already use the DS segment, should not change. This source : http://wiki.osdev.org/X86_Instruction_Encoding States "In 64-bit the CS, SS, DS and ES segment overrides are ignored." (since it's a wiki, it would have to be confirmed, but at least the worse that happens is that the DS override is ignored) This patch applies to 2.6.27-rc2, but would also have to be applied to earlier kernels (2.6.26, 2.6.25, ...). Performance impact of the fix : tests done on "xaddq" and "xaddl" shows it actually improves performances on Intel Xeon, AMD64, Pentium M. It does not change the ...
I believe this should be okay. In 32-bit mode some of the security and hypervisor frameworks want to set segment limits, but I don't believe they ever would set DS and SS inconsistently, or that we'd handle a #GP versus an #SS differently (segment violations on the stack segment are #SS, not #GP.) To be 100% sure we'd have to pick apart the modr/m byte to figure out what the base register is but I think that's total overkill. I have a vague notion that DS: prefixes came with a penalty on older CPUs, so we may want to do this only when CPU hotplug is enabled, to avoid penalizing older embedded systems. -hpa --
I guess some testing of this patch under an virtualized Linux would not hurt. Anyone have a setup ready ? The test case is simple : Run a kernel on a multi-CPU virtual guest. export NR_CPUS=... Reading the "Intel Architecture Optimizations Manual" for older Intels : http://developer.intel.com/design/pentium/manuals/242816.htm Chapter 3.7 Prefixed Opcodes The penality for instructions prefixed with other prefixes than 0x0f, 0x66 or 0x67 seems to be 1 added clock cycle to detect the prefix when it cannot be paired. Since we are choosing between the existing 0x90 nop followed by the atomic instruction and this prefix applied to the atomic instruction, we have to consider the penality cost of this nop. From the same manual, the NOP is decoded into 1 micro-op. Unless these architectures (386SX/DX, 486, Pentium Pro, Pentium MMX, Pentium II) can execute more than 1 micro-op per cycle, I doubt the DS prefix would cause any degradation compared to the 0x90 nop. And this would free the lower stages of the pipeline from executing this NOP micro-op. I guess some quick performance tests with the modules I provide on my website (URL in the patch header) could confirm or infirm this. Actually, I just removed the dust from an old Pentium II, here are the results. There is no performance overhead nor degradation. NR_TESTS 10000000 test empty cycles : 200833494 test test 1-byte nop xadd cycles : 340000130 test test DS override prefix xadd cycles : 340000126 * test test LOCK xadd cycles : 530000078 processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 5 model name : Pentium II (Deschutes) stepping : 2 cpu MHz : 350.867 cache size : 512 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr bogomips : ...
The kernel sets ds and ss to the same selector, so they're always going
to have the same underlying descriptor.
My only concern is whether there are any locked instructions which are
explicitly using a cs: override for those odd corners of the kernel. I
don't think so.
That said, I wonder how useful it is to do the SMP->UP code transition.
How often does a kernel go from being SMP to UP in a situation where we
really care about the performance? And that won't be shortly be
becoming SMP again anyway?
J
--
I agree, it only seems worthwhile to do this substitution when from the config+hardware it's determined at boot that nr_cpus_possible==1 (i.e. no CPU hotplug possibility). Thanks, Roland --
A virtualized guest kernel could use that to limit its use of the overall machine CPU resources in different time periods. Policies can determine how many physical CPU a virtual machine can be tied to, and that may change depending on e.g. the workload or time of day. Having the ability to efficiently switch to UP for a long period of time seems important in this use-case. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Not very convinced. Unplugging cpus is a pretty coarse way to control
resource use. There are lots of other mechanisms. And it's not like an
uncontended lock is all that expensive these days...
J
--
I can't argue about the benefit of using VM CPU pinning to manage
resources because I don't use it myself, but I ran some tests out of
curiosity to find if uncontended locks were that cheap, and it turns out
they aren't. Here are the results :
Xeon 2.0GHz
Summary
no lock prefix (s) with lock prefix (s) Speedup
make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
Detail :
1 CPU, replace smp lock prefixes with DS segment selector prefixes
make -j1 kernel/
real 0m34.067s
user 0m30.630s
sys 0m2.980s
real 0m33.867s
user 0m30.582s
sys 0m3.024s
real 0m33.939s
user 0m30.738s
sys 0m2.876s
real 0m33.913s
user 0m30.806s
sys 0m2.808s
avg : 33.94s
std. dev. : 0.07s
hackbench 50
Time: 2.978
Time: 2.982
Time: 3.010
Time: 2.984
Time: 2.982
avg : 2.99
std. dev. : 0.01
1 CPU, noreplace-smp
make -j1 kernel/
real 0m35.326s
user 0m30.630s
sys 0m3.260s
real 0m34.325s
user 0m30.802s
sys 0m3.084s
real 0m35.568s
user 0m30.722s
sys 0m3.168s
real 0m34.435s
user 0m30.886s
sys 0m2.996s
avg.: 34.91s
std. dev. : 0.27s
hackbench 50
Time: 3.733
Time: 3.750
Time: 3.761
Time: 3.737
Time: 3.741
avg : 3.74
std. dev. : 0.01
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Absolutely. Locked ops show up not just in microbenchmarks looping over the instruction, they show up in "real" benchmarks too. We added a single locked instruction (maybe it was two) to the page fault handling code some time ago, and the reason I noticed it was that it actually made the page fault cost visibly more expensive in lmbench. That was a _single_ instruction in the hot path (or maybe two). And the page fault path is some of the most timing critical in the whole kernel - if you have everything cached, the cost of doing the page faults to populate new processes for some fork/exec-heavy workload (and compiling the kernel is just one of those - any traditional unix behaviour will show this) is critical. This is one of the things AMD does a _lot_ better than Intel. Intel tends to have a 30-50 cycle cost (with later P4s being *much* worse), while AMD tends to have a cost of around 10-15 cycles. It's one of the things Intel promises to have improved in the next-gen uarch (Nehalem), an while I am not supposed to give out any benchmarks, I can confirm that Intel is getting much better at it. But it's going to be visible still, and it's really a _big_ issue on P4. (Of course, on P4, the page fault exception cost itself is so high that the cost of atomics may be _relatively_ less noticeable in that particular path) Linus --
For reference, could you also compare replace smp lock with NOPs? -hpa --
Sure, here are the updated tables. Basically, they show no significant
difference between the NOP and the DS segment selector prefix
approaches.
Xeon 2.0GHz
Summary
* replace smp lock prefixes with DS segment selector prefixes
no lock prefix (s) with lock prefix (s) Speedup
make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
* replace smp lock prefixes with 0x90 nops
no lock prefix (s) with lock prefix (s) Speedup
make -j1 kernel/ 34.16 +/- 0.32 34.91 +/- 0.27 2.2 %
hackbench 50 3.00 +/- 0.01 3.74 +/- 0.01 24.7 %
Detail :
1 CPU, replace smp lock prefixes with DS segment selector prefixes
make -j1 kernel/
real 0m34.067s
user 0m30.630s
sys 0m2.980s
real 0m33.867s
user 0m30.582s
sys 0m3.024s
real 0m33.939s
user 0m30.738s
sys 0m2.876s
real 0m33.913s
user 0m30.806s
sys 0m2.808s
avg : 33.94s
std. dev. : 0.07s
hackbench 50
Time: 2.978
Time: 2.982
Time: 3.010
Time: 2.984
Time: 2.982
avg : 2.99
std. dev. : 0.01
1 CPU, noreplace-smp
make -j1 kernel/
real 0m35.326s
user 0m30.630s
sys 0m3.260s
real 0m34.325s
user 0m30.802s
sys 0m3.084s
real 0m35.568s
user 0m30.722s
sys 0m3.168s
real 0m34.435s
user 0m30.886s
sys 0m2.996s
avg.: 34.91s
std. dev. : 0.27s
hackbench 50
Time: 3.733
Time: 3.750
Time: 3.761
Time: 3.737
Time: 3.741
avg : 3.74
std. dev. : 0.01
1 CPU, replace smp lock prefixes with 0x90 nops
make -j1 kernel/
real 0m34.139s
user 0m30.782s
sys 0m2.820s
real 0m34.010s
user 0m30.630s
sys 0m2.976s
real 0m34.777s
user 0m30.658s
sys 0m2.916s
real 0m33.924s
user 0m30.634s
sys 0m2.924s
real 0m33.962s
user 0m30.774s
sys 0m2.800s
real 0m34.141s
user 0m30.770s
sys 0m2.828s
avg : 34.16
std. dev. : 0.32
hackbench 50
Time: 2.999
Time: 2.994
Time: 3.004
Time: 2.991
Time: ...BTW, are you changing the initial prefix to DS too? Ie, are you doing a
nop->lock->ds transition, or ds->lock->ds?
J
--
Yeah, I thought about this case yesterday, good thing you ask.
include/asm-x86/alternative.h defines LOCK_PREFIX as :
#define LOCK_PREFIX \
".section .smp_locks,\"a\"\n" \
_ASM_ALIGN "\n" \
_ASM_PTR "661f\n" /* address */ \
".previous\n" \
"661:\n\tlock; "
So we have the locked instructions built into the kernel, not the nop'd
one. Therefore, the only transition I am doing for my benchmarks is :
lock->ds
but I tried to switch back to SMP and it worked fine.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Ah, OK. I'd thought we started unlocked, but given that I've just been
disassembling the kernel and looking at the lock prefixes, that's a bit
of a braino on my part.
BTW, using the ds prefix allows us to undo the hack of dealing with
locked instructions with exception handlers. There was a bug where if
we do lock->nop, then the address of a faulting instruction changes, so
we need exception records for both the locked and unlocked forms. Using
ds means the instruction size doesn't change, so we only need one
exception record. I don't remember off hand where that happens.
J
--
Using %ds: rather than nop really seems to solve a whole lot of problems, and might even be faster to boot. It really sounds like a no-brainer. -hpa --
So should I wait a bit for more comments or straightforwardly submit this as a patch rather than RFC ? Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Looks like all the relevant people have reviewed it now, so I don't
think there's much more to say.
J
--
I'm just worried about this comment from Harvey Harrison :
arch/x86/mm/fault.c : is_prefetch()
* Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes.
* In X86_64 long mode, the CPU will signal invalid
* opcode if some of these prefixes are present so
* X86_64 will never get here anyway
*/
This comment refers to :
0x26 : ES segment override prefix
0x2E : CS segment override prefix
0x36 : SS segment override prefix
0x3E : DS segment override prefix
AMD documentation seems to indicate that these prefix will be null, not
that the cpu would signal "invalid opcodes" :
"AMD 64-Bit Technology" A.7
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/x86-64_overvie...
"In 64-bit mode, the DS, ES, SS and CS segment-override prefixes have no effect.
These four prefixes are no longer treated as segment-override prefixes in the
context of multipleprefix rules. Instead, they are treated as null prefixes."
Intel does not seem to state anything particular about these prefixes
for the 64-bit mode.
So, is this comment misleading, or is it using the term "invalid opcode"
in a way that does not imply generating a fault ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
They do not signal faults, there just aren't any base addresses behind them. Some AMD chips allow limits to be set on these segments -- apparently added on behalf of some hypervisor makers; I suspect that VMX/SVM is making that quickly obsolete. So it should be just fine. Acked-by: H. Peter Anvin <hpa@zytor.com> -hpa --
I would say that comment is wrong. But we'd never put a lock prefix on
a prefetch, so you won't be added a ds prefix either.
J
--
Locking prefetches is definitely wrong :) I don't know if the comment is correct; I've certainly never heard of that myself. -hpa --
Actually, unless I have blown my T-test completely, they show with a 80% and 74% confidence (respective for the two benchmarks) that the DS case is slightly *better* (0.26% and 0.20% better, respective), which makes it a no-brainer. Doing around 10 runs of each is likely to confirm this conclusion by pushing it into the 90+% interval. Note that since the difference is so small, and so can also be due to some kind of systematic error (lower ambient temperature during the DS run making the disk drive slightly faster, what have you.) -hpa --
I doubt it, because I made a "cache priming" run before the tests, which made sure the data was all populated in memory. But yeah, having different cpu clock calibration values/cpu clock frequency between reboots could also cause that kind of difference. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I did more runs (20 runs of each) to compare the nop case to the DS prefix case. Results in seconds. They actually does not seems to show a significant difference. NOP 34.155 33.955 34.012 35.299 35.679 34.141 33.995 35.016 34.254 33.957 33.957 34.008 35.013 34.494 33.893 34.295 34.314 34.854 33.991 34.132 DS 34.080 34.304 34.374 35.095 34.291 34.135 33.940 34.208 35.276 34.288 33.861 33.898 34.610 34.709 33.851 34.256 35.161 34.283 33.865 35.078 Used http://www.graphpad.com/quickcalcs/ttest1.cfm?Format=C to do the T-test (yeah, I'm lazy) : Group Group One (DS prefix) Group Two (nops) Mean 34.37815 34.37070 SD 0.46108 0.51905 SEM 0.10310 0.11606 N 20 20 P value and statistical significance: The two-tailed P value equals 0.9620 By conventional criteria, this difference is considered to be not statistically significant. Confidence interval: The mean of Group One minus Group Two equals 0.00745 95% confidence interval of this difference: From -0.30682 to 0.32172 Intermediate values used in calculations: t = 0.0480 df = 38 standard error of difference = 0.155 So, unless these calculus are completely bogus, the difference between the nop and the DS case seems not to be statistically significant. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
OK, let me clarify my point a bit. If you've got a kernel which is
switching between UP and SMP on a regular basis, you're going to be
taking the hit for the locked instructions whenever you're in the SMP
state anyway. It's only going to be a workload where you're mostly UP
with occasional excursions into being SMP that patching out the lock
prefixes is actually going to make a difference.
And that just doesn't seem like a very likely use-case to me. Certainly
I don't think it would ever happen on a physical machine. And it
doesn't seem all that likely on a virtual machine either. Certainly
resources are more dynamic in a virtual environment, but I think there's
a fairly good chance that the domain knows from the outset whether it's
going to be UP or SMP, or does the UP->SMP transition once.
(That said, the XenServer product does precisely what I say is unusual:
the dom0 kernel hotplugs all the cpus so it can do ucode updates, etc,
Yeah, that's more severe than I would have expected. Perhaps I have AMD
numbers in my head.
J
--
He doesn't specify what kind of Xeon it is (there are three completely different CPUs under the Xeon brand name: P3 Xeon, P4 Xeon, and Core 2 Xeon), but I think it's a P4 Xeon. Netburst sucked in so many ways, and this is one of them. -hpa --
I've given up trying to remember what "E5405" might mean, but that
feature list suggests a relatively modern microarchitecture.
J
--
E5405 is the 45 nm Core 2 Quad Xeon. -hpa --
Here are the specs. It's a brand new 64bits dual quad-core (Christmas in August) :-) None of the specs seems to talk about neither Core2 or Pentium 4 though. Those appellations seems to hold for desktop processors, not server processors... processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz stepping : 6 cpu MHz : 2000.109 cache size : 6144 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc up arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm bogomips : 4000.21 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management: Booting processor 7/7 ip 6000 Initializing CPU#7 Calibrating delay using timer specific routine.. 4000.26 BogoMIPS (lpj=8000538) CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 6144K CPU: Physical Processor ID: 1 CPU: Processor Core ID: 3 CPU7: Intel(R) Xeon(R) CPU E5405 @ 2.00GHz stepping 06 checking TSC synchronization [CPU#0 -> CPU#7]: passed. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
Is it a Dell thingy? Turns out my main test box has exactly the same
processor.
J
--
Nope, it's a custom-made machine from a company in the Montreal area. It has a Supermicro X7DAL-E motherboard and, well, I got 16GB of ram in it. Pretty useful to take large traces. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
FWIW, I think its a Harpertown: http://processorfinder.intel.com/details.aspx?sSpec=3DSLAP2 -Greg
[ Finally got my goodmis email back ] The new code does not discover the places at runtime. The old code did that. The "to kill a daemon" removed the runtime discovery and replaced it The current patch set, pretty much does exactly this. Yes, I patch at boot up all in one go, before the other CPUS are even active. This is the only advantage that you have. -- Steve --
Ok. But it's a serious one. It gives slightly more gain as your whole complicated patching exercise. Ok maybe it would be better to just properly fix gcc, but the problem is it takes forever for the user base to actually start using a new gcc :/ -Andi --
Yup, I agree. Actually, the tests I ran shows that using jumps as nops Yes, I am aware of these "high locality" effects. I use these tests as a starting point to find out which nops are good candidates, and then it can be later validated with more thorough testing on real workloads, which will suffer from higher standard deviation. Interestingly enough, the P6_NOPS seems to be a poor choice both at the macro and micro levels for the Intel Xeon (referring to As long as the whole kernel agrees on which instructions should be used for frequently used nops, the instruction trace cache should behave I assume the effect of I$ miss to be the same for all the tested scenarios (except on P4, and maybe except for the jump cases), given that in each case we load 5-bytes worth of instructions. Even considering this, the results I get show that the choices made in the Yep. I think it may make a difference if we use jumps, but I doubt it will change anything to the other various nops. Still, having that information would be good. Some more numbers follow for older architectures. Intel Pentium 3, 550MHz NR_TESTS 10000000 test empty cycles : 510000254 test 2-bytes jump cycles : 510000077 test 5-bytes jump cycles : 510000101 test 3/2 nops cycles : 500000072 test 5-bytes nop with long prefix cycles : 500000107 test 5-bytes P6 nop cycles : 500000069 (current choice ok) test Generic 1/4 5-bytes nops cycles : 514687590 test K7 1/4 5-bytes nops cycles : 530000012 Intel Pentium 3, 933MHz NR_TESTS 10000000 test empty cycles : 510000565 test 2-bytes jump cycles : 510000133 test 5-bytes jump cycles : 510000363 test 3/2 nops cycles : 500000358 test 5-bytes nop with long prefix cycles : 500000331 test 5-bytes P6 nop cycles : ...
Are we sure there aren't some good available 5-byte nops?
e.g. for 64-bit maybe 3e 48 8d 76 00
for 32-bit maybe 3e 8d 74 26 00
Have the chip folks confirmed that exactly the set we have in
asm-x86/nops.h are the only optimal ones?
Thanks,
Roland
--
They were based on the respective optimization manuals, so yes. -Andi --
Scary usage. How much is the difference? BTW a good way to get most of that back would be to not use frame pointers if you have them enabled ;-) Unfortunately recently the "select FRAME_POINTER"s keep creeping all over so far too many kernels suffer from this slow mode now :-/ -Andi --
Funny, CONFIG_FTRACE happens to select that. Now the question is, would
mcount work without it?
i386:
ENTRY(mcount)
pushl %eax
pushl %ecx
pushl %edx
movl 0xc(%esp), %eax
movl 0x4(%ebp), %edx <-- we record the parent ip here
subl $MCOUNT_INSN_SIZE, %eax
call *ftrace_trace_function
popl %edx
popl %ecx
popl %eax
ret
x86_64:
ENTRY(mcount)
subq $0x38, %rsp
movq %rax, (%rsp)
movq %rcx, 8(%rsp)
movq %rdx, 16(%rsp)
movq %rsi, 24(%rsp)
movq %rdi, 32(%rsp)
movq %r8, 40(%rsp)
movq %r9, 48(%rsp)
movq 0x38(%rsp), %rdi
movq 8(%rbp), %rsi <-- we record the parent pointer here
subq $MCOUNT_INSN_SIZE, %rdi
call *ftrace_trace_function
movq 48(%rsp), %r9
movq 40(%rsp), %r8
movq 32(%rsp), %rdi
movq 24(%rsp), %rsi
movq 16(%rsp), %rdx
movq 8(%rsp), %rcx
movq (%rsp), %rax
addq $0x38, %rsp
retq
-- Steve
--
Not without fixing gcc first. It would work if gcc always called mcount the first thing before setting up the stack frame. Not sure why it doesn't do that. Still do a benchmark of frame pointer vs no frame pointer kernel and you'll see, especially on a older CPUs without special hardware to avoid stack stalls (e.g. not Core2) BTW always forcing frame pointers also means that ftrace is far from near zero overhead even when disabled. That is unless you find a way to nop the frame pointers too, but that would be likely very difficult because the code will actually use it. -Andi --
I could take out the frame pointer option and pass in zero for the parent in such that case. But the parent is quite useful. But if it helps with performance, it may be worth doing this. I'll see what it does with just a kernel build. -- Steve --
Right now gcc errors out unfortunately. Would need to ask gcc developers to fix this first. But when they move mcount to be always first you can still get at the parent, it will always be at 4(%esp) or 8(%rsp). An alternative would be also run an assembler post processor that inserts mcount calls without compiler help. -Andi --
I checked out the latest git tree from Linus, and set up a basic config for it. I built it once. Then I rebooted into various versions of the kernel and timed the makes. This is what I did for each kernel. # cd /work/git/linux-compile.git # make distclean # cp config-use .config # make oldconfig # time make Here are the results. Just for curiosity sake. Using the two part nop: real 3m26.751s user 1m36.064s sys 0m39.450s Using the 5 byte nop with the four 0x66: real 3m31.543s user 1m36.197s sys 0m39.309s Using jmp 0 as a nop: real 3m30.025s user 1m35.948s sys 0m39.540s Then I realized that I had the irqsoff, and preemptoff tracers configured. These do add an overhead when configured in, even when they are disabled. I disabled them and ran the code with the two part nop again: Using two part nop without other tracers: real 3m26.851s user 1m36.425s sys 0m35.437s I then disabled ftrace completely: no ftrace times: real 3m25.807s user 1m36.992s sys 0m35.715s Then I disabled frame pointers as well: real 3m25.104s user 1m35.726s sys 0m34.797s The important numbers here is the sys time. The user and real should not be that much affected by these changes. Well, the real would, but not the user. I have an old toshiba that I want to run this on too. That might give more interesting results. But it's getting late, and I'll do that another time. Looking at the results, and yes I should run this more than once, but I just wanted an idea, this box did not show much difference between the two part nop, the jmp and the 5 byte nop. Running with or without ftrace enabled did not make a difference. Heck, the ftrace disabled was actually slower. But it does seem that Andi is correct about the frame pointers. It did slow the compile down by almost a second. Just some food for thought. -- Steve --
[ updated patch ]
I put that code in the scheduler and it does indeed preempt in the nops!
It also uncovered a nasty bug I had in my code:
Notice the bug? Hint, *ip is of type unsigned long *.
Here's the updated fix:
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
arch/x86/kernel/alternative.c | 29 ++++++++++++++++++++-------
arch/x86/kernel/asm-offsets_32.c | 1
arch/x86/kernel/asm-offsets_64.c | 1
arch/x86/kernel/entry_32.S | 4 +++
arch/x86/kernel/entry_64.S | 4 +++
arch/x86/kernel/ftrace.c | 41 +++++++++++++++++++++++++++++++++++++++
include/asm-x86/ftrace.h | 5 ++++
include/asm-x86/thread_info.h | 4 +++
kernel/trace/ftrace.c | 12 +++++++++++
9 files changed, 94 insertions(+), 7 deletions(-)
Index: linux-tip.git/arch/x86/kernel/alternative.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/alternative.c 2008-06-05 11:52:24.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/alternative.c 2008-08-08 16:20:23.000000000 -0400
@@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
};
#endif
+/*
+ * Some versions of x86 CPUs have a two part NOP5. This
+ * can break ftrace if a process is preempted between
+ * the two. ftrace needs to know what the second nop
+ * is to handle this case.
+ */
+int x86_nop5_part2;
+
#ifdef CONFIG_X86_64
extern char __vsyscall_0;
const unsigned char *const *find_nop_table(void)
{
- return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
- boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+ boot_cpu_data.x86 < 6) {
+ x86_nop5_part2 = 2; /* K8_NOP2 */
+ return k8_nops;
+ } else
+ /* keep k86_nop5_part2 NULL */
+ return p6_nops;
}
#else /* CONFIG_X86_64 */
@@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
static const struct nop {
int cpuid;
const unsigned char *const ...Hi Steven, I'm actually a bit worried about this scheduler-centric approach. The problem is that any trap that could be generated in the middle of the 3/2 nops (think of performance counters if the APIC is set as a trap gate) which would stack an interruptible trap handler on top of those instructions would lead to having a return IP pointing in the wrong spot, but because the scheduler would interrupt the trap handler and not the trapped code, it would not detect this. I am not sure wheter we do actually have the possibility to have these interruptible trap handlers considering the way the kernel sets up the gates, but I think the "let's figure out where the IP stopped" approach is a bit complex and fragile. Trying to find a fast atomic 5-bytes nop, involving the microarchitecture guys, seems like a better approach to me. -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
I agree that the fast atomic 5-byte nop is the optimal approach, but until we have that, we need to do this. I don't think this approach is too complex, I wrote the code in about an hour, and the patch isn't that big. albeit my one bug, which was a CS101 type bug (pointer arithmetic), there wasn't anything to me that was complex. The thing is, I want this to be enabled in a production kernel. If there is a 1 in a million chance that this theoretical bug with the trap happening in a middle of the double nop, it only affects those that enable the tracer. No one else. The code only happens when tracing is being enabled. If we need to put in a non optimal nop in, to replace the call to mcount, this affects everyone. Anyone that has CONFIG_FTRACE on, and never plans on running a trace. If it comes down to slowing everyone down, against hitting a 1 in a million theoretical bug, I'll take my chances on the bug. Again, that code is only performed when tracing is being enabled. That is, when it converts all 20 thousand nops into calls to a tracer function. If the crash happens, it will only happen after tracing has started. Then we can easily point the bug at the tracer. IOW, I do not want to slow down Linus' box. -- Steve --
Sure. If you have two nops in a row (and the kernel definition of the NOP array does _not_ guarantee that it's a single-instruction one), you may get a profile hit (ie any interrupt) on the second one. It's less _likely_, but it certainly is not architecturally in any way guaranteed Depends on the microarchitecture. But many will squash it in the decode stage, and generate no uops for them at all, so it's purely a decode throughput issue and has absolutely _zero_ impact for any later CPU See above. It needs to decode them, and the decoder itself may well have some limitations - for example, the longer nops may not even decode in all decoders, which is why some uarchs might prefer two short nops to one long one, but _generally_ a nop will not make it any further than the decoder. But separate nops do count as separate instructions, ie they will hit all the normal decode limits (mostly three or four instructions decoded per Yes. A CPU core _could_ certainly do special decoding for 'jmp 0' too, but I don't know any that do. The 'jmp' is much more likely to be special in the front-end and the decoder, and can easily cause things like the prefetcher to hickup (ie it tries to start prefetching at the "new" target address). So I would absolutely _expect_ a 'jmp' to be noticeably more expensive than one of the special nop's that can be squashed by the decoder. A nop that is squashed by the decoder will literally take absolutely no other resources. It doesn't even need to be tracked from an instruction completion standpoint (which _may_ end up meaning that a profiler would actually never see a hit on the second nop, but it's quite likely to depend on which decoder it hits etc). Linus --
The CPU can be interrupted between any two instructions, with a couple
of exceptions (it must, otherwise an unbounded run of nops could cause
an unbounded interrupt latency). If you don't want eip to be in the
middle of your instruction site, you need to make sure it's nopped out
with a single instruction.
Unfortunately, aside from P6_NOP5, none of the standard asm/nops.h nops
are suitable. Something like "0x66, 0x66, 0x66, 0x66, 0x90" would work
everywhere, but its possible putting more than 3 prefixes on the
instruction will kick the decoder into a slow path, which may have a
A no-op jmp could be easily converted to a nop early in the pipeline.
But I don't know whether anyone bothers to do that.
J
--
You can walk the task list and fix them up, if you have to. Of course, this could be extremely slow with a million threads. Maybe just walking the runqueues would be sufficient? Rusty. --
True, we could do the run queues too. But we are only looping once through the tasks and checking a single pointer on it, which would only be set if the task was preempted while in the kernel. As for being slow, this only happens when we enable the function tracer, which is slow anyway. It does not need to happen on disabling the tracer. Note, this is enabling the calls to the tracer function. The tracer itself can enable and disable quickly by turning on or off a variable. This is just the "setup" part. Not something that happens often. -- Steve --
I have a few concerns about this scheme:
One is that you assume that all text is in a .text section, and that the
offsets you compute on a per-section basis are going to be valid in the
final kernel image. At the very least you should make each offset
relative to its own function rather than inter-function.
But it seems pretty fragile overall. Things like -ffunction-sections
and section gc will invalidate the table you build up.
It seems to me that you can acheive the same effect by:
1. link vmlinux with "ld -q", which will leave the relocation
information intact in the final object
2. use "readelf -r" to extract the relocation info, find the
references to mcount and build a table
3. link that table into the kernel image
4. repeat to get a stable result
Note that steps 2-4 are identical to kallsyms, both in intent and
detail. The only difference is the precise table you build and the
command you use to extract the info from the kernel. From a quick peek
at the way Makefile implements kallsyms, it looks to me like it would be
fairly easy to generalize so that you can do the mcount reloc processing
in the same relink passes as kallsyms with minimal overhead on the
kernel build time.
It just seems incredibly fiddly to be doing all this per-object.
--
I assume neither ;-)
This works with .text.sched .text.init .text.cpu etc. I use the first
I don't see how. I'm referencing a function as my pointer. If this was
true, then I wouldn't be able to do
static void my_func(void) { [...] }
struct func_struct {
void (*func)(void);
} f = { my_func; };
My references end up exactly the same as the "f->func" reference does.
If my references are broken, so is this f->func. Which would be bad to
This doesn't seem any less complex than what I did. With this, I would
have to come up with another way to handle modules. This will make
I could work on this. But again, this will put more work into module.c
Actually, this isn't that much fiddling. What I did was simply make a list
of pointers to the call sites to mcount. The pointers used the first
function of each section to offset against. I relinked in the list making
it just like a list done within C. All relocations and magic after this
will be handled just fine.
-- Steve
--
The scheme you've implemented can apply fine to a .ko file after it's made. They are just .o's really. It is presumably faster to do one step per final .ko rather than many tiny ones (per source file). That might be a benefit to doing it all at the end for vmlinux too. I think the best way would be to have a vmlinux.o that we actually use in the link, rather than just analyzing and discarding. Then you can just do your existing hack on vmlinux.o before it's linked into vmlinux. Thanks, Roland --
OK, I am playing with this. And you are right, it does work with vmlinux.o file. It adds almost another minute to the build anytime something is changed. Where the per-object way may take longer doing a full build, but does not add that extra minute ever build. Which would people prefer? I little longer full build and do the modification on every object, or have a shorter full build, but every build will take that extra minute to do? Note, if you have DYNMAIC_FTRACE disabled, this is a nop, and will _not_ affect you. -- Steve --
Another advantage of doing it at the end on the vmlinux.o, is that you do not need to recompile the entire kernel when you change the config option. But you do need to recompile all modules. -- Steve --
I spoke to soon. I was only looking at the result of the vmlinux.o. But it seems that kallsyms skips using this file and relinks everything itself. Thus the __mcount_loc I added never made it into the vmlinux. The more I try to get this method to work, the more complex it becomes. And I still did not change the recordmcount.pl at all. Which means, by trying to do it against the vmlinux.o, I'm actually making it more complex than it already is. Right now, doing it to every object seems to be the safest approach. -- Steve --
I have a very strong preference to do it per obj rather than at the end, that way it can be distributed and make these fancy smp boxen useful. Bonus points if you can make distcc dig it. The vmlinux link path is the thing that is killing build performance, anything to shorten that is golden, anything increasing it needs _very_ good justification. --
Yes, the CC used in the script does indeed use distcc (if you are using it). ;-) --
I have patches to actually use vmlinux.o as part of the link process, but as I hit some subtle issues with um and sparc they are not yet ready. I plan to dust them off again and fix up the sparc and um issues but that is not this month. Sam --
Is this new framework needed for x86 specific reasons only? From what I gathered here, ftraced defers mcount patching simply because there's no way to update a 5-byte nop atomically. If so, why can't mcount site patching be left to arch specific ftrace code? For !SMP or archs which generate word sized mcount branch calls (e.g, ARM) is there really no way to patch mcount sites synchronously from inside ftrace_record_ip by disabling interrupts? Regards, Abhishek Sagar --
There's two topics in this thread. 1) the x86 issue of the 5 byte instruction. The problem with x86 is that on some CPUs the nop used consists of two nops to fill the 5 bytes. There is no way to change that atomically. The workarounds for this is the arch specific ftrace_pre_enable() that will make sure no process is about to execute the second part of that nop. 2) Getting rid of the daemon. The daemon is used to patch the code dynamically later on bootup. Now an arch may or may not be able to modify code in SMP, but I've been told that this is dangerous to do even on PPC. Dynamically modifying text that might be in the pipeline on another CPU may or may not be dangerous on all archs. The fix here is to convert the mcount calls to nops at boot up. This is really ideal on all archs. This means we know ever mcount call, and we get rid of the requirement that we need to run the code once before we can trace it. The kstop_machine is now only left at the start and stop of tracing. -- Steve --
Kprobes leaves this mess for individual archs to handle (see arch_arm_kprobe). What would be nice to have is an explanation as to which archs are safe from this and under what conditions. Also, for !SMP, this boot-time patching approach and the associated build time paraphernalia would simply be a bloat. There we can simply have a daemonless ftrace which patches mcount sites synchronously. Why not This solution indeed would fit all archs well but for some it may be an overkill (or is it?...I'd like to know that :-\ ). Oh and as you pointed out, it would mean that we have to no longer run the code before starting to trace it. But why don't we do that now? How about calling the tracer from ftrace_record_ip? All we need to do This a nice fix. I'm just looking to find out what the self modifying code problem here translates to on different archs for my own understanding :-) Thanks, Abhishek Sagar --
One of the things I tried to do was to make ftrace port easily to all archs. David Miller ported it to Sparc64 in 20 minutes and that was mostly compiling. Doing a kprobe fix would require much more knowledge to each It is not an issue at all. The mcount list is discarded with the init section, and the change to nops is relatively fast. We do make a list of all entries anyway, so that we can pick and choose which functions we want to enable tracing on. This recording at boot up should be fine on all archs, SMP or UP and will I want ftrace_record_ip to be as fast as possible, and also having it call the registered function means we need to test if it was set in the filter too. This is too much for what record_ip does. And as I said, doing it all Now, what I can do, is remove the kstop machine from UP. Hmm, I need to check if kstop_machine code itself is smart enough to simply do a local_irq_save; run function; local_irq_restore, if it was on UP and not SMP. -- Steve --
I agree with the point about porting. My kprobe reference was simply to point out that an existing module seems to be overwriting code on the fly and doesn't take care of much on non-x86 archs. Ftrace has a unique 5-byte call replacement issue Alrighty. Thanks, Abhishek Sagar --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/ |
