Re: [PATCH 0/5] ftrace: to kill a daemon

Previous thread: [PATCH] x86: audit syscalls based on type of syscall not type of binary by Michael Davidson on Thursday, August 7, 2008 - 11:14 am. (1 message)

Next thread: [PATCH 3/5] ftrace: enable mcount recording for modules by Steven Rostedt on Thursday, August 7, 2008 - 11:20 am. (3 messages)
From: Steven Rostedt
Date: Thursday, August 7, 2008 - 11:20 am

One of the things that bothered me about the latest ftrace code was
this annoying daemon that would wake up once a second to see if it
had work to do. If it did not, it would go to sleep, otherwise it would
do its work and then go to sleep.

You see, the reason for this is that for ftrace to maintain performance
when configured in but disabled, it would need to change all the
locations that called "mcount" (enabled with the gcc -pg option) into
nops.  The "-pg" option in gcc sets up a function profiler to call this
function called "mcount". If you simply have "mcount" return, it will
still add 15 to 18% overhead in performance. Changing all the calls to
nops moved the overhead into noise.

To get rid of this, I had the mcount code record the location that called
it. Later, the "ftraced" daemon would wake up and look to see if
any new functions were recorded. If so, it would call kstop_machine
and convert the calls to "nops". We needed kstop_machine because bad
things happen on SMP if you modify code that happens to be in the
instruction cache of another CPU.

This "ftraced" kernel thread would be a happy little worker, but it caused
some pains. One, is that it woke up once a second, and Ted Tso got mad
at me because it would show up on PowerTop. I could easily make the
default 5 seconds, and even have it runtime configurable, with a trivial
patch. I have not got around to doing that yet.

The other annoying thing, and this one bothers me the most, is that we
can not have this enabled on a production -rt kernel.  The latency caused
by the kstop_machine when doing work is lost in the noise on a non-rt
kernel, but it can be up to 800 microseconds, and that would kill
the -rt kernel.  The reason this bothered me the most, is that -rt is
where it came from, and ftraced was not treating its motherland very well.

Along came Gregory Haskins, who was bickering about having ftrace enabled
on a production -rt kernel. I told him the reasons that this would be bad
and then he started ...
From: Mathieu Desnoyers
Date: Thursday, August 7, 2008 - 11:47 am

Hi Steven,

If you really want to stop using stop_machine, I think you should have a
look at my immediate values infrastructure :

see:
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x8...

particularly replace_instruction_safe(), which uses a temporary
breakpoint to safely replace an instruction on a live SMP system without
using stop_machine. Feel free to try it in ftrace. :)

You may need to tweak the imv_notifier(), which is responsible for
executing the breakpoint. The only special thing this handler has to
know is the non-relocatable instructions you might want to insert (it
only cares about the new instructions, not the old ones being replaced).
The current code deals with 5-bytes jumps. Note that the instruction is
executed in the breakpoint handler with preemption disabled, which might
not be well suited for a call instruction.

I would recommend to patch in a 5-bytes jmp with 0 offset
(e9 00 00 00 00) when disabled (this makes a single 5-bytes instruction
and thus makes sure no instruction pointer can iret in the middle).

When enabling the site, you could patch-in the original call, but you
should tweak the imv_notifier() so that it uses the jmp offset 0 in the
bypass instead of the function call, because preemption would otherwise
be disabled around the call when executed in the breakpoint bypass.

Therefore, simply statically forcing the bypass code to e9 00 00 00 00
in all the cases (a nop) would do the job for your needs.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Thursday, August 7, 2008 - 1:42 pm

I originally used jumps instead of nops, but unfortunately, they actually 
hurt performance more than adding nops. Ingo told me it was probably due
to using up the jump predictions of the CPU.

-- Steve

--

From: Mathieu Desnoyers
Date: Friday, August 8, 2008 - 10:22 am

Hrm, are you sure you use a single 5-bytes nop instruction then, or do
you use a mix of various nop sizes (add_nops) on some architectures ?

You can consume the branch prediction buffers for conditional branches,
but I doubt static jumps have this impact ? I don't see what "jump
predictions" you are referring to here exactly.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 10:36 am

I use (for x86) what is in include/asm-x86/nops.h depending on what the

I don't know the details, but we definitely saw a drop in preformance 
between using nops and static jumps.

-- Steve

--

From: Mathieu Desnoyers
Date: Friday, August 8, 2008 - 10:46 am

That's bad :

#define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4

#define K8_NOP5 K8_NOP3 K8_NOP2

#define K7_NOP5 K7_NOP4 ASM_NOP1

So, when you try, later, to replace these instructions with a single
5-bytes instruction, a preempted thread could iret in the middle of your

Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
instead of the 5-bytes add_nops ? On which architectures ?


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 11:13 am

I ran this on my Dell (intel Xeon), which IIRC did show the performance 
degration. I unfortunately don't have the time to redo those tests, but 
you are welcome to.

Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
In fact, the comments in that file still say it is a jmp. Remember, my 
first go was to use the jmp.

-- Steve

--

From: Peter Zijlstra
Date: Friday, August 8, 2008 - 11:15 am

5 single byte nops?

--

From: Mathieu Desnoyers
Date: Friday, August 8, 2008 - 11:21 am

kstop_machine does not guarantee that you won't have _any_ thread
preempted with IP pointing exactly in the middle of your instructions
_before_ the modification scheduled back in _after_ the modification and
thus causing an illegal instruction.


I'll try to find time to compare :

multi-instructions 5-bytes nops (although this approach is just buggy)
5-bytes jump to the next address
2-bytes jump to offset +3.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 11:41 am

Hmm, good point. Unless...

Can a processor be preempted in a middle of nops?  What do nops do for a 
processor? Can it skip them nicely in one shot?

This means I'll have to do the benchmarks again, and see what the 
performance difference of a jmp and a nop is significant. I'm thinking 
that if the processor can safely skip nops without any type of processing, 
this may be the reason that nops are better than a jmp. A jmp causes the 
processor to do a little more work.

I might even run a test to see if I can force a processor that uses the 
three-two nops to preempt between them.

I can add a test in x86 ftrace.c to check to see which nop was used, and 
use the jmp if the arch does not have a 5 byte nop.

I'm assuming that jmp is more expensive than the nops because otherwise
a jmp 0 would have been used as a 5 byte nop.

-- Steve
--

From: Mathieu Desnoyers
Date: Friday, August 8, 2008 - 12:05 pm

Given that those are multiple instructions, I think a processor has all
the rights to preempt in the middle of them. And even if some specific
architecture, for any obscure reason, happens to merge them, I don't

Yup, although one architecture not triggering this doesn't say much
about the various x86 flavors out there. In any case
- if you trigger the problem, we have to fix it.
- if you do not succeed to trigger the problem, we will have to test it
  on a wider architecture range and maybe end up fixit it anyway to play
  safe with the specs.


I would propose the following alternative :

Create new macros in include/asm-x86/nops.h :

/* short jump, offset 3 bytes : skips total of 5 bytes */
#define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n"

#if defined(CONFIG_MK7)
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#elif defined(CONFIG_X86_P6_NOP)
#define ATOMIC_NOP5 P6_NOP5
#elif defined(CONFIG_X86_64)
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#else
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#endif

And then optimize if necessary. You will probably find plenty of
knowledgeable people who will know better 5-bytes nop instruction more
efficient than this "generic" short jump offset 0x3.

Then you can use the (buggy) 3nops/2nops as a performance baseline and
see the performance hit on each architecture.

First get it right, then make it fast....


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 4:38 pm

[ patch included ]


I'm stubborn, I want to get it right _and_ keep it fast.

I still want the NOPS. Using jmps will hurt performance and that would 
keep this turned off on all distros.

But lets think outside the box here (and we will ignore Alan's cat).

Right now the issue is that we might preempt after the first nop, and when 
we enable the code, that task will crash when it tries to read the second 
nop.

Since we are doing the modifications from kstop_machine, all the tasks are 
stopped. We can simply look to see if the tasks have been preempted in 
kernel space and if so, is their instruction pointer pointing to the 
second nop. If it is, move the ip forward.

Here's a patch that does just that for both i386 and x86_64.

I added a field in the thread_info struct called "ip". This is a pointer 
to the location of the task ip in the stack if it was preempted in kernel 
space. Null otherwise:

         jz restore_all
 +       lea PT_EIP(%esp), %eax
 +       movl %eax, TI_ip(%ebp)
         call preempt_schedule_irq
 +       GET_THREAD_INFO(%ebp)
 +       movl $0, TI_ip(%ebp)
         jmp need_resched


Then, just before we enable tracing (we only need to do this when we 
enable tracing, since that is when we have a two instruction nop), we look 
at all the tasks. If the task->thread_info->ip is set, this means that it
was preempted just before going back to the kernel.

We look at the **ip and see if it compares with the second nop. If it 
does, we increment the ip by the size of that nop:

               if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
                       /* Match, move the ip forward */
                       *ip += x86_nop5_part2;



We do this just once before enabling all the locations, and we only do it 
if we have a two part nop.

Interesting enough, I wrote a module that did the following:

void (*silly_func)(void);

void do_something_silly(void)
{
}


static int my_thread(void *arg)
{
        int i;
        ...
From: Andi Kleen
Date: Friday, August 8, 2008 - 5:23 pm

For me it would seem better to just not use two part 5 byte nops
instead of adding such hacks.  I doubt there are that many of them
anyways. I bet you won't be able to measure any difference between the
different nop types in any macro benchmark.

-Andi
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 5:36 pm

I wish we had a true 5 byte nop. The alternative is a jmp 0, which is 
measurable.  This is replacing mcount from a kernel compile with the -pg 
option. With a basic build (not counting modules), I have over 15,000 
locations that are turned into these 5 byte nops.

# objdump -dr vmlinux.o | grep mcount |wc
  15152   45489  764924

If we use the jmp 0, then yes, we will see the overhead. The double nop 
that is used for 5 bytes, is significantly better than the jump.

-- Steve

--

From: Jeremy Fitzhardinge
Date: Friday, August 8, 2008 - 5:47 pm

0x66 0x66 0x66 0x66 0x90

    J
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 5:51 pm

I guess the next step is to test if it is slow.

-- Steve

--

From: Linus Torvalds
Date: Friday, August 8, 2008 - 5:51 pm

I don't think so. Multiple redundant prefixes can be really expensive on 
some uarchs.

A no-op that isn't cheap isn't a no-op at all, it's a slow-op.

			Linus
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 6:25 pm

A quick meaningless benchmark showed a slight perfomance hit.

Here's 10 runs of "hackbench 50" using the two part 5 byte nop:

run 1
Time: 4.501
run 2
Time: 4.855
run 3
Time: 4.198
run 4
Time: 4.587
run 5
Time: 5.016
run 6
Time: 4.757
run 7
Time: 4.477
run 8
Time: 4.693
run 9
Time: 4.710
run 10
Time: 4.715
avg = 4.6509


And 10 runs using the above 5 byte nop:

run 1
Time: 4.832
run 2
Time: 5.319
run 3
Time: 5.213
run 4
Time: 4.830
run 5
Time: 4.363
run 6
Time: 4.391
run 7
Time: 4.772
run 8
Time: 4.992
run 9
Time: 4.727
run 10
Time: 4.825
avg = 4.8264

# cat /proc/cpuinfo
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 65
model name	: Dual-Core AMD Opteron(tm) Processor 2220
stepping	: 3
cpu MHz		: 2799.992
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fdiv_bug	: no
hlt_bug		: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic 
cr8_legacy
bogomips	: 5599.98
clflush size	: 64
power management: ts fid vid ttp tm stc

There's 4 of these.

Just to make sure, I ran the above nop test again:

[ this is reverse from the above runs ]

run 1
Time: 4.723
run 2
Time: 5.080
run 3
Time: 4.521
run 4
Time: 4.841
run 5
Time: 4.696
run 6
Time: 4.946
run 7
Time: 4.754
run 8
Time: 4.717
run 9
Time: 4.905
run 10
Time: 4.814
avg = 4.7997

And again the two part nop:

run 1
Time: 4.434
run 2
Time: 4.496
run 3
Time: 4.801
run 4
Time: 4.714
run 5
Time: 4.631
run 6
Time: 5.178
run 7
Time: 4.728
run 8
Time: 4.920
run 9
Time: 4.898
run 10
Time: 4.770
avg = 4.757


This time it was close, but still seems to have some difference.

heh, perhaps it's just noise.

-- Steve

--

From: Mathieu Desnoyers
Date: Tuesday, August 12, 2008 - 11:31 pm

Hi Steven,

I tried to run my own tests to see if I could get to know if these
numbers are actually meaningful at all. My results seems to show that
there is not any significant difference between the various
configurations, and actually that the only one tendency I see is that
the 2-bytes jump offset 0x03 would be slightly faster than the 3/2 nop
on Intel Xeon. But we would have to run these a bit more often to
confirm that I guess.

I am just trying to get a sense of whether we are really trying hard to
optimize something worthless in practice, and to me it looks like it.
But it could be the architecture I am using that brings these results.

Mathieu

Intel Xeon dual quad-core
Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz

3/2 nop used :
K8_NOP3 K8_NOP2
#define K8_NOP2 ".byte 0x66,0x90\n"
#define K8_NOP3 ".byte 0x66,0x66,0x90\n"

** Summary **

Test A : make -j20 2.6.27-rc2 kernel (real time)
                                          Avg.      std.dev
Case 1 : ftrace not compiled-in.          1m9.76s   0.41s
Case 2 : 3/2 nops                         1m9.95s   0.36s
Case 3 : 2-bytes jump, offset 0x03        1m9.10s   0.40s
Case 4 : 5-bytes jump, offset 0x00        1m9.25s   0.34s

Test B : hackbench 15

Case 1 : ftrace not compiled-in.          0.349s    0.007s
Case 2 : 3/2 nops                         0.351s    0.014s
Case 3 : 2-bytes jump, offset 0x03        0.350s    0.007s
Case 4 : 5-bytes jump, offset 0x00        0.351s    0.010s



** Detail **

* Test A

benchmark : make -j20 2.6.27-rc2 kernel
make clean; make -j20; make clean done before the tests to prime caches.
Same .config used.


Case 1 : ftrace not compiled-in.

real	1m9.980s
user	7m27.664s
sys	0m48.771s

real	1m9.330s
user	7m27.244s
sys	0m50.567s

real	1m9.393s
user	7m27.408s
sys	0m50.511s

real	1m9.674s
user	7m28.088s
sys	0m50.327s

real	1m10.441s
user	7m27.736s
sys	0m49.687s

real time
average : 1m9.76s
std. dev. : 0.41s

after a reboot with the same kernel ...
From: Mathieu Desnoyers
Date: Wednesday, August 13, 2008 - 8:38 am

Small correction : my architecture uses the P6_NOP5, which is an atomic
5-bytes nop (just looked at the runtime output of find_nop_table()).

5: nopl 0x00(%eax,%eax,1)
#define P6_NOP5 ".byte 0x0f,0x1f,0x44,0x00,0\n"

But the results stands. Maybe I should try to force a test with a
K8_NOP3 K8_NOP2 nop.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Mathieu Desnoyers
Date: Wednesday, August 13, 2008 - 10:52 am

Hi Steven,

I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
Intel Pentium 4 boxes to compare a baseline (function doing a bit of
memory read and arithmetic operations) to cases where nops are used.
Here are the results. The kernel module used for the benchmarks is
below, feel free to run it on your own architectures.

Xeon :

NR_TESTS                                    10000000
test empty cycles :                        165472020
test 2-bytes jump cycles :                 166666806
test 5-bytes jump cycles :                 166978164
test 3/2 nops cycles :                     169259406
test 5-bytes nop with long prefix cycles : 160000140
test 5-bytes P6 nop cycles :               163333458


AMD64 :

NR_TESTS                                    10000000
test empty cycles :                        145142367
test 2-bytes jump cycles :                 150000178
test 5-bytes jump cycles :                 150000171
test 3/2 nops cycles :                     159999994
test 5-bytes nop with long prefix cycles : 150000156
test 5-bytes P6 nop cycles :               150000148


Intel Pentium 4 :

NR_TESTS                                    10000000
test empty cycles :                        290001045
test 2-bytes jump cycles :                 310000568
test 5-bytes jump cycles :                 310000478
test 3/2 nops cycles :                     290000565
test 5-bytes nop with long prefix cycles : 311085510
test 5-bytes P6 nop cycles :               300000517
test Generic 1/4 5-bytes nops cycles :     310000553
test K7 1/4 5-bytes nops cycles :          300000533


These numbers show that both on Xeon and AMD64, the 

   .byte 0x66,0x66,0x66,0x66,0x90

(osp osp osp osp nop, which is not currently used in nops.h)

is the fastest nop on both architectures.

The currently used 3/2 nops looks like a _very_ bad choice for AMD64
cycle-wise.

The currently used 5-bytes P6 nop used on Xeon seems to be a bit slower
than the 0x66,0x66,0x66,0x66,0x90 nop ...
From: Linus Torvalds
Date: Wednesday, August 13, 2008 - 11:27 am

Note that the biggest problems of a jump-based nop are likely to happen 
when there are I$ misses and/or when there are other jumps involved. Ie a 
some microarchitectures tend to have issues with jumps to jumps, or when 
there are multiple control changes in the same (possibly partial) 
cacheline because the instruction stream prediction may be predecoded in 
the L1 I$, and multiple branches in the same cacheline - or in the same 
execution cycle - can pollute that kind of thing.

So microbenchmarking this way will probably make some things look 
unrealistically good. 

On the P4, the trace cache makes things even more interesting, since it's 
another level of I$ entirely, with very different behavior for the hit 
case vs the miss case.

And I$ misses for the kernel are actually fairly high. Not in 
microbenchmarks that tend to have very repetive behavior and a small I$ 
footprint, but in a lot of real-life loads the *bulk* of all action is in 
user space, and then the kernel side is often invoced with few loops (the 
kernel has very few loops indeed) and a cold I$.

So your numbers are interesting, but it would be really good to also get 
some info from Intel/AMD who may know about microarchitectural issues for 
the cases that don't show up in the hot-I$-cache environment.

			Linus
--

From: Andi Kleen
Date: Wednesday, August 13, 2008 - 11:41 am

Must be careful to miss the big picture here.

We have two assumptions here in this thread:

- Normal alternative() nops are relatively infrequent, typically
in points with enough pipeline bubbles anyways, and it likely doesn't
matter how they are encode. And also they don't have an issue
with mult part instructions anyways because they're not patched
at runtime, so always the best known can be used.

- The one case where nops are very frequent and matter and multipart
is a problem is with ftrace noping out the call to mcount at runtime 
because that happens on every function entry.
Even there the overhead is not that big, but at least measurable 
in kernel builds.

Now the numbers have shown that just by not using frame pointer (
-pg right now implies frame pointer) you can get more benefit 
than what you lose from using non optimal nops.

So for me the best strategy would be to get rid of the frame pointer
and ignore the nops. This unfortunately would require going away
from -pg and instead post process gcc output to insert "call mcount"
manually. But the nice advantage of that is that you could actually 
set up a custom table of callers built in a ELF section and with
that you don't actually need the runtime patching (which is only
done currently because there's no global table of mcount calls),
but could do everything in stop_machine(). Without
runtime patching you also don't need single part nops. 

I think that would be the best option. I especially like it because
it would prevent forcing frame pointer which seems to be costlier
than any kinds of nosp.

-Andi

--

From: Avi Kivity
Date: Wednesday, August 13, 2008 - 11:45 am

How would you deal with inlines?  Using debug information?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Andi Kleen
Date: Wednesday, August 13, 2008 - 11:51 am

> How would you deal with inlines?  Using debug information?

-pg already ignores inlines, so they aren't even traced today.

It pretty much has to, assume an inline gets spread out by
the global optimizer over the rest of the function, where would
the mcount calls be inserted?

-Andi

--

From: Avi Kivity
Date: Wednesday, August 13, 2008 - 11:56 am

Good point.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Mathieu Desnoyers
Date: Wednesday, August 13, 2008 - 12:30 pm

I agree that if frame pointer brings a too big overhead, it should not
be used.

Sorry to ask, I feel I must be missing something, but I'm trying to
figure out where you propose to add the "call mcount" ? In the caller or
in the callee ?

In the caller, I guess it would replace the normal function call, call a
trampoline which would jump to the normal code.

In the callee, as what is currently done with -pg, the callee would have
a call mcount at the beginning of the function.

Or is it a different scheme I don't see ? I am trying to figure out how
you happen to do all that without dynamic code modification and manage
not to hurt performance.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Andi Kleen
Date: Wednesday, August 13, 2008 - 12:37 pm

callee like gcc. caller would be likely more bloated because
there are more calls than functions. Also if it was at the 
callee more code would be needed because the function currently

The dynamic code modification is only needed because there is no
global table of the mcount call sites. So instead it discovers
them at runtime, but that requires runtime save patching

With a custom call scheme one could just build up a table of 
call sites at link time using an ELF section and then when
tracing is enabled/disabled always patch them all in one go
in a stop_machine(). Then you wouldn't need parallel execution safe
patching anymore and it doesn't matter what the nops look like.

The other advantage is that it would allow getting rid of
the frame pointer.

-Andi

--

From: Mathieu Desnoyers
Date: Wednesday, August 13, 2008 - 1:01 pm

I agree that the custom call scheme could let you know the mcount call
site addresses at link time, so you could replace the call instructions
with nops (at link time, so you actually don't know much about the
exact hardware the kernel will be running on, which makes it harder to
choose the best nop). To me, it seems that doing this at link time,
as you propose, is the best approach, as it won't impact the system
bootup time as much as the current ftrace scheme.

However, I disagree with you on one point : if you use nops which are
made of multiple instructions smaller than 5 bytes, enabling the tracer
(patching all the sites in a stop_machine()) still present the risk of
having a preempted thread with a return IP pointing directly in the
middle of what will become a 5-bytes call instruction. When the thread
will be scheduled again after the stop_machine, an illegal instruction
fault (or any random effect) will occur.

Therefore, building a table of mcount call sites in a ELF section,
declaring _single_ 5-bytes nop instruction in the instruction stream
that would fit for all target architectures in lieue of mcount call, so
it can be later patched-in with the 5-bytes call at runtime seems like a
good way to go.

Mathieu

P.S. : It would be good to have a look at the alternative.c lock prefix
vs preemption race I identified a few weeks ago. Actually, this
currently existing cpu hotplug bug is related to the preemption issue I
just explained here. ref. http://lkml.org/lkml/2008/7/30/265,
especially:

"As a general rule, never try to combine smaller instructions into a
bigger one, except in the case of adding a lock-prefix to an instruction :
this case insures that the non-lock prefixed instruction is still
valid after the change has been done. We could however run into a nasty
non-synchronized atomic instruction use in SMP mode if a thread happens
to be scheduled out right after the lock prefix. Hopefully the
alternative code uses the refrigerator... (hrm, it ...
From: Mathieu Desnoyers
Date: Wednesday, August 13, 2008 - 4:41 pm

If a kernel thread is preempted in single-cpu mode right after the NOP (nop
about to be turned into a lock prefix), then we CPU hotplug a CPU, and then the
thread is scheduled back again, a SMP-unsafe atomic operation will be used on
shared SMP variables, leading to corruption. No corruption would happen in the
reverse case : going from SMP to UP is ok because we split a bit instruction
into tiny pieces, which does not present this condition.

Changing the 0x90 (single-byte nop) currently used into a 0x3E DS segment
override prefix should fix this issue. Since the default of the atomic
instructions is to use the DS segment anyway, it should not affect the
behavior.

** BIG FAT WARNING (tm) ** : this patch assumes that the 0x3E prefix will
leave atomic operations as-is (thus assuming they normally touch data in the DS
segment). Is there an obnoxious corner-case I would have missed ?

This source :
AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System
Instructions
States
"Instructions that Reference a Non-Stack Segment—If an instruction encoding
references any base register other than rBP or rSP, or if an instruction
contains an immediate offset, the default segment is the data segment (DS).
These instructions can use the segment-override prefix to select one of the
non-default segments, as shown in Table 1-5."

Therefore, forcing the DS segment on the atomic operations, which already use
the DS segment, should not change.

This source :
http://wiki.osdev.org/X86_Instruction_Encoding
States
"In 64-bit the CS, SS, DS and ES segment overrides are ignored."
(since it's a wiki, it would have to be confirmed, but at least the worse that
happens is that the DS override is ignored)

This patch applies to 2.6.27-rc2, but would also have to be applied to earlier
kernels (2.6.26, 2.6.25, ...).

Performance impact of the fix : tests done on "xaddq" and "xaddl" shows it
actually improves performances on Intel Xeon, AMD64, Pentium M. It does not
change the ...
From: H. Peter Anvin
Date: Wednesday, August 13, 2008 - 5:01 pm

I believe this should be okay.  In 32-bit mode some of the security and 
hypervisor frameworks want to set segment limits, but I don't believe 
they ever would set DS and SS inconsistently, or that we'd handle a #GP 
versus an #SS differently (segment violations on the stack segment are 
#SS, not #GP.)  To be 100% sure we'd have to pick apart the modr/m byte 
to figure out what the base register is but I think that's total overkill.

I have a vague notion that DS: prefixes came with a penalty on older 
CPUs, so we may want to do this only when CPU hotplug is enabled, to 
avoid penalizing older embedded systems.

	-hpa
--

From: Mathieu Desnoyers
Date: Wednesday, August 13, 2008 - 6:13 pm

I guess some testing of this patch under an virtualized Linux would not
hurt. Anyone have a setup ready ? The test case is simple : Run a kernel
on a multi-CPU virtual guest.

export NR_CPUS=...

Reading the "Intel Architecture Optimizations Manual" for older Intels :

http://developer.intel.com/design/pentium/manuals/242816.htm
Chapter 3.7 Prefixed Opcodes

The penality for instructions prefixed with other prefixes than 0x0f,
0x66 or 0x67 seems to be 1 added clock cycle to detect the prefix when
it cannot be paired.

Since we are choosing between the existing 0x90 nop followed by the
atomic instruction and this prefix applied to the atomic instruction,
we have to consider the penality cost of this nop. From the same manual,
the NOP is decoded into 1 micro-op.

Unless these architectures (386SX/DX, 486, Pentium Pro, Pentium MMX,
Pentium II) can execute more than 1 micro-op per cycle, I doubt the DS
prefix would cause any degradation compared to the 0x90 nop. And this
would free the lower stages of the pipeline from executing this NOP
micro-op.

I guess some quick performance tests with the modules I provide on my
website (URL in the patch header) could confirm or infirm this.
Actually, I just removed the dust from an old Pentium II, here are the
results. There is no performance overhead nor degradation.

NR_TESTS                                    10000000
test empty cycles :                        200833494
test test 1-byte nop xadd cycles :         340000130
test test DS override prefix xadd cycles : 340000126 *
test test LOCK xadd cycles :               530000078

processor : 0
vendor_id : GenuineIntel
cpu family  : 6
model   : 5
model name  : Pentium II (Deschutes)
stepping  : 2
cpu MHz   : 350.867
cache size  : 512 KB
fdiv_bug  : no
hlt_bug   : no
f00f_bug  : no
coma_bug  : no
fpu   : yes
fpu_exception : yes
cpuid level : 2
wp    : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr
bogomips  : ...
From: Jeremy Fitzhardinge
Date: Wednesday, August 13, 2008 - 6:22 pm

The kernel sets ds and ss to the same selector, so they're always going 
to have the same underlying descriptor.

My only concern is whether there are any locked instructions which are 
explicitly using a cs: override for those odd corners of the kernel.  I 
don't think so.

That said, I wonder how useful it is to do the SMP->UP code transition.  
How often does a kernel go from being SMP to UP in a situation where we 
really care about the performance?  And that won't be shortly be 
becoming SMP again anyway?

    J
--

From: Roland McGrath
Date: Wednesday, August 13, 2008 - 6:26 pm

I agree, it only seems worthwhile to do this substitution when from the
config+hardware it's determined at boot that nr_cpus_possible==1 (i.e. no
CPU hotplug possibility).


Thanks,
Roland
--

From: Mathieu Desnoyers
Date: Wednesday, August 13, 2008 - 6:49 pm

A virtualized guest kernel could use that to limit its use of the
overall machine CPU resources in different time periods. Policies can
determine how many physical CPU a virtual machine can be tied to, and
that may change depending on e.g. the workload or time of day. Having
the ability to efficiently switch to UP for a long period of time seems
important in this use-case.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jeremy Fitzhardinge
Date: Wednesday, August 13, 2008 - 8:35 pm

Not very convinced.  Unplugging cpus is a pretty coarse way to control 
resource use.  There are lots of other mechanisms.  And it's not like an 
uncontended lock is all that expensive these days...

    J

--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 8:18 am

I can't argue about the benefit of using VM CPU pinning to manage
resources because I don't use it myself, but I ran some tests out of
curiosity to find if uncontended locks were that cheap, and it turns out
they aren't. Here are the results :

Xeon 2.0GHz


Summary

                  no lock prefix (s)   with lock prefix (s)    Speedup
make -j1 kernel/      33.94 +/- 0.07         34.91 +/- 0.27      2.8 %
hackbench 50           2.99 +/- 0.01          3.74 +/- 0.01     25.1 %

Detail :


1 CPU, replace smp lock prefixes with DS segment selector prefixes

make -j1 kernel/

real	0m34.067s
user	0m30.630s
sys	0m2.980s

real	0m33.867s
user	0m30.582s
sys	0m3.024s

real	0m33.939s
user	0m30.738s
sys	0m2.876s

real	0m33.913s
user	0m30.806s
sys	0m2.808s

avg : 33.94s
std. dev. : 0.07s

hackbench 50

Time: 2.978
Time: 2.982
Time: 3.010
Time: 2.984
Time: 2.982

avg : 2.99
std. dev. : 0.01



1 CPU, noreplace-smp

make -j1 kernel/

real	0m35.326s
user	0m30.630s
sys	0m3.260s

real	0m34.325s
user	0m30.802s
sys	0m3.084s

real	0m35.568s
user	0m30.722s
sys	0m3.168s

real	0m34.435s
user	0m30.886s
sys	0m2.996s

avg.: 34.91s
std. dev. : 0.27s

hackbench 50

Time: 3.733
Time: 3.750
Time: 3.761
Time: 3.737
Time: 3.741

avg : 3.74
std. dev. : 0.01

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Linus Torvalds
Date: Thursday, August 14, 2008 - 9:10 am

Absolutely. 

Locked ops show up not just in microbenchmarks looping over the 
instruction, they show up in "real" benchmarks too. We added a single 
locked instruction (maybe it was two) to the page fault handling code some 
time ago, and the reason I noticed it was that it actually made the page 
fault cost visibly more expensive in lmbench. That was a _single_ 
instruction in the hot path (or maybe two).

And the page fault path is some of the most timing critical in the whole 
kernel - if you have everything cached, the cost of doing the page faults 
to populate new processes for some fork/exec-heavy workload (and compiling 
the kernel is just one of those - any traditional unix behaviour will show 
this) is critical.

This is one of the things AMD does a _lot_ better than Intel. Intel tends 
to have a 30-50 cycle cost (with later P4s being *much* worse), while AMD 
tends to have a cost of around 10-15 cycles.

It's one of the things Intel promises to have improved in the next-gen 
uarch (Nehalem), an while I am not supposed to give out any benchmarks, I 
can confirm that Intel is getting much better at it. But it's going to be 
visible still, and it's really a _big_ issue on P4.

(Of course, on P4, the page fault exception cost itself is so high that 
the cost of atomics may be _relatively_ less noticeable in that particular 
path)

		Linus
--

From: H. Peter Anvin
Date: Thursday, August 14, 2008 - 9:13 am

For reference, could you also compare replace smp lock with NOPs?

	-hpa
--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 9:58 am

Sure, here are the updated tables. Basically, they show no significant
difference between the NOP and the DS segment selector prefix
approaches.


Xeon 2.0GHz


Summary


* replace smp lock prefixes with DS segment selector prefixes
                  no lock prefix (s)   with lock prefix (s)    Speedup
make -j1 kernel/      33.94 +/- 0.07         34.91 +/- 0.27      2.8 %
hackbench 50           2.99 +/- 0.01          3.74 +/- 0.01     25.1 %


* replace smp lock prefixes with 0x90 nops
                  no lock prefix (s)   with lock prefix (s)    Speedup
make -j1 kernel/      34.16 +/- 0.32         34.91 +/- 0.27      2.2 %
hackbench 50           3.00 +/- 0.01          3.74 +/- 0.01     24.7 %



Detail :


1 CPU, replace smp lock prefixes with DS segment selector prefixes

make -j1 kernel/

real	0m34.067s
user	0m30.630s
sys	0m2.980s

real	0m33.867s
user	0m30.582s
sys	0m3.024s

real	0m33.939s
user	0m30.738s
sys	0m2.876s

real	0m33.913s
user	0m30.806s
sys	0m2.808s

avg : 33.94s
std. dev. : 0.07s

hackbench 50

Time: 2.978
Time: 2.982
Time: 3.010
Time: 2.984
Time: 2.982

avg : 2.99
std. dev. : 0.01



1 CPU, noreplace-smp

make -j1 kernel/

real	0m35.326s
user	0m30.630s
sys	0m3.260s

real	0m34.325s
user	0m30.802s
sys	0m3.084s

real	0m35.568s
user	0m30.722s
sys	0m3.168s

real	0m34.435s
user	0m30.886s
sys	0m2.996s

avg.: 34.91s
std. dev. : 0.27s

hackbench 50

Time: 3.733
Time: 3.750
Time: 3.761
Time: 3.737
Time: 3.741

avg : 3.74
std. dev. : 0.01



1 CPU, replace smp lock prefixes with 0x90 nops

make -j1 kernel/

real	0m34.139s
user	0m30.782s
sys	0m2.820s

real	0m34.010s
user	0m30.630s
sys	0m2.976s

real	0m34.777s
user	0m30.658s
sys	0m2.916s

real	0m33.924s
user	0m30.634s
sys	0m2.924s

real	0m33.962s
user	0m30.774s
sys	0m2.800s

real	0m34.141s
user	0m30.770s
sys	0m2.828s


avg : 34.16
std. dev. : 0.32


hackbench 50

Time: 2.999
Time: 2.994
Time: 3.004
Time: 2.991
Time: ...
From: Jeremy Fitzhardinge
Date: Thursday, August 14, 2008 - 10:05 am

BTW, are you changing the initial prefix to DS too?  Ie, are you doing a 
nop->lock->ds transition, or ds->lock->ds?

    J
--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 10:30 am

Yeah, I thought about this case yesterday, good thing you ask.

include/asm-x86/alternative.h defines LOCK_PREFIX as :

#define LOCK_PREFIX \
                ".section .smp_locks,\"a\"\n"   \
                _ASM_ALIGN "\n"                 \
                _ASM_PTR "661f\n" /* address */ \
                ".previous\n"                   \
                "661:\n\tlock; "

So we have the locked instructions built into the kernel, not the nop'd
one. Therefore, the only transition I am doing for my benchmarks is :

lock->ds

but I tried to switch back to SMP and it worked fine.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jeremy Fitzhardinge
Date: Thursday, August 14, 2008 - 10:43 am

Ah, OK.  I'd thought we started unlocked, but given that I've just been 
disassembling the kernel and looking at the lock prefixes, that's a bit 
of a braino on my part.

BTW, using the ds prefix allows us to undo the hack of dealing with 
locked instructions with exception handlers.  There was a bug where if 
we do lock->nop, then the address of a faulting instruction changes, so 
we need exception records for both the locked and unlocked forms.  Using 
ds means the instruction size doesn't change, so we only need one 
exception record.  I don't remember off hand where that happens.

    J
--

From: H. Peter Anvin
Date: Thursday, August 14, 2008 - 11:37 am

Using %ds: rather than nop really seems to solve a whole lot of 
problems, and might even be faster to boot.  It really sounds like a 
no-brainer.

	-hpa
--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 11:53 am

So should I wait a bit for more comments or straightforwardly submit
this as a patch rather than RFC ?

Mathieu



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jeremy Fitzhardinge
Date: Thursday, August 14, 2008 - 12:29 pm

Looks like all the relevant people have reviewed it now, so I don't 
think there's much more to say.

    J
--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 1:31 pm

I'm just worried about this comment from Harvey Harrison :

arch/x86/mm/fault.c : is_prefetch()

                         * Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes.
                         * In X86_64 long mode, the CPU will signal invalid
                         * opcode if some of these prefixes are present so
                         * X86_64 will never get here anyway
                         */

This comment refers to :

0x26 : ES segment override prefix
0x2E : CS segment override prefix
0x36 : SS segment override prefix
0x3E : DS segment override prefix

AMD documentation seems to indicate that these prefix will be null, not
that the cpu would signal "invalid opcodes" :

"AMD 64-Bit Technology" A.7
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/x86-64_overvie...

"In 64-bit mode, the DS, ES, SS and CS segment-override prefixes have no effect.
These four prefixes are no longer treated as segment-override prefixes in the
context of multipleprefix rules. Instead, they are treated as null prefixes."

Intel does not seem to state anything particular about these prefixes
for the 64-bit mode.

So, is this comment misleading, or is it using the term "invalid opcode"
in a way that does not imply generating a fault ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: H. Peter Anvin
Date: Thursday, August 14, 2008 - 1:39 pm

They do not signal faults, there just aren't any base addresses behind 
them.  Some AMD chips allow limits to be set on these segments -- 
apparently added on behalf of some hypervisor makers; I suspect that 
VMX/SVM is making that quickly obsolete.

So it should be just fine.

Acked-by: H. Peter Anvin <hpa@zytor.com>

	-hpa

--

From: Jeremy Fitzhardinge
Date: Thursday, August 14, 2008 - 2:46 pm

I would say that comment is wrong.  But we'd never put a lock prefix on 
a prefetch, so you won't be added a ds prefix either.

    J
--

From: H. Peter Anvin
Date: Thursday, August 14, 2008 - 3:26 pm

Locking prefetches is definitely wrong :)

I don't know if the comment is correct; I've certainly never heard of 
that myself.

	-hpa
--

From: H. Peter Anvin
Date: Thursday, August 14, 2008 - 10:17 am

Actually, unless I have blown my T-test completely, they show with a 80% 
and 74% confidence (respective for the two benchmarks) that the DS case 
is slightly *better* (0.26% and 0.20% better, respective), which makes 
it a no-brainer.  Doing around 10 runs of each is likely to confirm this 
conclusion by pushing it into the 90+% interval.

Note that since the difference is so small, and so can also be due to 
some kind of systematic error (lower ambient temperature during the DS 
run making the disk drive slightly faster, what have you.)

	-hpa
--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 11:09 am

I doubt it, because I made a "cache priming" run before the tests, which
made sure the data was all populated in memory. But yeah, having
different cpu clock calibration values/cpu clock frequency between
reboots could also cause that kind of difference.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 12:49 pm

I did more runs (20 runs of each) to compare the nop case to the DS
prefix case. Results in seconds. They actually does not seems to show a
significant difference.

NOP

34.155
33.955
34.012
35.299
35.679
34.141
33.995
35.016
34.254
33.957
33.957
34.008
35.013
34.494
33.893
34.295
34.314
34.854
33.991
34.132

DS

34.080
34.304
34.374
35.095
34.291
34.135
33.940
34.208
35.276
34.288
33.861
33.898
34.610
34.709
33.851
34.256
35.161
34.283
33.865
35.078

Used http://www.graphpad.com/quickcalcs/ttest1.cfm?Format=C to do the
T-test (yeah, I'm lazy) :

 Group      Group One (DS prefix)       Group Two (nops)
 Mean                    34.37815               34.37070
 SD                       0.46108                0.51905
 SEM                      0.10310                0.11606
 N                             20                     20       

P value and statistical significance:
  The two-tailed P value equals 0.9620
  By conventional criteria, this difference is considered to be not
  statistically significant.

Confidence interval:
  The mean of Group One minus Group Two equals 0.00745
  95% confidence interval of this difference: From -0.30682 to 0.32172 

Intermediate values used in calculations:
  t = 0.0480
  df = 38
  standard error of difference = 0.155 

So, unless these calculus are completely bogus, the difference between
the nop and the DS case seems not to be statistically significant.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jeremy Fitzhardinge
Date: Thursday, August 14, 2008 - 10:04 am

OK, let me clarify my point a bit.  If you've got a kernel which is 
switching between UP and SMP on a regular basis, you're going to be 
taking the hit for the locked instructions whenever you're in the SMP 
state anyway.  It's only going to be a workload where you're mostly UP 
with occasional excursions into being SMP that patching out the lock 
prefixes is actually going to make a difference. 

And that just doesn't seem like a very likely use-case to me.  Certainly 
I don't think it would ever happen on a physical machine.  And it 
doesn't seem all that likely on a virtual machine either. Certainly 
resources are more dynamic in a virtual environment, but I think there's 
a fairly good chance that the domain knows from the outset whether it's 
going to be UP or SMP, or does the UP->SMP transition once.

(That said, the XenServer product does precisely what I say is unusual: 
the dom0 kernel hotplugs all the cpus so it can do ucode updates, etc, 

Yeah, that's more severe than I would have expected.  Perhaps I have AMD 
numbers in my head.

    J
--

From: H. Peter Anvin
Date: Thursday, August 14, 2008 - 10:18 am

He doesn't specify what kind of Xeon it is (there are three completely 
different CPUs under the Xeon brand name: P3 Xeon, P4 Xeon, and Core 2 
Xeon), but I think it's a P4 Xeon.

Netburst sucked in so many ways, and this is one of them.

	-hpa


--

From: Jeremy Fitzhardinge
Date: Thursday, August 14, 2008 - 10:28 am

I've given up trying to remember what "E5405" might mean, but that 
feature list suggests a relatively modern microarchitecture.


    J
--

From: H. Peter Anvin
Date: Thursday, August 14, 2008 - 10:31 am

E5405 is the 45 nm Core 2 Quad Xeon.

	-hpa
--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 10:46 am

Here are the specs. It's a brand new 64bits dual quad-core (Christmas in
August) :-) None of the specs seems to talk about neither Core2 or
Pentium 4 though. Those appellations seems to hold for desktop
processors, not server processors...

processor : 0
vendor_id : GenuineIntel
cpu family  : 6
model   : 23
model name  : Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
stepping  : 6
cpu MHz   : 2000.109
cache size  : 6144 KB
physical id : 0
siblings  : 1
core id   : 0
cpu cores : 1
apicid    : 0
initial apicid  : 0
fpu   : yes
fpu_exception : yes
cpuid level : 10
wp    : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc up arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx
tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips  : 4000.21
clflush size  : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:


Booting processor 7/7 ip 6000
Initializing CPU#7
Calibrating delay using timer specific routine.. 4000.26 BogoMIPS (lpj=8000538)
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 6144K
CPU: Physical Processor ID: 1
CPU: Processor Core ID: 3
CPU7: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz stepping 06
checking TSC synchronization [CPU#0 -> CPU#7]: passed.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Jeremy Fitzhardinge
Date: Thursday, August 14, 2008 - 10:49 am

Is it a Dell thingy?  Turns out my main test box has exactly the same 
processor.

    J
--

From: Mathieu Desnoyers
Date: Thursday, August 14, 2008 - 10:55 am

Nope, it's a custom-made machine from a company in the Montreal area. It
has a Supermicro X7DAL-E motherboard and, well, I got 16GB of ram in it.
Pretty useful to take large traces.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Friday, August 15, 2008 - 2:34 pm

[ Finally got my goodmis email back ]


The new code does not discover the places at runtime. The old code did 
that. The "to kill a daemon" removed the runtime discovery and replaced it 

The current patch set, pretty much does exactly this. Yes, I patch
at boot up all in one go, before the other CPUS are even active.

This is the only advantage that you have.

-- Steve

--

From: Andi Kleen
Date: Friday, August 15, 2008 - 2:51 pm

Ok. But it's a serious one.  It gives slightly more gain as your whole 
complicated patching exercise.

Ok maybe it would be better to just properly fix gcc, but the problem
is it takes forever for the user base to actually start using a new
gcc :/

-Andi
--

From: Mathieu Desnoyers
Date: Wednesday, August 13, 2008 - 12:16 pm

Yup, I agree. Actually, the tests I ran shows that using jumps as nops

Yes, I am aware of these "high locality" effects. I use these tests as a
starting point to find out which nops are good candidates, and then it
can be later validated with more thorough testing on real workloads,
which will suffer from higher standard deviation.

Interestingly enough, the P6_NOPS seems to be a poor choice both at the
macro and micro levels for the Intel Xeon (referring to

As long as the whole kernel agrees on which instructions should be used
for frequently used nops, the instruction trace cache should behave

I assume the effect of I$ miss to be the same for all the tested
scenarios (except on P4, and maybe except for the jump cases), given
that in each case we load 5-bytes worth of instructions. Even
considering this, the results I get show that the choices made in the

Yep. I think it may make a difference if we use jumps, but I doubt it
will change anything to the other various nops. Still, having that
information would be good.

Some more numbers follow for older architectures.

Intel Pentium 3, 550MHz

NR_TESTS                                    10000000
test empty cycles :                        510000254
test 2-bytes jump cycles :                 510000077
test 5-bytes jump cycles :                 510000101
test 3/2 nops cycles :                     500000072
test 5-bytes nop with long prefix cycles : 500000107
test 5-bytes P6 nop cycles :               500000069 (current choice ok)
test Generic 1/4 5-bytes nops cycles :     514687590
test K7 1/4 5-bytes nops cycles :          530000012

Intel Pentium 3, 933MHz

NR_TESTS                                    10000000
test empty cycles :                        510000565
test 2-bytes jump cycles :                 510000133
test 5-bytes jump cycles :                 510000363
test 3/2 nops cycles :                     500000358
test 5-bytes nop with long prefix cycles : 500000331
test 5-bytes P6 nop cycles :               ...
From: Roland McGrath
Date: Friday, August 8, 2008 - 5:53 pm

Are we sure there aren't some good available 5-byte nops?

e.g. for 64-bit maybe 3e 48 8d 76 00
     for 32-bit maybe 3e 8d 74 26 00

Have the chip folks confirmed that exactly the set we have in
asm-x86/nops.h are the only optimal ones?


Thanks,
Roland
--

From: Andi Kleen
Date: Friday, August 8, 2008 - 6:13 pm

They were based on the respective optimization manuals, so yes.

-Andi
--

From: Andi Kleen
Date: Friday, August 8, 2008 - 6:19 pm

Scary usage.  How much is the difference? 

BTW a good way to get most of that back would be to not use frame 
pointers if you have them enabled ;-) Unfortunately recently
the "select FRAME_POINTER"s keep creeping all over so far too many kernels
suffer from this slow mode now :-/

-Andi
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 6:30 pm

Funny, CONFIG_FTRACE happens to select that. Now the question is, would 
mcount work without it?

i386:
ENTRY(mcount)
        pushl %eax
        pushl %ecx
        pushl %edx
        movl 0xc(%esp), %eax
        movl 0x4(%ebp), %edx  <-- we record the parent ip here
        subl $MCOUNT_INSN_SIZE, %eax

        call *ftrace_trace_function

        popl %edx
        popl %ecx
        popl %eax
	ret


x86_64:
ENTRY(mcount)
        subq $0x38, %rsp
        movq %rax, (%rsp)
        movq %rcx, 8(%rsp)
        movq %rdx, 16(%rsp)
        movq %rsi, 24(%rsp)
        movq %rdi, 32(%rsp)
        movq %r8, 40(%rsp)
        movq %r9, 48(%rsp)

        movq 0x38(%rsp), %rdi
        movq 8(%rbp), %rsi <-- we record the parent pointer here
        subq $MCOUNT_INSN_SIZE, %rdi

        call   *ftrace_trace_function

        movq 48(%rsp), %r9
        movq 40(%rsp), %r8
        movq 32(%rsp), %rdi
        movq 24(%rsp), %rsi
        movq 16(%rsp), %rdx
        movq 8(%rsp), %rcx
        movq (%rsp), %rax
        addq $0x38, %rsp
	retq

-- Steve

--

From: Andi Kleen
Date: Friday, August 8, 2008 - 6:55 pm

Not without fixing gcc first. It would work if gcc always called
mcount the first thing before setting up the stack frame. Not 
sure why it doesn't do that.

Still do a benchmark of frame pointer vs no frame pointer kernel
and you'll see, especially on a older CPUs without special hardware
to avoid stack stalls (e.g. not Core2)

BTW always forcing frame pointers also means that ftrace is far
from near zero overhead even when disabled. That is unless you
find a way to nop the frame pointers too, but that would be likely
very difficult because the code will actually use it.

-Andi
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 7:03 pm

I could take out the frame pointer option and pass in zero for the parent 
in such that case. But the parent is quite useful. But if it helps
with performance, it may be worth doing this.


I'll see what it does with just a kernel build.

-- Steve
--

From: Andi Kleen
Date: Friday, August 8, 2008 - 7:23 pm

Right now gcc errors out unfortunately. Would need to ask gcc developers
to fix this first. But when they move mcount to be always first
you can still get at the parent, it will always be at 4(%esp) or 8(%rsp).

An alternative would be also run an assembler post processor that
inserts mcount calls without compiler help.

-Andi
--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 9:12 pm

I checked out the latest git tree from Linus, and set up a basic config 
for it. I built it once. Then I rebooted into various versions of the 
kernel and timed the makes. This is what I did for each kernel.

# cd /work/git/linux-compile.git
# make distclean
# cp config-use .config
# make oldconfig
# time make

Here are the results. Just for curiosity sake.

Using the two part nop:
 real    3m26.751s
 user    1m36.064s
 sys     0m39.450s

Using the 5 byte nop with the four 0x66:
 real    3m31.543s
 user    1m36.197s
 sys     0m39.309s

Using jmp 0 as a nop:
 real    3m30.025s
 user    1m35.948s
 sys     0m39.540s


Then I realized that I had the irqsoff, and preemptoff tracers configured.
These do add an overhead when configured in, even when they are disabled.
I disabled them and ran the code with the two part nop again:

Using two part nop without other tracers:
 real    3m26.851s
 user    1m36.425s
 sys     0m35.437s

I then disabled ftrace completely:

no ftrace times:
 real    3m25.807s
 user    1m36.992s
 sys     0m35.715s

Then I disabled frame pointers as well:
 real    3m25.104s
 user    1m35.726s
 sys     0m34.797s

The important numbers here is the sys time. The user and real should not
be that much affected by these changes. Well, the real would, but not the 
user.

I have an old toshiba that I want to run this on too. That might give more 
interesting results. But it's getting late, and I'll do that another time.

Looking at the results, and yes I should run this more than once, but I 
just wanted an idea, this box did not show much difference between the two 
part nop, the jmp and the 5 byte nop.

Running with or without ftrace enabled did not make a difference. Heck, 
the ftrace disabled was actually slower.

But it does seem that Andi is correct about the frame pointers. It did 
slow the compile down by almost a second.

Just some food for thought.

-- Steve

--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 5:30 pm

[ updated patch ]


I put that code in the scheduler and it does indeed preempt in the nops!
It also uncovered a nasty bug I had in my code:



Notice the bug? Hint, *ip is of type unsigned long *.

Here's the updated fix:

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
---
 arch/x86/kernel/alternative.c    |   29 ++++++++++++++++++++-------
 arch/x86/kernel/asm-offsets_32.c |    1 
 arch/x86/kernel/asm-offsets_64.c |    1 
 arch/x86/kernel/entry_32.S       |    4 +++
 arch/x86/kernel/entry_64.S       |    4 +++
 arch/x86/kernel/ftrace.c         |   41 +++++++++++++++++++++++++++++++++++++++
 include/asm-x86/ftrace.h         |    5 ++++
 include/asm-x86/thread_info.h    |    4 +++
 kernel/trace/ftrace.c            |   12 +++++++++++
 9 files changed, 94 insertions(+), 7 deletions(-)

Index: linux-tip.git/arch/x86/kernel/alternative.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/alternative.c	2008-06-05 11:52:24.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/alternative.c	2008-08-08 16:20:23.000000000 -0400
@@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
 };
 #endif
 
+/*
+ * Some versions of x86 CPUs have a two part NOP5. This
+ * can break ftrace if a process is preempted between
+ * the two. ftrace needs to know what the second nop
+ * is to handle this case.
+ */
+int x86_nop5_part2;
+
 #ifdef CONFIG_X86_64
 
 extern char __vsyscall_0;
 const unsigned char *const *find_nop_table(void)
 {
-	return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
-	       boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+	    boot_cpu_data.x86 < 6) {
+		x86_nop5_part2 = 2; /* K8_NOP2 */
+		return k8_nops;
+	} else
+		/* keep k86_nop5_part2 NULL */
+		return p6_nops;
 }
 
 #else /* CONFIG_X86_64 */
@@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
 static const struct nop {
 	int cpuid;
 	const unsigned char *const ...
From: Mathieu Desnoyers
Date: Monday, August 11, 2008 - 11:21 am

Hi Steven,

I'm actually a bit worried about this scheduler-centric approach. The
problem is that any trap that could be generated in the middle of the
3/2 nops (think of performance counters if the APIC is set as a trap
gate) which would stack an interruptible trap handler on top of those
instructions would lead to having a return IP pointing in the wrong
spot, but because the scheduler would interrupt the trap handler and not
the trapped code, it would not detect this.

I am not sure wheter we do actually have the possibility to have these
interruptible trap handlers considering the way the kernel sets up the
gates, but I think the "let's figure out where the IP stopped" approach
is a bit complex and fragile.

Trying to find a fast atomic 5-bytes nop, involving the
microarchitecture guys, seems like a better approach to me.


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Steven Rostedt
Date: Monday, August 11, 2008 - 12:28 pm

I agree that the fast atomic 5-byte nop is the optimal approach, but until 
we have that, we need to do this.

I don't think this approach is too complex, I wrote the code in about an 
hour, and the patch isn't that big. albeit my one bug, which was a CS101 
type bug (pointer arithmetic), there wasn't anything to me that was 
complex.

The thing is, I want this to be enabled in a production kernel. If there 
is a 1 in a million chance that this theoretical bug with the trap 
happening in a middle of the double nop, it only affects those that enable 
the tracer. No one else. The code only happens when tracing is being 
enabled.

If we need to put in a non optimal nop in, to replace the call to mcount, 
this affects everyone. Anyone that has CONFIG_FTRACE on, and never plans 
on running a trace.

If it comes down to slowing everyone down, against hitting a 1 in a 
million theoretical bug, I'll take my chances on the bug. Again, that code 
is only performed when tracing is being enabled. That is, when it converts 
all 20 thousand nops into calls to a tracer function. If the crash 
happens, it will only happen after tracing has started. Then we can easily 
point the bug at the tracer.

IOW, I do not want to slow down Linus' box.

-- Steve

--

From: Linus Torvalds
Date: Friday, August 8, 2008 - 12:04 pm

Sure. If you have two nops in a row (and the kernel definition of the NOP 
array does _not_ guarantee that it's a single-instruction one), you may 
get a profile hit (ie any interrupt) on the second one. It's less 
_likely_, but it certainly is not architecturally in any way guaranteed 

Depends on the microarchitecture. But many will squash it in the decode 
stage, and generate no uops for them at all, so it's purely a decode 
throughput issue and has absolutely _zero_ impact for any later CPU 

See above. It needs to decode them, and the decoder itself may well have 
some limitations - for example, the longer nops may not even decode in all 
decoders, which is why some uarchs might prefer two short nops to one long 
one, but _generally_ a nop will not make it any further than the decoder. 
But separate nops do count as separate instructions, ie they will hit all 
the normal decode limits (mostly three or four instructions decoded per 

Yes. A CPU core _could_ certainly do special decoding for 'jmp 0' too, but 
I don't know any that do. The 'jmp' is much more likely to be special in 
the front-end and the decoder, and can easily cause things like the 
prefetcher to hickup (ie it tries to start prefetching at the "new" target 
address).

So I would absolutely _expect_ a 'jmp' to be noticeably more expensive 
than one of the special nop's that can be squashed by the decoder. 

A nop that is squashed by the decoder will literally take absolutely no 
other resources. It doesn't even need to be tracked from an instruction 
completion standpoint (which _may_ end up meaning that a profiler would 
actually never see a hit on the second nop, but it's quite likely to 
depend on which decoder it hits etc).

			Linus
--

From: Jeremy Fitzhardinge
Date: Friday, August 8, 2008 - 12:08 pm

The CPU can be interrupted between any two instructions, with a couple 
of exceptions (it must, otherwise an unbounded run of nops could cause 
an unbounded interrupt latency).  If you don't want eip to be in the 
middle of your instruction site, you need to make sure it's nopped out 
with a single instruction.

Unfortunately, aside from P6_NOP5, none of the standard asm/nops.h nops 
are suitable.  Something like "0x66, 0x66, 0x66, 0x66, 0x90" would work 
everywhere, but its possible putting more than 3 prefixes on the 
instruction will kick the decoder into a slow path, which may have a 

A no-op jmp could be easily converted to a nop early in the pipeline.  
But I don't know whether anyone bothers to do that.

    J
--

From: Rusty Russell
Date: Sunday, August 10, 2008 - 7:41 pm

You can walk the task list and fix them up, if you have to.  Of course, this 
could be extremely slow with a million threads.  Maybe just walking the 
runqueues would be sufficient?

Rusty.
--

From: Steven Rostedt
Date: Monday, August 11, 2008 - 5:33 am

True, we could do the run queues too. But we are only looping once 
through the tasks and checking a single pointer on it, which would only be 
set if the task was preempted while in the kernel.

As for being slow, this only happens when we enable the function tracer, 
which is slow anyway. It does not need to happen on disabling the tracer.

Note, this is enabling the calls to the tracer function. The tracer itself 
can enable and disable quickly by turning on or off a variable. This is 
just the "setup" part. Not something that happens often.

-- Steve

--

From: Jeremy Fitzhardinge
Date: Thursday, August 7, 2008 - 2:11 pm

I have  a few concerns about this scheme:

One is that you assume that all text is in a .text section, and that the 
offsets you compute on a per-section basis are going to be valid in the 
final kernel image.  At the very least you should make each offset 
relative to its own function rather than inter-function.

But it seems pretty fragile overall.  Things like -ffunction-sections 
and section gc will invalidate the table you build up.

It seems to me that you can acheive the same effect by:

   1. link vmlinux with "ld -q", which will leave the relocation
      information intact in the final object
   2. use "readelf -r" to extract the relocation info, find the
      references to mcount and build a table
   3. link that table into the kernel image
   4. repeat to get a stable result

Note that steps 2-4 are identical to kallsyms, both in intent and 
detail.  The only difference is the precise table you build and the 
command you use to extract the info from the kernel.  From a quick peek 
at the way Makefile implements kallsyms, it looks to me like it would be 
fairly easy to generalize so that you can do the mcount reloc processing 
in the same relink passes as kallsyms with minimal overhead on the 
kernel build time.

It just seems incredibly fiddly to be doing all this per-object.


--

From: Steven Rostedt
Date: Thursday, August 7, 2008 - 2:29 pm

I assume neither ;-)

This works with .text.sched .text.init .text.cpu etc. I use the first 

I don't see how. I'm referencing a function as my pointer. If this was
true, then I wouldn't be able to do

  static void my_func(void) { [...] }
  struct func_struct {
     void (*func)(void);
  } f = { my_func; };

My references end up exactly the same as the "f->func" reference does.
If my references are broken, so is this f->func. Which would be bad to

This doesn't seem any less complex than what I did. With this, I would 
have to come up with another way to handle modules. This will make

I could work on this. But again, this will put more work into module.c 

Actually, this isn't that much fiddling. What I did was simply make a list
of pointers to the call sites to mcount. The pointers used the first 
function of each section to offset against. I relinked in the list making 
it just like a list done within C.  All relocations and magic after this
will be handled just fine.

-- Steve

--

From: Roland McGrath
Date: Thursday, August 7, 2008 - 3:26 pm

The scheme you've implemented can apply fine to a .ko file after it's made.
They are just .o's really.  It is presumably faster to do one step per
final .ko rather than many tiny ones (per source file).  

That might be a benefit to doing it all at the end for vmlinux too.  I
think the best way would be to have a vmlinux.o that we actually use in the
link, rather than just analyzing and discarding.  Then you can just do your
existing hack on vmlinux.o before it's linked into vmlinux.


Thanks,
Roland
--

From: Steven Rostedt
Date: Thursday, August 7, 2008 - 6:21 pm

OK, I am playing with this. And you are right, it does work with vmlinux.o 
file.  It adds almost another minute to the build anytime something is 
changed. Where the per-object way may take longer doing a full build, but 
does not add that extra minute ever build.

Which would people prefer?  I little longer full build and do the 
modification on every object, or have a shorter full build, but every 
build will take that extra minute to do?

Note, if you have DYNMAIC_FTRACE disabled, this is a nop, and will _not_ 
affect you.

-- Steve

--

From: Steven Rostedt
Date: Thursday, August 7, 2008 - 6:24 pm

Another advantage of doing it at the end on the vmlinux.o, is that you do 
not need to recompile the entire kernel when you change the config option. 
But you do need to recompile all modules.

-- Steve

--

From: Steven Rostedt
Date: Thursday, August 7, 2008 - 6:56 pm

I spoke to soon. I was only looking at the result of the vmlinux.o. But it 
seems that kallsyms skips using this file and relinks everything itself.
Thus the __mcount_loc I added never made it into the vmlinux. The more I 
try to get this method to work, the more complex it becomes. And I still 
did not change the recordmcount.pl at all. Which means, by trying to do it 
against the vmlinux.o, I'm actually making it more complex than it already 
is.

Right now, doing it to every object seems to be the safest approach.

-- Steve


--

From: Peter Zijlstra
Date: Friday, August 8, 2008 - 12:22 am

I have a very strong preference to do it per obj rather than at the end,
that way it can be distributed and make these fancy smp boxen useful.

Bonus points if you can make distcc dig it.

The vmlinux link path is the thing that is killing build performance,
anything to shorten that is golden, anything increasing it needs _very_
good justification.

--

From: Steven Rostedt
Date: Friday, August 8, 2008 - 4:31 am

Yes, the CC used in the script does indeed use distcc (if you are using 
it).

;-)

--

From: Sam Ravnborg
Date: Thursday, August 7, 2008 - 9:54 pm

I have patches to actually use vmlinux.o as part of the link process,
but as I hit some subtle issues with um and sparc they are not
yet ready. I plan to dust them off again and fix up the sparc and
um issues but that is not this month.

	Sam
--

From: Abhishek Sagar
Date: Saturday, August 9, 2008 - 2:48 am

Is this new framework needed for x86 specific reasons only? From what
I gathered here, ftraced defers mcount patching simply because there's
no way to update a 5-byte nop atomically. If so, why can't mcount site
patching be left to arch specific ftrace code? For !SMP or archs which
generate word sized mcount branch calls (e.g, ARM) is there really no
way to patch  mcount sites synchronously from inside ftrace_record_ip
by disabling interrupts?

Regards,
Abhishek Sagar
--

From: Steven Rostedt
Date: Saturday, August 9, 2008 - 6:01 am

There's two topics in this thread.

1) the x86 issue of the 5 byte instruction. The problem with x86 is that on
some CPUs the nop used consists of two nops to fill the 5 bytes. There is 
no way to change that atomically. The workarounds for this is the arch 
specific ftrace_pre_enable() that will make sure no process is about to 
execute the second part of that nop.

2) Getting rid of the daemon. The daemon is used to patch the code 
dynamically later on bootup. Now an arch may or may not be able to modify 
code in SMP, but I've been told that this is dangerous to do even on PPC.

Dynamically modifying text that might be in the pipeline on another CPU 
may or may not be dangerous on all archs.

The fix here is to convert the mcount calls to nops at boot up. This is 
really ideal on all archs. This means we know ever mcount call, and we get 
rid of the requirement that we need to run the code once before we can 
trace it.

The kstop_machine is now only left at the start and stop of tracing.

-- Steve

--

From: Abhishek Sagar
Date: Saturday, August 9, 2008 - 8:01 am

Kprobes leaves this mess for individual archs to handle (see
arch_arm_kprobe). What would be nice to have is an explanation as to
which archs are safe from this and under what conditions. Also, for
!SMP, this boot-time patching approach and the associated build time
paraphernalia would simply be a bloat. There we can simply have a
daemonless ftrace which patches mcount sites synchronously. Why not

This solution indeed would fit all archs well but for some it may be
an overkill (or is it?...I'd like to know that :-\ ).

Oh and as you pointed out, it would mean that we have to no longer run
the code before starting to trace it. But why don't we do that now?
How about calling the tracer from ftrace_record_ip? All we need to do

This a nice fix. I'm just looking to find out what the self modifying
code problem here translates to on different archs for my own
understanding :-)

Thanks,
Abhishek Sagar
--

From: Steven Rostedt
Date: Saturday, August 9, 2008 - 8:37 am

One of the things I tried to do was to make ftrace port easily to all 
archs. David Miller ported it to Sparc64 in 20 minutes and that was mostly 
compiling.  Doing a kprobe fix would require much more knowledge to each 

It is not an issue at all. The mcount list is discarded with the init 
section, and the change to nops is relatively fast. We do make a list of 
all entries anyway, so that we can pick and choose which functions we want 
to enable tracing on.

This recording at boot up should be fine on all archs, SMP or UP and will 

I want ftrace_record_ip to be as fast as possible, and also having it call
the registered function means we need to test if it was set in the filter 
too. This is too much for what record_ip does. And as I said, doing it all 


Now, what I can do, is remove the kstop machine from UP. Hmm, I need to 
check if kstop_machine code itself is smart enough to simply do a 
local_irq_save; run function; local_irq_restore, if it was on UP and not 
SMP.

-- Steve

--

From: Abhishek Sagar
Date: Saturday, August 9, 2008 - 10:14 am

I agree with the point about porting. My kprobe reference was simply
to point out that
an existing module seems to be overwriting code on the fly and doesn't
take care of much
on non-x86 archs. Ftrace has a unique 5-byte call replacement issue


Alrighty.

Thanks,
Abhishek Sagar
--

Previous thread: [PATCH] x86: audit syscalls based on type of syscall not type of binary by Michael Davidson on Thursday, August 7, 2008 - 11:14 am. (1 message)

Next thread: [PATCH 3/5] ftrace: enable mcount recording for modules by Steven Rostedt on Thursday, August 7, 2008 - 11:20 am. (3 messages)