Re: [patch 0/2] Immediate Values - jump patching update

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Ingo Molnar <mingo@...>
Cc: <akpm@...>, <linux-kernel@...>, H. Peter Anvin <hpa@...>
Date: Monday, April 28, 2008 - 10:35 am

* Ingo Molnar (mingo@elte.hu) wrote:

Yes, thanks,


Let's see :

Paraphrasing the
"Intel® 64 and IA-32 Architectures Optimization Reference Manual" :
http://www.intel.com/design/processor/manuals/248966.pdf

3.5.1.8 Using NOPs

The 1-byte nop, xchg %eax,%eax, is treated specially by the cpu. It
takes only one µop and does not have register dependencies. However,
what we would need here is likely a 5-bytes nop.

The generic 5-bytes nop found in asm-x86/nops.h is :

#define GENERIC_NOP1 ".byte 0x90\n"
#define GENERIC_NOP4 ".byte 0x8d,0x74,0x26,0x00\n"
#define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4

In Intel's guide, they propose a single instruction :
NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)

(0F 1F 44 00 00H)

Which would be required for correct code patching, since we cannot
safely turn a 1+4 bytes sequence of instruction into a single 5-bytes
call without suffering from possible preempted threads returning in the
middle of the instruction. Since the 5-bytes nop does not seem to be
available on architectures below P6, I think this would be a backward
compatibility problem.

K8 NOP 5 also uses a combination of K8_NOP3 K8_NOP2, so this is not only
backward compatibility, but also a concern for current AMD
compatibility. (same issue for K7 5-bytes nop).

However, if we forget about this issue and take for granted that we can
use a single nop instruction that would only take a single µop and have
no external effect (that would be the ideal P6 and + world), the
closest comparison with the jmp instruction I found is the jcc
(conditional branch) instruction in the same manual, Appendix C, where
they list jcc as having 0 latency when not taken and to have a bandwidth
of 0.5, compared to a latency of 1 and bandwidth of 0.33 to 0.5 for nop.

So, just there, a single jmp instruction would seem to have a lower
latency than the nop. (0 cycle vs 1 cycle)

Another interested reading in this document is about conditional
branches :

3.4.1 Optimizing the Front End

Some interesting guide lines :
- Eliminate branches whenever possible
- Separate branches so that they occur no more frequently than every 3 µops
  where possible

3.4.1.1 Eliminating branches improve performance because :
- It eliminates the number of BTB (branch target buffer) entries.
  However, cond. branches which are never taken do not consume BTB
  resources (that's interesting).

3.4.1.3 Static Prediction

Basically, if there is no BTB entry for a branch, if the target address
is forward, static prediction goes forward, and if it goes backward
(loop), static pred. is to go backward. (no surprise here)

Given that the unlikely code is normally at the bottom of functions, the
static prediction would correctly predict the not taken case. So,
disabled markers would not consume any BTB entry, but each enabled
marker would consume a BTB entry.

So, in the end, it looks like the non-taken jcc vs jmp instructions have
the same impact on the system, except that the non-taken jcc must be
preceded by movb (latency 1, bw 0.5 to 1) and test instructions (latency
1, bw 0.33 to 0.5), for a total of 2 latency cycles for the
movb+test+jcc vs 0 latency cycles for the jmp.

Therefore, jmp wins over jcc in the disabled case (latency 0 vs 2) and
in the enabled case even more, since it won't consume any BTB entry.

Oh, and for NOPs vs jmp/jcc, I think not having a single instruction
5-byte nop on every architectures (below P6 and AMD) just disqualifies
it. Even if it would qualify, jmp wins the latency game 0 to 1.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[patch 0/2] Immediate Values - jump patching update, Mathieu Desnoyers, (Sun Apr 27, 11:34 pm)
Re: [patch 0/2] Immediate Values - jump patching update, H. Peter Anvin, (Mon Apr 28, 1:21 pm)
Re: [patch 0/2] Immediate Values - jump patching update, H. Peter Anvin, (Mon Apr 28, 5:03 pm)
Re: [patch 0/2] Immediate Values - jump patching update, H. Peter Anvin, (Mon Apr 28, 6:25 pm)
Re: [patch 0/2] Immediate Values - jump patching update, H. Peter Anvin, (Mon Apr 28, 7:06 pm)
Re: [patch 0/2] Immediate Values - jump patching update, Mathieu Desnoyers, (Mon Apr 28, 9:46 pm)
Re: [patch 0/2] Immediate Values - jump patching update, H. Peter Anvin, (Mon Apr 28, 10:07 pm)
Re: [patch 0/2] Immediate Values - jump patching update, Mathieu Desnoyers, (Tue Apr 29, 8:18 am)
Re: [patch 0/2] Immediate Values - jump patching update, H. Peter Anvin, (Tue Apr 29, 11:35 am)
Re: [patch 0/2] Immediate Values - jump patching update, Mathieu Desnoyers, (Sun May 4, 10:54 am)
Re: [patch 0/2] Immediate Values - jump patching update, H. Peter Anvin, (Sun May 4, 5:05 pm)
Re: [patch 0/2] Immediate Values - jump patching update, Frank Ch. Eigler, (Mon Apr 28, 8:47 pm)
Re: [patch 0/2] Immediate Values - jump patching update, H. Peter Anvin, (Mon Apr 28, 9:08 pm)
Re: [patch 0/2] Immediate Values - jump patching update, Pavel Machek, (Wed May 14, 10:53 am)
Re: [patch 0/2] Immediate Values - jump patching update, Mathieu Desnoyers, (Mon Apr 28, 10:35 am)