* Ingo Molnar (mingo@elte.hu) wrote:Yes, thanks, Let's see : Paraphrasing the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" : http://www.intel.com/design/processor/manuals/248966.pdf 3.5.1.8 Using NOPs The 1-byte nop, xchg %eax,%eax, is treated specially by the cpu. It takes only one µop and does not have register dependencies. However, what we would need here is likely a 5-bytes nop. The generic 5-bytes nop found in asm-x86/nops.h is : #define GENERIC_NOP1 ".byte 0x90\n" #define GENERIC_NOP4 ".byte 0x8d,0x74,0x26,0x00\n" #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4 In Intel's guide, they propose a single instruction : NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement) (0F 1F 44 00 00H) Which would be required for correct code patching, since we cannot safely turn a 1+4 bytes sequence of instruction into a single 5-bytes call without suffering from possible preempted threads returning in the middle of the instruction. Since the 5-bytes nop does not seem to be available on architectures below P6, I think this would be a backward compatibility problem. K8 NOP 5 also uses a combination of K8_NOP3 K8_NOP2, so this is not only backward compatibility, but also a concern for current AMD compatibility. (same issue for K7 5-bytes nop). However, if we forget about this issue and take for granted that we can use a single nop instruction that would only take a single µop and have no external effect (that would be the ideal P6 and + world), the closest comparison with the jmp instruction I found is the jcc (conditional branch) instruction in the same manual, Appendix C, where they list jcc as having 0 latency when not taken and to have a bandwidth of 0.5, compared to a latency of 1 and bandwidth of 0.33 to 0.5 for nop. So, just there, a single jmp instruction would seem to have a lower latency than the nop. (0 cycle vs 1 cycle) Another interested reading in this document is about conditional branches : 3.4.1 Optimizing the Front End Some interesting guide lines : - Eliminate branches whenever possible - Separate branches so that they occur no more frequently than every 3 µops where possible 3.4.1.1 Eliminating branches improve performance because : - It eliminates the number of BTB (branch target buffer) entries. However, cond. branches which are never taken do not consume BTB resources (that's interesting). 3.4.1.3 Static Prediction Basically, if there is no BTB entry for a branch, if the target address is forward, static prediction goes forward, and if it goes backward (loop), static pred. is to go backward. (no surprise here) Given that the unlikely code is normally at the bottom of functions, the static prediction would correctly predict the not taken case. So, disabled markers would not consume any BTB entry, but each enabled marker would consume a BTB entry. So, in the end, it looks like the non-taken jcc vs jmp instructions have the same impact on the system, except that the non-taken jcc must be preceded by movb (latency 1, bw 0.5 to 1) and test instructions (latency 1, bw 0.33 to 0.5), for a total of 2 latency cycles for the movb+test+jcc vs 0 latency cycles for the jmp. Therefore, jmp wins over jcc in the disabled case (latency 0 vs 2) and in the enabled case even more, since it won't consume any BTB entry. Oh, and for NOPs vs jmp/jcc, I think not having a single instruction 5-byte nop on every architectures (below P6 and AMD) just disqualifies it. Even if it would qualify, jmp wins the latency game 0 to 1. Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Arjan van de Ven | [Announce] Development release 0.1 of the LatencyTOP tool |
| Andrew Morton | -mm merge plans for 2.6.23 |
| Greg Kroah-Hartman | [PATCH 020/196] IDE: Convert from class_device to device for ide-tape |
git: | |
| Tantilov, Emil S | RE: [PATCH] net: sk_alloc() should not blindly overwrite memory |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
