Changelog: - Remove imv_set_early (removed from API). - Use imv_* instead of immediate_*. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: Rusty Russell <rusty@rustcorp.com.au> CC: Adrian Bunk <bunk@stusta.de> CC: Andi Kleen <andi@firstfloor.org> CC: Christoph Hellwig <hch@infradead.org> CC: mingo@elte.hu CC: akpm@osdl.org --- Documentation/immediate.txt | 221 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 221 insertions(+) Index: linux-2.6-lttng/Documentation/immediate.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6-lttng/Documentation/immediate.txt 2008-02-01 07:42:01.000000000 -0500 @@ -0,0 +1,221 @@ + Using the Immediate Values + + Mathieu Desnoyers + + +This document introduces Immediate Values and their use. + + +* Purpose of immediate values + +An immediate value is used to compile into the kernel variables that sit within +the instruction stream. They are meant to be rarely updated but read often. +Using immediate values for these variables will save cache lines. + +This infrastructure is specialized in supporting dynamic patching of the values +in the instruction stream when multiple CPUs are running without disturbing the +normal system behavior. + +Compiling code meant to be rarely enabled at runtime can be done using +if (unlikely(imv_read(var))) as condition surrounding the code. The +smallest data type required for the test (an 8 bits char) is preferred, since +some architectures, such as powerpc, only allow up to 16 bits immediate values. + + +* Usage + +In order to use the "immediate" macros, you should include linux/immediate.h. + +#include <linux/immediate.h> + +DEFINE_IMV(char, this_immediate); +EXPORT_IMV_SYMBOL(this_immediate); + + +And use, in the body of a function: + +Use imv_set(this_immediate) to set the immediate value. + +Use imv_read(this_immediate) to read the immediate value. + +The immediate mechanism supports inserting multiple instances of the same +immediate. Immediate values can be put in inline functions, inlined static +functions, and unrolled loops. + +If you have to read the immediate values from a function declared as __init or +__exit, you should explicitly use _imv_read(), which will fall back on a +global variable read. Failing to do so will leave a reference to the __init +section after it is freed (it would generate a modpost warning). + +You can choose to set an initial static value to the immediate by using, for +instance: + +DEFINE_IMV(long, myptr) = 10; + + +* Optimization for a given architecture + +One can implement optimized immediate values for a given architecture by +replacing asm-$ARCH/immediate.h. + + +* Performance improvement + + + * Memory hit for a data-based branch + +Here are the results on a 3GHz Pentium 4: + +number of tests: 100 +number of branches per test: 100000 +memory hit cycles per iteration (mean): 636.611 +L1 cache hit cycles per iteration (mean): 89.6413 +instruction stream based test, cycles per iteration (mean): 85.3438 +Just getting the pointer from a modulo on a pseudo-random value, doing + nothing with it, cycles per iteration (mean): 77.5044 + +So: +Base case: 77.50 cycles +instruction stream based test: +7.8394 cycles +L1 cache hit based test: +12.1369 cycles +Memory load based test: +559.1066 cycles + +So let's say we have a ping flood coming at +(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms) +7674 packets per second. If we put 2 markers for irq entry/exit, it +brings us to 15348 markers sites executed per second. + +(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029 +We therefore have a 0.29% slowdown just on this case. + +Compared to this, the instruction stream based test will cause a +slowdown of: + +(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004 +For a 0.004% slowdown. + +If we plan to use this for memory allocation, spinlock, and all sorts of +very high event rate tracing, we can assume it will execute 10 to 100 +times more sites per second, which brings us to 0.4% slowdown with the +instruction stream based test compared to 29% slowdown with the memory +load based test on a system with high memory pressure. + + + + * Markers impact under heavy memory load + +Running a kernel with my LTTng instrumentation set, in a test that +generates memory pressure (from userspace) by trashing L1 and L2 caches +between calls to getppid() (note: syscall_trace is active and calls +a marker upon syscall entry and syscall exit; markers are disarmed). +This test is done in user-space, so there are some delays due to IRQs +coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20 +nice level) + +My first set of results: Linear cache trashing, turned out not to be +very interesting, because it seems like the linearity of the memset on a +full array is somehow detected and it does not "really" trash the +caches. + +Now the most interesting result: Random walk L1 and L2 trashing +surrounding a getppid() call. + +- Markers compiled out (but syscall_trace execution forced) +number of tests: 10000 +No memory pressure +Reading timestamps takes 108.033 cycles +getppid: 1681.4 cycles +With memory pressure +Reading timestamps takes 102.938 cycles +getppid: 15691.6 cycles + + +- With the immediate values based markers: +number of tests: 10000 +No memory pressure +Reading timestamps takes 108.006 cycles +getppid: 1681.84 cycles +With memory pressure +Reading timestamps takes 100.291 cycles +getppid: 11793 cycles + + +- With global variables based markers: +number of tests: 10000 +No memory pressure +Reading timestamps takes 107.999 cycles +getppid: 1669.06 cycles +With memory pressure +Reading timestamps takes 102.839 cycles +getppid: 12535 cycles + +The result is quite interesting in that the kernel is slower without +markers than with markers. I explain it by the fact that the data +accessed is not laid out in the same manner in the cache lines when the +markers are compiled in or out. It seems that it aligns the function's +data better to compile-in the markers in this case. + +But since the interesting comparison is between the immediate values and +global variables based markers, and because they share the same memory +layout, except for the movl being replaced by a movz, we see that the +global variable based markers (2 markers) adds 742 cycles to each system +call (syscall entry and exit are traced and memory locations for both +global variables lie on the same cache line). + + +- Test redone with less iterations, but with error estimates + +10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with +syscall trace inactive, comparing the case with memory pressure and without +memory pressure. (sorry, my system is not setup to execute syscall_trace this +time, but it will make the point anyway). + +No memory pressure +Reading timestamps: 150.92 cycles, std dev. 1.01 cycles +getppid: 1462.09 cycles, std dev. 18.87 cycles + +With memory pressure +Reading timestamps: 578.22 cycles, std dev. 269.51 cycles +getppid: 17113.33 cycles, std dev. 1655.92 cycles + + +Now for memory read timing: (10 runs, branches per test: 100000) +Memory read based branch: + 644.09 cycles, std dev. 11.39 cycles +L1 cache hit based branch: + 88.16 cycles, std dev. 1.35 cycles + + +So, now that we have the raw results, let's calculate: + +Memory read: +644.09
| Klaus S. Madsen | Regression in 2.6.25-rc3: s2ram segfaults before suspending |
| Dan Hecht | Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree |
| Balbir Singh | Re: 2.6.23-rc7-mm1 - 'touch' command causes Oops. |
| Arjan van de Ven | [patch] Add basic sanity checks to the syscall execution patch |
git: | |
| Nicolas Pitre | Re: [PATCH] diff-delta: produce optimal pack data |
| Catalin Marinas | Re: hgmq vs. StGIT |
| Mark Levedahl | [PATCH] git-clone - Set remotes.origin config variable |
| Junio C Hamano | Re: tracking repository |
| Richard Daemon | Re: booting openbsd on eee without cd-rom |
| Matt | Setting up a virtual hosting machine w. SSH/SFTP accounts - pitfalls/experiences? |
| Paul Greidanus | [Fwd: Open-Hardware] |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
| Jim Winstead Jr. | Re: Root Disk/Book Disk Compatibility |
| Paul Douglas Page | Where is mkfs? |
| Howard Wei-Hao Pan | [Q] Does Linux work with PCMCIA devices? |
| Timothy L. Kay | Use PERL rather than C for system commands? |
