Sure, here is the result of a small test comparing:
1 - Branch depending on a cache miss (has to fetch in memory, caused by a 128
bytes stride)). This is the test that is likely to look like what
side-effect the original profile_hit code was causing, under the
assumption that the kernel is already using L1 and L2 caches at
their full capacity and that a supplementary data load would cause
cache trashing.
2 - Branch depending on L1 cache hit. Just for comparison.
3 - Branch depending on a load immediate in the instruction stream.
It has been compiled with gcc -O2. Tests done on a 3GHz P4.
In the first test series, the branch is not taken:
number of tests : 1000
number of branches per test : 81920
memory hit cycles per iteration (mean) : 48.252
L1 cache hit cycles per iteration (mean) : 16.1693
instruction stream based test, cycles per iteration (mean) : 16.0432
In the second test series, the branch is taken and an integer is
incremented within the block:
number of tests : 1000
number of branches per test : 81920
memory hit cycles per iteration (mean) : 48.2691
L1 cache hit cycles per iteration (mean) : 16.396
instruction stream based test, cycles per iteration (mean) : 16.0441
Therefore, the memory fetch based test seems to be 200% slower than the
load immediate based test.
(I am adding these results to the documentation)
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
-