the counter argument was that by specific sched.o analysis, this results
in slower code. The reason is that the "function call parameter
preparation" halo around that 5-byte patch site is larger than that
single conditional branch operation to an offline place of the current
function is.
i.e. the current optimized marker approach does roughly this:
[ .... fastpath head .... ]
[ immediate value instruction ] --->
[ branch instruction ] ---> these two get NOP-ed out
[ .... fastpath tail .... ]
[ ............................. ]
[ ... offline area ............ ]
[ ... parameter preparation ... ]
[ ... marker call ............. ]
your proposed 5-byte call NOP approach (which btw. was what i proposed
multiple times in the past 2 years) would do this:
[ .... fastpath head ...... ]
[ ... parameter preparation ... ]
[ .... 5-byte CALL .......... ] ---> NOP-ed out
[ .... fastpath tail .......... ]
[ ............................. ]
in the first case we have little "marker parameter/value preparation"
cost: it all happens in the 'offline area' _by GCC_. I.e. the fastpath
is relatively undisturbed.
in the latter case, all the 'parameter preparation' phase has to happen
at around the 5-byte CALL site, in the fastpath. This, in the specific,
assembly level analysis of sched.o, was shown by Matthieu to be a
pessimisation. We are better off by inserting that conditional and
letting gcc generate the call, than by forcing it in the middle of the
fastpath - even if we end up NOP-ing out the call.
wrt. complexity i agree with you - if the current optimization cannot be
made correctly we have to fall back to a simpler variant, even if it's
slower.
Ingo
--