I was asked just what that overhead *is* ... and it surprised me.
A summary of the results is appended to this note.
Fortuntely it turns out those problems all go away if the gpiolib
code uses a *raw* spinlock to guard its table lookups. With a raw
spinlock, any performance impact of gpiolib seems to be well under
a microsecond in this bitbang context (and not objectionable).
Preempt became free; enabling debug options had only a minor cost.
That's as it should be, since the only substantive changes were to
grab and release a lock, do one table lookup a bit differently, and
add one indirection function call ... changes which should not have
any visible performance impact on per-bit codepaths, and one might
expect to cost on the order of one dozen instructions.
So the next version of this code will include a few minor bugfixes,
and will also use a raw spinlock to protect that table. A raw lock
seems appropriate there in any case, since non-sleeping GPIOs should
be accessible from hardirq contexts even on RT kernels.
If anyone has any strong arguments against using a raw spinlock
to protect that table, it'd be nice to know them sooner rather
than later.
- Dave
SUMMARY:
Using the i2c-gpio driver on a preempt kernel with all the usual
kernel debug options enabled, the per-bit times (*) went up in a
bad way: from about 6.4 usec/bit (original GPIO code on this board)
up to about 11.2 usec/bit (just switching to gpiolib), which is
well into "objectionable overhead" territory for bit access.
Just enabling preempt shot the time up to 7.4 usec/bit ... which is
also objectionable (it's all-the-time overhead that is clearly
needless), but much less so.
Converting the table lock to be a raw spinlock essentially removed
all non-debug overheads. It took enabling all those debug options
plus internal gpiolib debugging overhead to get those times up to
the 7.4 usec/bit that previously applied even with just preempt.
(*) Those times being eyeballed medians; I didn't make time to find
a way to export a few thousand measurements from the tool and
do the math. The typical range was +/- one usec.
The numbers include udelay() calls, so the relevant point is
the time *delta* attributable only to increased gpiolib costs,
not the base time (with udelays). The delta probably reflects
on the order of four GPIO calls: set two different bits, clear
one of them, and read it to make sure it cleared.
-