OK, I hoped so, but just double-checked to be sure. :)
When talking to MMIO you often don't need to force the outstanding writes
to complete before you exit some driver's code. They will eventually
reach the device and to their things in due course.
A notable exception are some kinds of side effects that need to be
synchronised to prevent races. For example to avoid wasting processing
time for handling spurious interrupts you do want to make sure a write
that acknowledges a pending interrupt has been recorded by the handler
reaches the respective device's register before the interrupt has been
cleared in the interrupt controller.
On the other hand you do not need to issue a writeback of a request for
the device to look for more data in the outgoing DMA descriptor ring.
Ah, framebuffers. The DEC Alpha people somehow managed to get them
right. :) What you say is of course true for a dumb framebuffer -- but
who cares about dumb framebuffers these days?
A half-decent graphics controller will provide a set of typical masked
raster operations: STORE, AND, OR, XOR, etc. so that you don't have to
issue RMW cycles to framebuffer's memory -- all you need are bulk writes,
where the order does not really matter and which can be pipelined (the
graphics controller may be able to replicate writes too, such as across
the whole scanline -- good for the bandwidth!).
You may still have to issue some barriers around accesses to
framebuffer's control registers, but that's about it. And the TGA X11
driver undobtedly gets these things right or otherwise nobody could have
used it and the adapters it supports with an Alpha (as a side note: that
graphics chip/software applies to MIPS-based DECstation systems too).
This is all early 1990s' technology, no rocket science anymore. :)
There's a technical report on the techniques used somewhere on the web --
look for "Smart Frame Buffer" (and don't forget to check its date ;) ).
In general: don't break the CPU because you've got a broken piece of
software -- fix the piece instead!
I stand by my choice -- inefficiency from unnecessary (implicit) ordering
barriers accumulates. These operations are so slow (with latencies
possibly counted in hundreds of CPU cycles) it really matters whether you
need ten or just one, especially with the speeds of contemporary
processors.
Maciej
--