I meant the latter.
I agree the docs are unclear here. They contain an example of cached and
uncached stores (Ralf has pointed to already), but no clear explanation
for mix of loads and stores. Sure, it's safer to keep both sync and
uncached load.
There is no such thing like performance in case of uncached loads.
The case #2 requires:
1. sync
2. additional operations (usually just a read) to pull data behind input
buffers on an IO bus.
While it's ok to put that in MMIO reads/writes as you've done, it's
almost impossible to program X server in that way, for example. This
beast considers a frame buffer as an memory array with strong ordering.
That's why I'd vote for the case #3. Not because it outperforms #2 in
the real life (who cares for 0.0001% gain), but because IO devices
requires strong ordering.
Gleb.
--