On Fri, Aug 06, 2010 at 02:20:32PM -0700, Daniel Stodden wrote:
Witht the barrier model we have in current kernels you basically need to
a) do a drain (typically inside the guest) and you need to have a cache
flush command if you have volatile write cache semantics. The cache
flush command will be used for pre-flushes, standalone flushes and
if you don't have a FUA bit in the protocol post-flushes.
Which backend? Currently filesystems can in theory rely on the ordering
semantics, although very few do. And we've not seen a working
implementation except for draining for it - the _TAG versions exist,
but they are basically untested, and no one has solved the issues of
error handling for it yet.
Basically the only think you need it a cache flush command right now,
that solves everything the Linux kernel needs, as does windows or
possibly other guests. The draining is something imposed on us by
the current Linux barrier semantics, and I'm confident it will be a
thing of the past by Linux 2.6.37.
I thikn you're another victim of the overloaded barrier concept. What
the Linux barrier flags does is two only slightly related things:
a) give the filesystem a way to flush volatile write caches and thus
gurantee data integrity
b) provide block level ordering losly modeled after the SCSI ordered
tag model
Practice has shown that we just need (a) in most cases, there's only
two filesystems that can theoretically take advantage of (b), and even
there I'm sure we could do better without the block level draining.
The _TAG implementation of barriers is related to (b) only - the pure
QUEUE_ORDERED_TAG is only safe if you do not have a volatile write
cache - to do cache flushing you need the QUEUE_ORDERED_TAG_FLUSH or
QUEUE_ORDERED_TAG_FUA modes. In theory we could also add another
mode that not only integrates the post-flush in the _FUA mode but
also a pre-flush into a single command, but so far there wasn't
any demand, most likely because no on the wire storage protocol
implements it.
The typical one is f(data)sync for the case where there have been no
modifications of metadata, or when using an external log device.
No metadata modifications are quite typical for databases or
virtualization images, or other bulk storage that doesn't allocate space
on the fly.
It does issue normal write barriers when you have dirty metadata, else
it sends empty barriers if supported.
Err, that was a question. For O_SYNC/O_DYSNC you don't need the
explicit fsync. For O_DIRECT you do (or use O_SYNC/O_DYSNC in addition)
No, if you're using O_DIRECT you still need f(data)sync to flush out
the host disk cache.
All this will depends a lot on the filesystem. But if you're not
doing any allocation and you're not using O_SYNC/O_DYSNC most
filesystems will not send any barrier at all. The obvious exception is
btrfs because it has to allocate new blocks anyway due to it's copy on
write scheme.
--