On Fri, Mar 27, 2009 at 07:14:26PM +0000, Alan Cox wrote:
The thought that I had was to create a new system call, fbarrier()
which has the semantics that it will request the filesystem to make
sure that (at least) changes that have been made data blocks to date
should be forced out to disk when the next metadata operation is
committed. For ext3 in data=ordered mode, this would be a no-op. For
other filesystems that had fast/efficient fsync()'s, it could simply
be an fsync(). For other filesystems, it could trigger an
asynchronous writeout, if the journal commit will wait for the
writeout to complete. For yet other filesystems, it might set a flag
that will cause the filesystem to start a synchronous writeout of the
file as part of the commit operations. The bottom line was that what
we could *then* tell application programmers to do is
open/write/fbarrier/close/rename. (And for operating systems where
they don't have fbarrier, they can use autoconf magic to replace
fbarrier with fsync.)
We could potentially make close() imply fbarrier(), but there are
plenty of times when that might not be such a great idea. If we do
that, we're back to requiring synchronous data writes for all files on
close(), which might lead to huge latencies, just as ext3's
data=ordered mode did. And in many cases, where the files in
questions can be easily regenerated (such as object files in a kernel
tree build), there really is no reason why it's a good idea to force
the blocks to disk on close(). In the highly unusual case where we
crash in the middle of a kernel build; we can do a "make clean; make"
and regenerate the object files.
The fundamental idea here is not all files need to be forced to disk
on close. Not all files need fsync(), or even fbarrier(). We can
make the system go much more quickly if we can make a distinction
between these two cases. It can also make SSD drives last longer if
we don't force blocks to disk for non-precious files. If people
disagree with this premise, we can go back to something very much like
ext3's data=ordered mode; but then we get *all* of the problems of
ext3's data=ordered mode, including the unexpected filesystem
latencies that Linus and Ingo have been complaining about so much.
The two are very much related.
Anyway, this is just one idea; I'm not claiming that fbarrier() is the
perfect solution --- but it is one I plan to propose at the upcoming
Linux Storage and Filesystem workshop in San Francisco in a week or
so. Maybe someone else will have a better idea.
- Ted
--