Btw, something semi-related I've been looking at recently: Currently O_DIRECT writes bypass all kernel caches, but there they do use the disk caches. We currenly don't have any barrier support for them at all, which is really bad for data integrity in virtualized environments. I've started thinking about how to implement this. The simplest scheme would be to mark the last request of each O_DIRECT write as barrier requests. This works nicely from the FS perspective and works with all hardware supporting barriers. It's massive overkill though - we really only need to flush the cache after our request, and not before. And for SCSI we would be much better just setting the FUA bit on the commands and not require a full cache flush at all. The next scheme would be to simply always do a cache flush after the direct I/O write has completed, but given that blkdev_issue_flush blocks until the command is done that would a) require everyone to use the end_io callback and b) spend a lot of time in that workque. This only requires one full cache flush, but it's still suboptimal. I have prototypes this for XFS, but I don't really like it. The best scheme would be to get some highlevel FUA request in the block layer which gets emulated by a post-command cache flush. --
I've talked to Chris about this in the past too, but I never got around
to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
without making too many changes, and we do have FUA support on most SATA
drives too. Basically just a check in the driver for whether the
request is O_DIRECT and a WRITE, ala:
if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
WRITE_FUA;
I know that FUA is used by that other OS, so I think we should be golden
on the hw support side.
--
Jens Axboe
--
I've been thinking about this too, and for optimal performance with VMs and also with databases, I think FUA is too strong. (It's also too weak, on drives which don't have FUA). I would like to be able to get the same performance and integrity as the kernel filesystems can get, and that means using barrier flushes when a kernel filesystem would use them, and FUA when a kernel filesystem would use that. Preferably the same whether userspace is using a file or a block device. The conclusion I came to is that O_DIRECT users need a barrier flush primitive. FUA can either be deduced by the elevator, or signalled explicitly by userspace. Fortunately there's already a sensible API for both: fdatasync (and aio_fsync) to mean flush, and O_DSYNC (or inferred from flush-after-one-write) to mean FUA. Those apply to files, but they could be made to have the same effect with block devices, which would be nice for applications which can use both. I'll talk about files from here on; assume the idea is to provide the same functions for block devices. It turns out that applications needing integrity must use fdatasync or O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may choose to use buffered writes at any time, with no signal to the application. O_DSYNC or fdatasync ensures that unknown buffered writes will be committed. This is true for other operating systems too, for the same reason, except some other unixes will convert all writes to buffered writes, not just corner cases, under various circumstances that it's hard for applications to detect. So there's already a good match to using fdatasync and/or O_DSYNC for O_DIRECT integrity. If we define fdatasync's behaviour to be that it always causes a barrier flush if there have been any WRITE commands to a disk since the last barrier flush, in addition to it's behaviour of flushing cached pages, that would be enough for VM and database applications would have good support for integrity. Of course O_DSYNC ...
I thought about this alot . It would be sensible to only require the FUA semantics if O_SYNC is specified. But from looking around at users of O_DIRECT no one seems to actually specify O_SYNC with it. And on Linux where O_SYNC really means O_DYSNC that's pretty sensible - if O_DIRECT bypasses the filesystem cache there is nothing else left to sync for a non-extending write. That is until those pesky disk write back caches come into play that no application writer wants or The fallback was a relatively recent addition to the O_DIRECT semantics for broken filesystems that can't handle holes very well. Fortunately enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC) semantics for that already. --
In measurements I've done, disabling a disk's write cache results in much slower ext3 filesystem writes than using barriers. Others report similar results. This is with disks that don't have NCQ; good NCQ may be better. Using FUA for all writes should be equivalent to writing with write cache disabled. A journalling filesystem or database tends to write like this: (guest) WRITE (guest) WRITE (guest) WRITE (guest) WRITE (guest) WRITE (guest) CACHE FLUSH (guest) WRITE (guest) CACHE FLUSH (guest) WRITE (guest) WRITE (guest) WRITE When a guest does that, for integrity it can be mapped to this on the host with FUA: (host) WRITE FUA (host) WRITE FUA (host) WRITE FUA (host) WRITE FUA (host) WRITE FUA (host) WRITE FUA (host) WRITE FUA (host) WRITE FUA (host) WRITE FUA or (host) WRITE (host) WRITE (host) WRITE (host) WRITE (host) WRITE (host) CACHE FLUSH (host) WRITE (host) CACHE FLUSH (host) WRITE (host) WRITE (host) WRITE We know from measurements that disabling the disk write cache is much slower than using barriers, at least with some disks. Assuming that WRITE FUA is equivalent to disabling write cache, we may expect the WRITE FUA version to run much slower than the CACHE FLUSH version. It's also too weak, of course, on drives which don't support FUA. Then you have to use CACHE FLUSH anyway, so the code should support that (or disable the write cache entirely, which also performs badly). If you don't handle drives without FUA, then you're back to "integrity sometimes, user must check type of hardware", which is something we're trying to get away from. Integrity should not be a surprise when the O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes inode metadata (like mtime) too. O_DIRECT|O_DSYNC is better. O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for integrity problems when direct writes are converted ...
On a scsi disk and a SATA SSD with NCQ I get different results. Most worksloads, in particular metadata-intensive ones and large streaming writes are noticably better just turning off the write cache. The only onces that benefit from it are relatively small writes witout O_SYNC or much fsyncs. This is however using XFS which tends to issue much For a workload that only does FUA writes, yeah. That is however the use case for virtual machines. As I'm looking into those issues I will run As mentioned in the previous mails FUA would only be an optimization It did not happen on IRIX where O_DIRECT originated that did not happen, neither does it happen on Linux when using XFS. Then again at least on Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC) That is what I meant. Only doing cache flushes/FUA for O_DIRECT|O_DSYNC is not what users naively expect. And the wording in hour manpages also suggests this behaviour, although it is not entirely clear: O_DIRECT (Since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred. See NOTES below forfurther discussion. (And yeah, the whole wording is horrible, I will send an update once No. In the generic code and filesystems I looked at it simply has no effect at all. --
With normal S-ATA disks, streaming write workloads on ext3 run twice as fast with barriers & write cache enabled in my testing. Small file workloads were more even if I remember correctly... --
IRIX has an unusually sane O_DIRECT - at least according to it's
documentation. This is write(2):
When attempting to write to a file with O_DIRECT or FDIRECT set,
the portion being written can not be locked in memory by any
process. In this case, -1 will be returned and errno will be set
to EBUSY.
AIX however says this:
In order to avoid consistency issues between programs that use
Direct I/O and programs that use normal cached I/O, Direct I/O is
by default used in an exclusive use mode. If there are multiple
opens of a file and some of them are direct and others are not,
the file will stay in its normal cached access mode. Only when
the file is open exclusively by Direct I/O programs will the file
be placed in Direct I/O mode.
Similarly, if the file is mapped into virtual memory via the
shmat() or mmap() system calls, then file will stay in normal
cached mode.
The JFS or JFS2 will attempt to move the file into Direct I/O
mode any time the last conflicting. non-direct access is
eliminated (either by close(), munmap(), or shmdt()
subroutines). Changing the file from normal mode to Direct I/O
mode can be rather expensive since it requires writing all
modified pages to disk and removing all the file's pages from
Oh, I agree with that. That comes from observing that quasi-portable
code using O_DIRECT needs to use O_DSYNC too because several OSes and
filesystems on those OSes revert to buffered writes under some
circumstances, in which case you want O_DSYNC too. That has nothing
to do with hardware caches, but it's a lucky coincidence that
fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC
Perhaps in the same way that fsync/fdatasync aren't clear on disk
One thing it's unhelpful about is the performance. O_DIRECT tends to
improve performance for applications that do their own caching, it
also improves performance in whole systems when caching ...Can you forward a pointer to an Irix man page which describes its O_DIRECT semantics (or at least what they claim in their man pages)? I was looking for one on the web, but I couldn't seem to find any on-line web pages for Irix. It'd be nice if we could also get permission from SGI to quote relevant sections in the "Clarifying Direct I/O Semantics" wiki page would be welcome, in case we end up quoting more than what someone might consider fair game for fair use, but for now, I'd be really happy getting something that I could look out for reference purposes. Was there any thing more than what you quoted in the Irix write(2) man page about O_DIRECT? Thanks, - Ted --
I agree. I do however fear about everything using O_DIRECT that is around now. Less so about the databases and HPC workloads on expensive hardware because they usually run on vendor approved scsi disks that have the write back cache disabled, but rather things like virtualization software or other things that get run on commodity hardware. Then again they already don't get what they expect and never did, so if we clear document and communicate the O_SYNC (that is Linux The disk write cache really is an implementation detail, it has no business in Posix. Posix seems to define the semantics for fdatasync and cor relatively well (that is if you like the specification speak in there): "The fdatasync() function forces all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronised I/O completion state." "synchronised I/O data integrity completion o For read, when the operation has been completed or diagnosed if unsuccessful. The read is complete only when an image of the data has been successfully transferred to the requesting process. If there were any pending write requests affecting the data to be read at the time that the synchronised read operation was requested, these write requests shall be successfully transferred prior to reading the data." o For write, when the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred." Given that it talks about data retrievable an volatile cache does not IRIX only came pre-packaged with SGI MIPS systems. Which as most of the more expensive hardware was not configured with write through caches. Which btw is still the case for all more expensive hardware I have. The whole issue with volatile write back cache is just too much of a data integrity ...
I'm thinking, while we're looking at this, that now is a really good time to split up O_SYNC and O_DSYNC. We have separate fsync and fdatasync, so it should be quite tidy now. Then we can document using O_DSYNC on Linux, which is fine for older versions because it has the same value as O_SYNC at the moment. -- Jamie --
Technically we could easily make O_SYNC really mean O_SYNC and implement a seaprate O_DSYNC at the kernel level. The question is how to handle this at the libc level. Currently glibc defines O_DSYNC to be O_SYNC. We would need to update glibc to pass through O_DSYNC for newer kernels and make sure it falls back to O_SYNC for olders. I'm not sure how feasible this is, but maybe Ulrich has some better ideas. --
The problem with O_* extensions is that the syscall doesn't fail if the flag is not handled. This is a problem in the open implementation which can only be fixed with a new syscall. Why cannot just go on and say we interpret O_SYNC like O_SYNC and O_SYNC|O_DSYNC like O_DSYNC. The POSIX spec explicitly requires that the latter handled like O_SYNC. We could handle it by allocating two bits, only one is handled in the kernel. If the O_DSYNC definition for userlevel would be different from the kernel definition then the kernel could interpret O_SYNC|O_DSYNC like O_DSYNC. The libc would then have to translate the userlevel O_DSYNC into the kernel O_DSYNC. If the libc is too old for the kernel and the application, the userlevel flag would be passed to the kernel and nothing bad happens. The cleaner alternative is to have a sys_newopen which checks for unknown flags and fails in that case. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ --
What about hte following variant: - given that our current O_SYNC really is and always has been actuall Posix O_DSYNC keep the numerical value and rename it to O_DSYNC in the headers. - Add a new O_SYNC definition: #define O_SYNC (O_DSYNC|O_REALLY_SYNC) and do full O_SYNC handling in new kernels if O_REALLY_SYNC is present. --
If this is true, then this proposal would work, yes. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ --
I'll put it on my todo list. While reading through the Posix specs I came up with some questions that you might be able to answer: - O_RSYNC basically means we need to commit atime updates before a read returns, right? It would be easy to implement it in a slightly suboptimal fashion, but is there any point? --
Any ABI change like this takes a long time to trickle down. If this is agreed to be the correct approach then adding the O_* definitions earlier is better. Even if it isn't yet implemented. Then, once the kernel side is implemented, programs are ready to use it. I No, that's not it. O_RSYNC on its own just means the data is successfully transferred to the calling process (always the case). O_RSYNC|O_DSYNC means that if a read request hits data that is currently in a cache and not yet on the medium, then the write to medium is successful before the read succeeds. O_RSYNC|O_SYNC means the same plus the integrity of file meta information (access time etc). -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ --
Yeah. The implementation really is trivial in 2.6.32 - we basically just need to change one function to check the new O_REALLY_SYNC flag and pass down a 0 instead of a 1 to another routine in the generic fs code, plus doing the same in a few filesystems opencoding it instead of using the generic helpers. So the logistics of doing the flags really is the biggest work here. And I'm not entirely sure how to do it correctly. Can we just switch the current O_SYNC defintion in the kernel headers to O_DSYNC while That includes a write from another process? So O_RSYNC basically means doing an range-fdatasync before the actual read request? Again, we could implement this easily if we care enough. --
I don't think you have to change anything. As I wrote before, the kernel ignores unknown O_* flags. It's usually a problem. Here it is a No, that's not a good idea. This would mean a program compiled with newer headers is using O_SYNC which isn't known to old kernels and I think it can be useful at times. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ --
Ok, let's agree on how to proceed: once 2.6.31 is out we will do the following - do a global s/O_SYNC/O_DSYNC/g over the whole kernel tree - add a this to include/asm-generic/fcntl.h and in modified form to arch headers not using it: #ifndef O_FULLSYNC #define O_FULLSYNC 02000000 #endif #ifndef O_RSYNC #define O_RSYNC 04000000 #endif #define O_SYNC (O_FULLSYNC|O_DSYNC) - during the normal merge window I will add a real implementation for for O_FULLSYNC and O_RSYNC P.S. better naming suggestions for O_FULLSYNC welcome --
Basically you are just ensuring that the metadata changes are being synced together with the data changes, so how about O_ISYNC (inode sync)? --
Yeah. Thinking about this a bit more we should define this flag much more clearly. In the obvious implementation it would not actually do anything if it's set on it's own. We would only check it if O_DSYNC is already set to decided if we want to set the datasync argument to ->fsync to 0 or 1 for the generic filesystems (and similar things for filesystems not using the generic helper). If we deem that this is too unsafe we could make sure O_DSYNC always gets set on this fag in ->open, but if we make sure O_SYNC is defined like the one above in the kernel headers and glibc we should be fine. Although in that case a name that doesn't suggest that it actually does something useful would be better. --
If you are going to automatically set O_DSYNC in open(), then fcntl(F_SETFL) might get a bit nasty. Imagine using it after the open in order to clear the O_ISYNC flag; you'll still be left with the O_DSYNC (which you never set in the first place). That would be confusing... Cheers Trond --
Indeed, that's a killer argument for the first variant. We just need to make it extremly clear (manpage _and_ comments) that only O_SYNC is an exposed user interface and that O_WHATEVER_SYNC is an implementation detail. --
O_FULLSYNC might get confused with MacOS X's F_FULLSYNC, which means
something else: fsync through hardware volatile write caches.
(Might we even want to provide O_FULLSYNC and O_FULLDATASYNC to mean
that, eventually?)
O_ISYNC is a bit misleading if we don't really offer "flush just the
inode state" by itself.
So it should at least start with underscores: __O_ISYNC.
How about __O_SYNC_NEW with
#define O_SYNC (O_DSYNC|__O_SYNC_NEW)
I think that tells people reading the headers a bit about what to
expect on older kernels too.
-- Jamie
--
On several unixes, O_RSYNC means it will send the read to the hardware, not relying on the cache. This can be used to verify the data which was written earlier, whether by O_DSYNC or fdatasync. -- Jamie --
I'm sure I read that in a couple of OS man pages, but I can't find it again. Maybe it was something more obscure than the mainstream unices; maybe I imagined it. Ho hum. For now, forget I said anythng. -- Jamie --
That looks good for the kernel.
However, for userspace, there's an issue with applications which were
compiled with an old libc and used O_SYNC. Most of them probably
expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
didn't do it right.
But they *didn't know* that.
When using a newer kernel which actually implements O_SYNC behaviour,
I'm thinking those applications which asked for O_SYNC should get it,
even though they're still linked with an old libc.
That's because this thread is the first time I've heard that Linux
O_SYNC was really the weaker O_DSYNC in disguise, and judging from the
many Googlings I've done about O_SYNC in applications and on different
OS, it'll be news to other people too.
(I always thought the "#define O_DSYNC O_SYNC" was because Linux
didn't implement the weaker O_DSYNC).
(Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
headers? That doesn't make sense: O_RSYNC has nothing to do with
writing.)
To achieve that, libc could implement two versions of open() at the
same time as it updates header files. The new libc's __old_open() would
do:
/* Only O_DSYNC is set for apps built against old libc which
were compiled
if (flags & O_DSYNC)
flags |= O_SYNC;
I'm not exactly sure how symbol versioning works, but perhaps the
header file in the new libc would need __REDIRECT_NTH to map open() to
__new_open(), which just calls the kernel. This is to ensure .o and
.a files built with an old libc's headers but then linked to a new
libc will get __old_open().
Although libc's __new_open() could have this:
/* Old kernels only look at O_DSYNC. It's better than nothing. */
if (flags & O_SYNC)
flags |= O_DSYNC;
Imho, it's better to not do that, and instead have
#define O_SYNC (O_DSYNC|__O_SYNC_KERNEL)
as Chris suggests, in the libc header the same as the kernel header,
because that way applications which use the syscall() function or have
to ...It looks like we're not the only ones. AIX has: http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc... Before the O_DSYNC open mode existed, AIX applied O_DSYNC semantics to O_SYNC. For binary compatibility reasons, this behavior still exists. If true O_SYNC behavior is required, then both O_DSYNC and O_SYNC open flags must be specified. Exporting the XPG_SUS_ENV=ON environment variable also enables true O_SYNC behavior. -- Jamie --
Right. But these programs apparently can live with the broken semantics. I don't worry too much about this. If people really need In general yes, but it's too expensive. Again, existing programs expect O_SYNC is a superset of O_RSYNC. In the absence of a true O_RSYNC that's the next best thing. Of course I didn't know the Linux O_SYNC is Why should it be better? You're replacing something the compiler can do with zero cost with active code. Again, these O_* constant changes are sufficient. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ --
That's an error - O_SYNC is not a superset of O_RSYNC. O_SYNC (by itself) only affects writes. O_RSYNC only affect reads. In the absence of O_RSYNC support in the kernel, it's better to not define O_RSYNC at all in userspace. That tells applications they can call fsync/fdatasync themselves before reading to get an equivalent effect. In fact O_RSYNC, when implemented correctly, can be used by applications to get the effect of range-fsync/fdatasync when such system calls aren't available (by reading a range), but not as efficiently of course. Defining O_RSYNC as O_SYNC fails to do that. -- Jamie --
You misread; I said the zero cost thing is better.
The only reason you might use the active code is this:
/* Upgrade O_DSYNC to O_SYNC. */
flags = fcntl(fd, F_GETFL, 0);
flags = (flags | O_SYNC) & ~O_DSYNC;
fcntl(fd, F_SETFL, flags);
I'm not sure if that should work in POSIX.
-- Jamie
--
Are you sure about this? From http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01704 : Error description LINUX O_DIRECT/O_SYNC TAKES TOO MANY IOS Problem summary On AIX, the O_SYNC and O_DSYNC are different values and performance improvement are available because the inode does not need to be flushed for mtime changes only. On Linux the flags are the same, so performance is lost. when databases open files with O_DIRECT and O_SYNC. -- Jamie --
That is for GPFS, and out of tree filesystem with binary components. It could be that they took linux O_SYNC for real O_SYNC. Any filesystem using the generic helpers in Linux has gotten the O_DSYNC semantics at least as long as I have worked on Linux filesystems, which is getting close to 10 years now. I'll do some code archaelogy before we'll move with this to be sure. --
Um, actually, we don't. If we did that, we would have to wait for a journal commit to complete before allowing the write(2) to complete, which would be especially painfully slow for ext3. This question recently came up on the ext4 developer's list, because of a question of how direct I/O to an preallocated (uninitialized) extent should be handled. Are we supposed to guarantee synchronous updates of the metadata by the time write(2) returns, or not? One of the ext4 developers (I can't remember if it was Mingming or Eric) asked an XFS developer what they did in that case, and I believe the answer they were given was that XFS started a commit, but did *not* wait for the commit to complete before returning from the Direct I/O write. In fact, they were told (I believe this was from an SGI engineer, but I don't remember the name; we can track that down if it's important) that if an application wanted to guarantee metadata would be updated for an extending write, they had to use fsync() or O_SYNC/O_DSYNC. Perhaps they were given an incorrect answer, but it's clear the semantics of exactly how Direct I/O works in edge cases isn't well defined, or at least clearly and widely understood. I have an early draft (for discussion only) what we think it means and what is currently implemented in Linux, which I've put up, (again, let me emphasisize) for *discussion* here: http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics Comments are welcome, either on the wiki's talk page, or directly to me, or to the linux-fsdevel or linux-ext4. - Ted --
I think you mean "not well specified". ;-) Joel -- Life's Little Instruction Book #511 "Call your mother." Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 --
In the section on perhaps not waiting for buffered fallback, we need to clarify that O_DIRECT reads need to know to look in the pagecache. That is, if we decide that extending O_DIRECT writes without fsync can return before the data hits the storage, the caller shouldn't also have to call fsync() just to call read() of data they just wrote! Joel -- To spot the expert, pick the one who predicts the job will take the longest and cost the most. Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127 --
Yeah, I guess we can only do that if the filesystem guarantees coherence between the page cache and O_DIRECT reads; it's been a long while since I've studied that code, so I'm not sure whether all filesystems that support O_DIRECT provide this coherency (since I thought it was provided in the generic O_DIRECT routines, isn't it?) or not. - Ted --
It's provided in the generic code, yes (or at least appears to). Note that xfstests has quite a few tests exercising it. --
The way the O_DIRECT fallback is implemented currenly is that data does hit the disk before return, thanks to a: err = do_sync_mapping_range(file->f_mapping, pos, endbyte, SYNC_FILE_RANGE_WAIT_BEFORE| SYNC_FILE_RANGE_WRITE| SYNC_FILE_RANGE_WAIT_AFTER); which I expected to also sync the required metdata to disk, which it doesn't. Which btw are really horrible semantics given that we export that beast to userspace as a separate system call. --
And that's not even a hardware cache issue, just whether filesystem
metadata is written.
AIX behaves like XFS according to documentation:
[ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc... ]
Direct I/O and Data I/O Integrity Completion
Although direct I/O writes are done synchronously, they do not
provide synchronized I/O data integrity completion, as defined by
POSIX. Applications that need this feature should use O_DSYNC in
addition to O_DIRECT. O_DSYNC guarantees that all of the data and
enough of the metadata (for example, indirect blocks) have written
to the stable store to be able to retrieve the data after a system
crash. O_DIRECT only writes the data; it does not write the
metadata.
That's another reason to use O_DIRECT|O_DSYNC in moderately portable
I haven't read it yet. One thing which comes to mind is it would be
good to summarise what other OSes as well as Linux do with O_DIRECT
w.r.t. data-finding metadata, preallocation, file extending, hole
filling, unaligned access and what alignment is required, block
devices vs. files and different filesystems and behaviour-modifying
mount options, file open for buffered I/O on another descriptor, file
has mapped pages, mlocked pages, and of course drive cache write
through or not.
-- Jamie
--
...or use fsync() when they need to guarantee that data has been
atomically written, but not before. This becomes critically important
if the application is writing into a sparse file, or writing into
uninitalized blocks that were allocated using fallocate(); otherwise,
with O_DIRECT|O_DSYNC, the file system would have to do a commit
It's a wiki; contributions to define all of that is welcome. :-)
We may want to carefully consider what we want to guarantee for all
time to application writers, and what we might want to leave open to
allow for performance optimizations by the kernel, though.
- Ted
--
That would have been Eric asking me. My answer that O_DIRECT does not imply any new data integrity guarantees associated with a write(2) call - it just avoids system caches. You get the same guarantees of resiliency as a non-O_DIRECT write(2) call at completion - it may or may notbe there if you crash. If you want some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or call f[data]sync(2) just like all other IO. Also, note that direct IO is not necessarily synchronous - you can do asynchronous direct IO..... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I agree with all of the above, except:
1. If the automatic O_SYNC fallback mentioned by Christopher is
currently implemented at all, even in a subset of filesystems,
then I think it should be removed.
An app which wants integrity should be calling fsync/fdatasync or
using O_DSYNC/O_SYNC explicitly - with fsync/fdatasync giving
more control over batching.
If it doesn't do any of those things, it may be using O_DIRECT
for performance, and not wish to be penalised by an expensive
O_SYNC on every individual write. Especially when O_SYNC is
fixed to commit drive caches.
2. I agree with everything Dave said about needing to use some other
mechanism for an integrity commit; O_DIRECT is not enough.
We can't realistically make O_DIRECT (by itself) do integrity
commits anyway, because on some drives that involves committing
the drive cache, and it would be a large performance regression.
Given O_DIRECT is often used for its performance, that's not an
option.
3. Currently none of the options provides good integrity commit.
All of them fail to commit drive caches under some circumstances;
even fsync on ext3 with barriers enabled (because it doesn't
commit a journal record if there were writes but no inode change
with data=ordered).
This should be changed (or at least made optionally available),
and that's all the more reason to avoid commit operations except
when requested.
4. On drives which need it, fdatasync/fsync must trigger a drive
cache flush even when there is no dirty page cache to write,
because dirty pages may have been written in the background
already, and because O_DIRECT writes dirty the drive cache but
not the page cache.
A per-drive flag would make sense to optimise this: It is set by
any non-FUA writes sent to the drive while the drive's writeback
cache is enabled, and cleared when any cache flush ...Could you clarify what you meant by "it" above? I'm not sure I understood what you were referring to. Also, it sounds like you and Dave are mostly agreeing with the what I've written here; is that true? http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics I'm trying to get consensus that this is both (a) an accurate description of the state of affiars in Linux, and (b) that it is what we think things should be, before I start circulating it around application developers (especially database developers), to make sure I agree we *should* do this, but we're going to take a pretty serious performance hit when we do. Mac OS chickened out and added an F_FULLSYNC option: http://developer.apple.com/documentation/Darwin/Reference/Manpages/man2/fcntl.2.html The concern is that there are GUI programers that want to update state files after every window resize or move, and after click on a web browser. These GUI programmers then get cranky when changes get lost after proprietary video drivers cause the laptop to lock up. If we make fsync() too burdensome, then fewer and fewer applications will use it. Evidently the MacOS developers decided the few applications who really cared about doing device cache flushes were much smaller than the fast number of applications that need a lightweight file flush. Should we do the same? It seems like an awful cop-out, but having seen, up front and personal, how "agressively stupid" some desktop programmers can be[1], I can **certainly** understand why Apple chose the F_FULLSYNC route. [1] http://josefsipek.net/blahg/?p=364 - Ted (who really needs to get himself an O_PONIES t-shirt :-) --
I meant the automatic O_SYNC fallback, in other words, if O_DIRECT falls back to buffered writing, Chris said it automatically did O_SYNC, and you followed up by saying it doesn't :-) All I'm saying is if there's _some_ code doing O_SYNC writing when O_DIRECT falls back to buffered, it should be ripped out. Leave the I know about that one. (I've done quite a lot of research on O_DIRECT and fsync behaviours). It's really unfortunate that they didn't provide F_FULLDATASYNC, which is what a database or VM would ideally use. I think Vxfs provides a whole suite of mount options to adjust what If fsync is cheap but doesn't commit changes properly - what's the point in encouraging applications to use it? Without drive cache flushes, they will still lose changes occasionally. (Btw, don't blame proprietary video drivers. I see too many lockups I did see a few of those threads, and I think your solution was genius. Genius at keeping people quiet that is :-) But it's also a good default. fsync() isn't practical in shell scripts or Makefiles, although that's really because "mv" lacks the fsync option... Personally I side with "want some kind of full-system asynchronous transactionality please". (Possibly aka. O_PONIES :-) -- Jamie --
Just doing FUA should be pretty easy, in fact from my reading of the code we already use FUA for barriers if supported, that is only drain the queue, do a pre-flush for a barrier and then issue the actual barrier write as FUA. I can play around with getting rid of the pre-flush and doing cache flush based emulation if FUA is not supported if you're fine with that. --
I've never really understood why FUA is considered equivalent to a barrier. Our barrier semantics are that all I/Os before the barrier should be safely on disk after the barrier executes. The FUA semantics are that *this write* should be safely on disk after it executes ... it can still leave preceding writes in the cache. I can see that if you're only interested in metadata that making every metadata write a FUA and leaving the cache to sort out data writes does give FS image consistency. How does FUA give us linux barrier semantics? James --
FUA by itself doesn't.
Think what use cases we have for barriers and/or FUA right now:
- a cache flush. Can only implement as cache flush obviously.
- a barrier flush bio - can be implement as
o cache flush, write, cache flush
o or more efficiently as cache flush, write with FUA bit set
now there is a third use case for O_SYNC, O_DIRECT write which actually
do have FUA-like semantis, that is we only guarantee the I/O is on disk,
but we do not make guarantees about ordering vs earlier writes.
Currently we (as in those few filesystem bothering despite the
VFS/generic helpers making it really hard) implement O_SYNC by:
- doing one or multiple normal writes, and wait on them
- then issue a cache flush - either explicitly blkdev_issue_flush
or implicitly as part of a barrier write for metadata
this could be done more efficiently simply setting the FUA bit on these
requests if we had an API for it. For O_DIRECT should also apply
except that currently we don't even try.
--
