adding proper O_SYNC/O_DSYNC, was Re: O_DIRECT and barriers

Previous thread: Re: [Bugme-new] [Bug 14021] New: hfsplus caused data loss by Andrew Morton on Thursday, August 20, 2009 - 3:02 pm. (5 messages)

Next thread: none
From: Christoph Hellwig
Date: Thursday, August 20, 2009 - 3:12 pm

Btw, something semi-related I've been looking at recently:

Currently O_DIRECT writes bypass all kernel caches, but there they do
use the disk caches.  We currenly don't have any barrier support for
them at all, which is really bad for data integrity in virtualized
environments.  I've started thinking about how to implement this.

The simplest scheme would be to mark the last request of each
O_DIRECT write as barrier requests.  This works nicely from the FS
perspective and works with all hardware supporting barriers.  It's
massive overkill though - we really only need to flush the cache
after our request, and not before.  And for SCSI we would be much
better just setting the FUA bit on the commands and not require a
full cache flush at all.

The next scheme would be to simply always do a cache flush after
the direct I/O write has completed, but given that blkdev_issue_flush
blocks until the command is done that would a) require everyone to
use the end_io callback and b) spend a lot of time in that workque.
This only requires one full cache flush, but it's still suboptimal.

I have prototypes this for XFS, but I don't really like it.

The best scheme would be to get some highlevel FUA request in the
block layer which gets emulated by a post-command cache flush.
--

From: Jens Axboe
Date: Friday, August 21, 2009 - 4:40 am

I've talked to Chris about this in the past too, but I never got around
to benchmarking FUA for O_DIRECT. It should be pretty easy to wire up
without making too many changes, and we do have FUA support on most SATA
drives too. Basically just a check in the driver for whether the
request is O_DIRECT and a WRITE, ala:

        if (rq_data_dir(rq) == WRITE && rq_is_sync(rq))
                WRITE_FUA;

I know that FUA is used by that other OS, so I think we should be golden
on the hw support side.

-- 
Jens Axboe

--

From: Jamie Lokier
Date: Friday, August 21, 2009 - 6:54 am

I've been thinking about this too, and for optimal performance with
VMs and also with databases, I think FUA is too strong.  (It's also
too weak, on drives which don't have FUA).

I would like to be able to get the same performance and integrity as
the kernel filesystems can get, and that means using barrier flushes
when a kernel filesystem would use them, and FUA when a kernel
filesystem would use that.  Preferably the same whether userspace is
using a file or a block device.

The conclusion I came to is that O_DIRECT users need a barrier flush
primitive.  FUA can either be deduced by the elevator, or signalled
explicitly by userspace.

Fortunately there's already a sensible API for both: fdatasync (and
aio_fsync) to mean flush, and O_DSYNC (or inferred from
flush-after-one-write) to mean FUA.

Those apply to files, but they could be made to have the same effect
with block devices, which would be nice for applications which can use
both.  I'll talk about files from here on; assume the idea is to
provide the same functions for block devices.

It turns out that applications needing integrity must use fdatasync or
O_DSYNC (or O_SYNC) *already* with O_DIRECT, because the kernel may
choose to use buffered writes at any time, with no signal to the
application.  O_DSYNC or fdatasync ensures that unknown buffered
writes will be committed.  This is true for other operating systems
too, for the same reason, except some other unixes will convert all
writes to buffered writes, not just corner cases, under various
circumstances that it's hard for applications to detect.

So there's already a good match to using fdatasync and/or O_DSYNC for
O_DIRECT integrity.

If we define fdatasync's behaviour to be that it always causes a
barrier flush if there have been any WRITE commands to a disk since
the last barrier flush, in addition to it's behaviour of flushing
cached pages, that would be enough for VM and database applications
would have good support for integrity.  Of course O_DSYNC ...
From: Christoph Hellwig
Date: Friday, August 21, 2009 - 7:26 am

I thought about this alot .  It would be sensible to only require
the FUA semantics if O_SYNC is specified.  But from looking around at
users of O_DIRECT no one seems to actually specify O_SYNC with it.
And on Linux where O_SYNC really means O_DYSNC that's pretty sensible -
if O_DIRECT bypasses the filesystem cache there is nothing else
left to sync for a non-extending write.  That is until those pesky disk
write back caches come into play that no application writer wants or

The fallback was a relatively recent addition to the O_DIRECT semantics
for broken filesystems that can't handle holes very well.  Fortunately
enough we do force O_SYNC (that is Linux O_SYNC aka Posix O_DSYNC)
semantics for that already.

--

From: Jamie Lokier
Date: Friday, August 21, 2009 - 8:24 am

In measurements I've done, disabling a disk's write cache results in
much slower ext3 filesystem writes than using barriers.  Others report
similar results.  This is with disks that don't have NCQ; good NCQ may
be better.

Using FUA for all writes should be equivalent to writing with write
cache disabled.

A journalling filesystem or database tends to write like this:

   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE
   (guest) CACHE FLUSH
   (guest) WRITE
   (guest) CACHE FLUSH
   (guest) WRITE
   (guest) WRITE
   (guest) WRITE

When a guest does that, for integrity it can be mapped to this on the
host with FUA:

   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA
   (host) WRITE FUA

or

   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) WRITE
   (host) CACHE FLUSH
   (host) WRITE
   (host) CACHE FLUSH 
   (host) WRITE
   (host) WRITE
   (host) WRITE

We know from measurements that disabling the disk write cache is much
slower than using barriers, at least with some disks.

Assuming that WRITE FUA is equivalent to disabling write cache, we may
expect the WRITE FUA version to run much slower than the CACHE FLUSH
version.

It's also too weak, of course, on drives which don't support FUA.
Then you have to use CACHE FLUSH anyway, so the code should support
that (or disable the write cache entirely, which also performs badly).
If you don't handle drives without FUA, then you're back to "integrity
sometimes, user must check type of hardware", which is something we're
trying to get away from.  Integrity should not be a surprise when the

O_DIRECT with true POSIX O_SYNC is a bad idea, because it flushes
inode metadata (like mtime) too.  O_DIRECT|O_DSYNC is better.

O_DIRECT without O_SYNC, O_DSYNC, fsync or fdatasync is asking for
integrity problems when direct writes are converted ...
From: Christoph Hellwig
Date: Friday, August 21, 2009 - 10:45 am

On a scsi disk and a SATA SSD with NCQ I get different results.  Most
worksloads, in particular metadata-intensive ones and large streaming
writes are noticably better just turning off the write cache.  The only
onces that benefit from it are relatively small writes witout O_SYNC
or much fsyncs.  This is however using XFS which tends to issue much


For a workload that only does FUA writes, yeah.  That is however the use
case for virtual machines.  As I'm looking into those issues I will run

As mentioned in the previous mails FUA would only be an optimization


It did not happen on IRIX where O_DIRECT originated that did not happen,
neither does it happen on Linux when using XFS.  Then again at least on
Linux we provide O_SYNC (that is Linux O_SYNC, aka Posix O_DYSC)

That is what I meant.  Only doing cache flushes/FUA for O_DIRECT|O_DSYNC
is not what users naively expect.  And the wording in hour manpages also
suggests this behaviour, although it is not entirely clear:


O_DIRECT (Since Linux 2.4.10)

	Try to minimize cache effects of the I/O to and from this file.  In
	general this will degrade performance, but it is useful in special
	situations, such as when applications do their own caching.  File I/O
	is done directly to/from user space buffers.  The I/O is synchronous,
	that is,  at the completion of a read(2) or write(2), data is
	guaranteed to have been transferred.  See NOTES below forfurther
	discussion.

(And yeah, the whole wording is horrible, I will send an update once


No.  In the generic code and filesystems I looked at it simply has no
effect at all.

--

From: Ric Wheeler
Date: Friday, August 21, 2009 - 12:18 pm

With normal S-ATA disks, streaming write workloads on ext3 run twice as 
fast with barriers & write cache enabled in my testing.

Small file workloads were more even if I remember correctly...


--

From: Jamie Lokier
Date: Friday, August 21, 2009 - 5:50 pm

IRIX has an unusually sane O_DIRECT - at least according to it's
documentation.  This is write(2):

     When attempting to write to a file with O_DIRECT or FDIRECT set,
     the portion being written can not be locked in memory by any
     process. In this case, -1 will be returned and errno will be set
     to EBUSY.

AIX however says this:

     In order to avoid consistency issues between programs that use
     Direct I/O and programs that use normal cached I/O, Direct I/O is
     by default used in an exclusive use mode. If there are multiple
     opens of a file and some of them are direct and others are not,
     the file will stay in its normal cached access mode. Only when
     the file is open exclusively by Direct I/O programs will the file
     be placed in Direct I/O mode.

     Similarly, if the file is mapped into virtual memory via the
     shmat() or mmap() system calls, then file will stay in normal
     cached mode.

     The JFS or JFS2 will attempt to move the file into Direct I/O
     mode any time the last conflicting. non-direct access is
     eliminated (either by close(), munmap(), or shmdt()
     subroutines). Changing the file from normal mode to Direct I/O
     mode can be rather expensive since it requires writing all
     modified pages to disk and removing all the file's pages from


Oh, I agree with that.  That comes from observing that quasi-portable
code using O_DIRECT needs to use O_DSYNC too because several OSes and
filesystems on those OSes revert to buffered writes under some
circumstances, in which case you want O_DSYNC too.  That has nothing
to do with hardware caches, but it's a lucky coincidence that
fdatasync() would form a nice barrier function, and O_DIRECT|O_DSYNC

Perhaps in the same way that fsync/fdatasync aren't clear on disk

One thing it's unhelpful about is the performance.  O_DIRECT tends to
improve performance for applications that do their own caching, it
also improves performance in whole systems when caching ...
From: Theodore Tso
Date: Friday, August 21, 2009 - 7:19 pm

Can you forward a pointer to an Irix man page which describes its
O_DIRECT semantics (or at least what they claim in their man pages)?
I was looking for one on the web, but I couldn't seem to find any
on-line web pages for Irix.  

It'd be nice if we could also get permission from SGI to quote
relevant sections in the "Clarifying Direct I/O Semantics" wiki page
would be welcome, in case we end up quoting more than what someone
might consider fair game for fair use, but for now, I'd be really
happy getting something that I could look out for reference purposes.
Was there any thing more than what you quoted in the Irix write(2) man
page about O_DIRECT?

Thanks,

						- Ted
--

From: Theodore Tso
Date: Friday, August 21, 2009 - 7:31 pm

Never mind, I found it.  (And I've added the relevant bits to the wiki
article).

					- Ted
--

From: Christoph Hellwig
Date: Sunday, August 23, 2009 - 7:34 pm

I agree.  I do however fear about everything using O_DIRECT that is
around now.  Less so about the databases and HPC workloads on expensive
hardware because they usually run on vendor approved scsi disks that
have the write back cache disabled, but rather things like
virtualization software or other things that get run on commodity
hardware.

Then again they already don't get what they expect and never did,
so if we clear document and communicate the O_SYNC (that is Linux

The disk write cache really is an implementation detail, it has no
business in Posix.

Posix seems to define the semantics for fdatasync and cor relatively
well (that is if you like the specification speak in there):

"The fdatasync() function forces all currently queued I/O operations
 associated with the file indicated by file descriptor fildes to the
 synchronised I/O completion state."

"synchronised I/O data integrity completion

 o For read, when the operation has been completed or diagnosed if
   unsuccessful. The read is complete only when an image of the data has
   been successfully transferred to the requesting process. If there were
   any pending write requests affecting the data to be read at the time
   that the synchronised read operation was requested, these write
   requests shall be successfully transferred prior to reading the
   data."
 o For write, when the operation has been completed or diagnosed if
   unsuccessful. The write is complete only when the data specified in the
   write request is successfully transferred and all file system
   information required to retrieve the data is successfully transferred."

Given that it talks about data retrievable an volatile cache does not

IRIX only came pre-packaged with SGI MIPS systems.  Which as most of
the more expensive hardware was not configured with write through
caches.  Which btw is still the case for all more expensive hardware
I have.  The whole issue with volatile write back cache is just too
much of a data integrity ...
From: Jamie Lokier
Date: Thursday, August 27, 2009 - 7:34 am

I'm thinking, while we're looking at this, that now is a really good
time to split up O_SYNC and O_DSYNC.

We have separate fsync and fdatasync, so it should be quite tidy now.

Then we can document using O_DSYNC on Linux, which is fine for older
versions because it has the same value as O_SYNC at the moment.

-- Jamie
--

From: Christoph Hellwig
Date: Thursday, August 27, 2009 - 10:10 am

Technically we could easily make O_SYNC really mean O_SYNC and implement
a seaprate O_DSYNC at the kernel level.

The question is how to handle this at the libc level.  Currently glibc
defines O_DSYNC to be O_SYNC.  We would need to update glibc to pass
through O_DSYNC for newer kernels and make sure it falls back to O_SYNC
for olders.  I'm not sure how feasible this is, but maybe Ulrich has
some better ideas.
--

From: Ulrich Drepper
Date: Thursday, August 27, 2009 - 10:24 am

The problem with O_* extensions is that the syscall doesn't fail if the 
flag is not handled.  This is a problem in the open implementation which 
can only be fixed with a new syscall.

Why cannot just go on and say we interpret O_SYNC like O_SYNC and 
O_SYNC|O_DSYNC like O_DSYNC.  The POSIX spec explicitly requires that 
the latter handled like O_SYNC.

We could handle it by allocating two bits, only one is handled in the 
kernel.  If the O_DSYNC definition for userlevel would be different from 
the kernel definition then the kernel could interpret O_SYNC|O_DSYNC 
like O_DSYNC.  The libc would then have to translate the userlevel 
O_DSYNC into the kernel O_DSYNC.  If the libc is too old for the kernel 
and the application, the userlevel flag would be passed to the kernel 
and nothing bad happens.

The cleaner alternative is to have a sys_newopen which checks for 
unknown flags and fails in that case.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--

From: Christoph Hellwig
Date: Friday, August 28, 2009 - 8:46 am

What about hte following variant:

 - given that our current O_SYNC really is and always has been actuall
   Posix O_DSYNC keep the numerical value and rename it to O_DSYNC in
   the headers.
 - Add a new O_SYNC definition:

	#define O_SYNC		(O_DSYNC|O_REALLY_SYNC)

   and do full O_SYNC handling in new kernels if O_REALLY_SYNC is
   present.
--

From: Ulrich Drepper
Date: Friday, August 28, 2009 - 9:06 am

If this is true, then this proposal would work, yes.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--

From: Christoph Hellwig
Date: Friday, August 28, 2009 - 9:17 am

I'll put it on my todo list.  While reading through the Posix specs
I came up with some questions that you might be able to answer:

 - O_RSYNC basically means we need to commit atime updates before a
   read returns, right?  It would be easy to implement 
   it in a slightly suboptimal fashion, but is there any point?

--

From: Ulrich Drepper
Date: Friday, August 28, 2009 - 9:33 am

Any ABI change like this takes a long time to trickle down.

If this is agreed to be the correct approach then adding the O_* 
definitions earlier is better.  Even if it isn't yet implemented.  Then, 
once the kernel side is implemented, programs are ready to use it.  I 

No, that's not it.

O_RSYNC on its own just means the data is successfully transferred to 
the calling process (always the case).

O_RSYNC|O_DSYNC means that if a read request hits data that is currently 
in a cache and not yet on the medium, then the write to medium is 
successful before the read succeeds.

O_RSYNC|O_SYNC means the same plus the integrity of file meta 
information (access time etc).

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--

From: Christoph Hellwig
Date: Friday, August 28, 2009 - 9:41 am

Yeah.  The implementation really is trivial in 2.6.32 - we basically
just need to change one function to check the new O_REALLY_SYNC flag
and pass down a 0 instead of a 1 to another routine in the generic
fs code, plus doing the same in a few filesystems opencoding it instead
of using the generic helpers.

So the logistics of doing the flags really is the biggest work here.
And I'm not entirely sure how to do it correctly.  Can we just switch
the current O_SYNC defintion in the kernel headers to O_DSYNC while

That includes a write from another process?  So O_RSYNC basically means
doing an range-fdatasync before the actual read request?

Again, we could implement this easily if we care enough.

--

From: Ulrich Drepper
Date: Friday, August 28, 2009 - 1:51 pm

I don't think you have to change anything.  As I wrote before, the 
kernel ignores unknown O_* flags.  It's usually a problem.  Here it is a 

No, that's not a good idea.  This would mean a program compiled with 
newer headers is using O_SYNC which isn't known to old kernels and 


I think it can be useful at times.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--

From: Christoph Hellwig
Date: Friday, August 28, 2009 - 2:08 pm

Ok, let's agree on how to proceed:


once 2.6.31 is out we will do the following

 - do a global s/O_SYNC/O_DSYNC/g over the whole kernel tree
 - add a this to include/asm-generic/fcntl.h and in modified form
   to arch headers not using it:

#ifndef O_FULLSYNC
#define O_FULLSYNC	02000000
#endif

#ifndef O_RSYNC
#define O_RSYNC		04000000
#endif

#define O_SYNC	(O_FULLSYNC|O_DSYNC)

 - during the normal merge window I will add a real implementation for
   for O_FULLSYNC and O_RSYNC

P.S. better naming suggestions for O_FULLSYNC welcome
--

From: Trond Myklebust
Date: Friday, August 28, 2009 - 2:16 pm

Basically you are just ensuring that the metadata changes are being
synced together with the data changes, so how about O_ISYNC (inode
sync)?


--

From: Christoph Hellwig
Date: Friday, August 28, 2009 - 2:29 pm

Yeah.  Thinking about this a bit more we should define this flag
much more clearly.  In the obvious implementation it would not actually
do anything if it's set on it's own.  We would only check it if O_DSYNC
is already set to decided if we want to set the datasync argument to
->fsync to 0 or 1 for the generic filesystems (and similar things for
filesystems not using the generic helper).

If we deem that this is too unsafe we could make sure O_DSYNC always
gets set on this fag in ->open, but if we make sure O_SYNC is defined
like the one above in the kernel headers and glibc we should be fine.

Although in that case a name that doesn't suggest that it actually does
something useful would be better.
--

From: Trond Myklebust
Date: Friday, August 28, 2009 - 2:43 pm

If you are going to automatically set O_DSYNC in open(), then
fcntl(F_SETFL) might get a bit nasty.

Imagine using it after the open in order to clear the O_ISYNC flag;
you'll still be left with the O_DSYNC (which you never set in the first
place). That would be confusing...

Cheers
  Trond

--

From: Christoph Hellwig
Date: Friday, August 28, 2009 - 3:39 pm

Indeed, that's a killer argument for the first variant.  We just need
to make it extremly clear (manpage _and_ comments) that only O_SYNC is
an exposed user interface and that O_WHATEVER_SYNC is an implementation
detail.
--

From: Jamie Lokier
Date: Sunday, August 30, 2009 - 9:44 am

O_FULLSYNC might get confused with MacOS X's F_FULLSYNC, which means
something else: fsync through hardware volatile write caches.

(Might we even want to provide O_FULLSYNC and O_FULLDATASYNC to mean
that, eventually?)

O_ISYNC is a bit misleading if we don't really offer "flush just the
inode state" by itself.

So it should at least start with underscores: __O_ISYNC.

How about __O_SYNC_NEW with

    #define O_SYNC     (O_DSYNC|__O_SYNC_NEW)

I think that tells people reading the headers a bit about what to
expect on older kernels too.

-- Jamie
--

From: Jamie Lokier
Date: Friday, August 28, 2009 - 9:46 am

On several unixes, O_RSYNC means it will send the read to the
hardware, not relying on the cache.  This can be used to verify the
data which was written earlier, whether by O_DSYNC or fdatasync.

-- Jamie
--

From: Jamie Lokier
Date: Friday, August 28, 2009 - 5:59 pm

I'm sure I read that in a couple of OS man pages, but I can't find it
again.  Maybe it was something more obscure than the mainstream
unices; maybe I imagined it.  Ho hum.  For now, forget I said anythng.

-- Jamie
--

From: Jamie Lokier
Date: Friday, August 28, 2009 - 9:44 am

That looks good for the kernel.

However, for userspace, there's an issue with applications which were
compiled with an old libc and used O_SYNC.  Most of them probably
expected O_SYNC behaviour but all they got was O_DSYNC, because Linux
didn't do it right.

But they *didn't know* that.

When using a newer kernel which actually implements O_SYNC behaviour,
I'm thinking those applications which asked for O_SYNC should get it,
even though they're still linked with an old libc.

That's because this thread is the first time I've heard that Linux
O_SYNC was really the weaker O_DSYNC in disguise, and judging from the
many Googlings I've done about O_SYNC in applications and on different
OS, it'll be news to other people too.

(I always thought the "#define O_DSYNC O_SYNC" was because Linux
didn't implement the weaker O_DSYNC).

(Oh, and Ulrich: Why is there a "#define O_RSYNC O_SYNC" in the Glibc
headers?  That doesn't make sense: O_RSYNC has nothing to do with
writing.)

To achieve that, libc could implement two versions of open() at the
same time as it updates header files.  The new libc's __old_open() would
do:

    /* Only O_DSYNC is set for apps built against old libc which
       were compiled
    if (flags & O_DSYNC)
        flags |= O_SYNC;

I'm not exactly sure how symbol versioning works, but perhaps the
header file in the new libc would need __REDIRECT_NTH to map open() to
__new_open(), which just calls the kernel.  This is to ensure .o and
.a files built with an old libc's headers but then linked to a new
libc will get __old_open().

Although libc's __new_open() could have this:

    /* Old kernels only look at O_DSYNC.  It's better than nothing. */
    if (flags & O_SYNC)
        flags |= O_DSYNC;

Imho, it's better to not do that, and instead have

    #define O_SYNC          (O_DSYNC|__O_SYNC_KERNEL)

as Chris suggests, in the libc header the same as the kernel header,
because that way applications which use the syscall() function or have
to ...
From: Jamie Lokier
Date: Friday, August 28, 2009 - 9:50 am

It looks like we're not the only ones.  AIX has:

http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc...

    Before the O_DSYNC open mode existed, AIX applied O_DSYNC semantics to
    O_SYNC. For binary compatibility reasons, this behavior still
    exists. If true O_SYNC behavior is required, then both O_DSYNC and
    O_SYNC open flags must be specified. Exporting the XPG_SUS_ENV=ON
    environment variable also enables true O_SYNC behavior.

-- Jamie
--

From: Ulrich Drepper
Date: Friday, August 28, 2009 - 2:08 pm

Right.  But these programs apparently can live with the broken 
semantics.  I don't worry too much about this.  If people really need 

In general yes, but it's too expensive.  Again, existing programs expect 

O_SYNC is a superset of O_RSYNC.  In the absence of a true O_RSYNC 
that's the next best thing.  Of course I didn't know the Linux O_SYNC is 

Why should it be better?  You're replacing something the compiler can do 
with zero cost with active code.


Again, these O_* constant changes are sufficient.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
--

From: Jamie Lokier
Date: Sunday, August 30, 2009 - 9:58 am

That's an error - O_SYNC is not a superset of O_RSYNC.

O_SYNC (by itself) only affects writes.

O_RSYNC only affect reads.

In the absence of O_RSYNC support in the kernel, it's better to not
define O_RSYNC at all in userspace.  That tells applications they can
call fsync/fdatasync themselves before reading to get an equivalent
effect.

In fact O_RSYNC, when implemented correctly, can be used by
applications to get the effect of range-fsync/fdatasync when such
system calls aren't available (by reading a range), but not as
efficiently of course.  Defining O_RSYNC as O_SYNC fails to do that.

-- Jamie
--

From: Jamie Lokier
Date: Sunday, August 30, 2009 - 10:48 am

You misread; I said the zero cost thing is better.

The only reason you might use the active code is this:

    /* Upgrade O_DSYNC to O_SYNC. */

    flags = fcntl(fd, F_GETFL, 0);
    flags = (flags | O_SYNC) & ~O_DSYNC;
    fcntl(fd, F_SETFL, flags);

I'm not sure if that should work in POSIX.

-- Jamie
--

From: Jamie Lokier
Date: Friday, August 28, 2009 - 4:06 pm

Are you sure about this?

From http://www-01.ibm.com/support/docview.wss?uid=isg1IZ01704 :

    Error description

       LINUX O_DIRECT/O_SYNC TAKES TOO MANY IOS

    Problem summary

       On AIX, the O_SYNC and O_DSYNC are different values and
       performance improvement are available because the inode does
       not need to be flushed for mtime changes only.
       On Linux the flags are the same, so performance is lost.
       when databases open files with O_DIRECT and O_SYNC.

-- Jamie
--

From: Christoph Hellwig
Date: Friday, August 28, 2009 - 4:46 pm

That is for GPFS, and out of tree filesystem with binary components.
It could be that they took linux O_SYNC for real O_SYNC.  Any filesystem
using the generic helpers in Linux has gotten the O_DSYNC semantics at
least as long as I have worked on Linux filesystems, which is getting
close to 10 years now.  I'll do some code archaelogy before we'll move
with this to be sure.

--

From: Theodore Tso
Date: Friday, August 21, 2009 - 3:08 pm

Um, actually, we don't.  If we did that, we would have to wait for a
journal commit to complete before allowing the write(2) to complete,
which would be especially painfully slow for ext3.

This question recently came up on the ext4 developer's list, because
of a question of how direct I/O to an preallocated (uninitialized)
extent should be handled.  Are we supposed to guarantee synchronous
updates of the metadata by the time write(2) returns, or not?  One of
the ext4 developers (I can't remember if it was Mingming or Eric)
asked an XFS developer what they did in that case, and I believe the
answer they were given was that XFS started a commit, but did *not*
wait for the commit to complete before returning from the Direct I/O
write.  In fact, they were told (I believe this was from an SGI
engineer, but I don't remember the name; we can track that down if
it's important) that if an application wanted to guarantee metadata
would be updated for an extending write, they had to use fsync() or
O_SYNC/O_DSYNC.  

Perhaps they were given an incorrect answer, but it's clear the
semantics of exactly how Direct I/O works in edge cases isn't well
defined, or at least clearly and widely understood.

I have an early draft (for discussion only) what we think it means and
what is currently implemented in Linux, which I've put up, (again, let
me emphasisize) for *discussion* here:

http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

Comments are welcome, either on the wiki's talk page, or directly to
me, or to the linux-fsdevel or linux-ext4.

						- Ted
--

From: Joel Becker
Date: Friday, August 21, 2009 - 3:38 pm

I think you mean "not well specified". ;-)

Joel

-- 

Life's Little Instruction Book #511

	"Call your mother."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
--

From: Joel Becker
Date: Friday, August 21, 2009 - 3:45 pm

In the section on perhaps not waiting for buffered fallback, we
need to clarify that O_DIRECT reads need to know to look in the
pagecache.  That is, if we decide that extending O_DIRECT writes without
fsync can return before the data hits the storage, the caller shouldn't
also have to call fsync() just to call read() of data they just wrote!

Joel

-- 

To spot the expert, pick the one who predicts the job will take the
longest and cost the most.

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
--

From: Theodore Tso
Date: Friday, August 21, 2009 - 7:11 pm

Yeah, I guess we can only do that if the filesystem guarantees
coherence between the page cache and O_DIRECT reads; it's been a long
while since I've studied that code, so I'm not sure whether all
filesystems that support O_DIRECT provide this coherency (since I
thought it was provided in the generic O_DIRECT routines, isn't it?)
or not.

							- Ted
--

From: Christoph Hellwig
Date: Sunday, August 23, 2009 - 7:42 pm

It's provided in the generic code, yes (or at least appears to).  

Note that xfstests has quite a few tests exercising it.
--

From: Christoph Hellwig
Date: Sunday, August 23, 2009 - 7:37 pm

The way the O_DIRECT fallback is implemented currenly is that data does
hit the disk before return, thanks to a:

	err = do_sync_mapping_range(file->f_mapping, pos, endbyte,
					SYNC_FILE_RANGE_WAIT_BEFORE|
					SYNC_FILE_RANGE_WRITE|
					SYNC_FILE_RANGE_WAIT_AFTER);

which I expected to also sync the required metdata to disk, which
it doesn't.    Which btw are really horrible semantics given that
we export that beast to userspace as a separate system call.

--

From: Jamie Lokier
Date: Friday, August 21, 2009 - 5:56 pm

And that's not even a hardware cache issue, just whether filesystem
metadata is written.

AIX behaves like XFS according to documentation:

    [ http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.genprogc... ]

    Direct I/O and Data I/O Integrity Completion

    Although direct I/O writes are done synchronously, they do not
    provide synchronized I/O data integrity completion, as defined by
    POSIX. Applications that need this feature should use O_DSYNC in
    addition to O_DIRECT. O_DSYNC guarantees that all of the data and
    enough of the metadata (for example, indirect blocks) have written
    to the stable store to be able to retrieve the data after a system
    crash. O_DIRECT only writes the data; it does not write the
    metadata.

That's another reason to use O_DIRECT|O_DSYNC in moderately portable

I haven't read it yet.  One thing which comes to mind is it would be
good to summarise what other OSes as well as Linux do with O_DIRECT
w.r.t. data-finding metadata, preallocation, file extending, hole
filling, unaligned access and what alignment is required, block
devices vs. files and different filesystems and behaviour-modifying
mount options, file open for buffered I/O on another descriptor, file
has mapped pages, mlocked pages, and of course drive cache write
through or not.

-- Jamie
--

From: Theodore Tso
Date: Friday, August 21, 2009 - 7:06 pm

...or use fsync() when they need to guarantee that data has been
atomically written, but not before.  This becomes critically important
if the application is writing into a sparse file, or writing into
uninitalized blocks that were allocated using fallocate(); otherwise,
with O_DIRECT|O_DSYNC, the file system would have to do a commit

It's a wiki; contributions to define all of that is welcome.  :-)

We may want to carefully consider what we want to guarantee for all
time to application writers, and what we might want to leave open to
allow for performance optimizations by the kernel, though.

      	  	      		       	   	   - Ted
--

From: Dave Chinner
Date: Tuesday, August 25, 2009 - 11:34 pm

That would have been Eric asking me. My answer that O_DIRECT does
not imply any new data integrity guarantees associated with a
write(2) call - it just avoids system caches. You get the same
guarantees of resiliency as a non-O_DIRECT write(2) call at
completion - it may or may notbe there if you crash. If you want
some guarantee of integrity, then you need to use O_DSYNC, O_SYNC or
call f[data]sync(2) just like all other IO.

Also, note that direct IO is not necessarily synchronous - you can
do asynchronous direct IO.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Jamie Lokier
Date: Wednesday, August 26, 2009 - 8:01 am

I agree with all of the above, except:

  1. If the automatic O_SYNC fallback mentioned by Christopher is
     currently implemented at all, even in a subset of filesystems,
     then I think it should be removed.

     An app which wants integrity should be calling fsync/fdatasync or
     using O_DSYNC/O_SYNC explicitly - with fsync/fdatasync giving
     more control over batching.

     If it doesn't do any of those things, it may be using O_DIRECT
     for performance, and not wish to be penalised by an expensive
     O_SYNC on every individual write.  Especially when O_SYNC is
     fixed to commit drive caches.

  2. I agree with everything Dave said about needing to use some other
     mechanism for an integrity commit; O_DIRECT is not enough.

     We can't realistically make O_DIRECT (by itself) do integrity
     commits anyway, because on some drives that involves committing
     the drive cache, and it would be a large performance regression.
     Given O_DIRECT is often used for its performance, that's not an
     option.

  3. Currently none of the options provides good integrity commit.

     All of them fail to commit drive caches under some circumstances;
     even fsync on ext3 with barriers enabled (because it doesn't
     commit a journal record if there were writes but no inode change
     with data=ordered).

     This should be changed (or at least made optionally available),
     and that's all the more reason to avoid commit operations except
     when requested.

  4. On drives which need it, fdatasync/fsync must trigger a drive
     cache flush even when there is no dirty page cache to write,
     because dirty pages may have been written in the background
     already, and because O_DIRECT writes dirty the drive cache but
     not the page cache.

     A per-drive flag would make sense to optimise this: It is set by
     any non-FUA writes sent to the drive while the drive's writeback
     cache is enabled, and cleared when any cache flush ...
From: Theodore Tso
Date: Wednesday, August 26, 2009 - 11:47 am

Could you clarify what you meant by "it" above?  I'm not sure I
understood what you were referring to.

Also, it sounds like you and Dave are mostly agreeing with the what
I've written here; is that true?

http://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics

I'm trying to get consensus that this is both (a) an accurate
description of the state of affiars in Linux, and (b) that it is what
we think things should be, before I start circulating it around
application developers (especially database developers), to make sure

I agree we *should* do this, but we're going to take a pretty serious
performance hit when we do.  Mac OS chickened out and added an
F_FULLSYNC option:

http://developer.apple.com/documentation/Darwin/Reference/Manpages/man2/fcntl.2.html

The concern is that there are GUI programers that want to update state
files after every window resize or move, and after click on a web
browser.  These GUI programmers then get cranky when changes get lost
after proprietary video drivers cause the laptop to lock up.  If we
make fsync() too burdensome, then fewer and fewer applications will
use it.  Evidently the MacOS developers decided the few applications
who really cared about doing device cache flushes were much smaller
than the fast number of applications that need a lightweight file
flush.  Should we do the same?  

It seems like an awful cop-out, but having seen, up front and
personal, how "agressively stupid" some desktop programmers can be[1],
I can **certainly** understand why Apple chose the F_FULLSYNC route.

[1] http://josefsipek.net/blahg/?p=364

    		  	     	      - Ted
				      (who really needs to get himself
				       an O_PONIES t-shirt :-)
--

From: Jamie Lokier
Date: Thursday, August 27, 2009 - 7:50 am

I meant the automatic O_SYNC fallback, in other words, if O_DIRECT
falls back to buffered writing, Chris said it automatically did
O_SYNC, and you followed up by saying it doesn't :-)

All I'm saying is if there's _some_ code doing O_SYNC writing when
O_DIRECT falls back to buffered, it should be ripped out.  Leave the

I know about that one.  (I've done quite a lot of research on O_DIRECT
and fsync behaviours).  It's really unfortunate that they didn't
provide F_FULLDATASYNC, which is what a database or VM would ideally
use.

I think Vxfs provides a whole suite of mount options to adjust what

If fsync is cheap but doesn't commit changes properly - what's the
point in encouraging applications to use it?  Without drive cache
flushes, they will still lose changes occasionally.

(Btw, don't blame proprietary video drivers.  I see too many lockups

I did see a few of those threads, and I think your solution was genius.
Genius at keeping people quiet that is :-)

But it's also a good default.  fsync() isn't practical in shell
scripts or Makefiles, although that's really because "mv" lacks the
fsync option...

Personally I side with "want some kind of full-system asynchronous
transactionality please".  (Possibly aka. O_PONIES :-)

-- Jamie
--

From: Christoph Hellwig
Date: Friday, August 21, 2009 - 7:20 am

Just doing FUA should be pretty easy, in fact from my reading of the
code we already use FUA for barriers if supported, that is only drain
the queue, do a pre-flush for a barrier and then issue the actual
barrier write as FUA.

I can play around with getting rid of the pre-flush and doing cache
flush based emulation if FUA is not supported if you're fine with that.
--

From: James Bottomley
Date: Friday, August 21, 2009 - 8:06 am

I've never really understood why FUA is considered equivalent to a
barrier.  Our barrier semantics are that all I/Os before the barrier
should be safely on disk after the barrier executes.  The FUA semantics
are that *this write* should be safely on disk after it executes ... it
can still leave preceding writes in the cache.  I can see that if you're
only interested in metadata that making every metadata write a FUA and
leaving the cache to sort out data writes does give FS image
consistency.

How does FUA give us linux barrier semantics?

James


--

From: Christoph Hellwig
Date: Friday, August 21, 2009 - 8:23 am

FUA by itself doesn't.

Think what use cases we have for barriers and/or FUA right now:

 - a cache flush.  Can only implement as cache flush obviously.
 - a barrier flush bio - can be implement as
     o cache flush, write, cache flush
     o or more efficiently as cache flush, write with FUA bit set

now there is a third use case for O_SYNC, O_DIRECT write which actually
do have FUA-like semantis, that is we only guarantee the I/O is on disk,
but we do not make guarantees about ordering vs earlier writes.
Currently we (as in those few filesystem bothering despite the
VFS/generic helpers making it really hard) implement O_SYNC by:

 - doing one or multiple normal writes, and wait on them
 - then issue a cache flush - either explicitly blkdev_issue_flush
   or implicitly as part of a barrier write for metadata

this could be done more efficiently simply setting the FUA bit on these
requests if we had an API for it.  For O_DIRECT should also apply
except that currently we don't even try.

--

Previous thread: Re: [Bugme-new] [Bug 14021] New: hfsplus caused data loss by Andrew Morton on Thursday, August 20, 2009 - 3:02 pm. (5 messages)

Next thread: none