Re: O_DIRECT question

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Andrea Arcangeli <andrea@...>
Cc: Denis Vlasenko <vda.linux@...>, Bill Davidsen <davidsen@...>, Michael Tokarev <mjt@...>, Linus Torvalds <torvalds@...>, Viktor <vvp01@...>, Aubrey <aubreylee@...>, Hua Zhong <hzhong@...>, Hugh Dickins <hugh@...>, Linux-kernel <linux-kernel@...>
Date: Tuesday, January 30, 2007 - 7:07 pm

Andrea Arcangeli wrote:

I thought it obvious that we were talking about non recoverable errors 
that then DO make it to the application.  And any kind of mission 
critical app most definitely does care about write errors.  You don't 
need your db completing the transaction when it was only half recorded. 
  It needs to know it failed so it can back out and/or recover the data 
and record it elsewhere.  You certainly don't want the users to think 
everything is fine, walk away, and have the system continue to limp on 
making things worse by the second.


If the OS crashes due to an IO error reading user data, then there is 
something seriously wrong and beyond the scope of this discussion.  It 
suffices to say that due to the semantics of write() and sound 
engineering practice, the application expects to be notified of errors 
so it can try to recover, or fail gracefully.  Whether it chooses to 
fail gracefully as you say it should, or recovers from the error, it 
needs to know that an error happened, and where it was.


It most certainly matters where the error happened because "you are 
screwd" is not an acceptable outcome in a mission critical application. 
  A well engineered solution will deal with errors as best as possible, 
not simply give up and tell the user they are screwed because the 
designer was lazy.  There is a reason that read and write return the 
number of bytes _actually_ transfered, and the application is supposed 
to check that result to verify proper operation.


No, there is a slight difference.  An fsync() flushes all dirty buffers 
in an undefined order.  Using O_DIRECT or O_SYNC, you can control the 
flush order because you can simply wait for one set of writes to 
complete before starting another set that must not be written until 
after the first are on the disk.  You can emulate that by placing an 
fsync between both sets of writes, but that will flush any other dirty 
buffers whose ordering you do not care about.  Also there is no aio 
version of fsync.

 >


sync has no effect on reading, so that test is pointless.  direct saves 
the cpu overhead of the buffer copy, but isn't good if the cache isn't 
entirely cold.  The large buffer size really has little to do with it, 
rather it is the fact that the writes to null do not block dd from 
making the next read for any length of time.  If dd were blocking on an 
actual output device, that would leave the input device idle for the 
portion of the time that dd were blocked.

In any case, this is a totally different example than your previous one 
which had dd _writing_ to a disk, where it would block for long periods 
of time due to O_SYNC, thereby preventing it from reading from the input 
buffer in a timely manner.  By not reading the input pipe frequently, it 
becomes full and thus, tar blocks.  In that case the large buffer size 
is actually a detriment because with a smaller buffer size, dd would not 
be blocked as long and so it could empty the pipe more frequently 
allowing tar to block less.



You seem to have missed the point of this thread.  Denis Vlasenko's 
message that you replied to simply pointed out that they are 
semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC 
+ madvise could be fixed to perform as well.  Several people including 
Linus seem to like this idea and think it is quite possible.

-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: O_DIRECT question, Phillip Susi, (Tue Jan 30, 2:50 pm)
Re: O_DIRECT question, Andrea Arcangeli, (Tue Jan 30, 3:57 pm)
Re: O_DIRECT question, Phillip Susi, (Tue Jan 30, 7:07 pm)
Re: O_DIRECT question, Michael Tokarev, (Wed Jan 31, 5:37 am)
Re: O_DIRECT question, Andrea Arcangeli, (Tue Jan 30, 10:28 pm)
Re: O_DIRECT question, Andrea Arcangeli, (Tue Jan 30, 4:06 pm)