Re: O_DIRECT question

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Phillip Susi <psusi@...>
Cc: Denis Vlasenko <vda.linux@...>, Bill Davidsen <davidsen@...>, Michael Tokarev <mjt@...>, Linus Torvalds <torvalds@...>, Viktor <vvp01@...>, Aubrey <aubreylee@...>, Hua Zhong <hzhong@...>, Hugh Dickins <hugh@...>, Linux-kernel <linux-kernel@...>
Date: Tuesday, January 30, 2007 - 10:28 pm

On Tue, Jan 30, 2007 at 06:07:14PM -0500, Phillip Susi wrote:

An I/O error is not an acceptable outcome in a mission critical app,
all mission critical setups should be fault tolerant, so if raid
cannot recover at the first sign of error the whole system should
instantly go down and let the secondary takeover from it. See slony
etc...

Trying to recover the recoverable by mucking up with data making even
_more_ writes on a failing disk before doing physical mirror image of
the disk (the readable part) isn't a good idea IMHO. At best you could
retry writing on the same sector hoping somebody disconnected the scsi
cable by mistake.


You can track the range where it happened with fsync too like said in
previous email, and you can take the big database lock and then
read-write read-write every single block in that range until you find
the failing place if you really want to. read-write in place should be
safe.


Doing fsync after every write will provide the same ordering
guarantee as O_SYNC, thought it was obvious what I meant here.

The whole point is that most of the time you don't need it, you need
an fsync after a couple of writes. All smtp servers uses fsync for the
same reason, they also have to journal their writes to avoid losing
email when there is a power loss.

If you use writev or aio pwrite you can do well with O_SYNC too though.


please have a second look at aio_abi.h:

	IOCB_CMD_FSYNC = 2,
	IOCB_CMD_FDSYNC = 3,

there must be a reason why they exist, right?


direct bypasses the cache so the cache is freezing not just cold.


The objective was to measure the pipeline stall, if you stall it for
other reason anyway what's the point?


It would run slower with smaller buffer size because it would block
too and it would read and write slower too. For my backup usage
keeping tar blocked is actually a feature, so the load of the backup
decreases. To me it's important the MB/sec of the writes and the
MB/sec of the reads (to lower the load), I don't care too much about
how long it takes as far as things runs as efficiently as possible
when they run. The rate limiting effect of the blocking isn't a
problem to me.


I answered to that email to point out the fundamental differences
between O_SYNC and O_DIRECT, if you don't like what I said I'm sorry
but that's how things are running today and I don't see quite possible
to change (unless of course we remove performance from the equation,
then indeed they'll be much the same).

Perhaps a IOCB_CMD_PREADAHEAD plus MAP_SHARED backed by lagepages
loaded with a new syscall that reads a piece at time into the large
pagecache, could be an alternative design, or perhaps splice could
obsolete O_DIRECT. I've just a very hard time to see how
O_SYNC+madvise could ever obsolete O_DIRECT.
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: O_DIRECT question, Phillip Susi, (Tue Jan 30, 2:50 pm)
Re: O_DIRECT question, Andrea Arcangeli, (Tue Jan 30, 3:57 pm)
Re: O_DIRECT question, Phillip Susi, (Tue Jan 30, 7:07 pm)
Re: O_DIRECT question, Michael Tokarev, (Wed Jan 31, 5:37 am)
Re: O_DIRECT question, Andrea Arcangeli, (Tue Jan 30, 10:28 pm)
Re: O_DIRECT question, Andrea Arcangeli, (Tue Jan 30, 4:06 pm)