On Tue, Jan 30, 2007 at 06:07:14PM -0500, Phillip Susi wrote:An I/O error is not an acceptable outcome in a mission critical app, all mission critical setups should be fault tolerant, so if raid cannot recover at the first sign of error the whole system should instantly go down and let the secondary takeover from it. See slony etc... Trying to recover the recoverable by mucking up with data making even _more_ writes on a failing disk before doing physical mirror image of the disk (the readable part) isn't a good idea IMHO. At best you could retry writing on the same sector hoping somebody disconnected the scsi cable by mistake. You can track the range where it happened with fsync too like said in previous email, and you can take the big database lock and then read-write read-write every single block in that range until you find the failing place if you really want to. read-write in place should be safe. Doing fsync after every write will provide the same ordering guarantee as O_SYNC, thought it was obvious what I meant here. The whole point is that most of the time you don't need it, you need an fsync after a couple of writes. All smtp servers uses fsync for the same reason, they also have to journal their writes to avoid losing email when there is a power loss. If you use writev or aio pwrite you can do well with O_SYNC too though. please have a second look at aio_abi.h: IOCB_CMD_FSYNC = 2, IOCB_CMD_FDSYNC = 3, there must be a reason why they exist, right? direct bypasses the cache so the cache is freezing not just cold. The objective was to measure the pipeline stall, if you stall it for other reason anyway what's the point? It would run slower with smaller buffer size because it would block too and it would read and write slower too. For my backup usage keeping tar blocked is actually a feature, so the load of the backup decreases. To me it's important the MB/sec of the writes and the MB/sec of the reads (to lower the load), I don't care too much about how long it takes as far as things runs as efficiently as possible when they run. The rate limiting effect of the blocking isn't a problem to me. I answered to that email to point out the fundamental differences between O_SYNC and O_DIRECT, if you don't like what I said I'm sorry but that's how things are running today and I don't see quite possible to change (unless of course we remove performance from the equation, then indeed they'll be much the same). Perhaps a IOCB_CMD_PREADAHEAD plus MAP_SHARED backed by lagepages loaded with a new syscall that reads a piece at time into the large pagecache, could be an alternative design, or perhaps splice could obsolete O_DIRECT. I've just a very hard time to see how O_SYNC+madvise could ever obsolete O_DIRECT. -
| Jesse Barnes | Re: PCI probing changes |
| Borislav Petkov | [PATCH] [KERNEL-DOC] kill warnings when building mandocs |
| Greg Kroah-Hartman | [PATCH 012/196] nozomi driver |
| Roland Dreier | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| Herbert Xu | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Linus Torvalds | Re: [GIT]: Networking |
| Frans Pop | svc: failed to register lockdv1 RPC service (errno 97). |
