Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Andrew Morton <akpm@...>
Cc: Jan Kara <jack@...>, Hidehiro Kawai <hidehiro.kawai.ez@...>, <sct@...>, <adilger@...>, <linux-kernel@...>, <linux-ext4@...>, <jbacik@...>, <cmm@...>, <yumiko.sugita.yf@...>, <satoshi.oshima.fk@...>
Date: Wednesday, June 4, 2008 - 5:22 pm

On Wed, Jun 04, 2008 at 11:19:11AM -0700, Andrew Morton wrote:

As I told Kawai-san when I met with him and his colleagues in Tokyo
last week, it is the responsibility of the storage stack to retry
errors as appropriate.  From the filesystem perspective, a read or a
write operation succeeds, or fails.  A read or write operation could
take a long time before returning, but the storage stack doesn't get
to return a "fail, but try again at some point; maybe we'll succeed
later, or if you try writing to a different block".  The only sane
thing for a filesystem to do is to treat any failure as a hard
failure.

It is similarly insane to ask a filesystem to figure out that a newly
plugged in USB stick is the same one that the user had accidentally
unplugged 30 seconds ago.  We don't want to put that kind of low-level
knowlede about storage details in each different filesystem.

A much better place to put that kind of smarts is in a multipath
module which sits in between the device and the filesystem.  It can
retry writes from a transient failure, if a path goes down or if a
iSCSI device temporarily drops off the network.  But if a filesystem
gets a write failure, it has to assume that the write failure is
permanent.

The question though is what should you do if you have a write failure
in various different parts of the disk?  If you have a write failure
in a data block, you can return -EIO to the user.  You could try
reallocating to find another block, and try writing to that alternate
location (although with modern filesystems that do block remapping,
this is largely pointless, since an EIO failure on write probably
means you've lost connectivity to the disk or the disk as run out of
spare blocks).  But for a failure to write to the a critical part of
the filesystem, like the inode table, or failure to write to the
journal, what the heck can you do?  Remounting read-only is probably
the best thing you can do.

In theory, if it is a failure to write to the journal, you could fall
back to no-journaled operation, and if ext3 could support running w/o
a journal, that is possibly an option --- but again, it's very likely
that the disk is totally gone (i.e., the user pulled the USB stick
without unmounting), or the disk is out of spare blocks in its bad
block remapping pool, and the system is probably going to be in deep
trouble --- and the next failure to write some data might be critical
application data.  You probably *are* better off failing the system
hard, and letting the HA system swap in the hot spare backup, if this
is some critical service.

That being said, ext3 can be tuned (and it is the default today,
although I should probably change the default to be remount-ro), so
that its behaviour on write errors is, "don't worry, be happy", and
just leave the filesystem mounted read/write.  That's actually quite
dangerous for a critical production server, however.....

							- Ted
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[PATCH 5/5] ext3: abort ext3 if the journal has aborted, Hidehiro Kawai, (Mon Jun 2, 6:48 am)
[PATCH 4/5] jbd: fix error handling for checkpoint io, Hidehiro Kawai, (Mon Jun 2, 6:47 am)
[PATCH 4/5] jbd: fix error handling for checkpoint io, Hidehiro Kawai, (Tue Jun 3, 12:40 am)
Re: [PATCH 4/5] jbd: fix error handling for checkpoint io, Hidehiro Kawai, (Mon Jun 23, 7:14 am)
Re: [PATCH 4/5] jbd: fix error handling for checkpoint io, Hidehiro Kawai, (Tue Jun 24, 7:52 am)
Re: [PATCH 4/5] jbd: fix error handling for checkpoint io, Hidehiro Kawai, (Fri Jun 27, 4:06 am)
Re: [PATCH 4/5] jbd: fix error handling for checkpoint io, Hidehiro Kawai, (Mon Jun 30, 1:09 am)
Re: [PATCH 4/5] jbd: fix error handling for checkpoint io, Hidehiro Kawai, (Tue Jun 3, 1:11 am)
Re: [PATCH 4/5] jbd: fix error handling for checkpoint io, Hidehiro Kawai, (Tue Jun 3, 12:31 am)
[PATCH 2/5] jbd: ordered data integrity fix, Hidehiro Kawai, (Mon Jun 2, 6:45 am)
Re: [PATCH 2/5] jbd: ordered data integrity fix, Andrew Morton, (Tue Jun 3, 6:33 pm)
Re: [PATCH 2/5] jbd: ordered data integrity fix, Hidehiro Kawai, (Wed Jun 4, 6:55 am)
Re: [PATCH 2/5] jbd: ordered data integrity fix, Jan Kara, (Mon Jun 2, 7:59 am)
Re: [PATCH 1/5] jbd: strictly check for write errors on data..., Theodore Tso, (Wed Jun 4, 5:22 pm)