Re: IO error semantics

Previous thread: How to use mkfs.ext4 "stride=" on RAID correctly? by Mike Mestnik on Wednesday, January 13, 2010 - 5:12 pm. (6 messages)

Next thread: [REPOST][PATCH][RFC] vfs: add message print mechanism for the mount/umount into the VFS layer by Toshiyuki Okajima on Wednesday, January 13, 2010 - 11:48 pm. (11 messages)
From: Hidehiro Kawai
Date: Wednesday, January 13, 2010 - 11:12 pm

This patch fixes the similar bug fixed by commit 95450f5a.

If a directory is modified, its data block is journaled as metadata
and finally written back to the right place.  Now, we assume a
transient write erorr happens on that writeback.  Uptodate flag of
the buffer is cleared by write error, so next access on the buffer
causes a reread from disk.  This means it breaks the filesystems
consistency.

To prevent old directory data from being reread, this patch set
uptodate flag again in the case of after write error before issuing
the read operation.  The write error on the directory's data block
is detected at the time of journal checkpointing or discarded if a
rewrite by another modification succeeds, so no problem.

I tested this patch by using fault injection approach.

By the way, I think the right fix is to keep uptodate flag on write
error, but it gives a big impact.  We have to confirm whether
over 200 buffer_uptodate's are used for real uptodate check or write
error check.  For now, I adopt the quick-fix solution.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/ext3/inode.c |   13 +++++++++++++
 fs/ext3/namei.c |   15 ++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 455e6e6..17c7a5f 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1077,10 +1077,23 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
 		return bh;
 	if (buffer_uptodate(bh))
 		return bh;
+
+	/*
+	 * uptodate flag may have been cleared by a previous (transient)
+	 * write IO error.  In this case, we don't want to re-read its
+	 * old on-disk data.  Actually the buffer has the latest data,
+	 * so set uptodate flag again.
+	 */
+	if (buffer_write_io_error(bh)) {
+		set_buffer_uptodate(bh);
+		return bh;
+	}
+
 	ll_rw_block(READ_META, 1, &bh);
 	wait_on_buffer(bh);
 	if (buffer_uptodate(bh))
 		return bh;
+
 	put_bh(bh);
 	*err = -EIO;
 	return NULL;
diff --git ...
From: Hidehiro Kawai
Date: Thursday, January 14, 2010 - 2:05 am

Hi,


After sending this patch, I noticed that I have to deal with
the bh_uptodate_or_lock() case as well.  Actually, I confirmed
a data block sharing happens between two inodes. Allocate a new
block, then modified bitmap goes to the fs, but it fails due to
a transient IO error.  Next access on the bitmap buffer cause a
reread from disk.  As a result, the allocated block becomes 
a FREE block!  So this block can be shared by different two inodes.

I'll send the revised version later.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Hidehiro Kawai
Date: Thursday, January 14, 2010 - 3:14 am

This patch fixes the similar bug fixed by commit 95450f5a.

If a directory is modified, its data block is journaled as metadata
and finally written back to the right place.  Now, we assume a
transient write erorr happens on that writeback.  Uptodate flag of
the buffer is cleared by write error, so next access on the buffer
causes a reread from disk.  This means it breaks the filesystems
consistency.

To prevent old directory data from being reread, this patch set
uptodate flag again in the case of after write error before issuing
the read operation.  The write error on the directory's data block
is detected at the time of journal checkpointing or discarded if a
rewrite by another modification succeeds, so no problem.

Similarly, this kind of consistency breakage can be caused by
a transient write error on a bitmap block.

I tested this patch by using fault injection approach.

By the way, I think the right fix is to keep uptodate flag on write
error, but it gives a big impact.  We have to confirm whether
over 200 buffer_uptodate's are used for real uptodate check or write
error check.  For now, I adopt the quick-fix solution.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/ext3/balloc.c |   12 ++++++++++++
 fs/ext3/inode.c  |   13 +++++++++++++
 fs/ext3/namei.c  |   15 ++++++++++++++-
 3 files changed, 39 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index 27967f9..5dc5ccf 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -156,6 +156,18 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
 	if (likely(bh_uptodate_or_lock(bh)))
 		return bh;
 
+	/*
+	 * uptodate flag may have been cleared by a previous (transient)
+	 * write IO error.  In this case, we don't want to reread its
+	 * old on-disk data.  Actually the buffer has the latest data,
+	 * so set uptodate flag again.
+	 */
+	if (buffer_write_io_error(bh)) {
+		set_buffer_uptodate(bh);
+		unlock_buffer(bh);
+		return ...
From: Jan Kara
Date: Thursday, January 14, 2010 - 7:18 am

Yes that needs to be solved. I also looked into it and it's too much work
to do it in a one big sweep. But I think we could do the conversion
filesystem by filesystem - see below.
  Admittedly, I don't like your solution very much. It looks strange to
check write_io_error when *reading* the buffer and I'm afraid we could
introduce bugs e.g. by clearing write_io_error bit so that ext3_bread would
then fail to detect the error condition or by some other code deciding to
read the buffer from disk via other function than just ext3_bread. So I
think it would be much better to set back uptodate flag shortly after the
failed write or not clear it at all.
  Now here's how I think we could achieve that without having to change all
other filesystems: We could create a new superblock flag which would mean
that "this filesystem handles write_io_error and doesn't want to clear
uptodate flag". Filesystems with this capability would set this flag when
calling get_sb_bdev. And if write IO fails we check this flag
(via bh->b_bdev->bd_super->s_flags) and clear / not clear uptodate flag
accordingly. What do you think?
  I know it's more work than your quick fix but it should fix all these
problems for ext3 once for all and it would be much cleaner...

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Friday, January 15, 2010 - 3:38 am

Thanks for your comment!

Your suggestion is what I wanted to do ultimately, and it seems
to go well.  I'll send a rivised patch later.

Best regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Nick Piggin
Date: Sunday, January 17, 2010 - 10:18 pm

Yes this is something I've had problems with as well, although I never
properly solved the issue with auditing / back compatibility with
filesystems. So it is good to see people working on a real solution :)

Clearing uptodate flag on write errors is a really nasty thing to do. It
means that failed writeback cannot be retried, can also break application
consistency for data blocks, similarly to filesystem consistency for
metadata blocks, and might even cause oopses and weird problems when
!uptodate pages/buffers are not expected, mmapped pages, for example, or

I agree, and this sounds like a decent solution.

We also need to remove some ClearPageUptodate calls I think (similar
issues), so keep those in mind too. Unfortunately it looks like there
are also a lot of filesystem specific tests of PageUptodate... but you
could also move those under the new compatibility s_flag.

I don't know of a really good way to inject and test filesystem errors.
Make request failures causes most fs to quickly go readonly or have
bigger problems. If you're careful like try to only fail read IOs for
data, or only fail write IOs not involved in integrity or journal
operations, then test programs just tend to abort pretty quickly. Does
--

From: Nick Piggin
Date: Sunday, January 17, 2010 - 11:05 pm

This might be a good time to bring up IO error behaviour again. I got
into some debates I think on Andi's hwpoison thread a while back, but
probably not appropriate thread to find a real solution to this.

The problem we have now is that IO error semantics are not well defined.
It is hard to even enumerate all the issues.

read IOs
  how to retry? appropriate defaults should happen at the block layer I
  think. Should retry behaviour be tunable by the mm/fs, or should that
  be coded explicitly as submission retry loops? Either way does imply
  there is either similar defaults for all types (or maybe classes) of
  drivers, or some way to query/set this.

  It would be nice to be able to set fs/driver behaviour from userspace
  too, in a generic (not driver or fs specific way). But defaults should
  be reasonable and similar between all, I guess.

write IOs
  This is more interesting. How to handle write IO errors. In my opinion
  we must not invalidate the data before an IO error is returned to
  somebody (whether it be fsync or a synchronous write syscall). Any
  earlier and the app just gets RAW consistency randomly violated. And I
  think it is important to treat IO errors as transparently as possible
  until the error can be detected.

  I happen to think that actually we should go further and not
  invalidate the data at all. This makes implementation simpler, and
  also allows us to retry writes like we can retry reads. It's also
  problematic to throw out errors at that point because *sync syscalls
  coming from elsewhere could result in loss of error reporting (think,
  sys_sync).

  If we go this way, we probably need another syscall and fs helper call
  to invalidate the dirty data when we give up on retries. truncate_range
  probably not appropriate because it is much harder to implement and
  maybe we want to try to get at the most recent data that is on disk.

  Also do we need to think about O_SYNC or -o sync type of writes that
  are implemented via ...
From: Dave Chinner
Date: Monday, January 18, 2010 - 5:24 am

It's more complex than that - there are classes of errors to
consider as well. e.g transient vs permanent.

Transient is from stuff like FC path failures - failover can take up
to 240s to occur, and then the IO will generally complete
successfully.  Permanent errors are those that involve data loss e.g
bad sectors on single disks or on degraded RAID devices.

The action to take is generally different for different error
classes - transient errors can be retried later, while permanent
errors won't change no matter how many retries you do.  IOWs, we'll
need help from the block layer to enable us to determine the error

I don't think generic handling is really possible - filesystems may
have different ways of recovering e.g. duplicate copies of data or
metadata or internal ECC that can be used to recovery the bad
region. Also, depending where the error occurs, the filesystem might

We already pass the error via mapping_set_error() calls when the
error occurs and checking in it filemap_fdatawait_range().  However,
where we check the error we've lost all context and what range the
error occurred on. I don't see any easy way to track such an
error for later invalidation except maybe by a new radix tree tag.
That would allow later invalidation of only the specific range the

The worst problem with this is what happens when you can't write
back to the filesystem because of IO errors, but you still allow more
incoming writes? It's not far from IO error to running out of memory


How to handle this comes down to the type of error that occurred. In
the case of permanent error, the second read after the invalidation
probably should return EIO because you have no idea whether what is on
disk is the old, the new, some combination of the two or some other

It's a damn hard problem and many of the details are filesystem
specific. However, if we want high grade reliability from our
systems then we have to tackle these problems at some point in time.

FWIW, I started to document some of ...
From: Nick Piggin
Date: Monday, January 18, 2010 - 7:00 am

Yes. Is this something that should be visible above the block layer
though? If it is known transient, should it remain uncompleted until it
is successful?

Known permanent errors yes could avoid any need for retries. Leaving
cases where the lower layers don't really know (in which case we'd

For write errors, you could also do block re-allocation, which would be

Definitely there will be filesystem specific issues. But I mean that
some common things could be specified (like how long / how many times

If we always leave the error pages / buffers as dirty and uptodate,
then we can walk the radix tree dirty bits. IO errors are only really
reported by syncing calls anyway which walk dirty bits already.

If we wanted a purely querying syscall, it probably doesn't need to so
so performance critical as to require a new tag rather than just

Again, keeping pages dirty so we'll start synchronous dirty pagecache
throttling eventually.

That could cause problems of its own as well, but I don't know what else
we can do. I don't think we can throw out the dirty data by default (the

Well by this I just mean the dirty, unwritten pagecache and its associated
fs private structures. For errors in filesystem metadata yes it is a lot
harder. I guess filesystems simply need to check and handle errors on a

I'm not sure if that is important because you would have the same
problems if the read was not preceded by a write (or if the write came
from previous boot, or a different machine etc).

If we want to catch IO errors not detected by the block layer, it really


We do want to start by making this as _simple_ as possible. Even the
existing rudimentary error reporting by the block layer is not used in a
consistent way (or at all, in many cases).

So I think squashing corrupted data errors into transient/permanent


Yes this needs support, which I've talked about in hwpoison discussions.
Currently (or last time I checked) it just causes corrupted dirty
pagecache to appear as an IO ...
From: Dave Chinner
Date: Monday, January 18, 2010 - 3:51 pm

I think it needs to be exposed because if the filesystem has
multiple copies of the data it can read from the other location

Personally I don't think users aren't going to be able to make
intelligent decisions about what do with such a knob. I'd prefer to
just make it a fixed policy first, and only provide tunables if

Agreed - there will be some common things fall out, but I'd like to
see an analysis done first before we try to extract the common

The drive for the document I was writing was big, high performance
filesystems (think petabyte scale) and machines that might cache a
TB or two of a single file in memory. At that point, finding a

write_cache_pages() decrements nr_to_write even if there was a write
error on that page. Hence the throttling in balance_dirty_pages
won't kick in if lots of errors occur during synchronous writeback
because it will think the number of pages it asked to be written

A certain number of retries is certainly worth attempting for
errors that we can't directly report (background writeback), but
whether that should be done for sync/fsync is an open question in my


If the filesystem has been unmounted, then we have to assume that
corrective action has been taken (i.e. we've reported a problem,

Yes, though there are plenty of different types of errors the block
layer detects but report simply as "EIO". e.g. on Irix, the block
layer would report EXDEV rather than EIO for transient path-failure

True. My main point is, though, we can't really make that
classification without understanding the whole scope of errors that
can occur and ensuring that we get the correct errors reported from
the lower layers first. i.e. this is not just a pagecache/filesystem
level problem - the lower layers have to do the right thing before we

Yeah - a week rarely goes by when we don't get a report of an XFS
filesystem hung due to something below it just stopping mid-IO
(DM, md, drivers, and/or hardware). e.g what appears to be a 
DM-related hang reported ...
From: Anton Altaparmakov
Date: Monday, January 18, 2010 - 4:33 pm

Hi,


Yes it would.  (-:

FWIW, Windows does this with Microsoft's NTFS driver.  When a write fails due to a bad block, the block is marked as bad (recorded in the bad cluster list and marked as allocated in the in-use bitmap so no-one tries to allocate it), a new block is allocated, inode metadata is updated to reflect the change in the logical to physical block map of the file the block belongs to, and the write is then re-tried to its new location.

I have never bothered implementing it in NTFS on Linux partially because there doesn't seem any obvious way to do it inside the file system.  I think the VFS and/or the block layer would have to offer help there in some way.  What I mean for example is that if ->writepage fails then the failure is only detected inside the asynchronous i/o completion handler at which point the page is not locked any more, it is marked as being under writeback, and we are in IRQ context (or something) and thus it is not easy to see how we can from there get to doing all the above needed actions that require memory allocations, disk i/o, etc...  I suppose a separate thread could do it where we just schedule the work to be done.  But problem with that is that that work later on might fail so we can't simply pretend the block was written successfully yet we do not want to report an error or the upper layers would pick it up even though we hopefully will correct it in due course...

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

--

From: Ric Wheeler
Date: Monday, January 25, 2010 - 8:23 am

For permanent write errors, I would expect any modern drive to do a 
sector remapping internally. We should never need to track this kind of 
information for any modern device that I know of (S-ATA, SAS, SSD's and 
raid arrays should all handle this).

Would not seem to be worth the complexity.

Also keep in mind that retrying IO errors is not always a good thing - 
devices retry failed IO multiple times internally. Adding additional 
retry loops up the stack only makes our unavoidable IO error take much 
longer to hit!

Ric


--

From: Greg Freemyer
Date: Monday, January 25, 2010 - 9:15 am

I thought write errors returned by modern drives (last 15 years) in
general were caused by bad cables, controllers, power supplies, etc.

When a media error is returned on write it indicated the spare sector
area of the drive was full.

Thus a media write error is a major error.  I would think, if
anything, we should turn the filesystem readonly upon a write media
error.  Not try to hide such a major problem.

Greg
--

From: tytso
Date: Monday, January 25, 2010 - 10:47 am

... and if the device is run out of all of its blocks in its spare
blocks pool, it's probably well past the time to replace said disk.

BTW, I really liked Dave Chinner's summary of the issues involved; I
ran into Kawai-san last week at Linux.conf.au, and we discussed pretty
much the same thing over lunch.  (i.e., that it's a hard problem, and
in some cases we need to retry the writes, such as a transient FC path
problem --- but some kind of write throttling is critical or we could
end up choking the VM due to too many pages getting dirtied and no way
of cleaning them.)

						- Ted
--

From: Ric Wheeler
Date: Monday, January 25, 2010 - 10:50 am

Also note that retrying writes (or reads for that matter) often are counter 
productive. For those of us who have suffered with trying to migrate data off of 
an old, failing disk onto a new, shiny one, excessive retries can be painful...

ric

--

From: Nick Piggin
Date: Monday, January 25, 2010 - 10:59 am

That is probably true most of the time. So some sane defaults should
be attempted that work for most cases.

After that, retrying I was imagining should be driven by the
application. So: attempting to read or fsync again.

What should not happen is for the page to be marked !dirty or !uptodate.
This randomly breaks write to read consistency without necessarily even
any error reported, so it seems really hard for an app to do the right
thing there.

--

From: Nick Piggin
Date: Monday, January 25, 2010 - 10:55 am

Well I just don't think we can ever discard them by default. Therefore
we must default to not discarding them, therefore we need to solve or
work around the dirty page congestion problem some how.


--

From: Dave Chinner
Date: Monday, January 25, 2010 - 11:19 pm

We have done this for a long time in XFS. e.g. If we can't issue IO
on the page (e.g. allocation fails or we are in a shutdown
situation already) we invalidate the page immediately, clear the
page uptodate flag and return an error to mark the address space
with an error. See xfs_page_state_convert() for more detail.

And besides, if there is an error of some kind sufficient to shut
down the filesystem, the last thing you want to do is write more
data to it and potentially make the problem worse, especially if
async transactions that the data write might rely on were cancelled

Agreed. The way XFS treats data IO errors is because that's the only
thing we can do right now if we want the system to continue to function
in the face of IO errors....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

Previous thread: How to use mkfs.ext4 "stride=" on RAID correctly? by Mike Mestnik on Wednesday, January 13, 2010 - 5:12 pm. (6 messages)

Next thread: [REPOST][PATCH][RFC] vfs: add message print mechanism for the mount/umount into the VFS layer by Toshiyuki Okajima on Wednesday, January 13, 2010 - 11:48 pm. (11 messages)