This patch fixes the similar bug fixed by commit 95450f5a.
If a directory is modified, its data block is journaled as metadata
and finally written back to the right place. Now, we assume a
transient write erorr happens on that writeback. Uptodate flag of
the buffer is cleared by write error, so next access on the buffer
causes a reread from disk. This means it breaks the filesystems
consistency.
To prevent old directory data from being reread, this patch set
uptodate flag again in the case of after write error before issuing
the read operation. The write error on the directory's data block
is detected at the time of journal checkpointing or discarded if a
rewrite by another modification succeeds, so no problem.
I tested this patch by using fault injection approach.
By the way, I think the right fix is to keep uptodate flag on write
error, but it gives a big impact. We have to confirm whether
over 200 buffer_uptodate's are used for real uptodate check or write
error check. For now, I adopt the quick-fix solution.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/ext3/inode.c | 13 +++++++++++++
fs/ext3/namei.c | 15 ++++++++++++++-
2 files changed, 27 insertions(+), 1 deletions(-)
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 455e6e6..17c7a5f 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1077,10 +1077,23 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
return bh;
if (buffer_uptodate(bh))
return bh;
+
+ /*
+ * uptodate flag may have been cleared by a previous (transient)
+ * write IO error. In this case, we don't want to re-read its
+ * old on-disk data. Actually the buffer has the latest data,
+ * so set uptodate flag again.
+ */
+ if (buffer_write_io_error(bh)) {
+ set_buffer_uptodate(bh);
+ return bh;
+ }
+
ll_rw_block(READ_META, 1, &bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
+
put_bh(bh);
*err = -EIO;
return NULL;
diff --git ...Hi, After sending this patch, I noticed that I have to deal with the bh_uptodate_or_lock() case as well. Actually, I confirmed a data block sharing happens between two inodes. Allocate a new block, then modified bitmap goes to the fs, but it fails due to a transient IO error. Next access on the bitmap buffer cause a reread from disk. As a result, the allocated block becomes a FREE block! So this block can be shared by different two inodes. I'll send the revised version later. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
This patch fixes the similar bug fixed by commit 95450f5a.
If a directory is modified, its data block is journaled as metadata
and finally written back to the right place. Now, we assume a
transient write erorr happens on that writeback. Uptodate flag of
the buffer is cleared by write error, so next access on the buffer
causes a reread from disk. This means it breaks the filesystems
consistency.
To prevent old directory data from being reread, this patch set
uptodate flag again in the case of after write error before issuing
the read operation. The write error on the directory's data block
is detected at the time of journal checkpointing or discarded if a
rewrite by another modification succeeds, so no problem.
Similarly, this kind of consistency breakage can be caused by
a transient write error on a bitmap block.
I tested this patch by using fault injection approach.
By the way, I think the right fix is to keep uptodate flag on write
error, but it gives a big impact. We have to confirm whether
over 200 buffer_uptodate's are used for real uptodate check or write
error check. For now, I adopt the quick-fix solution.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/ext3/balloc.c | 12 ++++++++++++
fs/ext3/inode.c | 13 +++++++++++++
fs/ext3/namei.c | 15 ++++++++++++++-
3 files changed, 39 insertions(+), 1 deletions(-)
diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index 27967f9..5dc5ccf 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -156,6 +156,18 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
if (likely(bh_uptodate_or_lock(bh)))
return bh;
+ /*
+ * uptodate flag may have been cleared by a previous (transient)
+ * write IO error. In this case, we don't want to reread its
+ * old on-disk data. Actually the buffer has the latest data,
+ * so set uptodate flag again.
+ */
+ if (buffer_write_io_error(bh)) {
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+ return ...Yes that needs to be solved. I also looked into it and it's too much work to do it in a one big sweep. But I think we could do the conversion filesystem by filesystem - see below. Admittedly, I don't like your solution very much. It looks strange to check write_io_error when *reading* the buffer and I'm afraid we could introduce bugs e.g. by clearing write_io_error bit so that ext3_bread would then fail to detect the error condition or by some other code deciding to read the buffer from disk via other function than just ext3_bread. So I think it would be much better to set back uptodate flag shortly after the failed write or not clear it at all. Now here's how I think we could achieve that without having to change all other filesystems: We could create a new superblock flag which would mean that "this filesystem handles write_io_error and doesn't want to clear uptodate flag". Filesystems with this capability would set this flag when calling get_sb_bdev. And if write IO fails we check this flag (via bh->b_bdev->bd_super->s_flags) and clear / not clear uptodate flag accordingly. What do you think? I know it's more work than your quick fix but it should fix all these problems for ext3 once for all and it would be much cleaner... -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Thanks for your comment! Your suggestion is what I wanted to do ultimately, and it seems to go well. I'll send a rivised patch later. Best regards, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Yes this is something I've had problems with as well, although I never properly solved the issue with auditing / back compatibility with filesystems. So it is good to see people working on a real solution :) Clearing uptodate flag on write errors is a really nasty thing to do. It means that failed writeback cannot be retried, can also break application consistency for data blocks, similarly to filesystem consistency for metadata blocks, and might even cause oopses and weird problems when !uptodate pages/buffers are not expected, mmapped pages, for example, or I agree, and this sounds like a decent solution. We also need to remove some ClearPageUptodate calls I think (similar issues), so keep those in mind too. Unfortunately it looks like there are also a lot of filesystem specific tests of PageUptodate... but you could also move those under the new compatibility s_flag. I don't know of a really good way to inject and test filesystem errors. Make request failures causes most fs to quickly go readonly or have bigger problems. If you're careful like try to only fail read IOs for data, or only fail write IOs not involved in integrity or journal operations, then test programs just tend to abort pretty quickly. Does --
This might be a good time to bring up IO error behaviour again. I got into some debates I think on Andi's hwpoison thread a while back, but probably not appropriate thread to find a real solution to this. The problem we have now is that IO error semantics are not well defined. It is hard to even enumerate all the issues. read IOs how to retry? appropriate defaults should happen at the block layer I think. Should retry behaviour be tunable by the mm/fs, or should that be coded explicitly as submission retry loops? Either way does imply there is either similar defaults for all types (or maybe classes) of drivers, or some way to query/set this. It would be nice to be able to set fs/driver behaviour from userspace too, in a generic (not driver or fs specific way). But defaults should be reasonable and similar between all, I guess. write IOs This is more interesting. How to handle write IO errors. In my opinion we must not invalidate the data before an IO error is returned to somebody (whether it be fsync or a synchronous write syscall). Any earlier and the app just gets RAW consistency randomly violated. And I think it is important to treat IO errors as transparently as possible until the error can be detected. I happen to think that actually we should go further and not invalidate the data at all. This makes implementation simpler, and also allows us to retry writes like we can retry reads. It's also problematic to throw out errors at that point because *sync syscalls coming from elsewhere could result in loss of error reporting (think, sys_sync). If we go this way, we probably need another syscall and fs helper call to invalidate the dirty data when we give up on retries. truncate_range probably not appropriate because it is much harder to implement and maybe we want to try to get at the most recent data that is on disk. Also do we need to think about O_SYNC or -o sync type of writes that are implemented via ...
It's more complex than that - there are classes of errors to consider as well. e.g transient vs permanent. Transient is from stuff like FC path failures - failover can take up to 240s to occur, and then the IO will generally complete successfully. Permanent errors are those that involve data loss e.g bad sectors on single disks or on degraded RAID devices. The action to take is generally different for different error classes - transient errors can be retried later, while permanent errors won't change no matter how many retries you do. IOWs, we'll need help from the block layer to enable us to determine the error I don't think generic handling is really possible - filesystems may have different ways of recovering e.g. duplicate copies of data or metadata or internal ECC that can be used to recovery the bad region. Also, depending where the error occurs, the filesystem might We already pass the error via mapping_set_error() calls when the error occurs and checking in it filemap_fdatawait_range(). However, where we check the error we've lost all context and what range the error occurred on. I don't see any easy way to track such an error for later invalidation except maybe by a new radix tree tag. That would allow later invalidation of only the specific range the The worst problem with this is what happens when you can't write back to the filesystem because of IO errors, but you still allow more incoming writes? It's not far from IO error to running out of memory How to handle this comes down to the type of error that occurred. In the case of permanent error, the second read after the invalidation probably should return EIO because you have no idea whether what is on disk is the old, the new, some combination of the two or some other It's a damn hard problem and many of the details are filesystem specific. However, if we want high grade reliability from our systems then we have to tackle these problems at some point in time. FWIW, I started to document some of ...
Yes. Is this something that should be visible above the block layer though? If it is known transient, should it remain uncompleted until it is successful? Known permanent errors yes could avoid any need for retries. Leaving cases where the lower layers don't really know (in which case we'd For write errors, you could also do block re-allocation, which would be Definitely there will be filesystem specific issues. But I mean that some common things could be specified (like how long / how many times If we always leave the error pages / buffers as dirty and uptodate, then we can walk the radix tree dirty bits. IO errors are only really reported by syncing calls anyway which walk dirty bits already. If we wanted a purely querying syscall, it probably doesn't need to so so performance critical as to require a new tag rather than just Again, keeping pages dirty so we'll start synchronous dirty pagecache throttling eventually. That could cause problems of its own as well, but I don't know what else we can do. I don't think we can throw out the dirty data by default (the Well by this I just mean the dirty, unwritten pagecache and its associated fs private structures. For errors in filesystem metadata yes it is a lot harder. I guess filesystems simply need to check and handle errors on a I'm not sure if that is important because you would have the same problems if the read was not preceded by a write (or if the write came from previous boot, or a different machine etc). If we want to catch IO errors not detected by the block layer, it really We do want to start by making this as _simple_ as possible. Even the existing rudimentary error reporting by the block layer is not used in a consistent way (or at all, in many cases). So I think squashing corrupted data errors into transient/permanent Yes this needs support, which I've talked about in hwpoison discussions. Currently (or last time I checked) it just causes corrupted dirty pagecache to appear as an IO ...
I think it needs to be exposed because if the filesystem has multiple copies of the data it can read from the other location Personally I don't think users aren't going to be able to make intelligent decisions about what do with such a knob. I'd prefer to just make it a fixed policy first, and only provide tunables if Agreed - there will be some common things fall out, but I'd like to see an analysis done first before we try to extract the common The drive for the document I was writing was big, high performance filesystems (think petabyte scale) and machines that might cache a TB or two of a single file in memory. At that point, finding a write_cache_pages() decrements nr_to_write even if there was a write error on that page. Hence the throttling in balance_dirty_pages won't kick in if lots of errors occur during synchronous writeback because it will think the number of pages it asked to be written A certain number of retries is certainly worth attempting for errors that we can't directly report (background writeback), but whether that should be done for sync/fsync is an open question in my If the filesystem has been unmounted, then we have to assume that corrective action has been taken (i.e. we've reported a problem, Yes, though there are plenty of different types of errors the block layer detects but report simply as "EIO". e.g. on Irix, the block layer would report EXDEV rather than EIO for transient path-failure True. My main point is, though, we can't really make that classification without understanding the whole scope of errors that can occur and ensuring that we get the correct errors reported from the lower layers first. i.e. this is not just a pagecache/filesystem level problem - the lower layers have to do the right thing before we Yeah - a week rarely goes by when we don't get a report of an XFS filesystem hung due to something below it just stopping mid-IO (DM, md, drivers, and/or hardware). e.g what appears to be a DM-related hang reported ...
Hi, Yes it would. (-: FWIW, Windows does this with Microsoft's NTFS driver. When a write fails due to a bad block, the block is marked as bad (recorded in the bad cluster list and marked as allocated in the in-use bitmap so no-one tries to allocate it), a new block is allocated, inode metadata is updated to reflect the change in the logical to physical block map of the file the block belongs to, and the write is then re-tried to its new location. I have never bothered implementing it in NTFS on Linux partially because there doesn't seem any obvious way to do it inside the file system. I think the VFS and/or the block layer would have to offer help there in some way. What I mean for example is that if ->writepage fails then the failure is only detected inside the asynchronous i/o completion handler at which point the page is not locked any more, it is marked as being under writeback, and we are in IRQ context (or something) and thus it is not easy to see how we can from there get to doing all the above needed actions that require memory allocations, disk i/o, etc... I suppose a separate thread could do it where we just schedule the work to be done. But problem with that is that that work later on might fail so we can't simply pretend the block was written successfully yet we do not want to report an error or the upper layers would pick it up even though we hopefully will correct it in due course... Best regards, Anton -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ --
For permanent write errors, I would expect any modern drive to do a sector remapping internally. We should never need to track this kind of information for any modern device that I know of (S-ATA, SAS, SSD's and raid arrays should all handle this). Would not seem to be worth the complexity. Also keep in mind that retrying IO errors is not always a good thing - devices retry failed IO multiple times internally. Adding additional retry loops up the stack only makes our unavoidable IO error take much longer to hit! Ric --
I thought write errors returned by modern drives (last 15 years) in general were caused by bad cables, controllers, power supplies, etc. When a media error is returned on write it indicated the spare sector area of the drive was full. Thus a media write error is a major error. I would think, if anything, we should turn the filesystem readonly upon a write media error. Not try to hide such a major problem. Greg --
... and if the device is run out of all of its blocks in its spare blocks pool, it's probably well past the time to replace said disk. BTW, I really liked Dave Chinner's summary of the issues involved; I ran into Kawai-san last week at Linux.conf.au, and we discussed pretty much the same thing over lunch. (i.e., that it's a hard problem, and in some cases we need to retry the writes, such as a transient FC path problem --- but some kind of write throttling is critical or we could end up choking the VM due to too many pages getting dirtied and no way of cleaning them.) - Ted --
Also note that retrying writes (or reads for that matter) often are counter productive. For those of us who have suffered with trying to migrate data off of an old, failing disk onto a new, shiny one, excessive retries can be painful... ric --
That is probably true most of the time. So some sane defaults should be attempted that work for most cases. After that, retrying I was imagining should be driven by the application. So: attempting to read or fsync again. What should not happen is for the page to be marked !dirty or !uptodate. This randomly breaks write to read consistency without necessarily even any error reported, so it seems really hard for an app to do the right thing there. --
Well I just don't think we can ever discard them by default. Therefore we must default to not discarding them, therefore we need to solve or work around the dirty page congestion problem some how. --
We have done this for a long time in XFS. e.g. If we can't issue IO on the page (e.g. allocation fails or we are in a shutdown situation already) we invalidate the page immediately, clear the page uptodate flag and return an error to mark the address space with an error. See xfs_page_state_convert() for more detail. And besides, if there is an error of some kind sufficient to shut down the filesystem, the last thing you want to do is write more data to it and potentially make the problem worse, especially if async transactions that the data write might rely on were cancelled Agreed. The way XFS treats data IO errors is because that's the only thing we can do right now if we want the system to continue to function in the face of IO errors.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
