Subject: [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2) This patch set is the take 2 of fixing error handling problem in ext3/JBD. The previous discussion can be found here: http://lkml.org/lkml/2008/5/14/10 The same problem should also be in ext4/JBD, but I haven't prepared it yet. Problem ======= Currently some error checkings are missing, so the journal cannot abort correctly. This causes breakage of the ordered mode rule and filesystem corruption. Missing error checkings are: (1) error check for dirty buffers flushed before the commit (addressed by PATCH 1/5 and 2/5) (2) error check for the metadata writes to the journal before the commit (addressed by PATCH 3/5) (3) error check for checkpointing and replay (addressed by PATCH 4/5 and 5/5) Changes from take 1 =================== [PATCH 1/5] o not changed [PATCH 2/5] o rewrite my coment in journal_dirty_data() comprehensibly [PATCH 3/5] o check for errors and abort the journal just before journal_write_commit_record() instead of after writing metadata buffers [PATCH 4/5 and 5/5] o separate the ext3 part from the jbd part in a patch o use JFS_ABORT for checkpointing failures instead of introducing JFS_CP_ABORT flag o don't update only the journal super block, but also j_tail and j_tail_sequence when the journal has aborted (at least we only have to avoid updating the super block, but keeping j_tail*'s values will be good thing because it may protect someone from adding bugs in the future) o journal_destroy() returns -EIO when the journal has aborted so that ext3_put_super() can detect the abort o journal_flush() uses j_checkpoint_mutex to avoid a race with __log_wait_for_space() The last item targets a newly found problem. journal_flush() can be called while processing __log_wait_for_space(). In this case, cleanup_journal_tail() can be called between __journal_drop_transaction() and journal_abort(), then the transaction with checkpointing ...
Subject: [PATCH 1/5] jbd: strictly check for write errors on data buffers
In ordered mode, we should abort journaling when an I/O error has
occurred on a file data buffer in the committing transaction. But
there can be data buffers which are not checked for error:
(a) the buffer which has already been written out by pdflush
(b) the buffer which has been unlocked before scanned in the
t_locked_list loop
This patch adds missing error checks and aborts journaling
appropriately.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitacih.com>
Acked-by: Jan Kara <jack@suse.cz>
---
fs/jbd/commit.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
Index: linux-2.6.26-rc4/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc4/fs/jbd/commit.c
@@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
/*
* Submit all the data buffers to disk
*/
-static void journal_submit_data_buffers(journal_t *journal,
+static int journal_submit_data_buffers(journal_t *journal,
transaction_t *commit_transaction)
{
struct journal_head *jh;
@@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
int locked;
int bufs = 0;
struct buffer_head **wbuf = journal->j_wbuf;
+ int err = 0;
/*
* Whenever we unlock the journal and sleep, things can get added
@@ -253,6 +254,8 @@ write_out_data:
put_bh(bh);
} else {
BUFFER_TRACE(bh, "writeout complete: unfile");
+ if (unlikely(!buffer_uptodate(bh)))
+ err = -EIO;
__journal_unfile_buffer(jh);
jbd_unlock_bh_state(bh);
if (locked)
@@ -271,6 +274,8 @@ write_out_data:
}
spin_unlock(&journal->j_list_lock);
journal_do_submit_data(wbuf, bufs);
+
+ return err;
}
/*
@@ -410,8 +415,7 @@ void journal_commit_transaction(journal_
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
...On Mon, 02 Jun 2008 19:43:57 +0900 Why should we do that? --
I see two reasons: 1) If fs below us is returning IO errors, we don't really know how severe it is so it's safest to stop accepting writes. Also user notices the problem early this way. I agree that with the growing size of disks and thus probability of seeing IO error, we should probably think of something cleverer than this but aborting seems better than just doing nothing. 2) If the IO error is just transient (i.e., link to NAS is disconnected for a while), we would silently break ordering mode guarantees (user could be able to see old / uninitialized data). Honza PS: Changed Andreas's address in the email to the new one... -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Does any other filesystem driver turn the fs read-only on the first write-IO-error? It seems like a big policy change to me. For a lot of applications it's effectively a complete outage and people might get a bit upset if this happens on the first blip from their NAS. <waves vigorously at linux-ext4 people> --
As I told Kawai-san when I met with him and his colleagues in Tokyo last week, it is the responsibility of the storage stack to retry errors as appropriate. From the filesystem perspective, a read or a write operation succeeds, or fails. A read or write operation could take a long time before returning, but the storage stack doesn't get to return a "fail, but try again at some point; maybe we'll succeed later, or if you try writing to a different block". The only sane thing for a filesystem to do is to treat any failure as a hard failure. It is similarly insane to ask a filesystem to figure out that a newly plugged in USB stick is the same one that the user had accidentally unplugged 30 seconds ago. We don't want to put that kind of low-level knowlede about storage details in each different filesystem. A much better place to put that kind of smarts is in a multipath module which sits in between the device and the filesystem. It can retry writes from a transient failure, if a path goes down or if a iSCSI device temporarily drops off the network. But if a filesystem gets a write failure, it has to assume that the write failure is permanent. The question though is what should you do if you have a write failure in various different parts of the disk? If you have a write failure in a data block, you can return -EIO to the user. You could try reallocating to find another block, and try writing to that alternate location (although with modern filesystems that do block remapping, this is largely pointless, since an EIO failure on write probably means you've lost connectivity to the disk or the disk as run out of spare blocks). But for a failure to write to the a critical part of the filesystem, like the inode table, or failure to write to the journal, what the heck can you do? Remounting read-only is probably the best thing you can do. In theory, if it is a failure to write to the journal, you could fall back to no-journaled operation, and if ext3 could support ...
On Wed, 4 Jun 2008 17:22:02 -0400 To that sector, yes. But to the entire partition? (Well, if the entire partition became unwriteable then we don't have a problem. It's "parts are writeable" or "it became writeable again Absolutely. But afaict this patch changes things so that if we get a write failure in a data block we make the entire fs read-only. Which, as I said, is often "dead box". Ah, yes, I agree, a write error to the journal or to metadata is a quite different thing. An unwriteable journal block is surely terminal. And a write error during metadata checkpointing is pretty horrid ebcause we've (potentially) already told userspace that the write was successful. --
For failures to write to data blocks, I don't think any other filesystems turn the filesystem read-only. Not that many other filesystems do the remount on read-only thing in general; remounting No, but I think it's insane to put any of this into the filesystem. I don't want to put words in other people's mouths, but Mingming and I were chatting with Chris Mason the last two days at a BTRFS workshop (which is why we've been a bit slow on responding), and when discussed this thread informally, he agreed that it was really bad idea to try I agree it's a bad idea. OTOH, we really need a good way of notifying a system daemon or administrator, and not just rely on the application to DTRT when it receives an -EIO. Probably what we should do is (1) return -EIO, (2) send a uevent that includes as much information as possible. At the minimum, the block that had the write (or read) error, and if available the file name involved, and application and/or pid involved. That way, the policy of what should happen in case of a data I/O error can be informed by what write just failed. It might be that if the failure is to some critical application data, the right thing to do is to kill of the application server and let the HA system bring up the hotbackup. Or if the failure is writing to a log file, some other recovery procedure is the right thing. But forcing the entire filesystem read-only just because of a write Agreed, and it's not appropriate. I could imagine that for some setups it is the right policy, but the kernel should not be setting policy like this. Maybe as a new tunable in the superblock, or maybe via a round-trip to userspace via a uevent, but certainly not as the new default behavior. Apologies, I'm still catching up on patches sent in the past few days, so I haven't had a chance to do a detailed review on Kawai-san's patches yet. I do agree that if this patch is forcing the entire filesystem read-only on a write-failure to a data block, it's probably not ...
OK, you've convinced me that aborting journal because of EIO on data block isn't the best thing to do. But it is a bit problematic to return EIO to userspace - we usually find this in commit code so we work on behalf of kjournald and user already thinks the write has succeeded (which is basically problem of any cached write). So the best thing we can do is set AS_EIO on the mapping and hope userspace will notice (we report such error when user does fsync/sync/sync_file_range). Sending some event to userspace is certainly possible but can hardly be Yes, I believe a tunable in superblock controlling how do we behave on Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
My patch doesn't change the policy. JBD aborts the journal when it detects I/O error in file data since 2.6.11. Perhaps this patch: http://marc.info/?l=linux-kernel&m=110483888632225 I agree. I understood that there is a case where we don't want to make the fs read-only when writing file data failed. OTOH there are people who want to make the fs read-only to avoid the damage from expanding. Introducing the tunable would be better. I'm going to send a patch to make this behavior tunable if some of you agree on this way. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Looking at the code paths touched by patch you referenced, you are correct. And Andrew even signed off on it. :-) But if someone was only examining the patch, it wasn't obvious that the journal was getting aborted when the JBD layer was forcing buffers from t_sync_datalist to disk. So I suspect the change went in without proper consideration of the net effect. You just called it out explicitly in the subject line, which caused Andrew to ask some good Note that doing this right may be tricky, since in the case where we aren't aborting the journal, we need to set the appropriate flags in the page cache so that when the user calls fsync() or close(), that they get the EIO error. - Ted --
Sigh. An object lesson in the value of good changelogging :( I guess we need to undo this. And yes, propagating errors into AS_EIO is the way. I guess that's safe without holding lock_page(), as long as the bh is pinned. --
Something like the following instead if -EIO and journal abort:
if (!buffer_uptodate(bh)) {
set_bit(AS_EIO, &bh->b_page->mapping->flags);
SetPageError(bh->b_page);
}
It seems end_buffer_async_write() does this already, but
journal_do_submit_data() uses end_buffer_write_sync() and it does not
do either of those operations.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
--
Thank you for your suggestion. I wrote an additional patch to do
that below. Please apply it as the 6th patch of this patch series.
BTW, I'm developing a patch which makes "abort the journal if a file
data buffer has an error" tunable. I'll send it in another thread
because it's not a bug fix patch.
Regards,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center
Subject: JBD: don't abort if flushing file data failed
In ordered mode, it is not appropriate behavior to abort the journal
when we failed to write file data. This patch calls printk()
instead of aborting the journal. Additionally, set AS_EIO into
the address_space object of the buffer which is written out by
journal_do_submit_data() and failed so that fsync() can get -EIO.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/jbd/commit.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)
Index: linux-2.6.26-rc4/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc4/fs/jbd/commit.c
@@ -432,8 +432,11 @@ void journal_commit_transaction(journal_
wait_on_buffer(bh);
spin_lock(&journal->j_list_lock);
}
- if (unlikely(!buffer_uptodate(bh)))
+ if (unlikely(!buffer_uptodate(bh))) {
+ set_bit(AS_EIO, &bh->b_page->mapping->flags);
+ SetPageError(bh->b_page);
err = -EIO;
+ }
if (!inverted_lock(journal, bh)) {
put_bh(bh);
spin_lock(&journal->j_list_lock);
@@ -452,8 +455,11 @@ void journal_commit_transaction(journal_
}
spin_unlock(&journal->j_list_lock);
- if (err)
- journal_abort(journal, err);
+ if (err) {
+ printk(KERN_WARNING
+ "JBD: Detected IO errors during flushing file data\n");
+ err = 0;
+ }
journal_write_revoke_records(journal, commit_transaction);
--
Actually, you should be more careful here because if the data buffer has been truncated in the currently running transaction, it can happen that b_page->mapping is NULL. It is a question how to safely access Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Hi, Thank you for pointing out this problem. I confirmed that b_page->mapping can be NULL. I'm making sure that the locking page approach works out well. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
I'm not completely sure which "it" you mean in "it is the default today" (seems like you mean errors=continue) but isn't remount-ro already the ext3 default? see commit: 1eca93f9cafdec4a332ace9b0fc0d3886d430c28 --
I agree with Andrew. The historical policy of ext2/3/4 is that write errors for FILE DATA propagate to the application via EIO, regardless of whether ordered mode is active or not. If filesystem METADATA is involved, yes this should cause an ext3_error() to be called and the policy for what to do next is controlled by the administrator. If the journal is aborted for a file data write error, the node maybe panics (errors=panic) or is rebooted by the administrator, then e2fsck is run and no problem is found (which it will not because e2fsck does not check file data), then all that has been accomplished is to reboot the node. Applications should check the error return codes from their writes, and call fsync on the file before closing it if they are really worried about whether the data made it to disk safely, otherwise even a read-only filesystem will not cause the application to notice anything. Now, if we have a problem where the write error from the journal checkpoint is not being propagated back to the original buffer, that is a different issue. For ordered-mode data I don't _think_ that the data buffers are mirrored when a transaction is closed, so as long as the !buffer_uptodate() error is propagated back to the caller on fsync() that is enough. We might also consider having a separate mechanism to handle write failures, but I don't think that that should be intermixed with the ext3 error handling for metadata errors. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. --
Hi, Aborting the journal leads the filesystem to read-only state, and it can prevent inconsistent state in file data from diffusing furthermore. I think it makes sense for ordered mode users; they don't want to lose file updates as much as possible, but they don't want to do fsync. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Subject: [PATCH 2/5] jbd: ordered data integrity fix
In ordered mode, if a buffer being dirtied exists in the committing
transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it.
But we don't have to remove the buffer from the committing
transaction when the buffer couldn't be written out, otherwise it
would miss the error and the committing transaction would not abort.
This patch adds an error check before removing the buffer from the
committing transaction.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/jbd/transaction.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
Index: linux-2.6.26-rc4/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/transaction.c
+++ linux-2.6.26-rc4/fs/jbd/transaction.c
@@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
journal_t *journal = handle->h_transaction->t_journal;
int need_brelse = 0;
struct journal_head *jh;
+ int ret = 0;
if (is_handle_aborted(handle))
- return 0;
+ return ret;
jh = journal_add_journal_head(bh);
JBUFFER_TRACE(jh, "entry");
@@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
time if it is redirtied */
}
- /* journal_clean_data_list() may have got there first */
+ /*
+ * We cannot remove the buffer with io error from the
+ * committing transaction, because otherwise it would
+ * miss the error and the commit would not abort.
+ */
+ if (unlikely(!buffer_uptodate(bh))) {
+ ret = -EIO;
+ goto no_journal;
+ }
+
if (jh->b_transaction != NULL) {
JBUFFER_TRACE(jh, "unfile from commit");
__journal_temp_unlink_buffer(jh);
@@ -1108,7 +1118,7 @@ no_journal:
}
JBUFFER_TRACE(jh, "exit");
journal_put_journal_head(jh);
- return 0;
+ return ret;
}
/**
--
You can add Acked-by: Jan Kara <jack@suse.cz> -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
On Mon, 02 Jun 2008 19:45:04 +0900 I changed this text to read "if a file data buffer being dirtied". Please do be careful when describing all these buffers, else it gets --
Hi, Sorry for my misleading description. I'll be careful hereafter. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Subject: [PATCH 3/5] jbd: abort when failed to log metadata buffers If we failed to write metadata buffers to the journal space and succeeded to write the commit record, stale data can be written back to the filesystem as metadata in the recovery phase. To avoid this, when we failed to write out metadata buffers, abort the journal before writing the commit record. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> --- fs/jbd/commit.c | 3 +++ 1 file changed, 3 insertions(+) Index: linux-2.6.26-rc4/fs/jbd/commit.c =================================================================== --- linux-2.6.26-rc4.orig/fs/jbd/commit.c +++ linux-2.6.26-rc4/fs/jbd/commit.c @@ -734,6 +734,9 @@ wait_for_iobuf: /* AKPM: bforget here */ } + if (err) + journal_abort(journal, err); + jbd_debug(3, "JBD: commit phase 6\n"); if (journal_write_commit_record(journal, commit_transaction)) --
Acked-by: Jan Kara <jack@suse.cz> -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
On Mon, 02 Jun 2008 19:46:02 +0900 I assume this has all been tested? How are you finding these problems and testing the fixes? Fault injection? Does it make sense to proceed into phase 6 here, or should we bale out of commit at this point? --
Hi,
Yes, I tested all cases except for the following case (related to
I caused invocations of journal_flush() and __log_wait_for_space() and
a write error simultaneously, but I haven't confirmed the race had
I found these problems by reading souce codes, then tested them
by the fault injection approach. To inject a fault, I used a
What I really want to do is that don't write the commit record when
metadata buffers couldn't be written to the journal.
It should be no problem in the case of writing revoke records failure
because the recovery process detects the invalid control block with
a noncontiguous sequence number.
But it is nonsense to write the commit record even though we failed
to write control blocks to the journal. So I think it makes sense
to catch errors for all writes to the journal here and abort the
journal to avoid writing the commit record.
* * * * * *
The following SystemTap script was used to inject a fault.
Please don't use this script without changing. It is hard-coded
for my environment.
global target_inode_block = 64
/*
* Inject a fault when a particular metadata buffer is journaled.
*/
%{
#include <linux/buffer_head.h>
#include <linux/jbd.h>
#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
enum fi_state_bits {
BH_Faulty = BH_Unshadow + 1,
};
%}
function fault_inject (scmd: long) %{
struct scsi_cmnd *cmd = (void *)((unsigned long)THIS->scmd);
cmd->cmnd[0] |= (7 << 5);
cmd->cmd_len = 255;
%}
global do_fault_inject
global faulty_sector
probe module("jbd").function("journal_write_metadata_buffer") {
if ($jh_in->b_bh->b_blocknr == target_inode_block) {
do_fault_inject[tid()] = 1
}
}
probe module("jbd").function("journal_write_metadata_buffer").return {
do_fault_inject[tid()] = 0
}
probe module("jbd").function("journal_file_buffer") {
if (do_fault_inject[tid()] && $jlist == 4 /* BJ_IO */) {
faulty_sector[$jh->b_bh->b_blocknr * 8 + 63] = 1
printf("mark faulty @ ...Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io When a checkpointing IO fails, current JBD code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort the journal 3. when checkpointing fails, don't update the journal super block to prevent the journaled contents from being cleaned. For safety, don't update j_tail and j_tail_sequence either 4. when checkpointing fails, notify this error to the ext3 layer so that ext3 don't clear the needs_recovery flag, otherwise the journaled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag 6. prevent cleanup_journal_tail() from being called between __journal_drop_transaction() and journal_abort() (a race issue between journal_flush() and __log_wait_for_space() Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> --- fs/jbd/checkpoint.c | 48 +++++++++++++++++++++++++++++++----------- fs/jbd/journal.c | 28 +++++++++++++++++++----- fs/jbd/recovery.c | 7 ++++-- include/linux/jbd.h | 2 - 4 files changed, 64 insertions(+), 21 deletions(-) Index: linux-2.6.26-rc4/fs/jbd/checkpoint.c =================================================================== --- linux-2.6.26-rc4.orig/fs/jbd/checkpoint.c +++ linux-2.6.26-rc4/fs/jbd/checkpoint.c @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j int ret = 0; struct buffer_head *bh = jh2bh(jh); - if (jh->b_jlist == BJ_None && !buffer_locked(bh) && ...
Here you update result whenever retry is < 0 and below when result == 0. I think it's better to have these two consistent (not that it would be OK, so this way you've basically serialized all users of log_do_checkpoint(). That should be fine because performance-wise interesting is only log_wait_for_space() and that was already serialized before. So this change is fine with me. Only please add a comment in front of log_do_checkpoint() that it's supposed to be called with j_checkpoint_mutex held so that EIO propagation works correctly. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Hello, Thank you for reviewing. I added a comment. I'll send the revised patch soon. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io When a checkpointing IO fails, current JBD code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort the journal 3. when checkpointing fails, don't update the journal super block to prevent the journaled contents from being cleaned. For safety, don't update j_tail and j_tail_sequence either 4. when checkpointing fails, notify this error to the ext3 layer so that ext3 don't clear the needs_recovery flag, otherwise the journaled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag 6. prevent cleanup_journal_tail() from being called between __journal_drop_transaction() and journal_abort() (a race issue between journal_flush() and __log_wait_for_space() Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> --- fs/jbd/checkpoint.c | 49 +++++++++++++++++++++++++++++++----------- fs/jbd/journal.c | 28 ++++++++++++++++++------ fs/jbd/recovery.c | 7 ++++-- include/linux/jbd.h | 2 - 4 files changed, 65 insertions(+), 21 deletions(-) Index: linux-2.6.26-rc4/fs/jbd/checkpoint.c =================================================================== --- linux-2.6.26-rc4.orig/fs/jbd/checkpoint.c +++ linux-2.6.26-rc4/fs/jbd/checkpoint.c @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j int ret = 0; struct buffer_head *bh = jh2bh(jh); - if (jh->b_jlist == BJ_None && !buffer_locked(bh) && ...
Sorry, I forgot to remove the subject at the first line... -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Is OK ;) Sometimes it helps, when people use WeirdRandomStuff in their patch titles, or when the patch comes in reply to some email thread which has an irrelevant Subject:. --
You can add: Acked-by: Jan Kara <jack@suse.cz> Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Hi,
I noticed a problem of this patch. Please see below.
3. is implemented as described below.
(1) if log_do_checkpoint() detects an I/O error during
checkpointing, it calls journal_abort() to abort the journal
(2) if the journal has aborted, don't update s_start and s_sequence
in the on-disk journal superblock
So, if the journal aborts, journaled data will be replayed on the
next mount.
Now, please remember that some dirty metadata buffers are written
back to the filesystem without journaling if the journal aborted.
We are happy if all dirty metadata buffers are written to the disk,
the integrity of the filesystem will be kept. However, replaying
the journaled data can overwrite the latest on-disk metadata blocks
partly with old data. It would break the filesystem.
My idea to resolve this problem is that we don't write out metadata
buffers which belong to uncommitted transactions if journal has
aborted. Although the latest filesystem updates will be lost,
we can ensure the integrity. It will also be effective for the
kernel panic in the middle of writing metadata buffers without
journaling (this would occur in the `mount -o errors=panic' case.)
Which integrity or latest state should we choose?
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/jbd/commit.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: linux-2.6.26-rc5-mm3/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc5-mm3/fs/jbd/commit.c
@@ -486,9 +486,10 @@ void journal_commit_transaction(journal_
jh = commit_transaction->t_buffers;
/* If we're in abort mode, we just un-journal the buffer and
- release it for background writing. */
+ release it. */
if (is_journal_aborted(journal)) {
+ clear_buffer_jbddirty(jh2bh(jh));
JBUFFER_TRACE(jh, "journal is aborting: refile");
journal_refile_buffer(journal, jh);
...Yes, it would. But how do you think it can happen that a metadata buffer will be written back to the filesystem when it is a part of running transaction? Note that checkpointing code specifically checks whether the buffer being written back is part of a running transaction and if so, it waits for commit before writing back the buffer. So I don't think this can Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Checkpointing code checks it and may call log_wait_commit(), but this problem is caused by transactions which have not started checkpointing. For example, the tail transaction has an old update for block_B and the running transaction has a new update for block_B. Then, the committing transaction fails to write the commit record, it aborts the journal, and new block_B will be written back to the file system without journaling. Because this patch doesn't separate between normal abort and checkpointing related abort, the tail transaction is left in the journal space. So by replaying the tail transaction, new block_B is overwritten with old one. It can happen in the case of the checkpointing related abort. For example, assuming the tail transaction has an update for block_A, the next transaction has an old update for block_B, and the running transaction has a new update for block_B. Now, the running transaction needs more log space, and it calls log_do_checkpoint(). But it aborts the journal because it detected write error on block_A. In this case, new block_B will be overwritten when the old block_B in the second transaction to the tail is replayed. Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Yes, and this is expected an correct. When we cannot properly finish a transaction, we have to discard everything in it. A bug would be (and I think it could currently happen) if we already checkpointed the previous transaction and then written over block_B new data from the uncommitted transaction. I think we have to avoid that - i.e., in case we abort the journal we should not mark buffers dirty when processing the forget loop. But this is not too serious since fsck has to be run anyway and it will Well, the scenario has to be a bit different (if we need more space than there is in the journal, we commit the running transaction, do checkpoint and start a new transaction) but something like what you describe could happen. But again I think that this is a correct behavior - i.e., discard all the data in the running transaction when the journal is aborted before the transaction is properly committed. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Hi Jan, Thank you for your reply. Yes. The filesystem should be marked with an error, so fsck will check and recover the filesystem on boot. But this means the filesystem loses some latest updates even if it was cleanly unmounted (although some file data has been lost.) I'm a bit afraid that some people would think of this as a regression due to this PATCH 4/5. At least, to avoid undesirable replay, we had better keep journaled data only when the I think it is correct behavior to discard _all_ metadata updates in the running transaction on abort. But, we don't hope some of metadata updates are often discarded, do we? Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Hi, I don't think this makes any difference. Look: We have transaction A modifying block B fully committed to the journal. Now there is a running (or committing, it does not really matter) transaction R also modifying block B. Until R gets fully committed, no block modified by R is checkpointed to the device - checkpointing code takes care of that and it must be so to satisfy journaling guarantees. So if we abort journal (for whatever reason) before R is fully committed, no change in R will be seen on the filesystem regardless whether you cleanup the journal or not. And you cannot do much differenly - the principle of journaling is that either all changes in the transaction get to disk or none of them. So until the transaction is properly committed, you have the only way to satisfy this condition - take the "none of them" choice. Now fsck could of course be clever and try to use updates in not fully committed transaction but personally I don't think it's worth the effort. Please correct me if I just misunderstood your point... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Hi, No, changes in R will be seen on the filesystem. The metadata buffer for block B is marked as dirty when it is unfiled whether the journal has aborted or not. Eventually the buffer will be written-back to the filesystem by pdflush. Actually I have confirmed this behavior by using SystemTap. So if both journal abort and system crash happen at the same time, the filesystem would become inconsistent state. As I stated, replaying the journaled block B in transaction A may also corrupt the filesystem, because changes in transaction R are reflected halfway. That is why I sent a patch to Thanks, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
Hello, Ah, ok. Thanks for explanation. Yes, we should not mark buffers dirty when journal is aborted and we unfile them. That should fix the whole problem. Bye Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Subject: [PATCH 5/5] ext3: abort ext3 if the journal has aborted
If the journal has aborted due to a checkpointing failure, we
have to keep the contents of the journal space. ext3_put_super()
detects the journal abort, then it invokes ext3_abort() to make
the filesystem read only and keep needs_recovery flag.
Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/ext3/ioctl.c | 12 ++++++------
fs/ext3/super.c | 11 +++++++++--
2 files changed, 15 insertions(+), 8 deletions(-)
Index: linux-2.6.26-rc4/fs/ext3/ioctl.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/ext3/ioctl.c
+++ linux-2.6.26-rc4/fs/ext3/ioctl.c
@@ -239,7 +239,7 @@ setrsvsz_out:
case EXT3_IOC_GROUP_EXTEND: {
ext3_fsblk_t n_blocks_count;
struct super_block *sb = inode->i_sb;
- int err;
+ int err, err2;
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
@@ -254,16 +254,16 @@ setrsvsz_out:
}
err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
journal_lock_updates(EXT3_SB(sb)->s_journal);
- journal_flush(EXT3_SB(sb)->s_journal);
+ err2 = journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
group_extend_out:
mnt_drop_write(filp->f_path.mnt);
- return err;
+ return (err) ? err : err2;
}
case EXT3_IOC_GROUP_ADD: {
struct ext3_new_group_data input;
struct super_block *sb = inode->i_sb;
- int err;
+ int err, err2;
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
@@ -280,11 +280,11 @@ group_extend_out:
err = ext3_group_add(sb, &input);
journal_lock_updates(EXT3_SB(sb)->s_journal);
- journal_flush(EXT3_SB(sb)->s_journal);
+ err2 = journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
group_add_out:
mnt_drop_write(filp->f_path.mnt);
- return err;
+ return (err) ? err : err2;
}
Index: ...It's a bit scary that LVM will start creating filesystem snapshot when
journal_flush() failed but since there's no way to propagate error further
up, I guess we cannot do anything better.
You can add:
Acked-by: Jan Kara <jack@suse.cz>
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
Just a small note - repeating the subject in the changelog makes it a bit harder for Andrew to merge the patches. Standard merging scripts prepend subject of the email to the changelog below and so in your case the message would be there twice... Just that you know when you create the patch series next time. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Hello, Thank you for letting me know. I'll not do such odd thing next time. Regards, -- Hidehiro Kawai Hitachi, Systems Development Laboratory Linux Technology Center --
