Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased)
This is the rebased version against 2.6.26-rc2, so there is no
essential difference from the previous post.
The previous post can be found at: http://lkml.org/lkml/2008/4/18/154
(The previous post may have been filtered out as SPAM mails
due to a trouble in the mail submission.)This patch set fixes several error handling problems. As the
result, we can save the filesystem from file data and structural
corruption especially caused by temporal I/O errors. Do temporal
I/O errors occur so often? At least it will be not uncommon
for iSCSI storages.
This fixes have been done only for ext3/JBD parts. The ext4/JBD2
version has not been prepared yet, but merging this patch set
will be worthwhile because it takes away possible filesystem
corruption.[PATCH 1/4] jbd: strictly check for write errors on data buffers
Without this patch, some file data in ordered mode aren't
checked for errors. This means user processes can continue to
update the filesystem without noticing the write failure.
Furthermore, the page cache which we failed to write becomes
reclaimable. So if the page cache is reclaimed then we
succeed to read its data from the disk, the data corruption
will occur because the data is old.
Jan's ordered mode rewrite patch also fixes this problem, but
this patch will be needed at least for the current kernel.[PATCH 2/4] jbd: ordered data integrity fix
This patch fixes the ordered mode violation problem caused
by write error. Jan's ordered mode rewrite patch will also
fix this problem.[PATCH 3/4] jbd: abort when failed to log metadata buffers
Without this patch, the filesystem can corrupt along with
the following scenario:1. fail to write a metadata buffer to block B in the journal
2. succeed to write the commit record
3. the system crashes, reboots and mount the filesystem
4. in the recovery phase, succeed to read data from block B
5. write back th...
Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io
When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling. This means latest metadata can be
lost from both the journal and filesystem.This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:1. don't remove the failed buffer from the checkpoint list where in
the case of __try_to_free_cp_buf() because it may be released or
overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
from the checkpoint list and abort journaling
3. when checkpointing fails, don't update the journal super block to
prevent the journalled contents from being cleaned
4. when checkpointing fails, don't clear the needs_recovery flag,
otherwise the journalled contents are ignored and cleaned in the
recovery phase
5. if the recovery fails, keep the needs_recovery flagSigned-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/ext3/ioctl.c | 12 ++++----
fs/ext3/super.c | 13 +++++++--
fs/jbd/checkpoint.c | 59 ++++++++++++++++++++++++++++++++++--------
fs/jbd/journal.c | 21 +++++++++-----
fs/jbd/recovery.c | 7 +++-
include/linux/jbd.h | 7 ++++
6 files changed, 91 insertions(+), 28 deletions(-)Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c
+++ linux-2.6.26-rc2/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
int ret = 0;
struct buffer_head *bh = jh2bh(jh);- if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
+ if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
+ !buffer_dirty(bh) && buffer_uptodate(bh)) {
JBUFFER_TRACE(jh, "remove from checkpoint list");
ret = __jou...
Hello,
Actually, I don't think this new flag is needed (not that it would really
harm anything). At least at the places where you check it you can as well
check whether the journal is aborted (maybe except for journal_destroy()
Please split ext3 changes into a separate patch. There's no reason
why they should be together with JBD fixes (i.e. JBD fixes don't depend on
I think you should test here whether the journal is aborted (and
EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just
I don't like this much. I'd rather completely avoid updating j_tail and
j_tail_sequence above in case the journal is aborted but I'd write the
Why do you change the loop? If we fail to checkpoint some buffer, we stopHonza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
Hi,
Thank you for review.As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
but it is needed for __journal_abort_checkpoint() which report the
checkpoint related error by printk only once. If we use JFS_ABORT
to check whether this call is second time, the error message may be
never printed out because the journal has been aborted by another
reason. If we don't check for the second call, the error message
will be printed out several times.Instead of using __journal_abort_checkpoint(), we'll be able to do
the similar thing by adding the printk() directly in journal_destroy(),
journal_flush(), and __log_wait_for_space() (but it can be more
than one time).I agree that we should not add another flag as much as possible.
So I'll try to revise to remove the flag if you agree to addYour idea looks good. But as Josef pointed out, journal_destroy()
frees sbi->s_journal, so we need to make journal_destroy() return thelog_do_checkpoint() calls cleanup_journal_tail(), which advances
j_tail and j_tail_sequence. So if we adopt the policy that we don't
modify j_tail and j_tail_sequence when checkpointing failed, we should
also fix cleanup_journal_tail().I adopted the policy that we don't update the journal super block
when checkpointing failed. When checkpointing failed, journal_abort()
is called before cleanup_journal_tail(). So what we want to writeCertainly, fsck or journal_recover() replay the unwritten-back metadata
block. journal_destroy() also call log_do_checkpoint() against all
un-checkpointed transactions. I just thought `flush' means writing out
all data which we should write to disk. There is no problem if I
don't change the loop.Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center--
Hello,
Yes, I think that checking out for JFS_ABORT is the right thing to do.
Once the journal has aborted for some reason, it is enough that we print
some error message (and that is responsibility of the first caller of
journal_abort()). Printing that checkpointing has not succeeded as well is
Well, as I said, one flag in journal is not a big deal but I found it a
bit confusing why there's a special flag for checkpoint abort when standard
I see. The thing I'm afraid of with this policy is, that when sometime later
we add somewhere journal_update_superblock() and forget about checking
whether journal isn't aborted, we will magically get filesystem corruption
when IO error happens and that would be really hard to debug. So I'd
"flush" means "make journal empty". If we are not able to do it all, it
does not really matter how much do we manage to write out. So I wouldn't
change the loop.
Honza--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
Hi,
A checkpointing failure is a bit special. If the journal is aborted
by journal_commit_transaction(), the integrity of the filesystem is
ensured although the latest changes will be lost. However, if the
journal is aborted by log_do_checkpoint() and the replay also failed,
the filesystem may be corrupted because some of the metadata blocks
are in old states. In this case, the user will have to recover the
filesystem manually and carefully. So I think printing the specialI can understand your concern as the current JBD updates the journal
super block even if a checkpoint has failed. I think it will be
improved by adding a comment to journal_update_superblock().
For example: don't invoke this function if checkpointing has
failed, otherwise unwritten-back journaled data can be lost.Or we may stop both modifying j_tail and updating the super block
when the journal has aborted (especially for a reason of checkpoint
failure). Of course, __journal_abort_soft() still updates the
super block once.Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center--
Is this really different? If we spot error during checkpointing, we
abort journal (so data still remain in the journal because checkpointing
has failed). On next mount, either replay succeeds and everything is fine,
or we spot error during replay and we should just refuse to mount the
filesystem. So we are correct again - user then has to run fsck to solve
the problem and that will try it's best to get most data readable. So I
OK, the last option is also fine with me. Thanks for your work.Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
--
I understand.
I have thought the following scenario. When we failed to checkpoint
and replay, copy all readable blocks from the bad disk to a good disk,
and we can replay all journaled data on the good disk. But it will be
overkill. Normally we run the fsck when we failed to mount, and it's
enough for us.I'm going to just use JFS_ABORT instead of introducing JFS_CP_ABORT
in the next revision.Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center--
Is this bit here really needed? If we abort the journal the fs will be mounted
read only and we should never get in here. Is there a case where we could abort
the journal and not be flipped to being read-only? If there is such a case I
would think that we should fix that by making the fs read-only, and not haveOther than that one question it looks good to me. Thanks much,
Josef
--
Actually, journal_abort() (which could be called from journal_destroy())
does nothing to the filesystem as such. So at this moment, ext3 can still
happily think everything is OK. We only detect aborted journal in
ext3_journal_start_sb() and call ext3_abort() in that case, which does all
that is needed...Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
Hmm you're right, I was thinking we did some other stuff before put_super which
would have caught a journal abort but it looks like thats not the case. Still
shouldn't do is_checkpoint_aborted(sbi->s_journal) since journal_destroy()
kfree's the journal. Should probably move the is_journal_aborted() check above
that or something. Thanks,Josef
--
Hi,
Thank you for review.
Good catch, I will fix it.
Thanks!--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center--
Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
If we failed to write metadata buffers to the journal space and
succeeded to write the commit record, stale data can be written
back to the filesystem as metadata in the recovery phase.To avoid this, when we failed to write out metadata buffers,
abort the journal before writing the commit record.Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/jbd/commit.c | 3 +++
1 file changed, 3 insertions(+)Index: linux-2.6.26-rc2/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc2/fs/jbd/commit.c
@@ -703,6 +703,9 @@ wait_for_iobuf:
__brelse(bh);
}+ if (err)
+ journal_abort(journal, err);
+
J_ASSERT (commit_transaction->t_shadow_list == NULL);jbd_debug(3, "JBD: commit phase 5\n");
--
Shouldn't this rather be further just before
journal_write_commit_record()? We should abort also if writing revokeHonza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
Hi,
Thank you for review.
Unlike metadata blocks, each revoke block has a descriptor with the
sequence number of the commiting transaction. If we failed to write
a revoke block, there should be an old control block, metadata block,
or zero-filled block where we tried to write the revoke block.
In the recovery process, this old invalid block is detected by
checking its magic number and sequence number, then the transaction
is ignored even if we have succeeded to write the commit record.
So I think we don't need to check for errors just after writing
revoke records.--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center--
Hello,
Yes, I agree that not doing such check will not cause data corruption but
still I think that in case we fail to properly commit a transaction, we
should detect the error and abort the journal...Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
Hi,
I see. I'll move the aborting point to just before
journal_write_commit_record() in the next version.Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center--
Subject: [PATCH 2/4] jbd: ordered data integrity fix
In ordered mode, if a buffer being dirtied exists in the committing
transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it.
But we don't have to remove the buffer from the committing
transaction when the buffer couldn't be written out, otherwise it
breaks the ordered mode rule.Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
fs/jbd/transaction.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)Index: linux-2.6.26-rc2/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/transaction.c
+++ linux-2.6.26-rc2/fs/jbd/transaction.c
@@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
journal_t *journal = handle->h_transaction->t_journal;
int need_brelse = 0;
struct journal_head *jh;
+ int ret = 0;if (is_handle_aborted(handle))
- return 0;
+ return ret;jh = journal_add_journal_head(bh);
JBUFFER_TRACE(jh, "entry");
@@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
time if it is redirtied */
}- /* journal_clean_data_list() may have got there first */
+ /*
+ * We shouldn't remove the buffer from the committing
+ * transaction if it has failed to be written.
+ * Otherwise, it breaks the ordered mode rule.
+ */
+ if (unlikely(!buffer_uptodate(bh))) {
+ ret = -EIO;
+ goto no_journal;
+ }
+
if (jh->b_transaction != NULL) {
JBUFFER_TRACE(jh, "unfile from commit");
__journal_temp_unlink_buffer(jh);
@@ -1108,7 +1118,7 @@ no_journal:
}
JBUFFER_TRACE(jh, "exit");
journal_put_journal_head(jh);
- return 0;
+ return ret;
}/**
--
Hmm, could you elaborate a bit more what exactly is broken and how does
this help to fix it? Because even if we find EIO happened on data buffer,
we currently don't do anything else than just remove the buffer from the
transaction and abort the journal. And even if we later managed to write
the data buffer from other process before the journal is aborted, ordered
mode guarantees are satisfied - we only guarantee that too old data cannot
be seen, newer can be seen easily... Thanks.--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
Hi,
Thanks for your comment.
In the case where I stated the above, error checking is postponed to
the next (currently running) transaction because the buffer is removed
from the committing transaction before checked for an error. This can
happen repeatedly, then the error won't be detected "for a long time".
However, finally the error is detected by, for example,
journal_commit_transaction(), we can abort the journal. So this
problem is not so serious than the other patches which I sent.Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center--
Hello,
OK, I see. So I agree with the change but please add this explanation
(like: cannot remove buffer with io error from the committing transaction
because otherwise it would miss the error and commit would not abort) to
the comment in journal_dirty_data(). Thanks.Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers
In ordered mode, we should abort journaling when an I/O error has
occurred on a file data buffer in the committing transaction. But
there can be data buffers which are not checked for error:(a) the buffer which has already been written out by pdflush
(b) the buffer which has been unlocked before scanned in the
t_locked_list loopThis patch adds missing error checks and aborts journaling
appropriately.Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitacih.com>
---
fs/jbd/commit.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)Index: linux-2.6.26-rc2/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc2/fs/jbd/commit.c
@@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
/*
* Submit all the data buffers to disk
*/
-static void journal_submit_data_buffers(journal_t *journal,
+static int journal_submit_data_buffers(journal_t *journal,
transaction_t *commit_transaction)
{
struct journal_head *jh;
@@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
int locked;
int bufs = 0;
struct buffer_head **wbuf = journal->j_wbuf;
+ int err = 0;/*
* Whenever we unlock the journal and sleep, things can get added
@@ -253,6 +254,8 @@ write_out_data:
put_bh(bh);
} else {
BUFFER_TRACE(bh, "writeout complete: unfile");
+ if (unlikely(!buffer_uptodate(bh)))
+ err = -EIO;
__journal_unfile_buffer(jh);
jbd_unlock_bh_state(bh);
if (locked)
@@ -271,6 +274,8 @@ write_out_data:
}
spin_unlock(&journal->j_list_lock);
journal_do_submit_data(wbuf, bufs);
+
+ return err;
}/*
@@ -410,8 +415,7 @@ void journal_commit_transaction(journal_
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
*/
- err = 0;
- journal_submi...
Hi,
Acked-by: Jan Kara <jack@suse.cz>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| David Woodhouse | [PATCH 1/3] firmware: allow firmware files to be built into kernel image |
| Peter Zijlstra | [PATCH 00/23] per device dirty throttling -v8 |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Josip Rodin | bnx2_poll panicking kernel |
| Patrick McHardy | Re: [GIT]: Networking |
