Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Previous thread: [PATCH] kobject: reorder kobject to save space on 64 bit builds by Richard Kennedy on Monday, June 2, 2008 - 3:07 am. (2 messages)

Next thread: [PATCH net-2.6] [NETFILTER] Misc Cleanups. by Rami Rosen on Monday, June 2, 2008 - 4:11 am. (2 messages)
From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 3:40 am

Subject: [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2)

This patch set is the take 2 of fixing error handling problem in
ext3/JBD.  The previous discussion can be found here:
http://lkml.org/lkml/2008/5/14/10

The same problem should also be in ext4/JBD, but I haven't prepared
it yet.


Problem
=======
Currently some error checkings are missing, so the journal cannot abort
correctly.  This causes breakage of the ordered mode rule and filesystem
corruption.  Missing error checkings are:

(1) error check for dirty buffers flushed before the commit
    (addressed by PATCH 1/5 and 2/5)
(2) error check for the metadata writes to the journal before the
    commit (addressed by PATCH 3/5)
(3) error check for checkpointing and replay (addressed by PATCH 4/5
    and 5/5)


Changes from take 1
===================
[PATCH 1/5]
o not changed

[PATCH 2/5]
o rewrite my coment in journal_dirty_data() comprehensibly

[PATCH 3/5]
o check for errors and abort the journal just before
  journal_write_commit_record() instead of after writing metadata
  buffers

[PATCH 4/5 and 5/5]
o separate the ext3 part from the jbd part in a patch
o use JFS_ABORT for checkpointing failures instead of introducing
  JFS_CP_ABORT flag
o don't update only the journal super block, but also j_tail and
  j_tail_sequence when the journal has aborted (at least we only
  have to avoid updating the super block, but keeping j_tail*'s
  values will be good thing because it may protect someone from
  adding bugs in the future)
o journal_destroy() returns -EIO when the journal has aborted so that
  ext3_put_super() can detect the abort
o journal_flush() uses j_checkpoint_mutex to avoid a race with
  __log_wait_for_space()

The last item targets a newly found problem.  journal_flush() can be
called while processing __log_wait_for_space().  In this case,
cleanup_journal_tail() can be called between
__journal_drop_transaction() and journal_abort(), then 
the transaction with checkpointing ...
From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 3:43 am

Subject: [PATCH 1/5] jbd: strictly check for write errors on data buffers

In ordered mode, we should abort journaling when an I/O error has
occurred on a file data buffer in the committing transaction.  But
there can be data buffers which are not checked for error:

  (a) the buffer which has already been written out by pdflush
  (b) the buffer which has been unlocked before scanned in the
      t_locked_list loop

This patch adds missing error checks and aborts journaling
appropriately.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitacih.com>
Acked-by: Jan Kara <jack@suse.cz>
---
 fs/jbd/commit.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc4/fs/jbd/commit.c
@@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
 /*
  *  Submit all the data buffers to disk
  */
-static void journal_submit_data_buffers(journal_t *journal,
+static int journal_submit_data_buffers(journal_t *journal,
 				transaction_t *commit_transaction)
 {
 	struct journal_head *jh;
@@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
 	int locked;
 	int bufs = 0;
 	struct buffer_head **wbuf = journal->j_wbuf;
+	int err = 0;
 
 	/*
 	 * Whenever we unlock the journal and sleep, things can get added
@@ -253,6 +254,8 @@ write_out_data:
 			put_bh(bh);
 		} else {
 			BUFFER_TRACE(bh, "writeout complete: unfile");
+			if (unlikely(!buffer_uptodate(bh)))
+				err = -EIO;
 			__journal_unfile_buffer(jh);
 			jbd_unlock_bh_state(bh);
 			if (locked)
@@ -271,6 +274,8 @@ write_out_data:
 	}
 	spin_unlock(&journal->j_list_lock);
 	journal_do_submit_data(wbuf, bufs);
+
+	return err;
 }
 
 /*
@@ -410,8 +415,7 @@ void journal_commit_transaction(journal_
 	 * Now start flushing things to disk, in the order they appear
 	 * on the transaction lists.  Data blocks go first.
 	 ...
From: Andrew Morton
Date: Tuesday, June 3, 2008 - 3:30 pm

On Mon, 02 Jun 2008 19:43:57 +0900

Why should we do that?
--

From: Jan Kara
Date: Wednesday, June 4, 2008 - 3:19 am

I see two reasons:
1) If fs below us is returning IO errors, we don't really know how severe
it is so it's safest to stop accepting writes. Also user notices the
problem early this way. I agree that with the growing size of disks and
thus probability of seeing IO error, we should probably think of something
cleverer than this but aborting seems better than just doing nothing.

2) If the IO error is just transient (i.e., link to NAS is disconnected for
a while), we would silently break ordering mode guarantees (user could be
able to see old / uninitialized data).

								Honza

PS: Changed Andreas's address in the email to the new one...
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Andrew Morton
Date: Wednesday, June 4, 2008 - 11:19 am

Does any other filesystem driver turn the fs read-only on the first
write-IO-error?

It seems like a big policy change to me.  For a lot of applications
it's effectively a complete outage and people might get a bit upset if
this happens on the first blip from their NAS.

<waves vigorously at linux-ext4 people>
--

From: Theodore Tso
Date: Wednesday, June 4, 2008 - 2:22 pm

As I told Kawai-san when I met with him and his colleagues in Tokyo
last week, it is the responsibility of the storage stack to retry
errors as appropriate.  From the filesystem perspective, a read or a
write operation succeeds, or fails.  A read or write operation could
take a long time before returning, but the storage stack doesn't get
to return a "fail, but try again at some point; maybe we'll succeed
later, or if you try writing to a different block".  The only sane
thing for a filesystem to do is to treat any failure as a hard
failure.

It is similarly insane to ask a filesystem to figure out that a newly
plugged in USB stick is the same one that the user had accidentally
unplugged 30 seconds ago.  We don't want to put that kind of low-level
knowlede about storage details in each different filesystem.

A much better place to put that kind of smarts is in a multipath
module which sits in between the device and the filesystem.  It can
retry writes from a transient failure, if a path goes down or if a
iSCSI device temporarily drops off the network.  But if a filesystem
gets a write failure, it has to assume that the write failure is
permanent.

The question though is what should you do if you have a write failure
in various different parts of the disk?  If you have a write failure
in a data block, you can return -EIO to the user.  You could try
reallocating to find another block, and try writing to that alternate
location (although with modern filesystems that do block remapping,
this is largely pointless, since an EIO failure on write probably
means you've lost connectivity to the disk or the disk as run out of
spare blocks).  But for a failure to write to the a critical part of
the filesystem, like the inode table, or failure to write to the
journal, what the heck can you do?  Remounting read-only is probably
the best thing you can do.

In theory, if it is a failure to write to the journal, you could fall
back to no-journaled operation, and if ext3 could support ...
From: Andrew Morton
Date: Wednesday, June 4, 2008 - 2:58 pm

On Wed, 4 Jun 2008 17:22:02 -0400



To that sector, yes.  But to the entire partition?

(Well, if the entire partition became unwriteable then we don't have a
problem.  It's "parts are writeable" or "it became writeable again

Absolutely.

But afaict this patch changes things so that if we get a write failure
in a data block we make the entire fs read-only.  Which, as I said, is
often "dead box".


Ah, yes, I agree, a write error to the journal or to metadata is a
quite different thing.  An unwriteable journal block is surely
terminal.  And a write error during metadata checkpointing is pretty
horrid ebcause we've (potentially) already told userspace that the
write was successful.



--

From: Theodore Tso
Date: Wednesday, June 4, 2008 - 3:51 pm

For failures to write to data blocks, I don't think any other
filesystems turn the filesystem read-only.  Not that many other
filesystems do the remount on read-only thing in general; remounting

No, but I think it's insane to put any of this into the filesystem.  I
don't want to put words in other people's mouths, but Mingming and I
were chatting with Chris Mason the last two days at a BTRFS workshop
(which is why we've been a bit slow on responding), and when discussed
this thread informally, he agreed that it was really bad idea to try

I agree it's a bad idea.  OTOH, we really need a good way of notifying
a system daemon or administrator, and not just rely on the application
to DTRT when it receives an -EIO.  Probably what we should do is (1)
return -EIO, (2) send a uevent that includes as much information as
possible.  At the minimum, the block that had the write (or read)
error, and if available the file name involved, and application and/or
pid involved.  That way, the policy of what should happen in case of a
data I/O error can be informed by what write just failed.  It might be
that if the failure is to some critical application data, the right
thing to do is to kill of the application server and let the HA system
bring up the hotbackup.  Or if the failure is writing to a log file,
some other recovery procedure is the right thing.   

But forcing the entire filesystem read-only just because of a write

Agreed, and it's not appropriate.  I could imagine that for some
setups it is the right policy, but the kernel should not be setting
policy like this.  Maybe as a new tunable in the superblock, or maybe
via a round-trip to userspace via a uevent, but certainly not as the
new default behavior.

Apologies, I'm still catching up on patches sent in the past few days,
so I haven't had a chance to do a detailed review on Kawai-san's
patches yet.  I do agree that if this patch is forcing the entire
filesystem read-only on a write-failure to a data block, it's probably
not ...
From: Jan Kara
Date: Thursday, June 5, 2008 - 2:35 am

OK, you've convinced me that aborting journal because of EIO on data
block isn't the best thing to do. But it is a bit problematic to return EIO
to userspace - we usually find this in commit code so we work on behalf of
kjournald and user already thinks the write has succeeded (which is
basically problem of any cached write). So the best thing we can do is set
AS_EIO on the mapping and hope userspace will notice (we report such error
when user does fsync/sync/sync_file_range).
  Sending some event to userspace is certainly possible but can hardly be
  Yes, I believe a tunable in superblock controlling how do we behave on

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Thursday, June 5, 2008 - 4:33 am

My patch doesn't change the policy.  JBD aborts the journal when
it detects I/O error in file data since 2.6.11.  Perhaps this patch:
http://marc.info/?l=linux-kernel&m=110483888632225

I agree.  I understood that there is a case where we don't want to
make the fs read-only when writing file data failed.  OTOH there are
people who want to make the fs read-only to avoid the damage from
expanding.  Introducing the tunable would be better.
I'm going to send a patch to make this behavior tunable if some of you
agree on this way.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Theodore Tso
Date: Thursday, June 5, 2008 - 7:29 am

Looking at the code paths touched by patch you referenced, you are
correct.  And Andrew even signed off on it.  :-)

But if someone was only examining the patch, it wasn't obvious that
the journal was getting aborted when the JBD layer was forcing buffers
from t_sync_datalist to disk.  So I suspect the change went in without
proper consideration of the net effect.  You just called it out
explicitly in the subject line, which caused Andrew to ask some good

Note that doing this right may be tricky, since in the case where we
aren't aborting the journal, we need to set the appropriate flags in
the page cache so that when the user calls fsync() or close(), that
they get the EIO error.

						- Ted
--

From: Andrew Morton
Date: Thursday, June 5, 2008 - 9:20 am

Sigh. An object lesson in the value of good changelogging :(

I guess we need to undo this.  And yes, propagating errors into AS_EIO
is the way.  I guess that's safe without holding lock_page(), as long
as the bh is pinned.

--

From: Andreas Dilger
Date: Thursday, June 5, 2008 - 11:49 am

Something like the following instead if -EIO and journal abort:

		if (!buffer_uptodate(bh)) {
			set_bit(AS_EIO, &bh->b_page->mapping->flags);
			SetPageError(bh->b_page);
		}

It seems end_buffer_async_write() does this already, but
journal_do_submit_data() uses end_buffer_write_sync() and it does not
do either of those operations.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--

From: Hidehiro Kawai
Date: Monday, June 9, 2008 - 3:09 am

Thank you for your suggestion.  I wrote an additional patch to do
that below.  Please apply it as the 6th patch of this patch series.

BTW, I'm developing a patch which makes "abort the journal if a file
data buffer has an error" tunable.  I'll send it in another thread
because it's not a bug fix patch.

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


Subject: JBD: don't abort if flushing file data failed

In ordered mode, it is not appropriate behavior to abort the journal
when we failed to write file data.  This patch calls printk()
instead of aborting the journal.  Additionally, set AS_EIO into
the address_space object of the buffer which is written out by
journal_do_submit_data() and failed so that fsync() can get -EIO.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/commit.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc4/fs/jbd/commit.c
@@ -432,8 +432,11 @@ void journal_commit_transaction(journal_
 			wait_on_buffer(bh);
 			spin_lock(&journal->j_list_lock);
 		}
-		if (unlikely(!buffer_uptodate(bh)))
+		if (unlikely(!buffer_uptodate(bh))) {
+			set_bit(AS_EIO, &bh->b_page->mapping->flags);
+			SetPageError(bh->b_page);
 			err = -EIO;
+		}
 		if (!inverted_lock(journal, bh)) {
 			put_bh(bh);
 			spin_lock(&journal->j_list_lock);
@@ -452,8 +455,11 @@ void journal_commit_transaction(journal_
 	}
 	spin_unlock(&journal->j_list_lock);
 
-	if (err)
-		journal_abort(journal, err);
+	if (err) {
+		printk(KERN_WARNING
+		       "JBD: Detected IO errors during flushing file data\n");
+		err = 0;
+	}
 
 	journal_write_revoke_records(journal, commit_transaction);
 



--

From: Jan Kara
Date: Wednesday, June 11, 2008 - 5:35 am

Actually, you should be more careful here because if the data buffer has
been truncated in the currently running transaction, it can happen that
b_page->mapping is NULL. It is a question how to safely access

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Thursday, June 12, 2008 - 6:19 am

Hi,


Thank you for pointing out this problem.  I confirmed that
b_page->mapping can be NULL.  I'm making sure that the locking page
approach works out well.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Mike Snitzer
Date: Wednesday, June 4, 2008 - 8:28 pm

I'm not completely sure which "it" you mean in "it is the default
today" (seems like you mean errors=continue) but isn't remount-ro
already the ext3 default?

see commit: 1eca93f9cafdec4a332ace9b0fc0d3886d430c28
--

From: Andreas Dilger
Date: Wednesday, June 4, 2008 - 2:58 pm

I agree with Andrew.  The historical policy of ext2/3/4 is that write
errors for FILE DATA propagate to the application via EIO, regardless
of whether ordered mode is active or not.  If filesystem METADATA is
involved, yes this should cause an ext3_error() to be called and the
policy for what to do next is controlled by the administrator.

If the journal is aborted for a file data write error, the node maybe
panics (errors=panic) or is rebooted by the administrator, then e2fsck
is run and no problem is found (which it will not because e2fsck does
not check file data), then all that has been accomplished is to reboot
the node.

Applications should check the error return codes from their writes,
and call fsync on the file before closing it if they are really worried
about whether the data made it to disk safely, otherwise even a read-only
filesystem will not cause the application to notice anything.

Now, if we have a problem where the write error from the journal checkpoint
is not being propagated back to the original buffer, that is a different
issue.  For ordered-mode data I don't _think_ that the data buffers are
mirrored when a transaction is closed, so as long as the !buffer_uptodate()
error is propagated back to the caller on fsync() that is enough.

We might also consider having a separate mechanism to handle write failures,
but I don't think that that should be intermixed with the ext3 error handling
for metadata errors.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--

From: Hidehiro Kawai
Date: Wednesday, June 4, 2008 - 3:53 am

Hi,


Aborting the journal leads the filesystem to read-only state, and it
can prevent inconsistent state in file data from diffusing furthermore. 
I think it makes sense for ordered mode users; they don't want to
lose file updates as much as possible, but they don't want to do fsync.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 3:45 am

Subject: [PATCH 2/5] jbd: ordered data integrity fix

In ordered mode, if a buffer being dirtied exists in the committing
transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it.
But we don't have to remove the buffer from the committing
transaction when the buffer couldn't be written out, otherwise it
would miss the error and the committing transaction would not abort.

This patch adds an error check before removing the buffer from the
committing transaction.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/transaction.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/transaction.c
+++ linux-2.6.26-rc4/fs/jbd/transaction.c
@@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
 	journal_t *journal = handle->h_transaction->t_journal;
 	int need_brelse = 0;
 	struct journal_head *jh;
+	int ret = 0;
 
 	if (is_handle_aborted(handle))
-		return 0;
+		return ret;
 
 	jh = journal_add_journal_head(bh);
 	JBUFFER_TRACE(jh, "entry");
@@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
 				   time if it is redirtied */
 			}
 
-			/* journal_clean_data_list() may have got there first */
+			/*
+			 * We cannot remove the buffer with io error from the
+			 * committing transaction, because otherwise it would
+			 * miss the error and the commit would not abort.
+			 */
+			if (unlikely(!buffer_uptodate(bh))) {
+				ret = -EIO;
+				goto no_journal;
+			}
+
 			if (jh->b_transaction != NULL) {
 				JBUFFER_TRACE(jh, "unfile from commit");
 				__journal_temp_unlink_buffer(jh);
@@ -1108,7 +1118,7 @@ no_journal:
 	}
 	JBUFFER_TRACE(jh, "exit");
 	journal_put_journal_head(jh);
-	return 0;
+	return ret;
 }
 
 /**


--

From: Jan Kara
Date: Monday, June 2, 2008 - 4:59 am

You can add
  Acked-by: Jan Kara <jack@suse.cz>

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Andrew Morton
Date: Tuesday, June 3, 2008 - 3:33 pm

On Mon, 02 Jun 2008 19:45:04 +0900

I changed this text to read "if a file data buffer being dirtied". 
Please do be careful when describing all these buffers, else it gets
--

From: Hidehiro Kawai
Date: Wednesday, June 4, 2008 - 3:55 am

Hi,

 
Sorry for my misleading description.  I'll be careful hereafter. 

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 3:46 am

Subject: [PATCH 3/5] jbd: abort when failed to log metadata buffers

If we failed to write metadata buffers to the journal space and
succeeded to write the commit record, stale data can be written
back to the filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers,
abort the journal before writing the commit record.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/commit.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.26-rc4/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc4/fs/jbd/commit.c
@@ -734,6 +734,9 @@ wait_for_iobuf:
 		/* AKPM: bforget here */
 	}
 
+	if (err)
+		journal_abort(journal, err);
+
 	jbd_debug(3, "JBD: commit phase 6\n");
 
 	if (journal_write_commit_record(journal, commit_transaction))


--

From: Jan Kara
Date: Monday, June 2, 2008 - 5:00 am

Acked-by: Jan Kara <jack@suse.cz>

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Andrew Morton
Date: Tuesday, June 3, 2008 - 3:35 pm

On Mon, 02 Jun 2008 19:46:02 +0900

I assume this has all been tested?

How are you finding these problems and testing the fixes?  Fault
injection?

Does it make sense to proceed into phase 6 here, or should we bale out
of commit at this point?

--

From: Hidehiro Kawai
Date: Wednesday, June 4, 2008 - 3:57 am

Hi,


Yes,  I tested all cases except for the following case (related to

I caused invocations of journal_flush() and __log_wait_for_space() and
a write error simultaneously, but I haven't confirmed the race had

I found these problems by reading souce codes, then tested them
by the fault injection approach.  To inject a fault, I used a

What I really want to do is that don't write the commit record when
metadata buffers couldn't be written to the journal.
It should be no problem in the case of writing revoke records failure
because the recovery process detects the invalid control block with
a noncontiguous sequence number.
But it is nonsense to write the commit record even though we failed
to write control blocks to the journal.  So I think it makes sense
to catch errors for all writes to the journal here and abort the
journal to avoid writing the commit record.

* * * * * *

The following SystemTap script was used to inject a fault.
Please don't use this script without changing.  It is hard-coded
for my environment.


global target_inode_block = 64
/*
 * Inject a fault when a particular metadata buffer is journaled.
 */

%{
#include <linux/buffer_head.h>
#include <linux/jbd.h>
#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>

enum fi_state_bits {
	BH_Faulty = BH_Unshadow + 1,
};
%}

function fault_inject (scmd: long) %{
	struct scsi_cmnd *cmd = (void *)((unsigned long)THIS->scmd);
	cmd->cmnd[0] |= (7 << 5);
	cmd->cmd_len = 255;
%}

global do_fault_inject
global faulty_sector
probe module("jbd").function("journal_write_metadata_buffer") {
	if ($jh_in->b_bh->b_blocknr == target_inode_block) {
		do_fault_inject[tid()] = 1
	}
}
probe module("jbd").function("journal_write_metadata_buffer").return {
	do_fault_inject[tid()] = 0
}

probe module("jbd").function("journal_file_buffer") {
	if (do_fault_inject[tid()] && $jlist == 4 /* BJ_IO */) {
		faulty_sector[$jh->b_bh->b_blocknr * 8 + 63] = 1
		printf("mark faulty @ ...
From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 3:47 am

Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io

When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling.  This means latest metadata can be
lost from both the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:

1. don't remove the failed buffer from the checkpoint list where in
   the case of __try_to_free_cp_buf() because it may be released or
   overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
   from the checkpoint list and abort the journal
3. when checkpointing fails, don't update the journal super block to
   prevent the journaled contents from being cleaned.  For safety,
   don't update j_tail and j_tail_sequence either
4. when checkpointing fails, notify this error to the ext3 layer so
   that ext3 don't clear the needs_recovery flag, otherwise the
   journaled contents are ignored and cleaned in the recovery phase
5. if the recovery fails, keep the needs_recovery flag
6. prevent cleanup_journal_tail() from being called between
   __journal_drop_transaction() and journal_abort() (a race issue
   between journal_flush() and __log_wait_for_space()

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/checkpoint.c |   48 +++++++++++++++++++++++++++++++-----------
 fs/jbd/journal.c    |   28 +++++++++++++++++++-----
 fs/jbd/recovery.c   |    7 ++++--
 include/linux/jbd.h |    2 -
 4 files changed, 64 insertions(+), 21 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/checkpoint.c
+++ linux-2.6.26-rc4/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
 	int ret = 0;
 	struct buffer_head *bh = jh2bh(jh);
 
-	if (jh->b_jlist == BJ_None && !buffer_locked(bh) && ...
From: Jan Kara
Date: Monday, June 2, 2008 - 5:44 am

Here you update result whenever retry is < 0 and below when result == 0.
I think it's better to have these two consistent (not that it would be
  OK, so this way you've basically serialized all users of
log_do_checkpoint(). That should be fine because performance-wise interesting
is only log_wait_for_space() and that was already serialized before. So
this change is fine with me. Only please add a comment in front of
log_do_checkpoint() that it's supposed to be called with j_checkpoint_mutex
held so that EIO propagation works correctly.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 9:31 pm

Hello,

Thank you for reviewing.



I added a comment.

I'll send the revised patch soon.

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 9:40 pm

Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io

When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling.  This means latest metadata can be
lost from both the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:

1. don't remove the failed buffer from the checkpoint list where in
   the case of __try_to_free_cp_buf() because it may be released or
   overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
   from the checkpoint list and abort the journal
3. when checkpointing fails, don't update the journal super block to
   prevent the journaled contents from being cleaned.  For safety,
   don't update j_tail and j_tail_sequence either
4. when checkpointing fails, notify this error to the ext3 layer so
   that ext3 don't clear the needs_recovery flag, otherwise the
   journaled contents are ignored and cleaned in the recovery phase
5. if the recovery fails, keep the needs_recovery flag
6. prevent cleanup_journal_tail() from being called between
   __journal_drop_transaction() and journal_abort() (a race issue
   between journal_flush() and __log_wait_for_space()

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/checkpoint.c |   49 +++++++++++++++++++++++++++++++-----------
 fs/jbd/journal.c    |   28 ++++++++++++++++++------
 fs/jbd/recovery.c   |    7 ++++--
 include/linux/jbd.h |    2 -
 4 files changed, 65 insertions(+), 21 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/checkpoint.c
+++ linux-2.6.26-rc4/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
 	int ret = 0;
 	struct buffer_head *bh = jh2bh(jh);
 
-	if (jh->b_jlist == BJ_None && !buffer_locked(bh) && ...
From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 10:11 pm

Sorry, I forgot to remove the subject at the first line...

-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Andrew Morton
Date: Monday, June 2, 2008 - 10:20 pm

Is OK ;)

Sometimes it helps, when people use WeirdRandomStuff in their patch
titles, or when the patch comes in reply to some email thread which has
an irrelevant Subject:.

--

From: Jan Kara
Date: Tuesday, June 3, 2008 - 1:02 am

You can add:
  Acked-by: Jan Kara <jack@suse.cz>

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Monday, June 23, 2008 - 4:14 am

Hi,

I noticed a problem of this patch.  Please see below.


3. is implemented as described below.
  (1) if log_do_checkpoint() detects an I/O error during
      checkpointing, it calls journal_abort() to abort the journal
  (2) if the journal has aborted, don't update s_start and s_sequence
      in the on-disk journal superblock

So, if the journal aborts, journaled data will be replayed on the
next mount.

Now, please remember that some dirty metadata buffers are written
back to the filesystem without journaling if the journal aborted.
We are happy if all dirty metadata buffers are written to the disk,
the integrity of the filesystem will be kept.  However, replaying
the journaled data can overwrite the latest on-disk metadata blocks
partly with old data.  It would break the filesystem.

My idea to resolve this problem is that we don't write out metadata
buffers which belong to uncommitted transactions if journal has
aborted.  Although the latest filesystem updates will be lost,
we can ensure the integrity.  It will also be effective for the
kernel panic in the middle of writing metadata buffers without
journaling (this would occur in the `mount -o errors=panic' case.)

Which integrity or latest state should we choose?

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/jbd/commit.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc5-mm3/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc5-mm3/fs/jbd/commit.c
@@ -486,9 +486,10 @@ void journal_commit_transaction(journal_
 		jh = commit_transaction->t_buffers;
 
 		/* If we're in abort mode, we just un-journal the buffer and
-		   release it for background writing. */
+		   release it. */
 
 		if (is_journal_aborted(journal)) {
+			clear_buffer_jbddirty(jh2bh(jh));
 			JBUFFER_TRACE(jh, "journal is aborting: refile");
 			journal_refile_buffer(journal, jh);
 ...
From: Jan Kara
Date: Monday, June 23, 2008 - 5:22 am

Yes, it would. But how do you think it can happen that a metadata buffer
will be written back to the filesystem when it is a part of running
transaction? Note that checkpointing code specifically checks whether the
buffer being written back is part of a running transaction and if so, it
waits for commit before writing back the buffer. So I don't think this can

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Tuesday, June 24, 2008 - 4:52 am

Checkpointing code checks it and may call log_wait_commit(), but this
problem is caused by transactions which have not started checkpointing.

For example, the tail transaction has an old update for block_B and
the running transaction has a new update for block_B.  Then, the
committing transaction fails to write the commit record, it aborts the
journal, and new block_B will be written back to the file system without
journaling.  Because this patch doesn't separate between normal abort
and checkpointing related abort, the tail transaction is left in the
journal space.  So by replaying the tail transaction, new block_B is
overwritten with old one.

It can happen in the case of the checkpointing related abort.
For example, assuming the tail transaction has an update for block_A,
the next transaction has an old update for block_B, and the running
transaction has a new update for block_B.
Now, the running transaction needs more log space, and it calls
log_do_checkpoint().  But it aborts the journal because it detected
write error on block_A.  In this case, new block_B will be
overwritten when the old block_B in the second transaction to the tail
is replayed.


Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Jan Kara
Date: Tuesday, June 24, 2008 - 6:33 am

Yes, and this is expected an correct. When we cannot properly finish a
transaction, we have to discard everything in it. A bug would be (and I
think it could currently happen) if we already checkpointed the previous
transaction and then written over block_B new data from the uncommitted
transaction. I think we have to avoid that - i.e., in case we abort the
journal we should not mark buffers dirty when processing the forget loop.
But this is not too serious since fsck has to be run anyway and it will
  Well, the scenario has to be a bit different (if we need more space than
there is in the journal, we commit the running transaction, do checkpoint
and start a new transaction) but something like what you describe could
happen. But again I think that this is a correct behavior - i.e., discard
all the data in the running transaction when the journal is aborted before
the transaction is properly committed. 

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Friday, June 27, 2008 - 1:06 am

Hi Jan,
Thank you for your reply.



Yes.  The filesystem should be marked with an error, so fsck will check
and recover the filesystem on boot.  But this means the filesystem loses
some latest updates even if it was cleanly unmounted (although some file
data has been lost.)  I'm a bit afraid that some people would think of
this as a regression due to this PATCH 4/5.  At least, to avoid
undesirable replay, we had better keep journaled data only when the

I think it is correct behavior to discard _all_ metadata updates
in the running transaction on abort.  But, we don't hope some of
metadata updates are often discarded, do we?

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Jan Kara
Date: Friday, June 27, 2008 - 3:24 am

Hi,

  I don't think this makes any difference. Look: We have transaction A
modifying block B fully committed to the journal. Now there is a running
(or committing, it does not really matter) transaction R also modifying block
B. Until R gets fully committed, no block modified by R is checkpointed
to the device - checkpointing code takes care of that and it must be so
to satisfy journaling guarantees.
  So if we abort journal (for whatever reason) before R is fully committed,
no change in R will be seen on the filesystem regardless whether you
cleanup the journal or not.
  And you cannot do much differenly - the principle of journaling is that
either all changes in the transaction get to disk or none of them. So until
the transaction is properly committed, you have the only way to satisfy
this condition - take the "none of them" choice.
  Now fsck could of course be clever and try to use updates in not fully
committed transaction but personally I don't think it's worth the effort.
  Please correct me if I just misunderstood your point...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Sunday, June 29, 2008 - 10:09 pm

Hi,


No, changes in R will be seen on the filesystem.
The metadata buffer for block B is marked as dirty when it is unfiled
whether the journal has aborted or not.  Eventually the buffer will be
written-back to the filesystem by pdflush.  Actually I have confirmed
this behavior by using SystemTap.  So if both journal abort and
system crash happen at the same time, the filesystem would become
inconsistent state.  As I stated, replaying the journaled block B in
transaction A may also corrupt the filesystem, because changes in
transaction R are reflected halfway.  That is why I sent a patch to

Thanks,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

From: Jan Kara
Date: Monday, July 7, 2008 - 3:07 am

Hello,

  Ah, ok. Thanks for explanation. Yes, we should not mark buffers dirty
when journal is aborted and we unfile them. That should fix the whole
problem.
								Bye
									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 3:48 am

Subject: [PATCH 5/5] ext3: abort ext3 if the journal has aborted

If the journal has aborted due to a checkpointing failure, we
have to keep the contents of the journal space.  ext3_put_super()
detects the journal abort, then it invokes ext3_abort() to make
the filesystem read only and keep needs_recovery flag.

Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
---
 fs/ext3/ioctl.c |   12 ++++++------
 fs/ext3/super.c |   11 +++++++++--
 2 files changed, 15 insertions(+), 8 deletions(-)

Index: linux-2.6.26-rc4/fs/ext3/ioctl.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/ext3/ioctl.c
+++ linux-2.6.26-rc4/fs/ext3/ioctl.c
@@ -239,7 +239,7 @@ setrsvsz_out:
 	case EXT3_IOC_GROUP_EXTEND: {
 		ext3_fsblk_t n_blocks_count;
 		struct super_block *sb = inode->i_sb;
-		int err;
+		int err, err2;
 
 		if (!capable(CAP_SYS_RESOURCE))
 			return -EPERM;
@@ -254,16 +254,16 @@ setrsvsz_out:
 		}
 		err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
 		journal_lock_updates(EXT3_SB(sb)->s_journal);
-		journal_flush(EXT3_SB(sb)->s_journal);
+		err2 = journal_flush(EXT3_SB(sb)->s_journal);
 		journal_unlock_updates(EXT3_SB(sb)->s_journal);
 group_extend_out:
 		mnt_drop_write(filp->f_path.mnt);
-		return err;
+		return (err) ? err : err2;
 	}
 	case EXT3_IOC_GROUP_ADD: {
 		struct ext3_new_group_data input;
 		struct super_block *sb = inode->i_sb;
-		int err;
+		int err, err2;
 
 		if (!capable(CAP_SYS_RESOURCE))
 			return -EPERM;
@@ -280,11 +280,11 @@ group_extend_out:
 
 		err = ext3_group_add(sb, &input);
 		journal_lock_updates(EXT3_SB(sb)->s_journal);
-		journal_flush(EXT3_SB(sb)->s_journal);
+		err2 = journal_flush(EXT3_SB(sb)->s_journal);
 		journal_unlock_updates(EXT3_SB(sb)->s_journal);
 group_add_out:
 		mnt_drop_write(filp->f_path.mnt);
-		return err;
+		return (err) ? err : err2;
 	}
 
 
Index: ...
From: Jan Kara
Date: Monday, June 2, 2008 - 5:49 am

It's a bit scary that LVM will start creating filesystem snapshot when
journal_flush() failed but since there's no way to propagate error further
up, I guess we cannot do anything better.
  You can add:
    Acked-by: Jan Kara <jack@suse.cz>

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Jan Kara
Date: Monday, June 2, 2008 - 5:05 am

Just a small note - repeating the subject in the changelog makes it a bit
harder for Andrew to merge the patches. Standard merging scripts prepend
subject of the email to the changelog below and so in your case the message
would be there twice... Just that you know when you create the patch series
next time.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Hidehiro Kawai
Date: Monday, June 2, 2008 - 9:30 pm

Hello,


Thank you for letting me know.  I'll not do such odd thing next time.

Regards,
-- 
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

--

Previous thread: [PATCH] kobject: reorder kobject to save space on 64 bit builds by Richard Kennedy on Monday, June 2, 2008 - 3:07 am. (2 messages)

Next thread: [PATCH net-2.6] [NETFILTER] Misc Cleanups. by Rami Rosen on Monday, June 2, 2008 - 4:11 am. (2 messages)