Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

Previous thread: 2.6.25-rc4-ext4-1 released by Theodore Ts'o on Wednesday, March 5, 2008 - 6:18 pm. (1 message)

Next thread: [PATCH 01/16] acpi: replace remaining __FUNCTION__ occurrences by Harvey Harrison on Wednesday, March 5, 2008 - 7:24 pm. (1 message)
From: Duane Griffin
Date: Wednesday, March 5, 2008 - 6:59 pm

At present, as discussed in this LKML thread,
http://marc.info/?l=linux-kernel&m=117607695406580, when a dirty ext3
filesystem is mounted read-only it writes to the disk while replaying the
journal log and cleaning up the orphan list. This behaviour may surprise users
and can potentially cause data corruption/loss (e.g. if a system is suspended,
booted into a different OS, then resumed).

This patch series attempts to address this by using a block translation table
instead of replaying the journal on a read-only filesystem.

Patches 1-3 are independent cleanups/bug-fixes for things I came across while
working on this. They could be submitted separately and are not required for
following patches.

Patch 4 is a refactoring change that simplifies the code prior to later
substantive changes.

Patch 5 introduces the translation table and support for a truly read-only
journal into jbd.

Patch 6 uses the facility introduced in patch 5 to add support for true
read-only ext3.

For testing I've been using qemu VMs to create and mount dirtied filesystems. I
have a set of scripts that fully automates creating a dirty filesystem then
checking mounting read-only and read-write produces consistent results. On my
system it can get through around ~30 iteration overnight. If anyone is
interested in the scripts please let me know. Any suggestions for additional
tests or enhancements that could be made to the scripts would be gratefully
received.

TODO:
 * Add R/W remount support
 * Port to ext4

Cheers,
 fs/ext3/balloc.c        |    2 +-
 fs/ext3/ialloc.c        |    2 +-
 fs/ext3/inode.c         |    8 +-
 fs/ext3/resize.c        |    2 +-
 fs/ext3/super.c         |  123 ++++++++++-----
 fs/ext3/xattr.c         |    8 +-
 fs/jbd/checkpoint.c     |    2 +-
 fs/jbd/commit.c         |    2 +-
 fs/jbd/journal.c        |   68 +++++---
 fs/jbd/recovery.c       |  402 +++++++++++++++++++++++++++++++++--------------
 fs/jbd/revoke.c         |  133 ++++++----------
 fs/ocfs2/journal.c      ...
From: Duane Griffin
Date: Wednesday, March 5, 2008 - 6:59 pm

The revocation table initialisation/destruction code is repeated for each of
the two revocation tables stored in the journal. Refactoring the duplicated
code into functions is tidier, simplifies the logic in initialisation in
particular, and slightly reduces the code size.

There should not be any functional change.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
---

This change is an independent cleanup which is not required by following
patches in this series. Also it should perhaps be two separate patches?

 fs/jbd/revoke.c |  126 +++++++++++++++++++++++--------------------------------
 1 files changed, 52 insertions(+), 74 deletions(-)

diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index ad2eacf..5f64df4 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -195,108 +195,86 @@ void journal_destroy_revoke_caches(void)
 	revoke_table_cache = NULL;
 }
 
-/* Initialise the revoke table for a given journal to a given size. */
-
-int journal_init_revoke(journal_t *journal, int hash_size)
+static int journal_init_revoke_table(struct jbd_revoke_table_s *table, int size)
 {
-	int shift, tmp;
-
-	J_ASSERT (journal->j_revoke_table[0] == NULL);
-
-	shift = 0;
-	tmp = hash_size;
+	int shift = 0;
+	int tmp = size;
 	while((tmp >>= 1UL) != 0UL)
 		shift++;
 
-	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
-	if (!journal->j_revoke_table[0])
-		return -ENOMEM;
-	journal->j_revoke = journal->j_revoke_table[0];
-
-	/* Check that the hash_size is a power of two */
-	J_ASSERT(is_power_of_2(hash_size));
-
-	journal->j_revoke->hash_size = hash_size;
-
-	journal->j_revoke->hash_shift = shift;
-
-	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
-	if (!journal->j_revoke->hash_table) {
-		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
-		journal->j_revoke = NULL;
+	table->hash_size = size;
+	table->hash_shift = shift;
+	table->hash_table = kmalloc(
+		size * sizeof(struct ...
From: Duane Griffin
Date: Wednesday, March 5, 2008 - 6:59 pm

If an error occurs during jbd cache initialisation it is possible for the
journal_head_cache to be NULL when journal_destroy_journal_head_cache is
called. Replace the J_ASSERT with an if block to handle the situation
correctly.

Note that even with this fix things will break badly if ext3 is statically
compiled in and cache initialisation fails.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
---

At the moment the cache destruction methods are inconsistent in whether they
set the cache pointers to NULL after destroying them. To be precise,
journal_destroy_handle_cache doesn't, the others do. I haven't changed that,
although I was sorely tempted. I think it would be better to be consistent and
that NULLing the pointers is preferable. Any objections?

This patch is an independent bug fix and following patches in this series do
not depend on it.

 fs/jbd/journal.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 3943a89..3f334a3 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -1635,9 +1635,10 @@ static int journal_init_journal_head_cache(void)
 
 static void journal_destroy_journal_head_cache(void)
 {
-	J_ASSERT(journal_head_cache != NULL);
-	kmem_cache_destroy(journal_head_cache);
-	journal_head_cache = NULL;
+	if (journal_head_cache) {
+		kmem_cache_destroy(journal_head_cache);
+		journal_head_cache = NULL;
+	}
 }
 
 /*
-- 
1.5.3.7

--

From: Duane Griffin
Date: Wednesday, March 5, 2008 - 6:59 pm

JBD debugfs entries should only be created if cache initialisation is
successful. At the moment they are being created unconditionally which will
leave them dangling if cache (and hence module) initialisation fails.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
---

This patch is an independent bug fix and following patches in this series do
not depend on it.

 fs/jbd/journal.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 3f334a3..a868277 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -1942,9 +1942,10 @@ static int __init journal_init(void)
 	BUILD_BUG_ON(sizeof(struct journal_superblock_s) != 1024);
 
 	ret = journal_init_caches();
-	if (ret != 0)
+	if (ret == 0)
+		jbd_create_debugfs_entry();
+	else
 		journal_destroy_caches();
-	jbd_create_debugfs_entry();
 	return ret;
 }
 
-- 
1.5.3.7

--

From: Duane Griffin
Date: Wednesday, March 5, 2008 - 6:59 pm

The do_one_pass function iterates through the jbd log operating on blocks
depending on the pass. During the replay pass descriptor blocks are processed
by an inner loop which replays them. The nesting makes the code difficult to
follow or to modify. Splitting the inner loop and replay code into separate
functions simplifies matters.

The refactored code no longer unnecessarily reads revoked data blocks, so
potential read errors from such blocks will cause differing behaviour. Aside
from that there should be no functional effect.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
---

Note the TODO before the block read in the outer loop below. We could possibly
attempt to continue after a failed read as we do when replaying descriptor
blocks. However if it was a descriptor block we'd likely bomb out on the next
iteration anyway, so I'm not sure it would be worth it. Thoughts?

 fs/jbd/recovery.c |  243 ++++++++++++++++++++++++++++-------------------------
 1 files changed, 127 insertions(+), 116 deletions(-)

diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
index 2b8edf4..e2ac78f 100644
--- a/fs/jbd/recovery.c
+++ b/fs/jbd/recovery.c
@@ -307,6 +307,101 @@ int journal_skip_recovery(journal_t *journal)
 	return err;
 }
 
+static int replay_data_block(
+	journal_t *journal, struct buffer_head *obh, char *data,
+	int flags, unsigned long blocknr)
+{
+	struct buffer_head *nbh;
+
+	/* Find a buffer for the new data being restored */
+	nbh = __getblk(journal->j_fs_dev, blocknr, journal->j_blocksize);
+	if (nbh == NULL)
+		return -ENOMEM;
+
+	lock_buffer(nbh);
+	memcpy(nbh->b_data, obh->b_data, journal->j_blocksize);
+	if (flags & JFS_FLAG_ESCAPE)
+		*((__be32 *)data) = cpu_to_be32(JFS_MAGIC_NUMBER);
+
+	BUFFER_TRACE(nbh, "marking dirty");
+	set_buffer_uptodate(nbh);
+	mark_buffer_dirty(nbh);
+	BUFFER_TRACE(nbh, "marking uptodate");
+	/* ll_rw_block(WRITE, 1, &nbh); */
+	unlock_buffer(nbh);
+	brelse(nbh);
+
+	return 0;
+}
+
+/* Replay a descriptor block: write ...
From: Duane Griffin
Date: Wednesday, March 5, 2008 - 6:59 pm

Read-only log recovery allows a dirty journalled filesystem to be mounted and
provide a consistent view of its data without writing to the disk. Instead of
replaying the log a mapping is constructed between modified filesystem blocks
and the journal blocks containing their data.

The mapping is stored in a hashtable that is created and populated during
journal recovery. The hash function used is the same one used by the journal
revocation code. The size of the hashtable is currently being arbitrarily set
to 256 entries. Given that we know the number of block mappings when we create
the table it may be worth dynamically calculating an appropriate size instead
of hard-coding it.

Signed-off-by: Duane Griffin <duaneg@dghda.com>
---

I'm tempted to add warnings on attempts to modify a read-only journal, does
that sound useful? I'd also be grateful to anyone with suggestions for better
member names. :)

 fs/ext3/super.c     |    2 +-
 fs/jbd/checkpoint.c |    2 +-
 fs/jbd/commit.c     |    2 +-
 fs/jbd/journal.c    |   56 ++++++++++------
 fs/jbd/recovery.c   |  187 +++++++++++++++++++++++++++++++++++++++++++++------
 fs/jbd/revoke.c     |    7 +--
 fs/ocfs2/journal.c  |    4 +-
 include/linux/jbd.h |   41 ++++++++++-
 8 files changed, 246 insertions(+), 55 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 18769cc..4b9ff65 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2168,7 +2168,7 @@ static int ext3_load_journal(struct super_block *sb,
 	if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
 		err = journal_wipe(journal, !really_read_only);
 	if (!err)
-		err = journal_load(journal);
+		err = journal_load(journal, !really_read_only);
 
 	if (err) {
 		printk(KERN_ERR "EXT3-fs: error loading journal.\n");
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index a5432bb..84d4579 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -454,7 +454,7 @@ int cleanup_journal_tail(journal_t *journal)
 	journal->j_tail = ...
From: Duane Griffin
Date: Wednesday, March 5, 2008 - 6:59 pm

If a filesystem is being mounted read-only then load the journal read-only
(i.e. do not replay the log) and do not process the orphan list.

At present, when a dirty ext3 filesystem is mounted read-only, it writes to the
disk while replaying the journal log and cleaning up the orphan list. This
behaviour may surprise users and can potentially cause data corruption/loss
(e.g. if a system is suspended, booted into a different OS, then resumed).

Introduce block accessor wrapper functions that check for blocks requiring
translation and access them from the journal as needed. Convert the ext3 code
to use the wrappers anywhere that may be called on a read-only filesystem. Code
that is only called when writing to the filesystem is not converted.

For a discussion of the need for this feature, see:
http://marc.info/?l=linux-kernel&m=117607695406580

Tested by creating dirty filesystems, mounting a copy read-write, taking the
md5sum of all its files, mounting a different copy read-only, confirming the
md5sums match, unmounting it, then confirming the block device mounted
read-only has not been modified. Testing has been done with both internal and
external journals.

NOTE: For now I'm simply preventing filesystems requiring recovery from being
remounted read-write. This breaks booting with an uncleanly mounted root
filesystem!

Signed-off-by: Duane Griffin <duaneg@dghda.com>
---

I went back and forth on converting all code to use the accessors. In the end
it felt silly to shunt write code through a wrapper that was only useful in a
read-only filesystem. On the other hand, the inconsistency bothers me and
presumably increases the risks of future changes directly operating on
filesysem blocks when they need to go through the wrappers. I'd appreciate
second opinions on this point, in particular.

Also, would it perhaps be better to split this into separate patches to
introduce the accessors and add the no-replay code?

 fs/ext3/balloc.c        |    2 +-
 fs/ext3/ialloc.c        | ...
From: Andrew Morton
Date: Wednesday, March 5, 2008 - 8:42 pm

I'll grab the first three for now, thanks.

Someone(tm) should do the jbd2 versions..
--

From: Duane Griffin
Date: Thursday, March 6, 2008 - 4:20 am

I'll do them tonight, unless someone beats me to it.

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan
--

From: Daniel Phillips
Date: Wednesday, March 12, 2008 - 8:22 pm

Hi Duane,

Thanks for doing this.  Some perhaps not so obvious fallout from the bad 
old way of doing things is that ddnap (zumastor) hits an issue in 
replication.  Since ddsnap allows journal replay on the downstream 
server and also needs to have an unaltered snapshot to apply deltas 
against, if we do not take special care, Ext3 will come along and 
modify the downstream snapshot even when told not to.  Our solution: 
take two snapshots per replication cycle (pretty cheap) so that one can 
be clean and the other can be stepped on at will by the journal replay.
Ugh.

With your hack, we can eventually drop the double snapshot, provided no 
other filesystem is similarly badly behaved.

Re your page translation table: we already have a page translation 
table, it is called the page cache.  If you could figure out which file 
(or metadata) each journal block belongs to, you could just load the 
page table pages back in and presto, done.  No need to replay the 
journal at all, you are already back to journal+disk = consistent 
state.

I probably have missed a detail or two since I haven't looked closely at 
how orphan inodes work, revokes, probably other things, but there is 
the basic idea.  SCT, does my reasoning hold water?  (In fact, 
ddsnap "replays" its own journal in exactly this way.  Cache state is 
reconstructed and no actual journal flush is performed.)

Anyway, this is just a theoretical comment, it is in no way a suggestion 
for a rewrite.  The reason for that being, you do not have any 
convenient way to map physical journal blocks back to files and 
metadata.  Maybe if we do implement reverse mapping for Ext3/4 later 
(not just a pipe dream) we could revisit this and lose your extra 
mapping.  As it stands your solution seems well built, after a quick 
readthrough.  Nice looking code.  I think you added about 250 lines 
overall, so tight too.  Thanks again.

Daniel
--

From: Duane Griffin
Date: Thursday, March 13, 2008 - 5:35 am

Ah, good to know, thanks. It looks like you folks are doing some



Thanks very much, I appreciate it!

Cheers,
Duane.

-- 
"I never could learn to drink that blood and call it wine" - Bob Dylan
--

Previous thread: 2.6.25-rc4-ext4-1 released by Theodore Ts'o on Wednesday, March 5, 2008 - 6:18 pm. (1 message)

Next thread: [PATCH 01/16] acpi: replace remaining __FUNCTION__ occurrences by Harvey Harrison on Wednesday, March 5, 2008 - 7:24 pm. (1 message)