At present, as discussed in this LKML thread, http://marc.info/?l=linux-kernel&m=117607695406580, when a dirty ext3 filesystem is mounted read-only it writes to the disk while replaying the journal log and cleaning up the orphan list. This behaviour may surprise users and can potentially cause data corruption/loss (e.g. if a system is suspended, booted into a different OS, then resumed). This patch series attempts to address this by using a block translation table instead of replaying the journal on a read-only filesystem. Patches 1-3 are independent cleanups/bug-fixes for things I came across while working on this. They could be submitted separately and are not required for following patches. Patch 4 is a refactoring change that simplifies the code prior to later substantive changes. Patch 5 introduces the translation table and support for a truly read-only journal into jbd. Patch 6 uses the facility introduced in patch 5 to add support for true read-only ext3. For testing I've been using qemu VMs to create and mount dirtied filesystems. I have a set of scripts that fully automates creating a dirty filesystem then checking mounting read-only and read-write produces consistent results. On my system it can get through around ~30 iteration overnight. If anyone is interested in the scripts please let me know. Any suggestions for additional tests or enhancements that could be made to the scripts would be gratefully received. TODO: * Add R/W remount support * Port to ext4 Cheers, fs/ext3/balloc.c | 2 +- fs/ext3/ialloc.c | 2 +- fs/ext3/inode.c | 8 +- fs/ext3/resize.c | 2 +- fs/ext3/super.c | 123 ++++++++++----- fs/ext3/xattr.c | 8 +- fs/jbd/checkpoint.c | 2 +- fs/jbd/commit.c | 2 +- fs/jbd/journal.c | 68 +++++--- fs/jbd/recovery.c | 402 +++++++++++++++++++++++++++++++++-------------- fs/jbd/revoke.c | 133 ++++++---------- fs/ocfs2/journal.c ...
The revocation table initialisation/destruction code is repeated for each of
the two revocation tables stored in the journal. Refactoring the duplicated
code into functions is tidier, simplifies the logic in initialisation in
particular, and slightly reduces the code size.
There should not be any functional change.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
---
This change is an independent cleanup which is not required by following
patches in this series. Also it should perhaps be two separate patches?
fs/jbd/revoke.c | 126 +++++++++++++++++++++++--------------------------------
1 files changed, 52 insertions(+), 74 deletions(-)
diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index ad2eacf..5f64df4 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -195,108 +195,86 @@ void journal_destroy_revoke_caches(void)
revoke_table_cache = NULL;
}
-/* Initialise the revoke table for a given journal to a given size. */
-
-int journal_init_revoke(journal_t *journal, int hash_size)
+static int journal_init_revoke_table(struct jbd_revoke_table_s *table, int size)
{
- int shift, tmp;
-
- J_ASSERT (journal->j_revoke_table[0] == NULL);
-
- shift = 0;
- tmp = hash_size;
+ int shift = 0;
+ int tmp = size;
while((tmp >>= 1UL) != 0UL)
shift++;
- journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
- if (!journal->j_revoke_table[0])
- return -ENOMEM;
- journal->j_revoke = journal->j_revoke_table[0];
-
- /* Check that the hash_size is a power of two */
- J_ASSERT(is_power_of_2(hash_size));
-
- journal->j_revoke->hash_size = hash_size;
-
- journal->j_revoke->hash_shift = shift;
-
- journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
- if (!journal->j_revoke->hash_table) {
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
- journal->j_revoke = NULL;
+ table->hash_size = size;
+ table->hash_shift = shift;
+ table->hash_table = kmalloc(
+ size * sizeof(struct ...If an error occurs during jbd cache initialisation it is possible for the
journal_head_cache to be NULL when journal_destroy_journal_head_cache is
called. Replace the J_ASSERT with an if block to handle the situation
correctly.
Note that even with this fix things will break badly if ext3 is statically
compiled in and cache initialisation fails.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
---
At the moment the cache destruction methods are inconsistent in whether they
set the cache pointers to NULL after destroying them. To be precise,
journal_destroy_handle_cache doesn't, the others do. I haven't changed that,
although I was sorely tempted. I think it would be better to be consistent and
that NULLing the pointers is preferable. Any objections?
This patch is an independent bug fix and following patches in this series do
not depend on it.
fs/jbd/journal.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 3943a89..3f334a3 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -1635,9 +1635,10 @@ static int journal_init_journal_head_cache(void)
static void journal_destroy_journal_head_cache(void)
{
- J_ASSERT(journal_head_cache != NULL);
- kmem_cache_destroy(journal_head_cache);
- journal_head_cache = NULL;
+ if (journal_head_cache) {
+ kmem_cache_destroy(journal_head_cache);
+ journal_head_cache = NULL;
+ }
}
/*
--
1.5.3.7
--
JBD debugfs entries should only be created if cache initialisation is successful. At the moment they are being created unconditionally which will leave them dangling if cache (and hence module) initialisation fails. Signed-off-by: Duane Griffin <duaneg@dghda.com> --- This patch is an independent bug fix and following patches in this series do not depend on it. fs/jbd/journal.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c index 3f334a3..a868277 100644 --- a/fs/jbd/journal.c +++ b/fs/jbd/journal.c @@ -1942,9 +1942,10 @@ static int __init journal_init(void) BUILD_BUG_ON(sizeof(struct journal_superblock_s) != 1024); ret = journal_init_caches(); - if (ret != 0) + if (ret == 0) + jbd_create_debugfs_entry(); + else journal_destroy_caches(); - jbd_create_debugfs_entry(); return ret; } -- 1.5.3.7 --
The do_one_pass function iterates through the jbd log operating on blocks
depending on the pass. During the replay pass descriptor blocks are processed
by an inner loop which replays them. The nesting makes the code difficult to
follow or to modify. Splitting the inner loop and replay code into separate
functions simplifies matters.
The refactored code no longer unnecessarily reads revoked data blocks, so
potential read errors from such blocks will cause differing behaviour. Aside
from that there should be no functional effect.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
---
Note the TODO before the block read in the outer loop below. We could possibly
attempt to continue after a failed read as we do when replaying descriptor
blocks. However if it was a descriptor block we'd likely bomb out on the next
iteration anyway, so I'm not sure it would be worth it. Thoughts?
fs/jbd/recovery.c | 243 ++++++++++++++++++++++++++++-------------------------
1 files changed, 127 insertions(+), 116 deletions(-)
diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
index 2b8edf4..e2ac78f 100644
--- a/fs/jbd/recovery.c
+++ b/fs/jbd/recovery.c
@@ -307,6 +307,101 @@ int journal_skip_recovery(journal_t *journal)
return err;
}
+static int replay_data_block(
+ journal_t *journal, struct buffer_head *obh, char *data,
+ int flags, unsigned long blocknr)
+{
+ struct buffer_head *nbh;
+
+ /* Find a buffer for the new data being restored */
+ nbh = __getblk(journal->j_fs_dev, blocknr, journal->j_blocksize);
+ if (nbh == NULL)
+ return -ENOMEM;
+
+ lock_buffer(nbh);
+ memcpy(nbh->b_data, obh->b_data, journal->j_blocksize);
+ if (flags & JFS_FLAG_ESCAPE)
+ *((__be32 *)data) = cpu_to_be32(JFS_MAGIC_NUMBER);
+
+ BUFFER_TRACE(nbh, "marking dirty");
+ set_buffer_uptodate(nbh);
+ mark_buffer_dirty(nbh);
+ BUFFER_TRACE(nbh, "marking uptodate");
+ /* ll_rw_block(WRITE, 1, &nbh); */
+ unlock_buffer(nbh);
+ brelse(nbh);
+
+ return 0;
+}
+
+/* Replay a descriptor block: write ...Read-only log recovery allows a dirty journalled filesystem to be mounted and
provide a consistent view of its data without writing to the disk. Instead of
replaying the log a mapping is constructed between modified filesystem blocks
and the journal blocks containing their data.
The mapping is stored in a hashtable that is created and populated during
journal recovery. The hash function used is the same one used by the journal
revocation code. The size of the hashtable is currently being arbitrarily set
to 256 entries. Given that we know the number of block mappings when we create
the table it may be worth dynamically calculating an appropriate size instead
of hard-coding it.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
---
I'm tempted to add warnings on attempts to modify a read-only journal, does
that sound useful? I'd also be grateful to anyone with suggestions for better
member names. :)
fs/ext3/super.c | 2 +-
fs/jbd/checkpoint.c | 2 +-
fs/jbd/commit.c | 2 +-
fs/jbd/journal.c | 56 ++++++++++------
fs/jbd/recovery.c | 187 +++++++++++++++++++++++++++++++++++++++++++++------
fs/jbd/revoke.c | 7 +--
fs/ocfs2/journal.c | 4 +-
include/linux/jbd.h | 41 ++++++++++-
8 files changed, 246 insertions(+), 55 deletions(-)
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 18769cc..4b9ff65 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2168,7 +2168,7 @@ static int ext3_load_journal(struct super_block *sb,
if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
err = journal_wipe(journal, !really_read_only);
if (!err)
- err = journal_load(journal);
+ err = journal_load(journal, !really_read_only);
if (err) {
printk(KERN_ERR "EXT3-fs: error loading journal.\n");
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index a5432bb..84d4579 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -454,7 +454,7 @@ int cleanup_journal_tail(journal_t *journal)
journal->j_tail = ...If a filesystem is being mounted read-only then load the journal read-only (i.e. do not replay the log) and do not process the orphan list. At present, when a dirty ext3 filesystem is mounted read-only, it writes to the disk while replaying the journal log and cleaning up the orphan list. This behaviour may surprise users and can potentially cause data corruption/loss (e.g. if a system is suspended, booted into a different OS, then resumed). Introduce block accessor wrapper functions that check for blocks requiring translation and access them from the journal as needed. Convert the ext3 code to use the wrappers anywhere that may be called on a read-only filesystem. Code that is only called when writing to the filesystem is not converted. For a discussion of the need for this feature, see: http://marc.info/?l=linux-kernel&m=117607695406580 Tested by creating dirty filesystems, mounting a copy read-write, taking the md5sum of all its files, mounting a different copy read-only, confirming the md5sums match, unmounting it, then confirming the block device mounted read-only has not been modified. Testing has been done with both internal and external journals. NOTE: For now I'm simply preventing filesystems requiring recovery from being remounted read-write. This breaks booting with an uncleanly mounted root filesystem! Signed-off-by: Duane Griffin <duaneg@dghda.com> --- I went back and forth on converting all code to use the accessors. In the end it felt silly to shunt write code through a wrapper that was only useful in a read-only filesystem. On the other hand, the inconsistency bothers me and presumably increases the risks of future changes directly operating on filesysem blocks when they need to go through the wrappers. I'd appreciate second opinions on this point, in particular. Also, would it perhaps be better to split this into separate patches to introduce the accessors and add the no-replay code? fs/ext3/balloc.c | 2 +- fs/ext3/ialloc.c | ...
I'll grab the first three for now, thanks. Someone(tm) should do the jbd2 versions.. --
I'll do them tonight, unless someone beats me to it. Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan --
Hi Duane, Thanks for doing this. Some perhaps not so obvious fallout from the bad old way of doing things is that ddnap (zumastor) hits an issue in replication. Since ddsnap allows journal replay on the downstream server and also needs to have an unaltered snapshot to apply deltas against, if we do not take special care, Ext3 will come along and modify the downstream snapshot even when told not to. Our solution: take two snapshots per replication cycle (pretty cheap) so that one can be clean and the other can be stepped on at will by the journal replay. Ugh. With your hack, we can eventually drop the double snapshot, provided no other filesystem is similarly badly behaved. Re your page translation table: we already have a page translation table, it is called the page cache. If you could figure out which file (or metadata) each journal block belongs to, you could just load the page table pages back in and presto, done. No need to replay the journal at all, you are already back to journal+disk = consistent state. I probably have missed a detail or two since I haven't looked closely at how orphan inodes work, revokes, probably other things, but there is the basic idea. SCT, does my reasoning hold water? (In fact, ddsnap "replays" its own journal in exactly this way. Cache state is reconstructed and no actual journal flush is performed.) Anyway, this is just a theoretical comment, it is in no way a suggestion for a rewrite. The reason for that being, you do not have any convenient way to map physical journal blocks back to files and metadata. Maybe if we do implement reverse mapping for Ext3/4 later (not just a pipe dream) we could revisit this and lose your extra mapping. As it stands your solution seems well built, after a quick readthrough. Nice looking code. I think you added about 250 lines overall, so tight too. Thanks again. Daniel --
Ah, good to know, thanks. It looks like you folks are doing some Thanks very much, I appreciate it! Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan --
