Re: [PATCH 1/1] Allow ext4 to run without a journal.

Previous thread: - revert-percpu_counter-new-function-percpu_counter_sum_and_set.patch removed from -mm tree by akpm on Wednesday, December 10, 2008 - 12:28 pm. (1 message)

Next thread: [PATCH] ext2: ensure link targets are NULL-terminated by Duane Griffin on Thursday, December 11, 2008 - 12:16 pm. (4 messages)
From: Frank Mayhar
Date: Wednesday, December 10, 2008 - 4:54 pm

A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal.  Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs.  We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal.  It just so happens that running without a journal is
slightly faster for most everything.

We did
	iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache.  Results:

                     ext2        ext4, default   ext4, no journal
  initial writes   13.0 MB/s        15.4 MB/s          15.7 MB/s
  rewrites         13.1 MB/s        15.6 MB/s          15.9 MB/s
  reads            15.2 MB/s        16.9 MB/s          17.2 MB/s
  re-reads         15.3 MB/s        16.9 MB/s          17.2 MB/s
  random readers    5.6 MB/s         5.6 MB/s           5.7 MB/s
  random writers    5.1 MB/s         5.3 MB/s           5.4 MB/s 

So it seems that, so far, this was a useful exercise.

The patch follows.  It was

Signed-off-by: Frank Mayhar <fmayhar@google.com>

---

 fs/ext4/balloc.c    |    4 +-
 fs/ext4/ext4_jbd2.c |   60 +++++++++++++------
 fs/ext4/ext4_jbd2.h |   95 ++++++++++++++++++++++++++----
 fs/ext4/ext4_sb.h   |    1 +
 fs/ext4/extents.c   |   12 ++--
 fs/ext4/ialloc.c    |   25 ++++----
 fs/ext4/inode.c     |  130 +++++++++++++++++++++++++++-------------
 fs/ext4/ioctl.c     |    2 +-
 fs/ext4/mballoc.c   |   17 +++---
 fs/ext4/migrate.c   |    5 +-
 fs/ext4/namei.c     |   50 ++++++++--------
 fs/ext4/resize.c    |   31 +++++-----
 fs/ext4/super.c     |  163 +++++++++++++++++++++++++++++++++++++--------------
 fs/ext4/xattr.c     |   21 ++++---
 14 files changed, 419 insertions(+), 197 deletions(-)

diff --git a/fs/ext4/balloc.c ...
From: Frank Mayhar
Date: Monday, December 15, 2008 - 10:35 am

Has anyone had a chance to take a look at this?  Andreas?  Ted?  Anyone?
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

--

From: Theodore Tso
Date: Monday, December 15, 2008 - 9:00 pm

For some reason it didn't show up in my mailbox; maybe an errant SPAM
checker grabbed it?  (For some reason Herbert Xu's mail server
recently blacklisted my mail server, so he can't get any e-mails from
me, whether it's a reply to LKML or an invite to the kernel summit.
More recently Baracuda Networks, which handles mit.edu's spam
filtering, blacklisted the mail server for the vendor-sec mailing
list.  Go figure.)  

Fortunately your patch did show up in patchwork:

	http://patchwork.ozlabs.org/project/linux-ext4/

I've been travelling, so I haven't had a chance to look at it,
although I did note that it had showed up over the weekend when I was
going over the patch queue and reconciling it with Patchwork, and
wondered why I hadn't seen earlier...

I will try to get it shortly.

Regards,

						- Ted
--

From: Theodore Tso
Date: Tuesday, December 16, 2008 - 11:00 pm

Note, the patch allow-ext4-to-run-without-a-journal conflicted with a
large number of other patches in the patch queue.  I fixed it up as
best I could, and reviewed all of the changed patches, and I think
it's OK.  Please be cautious when first testing this, though.

     	  	    	     	  		- Ted
--

From: Frank Mayhar
Date: Wednesday, December 17, 2008 - 10:52 am

Yeah, sorry, for reasons that are probably obvious this patch was
relative to the ext4 stable tree, not to the latest bits.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

--

From: Theodore Tso
Date: Tuesday, December 16, 2008 - 8:12 pm

This will cause warnings on systems where pointers are 32-bits.  I'd
suggest using a 32-bit magic number.  Also, the number you have picked
is divisble by 4, which means at least in theory it could be a valid
pointer.  If we make EXT4_NOJOURNAL_MAGIC a 32-bit number, then a
number which is divisible by 4 definitely could be a valid pointer on
a 64-bit system.  I'd suggest keeping it simple, and simply using 0x1.
This is known not be a valid pointer, since it's not 32-bit word
aligned, and there is precedent for using such a number --- for
example, SIG_IGN is ((void *) 1) on many architectures, and page 0 is
generally unmapped to catch null pointer dereferences.  So I'd suggest
keeping things simple instead of using a fancy char string.

Also, why not use EXT4_NOJOURNAL_MAGIC constant directly, instead of
assigning it as a constant value to s_nojournal_flag in struct
ext4_sb_info?  That bloats the data structure for no good reason that

I don't see the point in doing this as an inline function; either way,
each time the inline function is executed, it will result a function
call, either to __ext4_journal_dirty_metadata() or
__ext4_write_dirty_metadata(), both of which are called exactly once
by this inline function.  So why not move the conditional and the
contents of _ext4_write_dirty_metadata() and
__ext4_journal_dirty_metadata() into __ext4_handle_dirty_metadata(),
and then do this:

#define ext4_handle_dirty_metadata(handle, inode, bh) \
	__ext4_handle_dirty_metadata(__func__, handle, inode, bh)

Ext4_handle_dirty_metadata() gets called a *lot*, so this will shrink
the icache footprint of ext4 filesystem, and for CPU's that have
decent branch prediction, centralizing the condition into a single
location instead of an inline will also be a win, once again
demonstrating that with modern CPU's, optimizations that minimize code

This shouldn't be a BUG_ON; this gets triggered if there is an inode
which is marked as needing data journalling.  So it is not a *bug*,
but at ...
From: Frank Mayhar
Date: Wednesday, December 17, 2008 - 10:50 am

This is just a quick response to one issue; I need to look at the rest
in more detail but they seem valid from a quick skim.


The way this works (which I did at Andreas' suggestion) is that when
there is no journal the "current handle" is really a pointer to
ext4_sb_info.  I need to distinguish that structure from a valid handle,
so I added the magic field to do so; the first field of a handle_t is a
pointer (to a transaction) so I need something I can distinguish from
that.  (And your comments regarding the magic value are very helpful,
thanks.)

If you look at ext4_handle_valid you'll see where it's actually used.  I
admit this is very non-obvious (so far it has confused at least two
different reviewers); if you have a more clear way to handle it I'm all

Yeah, I have an open issue to remove all of these BUG_ONs after we've
had a chance to test this fairly thoroughly.  In the meantime I was
leaving them in as a sanity check.  I'll remove them from the next


If by "we" you mean you, I'll just disregard those bits entirely.  If
you mean me, I'll chat with my manager but I think it's doable since

Thanks!  And thanks for the comments, I will address them and generate a
new patch.  Please think about the magic-number/ext4_sb_info thing,

Ah, good point.  If we want to recover the journal and there isn't one
that's probably bad, yes.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

--

From: Theodore Tso
Date: Saturday, January 3, 2009 - 8:52 am

I've modified your patch to reflect the comments I made two weeks ago.
Were there any other changes you have planned?  I'd like to get this
pushed to Linus soon....

						- Ted

ext4: Allow ext4 to run without a journal

From: Frank Mayhar <fmayhar@google.com>

A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal.  Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs.  We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal.  It just so happens that running without a journal is
slightly faster for most everything.

We did
	iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache.  Results:

                     ext2        ext4, default   ext4, no journal
  initial writes   13.0 MB/s        15.4 MB/s          15.7 MB/s
  rewrites         13.1 MB/s        15.6 MB/s          15.9 MB/s
  reads            15.2 MB/s        16.9 MB/s          17.2 MB/s
  re-reads         15.3 MB/s        16.9 MB/s          17.2 MB/s
  random readers    5.6 MB/s         5.6 MB/s           5.7 MB/s
  random writers    5.1 MB/s         5.3 MB/s           5.4 MB/s 

So it seems that, so far, this was a useful exercise.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 35f5f9a..31ebeb5 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -531,11 +531,11 @@ do_more:
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
-	err = ext4_journal_dirty_metadata(handle, bitmap_bh);
+	err = ext4_handle_dirty_metadata(handle, NULL, bitmap_bh);
 
 	/* And the group descriptor block */
 ...
From: Frank Mayhar
Date: Saturday, January 3, 2009 - 1:19 pm

My apologies, I have a new patch that has been tested quite a bit more
(although I still wouldn't say quite thoroughly yet, but close), all I
need to do is set it up correctly for you guys and send it to the list.
Unfortunately other stuff (including the holidays) has gotten in the
way.  Expect it on Monday or Tuesday.  Note that this version will have
a handful of small but important bugs fixed, as well.

One question:  Shold I make the patch relative to the stable tree or to
the latest bits?
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

--

From: Theodore Tso
Date: Saturday, January 3, 2009 - 6:17 pm

At this point can you make your patch relative to whatever you used
last time (and remind me what it is)?  Or, can you give me a delta
between what you did last time and what you have now that has the
fixes?  The problem is that the patch touches so much of ext4 that
it's quite painful to maintain it in the ext4 patch queue --- it
either invalidates a large number of patches after the fact, or the
patch needs a huge amount of work to fix up patch conflicts if it is
applied afterwards --- and if it is placed in the middle of the patch
queue, both conditions apply.

So I will have to do a lot of fix ups either way.  So unless you're
willing to pull down the ext4 patch queue, at:

	git://repo.or.cz/ext4-patch-queue.git

... and worry about making it apply in place, it's probably easier for
you to give me delta's versus the last patch, and I'll integrate in
the changes by hand.

						- Ted
--

From: Theodore Tso
Date: Sunday, January 4, 2009 - 10:31 pm

Here is an updated patch that adds a check to make sure the right
thing happens if an ext3 filesystem that needs recovery is mounted -o
noload.  This also fixes a potential oops in
ext4_mark_recovery_complete() journal doesn't exist, and we avoid
adding and remove orphans from the orphan list when the journal
doesn't exist.

					- Ted

ext4: Allow ext4 to run without a journal

From: Frank Mayhar <fmayhar@google.com>

A few weeks ago I posted a patch for discussion that allowed ext4 to run
without a journal.  Since that time I've integrated the excellent
comments from Andreas and fixed several serious bugs.  We're currently
running with this patch and generating some performance numbers against
both ext2 (with backported reservations code) and ext4 with and without
a journal.  It just so happens that running without a journal is
slightly faster for most everything.

We did
	iozone -T -t 4 s 2g -r 256k -T -I -i0 -i1 -i2

which creates 4 threads, each of which create and do reads and writes on
a 2G file, with a buffer size of 256K, using O_DIRECT for all file opens
to bypass the page cache.  Results:

                     ext2        ext4, default   ext4, no journal
  initial writes   13.0 MB/s        15.4 MB/s          15.7 MB/s
  rewrites         13.1 MB/s        15.6 MB/s          15.9 MB/s
  reads            15.2 MB/s        16.9 MB/s          17.2 MB/s
  re-reads         15.3 MB/s        16.9 MB/s          17.2 MB/s
  random readers    5.6 MB/s         5.6 MB/s           5.7 MB/s
  random writers    5.1 MB/s         5.3 MB/s           5.4 MB/s 

So it seems that, so far, this was a useful exercise.

Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 35f5f9a..31ebeb5 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -531,11 +531,11 @@ do_more:
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
-	err = ...
From: Andreas Dilger
Date: Wednesday, December 17, 2008 - 11:51 pm

This would imply that there is very little value in keeping ext2 or
ext3 around for the long term anymore (though of course I'm not going
to suggest that we get rid of them today).  It is now possible to mount
ext2 filesystems directly with ext4, and it seems clear that ext4 is

While I wouldn't go so far as Ted suggests to use "1" as the magic value,
it is definitely reasonable to not use a valid pointer value (so it should
be an odd number) and keeping it out of the kernel address space is also
useful.  Ensuring that it is not a valid pointer is sufficient, because
the first element of struct handle_s is a pointer and it can never be an

Adding a clear comments explaining this (not just a one-liner) at 


Since current->journal_info is a void * you shouldn't need to cast here.


Also, I thought there was somewhere that the "handle" was turned back into
ext4_sb_info again?  I can't see this after looking through the code.
The EXT4_NOJOURNAL_MAGIC and s_nojournal_flag is only used to determine if
a handle is not valid, and I can't see anywhere that a handle is turned
into ext4_sb_info.

If there is nowhere that you need to translate a handle into a superblock
anymore then having a pointer to a static variable would be sufficient.
There is no need for a magic value, and dereferencing it will give an oops:

static int dummy_handle = 0x0bad;
void *ext4_no_journal_in_use = &dummy_handle;

static inline int ext4_handle_valid(handle_t *handle)
{
	if (handle == ext4_no_journal_in_use)
		return 0;
	return 1;
}

handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
{

	/*
	 * We're not journaling.  Return a pointer to our flag that
	 * indicates that fact.
	 */
	current->journal_info = ext4_no_journal_in_use;
	return current->journal_info;
}



Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--

From: Jan Kara
Date: Thursday, December 18, 2008 - 11:00 am

Interesting although I'm not that surprised because those tests seem
to do a lot of data changes (which are never journaled in fact) and tiny
amount of metadata changes. If you run some benchmark doing lots of
directory operations, I guess the numbers would be considerably
different. Maybe trying dbench (I know it's kind of stupid ;) or
postmark will show the differences better.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
--

From: Michael Rubin
Date: Thursday, December 18, 2008 - 11:13 am

Actually we have some compile bench numbers (coming to this list soon)
that also surprised us.
The stages of compile bench that I believe are dominated by directory
operations are also

I admit also we see huge variance using dbench on subsequent runs. To
the point where I don't know how much I trust it's numbers.
Is this a tool that people on this list have a lot of faith in?

mrubin
--

From: Jan Kara
Date: Thursday, December 18, 2008 - 11:27 am

Yes, compilebench excercises directory operations quite heavily so
  Not really (at least as far as I'm concerned :). I had to deal with
dbench recently tracking some reported performance regression and lost
a lot of my faith in it ;). It measures something but the numbers
vary greatly and also the load it puts on the filesystem is not much
realistic (lots of rewrites of the same pages and file deletes just
after creation). I think compilebench is a better choice, I just didn't
remember its name when I was writing the email.

								Honza
-- 
Jan Kara <jack@suse.cz>
SuSE CR Labs
--

From: Curt Wohlgemuth
Date: Thursday, December 18, 2008 - 11:29 am

I should be able to post some numbers we're seeing for compilebench on
various filesystems next week, including ext4 without a journal
(Frank's patch).

Curt

--

Previous thread: - revert-percpu_counter-new-function-percpu_counter_sum_and_set.patch removed from -mm tree by akpm on Wednesday, December 10, 2008 - 12:28 pm. (1 message)

Next thread: [PATCH] ext2: ensure link targets are NULL-terminated by Duane Griffin on Thursday, December 11, 2008 - 12:16 pm. (4 messages)