[PATCH] Give kjournald a IOPRIO_CLASS_RT io priority

Previous thread: VERIFY YOUR NETVISION.NET.IL EMAIL ACCOUNT IMMEDIATELY! by NETVISION.NET.IL on Wednesday, October 1, 2008 - 5:34 pm. (1 message)

Next thread: by sena seneviratne on Wednesday, October 1, 2008 - 10:21 am. (1 message)
From: Arjan van de Ven
Date: Wednesday, October 1, 2008 - 8:00 pm

From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 1 Oct 2008 19:58:18 -0700
Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority

With latencytop, I noticed that the (in memory) atime updates during a
kernel build had latencies of 6 seconds or longer; this is obviously not so
nice behavior. Other EXT3 journal related operations had similar or even
longer latencies.

Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work on behalf of other processes and effectively this leads to a twisted 
kind of "mass priority inversion" type of behavior.

The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.

The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
this priority inversion behavior. With this patch, the latencies for atime
updates (and similar operation) go down by a factor of 3x to 4x !

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/ioprio.c            |    3 ++-
 fs/jbd/journal.c       |   12 ++++++++++++
 include/linux/ioprio.h |    2 ++
 3 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/fs/ioprio.c b/fs/ioprio.c
index da3cc46..3bd95dc 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -27,7 +27,7 @@
 #include <linux/security.h>
 #include <linux/pid_namespace.h>
 
-static int set_task_ioprio(struct task_struct *task, int ioprio)
+int set_task_ioprio(struct task_struct *task, int ioprio)
 {
 	int err;
 	struct io_context *ioc;
@@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
 	task_unlock(task);
 	return err;
 }
+EXPORT_SYMBOL_GPL(set_task_ioprio);
 
 asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
 {
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index aa7143a..2ed3d8f 100644
--- ...
From: Andrew Morton
Date: Wednesday, October 1, 2008 - 9:56 pm

You proposed this a while back and it didn't happen and I forget
why and the changelog doesn't mention any of that?
--

From: Jens Axboe
Date: Wednesday, October 1, 2008 - 11:27 pm

I think you called for benchmark results, which I don't think happened.
The patch definitely makes sense, so we should just make sure that we
don't regress elsewhere. Honestly, I'd be surprised if we did...

How about I just toss it into the 2.6.28 testing mix, plenty of time for
testing and such?

-- 
Jens Axboe

--

From: Andrew Morton
Date: Wednesday, October 1, 2008 - 11:55 pm

Now I think about it, didn't the earlier patch tweak CPU priority and

Many performance regressions don't get noticed for six or twelve
months, by which time everyone is suffering from them (see
kernel/sched.c).

kjournald does huge amounts of not-terribly-important writeback.  One
obvious risk is that by making all that bulk writeback high-priority,
read-latency-sensitive applications might suffer latency spikes.



Now, kjournald is _supposed_ to be mostly asynchronous wrt foreground
operations.  And once upon a time (seven years ago) it mostly was.  But
there was some horrid race which I ended up fixing by introducing one
point where synchronous userspace actions had to block behind kjournald
activity.  I spent quite some time on it but couldn't come up with
anything better.  It had fairly bad effects on some workloads.

I've forgotten where that code is now, but I don't think it was ever
revisited.  It should be.

So.  Where are these atime updaters getting blocked?
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 12:45 am

Behind other IO activity I suppose, since it's marked async. A more
appropriate fix may be to mark atime updates as sync IO.

Once upon a time I actually did start xxxing meta data IO from the file
system, in ext3. That was mainly for blktrace so we could inspect what
data was what at the trace end, but if we had full coverage (or better,
at least), we could use that to bump priority as well. Probably the
priority should be left at the default, but the IO should be upgraded to
SYNC instead. That'll ensure that it goes out fairly quickly (like other
IO at the same class), but not get preferential treatment apart from
that. Seems like the right thing to do.

-- 
Jens Axboe

--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 1:03 am

No, they might be getting blocked at a higher level.

An async atime update gets recorded into the current transaction. 
kjournald is working on the committing transaction.  We try to keep
those separated, to prevent user processes from getting blocked behind
kjournald activity.

But sometimes that doesn't work (including the place where I knowingly
broke it).  If we can find and fix the offending piece of jbd logic (a
big if) then all is peachy.

If the above theory turns out to be true then diddling IO priorities
is but a workaround.
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 1:22 am

If diddling with io priorities makes a big difference (and Arjan said it
does), it's clearly stuck inside the io scheduler waiting to be
dispatched. If it was marked sync, it would be going through much
faster. As async, it'll get mixed in with other async writeout and that
doesn't get a lot of attention in case of sync activity.

-- 
Jens Axboe

--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 1:43 am

Maybe.  And maybe marking all journal writeout and all commit activity
as sync would have the same effect with less downside.

But it'd be better to not wait on this IO at all, if poss..
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 1:46 am

Agree, would be much better...

-- 
Jens Axboe

--

From: Theodore Tso
Date: Thursday, October 2, 2008 - 5:04 am

This is true unless the journal gets too full, and we need to do a
checkpoint operation --- at which point, everything stops.  If this
was metadata-intensive a benchmark, and the journal wasn't large
enough, this could be the problem.  (And if you make the journal
bigger, then when you *do* finally get forced to do a checkpoint
operation, things get stalled for even longer.)

Arjan, is this *really* about atime updates?  I thought most poeple
these days run with noatime or relatime.  If people *really* want true
atime semantics, the best way to solve this problem would be to have
two dirty flags in the inode --- an "atime dirty" and a "dirty" flag.
The atime dirty bit would not actually cause the inode to get written
to disk, unless either (a) we are unmounting the filesystem, or (b) we
are trying to shrink the inode cache due to memory pressure.  If when
we write the inode out to disk, only the atime dirty bit is set, we
can also skip journalling the inode table block.  So if there are
people who really care about true atime semantics, without getting
killed by the I/O writes, there are some solutions we can pursue.

But if this is really about the "entangled fsync problem", where we
have a large number of processes writing a large amount of async data,
and then we have a single process writing a small amount of data and
then calling fsync(), then that's a different (and very long-standing)
problem in ext3/4.  Raising the I/O priority is probably the only
thing we can do in this circumstance.  We could try to do some kind of
complex priority inheritance scheme, but it would certainly be much
simpler to raise the I/O priority.  We could choose a level just below
realtime priority, but the reality is that if a real-time priority is
trying to write to the filesystem, and we are doing a checkpointing
opration, we're going to be blocking the real-time process anyway, and
it will be a priority inversion.  So perhaps the simplest and best
algorithm would be to use a priority level ...
From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 6:16 am

On Thu, 2 Oct 2008 08:04:44 -0400

I can very easily reproduce it; my mail client (claws) stalls due to
this several seconds at least once every 5 to 10 minutes...
(usually when I'm typing an email.. grumble)



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Theodore Tso
Date: Thursday, October 2, 2008 - 6:46 am

Are you running with noatime or relatime?

One quick thing to try that might show what is going on is to run
debugfs on the device where your mail directory is located, while
claws is running, and then use the debugfs command "logdump -a" to
dump out the contents of the journal.  You can then use ncheck and
icheck to take the FS block numbers from logdump and translate them to
inode numbers, and then from inode numbers to pathnames.  That might
give us some insight as to what is going on.

I can whip up a patch which adds some markers which we could use to
find out more information about what is happening.

     	      		  	     	- Ted
--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 7:33 am

On Thu, 2 Oct 2008 09:46:59 -0400

yes; I have

mount -o remount,relatime / &
for i in `pidof kjournald` ; do ionice -c1 -p $i ; done


I'll try (sadly I'm in a training session all day today so I might not

interesting testcase of the markers concept.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Theodore Tso
Date: Saturday, October 4, 2008 - 7:12 am

Sorry for the delay, I ran into a minor bug in the Modules.marker
generation support that prevented Systemtap from being able to use
markers.  (It was busted since 2.6.27-rc1, so I guess that gives us
some sense how often developers use Systemtap.  :-)

It looks like Andrew's workaround seems to help you out, but if you're
willing to run this while your mail reader is running, and correlate
it with with the large latency spikes, we might get some interesting
results.

Anyway, here's the patch (against ext4, although could pretty easily
move this to ext3 --- but you can mount an ext3 filesystem as ext4,
although for the moment you do have to run the command "tune2fs -E
test_fs /dev/hdXX" first), and a sample systemtap script.  You'll also
need the markers patch, and of course the latest systemtap from their
git repository.

						- Ted


P.S.  Make sure you install systemtap in /usr/local, instead of trying
to run it out of the build tree.  See for an interesting report from
Roland McGrath about what happens if you make this mistake:

	http://sources.redhat.com/ml/systemtap/2008-q3/msg00809.html

I really think Systemtap has a lot of potential if only they could get
past some "minor usability concerns".  So one thing that I think would
really help the Systemtap folks is if more people gave them more, ah,
"constructive feedback" to their mailing list.
From: Joseph Fannin
Date: Saturday, October 4, 2008 - 10:14 am

Shouldn't he also be sure to mount the FS with the "noextents" option,
or has ext4 learned not to to enable extents if they're not already
enabled?

--
Joseph Fannin
jfannin@gmail.com

--

From: Theodore Tso
Date: Saturday, October 4, 2008 - 2:27 pm

The latter; ext4 will no longer enable extents if not already enabled
as of commit e4079a11, dated July 11th.  That means 2.6.26 with any of
the ext4 patchsets (highly recommended if you want to use ext4) or any
mainline release post 2.6.26-git4/2.6.27-rc1.

A number of the ext4 tutorials/howto's are available on the web page
are out of date, including the one on IBM developer works.  The best
place to get up-to-date information on using ext4 is
http://ext4.wiki.kernel.org.

Regards,

						- Ted

P.S.  Note that there will be advantages to using even an unconverted
filesystem with the ext3 filesystem format with the ext4 code base,
even if you don't enable extents or any other ext4-specific feature;
the delayed allocation and soon-to-be-added inode table readahead
features (in the 2.6.27-rc8-ext4-2 patchset, and which will be pushed
during the next merge window) in the ext4 code base will improve
performance even on an "ext3 filesystem".
--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 6:12 am

my reproducer is sadly very simple (claws-mail is my mail client that uses maildir)

Process claws-mail (4896)                  Total: 2829.7 msec 
EXT3: Waiting for journal access                  2491.0 msec         88.4 % 
Writing back inodes				   160.9 msec          5.7 % 
synchronous write                                   78.8 msec          3.0 %

is an example of such a trace (this is with patch, without patch the numbers are about 3x bigger)

Waiting for journal access is "journal_get_write_access"
Writing back inodes is "writeback_inodes"
synchronous write is "do_sync_write"


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 1:24 pm

Right.  Probably the lock_buffer() in do_get_write_access().  kjournald
is checkpointing the committing transaction (writing metadata buffers
back into the fs) and a user process operating on the current
transaction is trying to get access to one of those buffers but has to
wait for the writeout to complete first.

It wasn't always thus.  Back in, umm, 2.5.0 we did

	/*
	 * The buffer_locked() || buffer_dirty() tests here are simply an
	 * optimisation tweak.  If anyone else in the system decides to
	 * lock this buffer later on, we'll blow up.  There doesn't seem
	 * to be a good reason why they should do this.
	 */
	if (jh->b_cp_transaction &&
	    (buffer_locked(jh2bh(jh)) || buffer_dirty(jh2bh(jh)))) {
		unlock_journal(journal);
		lock_buffer(jh2bh(jh));

and I _think_ it was the loss of that which hurt us a lot. 
773fc4c63442fbd8237b4805627f6906143204a8 or thereabouts in the old git
tree.

It would be very good if we could again decouple the committing and
current transactions, but I fear that none of us remember sufficiently
well how it all works (or, more importantly, how it all doesn't work
when you make a change).

Of course, that could all be wrong and we could be stuck somewhere
else.  A good way to diagnose this stuff would be

--- a/kernel/sched.c~a
+++ a/kernel/sched.c
@@ -5567,10 +5567,14 @@ EXPORT_SYMBOL(yield);
 void __sched io_schedule(void)
 {
 	struct rq *rq = &__raw_get_cpu_var(runqueues);
+	unsigned long in, out;
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
+	in = jiffies;
 	schedule();
+	out = jiffies;
+	WARN_ON(time_after(out, in + 1 * HZ));
 	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 }
_

perhaps for varying values of "1".
--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 9:01 pm

On Thu, 2 Oct 2008 13:24:57 -0700

I'll run with this. (well with a value of "1000" for 1 ;-)

Also I'll tell latencytop to give me the full backtrace as well ;-)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 9:23 pm

On Thu, 2 Oct 2008 21:01:17 -0700

caught a few, a few were over 3 1/2 seconds in stall time 
(specifically I know this for the last one)

(I've stripped out the ? entries to keep them reasonable)

[  410.168277] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[  410.168347] Pid: 699, comm: kjournald Not tainted 2.6.27-rc8-tip #50
[  410.168366]  [<c042ee64>] warn_on_slowpath+0x41/0x65
[  410.168414]  [<c070ec83>] io_schedule+0x77/0xb0
[  410.168421]  [<c04abc72>] sync_buffer+0x33/0x37
[  410.168429]  [<c070f123>] __wait_on_bit+0x36/0x5d
[  410.168445]  [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
[  410.168471]  [<c04abbd1>] __wait_on_buffer+0x19/0x1c
[  410.168478]  [<c04de796>] journal_commit_transaction+0x484/0xcb2
[  410.168519]  [<c04e14ae>] kjournald+0xc7/0x1ea
[  410.168544]  [<c043fb87>] kthread+0x3b/0x61
[  410.168559]  [<c040499f>] kernel_thread_helper+0x7/0x10
[  410.168567]  =======================
[  410.168572] ---[ end trace de523043f88bd9a7 ]---
[  451.605034] ------------[ cut here ]------------
[  451.605041] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[  451.605114] Pid: 699, comm: kjournald Tainted: G        W 2.6.27-rc8-tip #50
[  451.605133]  [<c042ee64>] warn_on_slowpath+0x41/0x65
[  451.605182]  [<c070ec83>] io_schedule+0x77/0xb0
[  451.605189]  [<c04abc72>] sync_buffer+0x33/0x37
[  451.605198]  [<c070f123>] __wait_on_bit+0x36/0x5d
[  451.605213]  [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
[  451.605240]  [<c04abbd1>] __wait_on_buffer+0x19/0x1c
[  451.605248]  [<c04de796>] journal_commit_transaction+0x484/0xcb2
[  451.605300]  [<c04e14ae>] kjournald+0xc7/0x1ea
[  451.605324]  [<c043fb87>] kthread+0x3b/0x61
[  451.605339]  [<c040499f>] kernel_thread_helper+0x7/0x10
[  451.605348]  =======================
[  451.605353] ---[ end trace de523043f88bd9a7 ]---
[  514.780993] ------------[ cut here ]------------
[  514.781001] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[  514.781074] Pid: 699, comm: kjournald ...
From: Andrew Morton
Date: Thursday, October 2, 2008 - 9:40 pm

hm, they're all kjournald getting stuck on lock_buffer().  That _may_

That's the one - the lock_buffer() in do_get_write_access().  It's a
major contention site and it'd be a major win if we could fix it.  Even
if we resorted to some nasty thing like taking a temp copy of the
buffer's contents.
--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 9:43 pm

I also notice it's part of "file_update_time". Do we really need to go all the way 
down to this level of synchronicity for that?
(I also randomly wonder if we, in the write path, dirty the inode twice, once for size once for item, and
if we then also reserve two slots in the journal for that..... but I'm showing
my total ignorance of JBD internals here)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 9:50 pm

Well, we've tossed that around many times but never implemented it. 
Once you get into the details it gets a bit nasty.  Need to keep the
dirtiness state in the VFS (or fs) inode, and going backwards from a
plain old buffer_head at commit time isn't possible.  We usually
tempfixed the problem by adding increasingly fancy ways of not doing the
atime update at all.

Of course, fixing this running-vs-committing contention point would fix

That shouldn't be the case - once we have write access to the buffer it
remains freely modifiable for the rest of the transaction period.  I

I'm going on senile memories of JDB five years ago, but the concepts
didn't change much.
--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 10:00 pm

On Thu, 2 Oct 2008 21:50:26 -0700

given that this is the write path, I was assuming this was mtime rather

yes clearly. It's waaay above my paygrade to hack on though; JBD is one
of those places in the kernel that scare me for doing fundamental

I hope you're right otherwise we'd always hit this; once for the size
change, then block for the mtime. That would thoroughly suck; so much
so that you just must be right.



-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 10:24 pm

If someone has an hour or so to kill, we can pretend to fix it:


 fs/ext3/super.c         |   12 +++++++++++-
 fs/jbd/transaction.c    |   13 +++++++++++--
 include/linux/ext3_fs.h |    1 +
 include/linux/jbd.h     |    1 +
 4 files changed, 24 insertions(+), 3 deletions(-)

diff -puN fs/ext3/super.c~a fs/ext3/super.c
--- a/fs/ext3/super.c~a
+++ a/fs/ext3/super.c
@@ -617,6 +617,8 @@ static int ext3_show_options(struct seq_
 		seq_puts(seq, ",barrier=1");
 	if (test_opt(sb, NOBH))
 		seq_puts(seq, ",nobh");
+	if (test_opt(sb, AKPM))
+		seq_puts(seq, ",akpm");
 
 	if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
 		seq_puts(seq, ",data=journal");
@@ -757,7 +759,7 @@ enum {
 	Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
 	Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
-	Opt_grpquota
+	Opt_grpquota, Opt_akpm,
 };
 
 static match_table_t tokens = {
@@ -808,6 +810,7 @@ static match_table_t tokens = {
 	{Opt_usrquota, "usrquota"},
 	{Opt_barrier, "barrier=%u"},
 	{Opt_resize, "resize"},
+	{Opt_akpm, "akpm"},
 	{Opt_err, NULL},
 };
 
@@ -1097,6 +1100,9 @@ set_qf_format:
 			set_opt(sbi->s_mount_opt, QUOTA);
 			set_opt(sbi->s_mount_opt, GRPQUOTA);
 			break;
+		case Opt_akpm:
+			set_opt(sbi->s_mount_opt, AKPM);
+			break;
 		case Opt_noquota:
 			if (sb_any_quota_enabled(sb) ||
 			    sb_any_quota_suspended(sb)) {
@@ -1986,6 +1992,10 @@ static void ext3_init_journal_params(str
 		journal->j_flags |= JFS_BARRIER;
 	else
 		journal->j_flags &= ~JFS_BARRIER;
+	if (test_opt(sb, AKPM))
+		journal->j_flags |= JFS_AKPM;
+	else
+		journal->j_flags &= ~JFS_AKPM;
 	spin_unlock(&journal->j_state_lock);
 }
 
diff -puN include/linux/ext3_fs.h~a include/linux/ext3_fs.h
--- a/include/linux/ext3_fs.h~a
+++ a/include/linux/ext3_fs.h
@@ -380,6 +380,7 @@ struct ext3_inode {
 #define EXT3_MOUNT_QUOTA		0x80000 /* Some quota option set */
 #define ...
From: Arjan van de Ven
Date: Friday, October 3, 2008 - 10:21 am

On Thu, 2 Oct 2008 22:24:38 -0700

I've been running with this for a few hours now
while I can't see IO is perfectly smooth now, it does feel a lot better
and latencytop isn't showing me the huge spikes in latency either in
the mail client.
(there's a lot of little stuff, but not the 4+ seconds)
-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Theodore Tso
Date: Wednesday, October 8, 2008 - 8:00 pm

I've ported the patch to the ext4 filesystem, and dropped it into the
unstable portion of the ext4 patch queue.  If we can get someone (hi,
Ric!) to run fs_mark with and without -o akpm_lock_hack, I suspect we
will find that it makes quite a large difference on that particular
benchmark, since it is fsync-heavy to force a large number of
transaction, and the creation of the inodes should cause multiple
blocks that will be entangled between the current and committing
transactions.

	      	     	     	      	      	 - Ted

ext4: akpm's locking hack to fix locking delays

This is a port of the following patch from Andrew Morton to ext4:

	http://lkml.org/lkml/2008/10/3/22

This fixes a major contention problem in do_get_write_access() when a
buffer is modified in both the current and committing transaction.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: akpm@linux-foundation.org
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f46a513..23822fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -540,6 +540,7 @@ do {									       \
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT	0x1000000 /* Journal Async Commit */
 #define EXT4_MOUNT_I_VERSION            0x2000000 /* i_version support */
 #define EXT4_MOUNT_DELALLOC		0x8000000 /* Delalloc support */
+#define EXT4_MOUNT_AKPM_LOCK_HACK	0x10000000 /* akpm lock hack */
 /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
 #define clear_opt(o, opt)		o &= ~EXT4_MOUNT_##opt
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 67ebefb..f4e7157 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -752,6 +752,8 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_puts(seq, ",journal_async_commit");
 	if (test_opt(sb, NOBH))
 		seq_puts(seq, ",nobh");
+	if (test_opt(sb, AKPM_LOCK_HACK))
+		seq_puts(seq, ",akpm_lock_hack");
 	if (!test_opt(sb, EXTENTS))
 		seq_puts(seq, ",noextents");
 	if (test_opt(sb, I_VERSION))
@@ -911,7 +913,7 @@ enum ...
From: Andrew Morton
Date: Wednesday, October 8, 2008 - 8:38 pm

fsync?  Yes, I suppose so.  Repeated modifications to the same
inodes/directories/bitmaps blocks/etc will hurt.

A quick test on other quantified workloads would be useful too.  If the
results look promising then someone(tm) will need to work out how to

More specifically: "under checkpoint writeback in the committing
transaction when the committing transaction requests write access".

--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 9:45 pm

On Thu, 2 Oct 2008 21:40:00 -0700

one of them is (based on timestamps of the printk) is less than 3
seconds before the "real" one, while the real one's delay was 4 seconds.
To me that implies they're at least somewhat correlated...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Andi Kleen
Date: Wednesday, October 1, 2008 - 11:57 pm

XFS tried this some time ago too.

I think the issue was that real user supplied RT applications don't want to 
compete with a "pseudo RT" kjournald.

So it would really need a new priority class between RT and normal priority.

-Andi

-- 
ak@linux.intel.com
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 12:55 am

Good point. I think we should mark the IO as sync, and maintain the same
priority level. Any IO that ends up being waited on is sync by
definition, we just need to expand the coverage a bit.

-- 
Jens Axboe

--

From: Dave Chinner
Date: Thursday, October 2, 2008 - 2:33 am

That's what XFS has always done - mark the journal I/O as sync.
Still, once you load up the elevator, the sync I/O can still get
delayed for hundreds of milliseconds before dispatch, which was
why I started looking at boosting the priority of the log I/O.
It proved to be much more effective at getting the log I/O
dispatched than the existing "mark it sync" technique....

The RT folk were happy with the idea of journal I/O using the
highest non-RT priority for the journal, but I never got around
to testing that out as I had a bunnch of other stuff to fix at
the time.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 2:45 am

Sure, just marking it as sync is not a magic bullet. It'll be in the
first priority for that class, but it'll share bandwidth with other
processes. So if you have lots of IO going on, it can take hundreds of
miliseconds before being dispatched.

RT will always be faster, but mainly by virtue of there being no RT IO
in the first place. And of course preferential treatment is given to

That's a good idea, just bump the priority a little bit. Arjan, did you
test that out? I'd suggest just trying prio level 0 and still using
best-effort scheduling. Probably still need the sync marking, would be
interesting to experiment with though.

-- 
Jens Axboe

--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 6:14 am

On Thu, 2 Oct 2008 11:45:37 +0200

I looked at 0 but it appears the 0 is the default for everyone...
if everyone just defaulted to > 0 then yes I would have picked 0.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 6:27 am

That's not correct, class BE and value 4 is the default (and the code
defaults to that, if you haven't set a value yourself):

#define IOPRIO_NORM     (4)
static inline int task_ioprio(struct io_context *ioc)
{
        if (ioprio_valid(ioc->ioprio))
                return IOPRIO_PRIO_DATA(ioc->ioprio);

        return IOPRIO_NORM;
}

static inline int task_ioprio_class(struct io_context *ioc)
{
        if (ioprio_valid(ioc->ioprio))
                return IOPRIO_PRIO_CLASS(ioc->ioprio);

        return IOPRIO_CLASS_BE;
}

So if you use IOPRIO_CLASS_BE and 0 for the ioprio, you will have the
highest priority of the default scheduling class.

-- 
Jens Axboe

--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 6:36 am

On Thu, 2 Oct 2008 15:27:47 +0200

ok

I checked not by looking at the code, but running ionice -p <pid> on a


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 6:47 am

Yes, it'll report '0' which means 'not set'. The kernel inteprets 'not
set' as the default values, BE/4. There's a big diffence, since '0'
means that we track CPU nice values where as if it returned be/4 then
that is a strict/fixed setting.

-- 
Jens Axboe

--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 7:26 am

On Thu, 2 Oct 2008 15:47:37 +0200

argh. "0" means both "not set" and "highest priority".

how about we fix this?

right now "query + set" isn't idempotent.....
it should be able to be that.


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 9:42 am

Arjan, please read the interface, it does not...

0 means not set, period. What matters is the class setting, if that is
non-zero then it is a valid setting. See

#define IOPRIO_PRIO_VALUE(class, data)  (((class) << IOPRIO_CLASS_SHIFT) | data)

So highest priority BE is IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0), which
is of course valid.

-- 
Jens Axboe

--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 12:04 pm

On Thu, 2 Oct 2008 11:45:37 +0200

ok 0 works ok enough in quick testing as well...... updated patch below

From df64cc4e2ab0c102bbac609dd948958a6f804fd3 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 1 Oct 2008 19:58:18 -0700
Subject: [PATCH] Give kjournald a higher io priority

With latencytop, I noticed that the (in memory) file updates during my
workload (reading mail) had latencies of 6 seconds or longer; this is
obviously not so nice behavior. Other EXT3 journal related operations had
similar or even longer latencies.

Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work and effectively this leads to a twisted kind of "mass priority
inversion" type of behavior.

The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.

The patch below makes kjournald of a slighlty higher priority than normal
applications, reducing these latencies significantly.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
 fs/ioprio.c            |    3 ++-
 fs/jbd/journal.c       |   12 ++++++++++++
 include/linux/ioprio.h |    2 ++
 3 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/fs/ioprio.c b/fs/ioprio.c
index da3cc46..3bd95dc 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -27,7 +27,7 @@
 #include <linux/security.h>
 #include <linux/pid_namespace.h>
 
-static int set_task_ioprio(struct task_struct *task, int ioprio)
+int set_task_ioprio(struct task_struct *task, int ioprio)
 {
 	int err;
 	struct io_context *ioc;
@@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
 	task_unlock(task);
 	return err;
 }
+EXPORT_SYMBOL_GPL(set_task_ioprio);
 
 asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
 {
diff --git a/fs/jbd/journal.c ...
From: Jens Axboe
Date: Thursday, October 2, 2008 - 12:22 pm

Can we agree on this patch?

-- 
Jens Axboe

--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 2:37 pm

On Thu, 2 Oct 2008 21:22:23 +0200

This change will cause _all_ kjournald writeout to have elevated
priority.  The majority of that writeout (in data=ordered mode) is file
data, which we didn't intend to change.

The risk here is that this will *worsen* latency for plain old read(),
because now kjournald writeout will be favoured.

There is in fact a good argument for _reducing_ kjournald's IO
priority, not increasing it!

A better approach might be to mark the relevant buffers/bios as needing
higher priority at submit_bh() time (if that's possible).  At least
that way we don't accidentally elevate the priority of the bulk data.


It's a bit of a hack, sorry :(
--

From: Dave Chinner
Date: Thursday, October 2, 2008 - 4:58 pm

You can do that for submit_bio() by calling bio_set_prio() before
submision - I did that for elevating only the XFS journal I/O.
submit_bh() doesn't have any way of passing a priority through to it
right now...

I should resurrect the XFS patches I had an retest them....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 5:06 pm

On Fri, 3 Oct 2008 09:58:49 +1000

Yup.  There are plenty of spare bits in buffer_head.b_state. 
set_buffer_kludge()?

--

From: Andrew Morton
Date: Thursday, October 2, 2008 - 5:20 pm

On Thu, 2 Oct 2008 17:06:46 -0700

And if we do decide to run set_buffer_kludge() against these blocks
then it should be done at the time when they are marked dirty, when JBD
refiles them for checkpoint writeback.

This is because kjournald isn't the only process which writes these
buffers.  Pretty often (usually?) it is pdflush.  The increased IO
prioritisation is a property of the buffer, not of the task which
submits it for IO.  I suspect.
--

From: Arjan van de Ven
Date: Thursday, October 2, 2008 - 6:05 am

On Wed, 1 Oct 2008 21:56:38 -0700


Jens didn't like the APi I used back then.
Since then the kernel grew (by Jens) a clean API to do the same...


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Jens Axboe
Date: Thursday, October 2, 2008 - 10:11 am

Actually I don't think I changed anything, did I? IIRC, I just pointed
you at the API :-)

-- 
Jens Axboe

--

Previous thread: VERIFY YOUR NETVISION.NET.IL EMAIL ACCOUNT IMMEDIATELY! by NETVISION.NET.IL on Wednesday, October 1, 2008 - 5:34 pm. (1 message)

Next thread: by sena seneviratne on Wednesday, October 1, 2008 - 10:21 am. (1 message)