From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 1 Oct 2008 19:58:18 -0700
Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
With latencytop, I noticed that the (in memory) atime updates during a
kernel build had latencies of 6 seconds or longer; this is obviously not so
nice behavior. Other EXT3 journal related operations had similar or even
longer latencies.
Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work on behalf of other processes and effectively this leads to a twisted
kind of "mass priority inversion" type of behavior.
The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.
The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
this priority inversion behavior. With this patch, the latencies for atime
updates (and similar operation) go down by a factor of 3x to 4x !
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
fs/ioprio.c | 3 ++-
fs/jbd/journal.c | 12 ++++++++++++
include/linux/ioprio.h | 2 ++
3 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/fs/ioprio.c b/fs/ioprio.c
index da3cc46..3bd95dc 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -27,7 +27,7 @@
#include <linux/security.h>
#include <linux/pid_namespace.h>
-static int set_task_ioprio(struct task_struct *task, int ioprio)
+int set_task_ioprio(struct task_struct *task, int ioprio)
{
int err;
struct io_context *ioc;
@@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
task_unlock(task);
return err;
}
+EXPORT_SYMBOL_GPL(set_task_ioprio);
asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
{
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index aa7143a..2ed3d8f 100644
--- ...You proposed this a while back and it didn't happen and I forget why and the changelog doesn't mention any of that? --
I think you called for benchmark results, which I don't think happened. The patch definitely makes sense, so we should just make sure that we don't regress elsewhere. Honestly, I'd be surprised if we did... How about I just toss it into the 2.6.28 testing mix, plenty of time for testing and such? -- Jens Axboe --
Now I think about it, didn't the earlier patch tweak CPU priority and Many performance regressions don't get noticed for six or twelve months, by which time everyone is suffering from them (see kernel/sched.c). kjournald does huge amounts of not-terribly-important writeback. One obvious risk is that by making all that bulk writeback high-priority, read-latency-sensitive applications might suffer latency spikes. Now, kjournald is _supposed_ to be mostly asynchronous wrt foreground operations. And once upon a time (seven years ago) it mostly was. But there was some horrid race which I ended up fixing by introducing one point where synchronous userspace actions had to block behind kjournald activity. I spent quite some time on it but couldn't come up with anything better. It had fairly bad effects on some workloads. I've forgotten where that code is now, but I don't think it was ever revisited. It should be. So. Where are these atime updaters getting blocked? --
Behind other IO activity I suppose, since it's marked async. A more appropriate fix may be to mark atime updates as sync IO. Once upon a time I actually did start xxxing meta data IO from the file system, in ext3. That was mainly for blktrace so we could inspect what data was what at the trace end, but if we had full coverage (or better, at least), we could use that to bump priority as well. Probably the priority should be left at the default, but the IO should be upgraded to SYNC instead. That'll ensure that it goes out fairly quickly (like other IO at the same class), but not get preferential treatment apart from that. Seems like the right thing to do. -- Jens Axboe --
No, they might be getting blocked at a higher level. An async atime update gets recorded into the current transaction. kjournald is working on the committing transaction. We try to keep those separated, to prevent user processes from getting blocked behind kjournald activity. But sometimes that doesn't work (including the place where I knowingly broke it). If we can find and fix the offending piece of jbd logic (a big if) then all is peachy. If the above theory turns out to be true then diddling IO priorities is but a workaround. --
If diddling with io priorities makes a big difference (and Arjan said it does), it's clearly stuck inside the io scheduler waiting to be dispatched. If it was marked sync, it would be going through much faster. As async, it'll get mixed in with other async writeout and that doesn't get a lot of attention in case of sync activity. -- Jens Axboe --
Maybe. And maybe marking all journal writeout and all commit activity as sync would have the same effect with less downside. But it'd be better to not wait on this IO at all, if poss.. --
Agree, would be much better... -- Jens Axboe --
This is true unless the journal gets too full, and we need to do a checkpoint operation --- at which point, everything stops. If this was metadata-intensive a benchmark, and the journal wasn't large enough, this could be the problem. (And if you make the journal bigger, then when you *do* finally get forced to do a checkpoint operation, things get stalled for even longer.) Arjan, is this *really* about atime updates? I thought most poeple these days run with noatime or relatime. If people *really* want true atime semantics, the best way to solve this problem would be to have two dirty flags in the inode --- an "atime dirty" and a "dirty" flag. The atime dirty bit would not actually cause the inode to get written to disk, unless either (a) we are unmounting the filesystem, or (b) we are trying to shrink the inode cache due to memory pressure. If when we write the inode out to disk, only the atime dirty bit is set, we can also skip journalling the inode table block. So if there are people who really care about true atime semantics, without getting killed by the I/O writes, there are some solutions we can pursue. But if this is really about the "entangled fsync problem", where we have a large number of processes writing a large amount of async data, and then we have a single process writing a small amount of data and then calling fsync(), then that's a different (and very long-standing) problem in ext3/4. Raising the I/O priority is probably the only thing we can do in this circumstance. We could try to do some kind of complex priority inheritance scheme, but it would certainly be much simpler to raise the I/O priority. We could choose a level just below realtime priority, but the reality is that if a real-time priority is trying to write to the filesystem, and we are doing a checkpointing opration, we're going to be blocking the real-time process anyway, and it will be a priority inversion. So perhaps the simplest and best algorithm would be to use a priority level ...
On Thu, 2 Oct 2008 08:04:44 -0400 I can very easily reproduce it; my mail client (claws) stalls due to this several seconds at least once every 5 to 10 minutes... (usually when I'm typing an email.. grumble) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Are you running with noatime or relatime?
One quick thing to try that might show what is going on is to run
debugfs on the device where your mail directory is located, while
claws is running, and then use the debugfs command "logdump -a" to
dump out the contents of the journal. You can then use ncheck and
icheck to take the FS block numbers from logdump and translate them to
inode numbers, and then from inode numbers to pathnames. That might
give us some insight as to what is going on.
I can whip up a patch which adds some markers which we could use to
find out more information about what is happening.
- Ted
--
On Thu, 2 Oct 2008 09:46:59 -0400 yes; I have mount -o remount,relatime / & for i in `pidof kjournald` ; do ionice -c1 -p $i ; done I'll try (sadly I'm in a training session all day today so I might not interesting testcase of the markers concept. -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Sorry for the delay, I ran into a minor bug in the Modules.marker generation support that prevented Systemtap from being able to use markers. (It was busted since 2.6.27-rc1, so I guess that gives us some sense how often developers use Systemtap. :-) It looks like Andrew's workaround seems to help you out, but if you're willing to run this while your mail reader is running, and correlate it with with the large latency spikes, we might get some interesting results. Anyway, here's the patch (against ext4, although could pretty easily move this to ext3 --- but you can mount an ext3 filesystem as ext4, although for the moment you do have to run the command "tune2fs -E test_fs /dev/hdXX" first), and a sample systemtap script. You'll also need the markers patch, and of course the latest systemtap from their git repository. - Ted P.S. Make sure you install systemtap in /usr/local, instead of trying to run it out of the build tree. See for an interesting report from Roland McGrath about what happens if you make this mistake: http://sources.redhat.com/ml/systemtap/2008-q3/msg00809.html I really think Systemtap has a lot of potential if only they could get past some "minor usability concerns". So one thing that I think would really help the Systemtap folks is if more people gave them more, ah, "constructive feedback" to their mailing list.
Shouldn't he also be sure to mount the FS with the "noextents" option, or has ext4 learned not to to enable extents if they're not already enabled? -- Joseph Fannin jfannin@gmail.com --
The latter; ext4 will no longer enable extents if not already enabled as of commit e4079a11, dated July 11th. That means 2.6.26 with any of the ext4 patchsets (highly recommended if you want to use ext4) or any mainline release post 2.6.26-git4/2.6.27-rc1. A number of the ext4 tutorials/howto's are available on the web page are out of date, including the one on IBM developer works. The best place to get up-to-date information on using ext4 is http://ext4.wiki.kernel.org. Regards, - Ted P.S. Note that there will be advantages to using even an unconverted filesystem with the ext3 filesystem format with the ext4 code base, even if you don't enable extents or any other ext4-specific feature; the delayed allocation and soon-to-be-added inode table readahead features (in the 2.6.27-rc8-ext4-2 patchset, and which will be pushed during the next merge window) in the ext4 code base will improve performance even on an "ext3 filesystem". --
my reproducer is sadly very simple (claws-mail is my mail client that uses maildir) Process claws-mail (4896) Total: 2829.7 msec EXT3: Waiting for journal access 2491.0 msec 88.4 % Writing back inodes 160.9 msec 5.7 % synchronous write 78.8 msec 3.0 % is an example of such a trace (this is with patch, without patch the numbers are about 3x bigger) Waiting for journal access is "journal_get_write_access" Writing back inodes is "writeback_inodes" synchronous write is "do_sync_write" -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Right. Probably the lock_buffer() in do_get_write_access(). kjournald
is checkpointing the committing transaction (writing metadata buffers
back into the fs) and a user process operating on the current
transaction is trying to get access to one of those buffers but has to
wait for the writeout to complete first.
It wasn't always thus. Back in, umm, 2.5.0 we did
/*
* The buffer_locked() || buffer_dirty() tests here are simply an
* optimisation tweak. If anyone else in the system decides to
* lock this buffer later on, we'll blow up. There doesn't seem
* to be a good reason why they should do this.
*/
if (jh->b_cp_transaction &&
(buffer_locked(jh2bh(jh)) || buffer_dirty(jh2bh(jh)))) {
unlock_journal(journal);
lock_buffer(jh2bh(jh));
and I _think_ it was the loss of that which hurt us a lot.
773fc4c63442fbd8237b4805627f6906143204a8 or thereabouts in the old git
tree.
It would be very good if we could again decouple the committing and
current transactions, but I fear that none of us remember sufficiently
well how it all works (or, more importantly, how it all doesn't work
when you make a change).
Of course, that could all be wrong and we could be stuck somewhere
else. A good way to diagnose this stuff would be
--- a/kernel/sched.c~a
+++ a/kernel/sched.c
@@ -5567,10 +5567,14 @@ EXPORT_SYMBOL(yield);
void __sched io_schedule(void)
{
struct rq *rq = &__raw_get_cpu_var(runqueues);
+ unsigned long in, out;
delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
+ in = jiffies;
schedule();
+ out = jiffies;
+ WARN_ON(time_after(out, in + 1 * HZ));
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
}
_
perhaps for varying values of "1".
--
On Thu, 2 Oct 2008 13:24:57 -0700 I'll run with this. (well with a value of "1000" for 1 ;-) Also I'll tell latencytop to give me the full backtrace as well ;-) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
On Thu, 2 Oct 2008 21:01:17 -0700 caught a few, a few were over 3 1/2 seconds in stall time (specifically I know this for the last one) (I've stripped out the ? entries to keep them reasonable) [ 410.168277] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0() [ 410.168347] Pid: 699, comm: kjournald Not tainted 2.6.27-rc8-tip #50 [ 410.168366] [<c042ee64>] warn_on_slowpath+0x41/0x65 [ 410.168414] [<c070ec83>] io_schedule+0x77/0xb0 [ 410.168421] [<c04abc72>] sync_buffer+0x33/0x37 [ 410.168429] [<c070f123>] __wait_on_bit+0x36/0x5d [ 410.168445] [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3 [ 410.168471] [<c04abbd1>] __wait_on_buffer+0x19/0x1c [ 410.168478] [<c04de796>] journal_commit_transaction+0x484/0xcb2 [ 410.168519] [<c04e14ae>] kjournald+0xc7/0x1ea [ 410.168544] [<c043fb87>] kthread+0x3b/0x61 [ 410.168559] [<c040499f>] kernel_thread_helper+0x7/0x10 [ 410.168567] ======================= [ 410.168572] ---[ end trace de523043f88bd9a7 ]--- [ 451.605034] ------------[ cut here ]------------ [ 451.605041] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0() [ 451.605114] Pid: 699, comm: kjournald Tainted: G W 2.6.27-rc8-tip #50 [ 451.605133] [<c042ee64>] warn_on_slowpath+0x41/0x65 [ 451.605182] [<c070ec83>] io_schedule+0x77/0xb0 [ 451.605189] [<c04abc72>] sync_buffer+0x33/0x37 [ 451.605198] [<c070f123>] __wait_on_bit+0x36/0x5d [ 451.605213] [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3 [ 451.605240] [<c04abbd1>] __wait_on_buffer+0x19/0x1c [ 451.605248] [<c04de796>] journal_commit_transaction+0x484/0xcb2 [ 451.605300] [<c04e14ae>] kjournald+0xc7/0x1ea [ 451.605324] [<c043fb87>] kthread+0x3b/0x61 [ 451.605339] [<c040499f>] kernel_thread_helper+0x7/0x10 [ 451.605348] ======================= [ 451.605353] ---[ end trace de523043f88bd9a7 ]--- [ 514.780993] ------------[ cut here ]------------ [ 514.781001] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0() [ 514.781074] Pid: 699, comm: kjournald ...
hm, they're all kjournald getting stuck on lock_buffer(). That _may_ That's the one - the lock_buffer() in do_get_write_access(). It's a major contention site and it'd be a major win if we could fix it. Even if we resorted to some nasty thing like taking a temp copy of the buffer's contents. --
I also notice it's part of "file_update_time". Do we really need to go all the way down to this level of synchronicity for that? (I also randomly wonder if we, in the write path, dirty the inode twice, once for size once for item, and if we then also reserve two slots in the journal for that..... but I'm showing my total ignorance of JBD internals here) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Well, we've tossed that around many times but never implemented it. Once you get into the details it gets a bit nasty. Need to keep the dirtiness state in the VFS (or fs) inode, and going backwards from a plain old buffer_head at commit time isn't possible. We usually tempfixed the problem by adding increasingly fancy ways of not doing the atime update at all. Of course, fixing this running-vs-committing contention point would fix That shouldn't be the case - once we have write access to the buffer it remains freely modifiable for the rest of the transaction period. I I'm going on senile memories of JDB five years ago, but the concepts didn't change much. --
On Thu, 2 Oct 2008 21:50:26 -0700 given that this is the write path, I was assuming this was mtime rather yes clearly. It's waaay above my paygrade to hack on though; JBD is one of those places in the kernel that scare me for doing fundamental I hope you're right otherwise we'd always hit this; once for the size change, then block for the mtime. That would thoroughly suck; so much so that you just must be right. -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
If someone has an hour or so to kill, we can pretend to fix it:
fs/ext3/super.c | 12 +++++++++++-
fs/jbd/transaction.c | 13 +++++++++++--
include/linux/ext3_fs.h | 1 +
include/linux/jbd.h | 1 +
4 files changed, 24 insertions(+), 3 deletions(-)
diff -puN fs/ext3/super.c~a fs/ext3/super.c
--- a/fs/ext3/super.c~a
+++ a/fs/ext3/super.c
@@ -617,6 +617,8 @@ static int ext3_show_options(struct seq_
seq_puts(seq, ",barrier=1");
if (test_opt(sb, NOBH))
seq_puts(seq, ",nobh");
+ if (test_opt(sb, AKPM))
+ seq_puts(seq, ",akpm");
if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
seq_puts(seq, ",data=journal");
@@ -757,7 +759,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
- Opt_grpquota
+ Opt_grpquota, Opt_akpm,
};
static match_table_t tokens = {
@@ -808,6 +810,7 @@ static match_table_t tokens = {
{Opt_usrquota, "usrquota"},
{Opt_barrier, "barrier=%u"},
{Opt_resize, "resize"},
+ {Opt_akpm, "akpm"},
{Opt_err, NULL},
};
@@ -1097,6 +1100,9 @@ set_qf_format:
set_opt(sbi->s_mount_opt, QUOTA);
set_opt(sbi->s_mount_opt, GRPQUOTA);
break;
+ case Opt_akpm:
+ set_opt(sbi->s_mount_opt, AKPM);
+ break;
case Opt_noquota:
if (sb_any_quota_enabled(sb) ||
sb_any_quota_suspended(sb)) {
@@ -1986,6 +1992,10 @@ static void ext3_init_journal_params(str
journal->j_flags |= JFS_BARRIER;
else
journal->j_flags &= ~JFS_BARRIER;
+ if (test_opt(sb, AKPM))
+ journal->j_flags |= JFS_AKPM;
+ else
+ journal->j_flags &= ~JFS_AKPM;
spin_unlock(&journal->j_state_lock);
}
diff -puN include/linux/ext3_fs.h~a include/linux/ext3_fs.h
--- a/include/linux/ext3_fs.h~a
+++ a/include/linux/ext3_fs.h
@@ -380,6 +380,7 @@ struct ext3_inode {
#define EXT3_MOUNT_QUOTA 0x80000 /* Some quota option set */
#define ...On Thu, 2 Oct 2008 22:24:38 -0700 I've been running with this for a few hours now while I can't see IO is perfectly smooth now, it does feel a lot better and latencytop isn't showing me the huge spikes in latency either in the mail client. (there's a lot of little stuff, but not the 4+ seconds) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
I've ported the patch to the ext4 filesystem, and dropped it into the unstable portion of the ext4 patch queue. If we can get someone (hi, Ric!) to run fs_mark with and without -o akpm_lock_hack, I suspect we will find that it makes quite a large difference on that particular benchmark, since it is fsync-heavy to force a large number of transaction, and the creation of the inodes should cause multiple blocks that will be entangled between the current and committing transactions. - Ted ext4: akpm's locking hack to fix locking delays This is a port of the following patch from Andrew Morton to ext4: http://lkml.org/lkml/2008/10/3/22 This fixes a major contention problem in do_get_write_access() when a buffer is modified in both the current and committing transaction. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: akpm@linux-foundation.org diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f46a513..23822fb 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -540,6 +540,7 @@ do { \ #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */ #define EXT4_MOUNT_I_VERSION 0x2000000 /* i_version support */ #define EXT4_MOUNT_DELALLOC 0x8000000 /* Delalloc support */ +#define EXT4_MOUNT_AKPM_LOCK_HACK 0x10000000 /* akpm lock hack */ /* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */ #ifndef _LINUX_EXT2_FS_H #define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 67ebefb..f4e7157 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -752,6 +752,8 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_puts(seq, ",journal_async_commit"); if (test_opt(sb, NOBH)) seq_puts(seq, ",nobh"); + if (test_opt(sb, AKPM_LOCK_HACK)) + seq_puts(seq, ",akpm_lock_hack"); if (!test_opt(sb, EXTENTS)) seq_puts(seq, ",noextents"); if (test_opt(sb, I_VERSION)) @@ -911,7 +913,7 @@ enum ...
fsync? Yes, I suppose so. Repeated modifications to the same inodes/directories/bitmaps blocks/etc will hurt. A quick test on other quantified workloads would be useful too. If the results look promising then someone(tm) will need to work out how to More specifically: "under checkpoint writeback in the committing transaction when the committing transaction requests write access". --
On Thu, 2 Oct 2008 21:40:00 -0700 one of them is (based on timestamps of the printk) is less than 3 seconds before the "real" one, while the real one's delay was 4 seconds. To me that implies they're at least somewhat correlated... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
XFS tried this some time ago too. I think the issue was that real user supplied RT applications don't want to compete with a "pseudo RT" kjournald. So it would really need a new priority class between RT and normal priority. -Andi -- ak@linux.intel.com --
Good point. I think we should mark the IO as sync, and maintain the same priority level. Any IO that ends up being waited on is sync by definition, we just need to expand the coverage a bit. -- Jens Axboe --
That's what XFS has always done - mark the journal I/O as sync. Still, once you load up the elevator, the sync I/O can still get delayed for hundreds of milliseconds before dispatch, which was why I started looking at boosting the priority of the log I/O. It proved to be much more effective at getting the log I/O dispatched than the existing "mark it sync" technique.... The RT folk were happy with the idea of journal I/O using the highest non-RT priority for the journal, but I never got around to testing that out as I had a bunnch of other stuff to fix at the time. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Sure, just marking it as sync is not a magic bullet. It'll be in the first priority for that class, but it'll share bandwidth with other processes. So if you have lots of IO going on, it can take hundreds of miliseconds before being dispatched. RT will always be faster, but mainly by virtue of there being no RT IO in the first place. And of course preferential treatment is given to That's a good idea, just bump the priority a little bit. Arjan, did you test that out? I'd suggest just trying prio level 0 and still using best-effort scheduling. Probably still need the sync marking, would be interesting to experiment with though. -- Jens Axboe --
On Thu, 2 Oct 2008 11:45:37 +0200 I looked at 0 but it appears the 0 is the default for everyone... if everyone just defaulted to > 0 then yes I would have picked 0. -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
That's not correct, class BE and value 4 is the default (and the code
defaults to that, if you haven't set a value yourself):
#define IOPRIO_NORM (4)
static inline int task_ioprio(struct io_context *ioc)
{
if (ioprio_valid(ioc->ioprio))
return IOPRIO_PRIO_DATA(ioc->ioprio);
return IOPRIO_NORM;
}
static inline int task_ioprio_class(struct io_context *ioc)
{
if (ioprio_valid(ioc->ioprio))
return IOPRIO_PRIO_CLASS(ioc->ioprio);
return IOPRIO_CLASS_BE;
}
So if you use IOPRIO_CLASS_BE and 0 for the ioprio, you will have the
highest priority of the default scheduling class.
--
Jens Axboe
--
On Thu, 2 Oct 2008 15:27:47 +0200 ok I checked not by looking at the code, but running ionice -p <pid> on a -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Yes, it'll report '0' which means 'not set'. The kernel inteprets 'not set' as the default values, BE/4. There's a big diffence, since '0' means that we track CPU nice values where as if it returned be/4 then that is a strict/fixed setting. -- Jens Axboe --
On Thu, 2 Oct 2008 15:47:37 +0200 argh. "0" means both "not set" and "highest priority". how about we fix this? right now "query + set" isn't idempotent..... it should be able to be that. -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Arjan, please read the interface, it does not... 0 means not set, period. What matters is the class setting, if that is non-zero then it is a valid setting. See #define IOPRIO_PRIO_VALUE(class, data) (((class) << IOPRIO_CLASS_SHIFT) | data) So highest priority BE is IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0), which is of course valid. -- Jens Axboe --
On Thu, 2 Oct 2008 11:45:37 +0200
ok 0 works ok enough in quick testing as well...... updated patch below
From df64cc4e2ab0c102bbac609dd948958a6f804fd3 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 1 Oct 2008 19:58:18 -0700
Subject: [PATCH] Give kjournald a higher io priority
With latencytop, I noticed that the (in memory) file updates during my
workload (reading mail) had latencies of 6 seconds or longer; this is
obviously not so nice behavior. Other EXT3 journal related operations had
similar or even longer latencies.
Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work and effectively this leads to a twisted kind of "mass priority
inversion" type of behavior.
The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.
The patch below makes kjournald of a slighlty higher priority than normal
applications, reducing these latencies significantly.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
---
fs/ioprio.c | 3 ++-
fs/jbd/journal.c | 12 ++++++++++++
include/linux/ioprio.h | 2 ++
3 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/fs/ioprio.c b/fs/ioprio.c
index da3cc46..3bd95dc 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -27,7 +27,7 @@
#include <linux/security.h>
#include <linux/pid_namespace.h>
-static int set_task_ioprio(struct task_struct *task, int ioprio)
+int set_task_ioprio(struct task_struct *task, int ioprio)
{
int err;
struct io_context *ioc;
@@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
task_unlock(task);
return err;
}
+EXPORT_SYMBOL_GPL(set_task_ioprio);
asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
{
diff --git a/fs/jbd/journal.c ...Can we agree on this patch? -- Jens Axboe --
On Thu, 2 Oct 2008 21:22:23 +0200 This change will cause _all_ kjournald writeout to have elevated priority. The majority of that writeout (in data=ordered mode) is file data, which we didn't intend to change. The risk here is that this will *worsen* latency for plain old read(), because now kjournald writeout will be favoured. There is in fact a good argument for _reducing_ kjournald's IO priority, not increasing it! A better approach might be to mark the relevant buffers/bios as needing higher priority at submit_bh() time (if that's possible). At least that way we don't accidentally elevate the priority of the bulk data. It's a bit of a hack, sorry :( --
You can do that for submit_bio() by calling bio_set_prio() before submision - I did that for elevating only the XFS journal I/O. submit_bh() doesn't have any way of passing a priority through to it right now... I should resurrect the XFS patches I had an retest them.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
On Fri, 3 Oct 2008 09:58:49 +1000 Yup. There are plenty of spare bits in buffer_head.b_state. set_buffer_kludge()? --
On Thu, 2 Oct 2008 17:06:46 -0700 And if we do decide to run set_buffer_kludge() against these blocks then it should be done at the time when they are marked dirty, when JBD refiles them for checkpoint writeback. This is because kjournald isn't the only process which writes these buffers. Pretty often (usually?) it is pdflush. The increased IO prioritisation is a property of the buffer, not of the task which submits it for IO. I suspect. --
On Wed, 1 Oct 2008 21:56:38 -0700 Jens didn't like the APi I used back then. Since then the kernel grew (by Jens) a clean API to do the same... -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Actually I don't think I changed anything, did I? IIRC, I just pointed you at the API :-) -- Jens Axboe --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing |
