Subject: Give kjournald a IOPRIO_CLASS_RT io priority
From: Arjan van de Ven <arjan@linux.intel.com>With latencytop, I noticed that the (in memory) atime updates during a
kernel build had latencies of 600 msec or longer; this is obviously not so
nice behavior. Other EXT3 journal related operations had similar or even
longer latencies.Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work and effectively this leads to a twisted kind of "mass priority
inversion" type of behavior.The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
this priority inversion behavior. With this patch, the latencies for atime
updates (and similar operation) go down by a factor of 3x to 4x !Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
diff -purN linux-2.6.23-rc9.org/fs/jbd/journal.c linux-2.6.23-rc9.lt/fs/jbd/journal.c
--- linux-2.6.23-rc9.org/fs/jbd/journal.c 2007-10-02 05:24:52.000000000 +0200
+++ linux-2.6.23-rc9.lt/fs/jbd/journal.c 2007-10-14 00:06:55.000000000 +0200
@@ -35,6 +35,7 @@
#include <linux/kthread.h>
#include <linux/poison.h>
#include <linux/proc_fs.h>
+#include <linux/ioprio.h>#include <asm/uaccess.h>
#include <asm/page.h>
@@ -131,6 +132,8 @@ static int kjournald(void *arg)
printk(KERN_INFO "kjournald starting. Commit interval %ld seconds\n",
journal->j_commit_interval / HZ);+ current->ioprio = (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | 4;
+
/*
* And now, wait forever for commit wakeup events.
*/
-
On Mon, 15 Oct 2007 10:46:47 -0700
Seems a pretty fundamental change which could do with some careful
benchmarking, methinks.See, your patch amounts to "do more seeks to improve one test case".
Might be worth a code comment?
-
On Mon, 15 Oct 2007 11:47:38 -0700
FWIW, I have marked the kjournald processes on my system
realtime with "rtprio -c 1 `pidof kjournald`" and the
usual desktop stalls that plague my system have not yetThe big problem I have seen here is that processes end up
waiting on kjournald to do something, and kjournald is
waiting due to the IO scheduler.This can cause a lot of low IO (high IO priority) processes
to indirectly get stuck behind a few high IO (low priority)
processes.Since you have been involved a lot with ext3 development,
which kinds of workloads do you think will show a performance
degradation with Arjan's patch? What should I test?--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
On Mon, 15 Oct 2007 16:13:15 -0400
Gee. Expect the unexpected ;)
One problem might be when kjournald is doing its ordered-mode data
writeback at the start of commit. That writeout will now be
higher-priority and might impact other tasks which are doing synchronous
file overwrites (ie: no commits) or O_DIRECT reads or writes or just plain
old reads.If the aggregate number of seeks over the long term is the same as before
then of course the overall throughput should be the same, in which case the
impact might only be upon latency. However if for some reason the total
number of seeks is increased then there will be throughput impacts as well.So as a starting point I guess one could set up a
copy-a-kernel-tree-in-a-loop running in the background and then see what
impact that has upon a large-linear-read, upon a
read-a-different-kernel-tree and upon some database-style multithreaded
O_DIRECT reader/overwriter.
-
Yes, completely agree! I think Arjans patch makes a heap of sense, but
It should not be merged as-is, instead I'll provide a function to do
this. It should also set current->io_context->ioprio_changed. Since no
IO has been done yet at this point it doesn't matter. But we should cut
a piece of set_task_ioprio() out and provide that as a kernel helper for
this sort of thing.Even just writing it as:
current->ioprio = (IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT) | IOPRIO_NORM;
would be more readable. Or perhaps this would suffice, given the above
restriction that IO hasn't been done yet:current->ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, IOPRIO_NORM);
--
Jens Axboe-
Arjan gave the relevant hard numbers:
| With latencytop, I noticed that the (in memory) atime updates during a
| kernel build had latencies of 600 msec or longer [...]
|
| With this patch, the latencies for atime updates (and similar
| operation) go down by a factor of 3x to 4x !atime update latencies went down by a factor of 3x-4x ...
but what bothers me even more is the large picture. Linux's development
is still fundamentally skewed towards bandwidth (which goes up with
hardware advances anyway), while the focus on latencies is very lacking
(which users do care about much more and which usually does _not_
improve with improved hardware), so i cannot see why we shouldnt apply
this. Reminds me of the illogical, almost superstitious resistence
against the relatime patch. (which is not in 2.6.24 mind you - killed
for good)if bandwidth hurts anywhere, it will be pointed out and fixed, we've got
like tons of bandwidth benchmarks and it's _easy_ to fix bandwidth
problems. But _finally_ we now have desktop latency tools, hard numbers
and patches that fix them, but what do we do ... we put up extra
roadblocks??so lets just goddamn apply this _trivial_ patch. This isnt an intrusive
1000 line rewrite that is hard to revert. If it causes any bandwidth
problems, it will be just as trivial to undo. If we do anything else we
just stiffle the still young and very much under-represented "lets fix
latencies that bothers people" movement. If anything we need _positive_
discrimination for latency related fixes (which treatment this fix does
not need at all - all it needs is _equal_ footing with the countless
bandwidth patches that go into the kernel all the time), otherwise it
will never take off and become as healthy as bandwidth optimizations.
Ok?Ingo
-
I think the situation is that we've asked for some additional
what-can-be-hurt-by-this testing.Yes, we could sling it out there and wait for the reports. But often
that's a pretty painful process and regressions can be discovered too late
for us to do anything about them.-
reverting this oneliner is trivial. Finding bandwidth problems and
tracking them down to this oneliner change is relatively easy too.
Finding latency problems and fixing them is _not_ trivial.Boot up a Linux desktop and start OOo or firefox, and measure the time
it takes to start the app up. 10-20 seconds on a top-of-the-line
quad-core 3.2 GHz system - which is a shame. Same box can do in excess
of 1GB/sec block IO. Yes, one could blame the apps but in reality most
of the blame is mostly on the kernel side. We do not make bloat and
latency suckage apparent enough to user-space (due to lack of
intelligent instrumentation), we make latencies hard to fix, we have an
acceptance bias towards bandwidth fixes (because they are easier to
measure and justify) - and that's all what it takes to let such a
situation get out of control.and i can bring up the scheduler as an example. CFS broke the bandwidth
performance of one particular app and it took only a few days to get it
back under control. But it was months to get good latency behavior out
of the scheduler. And that is with the help of excellent scheduler
instrumentation. In the IO space the latency situation is much, much
worse. Really.Ingo
-
None of which is an argument for simply not bothering to do a bit more
developer testing before merging.
i mean this one:
http://people.redhat.com/mingo/relatime-patches/improve-relatime.patch
this actually makes relatime practical.
Ingo
-
| Jeremy Fitzhardinge | Re: [RFC 00/15] x86_64: Optimize percpu accesses |
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
| Mike Galbraith | Re: regression: CD burning (k3b) went broke |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Linus Torvalds | Re: [GIT]: Networking |
| Michael Grollman | Re: 8169 Intermittent ifup Failure Issue With RTL8102E Chipset in Intel's New D945... |
