As proposed by Chris, Dave and Jan, don't start foreground writeback IO inside balance_dirty_pages(). Instead, simply let it idle sleep for some time to throttle the dirtying task. In the mean while, kick off the per-bdi flusher thread to do background writeback IO. This patch introduces the basic framework, which will be further consolidated by the next patches. RATIONALS ========= The current balance_dirty_pages() is rather IO inefficient. - concurrent writeback of multiple inodes (Dave Chinner) If every thread doing writes and being throttled start foreground writeback, it leads to N IO submitters from at least N different inodes at the same time, end up with N different sets of IO being issued with potentially zero locality to each other, resulting in much lower elevator sort/merge efficiency and hence we seek the disk all over the place to service the different sets of IO. OTOH, if there is only one submission thread, it doesn't jump between inodes in the same way when congestion clears - it keeps writing to the same inode, resulting in large related chunks of sequential IOs being issued to the disk. This is more efficient than the above foreground writeback because the elevator works better and the disk seeks less. - IO size too small for fast arrays and too large for slow USB sticks The write_chunk used by current balance_dirty_pages() cannot be directly set to some large value (eg. 128MB) for better IO efficiency. Because it could lead to more than 1 second user perceivable stalls. Even the current 4MB write size may be too large for slow USB sticks. The fact that balance_dirty_pages() starts IO on itself couples the IO size to wait time, which makes it hard to do suitable IO size while keeping the wait time under control. For the above two reasons, it's much better to shift IO to the flusher threads and let balance_dirty_pages() just wait for enough time or progress. Jan Kara, Dave Chinner and me explored the ...
Hi Wu, As you know, I am not a expert in this area. So I hope my review can help understanding other newbie like me and make clear this document. :) I didn't look into the code. before it, I would like to clear your concept. Seems good until now. Dumb question. I can't see the difference between old and new, La depends on A. A depends on Wa. T is constant? Then, throttle_bandwidth depends on Wa. Wa depends on the number of dirtied pages during some interval. So if light dirtier become heavy, at last light dirtier and heavy dirtier will have a same weight. It means throttle_bandwidth is same. It's a same with old result. Please, open my eyes. :) -- Kind regards, Minchan Kim --
Hi Minchan, Yeah, it's some big change of "concept" :) Sorry for the late reply, as I'm still tuning things and some details may change as a result. The biggest challenge now is the stability of the control algorithms. Everything is floating around and I'm trying to keep the fluctuations down by borrowing some equation from the That's not a problem. It's the proper behavior to converge for two T is the bdi's share of the global dirty limit. It's stable in normal, Sure, each task will be throttled at different bandwidth if there Thanks, Fengguang --
On Wed, 17 Nov 2010 12:27:21 +0800 writeback has always had these semi-bogus assumptions that all pages are the same, and it can sometimes go very wrong. A chronic case would be a 4GB i386 machine where only 1/4 of memory is useable for GFP_KERNEL allocations, filesystem metadata and /dev/sdX pagecache. When you think about it, a lot of the throttling work being done in writeback is really being done on behalf of the page allocator (and hence page reclaim). But what happens if the workload is mainly hammering away at ZONE_NORMAL, but writeback is considering ZONE_NORMAL to be the same thing as ZONE_HIGHMEM? Or vice versa, where page-dirtyings are all happening in lowmem? Can writeback then think that there are plenty of clean pages (because it's looking at highmem as well) so little or no throttling is happening? If so, what effect does this have upon GFP_KERNEL/GFP_USER allocation? And bear in mind that the user can tune the dirty levels. If they're set to 10% on a machine on which 25% of memory is lowmem then ill effects might be rare. But if the user tweaks the thresholds to 30% then can we get into problems? Such as a situation where 100% of lowmem is dirty and throttling isn't cutting in? So please have a think about that and see if you can think of ways in which this assumption can cause things to go bad. I'd suggest writing some targetted tests which write to /dev/sdX (to generate lowmem-only dirty pages) and which read from /dev/sdX (to request allocation of lowmem pages). Run these tests in conjunction with tests which exercise the highmem zone as well and check that everything behaves as expected. Of course, this all assumes that you have a 4GB i386 box :( It's almost getting to the stage where we need a fake-zone-highmem option for x86_64 boxes just so we can test this stuff. --
If you feel like playing with sub-jiffies timeouts (a way to avoid that
HZ=>100 assumption), the below (totally untested) patch might be of
help..
---
Subject: hrtimer: Provide io_schedule_timeout*() functions
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/hrtimer.h | 7 +++++++
kernel/hrtimer.c | 15 +++++++++++++++
kernel/sched.c | 17 +++++++++++++++++
3 files changed, 39 insertions(+), 0 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index dd9954b..9e0f67e 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -419,6 +419,13 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
struct task_struct *tsk);
+extern int io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
+ const enum hrtimer_mode mode);
+extern int io_schedule_hrtimeout_range_clock(ktime_t *expires,
+ unsigned long delta, const enum hrtimer_mode mode, int clock);
+extern int io_schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
+
+
extern int schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
const enum hrtimer_mode mode);
extern int schedule_hrtimeout_range_clock(ktime_t *expires,
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 72206cf..ef2d93c 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1838,6 +1838,14 @@ int __sched schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
}
EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
+int __sched io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
+ const enum hrtimer_mode mode)
+{
+ return io_schedule_hrtimeout_range_clock(expires, delta, mode,
+ CLOCK_MONOTONIC);
+}
+EXPORT_SYMBOL_GPL(io_schedule_hrtimeout_range);
+
/**
* schedule_hrtimeout - sleep until timeout
* @expires: timeout value (ktime_t)
@@ -1866,3 +1874,10 @@ int __sched ...Assuming there are HZ=10 users. - when choosing such a coarse granularity, do they really care about responsiveness? :) - will the use of hrtimer add a little code size and/or runtime overheads, and hence hurt the majority HZ=100 users? Thanks, --
Right. It seems not rewarding enough to check signal_pending(). We've already been able to response to signals much faster than before Fair enough. I do missed the D state (without the long wait :). Here is the patch. Thanks, Fengguang --- Subject: writeback: do uninterruptible sleep in balance_dirty_pages() Date: Thu Dec 02 09:31:19 CST 2010 Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong. If it's going to do that then it must break out if signal_pending(), otherwise it's pretty much guaranteed to degenerate into a busywait loop. Plus we *do* want these processes to appear in D state and to contribute to load average. So it should be TASK_UNINTERRUPTIBLE. -- Andrew Morton Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/page-writeback.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-next.orig/mm/page-writeback.c 2010-12-02 09:30:29.000000000 +0800 +++ linux-next/mm/page-writeback.c 2010-12-02 09:30:34.000000000 +0800 @@ -636,7 +636,7 @@ pause: pages_dirtied, pause); bdi_update_write_bandwidth(bdi, &bw_time, &bw_written); - __set_current_state(TASK_INTERRUPTIBLE); + __set_current_state(TASK_UNINTERRUPTIBLE); io_schedule_timeout(pause); bdi_update_write_bandwidth(bdi, &bw_time, &bw_written); --
Ah I seem to find the root cause. See the attached graphs. Ext4 should
be calling redirty_page_for_writepage() to redirty ~300MB pages on
every ~10s. The redirties happen in big bursts, so not surprisingly
the dd task's dirty weight will suddenly drop to 0.
It should be the same ext4 issue discussed here:
http://www.spinics.net/lists/linux-fsdevel/msg39555.html
Thanks,
Fengguang
May be it is reasonable to introduce new mount option which control
dynamic delalloc on/off behavior for example like this:
0) -odelalloc=off : analog of nodelalloc
1) -odelalloc=normal : Default mode (disable delalloc if close to full fs)
2) -odelalloc=force : delalloc mode always enabled, so we have to do
writeback more aggressive in case of ENOSPC.
So one can force delalloc and can safely use this writeback mode in
multi-user environment. Openvz already has this. I'll prepare the patch
--
Yeah, I'd really rather not do that. There are significant downsides with your proposed odelalloc=force mode. One of which is that we could run out of space and not notice. If the application doesn't call fsync() and check the return value, and simply closes()'s the file and then exits, when the writeback threads do get around to writing the file, the block allocation could fail, and oops, data gets lost. There's a _reason_ why we disable delalloc when we're close to a full fs. The only alternative is to super conservative when doing your block reservation calculations, and in that case, you end up returning ENOSPC far too soon. - Ted --
