Re: [PATCH 01/13] writeback: IO-less balance_dirty_pages()

Previous thread: [PATCH 02/13] writeback: consolidate variable names in balance_dirty_pages() by Wu Fengguang on Tuesday, November 16, 2010 - 9:27 pm. (1 message)

Next thread: [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages() by Wu Fengguang on Tuesday, November 16, 2010 - 9:27 pm. (5 messages)
From: Wu Fengguang
Date: Tuesday, November 16, 2010 - 9:27 pm

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the ...
From: Minchan Kim
Date: Wednesday, November 17, 2010 - 3:34 am

Hi Wu,

As you know, I am not a expert in this area.
So I hope my review can help understanding other newbie like me and
make clear this document. :)
I didn't look into the code. before it, I would like to clear your concept.


Seems good until now.


Dumb question.

I can't see the difference between old and new,
La depends on A.
A depends on Wa.
T is constant?
Then, throttle_bandwidth depends on Wa.
Wa depends on the number of dirtied pages during some interval.
So if light dirtier become heavy, at last light dirtier and heavy
dirtier will have a same weight.
It means throttle_bandwidth is same. It's a same with old result.

Please, open my eyes. :)




-- 
Kind regards,
Minchan Kim
--

From: Wu Fengguang
Date: Sunday, November 21, 2010 - 7:01 pm

Hi Minchan,


Yeah, it's some big change of "concept" :)

Sorry for the late reply, as I'm still tuning things and some details
may change as a result. The biggest challenge now is the stability of
the control algorithms. Everything is floating around and I'm trying
to keep the fluctuations down by borrowing some equation from the

That's not a problem. It's the proper behavior to converge for two

T is the bdi's share of the global dirty limit. It's stable in normal,

Sure, each task will be throttled at different bandwidth if there



Thanks,
Fengguang
--

From: Andrew Morton
Date: Wednesday, November 17, 2010 - 4:08 pm

On Wed, 17 Nov 2010 12:27:21 +0800

writeback has always had these semi-bogus assumptions that all pages
are the same, and it can sometimes go very wrong.

A chronic case would be a 4GB i386 machine where only 1/4 of memory is
useable for GFP_KERNEL allocations, filesystem metadata and /dev/sdX
pagecache.

When you think about it, a lot of the throttling work being done in
writeback is really being done on behalf of the page allocator (and
hence page reclaim).  But what happens if the workload is mainly
hammering away at ZONE_NORMAL, but writeback is considering ZONE_NORMAL
to be the same thing as ZONE_HIGHMEM?

Or vice versa, where page-dirtyings are all happening in lowmem?  Can
writeback then think that there are plenty of clean pages (because it's
looking at highmem as well) so little or no throttling is happening? 
If so, what effect does this have upon GFP_KERNEL/GFP_USER allocation?

And bear in mind that the user can tune the dirty levels.  If they're
set to 10% on a machine on which 25% of memory is lowmem then ill
effects might be rare.  But if the user tweaks the thresholds to 30%
then can we get into problems?  Such as a situation where 100% of
lowmem is dirty and throttling isn't cutting in?



So please have a think about that and see if you can think of ways in
which this assumption can cause things to go bad.  I'd suggest
writing some targetted tests which write to /dev/sdX (to generate
lowmem-only dirty pages) and which read from /dev/sdX (to request
allocation of lowmem pages).  Run these tests in conjunction with tests
which exercise the highmem zone as well and check that everything
behaves as expected.

Of course, this all assumes that you have a 4GB i386 box :( It's almost
getting to the stage where we need a fake-zone-highmem option for
x86_64 boxes just so we can test this stuff.

--

From: Peter Zijlstra
Date: Thursday, November 18, 2010 - 6:04 am

If you feel like playing with sub-jiffies timeouts (a way to avoid that
HZ=>100 assumption), the below (totally untested) patch might be of
help..


---
Subject: hrtimer: Provide io_schedule_timeout*() functions

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/hrtimer.h |    7 +++++++
 kernel/hrtimer.c        |   15 +++++++++++++++
 kernel/sched.c          |   17 +++++++++++++++++
 3 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index dd9954b..9e0f67e 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -419,6 +419,13 @@ extern long hrtimer_nanosleep_restart(struct restart_block *restart_block);
 extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
 				 struct task_struct *tsk);
 
+extern int io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
+						const enum hrtimer_mode mode);
+extern int io_schedule_hrtimeout_range_clock(ktime_t *expires,
+		unsigned long delta, const enum hrtimer_mode mode, int clock);
+extern int io_schedule_hrtimeout(ktime_t *expires, const enum hrtimer_mode mode);
+
+
 extern int schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
 						const enum hrtimer_mode mode);
 extern int schedule_hrtimeout_range_clock(ktime_t *expires,
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 72206cf..ef2d93c 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1838,6 +1838,14 @@ int __sched schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
 }
 EXPORT_SYMBOL_GPL(schedule_hrtimeout_range);
 
+int __sched io_schedule_hrtimeout_range(ktime_t *expires, unsigned long delta,
+				     const enum hrtimer_mode mode)
+{
+	return io_schedule_hrtimeout_range_clock(expires, delta, mode,
+					      CLOCK_MONOTONIC);
+}
+EXPORT_SYMBOL_GPL(io_schedule_hrtimeout_range);
+
 /**
  * schedule_hrtimeout - sleep until timeout
  * @expires:	timeout value (ktime_t)
@@ -1866,3 +1874,10 @@ int __sched ...
From: Wu Fengguang
Date: Thursday, November 18, 2010 - 6:26 am

Assuming there are HZ=10 users.

- when choosing such a coarse granularity, do they really care about
  responsiveness? :)

- will the use of hrtimer add a little code size and/or runtime
  overheads, and hence hurt the majority HZ=100 users?

Thanks,
--

From: Wu Fengguang
Date: Wednesday, December 1, 2010 - 6:56 pm

Right. It seems not rewarding enough to check signal_pending().  We've
already been able to response to signals much faster than before

Fair enough. I do missed the D state (without the long wait :).
Here is the patch.

Thanks,
Fengguang
---
Subject: writeback: do uninterruptible sleep in balance_dirty_pages()
Date: Thu Dec 02 09:31:19 CST 2010

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.                 -- Andrew Morton

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-02 09:30:29.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-02 09:30:34.000000000 +0800
@@ -636,7 +636,7 @@ pause:
 					  pages_dirtied,
 					  pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
-		__set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 
--

From: Wu Fengguang
Date: Sunday, December 5, 2010 - 9:14 am

Ah I seem to find the root cause. See the attached graphs. Ext4 should
be calling redirty_page_for_writepage() to redirty ~300MB pages on
every ~10s. The redirties happen in big bursts, so not surprisingly
the dd task's dirty weight will suddenly drop to 0.

It should be the same ext4 issue discussed here:

        http://www.spinics.net/lists/linux-fsdevel/msg39555.html

Thanks,
Fengguang
From: Dmitry
Date: Monday, December 6, 2010 - 2:52 am

May be it is reasonable to introduce new mount option which control
dynamic delalloc on/off behavior for example like this:
0) -odelalloc=off : analog of nodelalloc
1) -odelalloc=normal : Default mode (disable delalloc if close to full fs)
2) -odelalloc=force  : delalloc mode always enabled, so we have to do
                     writeback more aggressive in case of ENOSPC.

So one can force delalloc and can safely use this writeback mode in 
multi-user environment. Openvz already has this. I'll prepare the patch
--

From: Ted Ts'o
Date: Monday, December 6, 2010 - 5:34 am

Yeah, I'd really rather not do that.  There are significant downsides
with your proposed odelalloc=force mode.  One of which is that we
could run out of space and not notice.  If the application doesn't
call fsync() and check the return value, and simply closes()'s the
file and then exits, when the writeback threads do get around to
writing the file, the block allocation could fail, and oops, data gets
lost.  There's a _reason_ why we disable delalloc when we're close to
a full fs.  The only alternative is to super conservative when doing
your block reservation calculations, and in that case, you end up
returning ENOSPC far too soon.

						- Ted
--

Previous thread: [PATCH 02/13] writeback: consolidate variable names in balance_dirty_pages() by Wu Fengguang on Tuesday, November 16, 2010 - 9:27 pm. (1 message)

Next thread: [PATCH 03/13] writeback: per-task rate limit on balance_dirty_pages() by Wu Fengguang on Tuesday, November 16, 2010 - 9:27 pm. (5 messages)