Re: [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio

Previous thread: [PATCH 08/13] writeback: pass writeback_control down to move_expired_inodes() by Wu Fengguang on Thursday, August 5, 2010 - 9:10 am. (1 message)

Next thread: THIS IS MY SECOND MAIL TO YOU. by Christopher Andrew on Thursday, August 5, 2010 - 8:33 am. (1 message)
From: Wu Fengguang
Date: Thursday, August 5, 2010 - 9:10 am

Andrew,

These are writeback patches intended for 2.6.36.

It's combined from 2 previous patchsets:

	writeback cleanups and trivial fixes <http://lkml.org/lkml/2010/7/10/153>
	writeback: try to write older pages first <http://lkml.org/lkml/2010/7/22/47>

changelog:
- removed patch "writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages()"
- added patch "writeback: explicit low bound for vm.dirty_ratio"
- use "if (list_empty(&wb->b_io))" directly in "writeback: sync expired inodes first in background writeback"
- fix misplaced chunk for removing more_io in include/trace/events/ext4.h
- update comments in "writeback: fix queue_io() ordering"
- update comments in "writeback: add comment to the dirty limits functions"
- patch "writeback: try more writeback as long as something was written" will
  no longer livelock sync() with Jan's sync() livelock avoidance patches

	[PATCH 01/13] writeback: reduce calls to global_page_state in balance_dirty_pages()
	[PATCH 02/13] writeback: avoid unnecessary calculation of bdi dirty thresholds
	[PATCH 03/13] writeback: add comment to the dirty limits functions
	[PATCH 04/13] writeback: dont redirty tail an inode with dirty pages
	[PATCH 05/13] writeback: fix queue_io() ordering
	[PATCH 06/13] writeback: merge for_kupdate and !for_kupdate cases
	[PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
	[PATCH 08/13] writeback: pass writeback_control down to move_expired_inodes()
	[PATCH 09/13] writeback: the kupdate expire timestamp should be a moving target
	[PATCH 10/13] writeback: kill writeback_control.more_io
	[PATCH 11/13] writeback: sync expired inodes first in background writeback
	[PATCH 12/13] writeback: try more writeback as long as something was written
	[PATCH 13/13] writeback: introduce writeback_control.inodes_written

Thanks,
Fengguang

--

From: Wu Fengguang
Date: Thursday, August 5, 2010 - 9:10 am

Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD
AthlonX2 with 3Gb memory (dirty_threshold approx 600 Mb) running each
test 10 times, dropping the fasted & slowest values then taking the
average & standard deviation

		average (s.d.) in millions (10^6)
2.6.31-rc8	648.6 (14.6)
+patch		620.1 (16.5)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it
rereads the counters to apply the dirty_threshold and moving this check
up into balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit
tests allows the pdflush test after the loop to use the local copies of
the counters rather than rereading them.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

Fengguang:

This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
accurate we don't need to do routinely clip. A simple dirty limit check
would be enough.

The check is necessary because, in principle we should throttle
everything calling balance_dirty_pages() when we're over the total
limit, as said by Peter.

We now set and clear dirty_exceeded not only based on bdi dirty limits,
but also on the global dirty limit. The global limit check is added in
place of clip_bdi_dirty_limit() for safety and not intended as a
behavior change. The bdi limits should be tight enough to keep all dirty
pages under the global limit at most time; occasional small exceeding
should be OK though. The change makes the logic more obvious: the global
limit is the ultimate goal and shall be always ...
From: Wu Fengguang
Date: Thursday, August 5, 2010 - 9:11 am

Dynamically compute the dirty expire timestamp at queue_io() time.
Also remove writeback_control.older_than_this which is no longer used.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

[kitayama@cl.bb4u.ne.jp] fix btrfs and ext4 references

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 --
 fs/fs-writeback.c                |   24 +++++++++---------------
 include/linux/writeback.h        |    2 --
 include/trace/events/writeback.h |    6 +-----
 mm/backing-dev.c                 |    1 -
 mm/page-writeback.c              |    1 -
 6 files changed, 10 insertions(+), 26 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:28:29.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:28:29.000000000 +0800
@@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = ...
From: Wu Fengguang
Date: Thursday, August 5, 2010 - 9:10 am

Unify the logic for kupdate and non-kupdate cases.
There won't be starvation because the inodes requeued into b_more_io
will later be spliced _after_ the remaining inodes in b_io, hence won't
stand in the way of other inodes in the next run.

It avoids unnecessary redirty_tail() calls, hence the update of
i_dirtied_when. The timestamp update is undesirable because it could
later delay the inode's periodic writeback, or may exclude the inode
from the data integrity sync operation (which checks timestamp to
avoid extra work and livelock).

===
How the redirty_tail() comes about:

It was a long story.. This redirty_tail() was introduced with
wbc.more_io.  The initial patch for more_io actually does not have the
redirty_tail(), and when it's merged, several 100% iowait bug reports
arised:

reiserfs:
        http://lkml.org/lkml/2007/10/23/93

jfs:    
        commit 29a424f28390752a4ca2349633aaacc6be494db5
        JFS: clear PAGECACHE_TAG_DIRTY for no-write pages
        
ext2:   
        http://www.spinics.net/linux/lists/linux-ext4/msg04762.html

They are all old bugs hidden in various filesystems that become
"visible" with the more_io patch. At the time, the ext2 bug is thought
to be "trivial", so not fixed. Instead the following updated more_io
patch with redirty_tail() is merged: 

	http://www.spinics.net/linux/lists/linux-ext4/msg04507.html

This will in general prevent 100% on ext2 and possibly other unknown FS bugs.

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   43 ++++++++++---------------------------------
 1 file changed, 10 insertions(+), 33 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:21:43.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:22:02.000000000 +0800
@@ -378,45 +378,22 @@ writeback_single_inode(struct inode *ino
 		if ...
From: Wu Fengguang
Date: Thursday, August 5, 2010 - 9:10 am

Force a user visible low bound of 5% for the vm.dirty_ratio interface.

Currently global_dirty_limits() applies a low bound of 5% for
vm_dirty_ratio.  This is not very user visible -- if the user sets
vm.dirty_ratio=1, the operation seems to succeed but will be rounded up
to 5% when used.

Another problem is inconsistency: calc_period_shift() uses the plain
vm_dirty_ratio value, which may be a problem when vm.dirty_ratio is set
to < 5 by the user.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 kernel/sysctl.c     |    3 ++-
 mm/page-writeback.c |   10 ++--------
 2 files changed, 4 insertions(+), 9 deletions(-)

--- linux-next.orig/kernel/sysctl.c	2010-08-05 22:48:34.000000000 +0800
+++ linux-next/kernel/sysctl.c	2010-08-05 22:48:47.000000000 +0800
@@ -126,6 +126,7 @@ static int ten_thousand = 10000;
 
 /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
 static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
+static int dirty_ratio_min = 5;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -1031,7 +1032,7 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
 		.proc_handler	= dirty_ratio_handler,
-		.extra1		= &zero,
+		.extra1		= &dirty_ratio_min,
 		.extra2		= &one_hundred,
 	},
 	{
--- linux-next.orig/mm/page-writeback.c	2010-08-05 22:48:42.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-05 22:48:47.000000000 +0800
@@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
 
 	if (vm_dirty_bytes)
 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
-	else {
-		int dirty_ratio;
-
-		dirty_ratio = vm_dirty_ratio;
-		if (dirty_ratio < 5)
-			dirty_ratio = 5;
-		dirty = (dirty_ratio * available_memory) / 100;
-	}
+	else
+		dirty = (vm_dirty_ratio * available_memory) / 100;
 
 	if (dirty_background_bytes)
 		background = DIV_ROUND_UP(dirty_background_bytes, ...
From: Wu Fengguang
Date: Friday, August 6, 2010 - 5:44 am

Right.
         # echo 111 > /proc/sys/vm/dirty_ratio

Looks like a serious problem. I'm now much more reserved on pushing

Good point. Here is the patch with updated changelog.

Thanks,
Fengguang
---
Subject: writeback: explicit low bound for vm.dirty_ratio
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Thu Jul 15 10:28:57 CST 2010

Force a user visible low bound of 5% for the vm.dirty_ratio interface.

This is an interface change. When doing

	echo N > /proc/sys/vm/dirty_ratio

where N < 5, the old behavior is pretend to accept the value, while
the new behavior is to reject it explicitly with -EINVAL.  This will
possibly break user space if they checks the return value.

Currently global_dirty_limits() applies a low bound of 5% for
vm_dirty_ratio.  This is not very user visible -- if the user sets
vm.dirty_ratio=1, the operation seems to succeed but will be rounded up
to 5% when used.

Another problem is inconsistency: calc_period_shift() uses the plain
vm_dirty_ratio value, which may be a problem when vm.dirty_ratio is set
to < 5 by the user.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 kernel/sysctl.c     |    3 ++-
 mm/page-writeback.c |   10 ++--------
 2 files changed, 4 insertions(+), 9 deletions(-)

--- linux-next.orig/kernel/sysctl.c	2010-08-05 22:48:34.000000000 +0800
+++ linux-next/kernel/sysctl.c	2010-08-05 22:48:47.000000000 +0800
@@ -126,6 +126,7 @@ static int ten_thousand = 10000;
 
 /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
 static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
+static int dirty_ratio_min = 5;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -1031,7 +1032,7 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
 		.proc_handler	= dirty_ratio_handler,
-		.extra1		= &zero,
+		.extra1		= &dirty_ratio_min,
 		.extra2		= ...
From: KOSAKI Motohiro
Date: Monday, August 9, 2010 - 8:12 pm

Umm.. I dislike this change. Is there any good reason to refuse explicit 
admin's will? Why 1-4% is so bad? Internal clipping can be changed later
but explicit error behavior is hard to change later.

personally I prefer to
 - accept all value, or
 - clipping value in dirty_ratio_handler 

Both don't have explicit ABI change.

Thanks.



--

From: Neil Brown
Date: Monday, August 9, 2010 - 8:57 pm

On Tue, 10 Aug 2010 12:12:06 +0900 (JST)

As a data-point, I had a situation a while back where I needed a value below
1 to get desired behaviour.  The system had lots of RAM and fairly slow
write-back (over NFS) so a 'sync' could take minutes.

So I would much prefer allowing not only 1-4, but also fraction values!!!

I can see no justification at all for setting a lower bound of 5.  Even zero
can be useful - for testing purposes mostly.


--

From: Jan Kara
Date: Tuesday, August 10, 2010 - 6:29 am

If you run on a recent kernel, /proc/sys/vm/dirty_background_bytes and
dirty_bytes is what was introduced exactly for these purposes. Not that I
would think that magic clipping at 5% is a good thing...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Wu Fengguang
Date: Tuesday, August 10, 2010 - 11:12 am

Neil, that's perfectly legitimate need which I overlooked.
It seems that the vm.dirty_bytes parameter will work for your case.

Thanks,
Fengguang
--

From: Wu Fengguang
Date: Tuesday, August 10, 2010 - 11:06 am

Good point. Sorry for being ignorance. Neil is right that there is no
reason to impose some low bound. So the first option looks good.

Thanks,
Fengguang
--

From: Andrew Morton
Date: Thursday, August 5, 2010 - 4:08 pm

On Fri, 06 Aug 2010 00:10:51 +0800

I won't be here Friday.  I need to get my junk into mainline on Monday.
Everybody and his dog is working on writeback, which is good, but
there's a lot of flux here.

So, no, I think it's too late to be thinking about 2.6.36-rc1.  Let's
slow down a bit, review and test each other's proposals and work out
what we want to do in the longer term.  After we've done that, we can
calmly and carefully take a look to see if there are any nice goodies
which we want to slip into 2.6.36-rc3 or thereabouts.

--

From: Peter Zijlstra
Date: Friday, August 6, 2010 - 3:17 am

On Fri, 2010-08-06 at 00:10 +0800, Wu Fengguang wrote:


Another thing solved by the introduction of per-bdi dirty limits (and
now per-bdi flushing) is the whole stacked-bdi writeout deadlock.

Although I'm not sure we want/need to mention that here.
--

From: Wu Fengguang
Date: Saturday, August 7, 2010 - 9:47 am

The changelog looks like a suitable place :)

Thanks,
Fengguang
---
Subject: writeback: add comment to the dirty limits functions
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Thu Jul 15 09:54:25 CST 2010

Document global_dirty_limits(), bdi_dirty_limit() and task_dirty_limit().

Note that another thing solved by the introduction of per-bdi dirty
limits (and now per-bdi flushing) is the whole stacked-bdi writeout
deadlock.						-- Peter

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-08-03 23:14:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-05 00:37:17.000000000 +0800
@@ -261,11 +261,18 @@ static inline void task_dirties_fraction
 }
 
 /*
- * scale the dirty limit
+ * task_dirty_limit - scale down dirty throttling threshold for one task
  *
  * task specific dirty limit:
  *
  *   dirty -= (dirty/8) * p_{t}
+ *
+ * To protect light/slow dirtying tasks from heavier/fast ones, we start
+ * throttling individual tasks before reaching the bdi dirty limit.
+ * Relatively low thresholds will be allocated to heavy dirtiers. So when
+ * dirty pages grow large, heavy dirtiers will be throttled first, which will
+ * effectively curb the growth of dirty pages. Light dirtiers with high enough
+ * dirty threshold may never get throttled.
  */
 static unsigned long task_dirty_limit(struct task_struct *tsk,
 				       unsigned long bdi_dirty)
@@ -390,6 +397,15 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+/**
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - ...
Previous thread: [PATCH 08/13] writeback: pass writeback_control down to move_expired_inodes() by Wu Fengguang on Thursday, August 5, 2010 - 9:10 am. (1 message)

Next thread: THIS IS MY SECOND MAIL TO YOU. by Christopher Andrew on Thursday, August 5, 2010 - 8:33 am. (1 message)