Re: [PATCH 4/6] writeback: sync expired inodes first in background writeback

Previous thread: [Patch] kexec: increase max of kexec segments and use dynamic allocation by Amerigo Wang on Wednesday, July 21, 2010 - 11:13 pm. (9 messages)

Next thread: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() by Wu Fengguang on Wednesday, July 21, 2010 - 10:09 pm. (4 messages)
From: Wu Fengguang
Date: Wednesday, July 21, 2010 - 10:09 pm

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- retry with halfed expire interval until get some inodes to sync

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-22 12:56:42.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-22 13:07:51.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,15 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
-			break;
+		    inode_dirtied_after(inode, older_than_this)) {
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval >>= 1;
+				older_than_this = jiffies - expire_interval;
+				continue;
+			} else
+				break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -521,7 +528,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (!(wbc->for_kupdate || ...
From: Jan Kara
Date: Friday, July 23, 2010 - 11:15 am

Hmm, this logic looks a bit arbitrary to me. What I actually don't like
very much about this that when there aren't inodes older than say 2
seconds, you'll end up queueing just inodes between 2s and 1s. So I'd
rather just queue inodes older than the limit and if there are none, just
queue all other dirty inodes.

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Wu Fengguang
Date: Monday, July 26, 2010 - 4:51 am

You are proposing

-				expire_interval >>= 1;
+				expire_interval = 0;

IMO this does not really simplify code or concept. If we can get the
"smoother" behavior in original patch without extra cost, why not? 

Thanks,
--

From: Jan Kara
Date: Monday, July 26, 2010 - 5:12 am

I agree there's no substantial code simplification. But I see a
substantial "behavior" simplification (just two sweeps instead of 10 or
so). But I don't really insist on the two sweeps, it's just that I don't
see a justification for the exponencial back off here... I mean what's the
point if the interval we queue gets really small? Why not just use
expire_interval/2 as a step if you want a smoother behavior?

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Mel Gorman
Date: Monday, July 26, 2010 - 3:57 am

Ok, intuitively this would appear to tie into pageout where we want
older inodes to be cleaned first by background flushers to limit the
number of dirty pages encountered by page reclaim. If this is accurate,

This needs a comment.

I think what it is saying is that if background flush is active but no
inodes are old enough, consider newer inodes. This is on the assumption
that page reclaim has encountered dirty pages and the dirty inodes are

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Wu Fengguang
Date: Monday, July 26, 2010 - 5:00 am

Good suggestion. I'll add these lines:

This is to help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

Thanks,
Fengguang
--

From: Jan Kara
Date: Monday, July 26, 2010 - 5:20 am

Well, this kind of implicitely assumes that once page is written, it
doesn't get accessed anymore, right? Which I imagine is often true but
not for all workloads... Anyway I think this behavior is a good start
also because it is kind of natural to users to see "old" files written

								Honza  
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Wu Fengguang
Date: Monday, July 26, 2010 - 5:31 am

Thanks,
--

From: Jan Kara
Date: Monday, July 26, 2010 - 5:39 am

Sorry, I probably wasn't clear enough :) I meant: The claim that "older
inodes contain older dirty pages, which are more close to the end of the
LRU lists" assumes that once page is written it doesn't get accessed
again. For example files which get continual random access (like DB files)
can have rather old dirtied_when but some of their pages are accessed quite

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Wu Fengguang
Date: Monday, July 26, 2010 - 5:47 am

Ah yes. That leads to another fact: smaller inodes tend to have more
strong correlations between its inode dirty age and pages' dirty age. 

This is one of the reason to not sync huge dirty inode in one shot.
Instead of

        sync  1G for inode A
        sync 10M for inode B
        sync 10M for inode C
        sync 10M for inode D

It's better to

        sync 128M for inode A
        sync  10M for inode B
        sync  10M for inode C
        sync  10M for inode D
        sync 128M for inode A
        sync 128M for inode A
        sync 128M for inode A
        sync  10M for inode E (newly expired)
        sync 128M for inode A
        ...

Thanks,
Fengguang
--

From: Wu Fengguang
Date: Monday, July 26, 2010 - 5:56 am

Yes this should be commented. How about this one?

@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
        while (!list_empty(delaying_queue)) {
                inode = list_entry(delaying_queue->prev, struct inode, i_list);
                if (expire_interval &&
-                   inode_dirtied_after(inode, older_than_this))
+                   inode_dirtied_after(inode, older_than_this)) {
+                       /*
+                        * background writeback will start with expired inodes,
+                        * and then fresh inodes. This order helps reducing
+                        * the number of dirty pages reaching the end of LRU
+                        * lists and cause trouble to the page reclaim.
+                        */
+                       if (wbc->for_background &&
+                           list_empty(dispatch_queue) && list_empty(&tmp)) {
+                               expire_interval = 0;
+                               continue;
+                       }
                        break;
+               }
                if (sb && sb != inode->i_sb)
                        do_sb_sort = 1;
                sb = inode->i_sb;

Thanks,
Fengguang
--

From: Wu Fengguang
Date: Monday, July 26, 2010 - 6:11 am

Thanks. Here is the updated patch.
---
Subject: writeback: sync expired inodes first in background writeback
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Wed Jul 21 20:11:53 CST 2010

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-26 20:19:01.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-26 21:10:42.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
+		    inode_dirtied_after(inode, older_than_this)) {
+			/*
+			 * background ...
From: Mel Gorman
Date: Tuesday, July 27, 2010 - 2:45 am

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Sunday, August 1, 2010 - 8:15 am

Maybe I am rather late. 

Nitpick. 
uninitialized_var is consistent. :)

I haven't followed up this patch series. but his patch series is a fundamental way 
to go for reducing pageout. 
-- 
Kind regards,
Minchan Kim
--

Previous thread: [Patch] kexec: increase max of kexec segments and use dynamic allocation by Amerigo Wang on Wednesday, July 21, 2010 - 11:13 pm. (9 messages)

Next thread: [PATCH 1/6] writeback: pass writeback_control down to move_expired_inodes() by Wu Fengguang on Wednesday, July 21, 2010 - 10:09 pm. (4 messages)