Re: [PATCH] vmscan: remove wait_on_page_writeback() from pageout()

Previous thread: Email Upgrading Alert !!! by Derek Whitney on Tuesday, July 27, 2010 - 10:18 pm. (1 message)

Next thread: [tip:x86/asm] x86, asm: Clean up and simplify set_64bit() by tip-bot for H. Peter Anvin on Wednesday, July 28, 2010 - 12:18 am. (2 messages)
From: Wu Fengguang
Date: Wednesday, July 28, 2010 - 12:17 am

Fix "system goes unresponsive under memory pressure and lots of
dirty/writeback pages" bug.

	http://lkml.org/lkml/2010/4/4/86

In the above thread, Andreas Mohr described that

	Invoking any command locked up for minutes (note that I'm
	talking about attempted additional I/O to the _other_,
	_unaffected_ main system HDD - such as loading some shell
	binaries -, NOT the external SSD18M!!).

This happens when the two conditions are both meet:
- under memory pressure
- writing heavily to a slow device

OOM also happens in Andreas' system. The OOM trace shows that 3
processes are stuck in wait_on_page_writeback() in the direct reclaim
path. One in do_fork() and the other two in unix_stream_sendmsg(). They
are blocked on this condition:

	(sc->order && priority < DEF_PRIORITY - 2)

which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too
permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim
for the order-1 fork() allocation runs into a range of 512KB
hard-to-reclaim LRU pages, it will be stalled.

It's a severe problem in three ways.

Firstly, it can easily happen in daily desktop usage.  vmscan priority
can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even
if the system has 50% globally reclaimable pages, it still has good
opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a
simple dd can easily create a big range (up to 20%) of dirty pages in
the LRU lists.

Secondly, once triggered, it will stall unrelated processes (not doing IO
at all) in the system. This "one slow USB device stalls the whole system"
avalanching effect is very bad.

Thirdly, once stalled, the stall time could be intolerable long for the
users.  When there are 20MB queued writeback pages and USB 1.1 is
writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20
seconds.  Not to mention it may be called multiple times.

So raise the bar to only ...
From: Wu Fengguang
Date: Wednesday, July 28, 2010 - 1:46 am

The wait_on_page_writeback() call inside pageout() is virtually dead code.

        shrink_inactive_list()
          shrink_page_list(PAGEOUT_IO_ASYNC)
            pageout(PAGEOUT_IO_ASYNC)
          shrink_page_list(PAGEOUT_IO_SYNC)
            pageout(PAGEOUT_IO_SYNC)

Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
pageout(ASYNC) converts dirty pages into writeback pages, the second
shrink_page_list(SYNC) waits on the clean of writeback pages before
calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
into dirty pages for pageout(SYNC) unless in some race conditions.

And the wait page-by-page behavior of pageout(SYNC) will lead to very
long stall time if running into some range of dirty pages. So it's bad
idea anyway to call wait_on_page_writeback() inside pageout().

CC: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/vmscan.c |   13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/vmscan.c	2010-07-28 16:22:21.000000000 +0800
+++ linux-next/mm/vmscan.c	2010-07-28 16:23:35.000000000 +0800
@@ -324,8 +324,7 @@ typedef enum {
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
  */
-static pageout_t pageout(struct page *page, struct address_space *mapping,
-						enum pageout_io sync_writeback)
+static pageout_t pageout(struct page *page, struct address_space *mapping)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -384,14 +383,6 @@ static pageout_t pageout(struct page *pa
 			return PAGE_ACTIVATE;
 		}
 
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
 		if ...
From: Mel Gorman
Date: Wednesday, July 28, 2010 - 2:10 am

It's possible for the second call to run into dirty pages as there is a
congestion_wait() call between the first shrink_page_list() call and the

True, but this is also lumpy reclaim which is depending on a contiguous
range of pages. It's better for it to wait on the selected range of pages
which is known to contain at least one old page than excessively scan and

I recognise that you are probably thinking of the stall-due-to-fork problem
but I'd expect the patch that raises the bar for <= PAGE_ALLOC_COSTLY_ORDER
to be sufficient. If not, I think it still makes sense to call

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Wu Fengguang
Date: Wednesday, July 28, 2010 - 2:30 am

OK there is a <=0.1s time window. Then what about the data set size?
After first shrink_page_list(ASYNC), there will be hardly any pages
left in the page_list except for the already under-writeback pages and
other unreclaimable pages. So it still asks for some race conditions
for hitting the second pageout(SYNC) -- some unreclaimable pages

The main intention of this patch is to remove semi-dead code.
I'm less disturbed by the long stall time now with the previous patch ;)

Thanks,
Fengguang
--

From: Mel Gorman
Date: Wednesday, July 28, 2010 - 2:45 am

Ok, big was an exagguration for IO but during this window the page can

We are hitting this window because otherwise the trace points would not be
reporting sync IO in pageout(). Take from an ftrace-based report

Direct reclaims                               1176
Direct reclaim pages scanned                184337
Direct reclaim write file async I/O           2317
Direct reclaim write anon async I/O          35551
Direct reclaim write file sync I/O            1817
Direct reclaim write anon sync I/O           15920

For the last line to have a positive value, we must have called
pageout(PAGEOUT_IO_ASYNC) and then hit a dirty page during the
pageout(PAGEOUT_IO_SYNC) call.

Here is one fairly plausible scenario where we end up waiting on
writeback despite the previous pageout() call.

shrink_inactive_list()
  shrink_page_list(PAGEOUT_IO_ASYNC)
    Check PageWriteback
    Unmap page (set_dirty_page, if PTE was dirty)
    pageout(PAGEOUT_IO_ASYNC, IO starts, page in writeback)
    call congestion_wait()

During this 0.1s window, the process references the page and faults in.
As it is lumpy reclaim, the page could have been young even though it
was physically located near an old page

  shrink_page_list(PAGEOUT_IO_SYNC)
    Check PageWriteback (Lets assume it is written back for this example)
    Unmap page again (dirty page again, if PTE was dirty)

Unfortuately, while the code may not be currently doing the most
efficient thing with respect to lumpy reclaim, it's not dead either :/

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Wednesday, July 28, 2010 - 2:43 am

Today, I was successful to reproduce the Andres's issue. and I disagree this
opinion.
The root cause is, congestion_wait() mean "wait until clear io congestion". but
if the system have plenty dirty pages, flusher threads are issueing IO conteniously.
So, io congestion is not cleared long time. eventually, congestion_wait(BLK_RW_ASYNC, HZ/10)
become to equivalent to sleep(HZ/10).

I would propose followint patch instead.

And I've found synchronous lumpy reclaim have more serious problem. I woule like to
explain it as another mail.

Thanks.



From 0266fb2c23aef659cd4e89fccfeb464f23257b74 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Tue, 27 Jul 2010 14:36:44 +0900
Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

congestion_wait() mean "waiting for number of requests in IO queue is
under congestion threshold".
That said, if the system have plenty dirty pages, flusher thread push
new request to IO queue conteniously. So, IO queue are not cleared
congestion status for a long time. thus, congestion_wait(HZ/10) is
almostly equivalent schedule_timeout(HZ/10).

If the system 512MB memory, DEF_PRIORITY mean 128kB scan and 4096 times
shrink_inactive_list call. 4096 times 0.1sec stall makes crazy insane
long stall. That shouldn't.

In the other hand, this synchronous lumpy reclaim donesn't need this
congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) cause to
call wait_on_page_writeback() and it provide sufficient waiting.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 97170eb..2aa16eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1304,8 +1304,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	 */
 	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
 			sc->lumpy_reclaim_mode) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
-
 ...
From: Mel Gorman
Date: Wednesday, July 28, 2010 - 2:50 am

Is Andres's issue not covered by the patch "vmscan: raise the bar to
PAGEOUT_IO_SYNC stalls" because wait_on_page_writeback() was the

I think the final paragraph makes a lot of sense. If a lumpy reclaimer is
going to get stalled on wait_on_page_writeback(), it should be a sufficient
throttling mechanism.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Wednesday, July 28, 2010 - 2:59 am

Well, "vmscan: raise the bar to PAGEOUT_IO_SYNC stalls" is completely bandaid and
much IO under slow USB flash memory device still cause such problem even if the patch is applied.

But removing wait_on_page_writeback() doesn't solve the issue perfectly because current
lumpy reclaim have multiple sick. again, I'm writing explaining mail.....



--

From: Wu Fengguang
Date: Saturday, July 31, 2010 - 10:27 pm

As for this patch, raising the bar to PAGEOUT_IO_SYNC reduces both
calls to congestion_wait() and wait_on_page_writeback(). So it

Let's submit the two known working fixes first?

Thanks,
Fengguang
--

From: Wu Fengguang
Date: Saturday, July 31, 2010 - 10:49 pm

To base the discussion on more subjective numbers, I run the attached
debug patch on my desktop. I have 4GB memory without swap. Two
processes are started:

        usemem 3g --sleep 300
        cp /dev/zero /tmp

I didn't get any stall reports with DEF_PRIORITY/3, so I lowered it to
DEF_PRIORITY-2 and get the below reports. The test is a good testimony
for "vmscan: raise the bar to PAGEOUT_IO_SYNC stalls": it's working as
expected and does avoid the stalls. When the priority goes down to 4,
your patch to remove congestion_wait() will come into play. So they
are both good patches to have.

[Sorry I cannot afford to add more stresses to test your patch.
 My remote test box was stressed to death yesterday and asks for
 some physical reset tomorrow. My desktop has already corrupted the
 .zsh_history file twice and can't risk losing more data..]

Thanks,
Fengguang

[ 1113.740511] reclaim stall: priority 9
[ 1113.844288] reclaim stall: priority 9
[ 1113.944292] reclaim stall: priority 9
[ 1114.048274] reclaim stall: priority 9
[ 1114.252816] reclaim stall: priority 9
[ 1114.352293] reclaim stall: priority 9
[ 1114.452265] reclaim stall: priority 9
[ 1114.552258] reclaim stall: priority 9
[ 1114.652259] reclaim stall: priority 9
[ 1114.752259] reclaim stall: priority 9
[ 1114.852252] reclaim stall: priority 9
[ 1114.952254] reclaim stall: priority 9
[ 1115.052251] reclaim stall: priority 9
[ 1115.152254] reclaim stall: priority 9
[ 1115.252242] reclaim stall: priority 9
[ 1115.452251] reclaim stall: priority 8
[ 1115.552250] reclaim stall: priority 8
[ 1116.403126] reclaim stall: priority 9
[ 1116.440364] reclaim stall: priority 9
[ 1116.478632] reclaim stall: priority 9
[ 1116.500230] reclaim stall: priority 9
[ 1116.540243] reclaim stall: priority 9
[ 1116.576266] reclaim stall: priority 9
[ 1116.600223] reclaim stall: priority 9
[ 1116.640221] reclaim stall: priority 9
[ 1116.676219] reclaim stall: priority 9
[ 1116.740233] reclaim stall: priority 9
[ ...
From: KOSAKI Motohiro
Date: Sunday, August 1, 2010 - 1:32 am

Definitely, I can't oppose obvious test result (by another your mail) :-)

OK, should go!



Thanks.




--

From: Wu Fengguang
Date: Sunday, August 1, 2010 - 1:35 am

Great. Shall I go first? My changelog has more background :)

Thanks,
Fengguang
--

From: Wu Fengguang
Date: Saturday, July 31, 2010 - 10:17 pm

Good point. Maybe more clear to say: "It takes 4096 shrink_page_list()

Agreed.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks,
Fengguang
--

From: Minchan Kim
Date: Wednesday, July 28, 2010 - 9:29 am

Although we remove it in pageout, shrink_page_list still has it.
So it would result in long stall. And it is for lumpy reclaim which 
means it's a trade-off, I think. 

-- 
Kind regards,
Minchan Kim
--

From: Andrew Morton
Date: Wednesday, July 28, 2010 - 10:30 am

The idea is that vmscan doesn't call ->writepage if the underlying
queue is congested.  may_write_to_queue()->bdi_queue_congested() should
return false and we skip the write.

If that logic is broken then that would explain a few things...
--

From: KOSAKI Motohiro
Date: Wednesday, July 28, 2010 - 6:01 pm

we already have it in may_write_to_queue(). but kswapd and zone-reclaim have
PF_SWAPWRITE then ignore queue congestion. (btw, I believe zone-reclaim 
shouldn't use PF_SWAPWRITE). so, kswapd get stuck in get_request_wait() frequently. 

following commit explain why kswapd have to ignore queue congestion....

commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
Author: akpm <akpm>
Date:   Sun Dec 22 01:07:33 2002 +0000

    [PATCH] Give kswapd writeback higher priority than pdflush

    The `low latency page reclaim' design works by preventing page
    allocators from blocking on request queues (and by preventing them from
    blocking against writeback of individual pages, but that is immaterial
    here).

    This has a problem under some situations.  pdflush (or a write(2)
    caller) could be saturating the queue with highmem pages.  This
    prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
    enormous amounts of scenning.


And following commit made hard limit in io queue and changed vmscan writeout
behavior a lot if my understanding is correct. 


commit 082cf69eb82681f4eacb3a5653834c7970714bef
Author: Jens Axboe <axboe@suse.de>
Date:   Tue Jun 28 16:35:11 2005 +0200

    [PATCH] ll_rw_blk: prevent huge request allocations

    Currently we cap request allocations at q->nr_requests, but we allow a
    batching io context to allocate up to 32 more (default setting).  This
    can flood the queue with request allocations, with only a few batching
    processes.  The real fix would be to limit the number of batchers, but
    as that isn't currently tracked, I suggest we just cap the maximum
    number of allocated requests to eg 50% over the limit.

    This was observed in real life, users typically see this as vmstat bo
    numbers going off the wall with seconds of no queueing afterwards.
    Behaviour this bursty is not beneficial.

    Signed-off-by: Jens Axboe <axboe@suse.de>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff ...
From: Andrea Arcangeli
Date: Friday, July 30, 2010 - 6:17 am

Lumpy reclaim just made the system totally unusable with frequent
order 9 allocations. I nuked it long ago and replaced it with mem
compaction. You may try aa.git to test how thing goes without lumpy
reclaim. I recently also started to use mem compaction for order 1/2/3
allocations as there's no point not to use it for them, and to call
mem compaction from kswapd to satisfy order 2 GFP_ATOMIC in
replacement of blind responsiveness-destroyer lumpy.

Not sure why people insists on lumpy when we've memory compaction that
won't alter the working set and it's more effective.
--

From: Mel Gorman
Date: Friday, July 30, 2010 - 6:31 am

Yes, it's very disruptive and has been for a while. It was not much of a
problem when resizing the static hugepage pool but is a disaster for

A full-scale replacement is overkill but I can see why it would be done
in the short-term. There are times when lumpy reclaim is still needed -
specifically when the allocation failure is due to a lack of memory rather
than fragmentation. There will also be cases where compaction can't work

Compaction is preferred, no doubt about it but lumpy reclaim cannot be
dismissed. I know lumpy reclaim is too disruptive and Kosaki noticed the same
and it's currently doing some pretty stupid things. There are a few ideas
knocking around publicly on how to reduce its impact while increasing its
effectiveness. I have a few old ideas knocking around as well that I just
need the time to get around to. I hope to get at it after the fuss over
writeback is addressed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Hellwig
Date: Saturday, July 31, 2010 - 10:33 am

It looks much, much worse on my system.  Basically all inode structures,
and also tons of frequently allocated xfs structures fall into this
category,  None of them actually anywhere near the size of a page, which
makes me wonder why we do such high order allocations:

slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
nfsd4_stateowners      0      0    424   19    2 : tunables    0    0    0 : slabdata      0      0      0
kvm_vcpu               0      0  10400    3    8 : tunables    0    0    0 : slabdata      0      0      0
kmalloc_dma-512       32     32    512   16    2 : tunables    0    0    0 : slabdata      2      2      0
mqueue_inode_cache     18     18    896   18    4 : tunables    0    0    0 : slabdata      1      1      0
xfs_inode         279008 279008   1024   16    4 : tunables    0    0    0 : slabdata  17438  17438      0
xfs_efi_item          44     44    360   22    2 : tunables    0    0    0 : slabdata      2      2      0
xfs_efd_item          44     44    368   22    2 : tunables    0    0    0 : slabdata      2      2      0
xfs_trans             40     40    800   20    4 : tunables    0    0    0 : slabdata      2      2      0
xfs_da_state          32     32    488   16    2 : tunables    0    0    0 : slabdata      2      2      0
nfs_inode_cache        0      0   1016   16    4 : tunables    0    0    0 : slabdata      0      0      0
isofs_inode_cache      0      0    632   25    4 : tunables    0    0    0 : slabdata      0      0      0
fat_inode_cache        0      0    664   12    2 : tunables    0    0    0 : slabdata      0      0      0
hugetlbfs_inode_cache     14     14    584   14    2 : tunables    0    0    0 : slabdata      1      1      0
ext4_inode_cache       0      0    968   16    4 : tunables    0    0    0 : slabdata      0      0      0
ext2_inode_cache      21     21    776   21  ...
From: Pekka Enberg
Date: Saturday, July 31, 2010 - 10:55 am

Do you have CONFIG_SLUB enabled? It does high order allocations by
--

From: Christoph Hellwig
Date: Saturday, July 31, 2010 - 10:59 am

Yes. This is a kernel using slub.

--

From: Pekka Enberg
Date: Saturday, July 31, 2010 - 11:09 am

You can pass "slub_debug=o" as a kernel parameter to disable higher
order allocations if you want to test things. The per-cache default
order is calculated in calculate_order() of mm/slub.c. How many CPUs
do you have on your system?
--

Previous thread: Email Upgrading Alert !!! by Derek Whitney on Tuesday, July 27, 2010 - 10:18 pm. (1 message)

Next thread: [tip:x86/asm] x86, asm: Clean up and simplify set_64bit() by tip-bot for H. Peter Anvin on Wednesday, July 28, 2010 - 12:18 am. (2 messages)