Fix "system goes unresponsive under memory pressure and lots of dirty/writeback pages" bug. http://lkml.org/lkml/2010/4/4/86 In the above thread, Andreas Mohr described that Invoking any command locked up for minutes (note that I'm talking about attempted additional I/O to the _other_, _unaffected_ main system HDD - such as loading some shell binaries -, NOT the external SSD18M!!). This happens when the two conditions are both meet: - under memory pressure - writing heavily to a slow device OOM also happens in Andreas' system. The OOM trace shows that 3 processes are stuck in wait_on_page_writeback() in the direct reclaim path. One in do_fork() and the other two in unix_stream_sendmsg(). They are blocked on this condition: (sc->order && priority < DEF_PRIORITY - 2) which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim for the order-1 fork() allocation runs into a range of 512KB hard-to-reclaim LRU pages, it will be stalled. It's a severe problem in three ways. Firstly, it can easily happen in daily desktop usage. vmscan priority can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if the system has 50% globally reclaimable pages, it still has good opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a simple dd can easily create a big range (up to 20%) of dirty pages in the LRU lists. Secondly, once triggered, it will stall unrelated processes (not doing IO at all) in the system. This "one slow USB device stalls the whole system" avalanching effect is very bad. Thirdly, once stalled, the stall time could be intolerable long for the users. When there are 20MB queued writeback pages and USB 1.1 is writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds. Not to mention it may be called multiple times. So raise the bar to only ...
The wait_on_page_writeback() call inside pageout() is virtually dead code.
shrink_inactive_list()
shrink_page_list(PAGEOUT_IO_ASYNC)
pageout(PAGEOUT_IO_ASYNC)
shrink_page_list(PAGEOUT_IO_SYNC)
pageout(PAGEOUT_IO_SYNC)
Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
pageout(ASYNC) converts dirty pages into writeback pages, the second
shrink_page_list(SYNC) waits on the clean of writeback pages before
calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
into dirty pages for pageout(SYNC) unless in some race conditions.
And the wait page-by-page behavior of pageout(SYNC) will lead to very
long stall time if running into some range of dirty pages. So it's bad
idea anyway to call wait_on_page_writeback() inside pageout().
CC: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/vmscan.c | 13 ++-----------
1 file changed, 2 insertions(+), 11 deletions(-)
--- linux-next.orig/mm/vmscan.c 2010-07-28 16:22:21.000000000 +0800
+++ linux-next/mm/vmscan.c 2010-07-28 16:23:35.000000000 +0800
@@ -324,8 +324,7 @@ typedef enum {
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
*/
-static pageout_t pageout(struct page *page, struct address_space *mapping,
- enum pageout_io sync_writeback)
+static pageout_t pageout(struct page *page, struct address_space *mapping)
{
/*
* If the page is dirty, only perform writeback if that write
@@ -384,14 +383,6 @@ static pageout_t pageout(struct page *pa
return PAGE_ACTIVATE;
}
- /*
- * Wait on writeback if requested to. This happens when
- * direct reclaiming a large contiguous area and the
- * first attempt to free a range of pages fails.
- */
- if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
- wait_on_page_writeback(page);
-
if ...It's possible for the second call to run into dirty pages as there is a congestion_wait() call between the first shrink_page_list() call and the True, but this is also lumpy reclaim which is depending on a contiguous range of pages. It's better for it to wait on the selected range of pages which is known to contain at least one old page than excessively scan and I recognise that you are probably thinking of the stall-due-to-fork problem but I'd expect the patch that raises the bar for <= PAGE_ALLOC_COSTLY_ORDER to be sufficient. If not, I think it still makes sense to call -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
OK there is a <=0.1s time window. Then what about the data set size? After first shrink_page_list(ASYNC), there will be hardly any pages left in the page_list except for the already under-writeback pages and other unreclaimable pages. So it still asks for some race conditions for hitting the second pageout(SYNC) -- some unreclaimable pages The main intention of this patch is to remove semi-dead code. I'm less disturbed by the long stall time now with the previous patch ;) Thanks, Fengguang --
Ok, big was an exagguration for IO but during this window the page can
We are hitting this window because otherwise the trace points would not be
reporting sync IO in pageout(). Take from an ftrace-based report
Direct reclaims 1176
Direct reclaim pages scanned 184337
Direct reclaim write file async I/O 2317
Direct reclaim write anon async I/O 35551
Direct reclaim write file sync I/O 1817
Direct reclaim write anon sync I/O 15920
For the last line to have a positive value, we must have called
pageout(PAGEOUT_IO_ASYNC) and then hit a dirty page during the
pageout(PAGEOUT_IO_SYNC) call.
Here is one fairly plausible scenario where we end up waiting on
writeback despite the previous pageout() call.
shrink_inactive_list()
shrink_page_list(PAGEOUT_IO_ASYNC)
Check PageWriteback
Unmap page (set_dirty_page, if PTE was dirty)
pageout(PAGEOUT_IO_ASYNC, IO starts, page in writeback)
call congestion_wait()
During this 0.1s window, the process references the page and faults in.
As it is lumpy reclaim, the page could have been young even though it
was physically located near an old page
shrink_page_list(PAGEOUT_IO_SYNC)
Check PageWriteback (Lets assume it is written back for this example)
Unmap page again (dirty page again, if PTE was dirty)
Unfortuately, while the code may not be currently doing the most
efficient thing with respect to lumpy reclaim, it's not dead either :/
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
Today, I was successful to reproduce the Andres's issue. and I disagree this
opinion.
The root cause is, congestion_wait() mean "wait until clear io congestion". but
if the system have plenty dirty pages, flusher threads are issueing IO conteniously.
So, io congestion is not cleared long time. eventually, congestion_wait(BLK_RW_ASYNC, HZ/10)
become to equivalent to sleep(HZ/10).
I would propose followint patch instead.
And I've found synchronous lumpy reclaim have more serious problem. I woule like to
explain it as another mail.
Thanks.
From 0266fb2c23aef659cd4e89fccfeb464f23257b74 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Tue, 27 Jul 2010 14:36:44 +0900
Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()
congestion_wait() mean "waiting for number of requests in IO queue is
under congestion threshold".
That said, if the system have plenty dirty pages, flusher thread push
new request to IO queue conteniously. So, IO queue are not cleared
congestion status for a long time. thus, congestion_wait(HZ/10) is
almostly equivalent schedule_timeout(HZ/10).
If the system 512MB memory, DEF_PRIORITY mean 128kB scan and 4096 times
shrink_inactive_list call. 4096 times 0.1sec stall makes crazy insane
long stall. That shouldn't.
In the other hand, this synchronous lumpy reclaim donesn't need this
congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) cause to
call wait_on_page_writeback() and it provide sufficient waiting.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 97170eb..2aa16eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1304,8 +1304,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
*/
if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
sc->lumpy_reclaim_mode) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
-
...Is Andres's issue not covered by the patch "vmscan: raise the bar to PAGEOUT_IO_SYNC stalls" because wait_on_page_writeback() was the I think the final paragraph makes a lot of sense. If a lumpy reclaimer is going to get stalled on wait_on_page_writeback(), it should be a sufficient throttling mechanism. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Well, "vmscan: raise the bar to PAGEOUT_IO_SYNC stalls" is completely bandaid and much IO under slow USB flash memory device still cause such problem even if the patch is applied. But removing wait_on_page_writeback() doesn't solve the issue perfectly because current lumpy reclaim have multiple sick. again, I'm writing explaining mail..... --
As for this patch, raising the bar to PAGEOUT_IO_SYNC reduces both calls to congestion_wait() and wait_on_page_writeback(). So it Let's submit the two known working fixes first? Thanks, Fengguang --
To base the discussion on more subjective numbers, I run the attached
debug patch on my desktop. I have 4GB memory without swap. Two
processes are started:
usemem 3g --sleep 300
cp /dev/zero /tmp
I didn't get any stall reports with DEF_PRIORITY/3, so I lowered it to
DEF_PRIORITY-2 and get the below reports. The test is a good testimony
for "vmscan: raise the bar to PAGEOUT_IO_SYNC stalls": it's working as
expected and does avoid the stalls. When the priority goes down to 4,
your patch to remove congestion_wait() will come into play. So they
are both good patches to have.
[Sorry I cannot afford to add more stresses to test your patch.
My remote test box was stressed to death yesterday and asks for
some physical reset tomorrow. My desktop has already corrupted the
.zsh_history file twice and can't risk losing more data..]
Thanks,
Fengguang
[ 1113.740511] reclaim stall: priority 9
[ 1113.844288] reclaim stall: priority 9
[ 1113.944292] reclaim stall: priority 9
[ 1114.048274] reclaim stall: priority 9
[ 1114.252816] reclaim stall: priority 9
[ 1114.352293] reclaim stall: priority 9
[ 1114.452265] reclaim stall: priority 9
[ 1114.552258] reclaim stall: priority 9
[ 1114.652259] reclaim stall: priority 9
[ 1114.752259] reclaim stall: priority 9
[ 1114.852252] reclaim stall: priority 9
[ 1114.952254] reclaim stall: priority 9
[ 1115.052251] reclaim stall: priority 9
[ 1115.152254] reclaim stall: priority 9
[ 1115.252242] reclaim stall: priority 9
[ 1115.452251] reclaim stall: priority 8
[ 1115.552250] reclaim stall: priority 8
[ 1116.403126] reclaim stall: priority 9
[ 1116.440364] reclaim stall: priority 9
[ 1116.478632] reclaim stall: priority 9
[ 1116.500230] reclaim stall: priority 9
[ 1116.540243] reclaim stall: priority 9
[ 1116.576266] reclaim stall: priority 9
[ 1116.600223] reclaim stall: priority 9
[ 1116.640221] reclaim stall: priority 9
[ 1116.676219] reclaim stall: priority 9
[ 1116.740233] reclaim stall: priority 9
[ ...Definitely, I can't oppose obvious test result (by another your mail) :-) OK, should go! Thanks. --
Great. Shall I go first? My changelog has more background :) Thanks, Fengguang --
Good point. Maybe more clear to say: "It takes 4096 shrink_page_list() Agreed. Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Thanks, Fengguang --
Although we remove it in pageout, shrink_page_list still has it. So it would result in long stall. And it is for lumpy reclaim which means it's a trade-off, I think. -- Kind regards, Minchan Kim --
The idea is that vmscan doesn't call ->writepage if the underlying queue is congested. may_write_to_queue()->bdi_queue_congested() should return false and we skip the write. If that logic is broken then that would explain a few things... --
we already have it in may_write_to_queue(). but kswapd and zone-reclaim have
PF_SWAPWRITE then ignore queue congestion. (btw, I believe zone-reclaim
shouldn't use PF_SWAPWRITE). so, kswapd get stuck in get_request_wait() frequently.
following commit explain why kswapd have to ignore queue congestion....
commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
Author: akpm <akpm>
Date: Sun Dec 22 01:07:33 2002 +0000
[PATCH] Give kswapd writeback higher priority than pdflush
The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).
This has a problem under some situations. pdflush (or a write(2)
caller) could be saturating the queue with highmem pages. This
prevents anyone from writing back ZONE_NORMAL pages. We end up doing
enormous amounts of scenning.
And following commit made hard limit in io queue and changed vmscan writeout
behavior a lot if my understanding is correct.
commit 082cf69eb82681f4eacb3a5653834c7970714bef
Author: Jens Axboe <axboe@suse.de>
Date: Tue Jun 28 16:35:11 2005 +0200
[PATCH] ll_rw_blk: prevent huge request allocations
Currently we cap request allocations at q->nr_requests, but we allow a
batching io context to allocate up to 32 more (default setting). This
can flood the queue with request allocations, with only a few batching
processes. The real fix would be to limit the number of batchers, but
as that isn't currently tracked, I suggest we just cap the maximum
number of allocated requests to eg 50% over the limit.
This was observed in real life, users typically see this as vmstat bo
numbers going off the wall with seconds of no queueing afterwards.
Behaviour this bursty is not beneficial.
Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
diff ...Lumpy reclaim just made the system totally unusable with frequent order 9 allocations. I nuked it long ago and replaced it with mem compaction. You may try aa.git to test how thing goes without lumpy reclaim. I recently also started to use mem compaction for order 1/2/3 allocations as there's no point not to use it for them, and to call mem compaction from kswapd to satisfy order 2 GFP_ATOMIC in replacement of blind responsiveness-destroyer lumpy. Not sure why people insists on lumpy when we've memory compaction that won't alter the working set and it's more effective. --
Yes, it's very disruptive and has been for a while. It was not much of a problem when resizing the static hugepage pool but is a disaster for A full-scale replacement is overkill but I can see why it would be done in the short-term. There are times when lumpy reclaim is still needed - specifically when the allocation failure is due to a lack of memory rather than fragmentation. There will also be cases where compaction can't work Compaction is preferred, no doubt about it but lumpy reclaim cannot be dismissed. I know lumpy reclaim is too disruptive and Kosaki noticed the same and it's currently doing some pretty stupid things. There are a few ideas knocking around publicly on how to reduce its impact while increasing its effectiveness. I have a few old ideas knocking around as well that I just need the time to get around to. I hope to get at it after the fuss over writeback is addressed. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
It looks much, much worse on my system. Basically all inode structures, and also tons of frequently allocated xfs structures fall into this category, None of them actually anywhere near the size of a page, which makes me wonder why we do such high order allocations: slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> nfsd4_stateowners 0 0 424 19 2 : tunables 0 0 0 : slabdata 0 0 0 kvm_vcpu 0 0 10400 3 8 : tunables 0 0 0 : slabdata 0 0 0 kmalloc_dma-512 32 32 512 16 2 : tunables 0 0 0 : slabdata 2 2 0 mqueue_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0 xfs_inode 279008 279008 1024 16 4 : tunables 0 0 0 : slabdata 17438 17438 0 xfs_efi_item 44 44 360 22 2 : tunables 0 0 0 : slabdata 2 2 0 xfs_efd_item 44 44 368 22 2 : tunables 0 0 0 : slabdata 2 2 0 xfs_trans 40 40 800 20 4 : tunables 0 0 0 : slabdata 2 2 0 xfs_da_state 32 32 488 16 2 : tunables 0 0 0 : slabdata 2 2 0 nfs_inode_cache 0 0 1016 16 4 : tunables 0 0 0 : slabdata 0 0 0 isofs_inode_cache 0 0 632 25 4 : tunables 0 0 0 : slabdata 0 0 0 fat_inode_cache 0 0 664 12 2 : tunables 0 0 0 : slabdata 0 0 0 hugetlbfs_inode_cache 14 14 584 14 2 : tunables 0 0 0 : slabdata 1 1 0 ext4_inode_cache 0 0 968 16 4 : tunables 0 0 0 : slabdata 0 0 0 ext2_inode_cache 21 21 776 21 ...
Do you have CONFIG_SLUB enabled? It does high order allocations by --
Yes. This is a kernel using slub. --
You can pass "slub_debug=o" as a kernel parameter to disable higher order allocations if you want to test things. The per-cache default order is calculated in calculate_order() of mm/slub.c. How many CPUs do you have on your system? --
