Fix "system goes unresponsive under memory pressure and lots of dirty/writeback pages" bug. http://lkml.org/lkml/2010/4/4/86 In the above thread, Andreas Mohr described that Invoking any command locked up for minutes (note that I'm talking about attempted additional I/O to the _other_, _unaffected_ main system HDD - such as loading some shell binaries -, NOT the external SSD18M!!). This happens when the two conditions are both meet: - under memory pressure - writing heavily to a slow device OOM also happens in Andreas' system. The OOM trace shows that 3 processes are stuck in wait_on_page_writeback() in the direct reclaim path. One in do_fork() and the other two in unix_stream_sendmsg(). They are blocked on this condition: (sc->order && priority < DEF_PRIORITY - 2) which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim for the order-1 fork() allocation runs into a range of 512KB hard-to-reclaim LRU pages, it will be stalled. It's a severe problem in three ways. Firstly, it can easily happen in daily desktop usage. vmscan priority can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if the system has 50% globally reclaimable pages, it still has good opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a simple dd can easily create a big range (up to 20%) of dirty pages in the LRU lists. And order-1 to order-3 allocations are more than common with SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order slab caches. For example, the order-1 radix_tree_node slab cache may stall applications at swap-in time; the order-3 inode cache on most filesystems may stall applications when trying to read some file; the order-2 proc_inode_cache may stall applications when trying to open a /proc file. Secondly, once triggered, it will stall unrelated ...
rebased onto Wu's patch
----------------------------------------------
From 35772ad03e202c1c9a2252de3a9d3715e30d180f Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Sun, 1 Aug 2010 17:23:41 +0900
Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()
congestion_wait() mean "waiting for number of requests in IO queue is
under congestion threshold".
That said, if the system have plenty dirty pages, flusher thread push
new request to IO queue conteniously. So, IO queue are not cleared
congestion status for a long time. thus, congestion_wait(HZ/10) is
almostly equivalent schedule_timeout(HZ/10).
If the system 512MB memory, DEF_PRIORITY mean 128kB scan and It takes 4096
shrink_page_list() calls to scan 128kB (i.e. 128kB/32=4096) memory.
4096 times 0.1sec stall makes crazy insane long stall. That shouldn't.
In the other hand, this synchronous lumpy reclaim donesn't need this
congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) cause to
call wait_on_page_writeback() and it provide sufficient waiting.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/vmscan.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 972c8f0..c5e673e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1339,8 +1339,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
-
/*
* The attempt at page out may have made some
* of the pages active, mark them inactive again.
--
1.6.5.2
--
page size? DEF_PRIORITY is 12. 512MB >> DEF_PRIORITY = 512MB / 4096 = 128kB 128kB scan mean 4096 times shrink_list(). because one shrink_list() scan Umm.. I haven't catch this mention. --
Hi KOSAKI, Just a nitpick. Why is it a problem? HZ/10 is upper bound we intended. If is is rahter high, we can low it. But totally I agree on this patch. It would be better to remove it Reviewed-by: Minchan Kim <minchan.kim@gmail.com> -- Kind regards, Minchan Kim --
because all of _unnecessary_ sleep is evil. the problem is, congestion_wait() mean "wait until queue congestion will be cleared, iow, wait all of IO". but we want to wait until _my_ IO finished. So, if flusher thread conteniously push new IO into the queue, that makes big difference. Thanks. --
On Mon, Aug 2, 2010 at 1:13 PM, KOSAKI Motohiro Agree. Please include this explanation in description to make it kind if you resent this patch. Thanks -- Kind regards, Minchan Kim --
