Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

Previous thread: [PATCH 1/1] khubd -- switch USB product/manufacturer/serial handling to RCU by Andy Whitcroft on Monday, June 14, 2010 - 4:15 am. (3 messages)

Next thread: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 by Mel Gorman on Monday, June 14, 2010 - 4:17 am. (73 messages)
From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back pages by not setting
may_writepage in scan_control. Instead, dirty pages are placed back on the
LRU lists for either background writing by the BDI threads or kswapd. If
in direct lumpy reclaim and dirty pages are encountered, the process will
kick the background flushter threads before trying again.

Memory control groups do not have a kswapd-like thread nor do pages get
direct reclaimed from the page allocator. Instead, memory control group
pages are reclaimed when the quota is being exceeded or the group is being
shrunk. As it is not expected that the entry points into page reclaim are
deep call chains memcg is still allowed to writeback dirty pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   76 ++++++++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 57 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4856a2a..574e816 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
 	return PAGE_SUCCESS;
 }
 
+/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+static inline bool reclaim_can_writeback(struct scan_control *sc)
+{
+	return current_is_kswapd() || sc->mem_cgroup != NULL;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -701,6 +707,9 @@ static noinline_for_stack void clean_page_list(struct list_head *page_list,
 	list_splice(&ret_pages, page_list);
 }
 
+/* ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 2:55 pm

I'm not entirely convinced on this bit, but am willing to
be convinced by the data.

-- 
All rights reversed
--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 4:45 am

Which bit?

You're not convinced that kswapd should be allowed to write back?
You're not convinced that memcg should be allowed to write back?
You're not convinced that direct reclaim writing back pages can overflow
	the stack?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Rik van Riel
Date: Tuesday, June 15, 2010 - 6:34 am

If direct reclaim can overflow the stack, so can direct
memcg reclaim.  That means this patch does not solve the
stack overflow, while admitting that we do need the
ability to get specific pages flushed to disk from the
pageout code.

-- 
All rights reversed
--

From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 6:37 am

Can you explain what the hell memcg reclaim is and why it needs
to reclaim from random contexts?
It seems everything that has a cg in it's name that I stumbled over
lately seems to be some ugly wart..

--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 6:54 am

Kamezawa Hiroyuki has the full story here but here is a summary.

memcg is the Memory Controller cgroup
(Documentation/cgroups/memory.txt). It's intended for the control of the
amount of memory usable by a group of processes but its behaviour in
terms of reclaim differs from global reclaim. It has its own LRU lists
and kswapd operates on them. What is surprising is that direct reclaim
for a process in the control group also does not operate within the
cgroup.

Reclaim from a cgroup happens from the fault path. The new page is
"charged" to the cgroup. If it exceeds its allocated resources, some
pages within the group are reclaimed in a path that is similar to direct
reclaim except for its entry point.

So, memcg is not reclaiming from a random context, there is a limited
number of cases where a memcg is reclaiming and it is not expected to

The wart in this case is that the behaviour of page reclaim within a
memcg and globally differ a fair bit.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, June 15, 2010 - 5:30 pm

On Tue, 15 Jun 2010 14:54:08 +0100

No, we don't use kswapd. But we have some hooks in kswapd for implementing
soft-limit. Soft-limit is for giving a hint for kswapd "please reclaim memory
from this memcg" when global memory exhausts and kswapd runs.

What a memcg use when it his limit is just direct reclaim.
(*) Justfing using a cpu by a kswapd because a memcg hits limit is difficult 
    for me. So, I don't use kswapd until now.
    When direct-reclaim is used, cost-of-reclaim will be charged against


Sorry. But there has been very long story to reach current implementations.
But don't worry, of memcg is not activated (not mounted), it doesn't affect
the behavior of processes ;)


may need some diet :(


Thanks,
-Kame


--

From: Rik van Riel
Date: Tuesday, June 15, 2010 - 7:02 am

The page fault code will call the cgroup accounting code.

When a cgroup goes over its memory limit, __mem_cgroup_try_charge
will call mem_cgroup_hierarchical_reclaim, which will then go

No argument there.  It took me a few minutes to find the code
path above :)

-- 
All rights reversed
--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 6:59 am

What path is taken with memcg != NULL that could overflow the stack? I
couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
its use cases is tricky. The critical path I had in mind though was
direct reclaim and for that path, memcg == NULL or did I miss something?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Rik van Riel
Date: Tuesday, June 15, 2010 - 7:04 am

mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages

-- 
All rights reversed
--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 7:16 am

But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
reclaim? It appears to be only called from the fault path or as a result
of the memcg changing size.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, June 15, 2010 - 5:17 pm

On Tue, 15 Jun 2010 15:16:01 +0100
yes. It's only called from 
	- page fault
	- add_to_page_cache()

I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
reclaiming memory, it needs direct writeback.

Thanks,
-Kame

--

From: Rik van Riel
Date: Tuesday, June 15, 2010 - 5:29 pm

Of course, a memcg page fault could still be triggered
from copy_to_user or copy_from_user, with a fairly
arbitrary stack frame above...

-- 
All rights reversed
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, June 15, 2010 - 5:39 pm

On Tue, 15 Jun 2010 20:29:49 -0400

Hmm. But I don't expect copy_from/to_user is called in very deep stack.

Should I prepare a thread for reclaiming memcg pages ?
Because we shouldn't limit kswapd's cpu time by CFS cgroup, waking up
kswapd just because "a memcg hit limits" isn't fun. 

Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
Maybe we'll see much swaps.

I want to go with this for a while, changing memcg's behavior will took
some amounts of time, there are only a few developpers.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, June 15, 2010 - 7:20 pm

On Wed, 16 Jun 2010 10:40:36 +0900

BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
much more than memcg.

handle_mm_fault()
-> handle_pte_fault()
-> do_wp_page()
-> balance_dirty_page_rate_limited()
-> balance_dirty_pages()
-> writeback_inodes_wbc()
-> writeback_inodes_wb()
-> writeback_sb_inodes()
-> writeback_single_inode()
-> do_writepages()
-> generic_write_pages()
-> write_cache_pages()   // use on-stack pagevec.
-> writepage()

maybe much more stack consuming than memcg->writeback after vmscan.c diet.

Bye.
-Kame


















--

From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 10:11 pm

Yes, this is a massive issue.  Strangely enough I just wondered about
this callstack as balance_dirty_pages is the only place calling into the
per-bdi/sb writeback code directly instead of offloading it to the
flusher threads.  It's something that should be fixed rather quickly
IMHO.  write_cache_pages and other bits of this writeback code can use
quite large amounts of stack.
--

From: Jens Axboe
Date: Wednesday, June 16, 2010 - 3:51 am

I've had the same thought as well, bdp() should just signal a writeback
instead. Much cleaner than doing cleaning from that point.

-- 
Jens Axboe

--

From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 10:07 pm

Actually it is.  The poll code mentioned earlier in this thread is just
want nasty example.  I'm pretty sure there are tons of others in ioctl
code, as various ioctl implementations have been found to be massive
stack hogs in the past, even worse for out of tree drivers.

--

From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 10:06 pm

The page fault code should be fine, but add_to_page_cache can be called
with quite deep stacks.  Two examples are grab_cache_page_write_begin
which already was part of one of the stack overflows mentioned in this
thread, or find_or_create_page which can be called via
_xfs_buf_lookup_pages, which can be called from under the whole XFS
allocator, or via grow_dev_page which might have a similarly deep
stack for users of the normal buffer cache.  Although for the
find_or_create_page we usually should not have __GFP_FS set in the
gfp_mask.

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 16, 2010 - 5:25 pm

On Wed, 16 Jun 2010 01:06:40 -0400

Hmm. ok, then, memory cgroup needs some care.

BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
memory cgroup just limits pages for users, then, doesn't intend to
limit kernel pages. If this buffer is not for user(visible page cache), but for
internal structure, I'll have to add a code for ignoreing memory cgroup check
when gfp_mask doesn't have GFP_MOVABLE.


Thanks,
-Kame

--

From: Christoph Hellwig
Date: Wednesday, June 16, 2010 - 11:16 pm

You mean xfs_buf_allocate?  It doesn't in the end.  It goes through the
xfs_kmem helper which clear __GFP_FS if we're currently inside a
filesystem transaction (PF_FSTRANS is set) or a caller specificly
requested it to be disabled even without that by passig the
XBF_DONT_BLOCK flag.

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 16, 2010 - 11:23 pm

On Thu, 17 Jun 2010 02:16:47 -0400
Ah, sorry. My question was wrong.

If xfs_buf_allocate() is not for pages on LRU but for kernel memory,
memory cgroup has no reason to charge against it because we can't reclaim
memory which is not on LRU.

Then, I wonder I may have to add following check 

	if (!(gfp_mask & __GFP_RECLAIMABLE)) {
		/* ignore this. we just charge against reclaimable memory on LRU. */
		return 0;
	}

to mem_cgroup_charge_cache() which is a hook for accounting page-cache.


Thanks,
-Kame

--

Previous thread: [PATCH 1/1] khubd -- switch USB product/manufacturer/serial handling to RCU by Andy Whitcroft on Monday, June 14, 2010 - 4:15 am. (3 messages)

Next thread: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 by Mel Gorman on Monday, June 14, 2010 - 4:17 am. (73 messages)