When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.
This patch prevents direct reclaim writing back pages by not setting
may_writepage in scan_control. Instead, dirty pages are placed back on the
LRU lists for either background writing by the BDI threads or kswapd. If
in direct lumpy reclaim and dirty pages are encountered, the process will
kick the background flushter threads before trying again.
Memory control groups do not have a kswapd-like thread nor do pages get
direct reclaimed from the page allocator. Instead, memory control group
pages are reclaimed when the quota is being exceeded or the group is being
shrunk. As it is not expected that the entry points into page reclaim are
deep call chains memcg is still allowed to writeback dirty pages.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 76 ++++++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 57 insertions(+), 19 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4856a2a..574e816 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
return PAGE_SUCCESS;
}
+/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+static inline bool reclaim_can_writeback(struct scan_control *sc)
+{
+ return current_is_kswapd() || sc->mem_cgroup != NULL;
+}
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
@@ -701,6 +707,9 @@ static noinline_for_stack void clean_page_list(struct list_head *page_list,
list_splice(&ret_pages, page_list);
}
+/* ...I'm not entirely convinced on this bit, but am willing to be convinced by the data. -- All rights reversed --
Which bit? You're not convinced that kswapd should be allowed to write back? You're not convinced that memcg should be allowed to write back? You're not convinced that direct reclaim writing back pages can overflow the stack? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
If direct reclaim can overflow the stack, so can direct memcg reclaim. That means this patch does not solve the stack overflow, while admitting that we do need the ability to get specific pages flushed to disk from the pageout code. -- All rights reversed --
Can you explain what the hell memcg reclaim is and why it needs to reclaim from random contexts? It seems everything that has a cg in it's name that I stumbled over lately seems to be some ugly wart.. --
Kamezawa Hiroyuki has the full story here but here is a summary. memcg is the Memory Controller cgroup (Documentation/cgroups/memory.txt). It's intended for the control of the amount of memory usable by a group of processes but its behaviour in terms of reclaim differs from global reclaim. It has its own LRU lists and kswapd operates on them. What is surprising is that direct reclaim for a process in the control group also does not operate within the cgroup. Reclaim from a cgroup happens from the fault path. The new page is "charged" to the cgroup. If it exceeds its allocated resources, some pages within the group are reclaimed in a path that is similar to direct reclaim except for its entry point. So, memcg is not reclaiming from a random context, there is a limited number of cases where a memcg is reclaiming and it is not expected to The wart in this case is that the behaviour of page reclaim within a memcg and globally differ a fair bit. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
On Tue, 15 Jun 2010 14:54:08 +0100
No, we don't use kswapd. But we have some hooks in kswapd for implementing
soft-limit. Soft-limit is for giving a hint for kswapd "please reclaim memory
from this memcg" when global memory exhausts and kswapd runs.
What a memcg use when it his limit is just direct reclaim.
(*) Justfing using a cpu by a kswapd because a memcg hits limit is difficult
for me. So, I don't use kswapd until now.
When direct-reclaim is used, cost-of-reclaim will be charged against
Sorry. But there has been very long story to reach current implementations.
But don't worry, of memcg is not activated (not mounted), it doesn't affect
the behavior of processes ;)
may need some diet :(
Thanks,
-Kame
--
The page fault code will call the cgroup accounting code. When a cgroup goes over its memory limit, __mem_cgroup_try_charge will call mem_cgroup_hierarchical_reclaim, which will then go No argument there. It took me a few minutes to find the code path above :) -- All rights reversed --
What path is taken with memcg != NULL that could overflow the stack? I couldn't spot one but mm/memcontrol.c is a bit tangled so finding all its use cases is tricky. The critical path I had in mind though was direct reclaim and for that path, memcg == NULL or did I miss something? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages -- All rights reversed --
But in turn, where is mem_cgroup_hierarchical_reclaim called from direct reclaim? It appears to be only called from the fault path or as a result of the memcg changing size. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
On Tue, 15 Jun 2010 15:16:01 +0100 yes. It's only called from - page fault - add_to_page_cache() I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for reclaiming memory, it needs direct writeback. Thanks, -Kame --
Of course, a memcg page fault could still be triggered from copy_to_user or copy_from_user, with a fairly arbitrary stack frame above... -- All rights reversed --
On Tue, 15 Jun 2010 20:29:49 -0400 Hmm. But I don't expect copy_from/to_user is called in very deep stack. Should I prepare a thread for reclaiming memcg pages ? Because we shouldn't limit kswapd's cpu time by CFS cgroup, waking up kswapd just because "a memcg hit limits" isn't fun. Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ? Maybe we'll see much swaps. I want to go with this for a while, changing memcg's behavior will took some amounts of time, there are only a few developpers. Thanks, -Kame --
On Wed, 16 Jun 2010 10:40:36 +0900 BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following much more than memcg. handle_mm_fault() -> handle_pte_fault() -> do_wp_page() -> balance_dirty_page_rate_limited() -> balance_dirty_pages() -> writeback_inodes_wbc() -> writeback_inodes_wb() -> writeback_sb_inodes() -> writeback_single_inode() -> do_writepages() -> generic_write_pages() -> write_cache_pages() // use on-stack pagevec. -> writepage() maybe much more stack consuming than memcg->writeback after vmscan.c diet. Bye. -Kame --
Yes, this is a massive issue. Strangely enough I just wondered about this callstack as balance_dirty_pages is the only place calling into the per-bdi/sb writeback code directly instead of offloading it to the flusher threads. It's something that should be fixed rather quickly IMHO. write_cache_pages and other bits of this writeback code can use quite large amounts of stack. --
I've had the same thought as well, bdp() should just signal a writeback instead. Much cleaner than doing cleaning from that point. -- Jens Axboe --
Actually it is. The poll code mentioned earlier in this thread is just want nasty example. I'm pretty sure there are tons of others in ioctl code, as various ioctl implementations have been found to be massive stack hogs in the past, even worse for out of tree drivers. --
The page fault code should be fine, but add_to_page_cache can be called with quite deep stacks. Two examples are grab_cache_page_write_begin which already was part of one of the stack overflows mentioned in this thread, or find_or_create_page which can be called via _xfs_buf_lookup_pages, which can be called from under the whole XFS allocator, or via grow_dev_page which might have a similarly deep stack for users of the normal buffer cache. Although for the find_or_create_page we usually should not have __GFP_FS set in the gfp_mask. --
On Wed, 16 Jun 2010 01:06:40 -0400 Hmm. ok, then, memory cgroup needs some care. BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ? memory cgroup just limits pages for users, then, doesn't intend to limit kernel pages. If this buffer is not for user(visible page cache), but for internal structure, I'll have to add a code for ignoreing memory cgroup check when gfp_mask doesn't have GFP_MOVABLE. Thanks, -Kame --
You mean xfs_buf_allocate? It doesn't in the end. It goes through the xfs_kmem helper which clear __GFP_FS if we're currently inside a filesystem transaction (PF_FSTRANS is set) or a caller specificly requested it to be disabled even without that by passig the XBF_DONT_BLOCK flag. --
On Thu, 17 Jun 2010 02:16:47 -0400
Ah, sorry. My question was wrong.
If xfs_buf_allocate() is not for pages on LRU but for kernel memory,
memory cgroup has no reason to charge against it because we can't reclaim
memory which is not on LRU.
Then, I wonder I may have to add following check
if (!(gfp_mask & __GFP_RECLAIMABLE)) {
/* ignore this. we just charge against reclaimable memory on LRU. */
return 0;
}
to mem_cgroup_charge_cache() which is a hook for accounting page-cache.
Thanks,
-Kame
--
