Re: [PATCH v3 3/3] Prevent activation of page in madvise_dontneed

Previous thread: [PATCH] HPET: Fix HPET readout for small deltas by Borislav Petkov on Monday, November 29, 2010 - 8:09 am. (2 messages)

Next thread: Webmail Biztonsági figyelmeztetés by irodrigu on Monday, November 29, 2010 - 8:19 am. (1 message)
From: Minchan Kim
Date: Monday, November 29, 2010 - 8:23 am

Recently there is a report about working set page eviction due to rsync
workload. application programmers want to use fadvise but it's not easy.
You could see detailed description on [1/3].

 - [1/3] is to move invalidated page which is dirty/writeback on active list
   into inactive list's head.
 - [2/3] is for moving invalidated page into inactive list's tail when the
   page's writeout is completed.
 - [3/3] is to not calling mark_page_accessed in case of madvise(DONTNEED).

Minchan Kim (3):
  deactivate invalidated pages
  Reclaim invalidated page ASAP
  Prevent activation of page in madvise_dontneed

 include/linux/mm.h   |    4 +-
 include/linux/swap.h |    1 +
 mm/madvise.c         |    4 +-
 mm/memory.c          |   38 +++++++++++-------
 mm/mmap.c            |    4 +-
 mm/page-writeback.c  |   12 +++++-
 mm/swap.c            |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c        |   16 ++++++--
 8 files changed, 155 insertions(+), 26 deletions(-)

--

From: Minchan Kim
Date: Monday, November 29, 2010 - 8:23 am

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It is very hard for application programmer to use it.
Because they always have to sync data before calling
fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
be discardable. At last, they can't use deferred write of kernel
so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head.

Why I need the page to head, Dirty/Writeback page would be flushed
sooner or later. It can prevent writeout of pageout which is less
effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added
his Signed-off-by.

Reported-by: Ben Gamari <bgamari.foss@gmail.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>

Adnrew. Before applying this series, please drop below two patches.
 mm-deactivate-invalidated-pages.patch
 mm-deactivate-invalidated-pages-fix.patch

Changelog since v2:
 - mapped page leaves alone - suggested by Mel
 - pass part ...
From: KOSAKI Motohiro
Date: Monday, November 29, 2010 - 6:01 pm

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--

From: Johannes Weiner
Date: Monday, November 29, 2010 - 10:22 pm

The wording is a bit confusing, I find.  It sounds a bit like the
flusher's chance is increased by moving it to the inactive list in the
first place, but the key is that it is moved to the head instead of,
what one would expect, the tail of the list.  How about:

	If the page can not be invalidated, it is moved to the
	inactive list to speed up its reclaim.  It is moved to the
	head of the list, rather than the tail, to give the flusher
	threads some time to write it out, as this is much more


Why is that?  Unless I missed something, the only thing that needs
protection is the per-cpu pagevec reference the only user of this
function passes in.  But this should be the caller's concern and is


How about:

/**
 * lru_deactivate_page - forcefully deactivate a page
 * @page: page to deactivate
 *
 * This function hints the VM that @page is a good reclaim candidate,
 * for example if its invalidation fails due to the page being dirty
 * or under writeback.

I would love that lru_ prefix for most of the API in this file.  In
fact, the file should probably be called lru.c.  But for now, can you


This would also be less confusing if it would say

	Move it to the head of the inactive LRU (rather than the tail)
	for using [...]

But I am not sure that this detail is interesting at this point.  It
would be more interesting to name the reasons for why the page is
moved to the inactive list in the first place:

	If the page is dirty or under writeback, we can not invalidate
	it now.  But we assume that attempted invalidation is a hint
	that the page is no longer of interest and try to speed up its

	Hannes
--

From: Minchan Kim
Date: Monday, November 29, 2010 - 11:18 pm

Hi Hannes,


Looks good to me.

Good point.

__lru_deactivate is self-contained.
It is valuable enough using other places.

Yes. It's unnecessary.
I didn't consider enoughly.


No matter. I can change it. but deactivate_page will be asymmetric
about that deactivate_page move active page into inactive
forcefully(two step) while activate_page does one step activation.

Will fix.
I hope listen you guys's opinions about [2/3], too. :)




-- 
Kind regards,
Minchan Kim
--

From: Minchan Kim
Date: Monday, November 29, 2010 - 8:23 am

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>

Changelog since v2:
 - put ClearPageReclaim in set_page_dirty - suggested by Wu.
 
Changelog since v1:
 - make the invalidated page reclaim asap - suggested by Andrew.
---
 mm/page-writeback.c |   12 +++++++++++-
 mm/swap.c           |   48 +++++++++++++++++++++++++++++++++++-------------
 2 files changed, 46 insertions(+), 14 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc93802..88587a5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
+	/*
+	 * readahead/lru_deactivate_page could ...
From: Mel Gorman
Date: Monday, November 29, 2010 - 9:57 am

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Monday, November 29, 2010 - 3:41 pm

My fault. 
Resend.

Subject: [PATCH v3 2/3] Reclaim invalidated page ASAP

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>

Changelog since v2:
 - put ClearPageReclaim in set_page_dirty - suggested by Wu.

Changelog since v1:
 - make the invalidated page reclaim asap - suggested by Andrew.
---
 mm/page-writeback.c |   12 +++++++++++-
 mm/swap.c           |   49 ++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc93802..88587a5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
 {
 	struct address_space ...
From: Mel Gorman
Date: Tuesday, November 30, 2010 - 2:16 am

I should have said this the last time but if you do another revision,
make active a "bool". There is a very slow migration of int to bool in

Whether you update active's type or not;

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Tuesday, November 30, 2010 - 7:04 am

-- 
Kind regards,
Minchan Kim
--

From: Johannes Weiner
Date: Tuesday, November 30, 2010 - 4:20 am

If we lose the race with writeback, the completion handler won't see
PG_reclaim, won't move the page, and we have an unwanted clean cache
page on the active list.  Given the pagevec caching of those pages it
could be rather likely that IO completes before the above executes.

Shouldn't this be

	if (PageWriteback() || PageDirty()) {
		SetPageReclaim()
		move_to_inactive_head()
	} else {
		move_to_inactive_tail()
	}

instead?

	Hannes
--

From: Minchan Kim
Date: Tuesday, November 30, 2010 - 7:03 am

Fair enough.


-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Monday, November 29, 2010 - 6:10 pm

I still dislike this one. I doubt this trick makes much benefit in real
world workload.



--

From: Mel Gorman
Date: Tuesday, November 30, 2010 - 2:18 am

I would agree except as said elsewhere, it's a chicken and egg problem.
We don't have a real world test because fadvise is not useful in its
current iteration. I'm hoping that there will be a test comparing

rsync		on vanilla kernel
rsync		on patched kernel
rsync+patch	on vanilla kernel
rsync+patch	on patched kernel

Are the results of such a test likely to happen?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Tuesday, November 30, 2010 - 7:01 am

Ben, Could you get the rsync execution time(user/sys) and 
'cat /proc/vmstat' result before/after?
If Ben is busy, I will try to get a data. but I need enough time.

I expect rsync+patch on patched kernel should have a less allocstall, less pgscan 

-- 
Kind regards,
Minchan Kim
--

From: Ben Gamari
Date: Tuesday, November 30, 2010 - 7:11 am

Yes, absolutely, although I'm sorry it has taken so long. Between
thanksgiving and the impending end of the semester things have been a
bit hectic. Nevertheless, I just finished putting together a script to
record some metics from /proc/vmstat and /proc/[pid]/statm, so at this
point I'm ready to finally take some data. Any suggestions for
particular patterns to look for in the numbers or other metrics to
record are welcome. Cheers,

- Ben
--

From: Minchan Kim
Date: Monday, November 29, 2010 - 8:23 am

Now zap_pte_range alwayas activates pages which are pte_young &&
!VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED,
it's unnecessary since the page wouldn't use any more.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>

Changelog since v2:
 - remove unnecessary description
Changelog since v1: 
 - change word from promote to activate
 - add activate argument to zap_pte_range and family function

---
 include/linux/mm.h |    4 ++--
 mm/madvise.c       |    4 ++--
 mm/memory.c        |   38 +++++++++++++++++++++++---------------
 mm/mmap.c          |    4 ++--
 4 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e097df6..6032881 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -779,11 +779,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		unsigned long size);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
-		unsigned long size, struct zap_details *);
+		unsigned long size, struct zap_details *, bool activate);
 unsigned long unmap_vmas(struct mmu_gather **tlb,
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
-		struct zap_details *);
+		struct zap_details *, bool activate);
 
 /**
  * mm_walk - callbacks for walk_page_range
diff --git a/mm/madvise.c b/mm/madvise.c
index 319528b..8bc4b2d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -171,9 +171,9 @@ static long madvise_dontneed(struct vm_area_struct * vma,
 			.nonlinear_vma = vma,
 			.last_index = ULONG_MAX,
 		};
-		zap_page_range(vma, start, end - start, ...
From: KOSAKI Motohiro
Date: Monday, November 29, 2010 - 6:08 pm

Looks good to me.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>



--

From: Johannes Weiner
Date: Tuesday, November 30, 2010 - 4:35 am

I would prefer naming the parameter 'ignore_references' or something
similar, so that it reflects the immediate effect on the zappers'
behaviour, not what mark_page_accessed() might end up doing.

Other than that, the patch looks good to me.

	Hannes
--

From: Minchan Kim
Date: Tuesday, November 30, 2010 - 5:50 pm

Fair enough.
Will fix.
Maybe it would take a long time until sending next version.




-- 
Kind regards,
Minchan Kim
--

From: Hugh Dickins
Date: Tuesday, November 30, 2010 - 11:34 am

Everyone else seems pretty happy with this, and I've not checked
at all whether it achieves your purpose; but personally I'd much
prefer a smaller patch which adds your "activate" or "ignore_references"
flag to struct zap_details, instead of passing this exceptional arg
down lots of levels.  That's precisely the purpose of zap_details,
to gather together a few things that aren't needed in the common case
(though I admit the NULL details defaulting may be ugly).

Hugh
--

From: Minchan Kim
Date: Tuesday, November 30, 2010 - 5:49 pm

Hi Hugh,


Before I sent RFC, I tried it and suffered from NULL detail as you said.
But it's valuable to look on it, again.
Since other guys don't opposed this patch's goal, I will have a time
for unifying it into zap_details.




-- 
Kind regards,
Minchan Kim
--

Previous thread: [PATCH] HPET: Fix HPET readout for small deltas by Borislav Petkov on Monday, November 29, 2010 - 8:09 am. (2 messages)

Next thread: Webmail Biztonsági figyelmeztetés by irodrigu on Monday, November 29, 2010 - 8:19 am. (1 message)