Recently there is a report about working set page eviction due to rsync workload. application programmers want to use fadvise but it's not easy. You could see detailed description on [1/3]. - [1/3] is to move invalidated page which is dirty/writeback on active list into inactive list's head. - [2/3] is for moving invalidated page into inactive list's tail when the page's writeout is completed. - [3/3] is to not calling mark_page_accessed in case of madvise(DONTNEED). Minchan Kim (3): deactivate invalidated pages Reclaim invalidated page ASAP Prevent activation of page in madvise_dontneed include/linux/mm.h | 4 +- include/linux/swap.h | 1 + mm/madvise.c | 4 +- mm/memory.c | 38 +++++++++++------- mm/mmap.c | 4 +- mm/page-writeback.c | 12 +++++- mm/swap.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++ mm/truncate.c | 16 ++++++-- 8 files changed, 155 insertions(+), 26 deletions(-) --
Recently, there are reported problem about thrashing. (http://marc.info/?l=rsync&m=128885034930933&w=2) It happens by backup workloads(ex, nightly rsync). That's because the workload makes just use-once pages and touches pages twice. It promotes the page into active list so that it results in working set page eviction. Some app developer want to support POSIX_FADV_NOREUSE. But other OSes don't support it, either. (http://marc.info/?l=linux-mm&m=128928979512086&w=2) By other approach, app developers use POSIX_FADV_DONTNEED. But it has a problem. If kernel meets page is writing during invalidate_mapping_pages, it can't work. It is very hard for application programmer to use it. Because they always have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could be discardable. At last, they can't use deferred write of kernel so that they could see performance loss. (http://insights.oetiker.ch/linux/fadvise.html) In fact, invalidation is very big hint to reclaimer. It means we don't use the page any more. So let's move the writing page into inactive list's head. Why I need the page to head, Dirty/Writeback page would be flushed sooner or later. It can prevent writeout of pageout which is less effective than flusher's writeout. Originally, I reused lru_demote of Peter with some change so added his Signed-off-by. Reported-by: Ben Gamari <bgamari.foss@gmail.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Adnrew. Before applying this series, please drop below two patches. mm-deactivate-invalidated-pages.patch mm-deactivate-invalidated-pages-fix.patch Changelog since v2: - mapped page leaves alone - suggested by Mel - pass part ...
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --
The wording is a bit confusing, I find. It sounds a bit like the flusher's chance is increased by moving it to the inactive list in the first place, but the key is that it is moved to the head instead of, what one would expect, the tail of the list. How about: If the page can not be invalidated, it is moved to the inactive list to speed up its reclaim. It is moved to the head of the list, rather than the tail, to give the flusher threads some time to write it out, as this is much more Why is that? Unless I missed something, the only thing that needs protection is the per-cpu pagevec reference the only user of this function passes in. But this should be the caller's concern and is How about: /** * lru_deactivate_page - forcefully deactivate a page * @page: page to deactivate * * This function hints the VM that @page is a good reclaim candidate, * for example if its invalidation fails due to the page being dirty * or under writeback. I would love that lru_ prefix for most of the API in this file. In fact, the file should probably be called lru.c. But for now, can you This would also be less confusing if it would say Move it to the head of the inactive LRU (rather than the tail) for using [...] But I am not sure that this detail is interesting at this point. It would be more interesting to name the reasons for why the page is moved to the inactive list in the first place: If the page is dirty or under writeback, we can not invalidate it now. But we assume that attempted invalidation is a hint that the page is no longer of interest and try to speed up its Hannes --
Hi Hannes, Looks good to me. Good point. __lru_deactivate is self-contained. It is valuable enough using other places. Yes. It's unnecessary. I didn't consider enoughly. No matter. I can change it. but deactivate_page will be asymmetric about that deactivate_page move active page into inactive forcefully(two step) while activate_page does one step activation. Will fix. I hope listen you guys's opinions about [2/3], too. :) -- Kind regards, Minchan Kim --
invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.
Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.
In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Changelog since v2:
- put ClearPageReclaim in set_page_dirty - suggested by Wu.
Changelog since v1:
- make the invalidated page reclaim asap - suggested by Andrew.
---
mm/page-writeback.c | 12 +++++++++++-
mm/swap.c | 48 +++++++++++++++++++++++++++++++++++-------------
2 files changed, 46 insertions(+), 14 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc93802..88587a5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);
+ /*
+ * readahead/lru_deactivate_page could ...-- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
My fault.
Resend.
Subject: [PATCH v3 2/3] Reclaim invalidated page ASAP
invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.
Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.
Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.
In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.
I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Changelog since v2:
- put ClearPageReclaim in set_page_dirty - suggested by Wu.
Changelog since v1:
- make the invalidated page reclaim asap - suggested by Andrew.
---
mm/page-writeback.c | 12 +++++++++++-
mm/swap.c | 49 ++++++++++++++++++++++++++++++++++++-------------
2 files changed, 47 insertions(+), 14 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fc93802..88587a5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1250,6 +1250,17 @@ int set_page_dirty(struct page *page)
{
struct address_space ...I should have said this the last time but if you do another revision, make active a "bool". There is a very slow migration of int to bool in Whether you update active's type or not; Acked-by: Mel Gorman <mel@csn.ul.ie> -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
If we lose the race with writeback, the completion handler won't see
PG_reclaim, won't move the page, and we have an unwanted clean cache
page on the active list. Given the pagevec caching of those pages it
could be rather likely that IO completes before the above executes.
Shouldn't this be
if (PageWriteback() || PageDirty()) {
SetPageReclaim()
move_to_inactive_head()
} else {
move_to_inactive_tail()
}
instead?
Hannes
--
Fair enough. -- Kind regards, Minchan Kim --
I still dislike this one. I doubt this trick makes much benefit in real world workload. --
I would agree except as said elsewhere, it's a chicken and egg problem. We don't have a real world test because fadvise is not useful in its current iteration. I'm hoping that there will be a test comparing rsync on vanilla kernel rsync on patched kernel rsync+patch on vanilla kernel rsync+patch on patched kernel Are the results of such a test likely to happen? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Ben, Could you get the rsync execution time(user/sys) and 'cat /proc/vmstat' result before/after? If Ben is busy, I will try to get a data. but I need enough time. I expect rsync+patch on patched kernel should have a less allocstall, less pgscan -- Kind regards, Minchan Kim --
Yes, absolutely, although I'm sorry it has taken so long. Between thanksgiving and the impending end of the semester things have been a bit hectic. Nevertheless, I just finished putting together a script to record some metics from /proc/vmstat and /proc/[pid]/statm, so at this point I'm ready to finally take some data. Any suggestions for particular patterns to look for in the numbers or other metrics to record are welcome. Cheers, - Ben --
Now zap_pte_range alwayas activates pages which are pte_young && !VM_SequentialReadHint(vma). But in case of calling MADV_DONTNEED, it's unnecessary since the page wouldn't use any more. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Wu Fengguang <fengguang.wu@intel.com> Changelog since v2: - remove unnecessary description Changelog since v1: - change word from promote to activate - add activate argument to zap_pte_range and family function --- include/linux/mm.h | 4 ++-- mm/madvise.c | 4 ++-- mm/memory.c | 38 +++++++++++++++++++++++--------------- mm/mmap.c | 4 ++-- 4 files changed, 29 insertions(+), 21 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index e097df6..6032881 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -779,11 +779,11 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size); unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, - unsigned long size, struct zap_details *); + unsigned long size, struct zap_details *, bool activate); unsigned long unmap_vmas(struct mmu_gather **tlb, struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, - struct zap_details *); + struct zap_details *, bool activate); /** * mm_walk - callbacks for walk_page_range diff --git a/mm/madvise.c b/mm/madvise.c index 319528b..8bc4b2d 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -171,9 +171,9 @@ static long madvise_dontneed(struct vm_area_struct * vma, .nonlinear_vma = vma, .last_index = ULONG_MAX, }; - zap_page_range(vma, start, end - start, ...
Looks good to me. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --
I would prefer naming the parameter 'ignore_references' or something similar, so that it reflects the immediate effect on the zappers' behaviour, not what mark_page_accessed() might end up doing. Other than that, the patch looks good to me. Hannes --
Fair enough. Will fix. Maybe it would take a long time until sending next version. -- Kind regards, Minchan Kim --
Everyone else seems pretty happy with this, and I've not checked at all whether it achieves your purpose; but personally I'd much prefer a smaller patch which adds your "activate" or "ignore_references" flag to struct zap_details, instead of passing this exceptional arg down lots of levels. That's precisely the purpose of zap_details, to gather together a few things that aren't needed in the common case (though I admit the NULL details defaulting may be ugly). Hugh --
Hi Hugh, Before I sent RFC, I tried it and suffered from NULL detail as you said. But it's valuable to look on it, again. Since other guys don't opposed this patch's goal, I will have a time for unifying it into zap_details. -- Kind regards, Minchan Kim --
