Re: [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events

Previous thread: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim by Mel Gorman on Monday, June 14, 2010 - 4:17 am. (22 messages)

Next thread: [PATCH] x86: add power_end event to process_*.c cpu_idle routine by Robert Schöne on Monday, June 14, 2010 - 4:37 am. (1 message)
From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

This is a merging of two series - the first of which reduces stack usage
in page reclaim and the second which writes contiguous pages during reclaim
and avoids writeback in direct reclaimers.

Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls
into the filesystems writepage function. This patch series aims to trace
writebacks so it can be evaulated how many dirty pages are being written,
reduce stack usage of page reclaim in general and avoid direct reclaim
writing back pages and overflowing the stack.

The first 4 patches are a forward-port of trace points that are partly based
on trace points defined by Larry Woodman but never merged. They trace parts of
kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints
can be used to evaluate what is happening within reclaim and whether things
are getting better or worse. They do not have to be part of the final series
but might be useful during discussion and for later regression testing -
particularly around the percentage of time spent in reclaim.

The 6 patches after that reduce the stack footprint of page reclaim by moving
large allocations out of the main call path. Functionally they should be
similar although there is a timing change on when pages get freed exactly.
This is aimed at giving filesystems as much stack as possible if kswapd is
to writeback pages directly.

Patch 11 puts dirty pages as it finds them onto a temporary list and then
writes them all out with a helper function. This simplifies patch 12 and
also increases the chances that IO requests can be optimally merged.

Patch 12 prevents direct reclaim writing out pages at all and instead dirty
pages are put back on the LRU. For ...
From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

This patch adds a trace event for when page reclaim queues a page for IO and
records whether it is synchronous or asynchronous. Excessive synchronous
IO for a process can result in noticeable stalls during direct reclaim.
Excessive IO from page reclaim may indicate that the system is seriously
under provisioned for the amount of dirty pages that exist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |   23 +++++++++++++++++++++++
 mm/vmscan.c                   |    2 ++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a331454..b26daa9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -154,6 +154,29 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 		__entry->nr_lumpy_dirty,
 		__entry->nr_lumpy_failed)
 );
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
 		
 #endif /* _TRACE_VMSCAN_H */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25bf05a..58527c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -399,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
-- 
1.7.1

--

From: Rik van Riel
Date: Monday, June 14, 2010 - 9:48 am

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Larry Woodman
Date: Monday, June 14, 2010 - 2:02 pm

Acked-by: Larry Woodman <lwoodman@redhat.com>


--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.

detail
 - remove "while (nr_scanned < max_scan)" loop
 - remove nr_freed (now, we use nr_reclaimed directly)
 - remove nr_scan (now, we use nr_scanned directly)
 - rename max_scan to nr_to_scan
 - pass nr_to_scan into isolate_pages() directly instead
   using SWAP_CLUSTER_MAX

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |  211 ++++++++++++++++++++++++++++-------------------------------
 1 files changed, 100 insertions(+), 111 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 29e1ecd..6a42ebc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1132,15 +1132,21 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
+	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct page *page;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1155,129 +1161,112 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 11:06 am

Other than the weird whitespace below,


-- 
All rights reversed
--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 3:13 am

Not sure where this came out of. It's not in my local patch file nor in


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

Remove temporary variable that is only used once and does not help
clarify code.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a42ebc..246e084 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1721,13 +1721,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1773,7 +1772,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1785,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	 * mem_cgroup will not do shrink_slab.
 	 */
 	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		for_each_zone_zonelist(zone, z, zonelist, gfp_zone(sc->gfp_mask)) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-- 
1.7.1

--

From: Rik van Riel
Date: Monday, June 14, 2010 - 11:14 am

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

This patch adds an event for when pages are isolated en-masse from the
LRU lists. This event augments the information available on LRU traffic
and can be used to evaluate lumpy reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |   46 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |   14 ++++++++++++
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f76521f..a331454 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -109,6 +109,52 @@ TRACE_EVENT(mm_vmscan_direct_reclaim_end,
 	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
 );
 
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 9:47 am

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Larry Woodman
Date: Monday, June 14, 2010 - 2:02 pm

Acked-by: Larry Woodman <lwoodman@redhat.com>


--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

Page reclaim cleans individual pages using a_ops->writepage() because from
the VM perspective, it is known that pages in a particular zone must be freed
soon, it considers the target page to be the oldest and it does not want
to wait while background flushers cleans other pages. From a filesystem
perspective this is extremely inefficient as it generates a very seeky
IO pattern leading to the perverse situation where it can take longer to
clean all dirty pages than it would have otherwise.

This patch queues all dirty pages at once to maximise the chances that
the write requests get merged efficiently. It also makes the next patch
that avoids writeout from direct reclaim more straight-forward.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |  175 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 131 insertions(+), 44 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 019f0af..4856a2a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,55 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	/*
+	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
+	 * the page lock has been dropped by ->writepage, that mapping could
+	 * be anything
+	 */
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 2:13 pm

Reclaiming clean pages should be fast enough that this should


Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 3:18 am

Indeed, this was a bit weak. The original point of the patch was to write
contiguous pages belonging to the same inode when they were encountered in
that batch which made a bit more sense but didn't work out at first


Thanks again :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Chinner
Date: Monday, June 14, 2010 - 4:11 pm

Seeing as you have a list of pages for IO, perhaps they could be sorted
before issuing ->writepage on them.

That is, while this patch issues all the IO in one hit, it doesn't
change the order in which the IO is issued - it is still issued in
LRU order. Given that they are issued in a short period of time now,
rather than across a longer scan period, it is likely that it will
not be any faster as:

	a) IO will not be started as soon, and
	b) the IO scheduler still only has a small re-ordering
	   window and will choke just as much on random IO patterns.

However, there is a list_sort() function that could be used to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it.  Perhaps this is a simple
improvement that can be made to this code that will make a big
difference to worst case performance.

FWIW, I did this for delayed metadata buffer writeback in XFS
recently (i.e. sort the queue of (potentially tens of thousands of)
buffers in ascending block order before dispatch) and that showed a
10-15% reduction in seeks on simple kernel compile workloads. This
shows that if we optimise IO patterns at higher layers where the
sort window is much, much larger than in the IO scheduler, then
overall system performance improves....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Andrew Morton
Date: Monday, June 14, 2010 - 4:21 pm

On Tue, 15 Jun 2010 09:11:44 +1000

Yup.

But then, this all really should be done at the block layer so other
io-submitting-paths can benefit from it.

IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

--

From: Dave Chinner
Date: Monday, June 14, 2010 - 5:39 pm

That was what we did in the past with really, really deep IO
scheduler queues. That leads to IO latency and OOM problems because
we could lock gigabytes of memory away under IO and take minutes to
clean it.

Besides, there really isn't the right context in the block layer to
be able to queue and prioritise large amounts of IO without

Perhaps, but IMO sorting should be done where the context allows it
to be done most efficiently. Sorting is most effective when ever a
significant queue of IO is formed, whether it be in the filesystem,
the VFS, the VM or the block layer because the IO stack is very much
a GIGO queue.

Simply put, there's nothing the lower layers can do to optimise bad
IO patterns from the higher layers because they have small sort
windows which are necessary to keep IO latency in check. Hence if
the higher layers feed the lower layers crap they simply don't have
the context or depth to perform the same level of optimistations we
can do easily higher up the stack.

IOWs, IMO anywhere there is a context with significant queue of IO,
that's where we should be doing a better job of sorting before that
IO is dispatched to the lower layers. This is still no guarantee of
better IO (e.g. if the filesystem fragments the file) but it does
give the lower layers a far better chance at optimal allocation and
scheduling of IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Rik van Riel
Date: Monday, June 14, 2010 - 6:16 pm

Can we kick flushing for the whole inode at once from
vmscan.c?

That way we should:
1) ensure that the page we want is written to disk, and
2) we flush out related pages at the same time, getting
    a decent IO pattern

Chances are that if we want to evict one page from a
file, we'll also want to evict other pages from that
same file.  In fact, chances are a good number of them
will live nearby on the LRU list.

Does this make sense?

Would it be hard to add a "please flush this file"
way to call the filesystem flushing threads?

-- 
All rights reversed
--

From: Andrew Morton
Date: Monday, June 14, 2010 - 6:45 pm

Passing the igrab()bed inode into the flusher threads would fix the
iput_final() problems, as long as the alloc_pages() caller never blocks
indefinitely waiting for the work which the flusher threads are doing. 

Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
caller holds VFS locks and is waiting for the flusher threads while all
the flusher threads are stuck under iput_final() waiting for those VFS
locks.

That's fixable by not using igrab()/iput().  You can use lock_page() to
pin the address_space.  Pass the address of the locked page across to
the flusher threads so they don't try to lock it a second time, or just
use trylocking on that writeback path or whatever.

--

From: Rik van Riel
Date: Monday, June 14, 2010 - 9:08 pm

Any thread that does not have __GFP_FS set in its gfp_mask
cannot wait for the flusher to complete. This is regardless
of the mechanism used to kick the flusher.

Then again, those threads cannot call ->writepage today
either, so we should be fine keeping that behaviour.

Threads that do have __GFP_FS in their gfp_mask can wait
for the flusher in various ways.  Maybe the lock_page()
method can be simplified by having the flusher thread
unlock the page the moment it gets it, and then run the
normal flusher code?

The pageout code (in shrink_page_list) already unlocks
the page anyway before putting it back on the relevant
LRU list.  It would be easy enough to skip that unlock
and let the flusher thread take care of it.

-- 
All rights reversed
--

From: Andrew Morton
Date: Monday, June 14, 2010 - 9:37 pm

mm...  kinda.  A bare order-zero __GFP_WAIT allocation can still wait

I'm not sure.  iput_final() can take a lot of locks, both VFS and
heaven knows what within the individual filesystems.  Is it the case
that all allocations which occur under all of those locks is always

Well, _something_ has to pin the address_space.  A single locked page

Once that page is unlocked, we can't touch *mapping - its inode can be
concurrently reclaimed.  Although I guess the technique in
handle_write_error() can be reused.

--

From: Nick Piggin
Date: Monday, June 14, 2010 - 10:12 pm

__GFP_FS is set with i_mutex held in places, and there is nothing to

Nasty. That guy needs to be using lock_page_nosync().

--

From: Nick Piggin
Date: Monday, June 14, 2010 - 10:43 pm

--

Need lock_page_nosync here because we have no reference to the mapping when
taking the page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -296,7 +296,7 @@ static int may_write_to_queue(struct bac
 static void handle_write_error(struct address_space *mapping,
 				struct page *page, int error)
 {
-	lock_page(page);
+	lock_page_nosync(page);
 	if (page_mapping(page) == mapping)
 		mapping_set_error(mapping, error);
 	unlock_page(page);
--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 6:23 am

Thanks, I've picked this up and merged it into the series and removed

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 4:01 am

kswapd really should be a last effort tool to clean filesystem pages.
If it does enough I/O for this to matter significantly we need to
fix the VM to move more work to the flusher threads instead of trying

We already have that API, in Jens' latest tree that's
sync_inodes_sb/writeback_inodes_sb.  We could also add a non-waiting
variant if required, but I think the big problem with kswapd is that
we want to wait on I/O completion under circumstances.
--

From: Rik van Riel
Date: Tuesday, June 15, 2010 - 6:32 am

However, kswapd does not need to wait on I/O completion of
any page in particular - it just wants to wait on I/O
completion of any inactive pages in the zone (or memcg)
where memory is being freed.

-- 
All rights reversed
--

From: Andrew Morton
Date: Monday, June 14, 2010 - 6:39 pm

None of what you said had much to do with what I said.

What you've described are implementation problems in the current block
layer because it conflates "sorting" with "queueing".  I'm saying "fix
that".

And...  sorting at the block layer will always be superior to sorting
at the pagecache layer because the block layer sorts at the physical
block level and can handle not-well-laid-out files and can sort and merge
pages from different address_spaces.

Still, I suspect none of it will improve anything anyway.  Those pages
are still dirty, possibly-locked and need to go to disk.  It doesn't
matter from the MM POV whether they sit in some VM list or in the
request queue.

Possibly there may be some benefit to not putting so many of these
unreclaimable pages into the queue all at the the same time.  But
that's a shortcoming in the block code: we should be able to shove
arbitrary numbers of dirty page (segments) into the queue and not gum
the system up.  Don't try to work around that in the VM.

--

From: Dave Chinner
Date: Monday, June 14, 2010 - 8:20 pm

Yes it, can do that. And it still does that even if the higher
layers sort their I/O dispatch better,

Filesystems try very hard to allocate adjacent logical offsets in a
file in adjacent physical blocks on disk - that's the whole point of
extent-indexed filesystems. Hence with modern filesystems there is
generally a direct correlation between the page {mapping,index}
tuple and the physical location of the mapped block.

i.e. there is generally zero physical correlation between pages in
different mappings, but there is a high physical correlation
between the index of pages on the same mapping. Hence by sorting
where we have a {mapping,index} context, we push out IO that is
much more likely to be in contiguous physical chunks that the
current random page shootdown.

We optimise applications to use these sorts of correlations all the
time to improve IO patterns. Why can't we make the same sort of


I think you know perfectly well why the system gums up when we
increase block layer queue depth: it's the fact that the _VM_ relies
on block layer queue congestion to limit the amount of dirty memory
in the system.

We've got a feedback loop between the block layer and the VM that
only works if block device queues are kept shallow. Keeping the
number of dirty pages under control is a VM responsibility, but it
is putting limitations on the block layer to ensure that the VM
works correctly.  If you want the block layer to have deep queues,
then someone needs to fix the VM not to require knowledge of the
internal operation of the block layer for correct operation.

Adding a few lines of code to sort a list in the VM is far, far
easier than redesigning the write throttling code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Andrew Morton
Date: Monday, June 14, 2010 - 9:15 pm

Yes you can.  That's exactly what you're recommending!  Only you're
recommending doing it at the wrong level.  The fs-writeback radix-tree
walks do it at the wrong level too.  Sorting should be done within, or
in a layer above the block queues, not within the large number of

Nope.  Large-number-of-small-files is a pretty common case.  If the fs
doesn't handle that well (ie: by placing them nearby on disk), it's

We can, but it shouldn't be specific to page reclaim.  Other places

The only difference is that pages which are in the queue (current

mm, a little bit still, I guess.  Mainly because dirty memory
management isn't zone aware, so even though we limit dirty memory
globally, a particular zone(set) can get excessively dirtied.

Most of this problem happen on the balance_dirty_pages() path, where we

It's a hack and a workaround.  And I suspect it won't make any
difference, especially given Mel's measurements of the number of dirty
pages he's seeing coming off the LRU.  Although those numbers may well
be due to the new quite-low dirty memory thresholds.  

It would be interesting to code up a little test patch though, see if
there's benefit to be had going down this path.

--

From: Dave Chinner
Date: Monday, June 14, 2010 - 11:36 pm

Umm, I suggested sorting a queue dirty pages that was build by
reclaim before dispatching them. How does that translate to

If you feed a filesystem garbage IO, you'll get garbage performance
and there's nothing that a block layer sort queue can do to fix the
damage it does to both performance and filesystem fragmentation
levels. It's not just about IO issue - delayed allocation pretty
much requires writeback to be issuing well formed IOs to reap the

Filesystems already handle this case just fine as we see it from
writeback all the time. Untarring a kernel is a good example of
this...

I suggested sorting all the IO to be issued into per-mapping page
groups because:
	a) makes IO issued from reclaim look almost exactly the same
	   to the filesytem as if writeback is pushing out the IO.
	b) it looks to be a trivial addition to the new code.


I doubt Mel's tests cases will show anything - they simply didn't
show enough IO issued from reclaim to make any difference.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Tuesday, June 15, 2010 - 3:55 am

The solution is not to sort pages on their way to be submitted either,
really.

What I do in fsblock is to maintain a block-nr sorted tree of dirty
blocks. This works nicely because fsblock dirty state is properly
synchronized with page dirty state. So writeout can just walk this in
order and it provides pretty optimal submission pattern of any
interleavings of data and metadata. No need for buffer boundary or
hacks like that. (needs some intelligence for delalloc, though).

But even with all that, it's not the complete story. It doesn't know
about direct IO, sync IO, or fsyncs, and it would be very hard and
ugly to try to synchronise and sort all that from the pagecache level.
It also is a heuristic in terms of optimal block scheduling behaviour.
With smarter devices and drivers there might be better ways to go.

So what is needed is to get as much info into the block layer as
possible. As Andrew says, there shouldn't be such a big difference
between pages being writeback or dirty in pagecache.

--

From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 4:10 am

I think worrying about indirect blocks really doesn't matter much
these days.  For one thing extent based filesystems have a lot less
of these, and second for a journaling filesystem we only need to log
modification to the indirect blocks and not actually write them back
in place during the sync.  At least for XFS the actual writeback can
happen a lot later, as part of the ordered list of delwri buffers.

--

From: Nick Piggin
Date: Tuesday, June 15, 2010 - 4:20 am

That's true, more importantly I meant any interleavings of data from
more than one file too.

--

From: Dave Chinner
Date: Tuesday, June 15, 2010 - 4:20 pm

How does this work with delayed allocation where there is no block
number associated with the page until writeback calls the allocation
routine?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Tuesday, June 15, 2010 - 11:04 pm

It doesn't. I have been thinking about how best to make that work.
The mm/writeback is not in a good position to know what to do, so
the fs would have to help.

So either an fs callback, or the fs would have to insert the blocks
(or some marker) into the tree itself. It's relatively easy to do for
a single file (just walk the radix-tree and do delalloc conversions),
but between multiple files is harder (current code has the same problem
though).

--

From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 4:08 am

That's why we still have block layer sorting.  But for the problem
of larger files doing the sorting above the filesystem is a lot
more efficient, not just primarily due to the I/O patters but because
it makes life for the filesystem writeback code and allocator a lot

Which is actually more or less true - if we do larger amounts of
writeback from kswapd we're toast anyway and performance and allocation
patters go down the toilet.  Then again throwing a list_sort in is
a rather trivial addition.  Note that in addition to page->index we
can also sort by the inode number in the sort function.  At least for
XFS and the traditional ufs derived allocators that will give you
additional locality for small files.

--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 4:43 am

I tested with a dirty ratio of 40 but I didn't see a major problem.
It's still a case with the tests I saw that direct reclaim of dirty pages
was a relatively rare event except when lumpy reclaim was involved. What
did change is the amount of scanning the direct reclaim and kswapd had to do
(both increased quite a bit) but the percentage of dirty pages encountered was
roughly the same (1-2% of scanned pages were dirty in the case of sysbench).

This is sysbench only rather than flooding with more data.

DIRTY RATIO == 20
FTrace Reclaim Statistics
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               9843      13398      51651 
Direct reclaim pages scanned                871367    1008709    3080593 
Direct reclaim write async I/O               24883      30699          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       7070819    6961672   11268341 
Kswapd wakeups                                1578       1500        943 
Kswapd pages scanned                      22016558   21779455   17393431 
Kswapd reclaim write async I/O             1161346    1101641    1717759 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)             26.11      45.04       2.97 
Time kswapd awake (ms)                     5105.06    5135.93    6086.32 

User/Sys Time Running Test (seconds)        734.52    712.39     703.9
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
Percentage Time kswapd Awake                 0.06%     0.00%     0.00%

DIRTY RATIO == 40
FTrace Reclaim Statistics
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                              29945      41887     163006 
Direct reclaim pages scanned               2853804    3075288   13142072 
Direct reclaim write async ...
From: tytso
Date: Tuesday, June 15, 2010 - 6:07 am

I suspect the right answer is we need to sort both at the block layer
and either (a) before you pass things to the filesystem layer, or if
you don't do that (b) the filesystem will be forced to do its own
queuing/sorting at the very least for delayed allocation pages, so the
allocator can do something sane.  And given that there are multiple
file systems that support delayed allocation, it would be nice if this
could be recognized by the writeback code, as opposed to having btrfs,
xfs, ext4, all having to implement something very similar at the fs
layer.

							- Ted
--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 8:44 am

The patch to sort the list being cleaned by reclaim looks like this.
It's not actually tested

vmscan: Sort pages being queued for IO before submitting to the filesystem

While page reclaim submits dirty pages in batch, it doesn't change the
order in which the IO is issued - it is still issued in LRU order. Given
that they are issued in a short period of time now, rather than across a
longer scan period, it is likely that it will not be any faster as:

        a) IO will not be started as soon, and
        b) the IO scheduler still only has a small re-ordering
           window and will choke just as much on random IO patterns.

This patch uses list_sort() function to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it.

Credit mostly goes to Dave Chinner for this idea and the changelog text.

----

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 68b3d22..02ab246 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -32,6 +32,7 @@
 #include <linux/topology.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
+#include <linux/list_sort.h>
 #include <linux/notifier.h>
 #include <linux/rwsem.h>
 #include <linux/delay.h>
@@ -651,6 +652,34 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 		__pagevec_free(&freed_pvec);
 }
 
+/* Sort based on mapping then index */
+static int page_writeback_cmp(void *data, struct list_head *a, struct list_head *b)
+{
+	struct page *ap = list_entry(a, struct page, lru);
+	struct page *bp = list_entry(b, struct page, lru);
+	pgoff_t diff;
+
+	/*
+	 * Page not locked but it's not critical, the mapping is just for sorting
+	 * If the mapping is no longer valid, it's of little consequence
+	 */
+	if (ap->mapping != bp->mapping) {
+		if (ap->mapping < bp->mapping)
+			return -1;
+		if (ap->mapping > bp->mapping)
+			return ...
From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 3:57 am

Even if has not effect on the actual I/O patters it has a massive
benefit for the filesystem.  When probing delalloc/unwritten space at
least XFS does try to convert a larger extent forward from the index,
but doesn't bother to go backwards.  By providing the trivial sort here
we make life a lot easier for the filesystem.

In addition to that we do get better I/O patters especially with short
queues as smart writepage implementatons will also submit the next few
pages, which is essentially free given how the storage works.
This means we already have a page cleaned before we might even submit it
without sorting.

--

From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 3:53 am

how about:

	if (PageWriteback(page) {
		if (sync_writeback == PAGEOUT_IO_SYNC)
			wait_on_page_writeback(page);
	} else {
		/* synchronous write or broken a_ops? */
		ClearPageReclaim(page);

A rather pointless line of 80 chars.  I see the point for long string

I think there is.  There's quite a few places that do hold multiple
pages locked, but they always lock pages in increasing page->inxex order.
Given that this locks basically in random order it could cause problems
for those places.

--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 4:11 am

With the page lock released, the mapping may be no longer valid. Nick
posted a patch in relation to it that I need to look at. The comment was
because Andrew highlight that this was buggy and I wanted to make sure I



Hmm, ok. In that case, I'll have to release the locks on the list and
reacquire them. It was something I would have preferred to avoid. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Nick Piggin
Date: Tuesday, June 15, 2010 - 4:13 am

It's just a pretty simple use-after-free. Maybe people forget it because

There shouldn't be a problem _holding_ the locks, but there is a problem
waiting for multiple locks out of page->index order.

But there is a problem with holding the lock of a lot of pages while
calling ->writepage on them. So yeah, you can't do that.

Hmm, I should rediff that lockdep page_lock patch and get it merged.
(although I don't know if that can catch these all these problems easily)

--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

When shrink_inactive_list() isolates pages, it updates a number of
counters using temporary variables to gather them. These consume stack
and it's in the main path that calls ->writepage(). This patch moves the
accounting updates outside of the main path to reduce stack usage.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   63 +++++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 165c2f5..019f0af 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1073,7 +1073,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1153,12 +1154,13 @@ static int too_many_isolated(struct zone *zone, int file,
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void putback_lru_pages(struct zone *zone,
-				struct zone_reclaim_stat *reclaim_stat,
+				struct scan_control *sc,
 				unsigned long nr_anon, unsigned long nr_file,
  				struct list_head *page_list)
 {
 	struct page *page;
 	struct pagevec pvec;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
 	pagevec_init(&pvec, 1);
 
@@ -1197,6 +1199,37 @@ static noinline_for_stack void putback_lru_pages(struct zone *zone,
 	pagevec_release(&pvec);
 }
 
+static noinline_for_stack void update_isolated_counts(struct zone *zone, 
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
+
+	__mod_zone_page_state(zone, ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 12:42 pm

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

shrink_page_list() sets up a pagevec to release pages as according as they
are free. It uses significant amounts of stack on the pagevec. This
patch adds pages to be freed via pagevec to a linked list which is then
freed en-masse at the end. This avoids using stack in the main path that
potentially calls writepage().

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   37 +++++++++++++++++++++++++++++--------
 1 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34c5c87..165c2f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,6 +620,25 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	if (pagevec_count(&freed_pvec))
+		__pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -628,13 +647,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -809,10 +827,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 12:24 pm

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Andrew Morton
Date: Wednesday, June 16, 2010 - 4:48 pm

On Mon, 14 Jun 2010 12:17:50 +0100

hm, spose so.  I cen't see any trivial way to eliminate the local

That's an open-coded pagevec_free().
--

From: Mel Gorman
Date: Thursday, June 17, 2010 - 3:46 am

Fair point, will correct. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

This patch adds two trace events for kswapd waking up and going asleep for
the purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process.  High frequency triggering of these events could point to memory
pressure problems.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/gfpflags.h |   37 +++++++++++++
 include/trace/events/kmem.h     |   38 +-------------
 include/trace/events/vmscan.h   |  115 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                     |   24 +++++++--
 4 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 8:45 am

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Larry Woodman
Date: Monday, June 14, 2010 - 2:01 pm

Acked-by: Larry Woodman <lwoodman@redhat.com>


--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
uses significant amounts of stack doing this. This patch splits
shrink_inactive_list() to take the stack usage out of the main path so
that callers to writepage() do not contain an unused pagevec on the
stack.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   95 +++++++++++++++++++++++++++++++++-------------------------
 1 files changed, 54 insertions(+), 41 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 246e084..34c5c87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,19 +1129,65 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
+ * TODO: Try merging with migrations version of putback_lru_pages
+ */
+static noinline_for_stack void putback_lru_pages(struct zone *zone,
+				struct zone_reclaim_stat *reclaim_stat,
+				unsigned long nr_anon, unsigned long nr_file,
+ 				struct list_head *page_list)
+{
+	struct page *page;
+	struct pagevec pvec;
+
+	pagevec_init(&pvec, 1);
+
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
+
+/*
  ...
From: Christoph Hellwig
Date: Tuesday, June 15, 2010 - 3:47 am

I hate to nitpick on this, but with noinline_for_stack a prototype
really is unreadbale unless the function identifier goes to the next
line.  Compare the one above with:

static noinline_for_stack void
putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
		unsigned long nr_anon, unsigned long nr_file,

Same here, just even worse due to the spill over ove 80 characters.

--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 8:56 am

It's a fair nitpick and so is cleaned up.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrew Morton
Date: Wednesday, June 16, 2010 - 4:43 pm

On Mon, 14 Jun 2010 12:17:49 +0100

You can get the entire pagevec off the stack - just make it a
static-to-shrink_inactive_list() pagevec-per-cpu.

Locking just requires pinning to a CPU.  We could trivially co-opt
shrink_inactive_list()'s spin_lock_irq() for that, but
pagevec_release() can be relatively expensive so it'd be sad to move
that inside spin_lock_irq().  It'd be better to slap a
get_cpu()/put_cpu() around the whole thing.

--

From: Mel Gorman
Date: Thursday, June 17, 2010 - 3:30 am

That idea has been floated as well. I didn't pursue it because Dave
said that giving page reclaim a stack diet was never going to be the
full solution so I didn't think the complexity was justified.

I kept some of the stack reduction stuff because a) it was there and b)

It'd be something interesting to try out when nothing else was happening but
I'm not going to focus on it for the moment unless I think it will really
help this stack overflow problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   15 ------------
 mm/page_alloc.c        |    2 -
 mm/vmscan.c            |   57 ------------------------------------------------
 mm/vmstat.c            |    2 -
 4 files changed, 0 insertions(+), 76 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 11:04 am

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Andrew Morton
Date: Wednesday, June 16, 2010 - 4:37 pm

On Mon, 14 Jun 2010 12:17:46 +0100

This would have been badder in earlier days when we were using the
scanning priority to decide when to start unmapping pte-mapped pages -
page reclaim would have been recirculating large blobs of mapped pages
around the LRU until the priority had built to the level where we
started to unmap them.

However that priority-based decision got removed and right now I don't
recall what it got replaced with.  Aren't we now unmapping pages way
too early and suffering an increased major&minor fault rate?  Worried.


Things which are still broken after we broke prev_priority:

- If page reclaim is having a lot of trouble, prev_priority would
  have permitted do_try_to_free_pages() to call disable_swap_token()
  earlier on.  As things presently stand, we'll do a lot of
  thrash-detection stuff before (presumably correctly) dropping the
  swap token.

  So.  What's up with that?  I don't even remember _why_ we disable
  the swap token once the scanning priority gets severe and the code
  comments there are risible.  And why do we wait until priority==0
  rather than priority==1?

- Busted prev_priority means that lumpy reclaim will act oddly. 
  Every time someone goes into do some recalim, they'll start out not
  doing lumpy reclaim.  Then, after a while, they'll get a clue and
  will start doing the lumpy thing.  Then they return from reclaim and
  the next recalim caller will again forget that he should have done
  lumpy reclaim.

  I dunno what the effects of this are in the real world, but it
  seems dumb.

And one has to wonder: if we're making these incorrect decisions based
upon a bogus view of the current scanning difficulty, why are these
various priority-based thresholding heuristics even in there?  Are they
doing anything useful?

So..  either we have a load of useless-crap-and-cruft in there which
should be lopped out, or we don't have a load of useless-crap-and-cruft

The patch forgot to remove mem_cgroup_get_reclaim_priority() ...
From: Rik van Riel
Date: Wednesday, June 16, 2010 - 4:45 pm

We keep a different set of statistics to decide whether to
reclaim only page cache pages, or both page cache and
anonymous pages. The function get_scan_ratio parses those

The reason is that we never page out the pages belonging to the
process owning the swap token (with the exception of that process
evicting its own pages).

If that process has a really large RSS in the current zone, and
we are having problems freeing pages, it may be beneficial to
also evict pages from that process.

Now that the LRU lists are split out into file backed and swap
backed, it may be a lot easier to find pages to evict.  That
may mean we could notice we're getting into trouble at much
higher priority levels and disable the swap token at a higher
priority level.

I do not believe prev_priority will be very useful here, since

How common are lumpy reclaims, anyway?

Isn't it more likely that in-between every two higher-order
reclaims, a number of order zero reclaims will be happening?

In that case, the prev_priority logic may have introduced the

The prev_priority code was useful when we had filesystem and
swap backed pages mixed on the same LRU list.

I am not convinced it still has any use.

-- 
All rights reversed
--

From: Andrew Morton
Date: Wednesday, June 16, 2010 - 5:18 pm

On Wed, 16 Jun 2010 19:45:29 -0400

I wasn't talking about anon-vs-file.  I was referring to mapped-file
versus not-mapped file.  If the code sees a mapped page come off the
tail of the LRU it'll just unmap and reclaim the thing.  This policy
caused awful amounts of paging activity when someone started doing lots
of read() activity, which is why the VM was changed to value mapped
pagecache higher than unmapped pagecache.  Did this biasing get

hm, lots of "may"s there.



A lot more than they should be, I suspect, given the recent trend
towards asking for higher-order allocations.  Kernel developers should


No, stop saying swap! ;)

It's all to do with mapped pagecache versus unmapped pagecache.  "ytf
does my browser get paged out all the time".



--

From: Rik van Riel
Date: Wednesday, June 16, 2010 - 5:34 pm

It changed a little, but we still have it:

1) we do not deactivate active file pages if the active file
    list is smaller than the inactive file list - this protects
    the working set from streaming IO

2) we keep mapped referenced executable pages on the active file
    list if they got accessed while on the active list, while

I suspect it does, but I have not actually tested that code

For one, memory sizes today are a lot larger than they were
when 2.6.0 came out.

Secondly, we now know more exactly what is on each LRU list.
That should greatly reduce unnecessary turnover of the list.

For example, if we know there is no swap space available, we
will not bother scanning the anon LRU lists.

If we know there is not enough file cache left to get us up
to the zone high water mark, we will not bother scanning the
few remaining file pages.

Because of those simple checks (in get_scan_priority), I do
not expect that we will have to scan through all of memory
as frequently as we had to do in 2.6.0.

Furthermore, we unconditionally deactivate most active pages
and have a working used-once scheme for pages on the anon
lists.  This should also contribute to a reduction in the

We have other measures in place now to protect the working set
on the file LRU lists (see above).  We are able to have those
measures in the kernel because we no longer have mixed LRU
lists.

-- 
All rights reversed
--

From: KOSAKI Motohiro
Date: Friday, June 25, 2010 - 1:29 am

sorry for the long delay.

May I explain my experience? I'd like to explain why prev_priority wouldn't
works nowadays. 

First of all, Yes, current vmscan still a lot of UP centric code. it 
expose some weakness on some dozens CPUs machine. I think we need 
more and more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but If the another task have mempolicy restriction,
It's unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue.

In other word, prev_priority concept assume the sysmtem don't have lots
threads.

And, I don't think lumpy reclaim threshold is big matter, because It was
introduced to case aim7 corner case issue. I don't think such situation
will occur frequently in the real workload. thus end users can't observe
such logic.

For mapped-vs-unmapped thing, I dunnno the exactly reason. That was
introduced by Rik, unfortunatelly I had not joined its activity at 
making design time. I can only say, while my testing the current code 
works good.

That said, my conclusion is opposite. For long term view, we should
consider to kill reclaim priority completely. Instead, we should

Sure. thanks.
Will fix.


btw, current zone reclaim have wrong swap token usage.

	static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
	{
	(snip)
	       ...
From: Mel Gorman
Date: Monday, June 28, 2010 - 3:35 am

My fault, I unintentionally deleted your name from the send script.

The intention of the code was to note that orders < PAGE_ALLOC_COSTLY_ORDER,
there was an expectatation that those pages would be free or nearly free
without page reclaim taking special steps with lumpy reclaim. If this has
changed, it's almost certainly because of a greater dependence on high-order
pages than previously which should be resisted (it has cropped up a few
times recently). I do have a script that uses ftrace to count call sites
using high-order allocations and how often they occur which would be of use

I'm not aware of current problems with lumpy reclaim related stalls or
problems but it's not something I have specifically investigated. If
there is a known example workload that is felt to trigger lumpy reclaim
more than it should, someone point me in the general direction and I'll



-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Monday, June 14, 2010 - 4:17 am

This patch adds a simple post-processing script for the reclaim-related
trace events.  It can be used to give an indication of how much traffic
there is on the LRU lists and how severe latencies due to reclaim are.
Example output looks like the following

Reclaim latencies expressed as order-latency_in_ms
uname-3942             9-200.179000000004 9-98.7900000000373 9-99.8330000001006
kswapd0-311            0-662.097999999998 0-2.79700000002049 \
	0-149.100000000035 0-3295.73600000003 0-9806.31799999997 0-35528.833 \
	0-10043.197 0-129740.979 0-3.50500000000466 0-3.54899999999907 \
	0-9297.78999999992 0-3.48499999998603 0-3596.97999999998 0-3.92799999995623 \
	0-3.35000000009313 0-16729.017 0-3.57799999997951 0-47435.0630000001 \
	0-3.7819999998901 0-5864.06999999995 0-18635.334 0-10541.289 9-186011.565 \
	9-3680.86300000001 9-1379.06499999994 9-958571.115 9-66215.474 \
	9-6721.14699999988 9-1962.15299999993 9-1094806.125 9-2267.83199999994 \
	9-47120.9029999999 9-427653.886 9-2.6359999999404 9-632.148999999976 \
	9-476.753000000026 9-495.577000000048 9-8.45900000003166 9-6.6820000000298 \
	9-1.30500000016764 9-251.746000000043 9-383.905000000028 9-80.1419999999925 \
	9-281.160000000149 9-14.8780000000261 9-381.45299999998 9-512.07799999998 \
	9-49.5519999999087 9-167.439000000013 9-183.820999999996 9-239.527999999933 \
	9-19.9479999998584 9-148.747999999905 9-164.583000000101 9-16.9480000000913 \
	9-192.376000000164 9-64.1010000000242 9-1.40800000005402 9-3.60800000000745 \
	9-17.1359999999404 9-4.69500000006519 9-2.06400000001304 9-1582488.554 \
	9-6244.19499999983 9-348153.812 9-2.0999999998603 9-0.987999999895692 \
	0-32218.473 0-1.6140000000596 0-1.28100000019185 0-1.41300000017509 \
	0-1.32299999985844 0-602.584000000032 0-1.34400000004098 0-1.6929999999702 \
	1-22101.8190000001 9-174876.724 9-16.2420000000857 9-175.165999999736 \
	9-15.8589999997057 9-0.604999999981374 9-3061.09000000032 9-479.277000000235 \
	9-1.54499999992549 9-771.985000000335 9-4.88700000010431 ...
From: Rik van Riel
Date: Monday, June 14, 2010 - 10:55 am

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed
--

From: Larry Woodman
Date: Monday, June 14, 2010 - 2:03 pm

Acked-by: Larry Woodman <lwoodman@redhat.com>


--

From: Christoph Hellwig
Date: Monday, June 14, 2010 - 8:10 am

This stuff looks good to me from the filesystem POV.

You might want to throw in a follow on patch to remove the PF_MEMALLOC
checks from the various ->writepage methods.

--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 4:45 am

Will do, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KAMEZAWA Hiroyuki
Date: Monday, June 14, 2010 - 5:08 pm

On Mon, 14 Jun 2010 12:17:41 +0100


Execution time is reduced. Does this shows removing "I/O noise" by direct
reclaim makes the system happy ? or writeback in direct reclaim give
us too much costs ?

It seems I'll have to consider about avoiding direct-reciam in memcg, later.

BTW, I think we'll have to add wait-for-pages-to-be-cleaned trick in
direct reclaim if we want to avoid too much scanning, later.


Thank you for interesting test.

Regards,
-Kame

--

From: Mel Gorman
Date: Tuesday, June 15, 2010 - 4:49 am

It could be accounted for by the fact that the direct reclaimers are
stalled less in direct reclaim. They make more forward progress needing

I think it's accounted for by just making more forward progress rather than
IO noise. The throughput results for sysbench are all over the place because
the disk was maxed so I'm shying away from drawing any conclusions on the

This happens for lumpy reclaim. I didn't think it was justified for
normal reclaim based on the percentage of dirty pages encountered during
scanning. If the percentage of dirty pages scanned, we'll need to first
figure out why that happened and then if stalling when they are

You're welcome.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

Previous thread: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim by Mel Gorman on Monday, June 14, 2010 - 4:17 am. (22 messages)

Next thread: [PATCH] x86: add power_end event to process_*.c cpu_idle routine by Robert Schöne on Monday, June 14, 2010 - 4:37 am. (1 message)