Re: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure

Previous thread: [PATCH 07/11] Memory compaction core by Mel Gorman on Tuesday, March 23, 2010 - 5:25 am. (22 messages)

Next thread: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index by Mel Gorman on Tuesday, March 23, 2010 - 5:25 am. (2 messages)
From: Mel Gorman
Date: Tuesday, March 23, 2010 - 5:25 am

The fragmentation index may indicate that a failure it due to external
fragmentation, a compaction run complete and an allocation failure still
fail. There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction and allocation failure, this patch prevents
compaction happening for a short interval. It's only recorded on the
preferred zone but that should be enough coverage. This could have been
implemented similar to the zonelist_cache but the increased size of the
zonelist did not appear to be justified.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    7 +++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b851428..bc7059d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -14,6 +14,32 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+	/*
+	 * This function is called when compaction fails to result in a page
+	 * allocation success. This is somewhat unsatisfactory as the failure
+	 * to compact has nothing to do with time and everything to do with
+	 * the requested order, the number of free pages and watermarks. How
+	 * to wait on that is more unclear, but the answer would apply to
+	 * other areas where the VM waits based on time.
+	 */
+	zone->compact_resume = resume;
+}
+
+static ...
From: Christoph Lameter
Date: Tuesday, March 23, 2010 - 11:31 am

20ms? How was that interval determined?

--

From: Mel Gorman
Date: Tuesday, March 23, 2010 - 11:39 am

I was having some sort of fit when I wrote that obviously. Try this on
for size

The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible  

deferred makes more sense.

What I was thinking at the time was that compact_resume was stored in struct

Matches the time the page allocator would defer to an event like
congestion. The choice is somewhat arbitrary. Ideally, there would be
some sort of event that would re-enable compaction but there wasn't an
obvious candidate so I used time.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Tuesday, March 23, 2010 - 12:27 pm

There are frequent uses of HZ/10 as well especially in vmscna.c. A longer
time may be better?  HZ/50 looks like an interval for writeout. But this
is related to reclaim?


 backing-dev.h    <global>                      283 long congestion_wait(int sync, long timeout);
1 backing-dev.c    <global>                      762 EXPORT_SYMBOL(congestion_wait);
2 usercopy_32.c    __copy_to_user_ll             754 congestion_wait(BLK_RW_ASYNC, HZ/50);
3 pktcdvd.c        pkt_make_request             2557 congestion_wait(BLK_RW_ASYNC, HZ);
4 dm-crypt.c       kcryptd_crypt_write_convert   834 congestion_wait(BLK_RW_ASYNC, HZ/100);
5 file.c           fat_file_release              137 congestion_wait(BLK_RW_ASYNC, HZ/10);
6 journal.c        reiserfs_async_progress_wait  990 congestion_wait(BLK_RW_ASYNC, HZ / 10);
7 kmem.c           kmem_alloc                     61 congestion_wait(BLK_RW_ASYNC, HZ/50);
8 kmem.c           kmem_zone_alloc               117 congestion_wait(BLK_RW_ASYNC, HZ/50);
9 xfs_buf.c        _xfs_buf_lookup_pages         343 congestion_wait(BLK_RW_ASYNC, HZ/50);
a backing-dev.c    congestion_wait               751 long congestion_wait(int sync, long timeout)
b memcontrol.c     mem_cgroup_force_empty       2858 congestion_wait(BLK_RW_ASYNC, HZ/10);
c page-writeback.c throttle_vm_writeout          674 congestion_wait(BLK_RW_ASYNC, HZ/10);
d page_alloc.c     __alloc_pages_high_priority  1753 congestion_wait(BLK_RW_ASYNC, HZ/50);
e page_alloc.c     __alloc_pages_slowpath       1924 congestion_wait(BLK_RW_ASYNC, HZ/50);
f vmscan.c         shrink_inactive_list         1136 congestion_wait(BLK_RW_ASYNC, HZ/10);
g vmscan.c         shrink_inactive_list         1220 congestion_wait(BLK_RW_ASYNC, HZ/10);
h vmscan.c         do_try_to_free_pages         1837 congestion_wait(BLK_RW_ASYNC, HZ/10);
i vmscan.c         balance_pgdat                2161 congestion_wait(BLK_RW_ASYNC, HZ/10);

--

From: Mel Gorman
Date: Wednesday, March 24, 2010 - 3:37 am

In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone for a period of time. The zone that
is deferred is the first zone in the zonelist - i.e. the preferred zone.
To defer compaction in the other zones, the information would need to
be stored in the zonelist or implemented similar to the zonelist_cache.
This would impact the fast-paths and is not justified at this time.


HZ/10 is somewhat of an arbitrary choice as well and there isn't data on
which is better and which is worse. If the zone is full of dirty data, then
HZ/10 makes sense for IO. If it happened to be mainly clean cache but under
heavy memory pressure, then reclaim would be a relatively fast event and a
shorter wait makes sense of HZ/50.

Thing is, if we start with a short timer and it's too short, COMPACTFAIL
will be growing steadily. If we choose a long time and it's too long, there
is no counter to indicate it was a bad choice. Hence, I'd prefer the short
timer to start with and ideally resume compaction after some event in the
future rather than depending on time.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Wednesday, March 24, 2010 - 12:54 pm

Yes.

--

From: Andrew Morton
Date: Wednesday, March 24, 2010 - 1:53 pm

On Tue, 23 Mar 2010 12:25:46 +0000

um.  "Two wrongs don't make a right".  We should fix the other sites,
not use them as excuses ;)

What _is_ a good measure of "time" in this code?  "number of pages
scanned" is a pretty good one in reclaim.  We want something which will
adapt itself to amount-of-memory, number-of-cpus, speed-of-cpus,
nature-of-workload, etc, etc.

Is it possible to come up with some simple metric which approximately

--

From: Mel Gorman
Date: Thursday, March 25, 2010 - 2:40 am

Heh, one of those sites is currently in dispute. Specifically, the patch
that replaces congestion_wait() with a waitqueue that is woken when
watermarks are reached. I wrote that comment around about the same time
that patch was being developed which is why I found the situation

In this case, a strong possibility is number of pages freed since deferral.
It's not perfect though because heavy memory pressure would mean those
pages are getting allocated again and the compaction is still not going
to succeed. I could use NR_FREE_PAGES to make a guess at how much has
changed since and whether it's worth trying to compact again but even
that is not perfect.

Lets say for example that compaction failed because the zone was mostly slab
pages. If all those were freed and replaced with migratable pages then the
counters would look similar but compaction will now succeed.  I could make
some sort of guess based on number of free, anon and file pages in the zone but
ultimately it would be hard to tell if the heuristic was any better than time.

I think this is only worth worrying about if a workload is found where

I think a simple metric would be based on free anon and file pages but
I think we would need a workload that was hitting compact_fail to devise

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

Previous thread: [PATCH 07/11] Memory compaction core by Mel Gorman on Tuesday, March 23, 2010 - 5:25 am. (22 messages)

Next thread: [PATCH 06/11] Export fragmentation index via /proc/extfrag_index by Mel Gorman on Tuesday, March 23, 2010 - 5:25 am. (2 messages)