Re: Free memory never fully used, swapping

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Mel Gorman
Date: Thursday, November 25, 2010 - 9:12 am

On Thu, Nov 25, 2010 at 01:03:28AM -0800, Simon Kirby wrote:

Ok. A consequence of this is that kswapd balancing a node will still try
to balance Normal even if DMA32 has enough memory. This could account
for some of kswapd being mean.


It's possible to reduce the maximum order that SLUB uses but lets not
resort to that as a workaround just yet. In case it needs to be
elminiated as a source of problems later, the relevant kernel parameter
is slub_max_order=.


kswapd is not woken up because we stay in the allocator fastpath once
that much memory hs been freed.


Allocator slowpath.


Watermarks are probably not met though.


Technically it could, but watermark maintenance is important.


Yep.


It's probably fighting to keep *all* zones happy even though it's not strictly
necessary. I suspect it's fighting the most for Normal.


It's not required. The logic for kswapd is "balance all zones" and
Normal is one of the zones. Even though you know that DMA32 is just
fine, kswapd doesn't.


Watermarks. The steady stream of order-3 allocations is telling the
allocator and kswapd that these size pages must be available. It doesn't
know that slub can happily fall back to smaller pages because that
information is lost. Even removing __GFP_WAIT won't help because kswapd
still gets woken up for atomic allocation requests.


Ok, this is true. kswapd in balance_pgdat() has given up on the order
but that information is lost when sleeping_prematurely() is called so it
constantly loops. That is a mistake. balance_pgdat() could return the order
so sleeping_prematurely() doesn't do the wrong thing.


SLUB can be forced to use smaller orders but I don't think that's the
right fix here.


Yes, but we'd see more high-order atomic allocation (e.g. jumbo frames)
failures as a result so that fix would cause other regressions.


So, the key here is kswapd didn't need to balance all zones, any one of
them would have been fine.


It's not because sleeping_prematurely() interferes with it.


It doesn't.


I think there are at least two fixes required here.

1. sleeping_prematurely() must be aware that balance_pgdat() has dropped
   the order.
2. kswapd is trying to balance all zones for higher orders even though
   it doesn't really have to.

This patch has potential fixes for both of these problems. I have a split-out
series but I'm posting it as a single patch so see if it allows kswapd to
go to sleep as expected for you and whether it stops hammering the Normal
zone unnecessarily. I tested it locally here (albeit with compaction
enabled) and it did reduce the amount of time kswapd spent awake.

==== CUT HERE ====
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..25fe08d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -645,6 +645,7 @@ typedef struct pglist_data {
 	wait_queue_head_t kswapd_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
+	enum zone_type high_zoneidx;
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
@@ -660,7 +661,7 @@ typedef struct pglist_data {
 
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(void *data);
-void wakeup_kswapd(struct zone *zone, int order);
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx);
 int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		int classzone_idx, int alloc_flags);
 enum memmap_context {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..344b597 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1921,7 +1921,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
 	struct zone *zone;
 
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
-		wakeup_kswapd(zone, order);
+		wakeup_kswapd(zone, order, high_zoneidx);
 }
 
 static inline int
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d31d7ce..00529a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2118,15 +2118,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
 #endif
 
 /* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 {
 	int i;
+	bool all_zones_ok = true;
+	bool any_zone_ok = false;
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return 1;
 
-	/* If after HZ/10, a zone is below the high mark, it's premature */
+	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -2138,10 +2140,20 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 
 		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
 								0, 0))
-			return 1;
+			all_zones_ok = false;
+		else
+			any_zone_ok = true;
 	}
 
-	return 0;
+	/*
+	 * For high-order requests, any zone meeting the watermark is enough
+	 *   to allow kswapd go back to sleep
+	 * For order-0, all zones must be balanced
+	 */
+	if (order)
+		return !any_zone_ok;
+	else
+		return !all_zones_ok;
 }
 
 /*
@@ -2168,6 +2180,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
 static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 {
 	int all_zones_ok;
+	int any_zone_ok;
 	int priority;
 	int i;
 	unsigned long total_scanned;
@@ -2201,6 +2214,7 @@ loop_again:
 			disable_swap_token();
 
 		all_zones_ok = 1;
+		any_zone_ok = 0;
 
 		/*
 		 * Scan in the highmem->dma direction for the highest
@@ -2310,10 +2324,12 @@ loop_again:
 				 * spectulatively avoid congestion waits
 				 */
 				zone_clear_flag(zone, ZONE_CONGESTED);
+				if (i <= pgdat->high_zoneidx)
+					any_zone_ok = 1;
 			}
 
 		}
-		if (all_zones_ok)
+		if (all_zones_ok || (order && any_zone_ok))
 			break;		/* kswapd: all done */
 		/*
 		 * OK, kswapd is getting into trouble.  Take a nap, then take
@@ -2336,7 +2352,7 @@ loop_again:
 			break;
 	}
 out:
-	if (!all_zones_ok) {
+	if (!(all_zones_ok || (order && any_zone_ok))) {
 		cond_resched();
 
 		try_to_freeze();
@@ -2361,7 +2377,13 @@ out:
 		goto loop_again;
 	}
 
-	return sc.nr_reclaimed;
+	/*
+	 * Return the order we were reclaiming at so sleeping_prematurely()
+	 * makes a decision on the order we were last reclaiming at. However,
+	 * if another caller entered the allocator slow path while kswapd
+	 * was awake, order will remain at the higher level
+	 */
+	return order;
 }
 
 /*
@@ -2417,6 +2439,7 @@ static int kswapd(void *p)
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 		new_order = pgdat->kswapd_max_order;
 		pgdat->kswapd_max_order = 0;
+		pgdat->high_zoneidx = MAX_ORDER;
 		if (order < new_order) {
 			/*
 			 * Don't sleep if someone wants a larger 'order'
@@ -2464,7 +2487,7 @@ static int kswapd(void *p)
 		 */
 		if (!ret) {
 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
-			balance_pgdat(pgdat, order);
+			order = balance_pgdat(pgdat, order);
 		}
 	}
 	return 0;
@@ -2473,7 +2496,7 @@ static int kswapd(void *p)
 /*
  * A zone is low on free memory, so wake its kswapd task to service it.
  */
-void wakeup_kswapd(struct zone *zone, int order)
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx)
 {
 	pg_data_t *pgdat;
 
@@ -2483,8 +2506,10 @@ void wakeup_kswapd(struct zone *zone, int order)
 	pgdat = zone->zone_pgdat;
 	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
 		return;
-	if (pgdat->kswapd_max_order < order)
+	if (pgdat->kswapd_max_order < order) {
 		pgdat->kswapd_max_order = order;
+		pgdat->high_zoneidx = min(pgdat->high_zoneidx, high_zoneidx);
+	}
 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Free memory never fully used, swapping, Simon Kirby, (Mon Nov 15, 12:52 pm)
Re: Free memory never fully used, swapping, Andrew Morton, (Mon Nov 22, 4:44 pm)
Re: Free memory never fully used, swapping, Simon Kirby, (Mon Nov 22, 6:34 pm)
Re: Free memory never fully used, swapping, Dave Hansen, (Tue Nov 23, 1:35 am)
Re: Free memory never fully used, swapping, Mel Gorman, (Tue Nov 23, 3:04 am)
Re: Free memory never fully used, swapping, Simon Kirby, (Tue Nov 23, 11:43 pm)
Re: Free memory never fully used, swapping, Simon Kirby, (Wed Nov 24, 1:46 am)
Re: Free memory never fully used, swapping, Mel Gorman, (Wed Nov 24, 2:27 am)
Re: Free memory never fully used, swapping, Simon Kirby, (Wed Nov 24, 12:17 pm)
Re: Free memory never fully used, swapping, Shaohua Li, (Wed Nov 24, 6:07 pm)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Wed Nov 24, 6:18 pm)
Re: Free memory never fully used, swapping, Simon Kirby, (Thu Nov 25, 2:03 am)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Thu Nov 25, 3:18 am)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Thu Nov 25, 3:51 am)
Re: Free memory never fully used, swapping, Mel Gorman, (Thu Nov 25, 9:12 am)
Re: Free memory never fully used, swapping, Mel Gorman, (Thu Nov 25, 9:15 am)
Re: Free memory never fully used, swapping, Simon Kirby, (Thu Nov 25, 10:13 am)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Thu Nov 25, 5:07 pm)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Thu Nov 25, 5:33 pm)
Re: Free memory never fully used, swapping, Shaohua Li, (Thu Nov 25, 6:05 pm)
Re: Free memory never fully used, swapping, Mel Gorman, (Thu Nov 25, 6:25 pm)
Re: Free memory never fully used, swapping, Shaohua Li, (Thu Nov 25, 7:00 pm)
Re: Free memory never fully used, swapping, Shaohua Li, (Thu Nov 25, 7:05 pm)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Thu Nov 25, 7:31 pm)
Re: Free memory never fully used, swapping, Shaohua Li, (Thu Nov 25, 7:40 pm)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Fri Nov 26, 2:18 am)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Fri Nov 26, 4:03 am)
Re: Free memory never fully used, swapping, Mel Gorman, (Fri Nov 26, 4:11 am)
Re: Free memory never fully used, swapping, Christoph Lameter, (Fri Nov 26, 8:48 am)
Re: Free memory never fully used, swapping, Shaohua Li, (Sun Nov 28, 6:03 pm)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Sun Nov 28, 6:13 pm)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Mon Nov 29, 2:31 am)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Mon Nov 29, 5:25 pm)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Mon Nov 29, 11:31 pm)
Re: Free memory never fully used, swapping, Simon Kirby, (Tue Nov 30, 1:22 am)
Re: Free memory never fully used, swapping, Simon Kirby, (Tue Nov 30, 2:13 am)
Re: Free memory never fully used, swapping, Mel Gorman, (Tue Nov 30, 3:41 am)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Tue Nov 30, 4:19 am)
Re: Free memory never fully used, swapping, Christoph Lameter, (Tue Nov 30, 12:10 pm)
Re: Free memory never fully used, swapping, Christoph Lameter, (Tue Nov 30, 12:13 pm)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Wed Dec 1, 3:17 am)
Re: Free memory never fully used, swapping, Christoph Lameter, (Wed Dec 1, 8:29 am)
Re: Free memory never fully used, swapping, KOSAKI Motohiro, (Wed Dec 1, 7:44 pm)
Re: Free memory never fully used, swapping, Christoph Lameter, (Thu Dec 2, 7:39 am)