Simon Kirby reported the following problem We're seeing cases on a number of servers where cache never fully grows to use all available memory. Sometimes we see servers with 4 GB of memory that never seem to have less than 1.5 GB free, even with a constantly-active VM. In some cases, these servers also swap out while this happens, even though they are constantly reading the working set into memory. We have been seeing this happening for a long time; I don't think it's anything recent, and it still happens on 2.6.36. After some debugging work by Simon, Dave Hansen and others, the prevaling theory became that kswapd is reclaiming order-3 pages requested by SLUB too aggressive about it. There are two apparent problems here. On the target machine, there is a small Normal zone in comparison to DMA32. As kswapd tries to balance all zones, it would continually try reclaiming for Normal even though DMA32 was balanced enough for callers. The second problem is that sleeping_prematurely() uses the requested order, not the order kswapd finally reclaimed at. This keeps kswapd artifically awake. This series aims to alleviate these problems but needs testing to confirm it alleviates the actual problem and wider review to think if there is a better alternative approach. Local tests passed but are not reproducing the same problem unfortunately so the results are inclusive. include/linux/mmzone.h | 3 +- mm/page_alloc.c | 2 +- mm/vmscan.c | 90 ++++++++++++++++++++++++++++++++++++++++------- 3 files changed, 79 insertions(+), 16 deletions(-) --
When reclaiming for high-orders, kswapd is responsible for balancing a
node but it should not reclaim excessively. It avoids excessive reclaim
by considering if any zone in a node is balanced then the node is
balanced. In the cases where there are imbalanced zone sizes (e.g.
ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
prematurely as just one small zone was balanced.
This alters the sleep logic of kswapd slightly. It counts the number of pages
that make up the balanced zones. If the total number of balanced pages is
more than a quarter of the zone, kswapd will go back to sleep. This should
keep a node balanced without reclaiming an excessive number of pages.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 30 ++++++++++++++++++++++--------
1 files changed, 22 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9891efd..77c511f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2117,12 +2117,26 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
}
#endif
+/*
+ * pgdat_balanced is used when checking if a node is balanced for high-order
+ * allocations. Only zones that meet watermarks make up "balanced".
+ * The total of balanced pages must be at least 25% of the node for the
+ * node to be considered balanced. Forcing all zones to be balanced for high
+ * orders can cause excessive reclaim when there are imbalanced zones.
+ * Similarly, we do not want kswapd to go to sleep because ZONE_DMA happens
+ * to be balanced when ZONE_DMA32 is huge in comparison and unbalanced
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced)
+{
+ return balanced > pgdat->node_present_pages / 4;
+}
+
/* is kswapd sleeping prematurely? */
static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
{
int i;
+ unsigned long balanced = 0;
bool all_zones_ok = true;
- bool any_zone_ok = false;
/* If a direct reclaimer woke kswapd within HZ/10, it's ...Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
there was a race pushing a zone below its watermark. If the race
happened, it stays awake. However, balance_pgdat() can decide to reclaim
at a lower order if it decides that high-order reclaim is not working as
expected. This information is not passed back to sleeping_prematurely().
The impact is that kswapd remains awake reclaiming pages long after it
should have gone to sleep. This patch passes the adjusted order to
sleeping_prematurely and uses the same logic as balance_pgdat to decide
if it's ok to go to sleep.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 30 ++++++++++++++++++++++++------
1 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 67e4283..9891efd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2118,15 +2118,17 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
#endif
/* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
+static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
{
int i;
+ bool all_zones_ok = true;
+ bool any_zone_ok = false;
/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
return 1;
- /* If after HZ/10, a zone is below the high mark, it's premature */
+ /* Check the watermark levels */
for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;
@@ -2138,10 +2140,20 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
0, 0))
- return 1;
+ all_zones_ok = false;
+ else
+ any_zone_ok = true;
}
- return 0;
+ /*
+ * For high-order requests, any zone meeting the watermark is enough
+ * to allow kswapd go back to sleep
+ * For order-0, all zones must be balanced
+ */
+ if (order)
+ return ...When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced. For
order-0 allocations, this makes perfect sense but for higher orders it can
have unintended side-effects. If the zone sizes are imbalanced, kswapd
may reclaim heavily on a smaller zone discarding an excessive number of
pages. The user-visible behaviour is that kswapd is awake and reclaiming
even though plenty of pages are free from a suitable zone.
This patch alters the "balance" logic to stop kswapd if any suitable zone
becomes balanced to reduce the number of pages it reclaims from other zones.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 2 +-
mm/vmscan.c | 48 +++++++++++++++++++++++++++++++++++++++---------
3 files changed, 42 insertions(+), 11 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 39c24eb..25fe08d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -645,6 +645,7 @@ typedef struct pglist_data {
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
+ enum zone_type high_zoneidx;
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
@@ -660,7 +661,7 @@ typedef struct pglist_data {
extern struct mutex zonelists_mutex;
void build_all_zonelists(void *data);
-void wakeup_kswapd(struct zone *zone, int order);
+void wakeup_kswapd(struct zone *zone, int order, enum zone_type high_zoneidx);
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
int classzone_idx, int alloc_flags);
enum memmap_context {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..344b597 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1921,7 +1921,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
struct zone *zone;
for_each_zone_zonelist(zone, z, zonelist, ...from my understanding, the patch will break reclaim high zone if a low zone meets the high order allocation, even the high zone doesn't meet the high order allocation. This, for example, will make a high order allocation from a high zone fallback to low zone and quickly exhaust low zone, for example DMA. This will break some drivers. --
Have you seen patch [3/3]? I think it migigate your pointed issue. --
yes, it improves a lot, but still possible for small systems. --
Ok, I got you. so please define your "small systems" word? we can't make perfect VM heuristics obviously, then we need to compare pros/cons. Of cource, I'm glad if you have better idea and show it. --
if you don't care about small system, let's consider a NORMAL i386 system with 896m normal zone, and 896M*3 high zone. normal zone will quickly exhaust by high order high zone allocation, leave a latter allocation which does need normal zone fail. --
Not happen. slab don't allocate from highmem and page cache allocation is always using order-0. When happen high order high zone allocation? --
IIRC, ARM supports highmem. But you are right, slub doen't allocate from ok, thanks, I missed this. then how about a x86_64 box with 896M DMA32 and 896*3M NORMAL? some pci devices can only dma to DMA32 zone. --
First, DMA32 is 4GB. Second, modern high end system don't use 32bit PCI device. Third, while we are thinking desktop users, 4GB is not small room. nowadays, typical desktop have only 2GB or 4GB memory. In other word, I agree your pointed issue is exist _potentially_. but I don't think it is frequently than Simon's case. In other word, when deciding heuristics, we can't avoid to think issue frequency. It's very important. Of cource, if you have better idea, I don't oppose it. --
DMA32 isn't 4G, because there is hole under 4G for PCI bars. I don't think 32 bit PCI device is rare too. But anyway, if you insist this isn't a big issue, I'm ok. --
Indeed this is possible and it's a situation confirmed by Simon. Patch 3 should cover it because replacing "are any zones ok?" with "are zones The lowmem reserve would prevent that happening so the drivers would be fine. The real impact is that kswapd would stop when DMA was balanced even though it was really DMA32 or Normal needed to be balanced for proper behaviour. On lowmem reserves though, there is another buglet in sleeping_prematurely. The classzone_idx it uses means that the wrong lowmem_reserve is used for the majority of allocation requests. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
