Re: [PATCH 07/11] Memory compaction core

Previous thread: [PATCH 10/11] Direct compact when a high-order allocation fails by Mel Gorman on Tuesday, March 23, 2010 - 5:25 am. (19 messages)

Next thread: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure by Mel Gorman on Tuesday, March 23, 2010 - 5:25 am. (8 messages)
From: Mel Gorman
Date: Tuesday, March 23, 2010 - 5:25 am

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |    8 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    6 +
 include/linux/vmstat.h     |    1 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  348 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   39 +++++
 mm/vmscan.c                |    5 -
 mm/vmstat.c                |    5 +
 9 files changed, 409 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..6201371
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE	0
+#define COMPACT_COMPLETE	1
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3b473a..f920815 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -335,6 +335,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int ...
From: Christoph Lameter
Date: Tuesday, March 23, 2010 - 10:56 am

Put the above in a separate patch?

--

From: Mel Gorman
Date: Tuesday, March 23, 2010 - 11:15 am

I can if you prefer but it's so small, I didn't think it obscured the
clarity of the patch anyway. I would have somewhat expected the two
patches to be merged together before going upstream.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Tuesday, March 23, 2010 - 11:33 am

It exposes the definitions needed to run __isolate_lru_page(). The
definitions could be useful for other uses of page migrations.
--

From: Mel Gorman
Date: Tuesday, March 23, 2010 - 11:58 am

Sure. Patch split out now and looks like

=== CUT HERE ===
Subject: [PATCH 07/12] Move definition for LRU isolation modes to a header

Currently, vmscan.c defines the isolation modes for
__isolate_lru_page(). Memory compaction needs access to these modes for
isolating pages for migration.  This patch exports them.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/swap.h |    5 +++++
 mm/vmscan.c          |    5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..986b12d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..ef89600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
-- 
1.6.5

--

From: Christoph Lameter
Date: Tuesday, March 23, 2010 - 12:20 pm

Acked-by: Christoph Lameter <cl@linux-foundation.org>

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, March 23, 2010 - 6:03 pm

On Tue, 23 Mar 2010 12:25:42 +0000

I think lru_add_drain() or lru_add_drain_all() should be called somewhere
when we do __isolate_lru_page(). But it's (_all is) slow....

But,

--

From: Minchan Kim
Date: Tuesday, March 23, 2010 - 6:47 pm

On Wed, Mar 24, 2010 at 10:03 AM, KAMEZAWA Hiroyuki

migrate_prep does it.

-- 
Kind regards,
Minchan Kim
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, March 23, 2010 - 6:53 pm

On Wed, 24 Mar 2010 10:47:41 +0900
Thanks.

Hmm...then, lru_add_drain_all() is called at each (32page migrate) itelation.
Isn't it too slow to be called in such frequency ?

Thanks,
-Kame




--

From: Minchan Kim
Date: Tuesday, March 23, 2010 - 7:10 pm

On Wed, Mar 24, 2010 at 10:53 AM, KAMEZAWA Hiroyuki




-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Date: Wednesday, March 24, 2010 - 3:57 am

Indeed we can. It's moved now.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrew Morton
Date: Wednesday, March 24, 2010 - 1:33 pm

On Tue, 23 Mar 2010 12:25:42 +0000

General comment: it looks like there are some codepaths which could
hold zone->lock for a long time.  It's unclear that they're all

It's a bit strange to test this when we're about to oops anyway.  The

--

From: Jonathan Corbet
Date: Wednesday, March 24, 2010 - 1:59 pm

On Wed, 24 Mar 2010 13:33:47 -0700

...except that we've seen a fair number of null pointer dereference
exploits that have told us something altogether different.  Are we
*sure* we don't want to test for null pointers...?

jon
--

From: Andrew Morton
Date: Wednesday, March 24, 2010 - 2:14 pm

On Wed, 24 Mar 2010 14:59:46 -0600

It's hard to see what the test gains us really - the kernel has
zillions of pointer derefs, any of which could be NULL if we have a
bug.  Are we more likely to have a bug here than elsewhere?

This one will oops on a plain old read, so it's a bit moot in this
case.
--

From: Christoph Lameter
Date: Wednesday, March 24, 2010 - 2:19 pm

If the object pointed to is larger than page size and we are
referencing a member with an offset larger than page size later then we
may create an exploit without checks.

But the structure here is certainly smaller than that. So no issue here.



--

From: Andrea Arcangeli
Date: Wednesday, March 24, 2010 - 2:19 pm

Hi Jonathan,


Examples? Maybe WARN_ON != oops, but VM_BUG_ON still an oops that is
and without serial console it would go lost too. I personally don't
see how it's needed. Plus those things are mostly for debug to check
for invariant condition, how long it takes to sort it out isn't very
relevant. So I'm on Andrew camp ;).
--

From: Jonathan Corbet
Date: Wednesday, March 24, 2010 - 2:28 pm

On Wed, 24 Mar 2010 22:19:24 +0100

I don't quite understand the question; are you asking for examples of
exploits?

	http://lwn.net/Articles/347006/
	http://lwn.net/Articles/360328/
	http://lwn.net/Articles/342330/
	...

As to whether this particular test makes sense, I don't know.  But the
idea that we never need to test about-to-be-dereferenced pointers for
NULL does worry me a bit.

jon
--

From: Andrea Arcangeli
Date: Wednesday, March 24, 2010 - 2:47 pm

As far as I can tell, VM_BUG_ON would make _zero_ differences there.

I think you mistaken a VM_BUG_ON for a:

  if (could_be_null->something) {
     WARN_ON(1);
     return -ESOMETHING;
  }

adding a VM_BUG_ON(inode->something) would _still_ be as exploitable
as the null pointer deference, because it's a DoS. It's not really a
big deal of an exploit but it _sure_ need fixing.

The whole point is that VM_BUG_ON(!something) before something->else
won't move the needle as far as your null pointer deference exploits

Being worried is good idea, as we don't want DoS bugs ;). It's just
that VM_BUG_ON isn't a solution to the problem (and the really
important thing, it's not improving its detectability either), fixing
the actual bug is the solution.
--

From: Jonathan Corbet
Date: Wednesday, March 24, 2010 - 2:54 pm

On Wed, 24 Mar 2010 22:47:42 +0100

Ah, but that's the point: these NULL pointer dereferences were not DoS
vulnerabilities - they were full privilege-escalation affairs.  Since
then, some problems have been fixed and some distributors have started
shipping smarter configurations.  But, on quite a few systems a NULL
dereference still has the potential to be fully exploitable; if there's
a possibility of it happening I think we should test for it.  A DoS is
a much better outcome...

jon
--

From: Andrea Arcangeli
Date: Wednesday, March 24, 2010 - 3:06 pm

You're pointing the finger at lack of VM_BUG_ON but the finger should
be pointed in the code that shall enforce mmap_min_addr. That is the
exploitable bug. I can't imagine any other ways VM_BUG_ON could help
in preventing an exploit. Let's concentrate on mmap_min_addr and leave
the code fast.

If it's a small structure (<4096 bytes) we're talking about, I stand
that VM_BUG_ON() is just pure CPU overhead.

I do agree however for structures that may grow larger than 4096 bytes
VM_BUG_ON isn't bad idea, and furthermore I think it's wrong to keep
the min address at only 4096 bytes, it shall be like 100M or
something. Then all of them can go away. That is way more effective
than having to remember to add VM_BUG_ON(!null) when cpu can do it
zero cost.
--

From: Andrea Arcangeli
Date: Wednesday, March 24, 2010 - 2:57 pm

Ooops, I wrote ->something to indicate that "could_be_null" was going
to later be dereferenced for ->something and here we're checking if it
could be null when we dereference something, but now I think it could
be very confusing as I use strict C for all the rest, so maybe I

here the same !inode.
--

From: Mel Gorman
Date: Thursday, March 25, 2010 - 2:13 am

I don't think so. There are two points where zone-related locks are
held.

zone->lock is held in isolate_freepages() while it gets the free pages
	necessary for migration to complete. The size of the list of pages
	being migrated is constrained by COMPACT_CLUSTER_MAX so it is bounded
	by that. Worst case scenario is the zone is almost fully
	scanned.

zone->lru_lock is held in isolate_migratepages) while it gets pages for
	migration. It's released if COMPACT_CLUSTER_MAX pages are
	isolated. Again, worst case scenario is that the zone is
	almost fully scanned.

The worst-case scenario in both cases is the lock is held while the zone
is scanned. The concern would be if we managed to scan almost a full
zone and that zone is very large. I could add an additional check to
release the lock when a large number of pages has been scanned but I
don't think it's necessary. I find it very unlikely that a large zone

It was paranoia after the bugs related to NULL-offsets but unnecessary
paranoia in this case. It would require migration to be very broken for it to
trigger. Even if it was, I cannot imagine a case where it would be exploited
because it's a small structure and not offset by any userspace-supplied

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

Previous thread: [PATCH 10/11] Direct compact when a high-order allocation fails by Mel Gorman on Tuesday, March 23, 2010 - 5:25 am. (19 messages)

Next thread: [PATCH 11/11] Do not compact within a preferred zone after a compaction failure by Mel Gorman on Tuesday, March 23, 2010 - 5:25 am. (8 messages)