This patch is the core of a mechanism which compacts memory in a zone by relocating movable pages towards the end of the zone. A single compaction run involves a migration scanner and a free scanner. Both scanners operate on pageblock-sized areas in the zone. The migration scanner starts at the bottom of the zone and searches for all movable pages within each area, isolating them onto a private list called migratelist. The free scanner starts at the top of the zone and searches for suitable areas and consumes the free pages within making them available for the migration scanner. The pages isolated for migration are then migrated to the newly isolated free pages. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> --- include/linux/compaction.h | 8 + include/linux/mm.h | 1 + include/linux/swap.h | 6 + include/linux/vmstat.h | 1 + mm/Makefile | 1 + mm/compaction.c | 348 ++++++++++++++++++++++++++++++++++++++++++++ mm/page_alloc.c | 39 +++++ mm/vmscan.c | 5 - mm/vmstat.c | 5 + 9 files changed, 409 insertions(+), 5 deletions(-) create mode 100644 include/linux/compaction.h create mode 100644 mm/compaction.c diff --git a/include/linux/compaction.h b/include/linux/compaction.h new file mode 100644 index 0000000..6201371 --- /dev/null +++ b/include/linux/compaction.h @@ -0,0 +1,8 @@ +#ifndef _LINUX_COMPACTION_H +#define _LINUX_COMPACTION_H + +/* Return values for compact_zone() */ +#define COMPACT_INCOMPLETE 0 +#define COMPACT_COMPLETE 1 + +#endif /* _LINUX_COMPACTION_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index f3b473a..f920815 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -335,6 +335,7 @@ void put_page(struct page *page); void put_pages_list(struct list_head *pages); void split_page(struct page *page, unsigned int order); +int ...
Put the above in a separate patch? --
I can if you prefer but it's so small, I didn't think it obscured the clarity of the patch anyway. I would have somewhat expected the two patches to be merged together before going upstream. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
It exposes the definitions needed to run __isolate_lru_page(). The definitions could be useful for other uses of page migrations. --
Sure. Patch split out now and looks like === CUT HERE === Subject: [PATCH 07/12] Move definition for LRU isolation modes to a header Currently, vmscan.c defines the isolation modes for __isolate_lru_page(). Memory compaction needs access to these modes for isolating pages for migration. This patch exports them. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/swap.h | 5 +++++ mm/vmscan.c | 5 ----- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 1f59d93..986b12d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page) __lru_cache_add(page, LRU_ACTIVE_FILE); } +/* LRU Isolation modes. */ +#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */ +#define ISOLATE_ACTIVE 1 /* Isolate active pages. */ +#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */ + /* linux/mm/vmscan.c */ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); diff --git a/mm/vmscan.c b/mm/vmscan.c index 79c8098..ef89600 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -839,11 +839,6 @@ keep: return nr_reclaimed; } -/* LRU Isolation modes. */ -#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */ -#define ISOLATE_ACTIVE 1 /* Isolate active pages. */ -#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */ - /* * Attempt to remove the specified page from its LRU. Only take this page * if it is of the appropriate PageActive status. Pages which are being -- 1.6.5 --
Acked-by: Christoph Lameter <cl@linux-foundation.org> --
On Tue, 23 Mar 2010 12:25:42 +0000 I think lru_add_drain() or lru_add_drain_all() should be called somewhere when we do __isolate_lru_page(). But it's (_all is) slow.... But, --
On Wed, Mar 24, 2010 at 10:03 AM, KAMEZAWA Hiroyuki migrate_prep does it. -- Kind regards, Minchan Kim --
On Wed, 24 Mar 2010 10:47:41 +0900 Thanks. Hmm...then, lru_add_drain_all() is called at each (32page migrate) itelation. Isn't it too slow to be called in such frequency ? Thanks, -Kame --
On Wed, Mar 24, 2010 at 10:53 AM, KAMEZAWA Hiroyuki -- Kind regards, Minchan Kim --
Indeed we can. It's moved now. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
On Tue, 23 Mar 2010 12:25:42 +0000 General comment: it looks like there are some codepaths which could hold zone->lock for a long time. It's unclear that they're all It's a bit strange to test this when we're about to oops anyway. The --
On Wed, 24 Mar 2010 13:33:47 -0700 ...except that we've seen a fair number of null pointer dereference exploits that have told us something altogether different. Are we *sure* we don't want to test for null pointers...? jon --
On Wed, 24 Mar 2010 14:59:46 -0600 It's hard to see what the test gains us really - the kernel has zillions of pointer derefs, any of which could be NULL if we have a bug. Are we more likely to have a bug here than elsewhere? This one will oops on a plain old read, so it's a bit moot in this case. --
If the object pointed to is larger than page size and we are referencing a member with an offset larger than page size later then we may create an exploit without checks. But the structure here is certainly smaller than that. So no issue here. --
Hi Jonathan, Examples? Maybe WARN_ON != oops, but VM_BUG_ON still an oops that is and without serial console it would go lost too. I personally don't see how it's needed. Plus those things are mostly for debug to check for invariant condition, how long it takes to sort it out isn't very relevant. So I'm on Andrew camp ;). --
On Wed, 24 Mar 2010 22:19:24 +0100 I don't quite understand the question; are you asking for examples of exploits? http://lwn.net/Articles/347006/ http://lwn.net/Articles/360328/ http://lwn.net/Articles/342330/ ... As to whether this particular test makes sense, I don't know. But the idea that we never need to test about-to-be-dereferenced pointers for NULL does worry me a bit. jon --
As far as I can tell, VM_BUG_ON would make _zero_ differences there.
I think you mistaken a VM_BUG_ON for a:
if (could_be_null->something) {
WARN_ON(1);
return -ESOMETHING;
}
adding a VM_BUG_ON(inode->something) would _still_ be as exploitable
as the null pointer deference, because it's a DoS. It's not really a
big deal of an exploit but it _sure_ need fixing.
The whole point is that VM_BUG_ON(!something) before something->else
won't move the needle as far as your null pointer deference exploits
Being worried is good idea, as we don't want DoS bugs ;). It's just
that VM_BUG_ON isn't a solution to the problem (and the really
important thing, it's not improving its detectability either), fixing
the actual bug is the solution.
--
On Wed, 24 Mar 2010 22:47:42 +0100 Ah, but that's the point: these NULL pointer dereferences were not DoS vulnerabilities - they were full privilege-escalation affairs. Since then, some problems have been fixed and some distributors have started shipping smarter configurations. But, on quite a few systems a NULL dereference still has the potential to be fully exploitable; if there's a possibility of it happening I think we should test for it. A DoS is a much better outcome... jon --
You're pointing the finger at lack of VM_BUG_ON but the finger should be pointed in the code that shall enforce mmap_min_addr. That is the exploitable bug. I can't imagine any other ways VM_BUG_ON could help in preventing an exploit. Let's concentrate on mmap_min_addr and leave the code fast. If it's a small structure (<4096 bytes) we're talking about, I stand that VM_BUG_ON() is just pure CPU overhead. I do agree however for structures that may grow larger than 4096 bytes VM_BUG_ON isn't bad idea, and furthermore I think it's wrong to keep the min address at only 4096 bytes, it shall be like 100M or something. Then all of them can go away. That is way more effective than having to remember to add VM_BUG_ON(!null) when cpu can do it zero cost. --
Ooops, I wrote ->something to indicate that "could_be_null" was going to later be dereferenced for ->something and here we're checking if it could be null when we dereference something, but now I think it could be very confusing as I use strict C for all the rest, so maybe I here the same !inode. --
I don't think so. There are two points where zone-related locks are held. zone->lock is held in isolate_freepages() while it gets the free pages necessary for migration to complete. The size of the list of pages being migrated is constrained by COMPACT_CLUSTER_MAX so it is bounded by that. Worst case scenario is the zone is almost fully scanned. zone->lru_lock is held in isolate_migratepages) while it gets pages for migration. It's released if COMPACT_CLUSTER_MAX pages are isolated. Again, worst case scenario is that the zone is almost fully scanned. The worst-case scenario in both cases is the lock is held while the zone is scanned. The concern would be if we managed to scan almost a full zone and that zone is very large. I could add an additional check to release the lock when a large number of pages has been scanned but I don't think it's necessary. I find it very unlikely that a large zone It was paranoia after the bugs related to NULL-offsets but unnecessary paranoia in this case. It would require migration to be very broken for it to trigger. Even if it was, I cannot imagine a case where it would be exploited because it's a small structure and not offset by any userspace-supplied -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
