This series is based on the patches in mmotm-2010-04-15-14-42.patch but rebased against 2.6.34-rc4 and collapsed. The cumulative fixes were getting tricky to follow and this series should give a clearer view of what is going on. It should be possible to drop all the compaction-related patches in mmotm and drop this series directly in its place. Changelog since mmotm o Have defer_compaction use an exponential backoff instead of time o Rename COMPACT_INCOMPLETE and document the return values Changelog since V7 (these are the changes made while in mmotm) o Have compaction depend on CONFIG_HUGETLB_PAGE instead of CONFIG_HUGETLBFS o Move extfrag_index and unusable_index to debugfs o Many minor cleanups in the core patch o Drop unsafe optimisation on PageBuddy as zone->lock is not held o Call kernel_map_pages after split_free_page o Update documentation on compact_memory proc entry o Initialise cc->zone properly when compacting all of memory o Do not display vmstats related to compaction when disabled Changelog since V6 o Avoid accessing anon_vma when migrating unmapped PageSwapCache pages Changelog since V5 o Rebase to mmotm-2010-03-24-14-48 o Add more reviewed-by's o Correct one spelling in vmstat.c and some leader clarifications o Split the LRU isolation modes into a separate path o Correct a NID change o Call migrate_prep less frequently o Remove unnecessary inlining o Do not interfere with memory hot-remove o Do not compact for orders <= PAGE_ALLOC_COSTLY_ORDER o page_mapped instead of page_mapcount and allow swapcache to migrate o Avoid too many pages being isolated for migration o Handle PageSwapCache pages during migration Changelog since V4 o Remove unnecessary check for PageLRU and PageUnevictable o Fix isolated accounting o Close race window between page_mapcount and rcu_read_lock o Added a lot more Reviewed-by tags Changelog since V3 o Document sysfs entries (subseqently, merged independently) o ...
Currently, vmscan.c defines the isolation modes for __isolate_lru_page(). Memory compaction needs access to these modes for isolating pages for migration. This patch exports them. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Christoph Lameter <cl@linux-foundation.org> --- include/linux/swap.h | 5 +++++ mm/vmscan.c | 5 ----- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 94ec325..32af03c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page) __lru_cache_add(page, LRU_ACTIVE_FILE); } +/* LRU Isolation modes. */ +#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */ +#define ISOLATE_ACTIVE 1 /* Isolate active pages. */ +#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */ + /* linux/mm/vmscan.c */ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); diff --git a/mm/vmscan.c b/mm/vmscan.c index 1070f83..8bdd85c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -839,11 +839,6 @@ keep: return nr_reclaimed; } -/* LRU Isolation modes. */ -#define ISOLATE_INACTIVE 0 /* Isolate inactive pages. */ -#define ISOLATE_ACTIVE 1 /* Isolate active pages. */ -#define ISOLATE_BOTH 2 /* Isolate both active and inactive pages. */ - /* * Attempt to remove the specified page from its LRU. Only take this page * if it is of the appropriate PageActive status. Pages which are being -- 1.6.5 --
The fragmentation index may indicate that a failure is due to external fragmentation but after a compaction run completes, it is still possible for an allocation to fail. There are two obvious reasons as to why o Page migration cannot move all pages so fragmentation remains o A suitable page may exist but watermarks are not met In the event of compaction followed by an allocation failure, this patch defers further compaction in the zone (1 << compact_defer_shift) times. If the next compaction attempt also fails, compact_defer_shift is increased up to a maximum of 6. If compaction succeeds, the defer counters are reset again. The zone that is deferred is the first zone in the zonelist - i.e. the preferred zone. To defer compaction in the other zones, the information would need to be stored in the zonelist or implemented similar to the zonelist_cache. This would impact the fast-paths and is not justified at this time. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/linux/compaction.h | 39 +++++++++++++++++++++++++++++++++++++++ include/linux/mmzone.h | 9 +++++++++ mm/page_alloc.c | 5 ++++- 3 files changed, 52 insertions(+), 1 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 3719325..5ac5155 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -22,6 +22,36 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write, extern int fragmentation_index(struct zone *zone, unsigned int order); extern unsigned long try_to_compact_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); + +/* Do not skip compaction more than 64 times */ +#define COMPACT_MAX_DEFER_SHIFT 6 + +/* + * Compaction is deferred when compaction fails to result in a page + * allocation success. 1 << compact_defer_limit compactions are skipped up + * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT + */ +static inline void defer_compaction(struct zone ...
The kernel applies some heuristics when deciding if memory should be compacted or reclaimed to satisfy a high-order allocation. One of these is based on the fragmentation. If the index is below 500, memory will not be compacted. This choice is arbitrary and not based on data. To help optimise the system and set a sensible default for this value, this patch adds a sysctl extfrag_threshold. The kernel will only compact memory if the fragmentation index is above the extfrag_threshold. [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured] Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- Documentation/sysctl/vm.txt | 16 +++++++++++++++- include/linux/compaction.h | 3 +++ kernel/sysctl.c | 15 +++++++++++++++ mm/compaction.c | 12 +++++++++++- 4 files changed, 44 insertions(+), 2 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 3b3fa1b..6274970 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm: - dirty_ratio - dirty_writeback_centisecs - drop_caches +- extfrag_threshold - hugepages_treat_as_movable - hugetlb_shm_group - laptop_mode @@ -149,6 +149,20 @@ user should run `sync' first. ============================================================== +extfrag_threshold + +This parameter affects whether the kernel will compact memory or direct +reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what +the fragmentation index for each order is in each zone in the system. Values +tending towards 0 imply allocations would fail due to lack of memory, +values towards 1000 imply failures are due to fragmentation and -1 implies +that the allocation will succeed as long as watermarks are met. + +The kernel will not compact memory in a zone if the +fragmentation index is <= extfrag_threshold. The default value is ...
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
include/linux/rmap.h | 23 +++++++++++++++++++++++
mm/migrate.c | 12 ++++++++++++
mm/rmap.c | 10 +++++-----
3 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..567d43f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
#ifdef CONFIG_KSM
atomic_t ksm_refcount;
#endif
+#ifdef CONFIG_MIGRATION
+ atomic_t migrate_refcount;
+#endif
/*
* NOTE: the LSB of the head.next is set by
* mm_take_all_locks() _after_ taking the above lock. So the
@@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
return 0;
}
#endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+ atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+ return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+ return 0;
+}
+#endif /* CONFIG_MIGRATE */
static inline struct anon_vma *page_anon_vma(struct page *page)
{
diff --git a/mm/migrate.c b/mm/migrate.c
index 5938db5..b768a1d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -543,6 +543,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
int rcu_locked = 0;
int charge = ...On Tue, 20 Apr 2010 22:01:03 +0100 Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Hmm. Can't we move this part to mm/rmap.c ? as anon_vma_drop_external_reference() or some. Thanks, -Kame --
Ordinarily when a high-order allocation fails, direct reclaim is entered to free pages to satisfy the allocation. With this patch, it is determined if an allocation failed due to external fragmentation instead of low memory and if so, the calling process will compact until a suitable page is freed. Compaction by moving pages in memory is considerably cheaper than paging out to disk and works where there are locked pages or no swap. If compaction fails to free a page of a suitable size, then reclaim will still occur. Direct compaction returns as soon as possible. As each block is compacted, it is checked if a suitable page has been freed and if so, it returns. [akpm@linux-foundation.org: Fix build errors] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> --- include/linux/compaction.h | 24 ++++++++-- include/linux/vmstat.h | 1 + mm/compaction.c | 117 ++++++++++++++++++++++++++++++++++++++++++++ mm/page_alloc.c | 63 +++++++++++++++++++++++ mm/vmstat.c | 16 +++++- 5 files changed, 215 insertions(+), 6 deletions(-) diff --git a/include/linux/compaction.h b/include/linux/compaction.h index ba98cfe..eed40ec 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -1,15 +1,31 @@ #ifndef _LINUX_COMPACTION_H #define _LINUX_COMPACTION_H -/* Return values for compact_zone() */ -#define COMPACT_CONTINUE 0 -#define COMPACT_PARTIAL 1 -#define COMPACT_COMPLETE 2 +/* Return values for compact_zone() and try_to_compact_pages() */ +/* compaction didn't start as it was not possible or direct reclaim was more suitable */ +#define COMPACT_SKIPPED 0 +/* compaction should continue to another pageblock */ +#define COMPACT_CONTINUE 1 +/* direct compaction partially compacted a zone and there are suitable pages */ +#define COMPACT_PARTIAL 2 +/* The full zone was compacted */ +#define COMPACT_COMPLETE 3 #ifdef ...
==
From: Andrea Arcangeli <aarcange@redhat.com>
Preempt is enabled so it must use count_vm_event.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1768,7 +1768,7 @@ __alloc_pages_direct_compact(gfp_t gfp_m
alloc_flags, preferred_zone,
migratetype);
if (page) {
- __count_vm_event(COMPACTSUCCESS);
+ count_vm_event(COMPACTSUCCESS);
return page;
}
--
Reviewed-by: Mel Gorman <mel@csn.ul.ie> Andrew, this is a fix to the patch mmcompaction-direct-compact-when-a-high-order-allocation-fails.patch -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
You're welcome.
I updated current aa.git origin/master and origin/anon_vma_chain
branches (post THP-23*).
There's also another patch I've in my tree that you didn't picked up
and I wonder what's the issue here. This less a bugfix because it
seems to only affect lockdep, I don't know why lockdep forbids to call
migrate_prep with any lock held (in this case the mmap_sem). migrate.c
is careful to comply with it, compaction.c isn't. It's not mandatory
to succeed for compaction, so in doubt I just commented it out. It'll
also decrease the IPI load so I wasn't very concerned to re-enable it.
-----
Subject: disable migrate_prep()
From: Andrea Arcangeli <aarcange@redhat.com>
I get trouble from lockdep if I leave it enabled:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.34-rc3 #50
-------------------------------------------------------
largepages/4965 is trying to acquire lock:
(events){+.+.+.}, at: [<ffffffff8105b788>] flush_work+0x38/0x130
but task is already holding lock:
(&mm->mmap_sem){++++++}, at: [<ffffffff8141b022>] do_page_fault+0xd2/0x430
flush_work apparently wants to run free from lock and it bugs in:
lock_map_acquire(&cwq->wq->lockdep_map);
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -383,7 +383,9 @@ static int compact_zone(struct zone *zon
cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
cc->free_pfn &= ~(pageblock_nr_pages-1);
+#if 0
migrate_prep();
+#endif
while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
unsigned long nr_migrate, nr_remaining;
--
Simple, I didn't spot it. If you pointed it out to me, I didn't take I haven't seen this problem. The testing I'd have been doing with compaction It's not mandatory but the LRU lists should be drained so they can be properly isolated. It'd make a slight difference to success rates as there will be While true, is compaction density that high under normal workloads? I guess it would be if a scanner was constantly trying to promote pages. If the IPI load is out of hand, I'm ok with disabling in some cases. For example, I'd be ok with it being skipped if it was part of a daemon doing speculative promotion but I'd prefer it to still be used if the static hugetlbfs pool Hmm, I'm not seeing where in the fault path flush_work is getting called from. Can you point it out to me please? We already do some IPI work in the page allocator although it happens after direct reclaim and only for high-order pages. What happens there and what happens in migrate_prep are very similar so if there was a problem with IPI -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
That explains it! But anything can call alloc_pages(order>0) with some Yes success rate will be slightly worse but this also applies to all regular vmscan paths that don't send IPI but they only flush the local queue with lru_add_drain, simply pages won't be freed until there will be some other cpu holding the refcount on them, it is not specific to compaction.c but it applies to vmscan.c and vmscan likely not wanting to send an IPI flood because it could too if it wanted. But I guess I should at least use lru_add_drain() in replacement of Where? I never triggered other issues in the page allocator with lockdep, just this one pops up. --
Or a migrate_prep_local? migrate_prep() was there in case there was extensive
work that needed to be done. At least if the two versions were beside each
other, it would be a bit clearer if migrate_prep was ever modified. It'd
Actually, it's not an IPI in this case is it? As it's schedule_on_each_cpu,
it should be added to a workqueue and executed by keventd at some point in the
Ah, it's the difference between schedule_on_each_cpu that migrate_prep does
and on_each_cpu that the page allocator uses. That's why I haven't seen
it before.
How about the following as an alternative to dropp migrate_prep?
==== CUT HERE ====
mm,compaction: Do not schedule work on other CPUs for compaction
Migration normally requires a call to migrate_prep() as a preparation
step. This schedules work on all CPUs for pagevecs to be drained. This
makes sense for move_pages and memory hot-remove but is unnecessary
for memory compaction.
To avoid queueing work on multiple CPUs, this patch introduces
migrate_prep_local() which drains just local pagevecs.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/migrate.h | 2 ++
mm/compaction.c | 2 +-
mm/migrate.c | 11 ++++++++++-
3 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 05d2292..6dec3ef 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -19,6 +19,7 @@ extern int fail_migrate_page(struct address_space *,
struct page *, struct page *);
extern int migrate_prep(void);
+extern int migrate_prep_local(void);
extern int migrate_vmas(struct mm_struct *mm,
const nodemask_t *from, const nodemask_t *to,
unsigned long flags);
@@ -32,6 +33,7 @@ static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, int offlining) { return -ENOSYS; }
static inline int migrate_prep(void) { return -ENOSYS; }
+static inline int migrate_prep_local(void) { return -ENOSYS; }
...Yep this is what I'd like too... btw in the comments you also mention IPI but I guess that's ok. About the cost I'm not sure but I would expect the cost of this to be even higher because it also has to run the scheduler unlike a real IPI. --
Fair point. I wouldn't be too sure what the relative costs are but either way it's bad and unnecessary in the case of compaction. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Add a per-node sysfs file called compact. When the file is written to, each zone in that node is compacted. The intention that this would be used by something like a job scheduler in a batch system before a job starts so that the job can allocate the maximum number of hugepages without significant start-up cost. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- Documentation/ABI/testing/sysfs-devices-node | 7 +++++++ drivers/base/node.c | 3 +++ include/linux/compaction.h | 16 ++++++++++++++++ mm/compaction.c | 23 +++++++++++++++++++++++ 4 files changed, 49 insertions(+), 0 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-devices-node diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node new file mode 100644 index 0000000..453a210 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-devices-node @@ -0,0 +1,7 @@ +What: /sys/devices/system/node/nodeX/compact +Date: February 2010 +Contact: Mel Gorman <mel@csn.ul.ie> +Description: + When this file is written to, all memory within that node + will be compacted. When it completes, memory will be freed + into blocks which have as many contiguous pages as possible diff --git a/drivers/base/node.c b/drivers/base/node.c index 057979a..2bdd8a9 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -9,6 +9,7 @@ #include <linux/memory.h> #include <linux/node.h> #include <linux/hugetlb.h> +#include <linux/compaction.h> #include <linux/cpumask.h> #include <linux/topology.h> #include <linux/nodemask.h> @@ -246,6 +247,8 @@ int register_node(struct node *node, int num, struct node *parent) ...
This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary value is written to the file, all zones are compacted. The expected user of such a trigger is a job scheduler that prepares the system before the target application runs. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> --- Documentation/sysctl/vm.txt | 10 +++++++ include/linux/compaction.h | 6 ++++ kernel/sysctl.c | 10 +++++++ mm/compaction.c | 61 +++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 87 insertions(+), 0 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 56366a5..3b3fa1b 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -19,6 +19,7 @@ files can be found in mm/swap.c. Currently, these files are in /proc/sys/vm: - block_dump +- compact_memory - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -64,6 +65,15 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. ============================================================== +compact_memory + +Available only when CONFIG_COMPACTION is set. When 1 is written to the file, +all zones are compacted such that free memory is available in contiguous +blocks where possible. This can be important for example in the allocation of +huge pages although processes will also directly compact memory as required. + +============================================================== + dirty_background_bytes Contains the amount of dirty memory at which the pdflush background writeback diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 465ca88..5723888 100644 --- ...
This patch is the core of a mechanism which compacts memory in a zone by relocating movable pages towards the end of the zone. A single compaction run involves a migration scanner and a free scanner. Both scanners operate on pageblock-sized areas in the zone. The migration scanner starts at the bottom of the zone and searches for all movable pages within each area, isolating them onto a private list called migratelist. The free scanner starts at the top of the zone and searches for suitable areas and consumes the free pages within making them available for the migration scanner. The pages isolated for migration are then migrated to the newly isolated free pages. [aarcange@redhat.com: Fix unsafe optimisation] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> --- include/linux/compaction.h | 9 + include/linux/mm.h | 1 + include/linux/swap.h | 1 + include/linux/vmstat.h | 3 + mm/Makefile | 1 + mm/compaction.c | 394 ++++++++++++++++++++++++++++++++++++++++++++ mm/page_alloc.c | 45 +++++ mm/vmstat.c | 7 + 8 files changed, 461 insertions(+), 0 deletions(-) create mode 100644 include/linux/compaction.h create mode 100644 mm/compaction.c diff --git a/include/linux/compaction.h b/include/linux/compaction.h new file mode 100644 index 0000000..465ca88 --- /dev/null +++ b/include/linux/compaction.h @@ -0,0 +1,9 @@ +#ifndef _LINUX_COMPACTION_H +#define _LINUX_COMPACTION_H + +/* Return values for compact_zone() */ +#define COMPACT_CONTINUE 0 +#define COMPACT_PARTIAL 1 +#define COMPACT_COMPLETE 2 + +#endif /* _LINUX_COMPACTION_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index 3058bbc..eb21256 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -334,6 +334,7 @@ void put_page(struct page *page); void put_pages_list(struct list_head *pages); void split_page(struct ...
The fragmentation fragmentation index, is only meaningful if an allocation
would fail and indicates what the failure is due to. A value of -1 such as
in many of the examples above states that the allocation would succeed.
If it would fail, the value is between 0 and 1. A value tending towards
0 implies the allocation failed due to a lack of memory. A value tending
towards 1 implies it failed due to external fragmentation.
For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zo
Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/vmstat.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 84 insertions(+), 0 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d3e0fa1..23a5899 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -16,6 +16,7 @@
#include <linux/cpu.h>
#include <linux/vmstat.h>
#include <linux/sched.h>
+#include <linux/math64.h>
#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -420,6 +421,33 @@ static void fill_contig_page_info(struct zone *zone,
(order - suitable_order);
}
}
+
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+ unsigned long requested = 1UL << order;
+
+ if (!info->free_blocks_total)
+ return 0;
+
+ /* Fragmentation index ...The unusable free space index measures how much of the available free
memory cannot be used to satisfy an allocation of a given size and is
a value between 0 and 1. The higher the value, the more of free memory
is unusable and by implication, the worse the external fragmentation
is. For the most part, the huge page size will be the size of interest
but not necessarily so it is exported on a per-order and per-zone basis
Node 0, zone DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
Node 0, zone Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
[akpm@linux-foundation.org: Fix allnoconfig]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
mm/vmstat.c | 150 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 149 insertions(+), 1 deletions(-)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fa12ea3..d3e0fa1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -379,7 +379,50 @@ void zone_statistics(struct zone *preferred_zone, struct zone *z)
}
#endif
-#ifdef CONFIG_PROC_FS
+#ifdef CONFIG_COMPACTION
+struct contig_page_info {
+ unsigned long free_pages;
+ unsigned long free_blocks_total;
+ unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes no attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+ unsigned int suitable_order,
+ struct contig_page_info *info)
+{
+ unsigned int order;
+
+ info->free_pages = 0;
+ info->free_blocks_total = ...CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture being able to hot-remove memory. The main users of page migration such as sys_move_pages(), sys_migrate_pages() and cpuset process migration are only beneficial on NUMA so it makes sense. As memory compaction will operate within a zone and is useful on both NUMA and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the user selects CONFIG_COMPACTION as an option. [akpm@linux-foundation.org: Depend on CONFIG_HUGETLB_PAGE] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --- mm/Kconfig | 18 +++++++++++++++--- 1 files changed, 15 insertions(+), 3 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 9c61158..a275a7d 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS default "4" # +# support for memory compaction +config COMPACTION + bool "Allow for memory compaction" + def_bool y + select MIGRATION + depends on EXPERIMENTAL && HUGETLB_PAGE && MMU + help + Allows the compaction of memory for the allocation of huge pages. + +# # support for page migration # config MIGRATION @@ -180,9 +190,11 @@ config MIGRATION depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE help Allows the migration of the physical location of pages of processes - while the virtual addresses are not changed. This is useful for - example on NUMA systems to put pages nearer to the processors accessing - the page. + while the virtual addresses are not changed. This is useful in + two situations. The first is on NUMA systems to put pages nearer + to the processors accessing. The second is when allocating huge + pages as migration can relocate pages to satisfy a huge page + allocation instead of reclaiming. config PHYS_ADDR_T_64BIT def_bool 64BIT || ...
PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/migrate.c | 53 ++++++++++++++++++++++++++++++++++++-----------------
1 files changed, 36 insertions(+), 17 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index b114635..4afd6fe 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -485,7 +485,8 @@ static int fallback_migrate_page(struct address_space *mapping,
* < 0 - error code
* == 0 - success
*/
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+ int remap_swapcache)
{
struct address_space *mapping;
int rc;
@@ -520,10 +521,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
else
rc = fallback_migrate_page(mapping, newpage, page);
- if (!rc)
- remove_migration_ptes(page, newpage);
- else
+ if (rc) {
newpage->mapping = NULL;
+ } else {
+ if (remap_swapcache)
+ remove_migration_ptes(page, newpage);
+ }
unlock_page(newpage);
@@ -540,6 +543,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, &result);
+ int remap_swapcache = 1;
int rcu_locked = 0;
int charge = 0;
struct mem_cgroup *mem = NULL;
@@ -601,18 +605,33 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
rcu_read_lock();
rcu_locked = 1;
- /*
- * If the page has no mappings any more, just bail. An
- * unmapped anon page is ...You are going to keep the migration ptes after the page has been unlocked? Or is remap_swapcache true if its not a swapcache page? Maybe you meant if (!remap_swapcache) Looks like you meant !remap_swapcache --
Yes, because it's not known if the anon_vma for the unmapped swapcache page still exists or not. Now, a bug has been reported where a migration PTE is found where the page is not locked. I'm trying to determine if it's the same No, remap_swapcache could just be called "remap". If it's 0, it's If remap_swapcache is 1, the anon_vma is valid (or irrelevant because it's a file) and it's safe to remap the page by removing the migration PTEs. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Call this "can_remap"? --
can_do - ba dum tisch. While you are looking though, maybe you can confirm something for me. 1. Is leaving a migration PTE like this behind reasonable? (I think yes particularly as the page was already unmapped so it's not a new fault incurred) 2. Is the BUG_ON check in include/linux/swapops.h#migration_entry_to_page() now wrong? (I think yes, but I'm not sure and I'm having trouble verifying it) Thanks. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
The design of page migration only allows for the existence of these as long as the page is locked. Not sure what would happen if you leave this hanging around. Paths that are not prepared for a migration_pte may The bug check ensures that migration entries only occur when the page is locked. This patch changes that behavior. This is going too oops therefore in unmap_and_move() when you try to remove the migration_ptes from an unlocked page. --
If there are other paths, then migration of unmapped PageSwapCache is plain unsafe and this patch would have to go. It'd limit compaction somewhat but without the series, it would appear that memory hot-remove It's not unmap_and_move() that the problem is occurring on but during a page fault - presumably in do_swap_page but I'm not 100% certain. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
remove_migration_pte() calls migration_entry_to_page(). So it must do that only if the page is still locked. You need to ensure that the page is not unlocked in move_to_new_page() if the migration ptes are kept. move_to_new_page() only unlocks the new page not the original page. So that is safe. And it seems that the old page is also unlocked in unmap_and_move() only after the migration_ptes have been removed? So we are fine after all...? --
Correct, but the other call path is
do_swap_page
-> migration_entry_wait
-> migration_entry_to_page
with migration_entry_wait expecting the page to be locked. There is a dangling
migration PTEs coming from somewhere. I thought it was from unmapped swapcache
You'd think but migration PTEs are being left behind in some circumstance. I
thought it was due to this series, but it's unlikely. It's more a case that
compaction heavily exercises migration.
We can clean up the old migration PTEs though when they are encountered
like in the following patch for example? I'll continue investigating why
this dangling migration pte exists as closing that race would be a
better fix.
==== CUT HERE ====
mm,migration: Remove dangling migration ptes pointing to unlocked pages
Due to some yet-to-be-identified race, it is possible for migration PTEs
to be left behind, When later paged-in, a BUG is triggered that assumes
that all migration PTEs are point to a page currently being migrated and
so must be locked.
Rather than calling BUG, this patch notes the existance of dangling migration
PTEs in migration_entry_wait() and cleans them up.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/swapops.h | 11 ++++++-----
mm/memory.c | 2 +-
mm/migrate.c | 24 ++++++++++++++++++++----
3 files changed, 27 insertions(+), 10 deletions(-)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index cd42e30..01fba71 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -95,14 +95,15 @@ static inline int is_write_migration_entry(swp_entry_t entry)
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
}
-static inline struct page *migration_entry_to_page(swp_entry_t entry)
+static inline struct page *migration_entry_to_page(swp_entry_t entry,
+ bool lock_required)
{
struct page *p = pfn_to_page(swp_offset(entry));
/*
* Any use of migration entries may only occur while the
* corresponding page is ...On Thu, 22 Apr 2010 10:28:20 +0100 I use similar patch for debugging. In my patch, this when this function founds dangling migration entry, return error code and do_swap_page() returns VM_FAULT_SIGBUS. Hmm..in my test, the case was. Before try_to_unmap: mapcount=1, SwapCache, remap_swapcache=1 After remap mapcount=0, SwapCache, rc=0. So, I think there may be some race in rmap_walk() and vma handling or anon_vma handling. migration_entry isn't found by rmap_walk. Hmm..it seems this kind patch will be required for debug. -Kame --
On Thu, Apr 22, 2010 at 6:46 PM, KAMEZAWA Hiroyuki I looked do_swap_page, again. lock_page is called long after migration_entry_wait. It means lock_page can't close the race. So I think this BUG is possible. -- Kind regards, Minchan Kim --
On Thu, 22 Apr 2010 19:13:12 +0900 I think it's not a problem. When migration-entry-wait is called, enry_wait() does pte_lock(); check migration_pte check it's locked. And after wait_on_page_locked(), it just returns to user and cause page fault again. Thanks, -Kame --
On Thu, 22 Apr 2010 19:31:06 +0900
Ok, here is my patch for _fix_. But still testing...
Running well at least for 30 minutes, where I can see bug in 10minutes.
But this patch is too naive. please think about something better fix.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
At adjust_vma(), vma's start address and pgoff is updated under
write lock of mmap_sem. This means the vma's rmap information
update is atoimic only under read lock of mmap_sem.
Even if it's not atomic, in usual case, try_to_ummap() etc...
just fails to decrease mapcount to be 0. no problem.
But at page migration's rmap_walk(), it requires to know all
migration_entry in page tables and recover mapcount.
So, this race in vma's address is critical. When rmap_walk meet
the race, rmap_walk will mistakenly get -EFAULT and don't call
rmap_one(). This patch adds a lock for vma's rmap information.
But, this is _very slow_.
We need something sophisitcated, light-weight update for this..
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/mm_types.h | 1 +
kernel/fork.c | 1 +
mm/mmap.c | 11 ++++++++++-
mm/rmap.c | 3 +++
4 files changed, 15 insertions(+), 1 deletion(-)
Index: linux-2.6.34-rc4-mm1/include/linux/mm_types.h
===================================================================
--- linux-2.6.34-rc4-mm1.orig/include/linux/mm_types.h
+++ linux-2.6.34-rc4-mm1/include/linux/mm_types.h
@@ -183,6 +183,7 @@ struct vm_area_struct {
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy; /* NUMA policy for the VMA */
#endif
+ spinlock_t adjust_lock;
};
struct core_thread {
Index: linux-2.6.34-rc4-mm1/mm/mmap.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/mm/mmap.c
+++ linux-2.6.34-rc4-mm1/mm/mmap.c
@@ -584,13 +584,20 @@ again: remove_next = 1 + (end > next->
if (adjust_next)
vma_prio_tree_remove(next, root);
...Ok wow. That is exceptionally well-spotted. This looks like a proper bug
In the event the VMA is backed by a file, the mapping i_mmap_lock is taken for
the duration of the update and is taken elsewhere where the VMA information
is read such as rmap_walk_file()
In the event the VMA is anon, vma_adjust currently talks no locks and your
patch introduces a new one but why not use the anon_vma lock here? Am I
missing something that requires the new lock?
For example;
==== CUT HERE ====
mm: Take the anon_vma lock in vma_adjust()
vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.
This patch takes the anon_vma->lock during vma_adjust to avoid such
races.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/mmap.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
--
rmap_walk_anon doesn't hold vma's anon_vma->lock. It holds page->anon_vma->lock. -- Kind regards, Minchan Kim --
Of course, thank you for pointing out my error. With multiple
anon_vma's, the locking is a bit of a mess. We cannot hold spinlocks on
two vma's in the same list at the same time without potentially causing
a livelock. The problem becomes how we can safely drop one anon_vma and
acquire the other without them disappearing from under us.
See the XXX mark in the following incomplete patch for example. It's
incomplete because the list traversal is also not safe once the lock has
been dropped and -EFAULT is returned by vma_address.
==== CUT HERE ====
mm: Take the vma anon_vma lock in vma_adjust and during rmap_walk
vma_adjust() is updating anon VMA information without any locks taken.
In constract, file-backed mappings use the i_mmap_lock. This lack of
locking can result in races with page migration. During rmap_walk(),
vma_address() can return -EFAULT for an address that will soon be valid.
This leaves a dangling migration PTE behind which can later cause a
BUG_ON to trigger when the page is faulted in.
This patch takes the anon_vma->lock during vma_adjust to avoid such
races. During rmap_walk, the page anon_vma is locked but as it walks the
VMA list, it'll lock the VMA->anon_vma if they differ as well.
---
mm/mmap.c | 6 ++++++
mm/rmap.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 46 insertions(+), 8 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61d6f1d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
+
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -620,6 +623,9 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
+
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 85f203e..1ea0cae ...Incidentally, I now belatedly see why Kamezawa introduced a new lock. I assume it was to get around this mess. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
There is a simplier alternative I guess. When the vma->anon_vma is difference, try and lock it. If the lock is uncontended, continue. If not, release the pages anon_vma lock and start from the beginning - repeat until the list is walked uncontended. This should avoid livelocking against other walkers. I have the test running now for 30 minutes with no problems but will leave it overnight and see what happens. I tried the approach of having vma_adjust and rmap_walk always seeing the same anon_vma but I couldn't devise a method. It doesn't seem possible but I'm still getting to grips with the anon_vma_chain stuff. Maybe Rik can spot a better way of doing this. ==== CUT HERE ==== mm: Take the vma anon_vma lock in vma_adjust and during rmap_walk vma_adjust() is updating anon VMA information without any locks taken. In constract, file-backed mappings use the i_mmap_lock. This lack of locking can result in races with page migration. During rmap_walk(), vma_address() can return -EFAULT for an address that will soon be valid. This leaves a dangling migration PTE behind which can later cause a BUG_ON to trigger when the page is faulted in. This patch has vma_adjust() take the vma->anon_vma lock. When rmap_walk_anon() is walking the anon_vma_chain list, it can encounter a VMA a different anon_vma. It cannot just take this lock because depending on the order of traversal of anon_vma_chains in other processes, it could cause a livelock. Instead, it releases the pages anon_vma lock it has and starts again from scratch until it can traverse the full list uncontended. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/mmap.c | 6 ++++++ mm/rmap.c | 21 ++++++++++++++++++++- 2 files changed, 26 insertions(+), 1 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index f90ea92..61d6f1d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -578,6 +578,9 @@ again: remove_next = 1 + (end > next->vm_end); } } + if (vma->anon_vma) + spin_lock(&vma->anon_vma->lock); + if (root) ...
Isnt this also a race with reclaim / swap? --
On Thu, 22 Apr 2010 14:40:46 -0500 (CDT) Yes, it's also race in reclaim/swap ... page_referenced() try_to_unmap(). rmap_walk() <==== we hit this case. But above 2 are not considered to be critical. I'm not sure how this race affect KSM. Thanks, -Kame --
I'm not that familiar with KSM but took a look through. Mostly,
accessing the VMA is protected by the mmap_sem with the exception of
rmap_walk_ksm. It needs similar protection for accessing the VMA than
rmap_walk_anon does.
Specifically, this part
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
vma = vmac->vma;
if (rmap_item->address < vma->vm_start ||
rmap_item->address >= vma->vm_end)
continue;
needs to acquire the vma->anon_vma lock if it differs or in your case
call something similar to vma_address_safe.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
Nice Catch, Kame. :) For further optimization, we can hold vma->adjust_lock if vma_address returns -EFAULT. But I hope we redesigns it without new locking. But I don't have good idea, now. :( -- Kind regards, Minchan Kim --
How about this?
I just merged ideas of Mel and Kame.:)
It just shows the concept, not formal patch.
diff --git a/mm/mmap.c b/mm/mmap.c
index f90ea92..61ea742 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -578,6 +578,8 @@ again: remove_next = 1 + (end > next->vm_end);
}
}
+ if (vma->anon_vma)
+ spin_lock(&vma->anon_vma->lock);
if (root) {
flush_dcache_mmap_lock(mapping);
vma_prio_tree_remove(vma, root);
@@ -619,7 +621,8 @@ again: remove_next = 1 + (end > next->vm_end);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
-
+ if (vma->anon_vma)
+ spin_unlock(&vma->anon_vma->lock);
if (remove_next) {
if (file) {
fput(file);
diff --git a/mm/rmap.c b/mm/rmap.c
index 3a53d9f..8075057 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1359,9 +1359,22 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
spin_lock(&anon_vma->lock);
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
struct vm_area_struct *vma = avc->vma;
- unsigned long address = vma_address(page, vma);
- if (address == -EFAULT)
+ struct anon_vma *tmp_anon_vma = vma->anon_vma;
+ unsigned long address;
+ int tmp_vma_lock = 0;
+
+ if (tmp_anon_vma != anon_vma) {
+ spin_lock(&tmp_anon_vma->lock);
+ tmp_vma_lock = 1;
+ }
+ address = vma_address(page, vma);
+ if (address == -EFAULT) {
+ if (tmp_vma_lock)
+ spin_unlock(&tmp_anon_vma->lock);
continue;
+ }
+ if (tmp_vma_lock)
+ spin_unlock(&tmp_anon_vma->lock);
ret = rmap_one(page, vma, address, arg);
if (ret != SWAP_AGAIN)
break;
--
1.7.0.5
--
Kind regards,
Minchan Kim
--
heh, I thought of a similar approach at the same time as you but missed this mail until later. However, with this approach I suspect there is a possibility that two walkers of the same anon_vma list could livelock if two locks on the list are held at the same time. Am still thinking of -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Hi Mel, Trying to understand this issue and I've some questions. This vma_adjust and lock inversion troubles with the anon-vma lock in rmap_walk are a new issue introduced by the recent anon-vma changes, right? About swapcache, try_to_unmap just nuke the mappings, establish the swap entry in the pte (not migration entry), and then there's no need to call remove_migration_ptes. So it just need to skip it for swapcache. page_mapped must return zero after try_to_unmap returns before we're allowed to migrate (plus the page count must be just 1 and not 2 or more for gup users!). I don't get what's the problem about swapcache and the races connected to it, the moment I hear migration PTE in context of swapcache migration I'm confused because there's no migration PTE for swapcache. The new page will have no mappings either, it just needs to be part of the swapcache with the same page->index = swapentry, indexed in the radix tree with that page->index, and paga->mapping pointing to swapcache. Then new page faults will bring it in the pageatables. The lookup_swap_cache has to be serialized against some lock, it should be the radix tree lock? So the migration has to happen with that lock hold no? We can't just migrate swapcache without stopping swapcache radix tree lookups no? I didn't digest the full migrate.c yet and I don't see where it happens. Freeing the swapcache while simpler and safer, would be quite bad as it'd create I/O for potentially hot-ram. About the refcounting of anon-vma in migrate.c I think it'd be much simpler if zap_page_range and folks would just wait (like they do if they find a pmd_trans_huge && pmd_trans_splitting pmd), there would be no need of refcounting the anon-vma that way. I assume whatever is added to rmap_walk I also have to add to split_huge_page later when switching to mainline anon-vma code (for now I stick to 2.6.32 anon-vma code to avoid debugging anon-vma-chain, memory compaction, swapcache migration and transparent hugepage at ...
In a manner of speaking. There was no locking going on but prior to the anon_vma changes, there would have been only one anon_vma lock and the fix would be easier - just take the lock on anon_vma->lock while the That would be an alternative for swapcache but it's not necessarily That was a mistake on my part. The race appears to be between vma_adjust changing the details of the VMA while rmap_walk is going on. It mistakenly believes the vma no longer spans the address, gets -EFAULT from vma_address and doesn't clean up the migration PTE. This is later encountered but the page lock is no longer held and it bugs. An alternative would be to clean up the migration PTE of unlocked pages on the assumption it was due to this Think migrate_page_move_mapping() is what you're looking for? It takes I'm not getting what you're suggesting here. The refcount is to make sure the anon_vma doesn't go away after the page mapcount reaches zero. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
So it was very much a bug before too and we could miss to find some pte mapping the page if vm_start was adjusted? Also note, expand_downwards also moves vm_start with only the anon_vma->lock as it has to serialize against other expand_downwards and the rmap_walk. But expand_downwards takes the lock and it was safe before. Also for swapping even if things screwup it's no big deal, because it will just skip, but migration has to find all ptes in remove_migration_ptes and try_to_unmap also has to unmap everything. In the split_huge_page case even the ordering at which newly allocated vmas for the child are queued is critical, they've to be put at the end of the list to be safe (otherwise do_wp_huge_page may not wait and Hmm try_to_unmap already nukes all swap entries without creating any Agreed, it's sure better to close the race... the other may have even more implications. It's good to retain the invariant that when a migration PTE exists the page also still exists and it's locked Causing zap_page-range to Wait the end of migration when it encounters migration ptes instead of skipping them all together by only releasing the rss and doing nothing about them. If the pte can't go away, so the mm so the vma and so the anon-vma. I'm not suggesting to change that, but I guess that's the way I would have done if I would have implemented it, it'd avoid refcounting. Just like I did in split_huge_page I count the number of pmd marked splitting, and I compare it to the number of pmds that are converted from huge to not-huge and I compared that again with the page_mapcount. If the three numbers aren't equal I bug. It simply can't go wrong unnoticed that way. I only can do that because I stop the zapping. --
Well I looked deeper into it myself as I wanted to have this bit (and other bits) sorted out in aa.git, and definitely this is a bug introduced by the newest anon-vma changes in 2.6.34-rc so aa.git cannot be affected as it's using the 2.6.33 anon-vma (and prev) code. vma_adjust already takes the anon_vma->lock and of course I also further verified that trying to apply your snippet to vma_adjust results in immediately deadlock as the very same lock is already taken in my tree as it's the same anon-vma (simpler). So aa.git will be immune from these bugs for now. --
I think you're right. This is a new bug introduced by the anon_vma changes. On Yes, I expected that. Previously, there was only one anon_vma so if you It should be. I expect that's why you have never seen the bugon in swapops. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Correct, I never seen it, and I keep it under very great stress with swap storms of hugepages, lots of I/O and khugepaged at 100% cpu. Also keep in mind expand_downwards which also adjusts vm_start/vm_pgoff the same way (and without mmap_sem write mode). --
Well, to me this is also good because it shows it's not an existing bug in migration or a new bug introduced by compaction either. Previously I hadn't seen this bug either but until relatively recently, the bulk of the testing Will keep it in mind. It's taking the anon_vma lock but once again, there might be more than one anon_vma to worry about and the proper locking still isn't massively clear to me. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I suggest to test again with aa.git as soon as I make a new release with your v8 code (probably today). I didn't merge the swapcache locked debug patch that tries to recover the thing after the fact. No problem here with swapcache apparently and your v8 and latest numa-aware khugepaged code is under stress for the last 12 hours. Other than full numa awareness in all hugepage allocations and your v8 code, I added a generic document that needs review and I plan to add a config tweak to select the default to be madvise or always for transparent hugepage (never makes no sense, other than for debugging purposes, madvise already provides the guarantee of zero risk of Yes, that's my point, same issue there. --
The locking for the anon_vma_chain->same_vma list is essentially the same as what was used before in mmap and anon_vma_prepare. Either the mmap_sem is held for write, or the mmap_sem is held for reading and the page_table_lock is held. What exactly is the problem that migration is seeing? -- All rights reversed --
There are two problems. Migration isn't holding the mmap_sem for write, for read or the pagetable lock. It locks the page, unmaps it, puts a migration PTE in place that looks like a swap entry, copies it and remaps it under the pagetable lock. At no point does it hold the mmap_sem, but it needs to be sure it finds all the migration pte it created. Because there are multiple anon_vma's, the locking is tricky and unclear. I have one patch that locks the anon_vmas as it finds them but is prepared to start over in the event of contention. The second appears to be migration ptes that get copied during fork(). This is easier to handle. I'm testing two patches at the moment and after 8 hours have seen no problem even though the races are being detected (and handled). If it survives the night, I'll post them. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
split_huge_page has the exact same requirements, except it is more strict and it will stop zap_page_range and count that the same number of pmds it marked as splitting are found again later. Also note migration has the same "ordering" requirements for anon_vma_link during fork, new vmas have to be appended at the end or migration will choke (not going into the details of why, but I can if you want). This should be safe in new anon-vma code as I already pointed out this requirement to Rik for split_huge_page to be safe too. I never tested split_huge_page on the fixed new anon-vma code (before the latest fixes so with rc4 or so, I only know before the latest fixes it was triggering BUG_ON in split_huge_page as I've enough bug-on in there to be sure if split_huge_page doesn't BUG_ON, it's safe). I need to retry with the new anon-vma code... split_huge_page never showed anything wrong with the 2.6.33 code that I'm running on And this is also where the requirement that new vmas are added to the I run again the same kernel as before and I reproduced the crash in migration_entry_wait swapops.h (page not locked) just once when I posted the stack trace and never again. I wanted to compare stack traces and see if it happens again. But that bug in migration_entry_wait can't be related to the new anon-vma code because I've backed it out from aa.git. Still you've to figure out if your patch is fixing a real bug. I'm just pointing out if there's a bug in anon-vma vma_adjust/expand_downards is unrelated to the crash in swapops.h migration_entry_wait. And obviously it's not either a bug in transparent hugepage code, as you also reproduced the same crash without using aa.git only with v8. We need to fix the swapops.h bug with maximum priority... (and of course the anon-vma bug too if it exists). Other than that swapops.h in migrate that you can also reproduce with only mainline + memory compaction v8, I had zero other problems with current aa.git. --
Oh I just got the very crash you're talking about with aa.git with your v8 code. Weird that I never reproduced it before! I think it's because I fixed gcc to be fully backed by hugepages always (without khugepaged) and I was rebuilding a couple of packages, and that now triggers memory compaction much more, but mixed with heavy fork/execve. This is the only instability I managed to reproduce over 24 hours of stress testing and it's clearly not related to transparent hugepage support but it's either a bug in migrate.c (more likely) or memory compaction. Note that I'm running with the 2.6.33 anon-vma code, so it will relieve you to know it's not the anon-vma recent changes causing this (well I can't rule out anon-vma bugs, but if it's anon-vma, it's a longstanding one). kernel BUG at include/linux/swapops.h:105! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/devices/pci0000:00/0000:00:12.0/host0/target0:0:0/0:0:0:0/block/sr0/size CPU 0 Modules linked in: nls_iso8859_1 loop twofish twofish_common tun bridge stp llc bnep sco rfcomm l2cap bluetooth snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss usbhid gspca_pac207 gspca_main videodev v4l1_compat v4l2_compat_ioctl32 snd_hda_codec_realtek ohci_hcd snd_hda_intel ehci_hcd usbcore snd_hda_codec snd_pcm snd_timer snd snd_page_alloc sg psmouse sr_mod pcspkr Pid: 13351, comm: basename Not tainted 2.6.34-rc5 #23 M2A-VM/System Product Name RIP: 0010:[<ffffffff810e66b0>] [<ffffffff810e66b0>] migration_entry_wait+0x170/0x180 RSP: 0000:ffff88009ab6fa58 EFLAGS: 00010246 RAX: ffffea0000000000 RBX: ffffea000234eed8 RCX: ffff8800aaa95298 RDX: 00000000000a168d RSI: ffff88000411ae28 RDI: ffffea00025550a8 RBP: ffffea0002555098 R08: ffff88000411ae28 R09: 0000000000000000 R10: 0000000000000008 R11: 0000000000000009 R12: 00000000aaa95298 R13: 00007ffff8a53000 R14: ffff88000411ae28 R15: ffff88011108a7c0 FS: 00002adf29469b90(0000) GS:ffff880001a00000(0000) knlGS:0000000055700d50 CS: 0010 DS: 0000 ...
The oopses I am getting look very similar. The page is encountered in the stack while copying the arguements in. I don't think it's a coincidence. [17831.496941] ------------[ cut here ]------------ [17831.532517] kernel BUG at include/linux/swapops.h:105! [17831.532517] invalid opcode: 0000 [#1] PREEMPT SMP [17831.532517] last sysfs file: /sys/block/sde/size [17831.532517] CPU 0 [17831.532517] Modules linked in: kvm_amd kvm dm_crypt loop evdev tpm_tis i2c_piix4 shpchp tpm wmi tpm_bios serio_raw i2c_core pci_hotplug processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod sd_mod cdrom ata_generic ahci libahci ide_pci_generic libata r8169 mii ide_core ehci_hcd scsi_mod ohci_hcd floppy thermal fan thermal_sys [17831.532517] [17831.532517] Pid: 31028, comm: make Not tainted 2.6.34-rc4-mm1-fix-swapops #2 GA-MA790GP-UD4H/GA-MA790GP-UD4H [17831.532517] RIP: 0010:[<ffffffff811094fb>] [<ffffffff811094fb>] migration_entry_wait+0xc1/0x129 [17831.532517] RSP: 0018:ffff88007ebfd9d8 EFLAGS: 00010246 [17831.532517] RAX: ffffea0000000000 RBX: ffffea0000199368 RCX: 00000000000389f0 [17831.532517] RDX: 0000000000199368 RSI: ffffffff81826558 RDI: 0000000000e9d63e [17831.532517] RBP: ffff88007ebfda08 R08: ffff88007e334ec0 R09: ffff88007ebfd9e8 [17831.532517] R10: 00000000b411a2ca R11: 0000000000000246 R12: 0000000036f44000 [17831.532517] R13: ffff88007d164088 R14: f8000000000074eb R15: 0000000000e9d63e [17831.532517] FS: 00002abc5c083f90(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 [17831.532517] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [17831.532517] CR2: 00007fff3fb8cd1c CR3: 000000007d12d000 CR4: 00000000000006f0 [17831.532517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [17831.532517] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [17831.532517] Process make (pid: 31028, ...
On Tue, 27 Apr 2010 10:40:02 +0100
Hmm. booby trap aronude here ?
==
static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
{
....
/*
* cover the whole range: [new_start, old_end)
*/
if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
return -ENOMEM;
/*
* move the page tables downwards, on failure we rely on
* process cleanup to remove whatever mess we made.
*/
if (length != move_page_tables(vma, old_start,
vma, new_start, length))
return -ENOMEM;
...
/*
* Shrink the vma to just the new range. Always succeeds.
*/
vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
==
I think we have wrong vma_address() -> "pte"
==
=== (A) ===
vma_adjust(). ---- (*)
=== (B) ===
move_pte().
==
vma_address(page, vma)
=> address = vma->vm_start + ((page->index << shift) - vma->vm_pgoff) << PAGE_SHIFT);
So, vma_address() in zone (A) and vma_address in (B) will return different address.
When pte inludes migration_pte, this seems critical. Because an address pointed
by vma_address() in zone (B) will not contain migration_pte until
move_ptes() ends.
Thanks,
-Kame
--
I think so. I have a debugging patch running at the moment that is Specfically, I have it in move_ptes. If migration entries are being found there, it would be reasonable for exec() to wait on migration to complete Yes. I was expecting that the anon_vma lock in vma_adjust would delay exec This is plausible considering that, like vma_adjust(), move_ptes does not appear to take the anon_vma->lock in the same fashion as i_mmap_lock is taken for files. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Awesome this is happening in the same page and in the same place! Thanks a lot for sharing your oops (I'm running the same kernel again and I never reproduced it again but I didn't apply the "reproducer" stress to it but only plenty of gcc hugepage loads, I think rebuilding gcc with hugepage-gcc is what triggered it the first time, as gcc takes more memory to build and has a lot more of pathologic cases like translate.o taking lots of memory than the kernel or glibc which I rebuilt a lot recently). Reducing the race to this will help tremendously. If I understand correctly patch 1 didn't fix it (and patch 2 can't be the issue for me as I'm running 2.6.33 anon-vma code). I'll check ASAP if patch 1 is needed even if it's not fixing this bug, or if it's unnecessary, and what else could be wrong... Thanks a lot for the help! Andrea --
I thought it was but I was looking at an rc kernel instead of 2.6.33. This looked as if it was safe but it's not any more with the new It either has to find them all or it has to be capable of a lazy-cleanup. I had lazy cleanup patch but it was dropped because we felt it should have been possible to properly lock this. I'm beginning to think it can't because there appears to be a few cases where the VM There might also be a locking snarl there as well then. I confess that the details of transparent hugepage support have fallen back out of my When a mapped swapcache is unmapped, a migration PTE is put in place. If that was not the case, we wouldn't be hitting the bug in the first That would be another way of doing it all right. Count how many migration ptes we created, pass that to rmap_walk. If they don't match, assume a race and do it again before the page is unlocked. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
You could make it atomic through the use of RCU. Create a new vma entry with the changed parameters and then atomically switch to the new vma. Problem is that you have some list_heads in there. --
Indeed, it would be necessary to update -all- pointers to the old vma entry to point to the new vma entry. The question at that point will be "is it OK to switch the pointers over one at a time?" In many situations, the answer is "yes", but it is necessary to check carefully. Thanx, Paul --
Hi, Christoph. That's a good idea if we can do _simply_. That's because there are many confusion anon_vma and vma handling nowadays. (http://thread.gmane.org/gmane.linux.kernel/969907) So I hope we solve the problem without rather complicated rcu locking if it isn't critical path. -- Kind regards, Minchan Kim --
On Wed, 21 Apr 2010 09:30:20 -0500 (CDT) Ah....Can I confirm my understanding ? remap_swapcache == true only when The old page was ANON && it is not mapped. && it is SwapCache. We do above check under lock_page(). So, this SwapCache is never mapped until we release lock_page() on the old page. So, we don't use migration_pte in this case because try_to_unmap() do nothing and don't need to call remove_migration_pte(). If migration_pte is used somewhere...I think it's bug. --
On Thu, Apr 22, 2010 at 8:59 AM, KAMEZAWA Hiroyuki Yes. so I thought what kinds of race happened. Firstly I doubt fork and migration. but It isn't. I can't understand how this bug happens. Apparently, We have been missed something. I will look into this further. -- Kind regards, Minchan Kim --
rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being
isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
page dropped to 0 and the anon_vma gets freed. This can happen during
memory compaction if pages being migrated belong to a process that exits
before migration completes. Hence, the use-after-free race looks like
1. Page isolated for migration
2. Process exits
3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
is garbage.
This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/migrate.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 42a3d24..b114635 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -600,6 +600,17 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
if (PageAnon(page)) {
rcu_read_lock();
rcu_locked = 1;
+
+ /*
+ * If the page has no mappings any more, just bail. An
+ * unmapped anon page is likely to be freed soon but worse,
+ * it's possible its anon_vma disappeared between when
+ * the page was isolated and when we reached here while
+ * the RCU lock was not held
+ */
+ if (!page_mapped(page))
+ goto rcu_unlock;
+
anon_vma = page_anon_vma(page);
atomic_inc(&anon_vma->external_refcount);
}
--
1.6.5
--
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm | [PATCH 03/07] ARM: Use shared GIC entry macros on Realview |
| Oliver Neukum | Re: [Bug #13682] The webcam stopped working when upgrading from 2.6.29 to 2.6.30 |
| Martin Schwidefsky | Re: [PATCH] optimized ktime_get[_ts] for GENERIC_TIME=y |
git: | |
| Junio C Hamano | Re: Some advanced index playing |
| Jeff King | Re: confusion over the new branch and merge config |
| Robin Rosenberg | Re: cvs2svn conversion directly to git ready for experimentation |
| Linus Torvalds | git binary size... |
| Ævar Arnfjörð Bjarmason | Re: Challenge with Git-Bash |
| Linux Kernel Mailing List | md: move allocation of ->queue from mddev_find to md_probe | <
