Hello.
This patch set is first trial and to describe my rough idea of
"how to remove pgdat".
I would like to confirm "current my idea is good way or not" by this post.
This patch is incomplete and not tested yet, If my idea is good way,
I'll continue to make them and test.
I think pgdat removing is diffcult issue,
because any code doesn't know pgdat will be removed, and access
them without any locking now. But the pgdat remover must wait their access,
because the node may be removed electrically after it soon.
Current my idea is using RCU feature for waiting them.
Because it is the least impact against reader's performance,
and pgdat remover can wait finish of reader's access to pgdat
which is removing by synchronize_sched().
So, I made followings read_lock for accessing pgdat.
- pgdat_remove_read_lock()/unlock()
- pgdat_remove_read_lock_sleepable()/unlock_sleepable()
These definishions use rcu_read_lock and srcu_read_lock().
Writer uses node_set_offline() which uses clear_bit(),
and build_all_zonelists() with stop_machine_run().
There are a few types of pgdat access.
1) via node_online_bitmap.
Many code use for_each_xxx_node(), for_each_zone(), and so on.
These code must be used with pgdat_remove_read_lock/unlock().
2) mempolicy
There are callback interface when memory offline works. mempolicy
must use callbacks for disable removing node.
This patch set includes quite simple (sample) patch to point
what will be required. However more detail specification will be necessary.
(ex, When preffered node of mempolicy is removing, how does kernel should do?)
3) zonelist
alloc_pages access zones via zonelist. However, zone may be removed
by pgdat remover too. It must be check zones might be removed
before accessing zonliest which is guarded between pgdat_remove_read_lock()
and unlock().
4) via NODE_DATA() with node_id.
This type access is called with numa_node_id() in many case.
...This is definition for pgdat_remove_read_lock() & read_lock_sleepable().
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
include/linux/memory_hotplug.h | 25 +++++++++++++++++++++++++
mm/memory_hotplug.c | 12 ++++++++++++
2 files changed, 37 insertions(+)
Index: current/include/linux/memory_hotplug.h
===================================================================
--- current.orig/include/linux/memory_hotplug.h 2008-07-29 21:19:13.000000000 +0900
+++ current/include/linux/memory_hotplug.h 2008-07-29 21:19:17.000000000 +0900
@@ -20,6 +20,31 @@ struct mem_section;
#define MIX_SECTION_INFO (-1 - 2)
#define NODE_INFO (-1 - 3)
+#if (defined CONFIG_NUMA && CONFIG_MEMORY_HOTREMOVE)
+/*
+ * pgdat removing lock
+ */
+extern struct srcu_struct pgdat_remove_srcu;
+#define pgdat_remove_read_lock() rcu_read_lock()
+#define pgdat_remove_read_unlock() rcu_read_unlock()
+#define pgdat_remove_read_lock_sleepable() srcu_read_lock(&pgdat_remove_srcu)
+#define pgdat_remove_read_unlock_sleepable(idx) \
+ srcu_read_unlock(&pgdat_remove_srcu, idx)
+#else
+static inline void pgdat_remove_read_lock(void)
+{
+}
+static inline void pgdat_remove_read_unlock(void)
+{
+}
+static inline int pgdat_remove_read_lock_sleepable(void)
+{
+}
+static inline void pgdat_remove_read_unlock_sleepable(int idx)
+{
+}
+#endif
+
/*
* pgdat resizing functions
*/
Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c 2008-07-29 21:19:13.000000000 +0900
+++ current/mm/memory_hotplug.c 2008-07-29 22:17:38.000000000 +0900
@@ -31,6 +31,10 @@
#include "internal.h"
+#if (defined CONFIG_NUMA && CONFIG_MEMORY_HOTREMOVE)
+struct srcu_struct pgdat_remove_srcu;
+#endif
+
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
@@ -850,6 +854,14 @@ failed_removal:
return ret;
}
+
+static int __init ...This is to add pgdat_remove_read_lock()/unlock() for parsing zonelist in
__alloc_pages_internal().
The node might be removed before pgdat_remove_read_lock(),
node_online() must be checked at first. If offlined, don't parse it.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
mm/page_alloc.c | 36 ++++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)
Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-07-31 19:01:46.000000000 +0900
+++ current/mm/page_alloc.c 2008-07-31 19:19:19.000000000 +0900
@@ -1394,10 +1394,22 @@ get_page_from_freelist(gfp_t gfp_mask, n
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
+ pgdat_remove_read_lock();
+ if (unlikely(!node_online(zonelist_nid))) {
+ /*
+ * Pgdat removing worked before here.
+ * Don't touch pgdat/zone/zonelist any more.
+ */
+ pgdat_remove_read_unlock();
+ return NULL;
+ }
+
(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
&preferred_zone);
- if (!preferred_zone)
+ if (!preferred_zone) {
+ pgdat_remove_read_unlock();
return NULL;
+ }
classzone_idx = zone_idx(preferred_zone);
@@ -1451,6 +1463,7 @@ try_next_zone:
zlc_active = 0;
goto zonelist_scan;
}
+ pgdat_remove_read_unlock();
return page;
}
@@ -1536,10 +1549,21 @@ __alloc_pages_internal(gfp_t gfp_mask, u
return NULL;
restart:
+ pgdat_remove_read_lock();
+ if (unlikely(!node_online(zonelist_nid))) {
+ /*
+ * pgdat removing worked before here.
+ * zone & zonelist can't be touched.
+ */
+ pgdat_remove_read_unlock();
+ goto nopage;
+ }
zonelist = node_zonelist(zonelist_nid, gfp_mask);;
z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */
+ zone = z->zone;
+ pgdat_remove_read_unlock();
- if (unlikely(!z->zone)) {
+ if (unlikely(!zone)) {
/*
* ...Add pgdat_remove_read_lock() and unlock() for parsing
for_each_online_node() (and for_each_node_state()).
(for_each_zone also needs same lock, but I don't implement
it yet.)
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
fs/buffer.c | 4 +++-
mm/mempolicy.c | 9 ++++++++-
mm/page-writeback.c | 2 ++
mm/page_alloc.c | 9 ++++++++-
mm/vmscan.c | 2 ++
mm/vmstat.c | 3 +++
6 files changed, 26 insertions(+), 3 deletions(-)
Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-07-29 21:21:33.000000000 +0900
+++ current/mm/page_alloc.c 2008-07-29 22:17:44.000000000 +0900
@@ -2345,6 +2345,7 @@ static int default_zonelist_order(void)
/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
low_kmem_size = 0;
total_size = 0;
+ pgdat_remove_read_lock();
for_each_online_node(nid) {
for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
z = &NODE_DATA(nid)->node_zones[zone_type];
@@ -2355,6 +2356,7 @@ static int default_zonelist_order(void)
}
}
}
+ pgdat_remove_read_unlock();
if (!low_kmem_size || /* there are no DMA area. */
low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
return ZONELIST_ORDER_NODE;
@@ -2365,6 +2367,8 @@ static int default_zonelist_order(void)
*/
average_size = total_size /
(nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
+
+ pgdat_remove_read_lock();
for_each_online_node(nid) {
low_kmem_size = 0;
total_size = 0;
@@ -2378,9 +2382,12 @@ static int default_zonelist_order(void)
}
if (low_kmem_size &&
total_size > average_size && /* ignore small node */
- low_kmem_size > total_size * 70/100)
+ low_kmem_size > total_size * 70/100){
+ pgdat_remove_read_unlock();
return ZONELIST_ORDER_NODE;
+ }
}
+ pgdat_remove_read_unlock();
return ZONELIST_ORDER_ZONE;
}
Index: ...This is preparation patch for the later patch.
Current code passes the pointer of zonelist to __alloc_pages() to
specify which zonelist should be used.
However, its parameter also means which node(pgdat)'s zonelist
should be used for parsing.
This patch change interface from zonelist pointer to node id (zonelist_nid)
which has target zonelist.
Because node id is easy to check node online/offline.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
include/linux/gfp.h | 12 +++++------
include/linux/mempolicy.h | 2 -
mm/hugetlb.c | 4 ++-
mm/mempolicy.c | 50 +++++++++++++++++++++++-----------------------
mm/page_alloc.c | 21 ++++++++++++-------
5 files changed, 49 insertions(+), 40 deletions(-)
Index: current/include/linux/gfp.h
===================================================================
--- current.orig/include/linux/gfp.h 2008-07-31 18:54:09.000000000 +0900
+++ current/include/linux/gfp.h 2008-07-31 18:54:18.000000000 +0900
@@ -175,20 +175,20 @@ static inline void arch_alloc_page(struc
struct page *
__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask);
+ int zonelist_nid, nodemask_t *nodemask);
static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist)
+ int zonelist_nid)
{
- return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+ return __alloc_pages_internal(gfp_mask, order, zonelist_nid, NULL);
}
static inline struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
+ int zonelist_nid, nodemask_t *nodemask)
{
- return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+ return __alloc_pages_internal(gfp_mask, order, zonelist_nid, nodemask);
}
@@ -202,7 +202,7 @@ static inline struct page *alloc_pages_n
if (nid < 0)
nid = numa_node_id();
...This patch is to make kswapd_stop().
It must be stopped before node removing.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
include/linux/swap.h | 3 +++
mm/vmscan.c | 13 +++++++++++++
2 files changed, 16 insertions(+)
Index: current/mm/vmscan.c
===================================================================
--- current.orig/mm/vmscan.c 2008-07-29 22:17:16.000000000 +0900
+++ current/mm/vmscan.c 2008-07-29 22:17:16.000000000 +0900
@@ -1985,6 +1985,9 @@ static int kswapd(void *p)
}
finish_wait(&pgdat->kswapd_wait, &wait);
+ if (kthread_should_stop())
+ break;
+
if (!try_to_freeze()) {
/* We can speed up thawing tasks if we don't call
* balance_pgdat after returning from the refrigerator
@@ -2216,6 +2219,16 @@ int kswapd_run(int nid)
return ret;
}
+#ifdef CONFIG_MEMORY_HOTREMOVE
+void kswapd_stop(int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+
+ if (pgdat->kswapd)
+ kthread_stop(pgdat->kswapd);
+}
+#endif
+
static int __init kswapd_init(void)
{
int nid;
Index: current/include/linux/swap.h
===================================================================
--- current.orig/include/linux/swap.h 2008-07-29 21:20:02.000000000 +0900
+++ current/include/linux/swap.h 2008-07-29 22:17:16.000000000 +0900
@@ -262,6 +262,9 @@ static inline void scan_unevictable_unre
#endif
extern int kswapd_run(int nid);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern void kswapd_stop(int nid);
+#endif
#ifdef CONFIG_MMU
/* linux/mm/shmem.c */
--
Yasunori Goto
--
This patch is very incomplete (includes dummy code),
but I would like to show what mempolicy has to do when node is removed.
Basically, user should change policy before removing node
which is used mempolicy. However, user may not know
mempolicy is used due to automatic setting by software.
The kernel must guarantee removed node will be not used.
There is callback when memory offlining, mempolicy can change
each task's policies.
There are some issues.
- If nodes_weight(pol->v.nodes) will be 0 due to node removing,
Kernel will not be able to allocate any pages.
What does kernel should do? Kill its process?
- If preffered node is removing, then which node should be next
preffered node?
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
Index: current/mm/mempolicy.c
===================================================================
--- current.orig/mm/mempolicy.c 2008-07-29 22:17:25.000000000 +0900
+++ current/mm/mempolicy.c 2008-07-29 22:17:29.000000000 +0900
@@ -2345,3 +2345,35 @@ out:
m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
return 0;
}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static int mempolicy_mem_offline_callback(void *arg)
+{
+ int offline_node;
+ struct memory_notify *marg = arg;
+
+ offline_node = marg->status_change_nid;
+
+ /*
+ * If the node still has available memory, we keep policies.
+ */
+ if (offline_node < 0)
+ return 0;
+
+ /*
+ * Disable all offline node's bit for each node mask.
+ */
+ for_each_policy(pol) {
+ switch (pol->mode) {
+ case MPOL_BIND:
+ case MPOL_INTERLEAVE:
+ /* Force disable node bit */
+ node_clear(offline_node, pol->v.nodes);
+ break;
+ case MPOL_PREFFERED:
+ /* TBD */
+ default:
+ break;
+ }
+}
+#endif
--
Yasunori Goto
--
remove_pgdat() is main code for pgdat removing.
remove_pgdat() should be called for node-hotremove, but nothing calls
it. Sysfs interface (or anything else?) will be necessary.
And current offline_pages() has to be update zonelist and N_HIGH_MEMORY
if there is no present_pages on the node, and stop kswapd().
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
mm/memory_hotplug.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 84 insertions(+), 1 deletion(-)
Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c 2008-07-29 22:17:24.000000000 +0900
+++ current/mm/memory_hotplug.c 2008-07-29 22:17:32.000000000 +0900
@@ -241,6 +241,82 @@ static int __add_section(struct zone *zo
return register_new_memory(__pfn_to_section(phys_start_pfn));
}
+static int cpus_busy_on_node(int nid)
+{
+ cpumask_t tmp = node_to_cpumask(nid);
+ int cpu, ret;
+
+ for_each_cpu_mask(cpu, tmp) {
+ if (cpu_online(cpu)) {
+ printk(KERN_INFO "cpu %d is busy\n", cpu);
+ ret = 1 ;
+ }
+ }
+ return 0;
+}
+
+static int sections_busy_on_node(struct pglist_data *pgdat)
+{
+ unsigned long section_nr, num, i;
+ int ret = 0;
+
+ section_nr = pfn_to_section_nr(pgdat->node_start_pfn);
+ num = pfn_to_section_nr(pgdat->node_spanned_pages);
+
+ for (i = section_nr; i < num; i++) {
+ if (present_section_nr(i)) {
+ printk(KERN_INFO "section %ld is busy\n", i);
+ ret = 1;
+ }
+ }
+ return ret;
+}
+
+void free_pgdat(int offline_nid, struct pglist_data *pgdat)
+{
+ struct page *page = virt_to_page(pgdat);
+
+ arch_refresh_nodedata(offline_nid, NULL);
+
+ if (PageSlab(page)) {
+ /* This pgdat is allocated on other node via hot-add */
+ arch_free_nodedata(pgdat);
+ return;
+ }
+
+ if (offline_nid != page_to_nid(page)) {
+ /* This pgdat is allocated on other node as memoryless node */
+ put_page_bootmem(page);
+ return;
+ }
+
+ /*
+ * Ok. ...FWIW synchronize_sched() is the wrong function to use here, synchronize_rcu() is the right one. --
Thanks. I'll fix it. -- Yasunori Goto --
When kernel uses NODE_DATA(nid), kernel must check its node is really online
or not. In addition, if numa_node_id() returns offlined node,
it must be bug because cpu offline on the node has to be executed before
node offline.
This patch checks it, and add read locks on some other little parsing
zone/zonelist places.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
---
mm/mempolicy.c | 11 +++++++++--
mm/page_alloc.c | 19 +++++++++++++++++--
mm/quicklist.c | 8 +++++++-
mm/slub.c | 7 ++++++-
mm/vmscan.c | 22 +++++++++++++++++++---
5 files changed, 58 insertions(+), 9 deletions(-)
Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c 2008-07-29 22:06:46.000000000 +0900
+++ current/mm/page_alloc.c 2008-07-29 22:17:16.000000000 +0900
@@ -1884,7 +1884,12 @@ static unsigned int nr_free_zone_pages(i
/* Just pick one node, since fallback list is circular */
unsigned int sum = 0;
- struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
+ struct zonelist *zonelist;
+ int node = numa_node_id();
+
+ pgdat_remove_read_lock();
+ BUG_ON(!node_online(node));
+ zonelist = node_zonelist(node, GFP_KERNEL);
for_each_zone_zonelist(zone, z, zonelist, offset) {
unsigned long size = zone->present_pages;
@@ -1892,6 +1897,7 @@ static unsigned int nr_free_zone_pages(i
if (size > high)
sum += size - high;
}
+ pgdat_remove_read_unlock();
return sum;
}
@@ -1935,7 +1941,14 @@ EXPORT_SYMBOL(si_meminfo);
#ifdef CONFIG_NUMA
void si_meminfo_node(struct sysinfo *val, int nid)
{
- pg_data_t *pgdat = NODE_DATA(nid);
+ pg_data_t *pgdat;
+
+ pgdat_remove_read_lock();
+ if (unlikely(!node_online(nid))) {
+ pgdat_remove_read_unlock();
+ return;
+ }
+ pgdat = NODE_DATA(nid);
val->totalram = pgdat->node_present_pages;
val->freeram = node_page_state(nid, NR_FREE_PAGES);
@@ -1947,6 +1960,8 @@ void si_meminfo_node(struct ...Use stop_machine()? The removal of a zone or node is a pretty rare event after all and it would avoid having to deal with rcu etc etc. --
Agree.
To tell the truth, I tried hackbench with 3rd patch which add rcu_read_lock
in hot-path before this post to make rough estimate its impact.
%hackbench 100 process 2000
without patch.
39.93
with patch
39.99
(Both is 10 times avarage)
I guess this result has effect of disable preemption.
So, throughput looks not so bad, but probably, latency would be worse
as you mind.
Kame-san advised me I should take more other benchmarks which can get memory
I thought it at first, but are there the following worst case?
CPU 0 CPU 1
-------------------------------------------------------
__alloc_pages()
parsing_zonelist()
:
enter page_reclarim()
sleep (and remember zone) :
:
update zonelist and node_online_map
with stop_machine_run()
free pgdat().
remove the Node electrically.
wake up and touch remembered
zone, but it is removed
(Oops!!!)
Anyway, I'm happy if there is better way than my poor idea. :-)
Thanks for your comment.
--
Yasunori Goto
--
Duh. Then the use of RCU would also mean that all of reclaim must be in a rcu period. So reclaim cannot sleep anymore. --
I use srcu_read_lock() (sleepable rcu lock) if kernel must be sleep for
page reclaim. So, my patch basic idea is followings.
CPU 0 CPU 1
-------------------------------------------------------
__alloc_pages()
rcu_read_lock() and check
online bitmap
parsing_zonelist()
rcu_read_unlock()
:
enter page_reclarim()
srcu_read_lock()
parse zone/zonelist.
sleep (and remember zone) :
:
update zonelist and node_online_map
with stop_machine_run()
wake up and touch remembered zone,
srcu_read_unlock()
syncronized_sched().
free_pgdat()
Thanks.
--
Yasunori Goto
--
But that introduces more overhead in __alloc_pages. --
Hmmm. I think SRCU should be used when kernel has to sleep, and sleep time will be bigger than SRCU's overhead..... The followings are results of unixbench and lmbench. I suppose my patch impacts lantency rather than throghput. In these results, 100fd select and page fault latencies of lmbench became worse. So I can't say there is no problem in my patches. Anyway, I'll retry to find other less impact way if there is, and compare benchmark results with this way. Bye. ------------ Unixbench ----- Normal 2.6.27-rc1-mm1 BYTE UNIX Benchmarks (Version 4.1.0) System -- Linux localhost.localdomain 2.6.27-rc1-mm1 #1 SMP Mon Aug 4 16:08:48 JST 2008 ia64 ia64 ia64 GNU/Linux Start Benchmark Run: 2008年 8月 5日 火曜日 10:24:35 JST 1 interactive users. 10:24:35 up 9 min, 1 user, load average: 0.16, 0.08, 0.03 lrwxrwxrwx 1 root root 4 2008-02-25 15:48 /bin/sh -> bash /bin/sh: symbolic link to `bash' /dev/sda5 33792348 18360424 13687672 58% /home Execl Throughput 2954.0 lps (29.8 secs, 3 samples) File Read 1024 bufsize 2000 maxblocks 1211570.0 KBps (30.0 secs, 3 samples) File Write 1024 bufsize 2000 maxblocks 281599.0 KBps (30.0 secs, 3 samples) File Copy 1024 bufsize 2000 maxblocks 218859.0 KBps (30.0 secs, 3 samples) File Read 256 bufsize 500 maxblocks 328725.0 KBps (30.0 secs, 3 samples) File Write 256 bufsize 500 maxblocks 72850.0 KBps (30.0 secs, 3 samples) File Copy 256 bufsize 500 maxblocks 57095.0 KBps (30.0 secs, 3 samples) File Read 4096 bufsize 8000 maxblocks 3883690.0 KBps (30.0 secs, 3 samples) File Write 4096 bufsize 8000 maxblocks 1050752.0 KBps (30.0 secs, 3 samples) File Copy 4096 bufsize 8000 maxblocks 564703.0 KBps (30.0 secs, 3 samples) Pipe Throughput 462027.5 lps (10.0 secs, 10 samples) Pipe-based Context Switching 105824.3 lps (10.0 secs, 10 samples) Process Creation 2242.9 lps (30.0 ...
Maybe I am missing something, but what is wrong with stop_machine during -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Reclaim can sleep while going down a zonelist. There would need to be some form of synchronization to avoid removing a zone from the zonelist that we are just scanning. --
