Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

Previous thread: Potential fix to filemap_fault() by Steven Whitehouse on Thursday, July 31, 2008 - 4:04 am. (2 messages)

Next thread: [PATCH] MN10300: Wire up new system calls by David Howells on Thursday, July 31, 2008 - 5:01 am. (1 message)
From: Yasunori Goto
Date: Thursday, July 31, 2008 - 4:50 am

Hello.

This patch set is first trial and to describe my rough idea of
"how to remove pgdat".

I would like to confirm "current my idea is good way or not" by this post.
This patch is incomplete and not tested yet, If my idea is good way,
I'll continue to make them and test.

I think pgdat removing is diffcult issue,
because any code doesn't know pgdat will be removed, and access
them without any locking now. But the pgdat remover must wait their access,
because the node may be removed electrically after it soon.

Current my idea is using RCU feature for waiting them.
Because it is the least impact against reader's performance,
and pgdat remover can wait finish of reader's access to pgdat
which is removing by synchronize_sched().

So, I made followings read_lock for accessing pgdat.
  - pgdat_remove_read_lock()/unlock()
  - pgdat_remove_read_lock_sleepable()/unlock_sleepable()
These definishions use rcu_read_lock and srcu_read_lock().

Writer uses node_set_offline() which uses clear_bit(),
and build_all_zonelists() with stop_machine_run().


There are a few types of pgdat access.

  1) via node_online_bitmap.
     Many code use for_each_xxx_node(), for_each_zone(), and so on.
     These code must be used with pgdat_remove_read_lock/unlock().

  2) mempolicy
     There are callback interface when memory offline works. mempolicy
     must use callbacks for disable removing node.
     This patch set includes quite simple (sample) patch to point
     what will be required. However more detail specification will be necessary.
     (ex, When preffered node of mempolicy is removing, how does kernel should do?)
  
  3) zonelist
     alloc_pages access zones via zonelist. However, zone may be removed
     by pgdat remover too. It must be check zones might be removed
     before accessing zonliest which is guarded between pgdat_remove_read_lock()
     and unlock().

  4) via NODE_DATA() with node_id.
     This type access is called with numa_node_id() in many case.
     ...
From: Yasunori Goto
Date: Thursday, July 31, 2008 - 4:56 am

This is definition for pgdat_remove_read_lock() & read_lock_sleepable().


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>


---
 include/linux/memory_hotplug.h |   25 +++++++++++++++++++++++++
 mm/memory_hotplug.c            |   12 ++++++++++++
 2 files changed, 37 insertions(+)

Index: current/include/linux/memory_hotplug.h
===================================================================
--- current.orig/include/linux/memory_hotplug.h	2008-07-29 21:19:13.000000000 +0900
+++ current/include/linux/memory_hotplug.h	2008-07-29 21:19:17.000000000 +0900
@@ -20,6 +20,31 @@ struct mem_section;
 #define MIX_SECTION_INFO	(-1 - 2)
 #define NODE_INFO		(-1 - 3)
 
+#if (defined CONFIG_NUMA && CONFIG_MEMORY_HOTREMOVE)
+/*
+ * pgdat removing lock
+ */
+extern struct srcu_struct pgdat_remove_srcu;
+#define pgdat_remove_read_lock() rcu_read_lock()
+#define pgdat_remove_read_unlock() rcu_read_unlock()
+#define pgdat_remove_read_lock_sleepable() srcu_read_lock(&pgdat_remove_srcu)
+#define pgdat_remove_read_unlock_sleepable(idx) \
+	srcu_read_unlock(&pgdat_remove_srcu, idx)
+#else
+static inline void pgdat_remove_read_lock(void)
+{
+}
+static inline void pgdat_remove_read_unlock(void)
+{
+}
+static inline int pgdat_remove_read_lock_sleepable(void)
+{
+}
+static inline void pgdat_remove_read_unlock_sleepable(int idx)
+{
+}
+#endif
+
 /*
  * pgdat resizing functions
  */
Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c	2008-07-29 21:19:13.000000000 +0900
+++ current/mm/memory_hotplug.c	2008-07-29 22:17:38.000000000 +0900
@@ -31,6 +31,10 @@
 
 #include "internal.h"
 
+#if (defined CONFIG_NUMA && CONFIG_MEMORY_HOTREMOVE)
+struct srcu_struct pgdat_remove_srcu;
+#endif
+
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
@@ -850,6 +854,14 @@ failed_removal:
 
 	return ret;
 }
+
+static int __init ...
From: Yasunori Goto
Date: Thursday, July 31, 2008 - 4:58 am

This is to add pgdat_remove_read_lock()/unlock() for parsing zonelist in
__alloc_pages_internal().
The node might be removed before pgdat_remove_read_lock(),
node_online() must be checked at first. If offlined, don't parse it.

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 mm/page_alloc.c |   36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c	2008-07-31 19:01:46.000000000 +0900
+++ current/mm/page_alloc.c	2008-07-31 19:19:19.000000000 +0900
@@ -1394,10 +1394,22 @@ get_page_from_freelist(gfp_t gfp_mask, n
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
+	pgdat_remove_read_lock();
+	if (unlikely(!node_online(zonelist_nid))) {
+		/*
+		 * Pgdat removing worked before here.
+		 * Don't touch pgdat/zone/zonelist any more.
+		 */
+		pgdat_remove_read_unlock();
+		return NULL;
+	}
+
 	(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
 							&preferred_zone);
-	if (!preferred_zone)
+	if (!preferred_zone) {
+		pgdat_remove_read_unlock();
 		return NULL;
+	}
 
 	classzone_idx = zone_idx(preferred_zone);
 
@@ -1451,6 +1463,7 @@ try_next_zone:
 		zlc_active = 0;
 		goto zonelist_scan;
 	}
+	pgdat_remove_read_unlock();
 	return page;
 }
 
@@ -1536,10 +1549,21 @@ __alloc_pages_internal(gfp_t gfp_mask, u
 		return NULL;
 
 restart:
+	pgdat_remove_read_lock();
+	if (unlikely(!node_online(zonelist_nid))) {
+		/*
+		 * pgdat removing worked before here.
+		 * zone & zonelist can't be touched.
+		 */
+		pgdat_remove_read_unlock();
+		goto nopage;
+	}
 	zonelist = node_zonelist(zonelist_nid, gfp_mask);;
 	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
+	zone = z->zone;
+	pgdat_remove_read_unlock();
 
-	if (unlikely(!z->zone)) {
+	if (unlikely(!zone)) {
 		/*
 		 * ...
From: Yasunori Goto
Date: Thursday, July 31, 2008 - 5:00 am

Add pgdat_remove_read_lock() and unlock() for parsing 
for_each_online_node() (and for_each_node_state()).

(for_each_zone also needs same lock, but I don't implement
it yet.)


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 fs/buffer.c         |    4 +++-
 mm/mempolicy.c      |    9 ++++++++-
 mm/page-writeback.c |    2 ++
 mm/page_alloc.c     |    9 ++++++++-
 mm/vmscan.c         |    2 ++
 mm/vmstat.c         |    3 +++
 6 files changed, 26 insertions(+), 3 deletions(-)

Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c	2008-07-29 21:21:33.000000000 +0900
+++ current/mm/page_alloc.c	2008-07-29 22:17:44.000000000 +0900
@@ -2345,6 +2345,7 @@ static int default_zonelist_order(void)
 	/* Is there ZONE_NORMAL ? (ex. ppc has only DMA zone..) */
 	low_kmem_size = 0;
 	total_size = 0;
+	pgdat_remove_read_lock();
 	for_each_online_node(nid) {
 		for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
 			z = &NODE_DATA(nid)->node_zones[zone_type];
@@ -2355,6 +2356,7 @@ static int default_zonelist_order(void)
 			}
 		}
 	}
+	pgdat_remove_read_unlock();
 	if (!low_kmem_size ||  /* there are no DMA area. */
 	    low_kmem_size > total_size/2) /* DMA/DMA32 is big. */
 		return ZONELIST_ORDER_NODE;
@@ -2365,6 +2367,8 @@ static int default_zonelist_order(void)
          */
 	average_size = total_size /
 				(nodes_weight(node_states[N_HIGH_MEMORY]) + 1);
+
+	pgdat_remove_read_lock();
 	for_each_online_node(nid) {
 		low_kmem_size = 0;
 		total_size = 0;
@@ -2378,9 +2382,12 @@ static int default_zonelist_order(void)
 		}
 		if (low_kmem_size &&
 		    total_size > average_size && /* ignore small node */
-		    low_kmem_size > total_size * 70/100)
+		    low_kmem_size > total_size * 70/100){
+			pgdat_remove_read_unlock();
 			return ZONELIST_ORDER_NODE;
+		}
 	}
+	pgdat_remove_read_unlock();
 	return ZONELIST_ORDER_ZONE;
 }
 
Index: ...
From: Yasunori Goto
Date: Thursday, July 31, 2008 - 4:55 am

This is preparation patch for the later patch.
Current code passes the pointer of zonelist to __alloc_pages() to
specify which zonelist should be used.
However, its parameter also means which node(pgdat)'s zonelist
should be used for parsing.

This patch change interface from zonelist pointer to node id (zonelist_nid)
which has target zonelist.
Because node id is easy to check node online/offline.


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 include/linux/gfp.h       |   12 +++++------
 include/linux/mempolicy.h |    2 -
 mm/hugetlb.c              |    4 ++-
 mm/mempolicy.c            |   50 +++++++++++++++++++++++-----------------------
 mm/page_alloc.c           |   21 ++++++++++++-------
 5 files changed, 49 insertions(+), 40 deletions(-)

Index: current/include/linux/gfp.h
===================================================================
--- current.orig/include/linux/gfp.h	2008-07-31 18:54:09.000000000 +0900
+++ current/include/linux/gfp.h	2008-07-31 18:54:18.000000000 +0900
@@ -175,20 +175,20 @@ static inline void arch_alloc_page(struc
 
 struct page *
 __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
-		       struct zonelist *zonelist, nodemask_t *nodemask);
+		       int zonelist_nid, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+		int zonelist_nid)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_internal(gfp_mask, order, zonelist_nid, NULL);
 }
 
 static inline struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask)
+		int zonelist_nid, nodemask_t *nodemask)
 {
-	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+	return __alloc_pages_internal(gfp_mask, order, zonelist_nid, nodemask);
 }
 
 
@@ -202,7 +202,7 @@ static inline struct page *alloc_pages_n
 	if (nid < 0)
 		nid = numa_node_id();
 ...
From: Yasunori Goto
Date: Thursday, July 31, 2008 - 5:02 am

This patch is to make kswapd_stop().
It must be stopped before node removing.


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 include/linux/swap.h |    3 +++
 mm/vmscan.c          |   13 +++++++++++++
 2 files changed, 16 insertions(+)

Index: current/mm/vmscan.c
===================================================================
--- current.orig/mm/vmscan.c	2008-07-29 22:17:16.000000000 +0900
+++ current/mm/vmscan.c	2008-07-29 22:17:16.000000000 +0900
@@ -1985,6 +1985,9 @@ static int kswapd(void *p)
 		}
 		finish_wait(&pgdat->kswapd_wait, &wait);
 
+		if (kthread_should_stop())
+			break;
+
 		if (!try_to_freeze()) {
 			/* We can speed up thawing tasks if we don't call
 			 * balance_pgdat after returning from the refrigerator
@@ -2216,6 +2219,16 @@ int kswapd_run(int nid)
 	return ret;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+void kswapd_stop(int nid)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+
+	if (pgdat->kswapd)
+		kthread_stop(pgdat->kswapd);
+}
+#endif
+
 static int __init kswapd_init(void)
 {
 	int nid;
Index: current/include/linux/swap.h
===================================================================
--- current.orig/include/linux/swap.h	2008-07-29 21:20:02.000000000 +0900
+++ current/include/linux/swap.h	2008-07-29 22:17:16.000000000 +0900
@@ -262,6 +262,9 @@ static inline void scan_unevictable_unre
 #endif
 
 extern int kswapd_run(int nid);
+#ifdef CONFIG_MEMORY_HOTREMOVE
+extern void kswapd_stop(int nid);
+#endif
 
 #ifdef CONFIG_MMU
 /* linux/mm/shmem.c */

-- 
Yasunori Goto 


--

From: Yasunori Goto
Date: Thursday, July 31, 2008 - 5:03 am

This patch is very incomplete (includes dummy code),
but I would like to show what mempolicy has to do when node is removed.

Basically, user should change policy before removing node
which is used mempolicy. However, user may not know
mempolicy is used due to automatic setting by software.
The kernel must guarantee removed node will be not used.

There is callback when memory offlining, mempolicy can change
each task's policies.

There are some issues.
  - If nodes_weight(pol->v.nodes) will be 0 due to node removing,
    Kernel will not be able to allocate any pages.
    What does kernel should do?  Kill its process?
  - If preffered node is removing, then which node should be next
    preffered node?

Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 mm/mempolicy.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

Index: current/mm/mempolicy.c
===================================================================
--- current.orig/mm/mempolicy.c	2008-07-29 22:17:25.000000000 +0900
+++ current/mm/mempolicy.c	2008-07-29 22:17:29.000000000 +0900
@@ -2345,3 +2345,35 @@ out:
 		m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
 	return 0;
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static int mempolicy_mem_offline_callback(void *arg)
+{
+	int offline_node;
+	struct memory_notify *marg = arg;
+
+	offline_node = marg->status_change_nid;
+
+	/*
+	 * If the node still has available memory, we keep policies.
+	 */
+	if (offline_node < 0)
+		return 0;
+
+	/*
+	 * Disable all offline node's bit for each node mask.
+	 */
+	for_each_policy(pol) {
+		switch (pol->mode) {
+		case MPOL_BIND:
+		case MPOL_INTERLEAVE:
+			/* Force disable node bit */
+			node_clear(offline_node, pol->v.nodes);
+			break;
+		case MPOL_PREFFERED:
+			/* TBD */
+		default:
+			break;
+	}
+}
+#endif

-- 
Yasunori Goto 


--

From: Yasunori Goto
Date: Thursday, July 31, 2008 - 5:04 am

remove_pgdat() is main code for pgdat removing.
remove_pgdat() should be called for node-hotremove, but nothing calls
it. Sysfs interface (or anything else?) will be necessary.

And current offline_pages() has to be update zonelist and N_HIGH_MEMORY
if there is no present_pages on the node, and stop kswapd().


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>


---
 mm/memory_hotplug.c |   85 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 1 deletion(-)

Index: current/mm/memory_hotplug.c
===================================================================
--- current.orig/mm/memory_hotplug.c	2008-07-29 22:17:24.000000000 +0900
+++ current/mm/memory_hotplug.c	2008-07-29 22:17:32.000000000 +0900
@@ -241,6 +241,82 @@ static int __add_section(struct zone *zo
 	return register_new_memory(__pfn_to_section(phys_start_pfn));
 }
 
+static int cpus_busy_on_node(int nid)
+{
+	cpumask_t tmp = node_to_cpumask(nid);
+	int cpu, ret;
+
+	for_each_cpu_mask(cpu, tmp) {
+		if (cpu_online(cpu)) {
+			printk(KERN_INFO "cpu %d is busy\n", cpu);
+			ret = 1 ;
+		}
+	}
+	return 0;
+}
+
+static int sections_busy_on_node(struct pglist_data *pgdat)
+{
+	unsigned long section_nr, num, i;
+	int ret = 0;
+
+	section_nr = pfn_to_section_nr(pgdat->node_start_pfn);
+	num = pfn_to_section_nr(pgdat->node_spanned_pages);
+
+	for (i = section_nr; i < num; i++) {
+		if (present_section_nr(i)) {
+			printk(KERN_INFO "section %ld is busy\n", i);
+			ret = 1;
+		}
+	}
+	return ret;
+}
+
+void free_pgdat(int offline_nid, struct pglist_data *pgdat)
+{
+	struct page *page = virt_to_page(pgdat);
+
+	arch_refresh_nodedata(offline_nid, NULL);
+
+	if (PageSlab(page)) {
+		/* This pgdat is allocated on other node via hot-add */
+		arch_free_nodedata(pgdat);
+		return;
+	}
+
+	if (offline_nid != page_to_nid(page)) {
+		/* This pgdat is allocated on other node as memoryless node */
+		put_page_bootmem(page);
+		return;
+	}
+
+	/*
+	 * Ok. ...
From: Peter Zijlstra
Date: Saturday, September 6, 2008 - 7:21 am

FWIW synchronize_sched() is the wrong function to use here,
synchronize_rcu() is the right one.

--

From: Yasunori Goto
Date: Sunday, September 7, 2008 - 8:07 pm

Thanks. I'll fix it.


-- 
Yasunori Goto 


--

From: Yasunori Goto
Date: Thursday, July 31, 2008 - 5:01 am

When kernel uses NODE_DATA(nid), kernel must check its node is really online
or not. In addition, if numa_node_id() returns offlined node,
it must be bug because cpu offline on the node has to be executed before
node offline.
This patch checks it, and add read locks on some other little parsing
zone/zonelist places.


Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>

---
 mm/mempolicy.c  |   11 +++++++++--
 mm/page_alloc.c |   19 +++++++++++++++++--
 mm/quicklist.c  |    8 +++++++-
 mm/slub.c       |    7 ++++++-
 mm/vmscan.c     |   22 +++++++++++++++++++---
 5 files changed, 58 insertions(+), 9 deletions(-)

Index: current/mm/page_alloc.c
===================================================================
--- current.orig/mm/page_alloc.c	2008-07-29 22:06:46.000000000 +0900
+++ current/mm/page_alloc.c	2008-07-29 22:17:16.000000000 +0900
@@ -1884,7 +1884,12 @@ static unsigned int nr_free_zone_pages(i
 	/* Just pick one node, since fallback list is circular */
 	unsigned int sum = 0;
 
-	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
+	struct zonelist *zonelist;
+	int node = numa_node_id();
+
+	pgdat_remove_read_lock();
+	BUG_ON(!node_online(node));
+	zonelist = node_zonelist(node, GFP_KERNEL);
 
 	for_each_zone_zonelist(zone, z, zonelist, offset) {
 		unsigned long size = zone->present_pages;
@@ -1892,6 +1897,7 @@ static unsigned int nr_free_zone_pages(i
 		if (size > high)
 			sum += size - high;
 	}
+	pgdat_remove_read_unlock();
 
 	return sum;
 }
@@ -1935,7 +1941,14 @@ EXPORT_SYMBOL(si_meminfo);
 #ifdef CONFIG_NUMA
 void si_meminfo_node(struct sysinfo *val, int nid)
 {
-	pg_data_t *pgdat = NODE_DATA(nid);
+	pg_data_t *pgdat;
+
+	pgdat_remove_read_lock();
+	if (unlikely(!node_online(nid))) {
+		    pgdat_remove_read_unlock();
+		    return;
+	}
+	pgdat = NODE_DATA(nid);
 
 	val->totalram = pgdat->node_present_pages;
 	val->freeram = node_page_state(nid, NR_FREE_PAGES);
@@ -1947,6 +1960,8 @@ void si_meminfo_node(struct ...
From: Christoph Lameter
Date: Thursday, July 31, 2008 - 7:04 am

Use stop_machine()? The removal of a zone or node is a pretty rare event after all and it would avoid having to deal with rcu etc etc.

--

From: Yasunori Goto
Date: Friday, August 1, 2008 - 2:42 am

Agree.

To tell the truth, I tried hackbench with 3rd patch which add rcu_read_lock
in hot-path before this post to make rough estimate its impact.

%hackbench 100 process 2000

without patch.
  39.93

with patch
  39.99
(Both is 10 times avarage)

I guess this result has effect of disable preemption.
So, throughput looks not so bad, but probably, latency would be worse
as you mind.

Kame-san advised me I should take more other benchmarks which can get memory

I thought it at first, but are there the following worst case?


   CPU 0                                    CPU 1
-------------------------------------------------------
__alloc_pages()
    
    parsing_zonelist()
        :
    enter page_reclarim()
    sleep (and remember zone)                 :
                                              :
                                        update zonelist and node_online_map
                                          with stop_machine_run()
                                        free pgdat().
                                        remove the Node electrically.

    wake up and touch remembered
       zone,  but it is removed
    (Oops!!!)



Anyway, I'm happy if there is better way than my poor idea. :-)

Thanks for your comment.


-- 
Yasunori Goto 


--

From: Christoph Lameter
Date: Friday, August 1, 2008 - 6:51 am

Duh. Then the use of RCU would also mean that all of reclaim must be in a rcu period. So  reclaim cannot sleep anymore.



--

From: Yasunori Goto
Date: Friday, August 1, 2008 - 5:16 pm

I use srcu_read_lock() (sleepable rcu lock) if kernel must be sleep for
page reclaim. So, my patch basic idea is followings.


   CPU 0                                    CPU 1
-------------------------------------------------------
__alloc_pages()
     
    rcu_read_lock() and check 
      online bitmap
        parsing_zonelist()
    rcu_read_unlock()
        :
    enter page_reclarim()
    srcu_read_lock()
      parse zone/zonelist.
      sleep (and remember zone)                 :
                                               :
                                        update zonelist and node_online_map
                                          with stop_machine_run()

    wake up and touch remembered zone,  
   srcu_read_unlock()
                                        syncronized_sched().
                                        free_pgdat()


Thanks.

-- 
Yasunori Goto 


--

From: Christoph Lameter
Date: Monday, August 4, 2008 - 6:25 am

But that introduces more overhead in __alloc_pages.
--

From: Yasunori Goto
Date: Monday, August 4, 2008 - 11:39 pm

Hmmm. I think SRCU should be used when kernel has to sleep, and sleep time
will be bigger than SRCU's overhead.....

The followings are results of unixbench and lmbench.
I suppose my patch impacts lantency rather than throghput.
In these results, 100fd select and page fault latencies of lmbench became worse.
So I can't say there is no problem in my patches.

Anyway, I'll retry to find other less impact way if there is,
and compare benchmark results with this way.

Bye.

------------

Unixbench
-----

Normal 2.6.27-rc1-mm1


  BYTE UNIX Benchmarks (Version 4.1.0)
  System -- Linux localhost.localdomain 2.6.27-rc1-mm1 #1 SMP Mon Aug 4 16:08:48 JST 2008 ia64 ia64 ia64 GNU/Linux
  Start Benchmark Run: 2008年  8月  5日 火曜日 10:24:35 JST
   1 interactive users.
   10:24:35 up 9 min,  1 user,  load average: 0.16, 0.08, 0.03
  lrwxrwxrwx 1 root root 4 2008-02-25 15:48 /bin/sh -> bash
  /bin/sh: symbolic link to `bash'
  /dev/sda5             33792348  18360424  13687672  58% /home
Execl Throughput                           2954.0 lps   (29.8 secs, 3 samples)
File Read 1024 bufsize 2000 maxblocks    1211570.0 KBps  (30.0 secs, 3 samples)
File Write 1024 bufsize 2000 maxblocks   281599.0 KBps  (30.0 secs, 3 samples)
File Copy 1024 bufsize 2000 maxblocks    218859.0 KBps  (30.0 secs, 3 samples)
File Read 256 bufsize 500 maxblocks      328725.0 KBps  (30.0 secs, 3 samples)
File Write 256 bufsize 500 maxblocks      72850.0 KBps  (30.0 secs, 3 samples)
File Copy 256 bufsize 500 maxblocks       57095.0 KBps  (30.0 secs, 3 samples)
File Read 4096 bufsize 8000 maxblocks    3883690.0 KBps  (30.0 secs, 3 samples)
File Write 4096 bufsize 8000 maxblocks   1050752.0 KBps  (30.0 secs, 3 samples)
File Copy 4096 bufsize 8000 maxblocks    564703.0 KBps  (30.0 secs, 3 samples)
Pipe Throughput                          462027.5 lps   (10.0 secs, 10 samples)
Pipe-based Context Switching             105824.3 lps   (10.0 secs, 10 samples)
Process Creation                           2242.9 lps   (30.0 ...
From: Mel Gorman
Date: Tuesday, August 5, 2008 - 4:14 am

Maybe I am missing something, but what is wrong with stop_machine during

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christoph Lameter
Date: Tuesday, August 5, 2008 - 10:08 am

Reclaim can sleep while going down a zonelist. There would need to be some
form of synchronization to avoid removing a zone from the zonelist that we are
just scanning.
--

Previous thread: Potential fix to filemap_fault() by Steven Whitehouse on Thursday, July 31, 2008 - 4:04 am. (2 messages)

Next thread: [PATCH] MN10300: Wire up new system calls by David Howells on Thursday, July 31, 2008 - 5:01 am. (1 message)