login
Header Space

 
 

Re: [PATCH 0/6] cpuset aware writeback

Previous thread: signalfd and thread semantics by Michael Kerrisk on Tuesday, July 17, 2007 - 4:44 pm. (3 messages)

Next thread: [PATCH] smp_call_function_single() should be a macro on UP by Al Viro on Tuesday, July 17, 2007 - 5:29 pm. (5 messages)
To: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Tuesday, July 17, 2007 - 5:23 pm

Perform writeback and dirty throttling with awareness of cpuset mem_allowed.

The theory of operation has two primary elements:

1. Add a nodemask per mapping which indicates the nodes
   which have set PageDirty on any page of the mappings.

2. Add a nodemask argument to wakeup_pdflush() which is
   propagated down to sync_sb_inodes.

This leaves sync_sb_inodes() with two nodemasks. One is passed to it and
specifies the nodes the caller is interested in syncing, and will either
be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the
caller's context.

The second nodemask is attached to the inode's mapping and shows who has
modified data in the inode. sync_sb_inodes() will then skip syncing of
inodes if the nodemask argument does not intersect with the mapping
nodemask.

cpuset_current_mems_allowed will be passed in to pdflush
background_writeout by try_to_free_pages and balance_dirty_pages.
balance_dirty_pages also passes the nodemask in to writeback_inodes
directly when doing active reclaim.

Other callers do not limit inode writeback, passing in a NULL nodemask
pointer.

A final change is to get_dirty_limits. It takes a nodemask argument, and
when it is null there is no change in behavior. If the nodemask is set,
page statistics are accumulated only for specified nodes, and the
background and throttle dirty ratios will be read from a new per-cpuset
ratio feature.

These patches are mostly unchanged from Chris Lameter's original
changelist posted previously to linux-mm.
-
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 11, 2007 - 9:32 pm

Perform writeback and dirty throttling with awareness of cpuset mem_allowed.

The theory of operation has two primary elements:

1. Add a nodemask per mapping which indicates the nodes
   which have set PageDirty on any page of the mappings.

2. Add a nodemask argument to wakeup_pdflush() which is
   propagated down to sync_sb_inodes.

This leaves sync_sb_inodes() with two nodemasks. One is passed to it and
specifies the nodes the caller is interested in syncing, and will either
be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the
caller's context.

The second nodemask is attached to the inode's mapping and shows who has
modified data in the inode. sync_sb_inodes() will then skip syncing of
inodes if the nodemask argument does not intersect with the mapping
nodemask.

cpuset_current_mems_allowed will be passed in to pdflush
background_writeout by try_to_free_pages and balance_dirty_pages.
balance_dirty_pages also passes the nodemask in to writeback_inodes
directly when doing active reclaim.

Other callers do not limit inode writeback, passing in a NULL nodemask
pointer.

A final change is to get_dirty_limits. It takes a nodemask argument, and
when it is null there is no change in behavior. If the nodemask is set,
page statistics are accumulated only for specified nodes, and the
background and throttle dirty ratios will be read from a new per-cpuset
ratio feature.

For testing I did a variety of basic tests, verifying individual
features of the test. To verify that it fixes the core problem, I
created a stress test which involved using cpusets and mems_allowed
to split memory so that all daemons had memory set aside for them, and
my memory stress test had a separate set of memory. The stress test was
mmaping 7GB of a very large file on disk. It then scans the entire 7GB
of memory reading and modifying each byte. 7GB is more than the amount
of physical memory made available to the stress test.

Using iostat I can see the initial period of reading from dis...
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 11, 2007 - 9:42 pm

Per cpuset dirty ratios

This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:

background_dirty_ratio	Percentage at which background writeback starts

throttle_dirty_ratio	Percentage at which the application is throttled
			and we start synchrononous writeout.

Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 5/include/linux/cpuset.h 7/include/linux/cpuset.h
--- 5/include/linux/cpuset.h	2007-09-11 14:50:48.000000000 -0700
+++ 7/include/linux/cpuset.h	2007-09-11 14:51:12.000000000 -0700
@@ -77,6 +77,7 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+extern void cpuset_get_current_ratios(int *background, int *ratio);
 /*
  * We need macros since struct address_space is not defined yet
  */
diff -uprN -X 0/Documentation/dontdiff 5/kernel/cpuset.c 7/kernel/cpuset.c
--- 5/kernel/cpuset.c	2007-09-11 14:50:49.000000000 -0700
+++ 7/kernel/cpuset.c	2007-09-11 14:56:18.000000000 -0700
@@ -51,6 +51,7 @@
 #include &lt;linux/time.h&gt;
 #include &lt;linux/backing-dev.h&gt;
 #include &lt;linux/sort.h&gt;
+#include &lt;linux/writeback.h&gt;
 
 #include &lt;asm/uaccess.h&gt;
 #include &lt;asm/atomic.h&gt;
@@ -92,6 +93,9 @@ struct cpuset {
 	int mems_generation;
 
 	struct fmeter fmeter;		/* memory_pressure filter */
+
+	int background_dirty_ratio;
+	int throttle_dirty_ratio;
 };
 
 /* Retrieve the cpuset for a container */
@@ -169,6 +173,8 @@ static struct cpuset top_cpuset = {
 	.flags = ((1 &lt;&lt; CS_CPU_EXCLUSIVE) | (1 &lt;&lt; CS_MEM_EXCLUSIVE)),
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
+	.background_dirty_ratio = -1,
+	.throttle_dirty_ratio = -1,
 };
 
 ...
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Friday, September 14, 2007 - 7:15 pm

On Tue, 11 Sep 2007 18:42:16 -0700


ditto?


-
To: Andrew Morton <akpm@...>
Cc: Ethan Solomita <solo@...>, <linux-mm@...>, <pj@...>, LKML <linux-kernel@...>
Date: Monday, September 17, 2007 - 3:00 pm

Locking is wrong here. The lock needs to be taken before the cs pointer 

It is required to take the task lock while dereferencing the tasks cpuset 
pointer.
-
To: Christoph Lameter <clameter@...>
Cc: Andrew Morton <akpm@...>, <linux-mm@...>, <pj@...>, LKML <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 8:23 pm

I think we can just remove the callback_mutex lock. Since the change is
coming from an update to a cpuset filesystem file, the cpuset is not
going anywhere since the inode is open. And I don't see that any code

	Agreed.
	-- Ethan
-
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 11, 2007 - 9:41 pm

Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not  from kswapd.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h
--- 4/include/linux/writeback.h	2007-09-11 14:49:47.000000000 -0700
+++ 5/include/linux/writeback.h	2007-09-11 14:50:52.000000000 -0700
@@ -94,7 +94,7 @@ static inline void inode_sync_wait(struc
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c
--- 4/mm/page-writeback.c	2007-09-11 14:49:47.000000000 -0700
+++ 5/mm/page-writeback.c	2007-09-11 14:50:52.000000000 -0700
@@ -386,7 +386,7 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
 	struct dirty_limits dl;
 
@@ -401,7 +401,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
 	for ( ; ; ) {
-		get_dirty_limits(&amp;dl, NULL, &amp;node_online_map);
+		get_dirty_limits(&amp;dl, NULL, nodes);
 
 		/*
 		 * Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c	2007-09-11 14:50:41.000000000 -0700
+++ 5/mm/vmscan.c	2007-09-11 14:50:52.000000000 -0700
@@ -1185,7 +1185,...
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 11, 2007 - 9:40 pm

Direct reclaim: cpuset aware writeout

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either

A. No writeout occurs because global dirty limits have not been reached

or

B. Writeout starts randomly for some dirty inode in the system. Pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c	2007-09-11 14:41:56.000000000 -0700
+++ 4/mm/vmscan.c	2007-09-11 14:50:41.000000000 -0700
@@ -1301,7 +1301,8 @@ unsigned long do_try_to_free_pages(struc
 		 */
 		if (total_scanned &gt; sc-&gt;swap_cluster_max +
 					sc-&gt;swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+				&amp;cpuset_current_mems_allowed);
 			sc-&gt;may_writepage = 1;
 		}
 

-
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 11, 2007 - 9:39 pm

Make page writeback obey cpuset constraints

Currently dirty throttling does not work properly in a cpuset.

If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.

We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.

Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Signed-off-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c
--- 2/mm/page-writeback.c	2007-09-11 14:39:22.000000000 -0700
+++ 3/mm/page-writeback...
To: Ethan Solomita <solo@...>
Cc: Andrew Morton <akpm@...>, <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 11, 2007 - 9:38 pm

pdflush: Allow the passing of a nodemask parameter

If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c	2007-09-11 14:36:24.000000000 -0700
+++ 2/fs/buffer.c	2007-09-11 14:39:22.000000000 -0700
@@ -372,7 +372,7 @@ static void free_more_memory(void)
 	struct zone **zones;
 	pg_data_t *pgdat;
 
-	wakeup_pdflush(1024);
+	wakeup_pdflush(1024, NULL);
 	yield();
 
 	for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c	2007-09-11 14:36:05.000000000 -0700
+++ 2/fs/super.c	2007-09-11 14:39:22.000000000 -0700
@@ -616,7 +616,7 @@ int do_remount_sb(struct super_block *sb
 	return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
 	struct super_block *sb;
 
@@ -644,7 +644,7 @@ static void do_emergency_remount(unsigne
 
 void emergency_remount(void)
 {
-	pdflush_operation(do_emergency_remount, 0);
+	pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c	2007-09-11 14:36:05.000000000 -0700
+++ 2/fs/sync.c	2007-09-11 14:39:22.000000000 -0700
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-	wakeup_pdflush(0);
+	wakeup_pdflush(0, NULL);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	DQUOT_SYNC(NULL);
 	sync_supers();		/* Write...
To: Ethan Solomita <solo@...>
Cc: Andrew Morton <akpm@...>, <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 11, 2007 - 9:36 pm

To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Friday, September 14, 2007 - 7:15 pm

On Tue, 11 Sep 2007 18:36:34 -0700

I get a tremendous number of rejects trying to wedge this stuff on top of
Peter's mm-dirty-balancing-for-tasks changes.  More rejects than I am
prepared to partially-fix so that I can usefully look at these changes in

It'd be nice to see some discussion regarding the memory consumption of

The patch is sprinkled full of this conditional.

  I don't understand why this is being done.  afaict it isn't described
  in a code comment (it should be) nor even in the changelogs?

  Given its overall complexity and its likelihood to change in the
  future, I'd suggest that this conditional be centralised in a single
  place.  Something like

  /*
   * nice comment goes here
   */
  #if MAX_NUMNODES &lt;= BITS_PER_LONG
  #define CPUSET_DIRTY_LIMITS 1
  #else
  #define CPUSET_DIRTY_LIMITS 0
  #endif

  Then use #if CPUSET_DIRTY_LIMITS everywhere else.

  (This is better than #ifdef CPUSET_DIRTY_LIMITS because we'll et a
  warning if someone typos '#if CPUSET_DITRY_LIMITS')

  Even better would be to calculate CPUSET_DIRTY_LIMITS within Kconfig,
  but I suspect you'll need to jump through unfeasible hoops to do that

afacit there is no code comment and no changelog text which explains the
above design decision?  There should be, please.

There is talk of making cpusets available with CONFIG_SMP=n.  Will this new

That comment is a bit terse.  It's always good to be lavish when commenting


hmm, OK, there's a hint as to wghat's going on.

It's unobvious why the break point is at MAX_NUMNODES = BITS_PER_LONG and
we might want to tweak that in the future.  Yet another argument for

Does it have to be atomic?  atomic is weak and can fail.

If some callers can do GFP_KERNEL and some can only do GFP_ATOMIC then we
should at least pass the gfp_t into this function so it can do the stronger

Can this race with cpuset_update_dirty_nodes()?  And with itself?  If not,
-
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 18, 2007 - 8:51 pm

This isn't surprising. We're both changing the calculation of dirty
limits. If his code is already into your workspace, then I'll have to do


	I can add something like this. Probably something like:



	I'm not sure how useful it would be in that scenario, but for
consistency we should still be able to specify varying dirty ratios
(from patch 6/6). The above code wouldn't mean anything SMP=n since
there's only the one node. We'd just be indicating whether the inode has


	I'll add a comment to make it more obvious. The point is to only add
one long to the data structure, no matter sizeof(nodemask_t), adding

	I was going to say that sanity would be improved by just allocing the
nodemask at inode alloc time. A failure here could be a problem because
below cpuset_intersects_dirty_nodes() assumes that a NULL nodemask
pointer means that there are no dirty nodes, thus preventing dirty pages
from getting written to disk. i.e. This must never fail.

	Given that we allocate it always at the beginning, I'm leaning towards
just allocating it within mapping no matter its size. It will make the
code much much simpler, and save me writing all the comments we've been
discussing. 8-)

	How disastrous would this be? Is the need to support a 1024 node system
with 1,000,000 open mostly-read-only files thus needing to spend 120MB

	I'll add a comment. Such a race should not be possible. It is called
only from clear_inode() which is used when the inode is being freed
"with extreme prejudice" (from its comments). I can add a check that

-
To: Ethan Solomita <solo@...>
Cc: Andrew Morton <akpm@...>, <linux-mm@...>, LKML <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 1:06 pm

Hmmm. It should assume that there is no tracking thus any node can be 

Consider that a 1024 node system has more than 4TB of memory. If that 
system is running as a fileserver then you get into some issues. But then 
120MB are not that big of a deal. Its more the cache footprint issue I 
would think. Having a NULL there avoids touching a 128 byte nodemask. I 

There is already a comment saying that it cannot happen.
-
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, September 18, 2007 - 10:14 pm

None of this is very nice.  Yes, it would be good to save all that memory
and yes, I_DIRTY_PAGES inodes are very much the uncommon case.

But if a failed GFP_ATOMIC allocation results in data loss then that's a
showstopper.

How hard would it be to handle the allocation failure in a more friendly
manner?  Say, if the allocation failed then point mapping-&gt;dirty_nodes at
some global all-ones nodemask, and then special-case that nodemask in the

Sounds sane.


-
To: Andrew Morton <akpm@...>
Cc: Ethan Solomita <solo@...>, <linux-mm@...>, LKML <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 1:08 pm

Ack. However, the situation dirty_nodes == NULL &amp;&amp; inode dirty then means 
that unknown nodes are dirty. If we are later are successful with the 
alloc and we know that the pages are dirty in the mapping then the initial 
dirty_nodes must be all ones. If this is the first page to be dirtied then 
we can start with a dirty_nodes mask of all zeros like now.

-
To: Andrew Morton <akpm@...>
Cc: Ethan Solomita <solo@...>, <linux-mm@...>, LKML <linux-kernel@...>
Date: Monday, September 17, 2007 - 3:10 pm

On small NUMA system &lt; 64 nodes you basically have an additional word 
added to the address_space structure. On large NUMA we may have to 
allocate very large per node arrays of up to 1024 nodes which may result 
in 128 bytes used. However, the very large systems are rare thus we want 

Hmmmm... There was an extensive discussion on that one early in the year 
and I had it in the overview of the earlier releases. Ethan: Maybe get 

Yes. Cpuset with CONFIG_SMP=n has been run for years on IA64 with Tony 

So 


A pointer has the size of BITS_PER_LONG. If we have less nodes then we 

This is called from __set_page_dirty_nobuffers. There is no allocation 
context available at that point. If the allocation fails then we do not 

You just quoted the comment above that explains the locking requirements. 
Here it is again:

/*
+ * Special functions for NUMA systems with a large number of nodes.
+ * The nodemask is pointed to from the address space structures.
+ * The attachment of the dirty_node mask is protected by the
+ * tree_lock. The nodemask is freed only when the inode is cleared
+ * (and therefore unused, thus no locking necessary).
+ */

-
To: Andrew Morton <akpm@...>
Cc: Ethan Solomita <solo@...>, <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Friday, September 14, 2007 - 7:47 pm

Looks like just an optimization to me ... Ethan wants to economize and not bloat
struct address_space too much.

So, if sizeof(nodemask_t) == sizeof(long), i.e. when:
MAX_NUMNODES &lt;= BITS_PER_LONG, then we'll be adding only sizeof(long)
extra bytes to the struct (by plonking the object itself into it).

But even when MAX_NUMNODES &gt; BITS_PER_LONG, because we're storing
a pointer, and because sizeof(void *) == sizeof(long), so again the maximum
bloat addition to struct address_space would only be sizeof(long) bytes.

I didn't see the original mail, but if the #ifdeffery for this
conditional is too much
as a result of this optimization, Ethan should probably just do away
with all of it
entirely, and simply put a full nodemask_t object (irrespective of MAX_NUMNODES)
into the struct. After all, struct task_struct does the same unconditionally ...
but admittedly, there are several times more address_space struct's resident in
memory at any given time than there are task_struct's, so this optimization does
-
To: Satyam Sharma <satyam.sharma@...>
Cc: Ethan Solomita <solo@...>, <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Friday, September 14, 2007 - 8:07 pm

On Sat, 15 Sep 2007 05:17:48 +0530

yup.

Note that "It's unobvious" != "It's unobvious to me".  I review code for

I think the optimisation is (probably) desirable, but it would be best to
describe the tradeoff in the changelog and to add some suitable
code-commentary for those who read the code in a year's time and to avoid
sprinkling the logic all over the tree.

-
To: Andrew Morton <akpm@...>
Cc: Ethan Solomita <solo@...>, <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Friday, September 14, 2007 - 8:16 pm

True, the other option could be to put the /pointer/ in there unconditionally,
but that would slow down the MAX_NUMNODES &lt;= BITS_PER_LONG case,
which (after grepping through defconfigs) appears to be the common case on
all archs other than ia64. So I think your idea of making that conditional
centralized in the code with an accompanying comment is the way to go here ...


Satyam
-
To: Satyam Sharma <satyam.sharma@...>
Cc: Andrew Morton <akpm@...>, Ethan Solomita <solo@...>, <linux-mm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Monday, September 17, 2007 - 2:37 pm

Satyam Sharma wrote:

It won't be long before arch's other than ia64 will have
MAX_NUMNODES &gt; BITS_PER_LONG. While it won't be the norm,
we should account for it now.

Thanks,
Mike
-
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 4:18 pm

On Tue, 17 Jul 2007 14:23:14 -0700

Thanks for keeping these patches up to date. Add you signoff if you
did modifications to a patch. Also include the description of the tests
in the introduction to the patchset.
-
To: Christoph Lameter <clameter@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 5:30 pm

So switch from an Ack to a signed-off? OK, and I'll add descriptions of
testing. Everyone other than you has been silent on these patches. Does
silence equal consent?
	-- Ethan


-
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, July 23, 2007 - 5:53 pm

Sometimes. Howerver, the audience for NUMA and cpusets is rather limited.
-
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Tuesday, July 17, 2007 - 5:37 pm

Per cpuset dirty ratios

This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:

background_dirty_ratio	Percentage at which background writeback starts

throttle_dirty_ratio	Percentage at which the application is throttled
			and we start synchrononous writeout.

Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 6/include/linux/cpuset.h 7/include/linux/cpuset.h
--- 6/include/linux/cpuset.h	2007-07-11 21:17:08.000000000 -0700
+++ 7/include/linux/cpuset.h	2007-07-11 21:17:41.000000000 -0700
@@ -76,6 +76,7 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+extern void cpuset_get_current_ratios(int *background, int *ratio);
 /*
  * We need macros since struct address_space is not defined yet
  */
diff -uprN -X 0/Documentation/dontdiff 6/kernel/cpuset.c 7/kernel/cpuset.c
--- 6/kernel/cpuset.c	2007-07-12 12:15:20.000000000 -0700
+++ 7/kernel/cpuset.c	2007-07-12 12:15:34.000000000 -0700
@@ -51,6 +51,7 @@
 #include &lt;linux/time.h&gt;
 #include &lt;linux/backing-dev.h&gt;
 #include &lt;linux/sort.h&gt;
+#include &lt;linux/writeback.h&gt;
 
 #include &lt;asm/uaccess.h&gt;
 #include &lt;asm/atomic.h&gt;
@@ -92,6 +93,9 @@ struct cpuset {
 	int mems_generation;
 
 	struct fmeter fmeter;		/* memory_pressure filter */
+
+	int background_dirty_ratio;
+	int throttle_dirty_ratio;
 };
 
 /* Update the cpuset for a container */
@@ -175,6 +179,8 @@ static struct cpuset top_cpuset = {
 	.flags = ((1 &lt;&lt; CS_CPU_EXCLUSIVE) | (1 &lt;&lt; CS_MEM_EXCLUSIVE)),
 	.cpus_allowed = CPU_MASK_ALL,
 	.mems_allowed = NODE_MASK_ALL,
+	.background_dirty_ratio = -1,
+	.throttle_dirty_ratio = -1,
 };
 
 /*...
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Tuesday, July 17, 2007 - 5:36 pm

Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not  from kswapd.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 5/include/linux/writeback.h
--- 4/include/linux/writeback.h	2007-07-11 21:16:25.000000000 -0700
+++ 5/include/linux/writeback.h	2007-07-11 21:16:50.000000000 -0700
@@ -95,7 +95,7 @@ static inline void inode_sync_wait(struc
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 5/mm/page-writeback.c
--- 4/mm/page-writeback.c	2007-07-16 18:31:13.000000000 -0700
+++ 5/mm/page-writeback.c	2007-07-16 18:32:08.000000000 -0700
@@ -384,7 +384,7 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
 	struct dirty_limits dl;
 
@@ -399,7 +399,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
 	for ( ; ; ) {
-		get_dirty_limits(&amp;dl, NULL, &amp;node_online_map);
+		get_dirty_limits(&amp;dl, NULL, nodes);
 
 		/*
 		 * Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c	2007-07-11 21:16:26.000000000 -0700
+++ 5/mm/vmscan.c	2007-07-11 21:16:50.000000000 -0700
@@ -1064,7 +1064,...
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Tuesday, July 17, 2007 - 5:35 pm

Direct reclaim: cpuset aware writeout

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either

A. No writeout occurs because global dirty limits have not been reached

or

B. Writeout starts randomly for some dirty inode in the system. Pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c	2007-07-11 21:16:14.000000000 -0700
+++ 4/mm/vmscan.c	2007-07-11 21:16:26.000000000 -0700
@@ -1183,7 +1183,8 @@ unsigned long try_to_free_pages(struct z
 		 */
 		if (total_scanned &gt; sc.swap_cluster_max +
 					sc.swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+			wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+				&amp;cpuset_current_mems_allowed);
 			sc.may_writepage = 1;
 		}
 

-
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Tuesday, July 17, 2007 - 5:34 pm

Make page writeback obey cpuset constraints

Currently dirty throttling does not work properly in a cpuset.

If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.

We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.

Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 3/mm/page-writeback.c
--- 2/mm/page-writeback.c	2007-07-11 21:15:47.000000000 -0700
+++ 3/mm/page-writeback.c	20...
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Tuesday, July 17, 2007 - 5:33 pm

pdflush: Allow the passing of a nodemask parameter

If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c	2007-07-11 21:08:04.000000000 -0700
+++ 2/fs/buffer.c	2007-07-11 21:15:47.000000000 -0700
@@ -359,7 +359,7 @@ static void free_more_memory(void)
 	struct zone **zones;
 	pg_data_t *pgdat;
 
-	wakeup_pdflush(1024);
+	wakeup_pdflush(1024, NULL);
 	yield();
 
 	for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c	2007-07-11 21:07:41.000000000 -0700
+++ 2/fs/super.c	2007-07-11 21:15:47.000000000 -0700
@@ -615,7 +615,7 @@ int do_remount_sb(struct super_block *sb
 	return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
 	struct super_block *sb;
 
@@ -643,7 +643,7 @@ static void do_emergency_remount(unsigne
 
 void emergency_remount(void)
 {
-	pdflush_operation(do_emergency_remount, 0);
+	pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c	2007-07-11 21:07:41.000000000 -0700
+++ 2/fs/sync.c	2007-07-11 21:15:47.000000000 -0700
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-	wakeup_pdflush(0);
+	wakeup_pdflush(0, NULL);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	DQUOT_SYNC(NULL);
 	sync_supers();		/* Write...
To: Ethan Solomita <solo@...>
Cc: <linux-mm@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Tuesday, July 17, 2007 - 5:32 pm

Add a dirty map to struct address_space

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)	Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)	Record a node in the dirty nodes map

cpuset_init_dirty_nodes(mapping)	First time init of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
for systems with a large number of nodes (f.e. IA64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead if we insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.
At that point no pages are attached to the inode anymore and therefore it can
be done without any locking. The dirty map therefore records all nodes that
have been used for dirty pages by that inode until the inode is no longer
used.

Signed-off-by: Christoph Lameter &lt;clameter@sgi.com&gt;
Acked-by: Ethan Solomita &lt;solo@google.com&gt;

---

Patch against 2.6.22-rc6-mm1

diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c
--- 0/fs/buffer.c	2007-07-11 20:30:55.000000000 -0700
+++ 1/fs/buffer.c	2007-07-11 21:08:04.000000000 -0700
@@ -41,6 +41,7 @@
 #include &lt;linux/bitops.h&gt;
 #include &lt;linux/mpage.h&gt;
 #include &lt;linux/bit_spinlock.h&gt;
+#include &lt;linux/cpuset.h&gt;
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -710,6 +711,7 @@ static...
Previous thread: signalfd and thread semantics by Michael Kerrisk on Tuesday, July 17, 2007 - 4:44 pm. (3 messages)

Next thread: [PATCH] smp_call_function_single() should be a macro on UP by Al Viro on Tuesday, July 17, 2007 - 5:29 pm. (5 messages)
speck-geostationary