Re: [patch 2/2] cpusets: add interleave_over_allowed option

Previous thread: none

Next thread: Re: Possibility of adding -march=native to x86 by Michael Lothian on Thursday, October 25, 2007 - 3:58 pm. (1 message)
From: David Rientjes
Date: Thursday, October 25, 2007 - 3:54 pm

Extract a helper function from update_nodemask() to load an array of
mm_struct pointers with references to each task's mm_struct that is
currently attached to a given cpuset.

This will be used later for other purposes where memory policies need to
be rebound for each task attached to a cpuset.

Cc: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 kernel/cpuset.c |  130 ++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 81 insertions(+), 49 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -702,6 +702,79 @@ done:
 	/* Don't kfree(doms) -- partition_sched_domains() does that. */
 }
 
+/*
+ * Loads mmarray with pointers to all the mm_struct's of tasks attached to
+ * cpuset cs.
+ *
+ * The reference count to each mm is incremented before loading it into the
+ * array, so put_cpuset_mm_array() must be called after this function to
+ * decrement each reference count and free the memory allocated for mmarray
+ * via this function.
+ */
+static struct mm_struct **get_cpuset_mm_array(const struct cpuset *cs,
+					      int *ntasks)
+{
+	struct mm_struct **mmarray;
+	struct task_struct *p;
+	struct cgroup_iter it;
+	int count;
+	int fudge;
+
+	*ntasks = 0;
+	fudge = 10;				/* spare mmarray[] slots */
+	fudge += cpus_weight(cs->cpus_allowed);	/* imagine one fork-bomb/cpu */
+	/*
+	 * Allocate mmarray[] to hold mm reference for each task in cpuset cs.
+	 * Can't kmalloc GFP_KERNEL while holding tasklist_lock.  We could use
+	 * GFP_ATOMIC, but with a few more lines of code, we can retry until
+	 * we get a big enough mmarray[] w/o using GFP_ATOMIC.
+	 */
+	while (1) {
+		count = cgroup_task_count(cs->css.cgroup);  /* guess */
+		count += fudge;
+		mmarray = kmalloc(count * sizeof(*mmarray), GFP_KERNEL);
+		if (!mmarray)
+			return ...
From: David Rientjes
Date: Thursday, October 25, 2007 - 3:54 pm

Adds a new 'interleave_over_allowed' option to cpusets.

When a task with an MPOL_INTERLEAVE memory policy is attached to a cpuset
with this option set, the interleaved nodemask becomes the cpuset's
mems_allowed.  When the cpuset's mems_allowed changes, the interleaved
nodemask for all tasks with MPOL_INTERLEAVE memory policies is also
updated to be the new mems_allowed nodemask.

This allows applications to specify that they want to interleave over all
nodes that they are allowed to access.  This set of nodes can be changed
at any time via the cpuset interface and each individual memory policy is
updated to reflect the changes for all attached tasks when this option is
set.

Cc: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cpusets.txt |   30 +++++++++++++++++++-
 include/linux/cpuset.h    |    6 ++++
 kernel/cpuset.c           |   64 +++++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c            |    6 ++++
 4 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -20,7 +20,8 @@ CONTENTS:
   1.5 What is memory_pressure ?
   1.6 What is memory spread ?
   1.7 What is sched_load_balance ?
-  1.8 How do I use cpusets ?
+  1.8 What is interleave_over_allowed ?
+  1.9 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -497,7 +498,32 @@ the cpuset code to update these sched domains, it compares the new
 partition requested with the current, and updates its sched domains,
 removing the old and adding the new, for each change.
 
-1.8 How do I use cpusets ?
+1.8 What is interleave_over_allowed ?
+-------------------------------------
+
+Tasks may specify a memory policy of MPOL_INTERLEAVE with the desired
+result of ...
From: Christoph Lameter
Date: Thursday, October 25, 2007 - 4:37 pm

More interactions between cpusets and memory policies. We have to be 
careful here to keep clean semantics.

Isnt it a bit surprising for an application that has set up a custom 
MPOL_INTERLEAVE policy if the nodes suddenly change because of a cpuset or 
mems_allowed change?


-

From: David Rientjes
Date: Thursday, October 25, 2007 - 4:56 pm

Every MPOL_INTERLEAVE policy is a custom policy that the application has 
setup.  If you don't use cpusets at all, the nodemask you pass to 
set_mempolicy() with MPOL_INTERLEAVE is static and won't change without 
the application's knowledge.  It has full control over the nodemask that 
it desires to interleave over.

The problem occurs when you add cpusets into the mix and permit the 
allowed nodes to change without knowledge to the application.  Right now, 
a simple remap is done so if the cardinality of the set of nodes 
decreases, you're interleaving over a smaller number of nodes.  If the 
cardinality increases, your interleaved nodemask isn't expanded.  That's 
the problem that we're facing.  The remap itself is troublesome because it 
doesn't take into account the user's desire for a custom nodemask to be 
used anyway; it could remap an interleaved policy over several nodes that 
will already be contended with one another.

Normally, MPOL_INTERLEAVE is used to reduce bus contention to improve the 
throughput of the application.  If you remap the number of nodes to 
interleave over, which is currently how it's done when mems_allowed 
changes, you could actually be increasing latency because you're 
interleaving over the same bus.

This isn't a memory policy problem because all it does is effect a 
specific policy over a set of nodes.  With my change, cpusets are required 
to update the interleaved nodemask if the user specified that they desire 
the feature with interleave_over_allowed.  Cpusets are, after all, the 
ones that changed the mems_allowed in the first place and invalidated our 
custom interleave policy.  We simply can't make inferences about what we 
should do, so we allow the creator of the cpuset to specify it for us.  So 
the proper place to modify an interleaved policy is in cpusets and not 
mempolicy itself.

		David
-

From: Christoph Lameter
Date: Thursday, October 25, 2007 - 5:28 pm

Right. So I think we are fine if the application cannot setup boundaries 

Well you may hit some nodes more than others so a slight performance 

With that MPOL_INTERLEAVE would be context dependent and no longer 
needs translation. Lee had similar ideas. Lee: Could we make 
MPOL_INTERLEAVE generally cpuset context dependent?

-

From: Paul Jackson
Date: Thursday, October 25, 2007 - 6:55 pm

Well ... MPOL_INTERLEAVE already is essentially cpuset relative.

So long as the cpuset size (number of allowed memory nodes) doesn't
change, whatever MPOL_INTERLEAVE you set is remapped whenever the
cpusets 'mems' changes, preserving the cpuset relative interleaving.

The problem, as David explains, comes when cpusets change sizes.
When the cpuset gets smaller, one can still do a pretty good job,
scrunching down the interleave nodes in proportion.  But when the
cpuset gets larger, it's not clear how to convert a subset of a
smaller set, to an equivalent subset of a larger set.

The existing code handled this last case by saying screw it -- don't
expand the set of interleave nodes when the cpuset 'mems' grows.

David's new code handles this last case by adding a new per-cpuset
Boolean that adds a new alternative, forcing all the tasks using
MPOL_INTERLEAVE in that cpuset, anytime thereafter that the cpusets
'mems' changes, to get interleaved over the entire cpuset.

Now that I spell it out that way, I am having second thoughts about
this one.  It's another special case palliative, given that we can't
give the user what they really want.

David - could you describe the real world situation in which you
are finding that this new 'interleave_over_allowed' option, aka
'memory_spread_user', is useful?  I'm not always opposed to special
case solutions; but they do usually require special case needs to
justify them ;).

I suspect that the general case solution would require having the user
pass in two nodemasks, call them ALL and SUBSET, requesting that
relative to the ALL nodes, interleave be done on the SUBSET nodes.
That way, even if say the task happened to be running in a cpuset with
a -single- allowed memory node at the moment, it could express its user
memory interleave memory needs for the general case of any number of
nodes.  Then for whatever nodes were currently allowed by the cpuset
to that task at any point, the nodes_remap() logic could be done to
derive from the ...
From: David Rientjes
Date: Thursday, October 25, 2007 - 7:11 pm

Yes, when a task with MPOL_INTERLEAVE has its cpuset mems_allowed expanded 
to include more memory.  The task itself can't access all that memory with 
the memory policy of its choice.

Since the cpuset has changed the mems_allowed of the task without its 
knowledge, it would require a constant get_mempolicy() and set_mempolicy() 
loop in the application to catch these changes.  That's obviously not in 
the best interest of anyone.

So my change allows those tasks that have already expressed the desire to 
interleave their memory with MPOL_INTERLEAVE to always use the full range 
of memory available that is dynamically changing beneath them as a result 
of cpusets.  Keep in mind that it is still possible to request an 
interleave only over a subset of allowed mems: but you must do it when you 
create the interleaved mempolicy after it has been attached to the cpuset.
set_mempolicy() changes are always honored.

The only other way to support such a feature is through a modification to 
mempolicies themselves, which Lee has already proposed.  The problem with 
that is it requires mempolicy support for cpuset cases and modification to 
the set_mempolicy() API.  My solution presents a cpuset fix for a cpuset 

I find it hard to believe that a single cpuset with a single 
memory_spread_user boolean is going to include multiple tasks that request 
interleaved mempolicies over differing nodes within the cpuset's 
mems_allowed.  That, to me, is the special case.

		David
-

From: Paul Jackson
Date: Thursday, October 25, 2007 - 7:29 pm

That much I could have guessed (did guess, actually.)

Are you seeing this in a real world situation?  Can you describe the
situation?  I don't mean just describing how it looks to this kernel
code, but what is going on in the system, what sort of job mix or
applications, what kind of users, ...  In short, a "use case", or brief
approximation thereto.  See further:

  http://en.wikipedia.org/wiki/Use_case

I have no need of a full blown use case; just a three sentence
mini-story should suffice.  But it should (if you can, without
revealing proprietary knowledge) describe a situation you have

Yup, that it does.  Note that it is a special case -- "the full range",
not any application controlled specific subset thereof, short of
reissuing set_mempolicy() calls anytime that the applications cpuset

Do you have a link to what Lee proposed?  I agree that a full general
solution would seem to require a new or changed set_mempolicy API,
which may well be more than we want to do, absent a more compelling

That may well be, to you.  To me, pretty much -all- uses of
set_mempolicy() are special cases ;).  I have no way of telling
whether or not there are users who would require multiple tasks
in the same cpuset to have different interleave masks, but since
the API clearly supports that (except when changing cpuset 'mems'
settings mess things up), I have been presuming that somewhere in
the universe, such users exist or might come to exist.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Thursday, October 25, 2007 - 7:45 pm

Yes, when using cpusets for resource control.  If memory pressure is being 
felt for that cpuset and additional mems are added to alleviate possible 
OOM conditions, it is insufficient to allow tasks within that cpuset to 
continue using memory policies that prohibit them from taking advantage of 
the extra memory.

The best remedy for that situation is to give the cpuset owner the option 
of allowing tasks with MPOL_INTERLEAVE policies to always interleave over 
the entire set of available mems so they can be dynamically expanded and 

http://marc.info/?l=linux-mm&m=118849999128086
-

From: Paul Jackson
Date: Thursday, October 25, 2007 - 8:14 pm

Well ... "resource control" is a tad thin for a decent "use case".

But ok ... that's a little more compelling.

The user space man pages for set_mempolicy(2) are now even more
behind the curve, by not mentioning that MPOL_INTERLEAVE's mask
might mean nothing, if (1) in a cpuset marked memory_spread_user,
(2) after the cpuset has changed 'mems'.

I wonder if there is any way to fix that.  Who does the man pages
for Linux system calls?

Hmmm ... that reminds me ... the period of time between when the
task issues the set_mempolicy(2) MPOL_INTERLEAVE call and when some
cpuset 'mems' change subsequently moves its memory placement is an
anomaly here. During that period of time, the MPOL_INTERLEAVE mask
-does- apply, even if a subset of the 'mems' in the tasks cpuset.
This could result in test cases missing some failures.  If they
test with a particular, carefully crafted MPOL_INTERLEAVE mask
that is a proper (strictly less than) subset of the nodes allowed
in the cpuset, they might not notice that their code is broken if
they happen to be in a memory_spread_user cpuset after a 'mems'
change has jammed the entire cpusets 'mems' into their interleave
mask.

Perhaps we should make it so that doing a set_mempolicy(2) call
to set MPOL_INTERLEAVE immediately changes the memory policy to
the cpusets mems_allowed.

A key advantage in doing this would be that the set_mempolicy user
documentation could simply state that the MPOL_INTERLEAVE mask is
ignored when in a cpuset marked memory_spread_user, instead interleaving
over all the memory nodes in the cpuset.  This would be quite a bit
simpler and clearer than saying that the cpusets nodes are used only
after subsequent cpuset 'mems' changes.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Thursday, October 25, 2007 - 8:58 pm

Yeah.  They were already outdated in the sense that they did not specify 
that the interleave nodemask could change as a result of a cpuset mems 


Well, sure, but mempolicy's already get overridden by cpusets anyway.  For 
example, if you were to attach a task with an MPOL_BIND mempolicy to a 
cpuset with a disjoint set of allowed mems.

The important distinction is that you can still interleave over a subset 
of the mems_allowed if you set your memory policy after being attached to 

No, because that would negate the above.  We still want to be able to 
restrict interleaved memory policies to a subset of allowed mems.  This 

I think that documenting the change in the man page as saying that "the 
nodemask will include all allowed nodes if the mems_allowed of a 
memory_spread_user cpuset is expanded" is better.

I've got a few fixes for my patchset queued so I'll resend it later; it's 
mostly style changes but there is a subtle bug where the task changing the 
value of a cpuset's memory_spread_page is not in the same cpuset.

		David
-

From: Paul Jackson
Date: Thursday, October 25, 2007 - 9:34 pm

Ok.  I'm inclined the other way, but not certain enough of my
position to push the point any further.


Ok.  Good work - thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Friday, October 26, 2007 - 8:37 am

Michael Kerrisk, whom I've copied, does.  I recently sent in an update
to all of the mempolicy man pages that describe the behavior as it
currently exists.  [I need to send in an update for
MPOL_F_MEMS_ALLOWED].

One of the things that has bothered me is that there are no cpuset man
pages to reference from the mempolicy man pages.  [I know, we can and do
refer to the kernel source Documentation, but that might not be
available to everyone w/o some digging.  "See Also" refs typically point
at other man pages...].  To get around this, I had to talk about "nodes
allowed in the current context" or some such weasel-wording in my
updates.

Paul:  what do you think about subsetting the cpuset.txt into a man page

Lee

-

From: Paul Jackson
Date: Friday, October 26, 2007 - 10:04 am

Oh dear --- looking back in my work queue I have with my employer, I
see I have a task that is now over a year old, still unfinished, to
provide man pages for cpusets to Michael Kerrisk" <mtk-manpages@gmx.net>

So, yes, I agree this would be a "good thing".  I just haven't gotten a
round to it (http://www.quantumenterprises.co.uk/roundtuit/index.htm)
yet.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Friday, October 26, 2007 - 10:28 am

I'm a little backed up myself, right now, or I'd offer to take a cut for
you to review.  Once I get some free time [Hah!], I'll check with you
again.  If you get started before then, I'd be happy to review.

Lee

-

From: Michael Kerrisk
Date: Friday, October 26, 2007 - 1:21 pm

Yes, it would be great to have those pages.  Is there anything I can
do to assist?

Cheers,

Michael

PS Note my new addres for man-apges: mtk.manpages@gmail.com
-

From: Paul Jackson
Date: Friday, October 26, 2007 - 1:25 pm

Got any spare round tuit's ;)?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Michael Kerrisk
Date: Friday, October 26, 2007 - 1:33 pm

I ran out quite some time ago unfortunately.

Cheers,

Michael
-

From: Lee Schermerhorn
Date: Friday, October 26, 2007 - 8:30 am

Actually, my patch doesn't change the set_mempolicy() API at all, it
just co-opts a currently unused/illegal value for the nodemask to
indicate "all allowed nodes".  Again, I need to provide a libnuma API to
request this.   Soon come, mon...

Here's a link the last posting of my patch, as Paul requested:

http://marc.info/?l=linux-mm&m=118849999128086&w=4

A bit out of date, but I'll fix that maybe next week.

Lee
<snip>

-

From: David Rientjes
Date: Friday, October 26, 2007 - 11:46 am

If something that was previously unaccepted is now allowed with a 
newly-introduced semantic, that's an API change.
-

From: Paul Jackson
Date: Friday, October 26, 2007 - 12:00 pm

Without at least this sort of change to MPOL_INTERLEAVE nodemasks,
allowing either empty nodemasks (Lee's proposal) or extending them
outside the current cpuset (what I'm cooking up now), there is no way
for a task that is currently confined to a single node cpuset to say
anything about how it wants be interleaved in the event that it is
subsequently moved to a larger cpuset.  Currently, such a task is only
allowed to pass exactly one particular nodemask to set_mempolicy
MPOL_INTERLEAVE calls, with exactly the one bit corresponding to its
current node.  No useful information can be passed via an API that only
allows a single legal value.

But you knew that ...

You were just correcting my erroneously unqualified statement.  Good.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Friday, October 26, 2007 - 1:45 pm

Well, passing a single node to set_mempolicy() for MPOL_INTERLEAVE doesn't 
make a whole lot of sense in the first place.  I prefer your solution of 
allowing set_mempolicy(MPOL_INTERLEAVE, NODE_MASK_ALL) to mean "interleave 
me over everything I'm allowed to access."  NODE_MASK_ALL would be stored 
in the struct mempolicy and used later on mpol_rebind_policy().

		David
-

From: Christoph Lameter
Date: Friday, October 26, 2007 - 2:05 pm

So instead of an empty nodemask we would pass a nodemask where all bits 
are set? And they would stay set but the cpuset restrictions would 
effectively limit the interleaving to the allowed set?

rebind could ignore rebinds if all bits are set.


-

From: David Rientjes
Date: Friday, October 26, 2007 - 2:08 pm

You would pass NODE_MASK_ALL if your intent was to interleave over 
everything you have access to, yes.  Otherwise you can pass whatever you 
want access to and your interleaved nodemask becomes 
mpol_rebind_policy()'s newmask formal (the cpuset's new mems_allowed) 
AND'd with pol->passed_nodemask.
-

From: Christoph Lameter
Date: Friday, October 26, 2007 - 2:12 pm

We would need two fields in the policy structure

1. The specified nodemask (generally ignored)

2. The effective nodemask (specified & cpuset_mems_allowed)

If we have these two then its easy to get a bit further by making
the first nodemask a relative nodemask. The calculation of the effective
nodemask changes somewhat but the logic is then applicable to MPOL_BIND as 
well.

-

From: David Rientjes
Date: Friday, October 26, 2007 - 2:15 pm

Agreed.
-

From: Lee Schermerhorn
Date: Friday, October 26, 2007 - 2:13 pm

You don't need to save the entire mask--just note that NODE_MASK_ALL was
passed--like with my internal MPOL_CONTEXT flag.  This would involve
special casing NODE_MASK_ALL in the error checking, as currently
set_mempolicy() complains loudly if you pass non-allowed nodes--see
"contextualize_policy()".  [mbind() on the other hand, appears to allow
any nodemask, even outside the cpuset.  guess we catch this during
allocation.]  This is pretty much the spirit of my patch w/o the API
change/extension [/improvement :)]

For some systems [not mine], the nodemasks can get quite large.  I have
a patch, that I've tested  atop Mel Gorman's "onezonelist" patches that
replaces the nodemasks embedded in struct mempolicy with pointers to
dynamically allocated ones.  However, it's probably not much of a win,
memorywise, if most of the uses are for interleave and bind
policies--both of which would always need the nodemasks in addition to
the pointers.

Now, if we could replace the 'cpuset_mems_allowed' nodemask with a
pointer to something stable, it might be a win.

Lee


-

From: Christoph Lameter
Date: Friday, October 26, 2007 - 2:17 pm

The memory policies are already shared and have refcounters for that 
purpose.
-

From: Lee Schermerhorn
Date: Friday, October 26, 2007 - 2:26 pm

I must have missed that in the code I'm reading :)

Have a nice weekend.

Lee



-

From: Christoph Lameter
Date: Friday, October 26, 2007 - 2:37 pm

What is the benefit of having pointers to nodemasks? We likely would need 
to have refcounts in those nodemasks too? So we duplicate a lot of 
the characteristics of memory policies?
-

From: Lee Schermerhorn
Date: Monday, October 29, 2007 - 8:00 am

Hi, Christoph:

remoting the nodemasks from the mempolicy and allocating them only when
needed is something that you and Mel and I discussed last month, in the
context of Mel's "one zonelist filtered by nodemask" patches.  I just
put together the dynamic nodemask patch [included below FYI, NOT for
serious consideration] to see what it looked like and whether it helped.
Conclusion:  it's ugly/complex [especially trying to keep the nodemasks
embedded for systems that don't require > a pointer's worth of bits] and
they probably don't help much if most uses of non-default mempolicy
requires a nodemask.

I only brought it up again because now you all are considering another
nodemask per policy.  In fact, I only considered it in the first place
because nodemasks on our [HP's] platform don't require more than a
pointer's worth of bits [today, at least--I don't know about future
plans].  However, since we share an arch--ia64-with SGI and distros
don't want to support special kernels for different vendors, if they can
avoid it, we have 1K-bit nodemasks.   Since this is ia64 we're talking
about, most folks don't care.  Now that you're going to do the same for
x86_64, it might become more visible.  Then again, maybe there are few
enough mempolicy structs that no-one will care anyway.

Note:  I don't [didn't] think I need to ref count the nodemasks
associated with the mempolicies because they are allocated when the
mempolicy is and destroyed when the policy is--not shared.  Just like
the custom zonelist for bind policy, and we have no ref count there.
I.e., they're protected by the mempol's ref.  However, now that you
bring it up, I'm wondering about the effects of policy remapping, and
whether we have the reference counting or indirect protection [mmap_sem,
whatever] correct there in current code.  I'll have to take a look.

Lee
From: Paul Jackson
Date: Monday, October 29, 2007 - 10:33 am

The patch David and I are discussing will replace the
cpuset_mems_allowed nodemask in struct mempolicy, not
add a new nodemask.  In other words, the meaning and
name of that existing nodemask will change, with no
change in the overall structure size.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Monday, October 29, 2007 - 10:46 am

Kool!

Lee

-

From: Christoph Lameter
Date: Monday, October 29, 2007 - 1:35 pm

In that case we could just put the nodemask at the end of the mempolicy 
structure and then allocate the size needed? That way we would not need to 
deref an additional pointer?

-

From: David Rientjes
Date: Friday, October 26, 2007 - 2:18 pm

Not really, because perhaps your application doesn't want to interleave 
over all nodes.  I suggested NODE_MASK_ALL as the way to get access to all 
the memory you are allowed, but it's certainly plausible that an 
application could request to interleave only over a subset.  That's the 
entire reason set_mempolicy(MPOL_INTERLEAVE) takes a nodemask anyway right 
now instead of just using task->mems_allowed on each allocation.

		David
-

From: Lee Schermerhorn
Date: Friday, October 26, 2007 - 2:31 pm

So, you pass the subset, you don't set the flag to indicate you want
interleaving over all available.  You must be thinking of some other use
for saving the subset mask that I'm not seeing here.  Maybe restoring to
the exact nodes requested if they're taken away and then re-added to the
cpuset?


Later,
Lee



-

From: David Rientjes
Date: Friday, October 26, 2007 - 2:39 pm

Paul's motivation for saving the passed nodemask to set_mempolicy() is so 
that the _intent_ of the application is never lost.  That's the biggest 
advantage that this method has and that I totally agree with.  So whenever 
the mems_allowed of a cpuset changes, the MPOL_INTERLEAVE nodemask of all 
attached tasks becomes their intent (pol->passed_nodemask) AND'd with the 
new mems_allowed.  That can be done on mpol_rebind_policy() and shouldn't 
be an extensive change.

So MPOL_INTERLEAVE, and possibly other, mempolicies will always try to 
accomodate the intent of the application but only as far as the task's 
cpuset restriction allows them.

		David
-

From: Paul Jackson
Date: Friday, October 26, 2007 - 6:07 pm

Issue:

    Are the nodes and nodemasks passed into set_mempolicy() to be
    presumed relative to the cpuset or not?  [Careful, this question
    doesn't mean what you might think it means.]

Let's say our system has 100 nodes, numbered 0-99, and we have a task
in a cpuset that includes the twenty nodes 10-29 at the moment.

Currently, if that task does say an MPOL_PREFERRED on node 12, we take
that to mean the 3rd node of its cpuset.  If we move that task to a
cpuset on nodes 40-59, the kernel will change that MPOL_PREFERRED to
node 42.  Similarly for the other MPOL_* policies.

Ok so far ... seems reasonable.  Node numbers passed into the
set_mempolicy call are taken to be absolute node numbers that are to
be mapped relative to the tasks current cpuset, perhaps unbeknownst
to the calling task, and remapped if that cpuset changes.

But now imagine that a task happens to be in a cpuset of just two
nodes, and wants to request an MPOL_PREFERRED policy for the fourth
node of its cpuset, anytime there actually is a fourth node.  That
task can't say that using numbering relative to its current cpuset,
because that cpuset only has two nodes.  It could say it relative to
a mask of all possible nodes by asking for the fourth possible node,
likely numbered node 3.

If that task happened to be in a cpuset on nodes 10 and 11, asking
for the fourth node in the system (node 3) would still be rather
unambiguous, as node 3 can't be either of 10 or 11, so must be
relative to all possible nodes, meaning "the fourth available node,
if I'm ever fortunate enough to have that many nodes."

But if that task happened to be in a cpuset on nodes 2 and 3, then
the node number 3 could mean:

Choice A:
    as it does today, the second node in the tasks cpuset or it could
    mean

Choice B:
    the fourth node in the cpuset, if available, just as
    it did in the case above involving a cpuset on nodes 10 and 11.

Let me restate this.

Either way, passing in node 3 means node 3, as numbered ...
From: Christoph Lameter
Date: Friday, October 26, 2007 - 6:26 pm

Yes. We should default to Choice B. Add an option MPOL_MF_RELATIVE to 
enable that functionality? A new version of numactl can then enable
that by default for newer applications.
-

From: Paul Jackson
Date: Friday, October 26, 2007 - 7:41 pm

I'm confused.  If B is the default, then we don't need a flag to
enable it, rather we need a flag to go back to the old choice A.

So are you saying that:
 1) Choice A remains the default for the kernel unless
    MPOL_MF_RELATIVE is added, or
 2) that the new default for the kernel is Choice B,
    unless MPOL_MF_RELATIVE is specified, asking to
    revert to the original Choice A behaviour?

Perhaps, either way, whatever compatibility flag we have should be
something that can be forced on an application from the outside,
perhaps as a per-system mode flag in /sys, or a per-cpuset mode flag,
or a per-task operation, by what mechanism is not clear.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Christoph Lameter
Date: Friday, October 26, 2007 - 7:50 pm

Dont we need it for numactl to preserve backward compatibility? numactl 
can set that flag by default for newer software. We likely need a new 

libnuma can take of that. But we need to have that flag for numactl to be 
backward compatible.

-

From: Paul Jackson
Date: Friday, October 26, 2007 - 10:16 pm

I'm still confused, Christoph.

Are you saying:
 1) The kernel continues to default to Choice A, unless
    the flag enables Choice B, or
 2) The kernel defaults to the new Choice B, unless the
    flag reverts to the old Choice A?

Alternative (2) breaks libnuma and hence numactl until it is changed
to use the flag, or changed to use choice B (in which case it wouldn't
need the flag.)

So I guess you mean alternative (1) above, since you seem to be taking
the position that we can't break compatibility here.

But I could quote statements from you that seem to clearly state the
exact opposite.

So I remain confused.

Actually, alternative (1) is kinda ugly.  It leaves a permanent wart
on the set_mempolicy API -- two different variants to what the node
numbers and node masks mean, depending on whether this MPOL_MF_RELATIVE
is set on each call.  We'll have to ship out an extra serving of brain
food for most folks looking at this to have much chance that they will
confidently understand the difference between the two options selected
by this flag.

I wonder if there might be some way to avoid that permanent ugly wart
on each and every set/get mempolicy system call forever afterward.

Please try to double check your next reply, Christoph.  I'm beginning
to worry that we might be failing to communicate clearly.  Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Christoph Lameter
Date: Friday, October 26, 2007 - 11:07 pm

2) keeps everything in order. Let everything be as it is today unless


Tough. The API needs to remain stable. We can only change it through an 
additional flag that enables the relativeness and the folding the way you 
want it. libnuma may set the flag on its own without the user having to do 

Hmmm.. The alternative is to add new set/get mempolicy functions.
-

From: Paul Jackson
Date: Saturday, October 27, 2007 - 1:36 am

Good - that I understand.  Your position is clear now.

You have chosen (1) above, which keeps Choice A as the default.


Before I leave this part, there is one more thing I kinda really need,
if you could, Christoph.  Could you describe in your own words what you
think Choices A and B mean?  We seem to be having trouble communicating,
and hence there is some risk right now that we don't mean the same thing
by this new "Choice B".

===


Other alternatives include a per-system, per-cpuset or per-process
flag, in addition to the per-system call flag you suggested earlier
(MPOL_MF_RELATIVE), or whatever you mean by "new set/get mempolicy
functions" ... could you elaborate on that one?

So ... the question becomes this:

  How do we migrate to Choice B, without leaving both Choices
  permanently supported, and an ugly mode flag selecting the
  non-default Choice, while not breaking API's too abruptly?

Thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Christoph Lameter
Date: Saturday, October 27, 2007 - 10:47 am

There can be different defaults for the user space API via libnuma that 
are indepdent from the kernel API which needs to remain stable. The kernel 

None of those sound appealing. Multiple processes may run in one cpuset. 
Some of those may be linked to older libnumas and therefore depend on old 
behavior.
-

From: Paul Jackson
Date: Saturday, October 27, 2007 - 1:59 pm

Yes - the user level code can have different defaults too.


Well, that would justify keeping this choice per-task.  I tend to
agree with that.

But that doesn't justify having to specify it on each system call.

In another reply David recommends against supporting Choice A at all.
I'm inclined to agree with him.  I'll reply there, with more thoughts.

But if we did support Choice A, as a backwards compatible alternative
to Choice B, I'd suggest a per-task mode, not per-system call mode.
This would reduce the impact on the API of the ugly, unobvious, modal
flag needed to select the optional, non kernel default, Choice B
semantics.

I still have low confidence that you (Christoph) and I have the same
understanding of what these Choice A and B are.  Hopefully you can
address that, perhaps by briefly describing these choices in your words.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Saturday, October 27, 2007 - 10:50 am

I think there's a mixup in the flag name there, but I actually would 
recommend against any flag to effect Choice A.  It's simply going to be 
too complex to describe and is going to be a headache to code and support.  

The MPOL_PREFERRED behavior when constrained by cpusets was previously, to 
my knowledge, undocumented; you're in the position to make the behavior do 
what you want it to do and then release documentation so we'll finally 
have a complete and unambiguous API for it.  Right now it should be 
considered undefined and thus you are free to implement it as you choose.  
Then all callers of set_mempolicy(MPOL_PREFERRED) will standardize on that 
and not have to worry about the machine's 
mpol_preferred_relative_to_cpuset setting.

Then, any task that is attached to a cpuset and expecting the fourth node 
in their set_mempolicy(MPOL_PREFERRED) call to mean system node 3 if 
it's in the cpuset's mems_allowed will be broken.  If you want that, 
you'll need your task to be attached to a cpuset with at least mems 0-3; 
programmers will pick that up quickly enough if it's clearly documented.
I think Choice B is correct and makes more sense in terms of the semantics 
and at least allows mempolicies and cpusets to play nicely together 
without a bidirectional dependency on one another.

		David
-

From: Paul Jackson
Date: Saturday, October 27, 2007 - 4:19 pm

While I am sorely tempted to agree entirely with this, I suspect that
Christoph has a point when he cautions against breaking this kernel API.

Especially for users of the set/get mempolicy calls coming in via
libnuma, we have to be very careful not to break the current behaviour,
whether it is documented API or just an accident of the implementation.

There is a fairly deep and important stack of software, involving a
well known DBMS product whose name begins with 'O', sitting on that
libnuma software stack.  Steering that solution stack is like steering
a giant oil tanker near shore.  You take it slow and easy, and listen
closely to the advice of the ancient harbor master.  The harbor masters

True, which is why I am hoping we can keep this modal flag, if such be,
from having to be used on every set/get mempolicy call.  The ordinary
coder of new code using these calls directly should just see Choice B
behaviour.  However the user of libnuma should continue to see whatever
API libnuma supports, with no change whatsoever, and various versions of
libnuma, including those already shipped years ago, must continue to
behave without any changes in node numbering.

There are two decent looking ways (and some ugly ways) that I can see
to accomplish this:

 1) One could claim that no important use of Oracle over libnuma over
    these memory policy calls is happening on a system using cpusets.
    There would be a fair bit of circumstantial evidence for this
    claim, but I don't know it for a fact, and would not be the
    expert to determine this.  On systems making no use of cpusets,
    these two Choices A and B are identical, and this is a non-issue.
    Those systems will see no API changes whatsoever from any of this.

 2) We have a per-task mode flag selecting whether Choice A or B
    node numbering apply to the masks passed in to set_mempolicy.

    The kernel implementation is fairly easy.  (Yeah, I know, I
    too cringe everytime I read that line ;)

    If ...
From: David Rientjes
Date: Sunday, October 28, 2007 - 11:19 am

From a standpoint of the MPOL_PREFERRED memory policy itself, there is no 
documented behavior or standard that specifies its interaction with 
cpusets.  Thus, it's "undefined."  We are completely free to implement an 
undefined behavior as we choose and change it as Linux matures.

Once it is defined, however, we carry the burden of protecting 
applications that are written on that definition.  That's the point where 
we need to get it right and if we don't, we're stuck with it forever; I 
don't believe we're at that point with MPOL_PREFERRED policies under 

Ok, let's take a look at some specific unproprietary examples of tasks 
that use set_mempolicy(MPOL_PREFERRED) for a specific node, intending it 
to be the actual system node offset, that is then assigned to a cpuset 
that doesn't require that offset to be allowed.

I think it's going to become pretty difficult to find an example because 
the whole scenario is pretty lame: you would need to already know which 
nodes you're going to be assigned to in the cpuset to ask for one of them 
as your preferred node.  I don't imagine any application can have that 
type of foresight and, if it does, then we certainly shouldn't support the 
preferred node_remap() when it changes mems.

You're trying to support a scheme, in Choice A, where an application knows 
it's going to be assigned to a range of nodes (for example, 1-3) and wants 
the preferred node to be included (for example, 2).  So now the 
application must have control over both its memory policy and its cpuset 
placement.  Then it must be willing to change its cpuset placement to a 
different set of nodes (with equal or greater cardinality) and have the 
preferred node offset respected.  Why can't it simply then issue another 
set_mempolicy(MPOL_PREFERRED) call for the new preferred node?

See?  The problem is that you're trying to protect applications that know 
its initial cpuset mems [the only way it could ever send a 
set_mempolicy(MPOL_PREFERRED) for the right node ...
From: Paul Jackson
Date: Sunday, October 28, 2007 - 4:46 pm

You state this point clearly, but I have to disagree.

The Linux documentation is not a legal contract.  Anytime we change the
actual behaviour of the code, we have to ask ourselves what will be the
impact of that change on existing users and usages.  The burden is on
us to minimize breaking things (by that I mean, what users would
consider breakage, even if we think it is all for the better and that
their code was the real problem.)  I didn't say no breakage, but
minimum breakage, doing our best to guide users through changes with
minimum disruption to their work.

Linux is gaining market share rapidly because we co-operate with our
users to give us both the best chance of succeeding.

We don't just play gotcha games with the documentation -- ha ha --
we didn't document that detail, so it's your fault for ever depending
on it.  And besides your code sucks.  So there!  Let's leave that game



If that were so, then yes much of your subsequent reasoning would follow.


The above is the hack that allows us to support existing libnuma based
applications (the most significant users of memory policy historically)
with a default of Choice A, while other code and future code defaults

That's not the only sort of application I'm trying to protect.

I'm trying to protect almost any application that uses both
set_mempolicy or mbind, while in a cpuset.

    If a task is in a cpuset on say nodes 16-23, and it wants to issue
    any mbind, or any MPOL_PREFERRED, MPOL_BIND, or MPOL_INTERLEAVE
    mempolicy call, then under Choice A it must issue nodemasks offset
    by 16, relative to what it would issue under Choice B.

Almost any task using memory policies on a system making active use of
cpusets will be affected, even well written ones doing simple things.

I am more concerned that the above hack for libnuma isn't enough,
rather than it is unnecessary.

I think the above hack covers existing libnuma users rather well,
though I could be wrong even here, as I don't actually ...
From: David Rientjes
Date: Sunday, October 28, 2007 - 6:04 pm

Nobody can show an example of an application that would be broken because 
of this and, given the scenario and sequence of events that it requires to 
be broken when implementing the default as Choice B, I don't think it's as 

So all applications that use the libnuma interface and numactl will have 
different default behavior than those that simply issue 
{get,set}_mempolicy() calls.  libnuma is a collection of higher level 
functions that should be built upon {get,set}_mempolicy() like they 
currently are and not introduce new subtleties like changing the semantics 
of a preferred node argument.  This is going to quickly become a 
documentation nightmare and, in my opinion, isn't worth the time or effort 
to support because we haven't even idenitifed any real-world examples.

Maybe Andi Kleen should weigh in on this topic because, if we go with what 
you're suggesting, we'll never get rid of the two differing behaviors and 
we'll be introducing different semantics to arguments of libnuma functions 

True, but the ordering of that scenario is troublesome.  The correct way 
to implement it is to use set_mempolicy() or a higher level libnuma 
function with the same semantics and _then_ attach the task to a cpuset.  
Then the nodes_remap() takes care of the rest.

The scenario you describe above has a problem because it requires the task 
to have knowledge of the cpuset's mems in which it is attached when, for 
portability, it should have been written so that it is robust to any range 

No, because nodes_remap() takes care of the instances you describe above 
when the task sets its memory policy (usually done when it is started) and 

Supporting two different behaviors is going to be more problematic than 
simply selecting one and going with it and its associated documentation in 

Paul, the changes required to an application that is currently using 
{get,set}_mempolicy() calls to setup the memory policy or the higher level 
functions through libnuma is so easy to use Choice ...
From: Paul Jackson
Date: Sunday, October 28, 2007 - 9:27 pm

Well, neither you nor I have shown an example.  That's different than
"nobody can."

Since it would affect any task setting memory policies while in a
cpuset holding less than all memory nodes, it seems potentially serious
to me.

Actually, I have one example.  The libcpuset library would have some
breakage with Choice B the only Choice.  But I'm in a position to deal

Breaking the libnuma-Oracle solution stack is not an option.

And, unless someone in the know tells us otherwise, I have to assume
that this could break them.  Now, the odds are that they simply don't
run that solution stack on any system making active use of cpusets,
so the odds are this would be no problem for them.  But I don't

We could get rid of Choice A once libnuma and libcpuset have adapted
to Choice B, and any other uses of Choice A that we've subsequently
identified have had sufficient time to adapt.

But dual support is pretty easy so far as the kernel code is concerned.
It's just a few nodes_remap() calls optionally invoked at a few key
spots in mm/mempolicy.c.  Consequently there won't be a big hurry to

There is no "_then_ attach the task to a cpuset."  On systems with
kernels configured with CONFIG_CPUSETS=y, all tasks are in a cpuset
all the time.  Moreover, from a practical point of view, on large
systems managed with cpuset based mechanisms, almost all tasks are in
cpusets that do not include all nodes, for the entire life of the task.

And besides, I can't break existing applications willy-nilly, and
then claim it's their fault, because they should have been coded
differently.  So "correct way" arguments don't hold alot of weight

David ;)  I make some effort to avoid forcing applications to be

I had to read that a couple of times to make sense of it.  I take that
it means that the node numbering used in each cpuset's 'mems' file has
to be system-wide.  Yes, agreed.

(Well, actually, the node numbering of each cpusets 'mems' file could
be relative to its parent cpusets 'mem' ...
From: David Rientjes
Date: Sunday, October 28, 2007 - 9:47 pm

If we can't identify any applications that would be broken by this, what's 
the difference in simply implementing Choice B and then, if we hear 
complaints, add your hack to revert back to Choice A behavior based on the 
get_mempolicy() call you specified is always part of libnuma?

The problem that I see with immediately offering both choices is that we 
don't know if anybody is actually reverting back to Choice A behavior 
because libnuma, by default, would use it.  That's going to making it very 
painful to remove later because we've supported both options and have made 
libnuma and {get,set}_mempolicy() arguments ambiguous.  We should only 
support both choices if they will both be used and there's no hard 

You earlier insisted on an ease of documentation for the MPOL_INTERLEAVE 
case and now this dual support that you're proposing is going to make the 
documentation very difficult to understand for anyone who simply wants to 
use mempolicies.

Others even in this thread have had a hard enough time understanding the 
difference between the two choices and you explained them very thoroughly.  

And that application would need to be implemented to know the nodes that 
it has access to before it issues its set_mempolicy(MPOL_PREFERRED) 
command anyway if it truly uses Choice A behavior.  So unless these tasks 
are looking in /proc/pid/status and parsing Mems_allowed and then 
specifying one as its preferred node or always being guaranteed a certain 
set of nodes that they are always attached to in a cpuset so they have 
such foresight of what node to prefer, Choice A can't possibly be what 


The needs I was addressing with my initial patchset was so that when a 
cpuset is expanded, any MPOL_INTERLEAVE memory policy of attached tasks 
automatically get expanded as well.  This discussion has somewhat diverged 
from that, but I hope you still support what we earlier talked about in 
terms of adding a field to struct mempolicy to remember the intended 

You don't actually ...
From: Paul Jackson
Date: Sunday, October 28, 2007 - 10:45 pm

I'll probably reply to other parts of your message later, but this
one catches my eye right now.

"if we hear complaints, add your hack ... back"  -- this doesn't seem
like a good idea to me.  Maybe inside Google you don't see it, but
for those of us shipping computer systems using major distributions
such as SUSE or Red Hat, there can be a year lag between when I send a
feature patch to Andrew, and when my customers send their first
feedback to me resulting from using that new feature.

There are ways to expedite fixes for specific situations, of course,
but in general, this is rather like sending out a deep space probe.
You have to conservatively cover your options pre-launch, because
post-launch repairs are costly, slow and limited.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Monday, October 29, 2007 - 12:00 am

Let's add a Choice C:

	Any nodemask that is passed to set_mempolicy() is saved as the
	intent of the application in struct mempolicy.  All policies
	are effected on a contextualized per-allocation basis.

	Policies such as MPOL_INTERLEAVE always get AND'd with 
	pol->cpuset_mems_allowed.  If that yields numa_no_nodes,
	MPOL_DEFAULT is used instead.

	Policies such as MPOL_PREFERRED are respected if the node is set
	in pol->cpuset_mems_allowed, otherwise MPOL_DEFAULT is used.	

	If an application attempts to setup a memory policy for an
	MPOL_PREFERRED node that it doesn't have access to or an
	MPOL_INTERLEAVE nodemask that is empty when AND'd with
	pol->cpuset_mems_allowed, -EINVAL is returned and no new policy
	is effected.

	If an application gains nodes in pol->cpuset_mems_allowed that
	now include the nodes from MPOL_INTERLEAVE or MPOL_PREFERRED,
	that policy is then effected once again.  Otherwise,
	MPOL_DEFAULT is still used.
-

From: Paul Jackson
Date: Monday, October 29, 2007 - 12:26 am

"contextualized" - I guess that means converted to cpuset
relative numbering - yes.

"per-allocation" - Most of the calculation of nodemasks and





Not issues with Folding.

With folding, an application that layed out an elaborate memory
policy configuration covering say 16 nodes can run in a 4 node
cpuset, where whatever would have been on node N gets folded down
to node N % 4.

With AND'ing, such an application would find 3/4's of its fancy
memory policy configuration replaced with MPOL_DEFAULT and -EINVAL
fallbacks.


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Tuesday, October 30, 2007 - 3:53 pm

Missing the point; this is an alternative to the previous choices; Choice 
C explicitly removes all remaps ("folding") from mempolicies.  The 
nodemask passed to set_mempolicy() will always have exactly one meaning: 
the system nodes that the policy is intended for.

Cpusets, which are built upon mempolicies, can obviously take access to 
some of those nodes away.  That's why the existing mempolicies are AND'd 
with the cpuset's mems_allowed to represent the current nodemask that the 
mempolicy is effecting.  If none of them are available because of cpusets, 
the mempolicy is invalidated and MPOL_DEFAULT is used.  If access to some 
nodes from the mempolicy's nodemask become available once again, the 
policy is again effected.

I'm arguing that remapping a policy's nodemask, although that is what 
currently is done, is troublesome because it can use a policy such as 
MPOL_PREFERRED to work on a node for which it was never intended.

		David
-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 4:17 pm

Ok - that makes the meaning of Choice C clearer to me.  Thank-you.

We've already got two Choices, one released and one in the oven.  Is
there an actual, real world situation, motivating this third Choice?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Tuesday, October 30, 2007 - 4:25 pm

Let's put Choice C into the lower oven, then.

Of course there's actual and real world examples of this, because right 
now we're not meeting the full intent of the application.  Cpusets deal 
with cpus and memory, they don't have anything to do with affinity to 
particular I/O devices; that part is left up to the creator of the cpuset 
to sort out correctly based on their system topology.

If my application does tons of I/O on one particular device to which my 
memory has access, I can use MPOL_PREFERRED to prefer the memory be 
allocated on a node with the best affinity to my device.  If cpusets 
change my access to that node, I'm still using an MPOL_PREFERRED policy 
with a remapped node that no longer has affinity to that device because 
nodes_remap() doesn't take that into account.  My preference would be to 
fallback to MPOL_DEFAULT behavior, since it's certainly plausible that 
other cpusets share the same node, instead of unnecessarily filling up a 
node that I don't even prefer anymore.

Same situation exists of MPOL_INTERLEAVE policies where my NUMA 
optimization is no longer helpful because I'm interleaving over a set of 
nodes that was simply remapped and their affinity (which isn't guaranteed 
to be unifom) wasn't even taken into account.

But, with Choice C, my intent is still preserved in the mempolicy even 
though it's not effected because my access rights to the node has changed.  
If I get access to that node back later, and I haven't issued subsequent 
set_mempolicy() calls to change my policy, my MPOL_PREFERRED or 
MPOL_INTERLEAVE policy is again effected and I then benefit from my NUMA 
optimization once again.

		David
-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 5:03 pm

Please describe one, an actual one, not a hypothetical one, of which you
have personal knowledge.

There are many refinements we could add, an endless stream of them.
Each one adds a burden to those who didn't need it.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 5:05 pm

Choice B, as I'm coding it, has this property as well.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Paul Jackson
Date: Monday, October 29, 2007 - 12:15 am

Yes, that's a problem.  I would rather end up with both Choices
forever, than breaking stuff because we changed how memory policy

No.  We could only remove Choice A if we had hard evidence that
it wouldn't break things, especially for the libnuma-Oracle stack.

Either way, we obviously have to decide this lacking sufficient
hard evidence.  Changing memory policy node numbering is just way
too likely to break things, in ways that users initially find
difficult to diagnose.  We -can-not- inflict that on our users
in a single, sudden, change.  We must stage it, starting by

Yup - that's a problem.  But it is one that users can control.
If they just continue using memory policies and libnuma as before,
it continues to work as before.  If they need to deal with situations
in which applications using memory policies are being moved around
between larger and smaller cpusets, and they are willing and able to
modify and improve the part of their code that handles memory policies,
then they can read the new section of the documentation about this
improved cpuset-relative node numbering, and give it a try.

Blind siding users with a unilateral change like this will leave
orphaned bits gasping in agony on the computer room floor.  It can
sometimes takes months of elapsed time and hundreds of hours of various
peoples time across a dozen departments in three to five corporations
to track down the root cause of such a problem, from the point of the
initial failure, back to the desk of someone like you or me.  And then
it can take tens or hundreds more hours of human effort to deliver a
fix.  I refuse to knowingly go down that road.


People do this sort of stuff all the time; they just don't realize
what all is going on beneath the surface of the various tools,
libraries, scripts and magic incantations that they cobble together to
meet their needs.

Choice A is meeting most of our needs.  Not until you brought
up this case of MPOL_INTERLEAVE across the nodes of a job being


Not ...
From: David Rientjes
Date: Tuesday, October 30, 2007 - 4:12 pm

If your argument is that most applications are written to implement 
mempolicies without necessarily thinking too much about its cpuset 
placement or interactions with cpusets, then the requirement of remapping 
nodes when a cpuset changes for effected mempolicies isn't actually that 
important.  In other words, my Choice C with AND'd behavior as opposed to 
remapping behavior could be introduced as a replacement for Choice A.

Those applications that currently rely on the remapping are going to be 
broken anyway because they are unknowingly receiving different nodes than 
they intended, this is the objection to remapping that Lee agreed with.  
The remap doesn't take into account any notion of locality or affinity to 
physical controllers and seems to be merely a convenience of not 
invalidating the entire mempolicy in light of an ever-changing cpuset 

Yes, I know, and my Choice C does _not_ want that folding behavior; it 
wants the AND'd behavior because it fully respects the intent of the 
application with regard to the actual nodes that it specified in its 
memory policies.  A node should only have one definition and policies that 
are effected on a set of nodes, or one node in the preferred case, should 
not change from beneath the application because it was not the intent of 
the implementation.  Doing so is dangerous, regardless of whether or not 
it is currently the mempolicy behavior in HEAD.

		David
-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 4:44 pm

Just because they didn't think about cpuset remapping when they coded
their mempolicy calls, doesn't mean they wouldn't be broken by changes
in how mempolicy numbers nodes.  Often, it's the other way around:
the less they though of it, the more likely changing it would break

No - I will not agree to changing the default mempolicy kernel API node
numbering at this time.  Period.  Full stop.  We can add non-default
choices for now, and perhaps in the light of future experience, we

No, they may or may not be broken.  That depends on whether or not they had

If you're running apps that have specific hardware affinity requirements,
then perhaps you shouldn't be moving them about in the first place ;).
And if they did have such needs, aren't they just as likely to be busted
by AND'ing off some of their nodes as they are by remapping those nodes?

I sure wish I knew what real world, actual, not hypothetical, situations
were motivating this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Tuesday, October 30, 2007 - 4:53 pm

Of course they have specific affinity needs, that's why they used 
mempolicies.  Remapping those policies to a set of nodes that resembles 
the original mempolicy's nodemask in terms of construction but without 
regard for the affinity those nodes have with respect to system topology 

No, because you're interleaving over the set of actual nodes you wanted to 
interleave over in the first place and not some pseudo-random set that 

You're defending the current remap behavior in terms of semantics of 
mempolicies?  My position, and Choice C's position, is that you either get 
the exact (or partially-constructed) policy that you asked for, or you get 
the MPOL_DEFAULT behavior.  What you don't get, even though it's currently 
how we do it, is a completely different set of nodes that you never 
intended to have a specific policy over.

		David
-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 5:29 pm

No.  Good grief.  If they are just looking for some set of memory
banks, not to other node-specific hardware, then they might not need
a specific node.

Consider for example a multi-threaded, compute bound, long running
scientific computation that has a substantial and fussy memory layout.
Remapping it from one cpuset to another having the same NUMA topology
may well work fine, once its memory caches recover.  Reverting it to
the lowest common denominator MPOL_DEFAULT policy because (Choice C) it
no longer has access to its initial nodes might devastate its
performance.


I'm still wishing ...

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Monday, October 29, 2007 - 9:54 am

If most apps use libnuma APIs instead of directly calling the sys calls,
libnuma could query something as simple as an environment variable, or a
new flag to get_mempolicy(), or the value of a file in it's current
cpuset--but I'd like to avoid a dependency on libcpuset--to determine

I'd certainly like to hear from Oracle what libnuma features they use

Yeah.  This bothered me about policy remapping when I looked at it a
while back.  Worse, this behavior isn't documented as intended [or not].
I thought at the time that this could be solved by retaining the
original argument nodemask, but 1) I was worried about the size when ~1K
nodes are required to be supported and 2) it still doesn't solve the
problem of ensuring the same locality characteristics w/o a lot of
documentation about the implications of changing cpuset resources or
moving tasks between cpusets in such a way to preserve the locality
characteristics requested by the original mask.

Again, we stumble upon the notion of "intent".  If the intent is just to
spread allocations to share bandwidth, it probably doesn't matter.  If,
on the other hand, the original mask was carefully constructed, taking
into consideration the distances between the memories specified and
other resources [cpus in the cpuset, other memories in the cpuset, IO
adpater connection points, ...], there is a lot more to consider than


In libnuma in numactl-1.0.2 that I recently grabbed off Andi's site,
numa_available() indeed issues this call.  But, I don't see any internal
calls to numa_available() [comments says all other calls undefined when
numa_available() returns an error] nor any other calls to
get_mempolicy() with all null/0 args.  So, you'd be depending on the
application to call numa_available().  However, you could define an
additional MPOL_F_* flag to get_mempolicy() that is issued in library
init code to enable new behavior--again, based on some indication that

Only for apps that use the sys calls directly, right?  This can ...
From: Paul Jackson
Date: Monday, October 29, 2007 - 12:40 pm

The patch I'm working has a new set of options to get_mempolicy to set
and get the per-task kernel state indicating whether to use the old or
new semantics.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Paul Jackson
Date: Monday, October 29, 2007 - 12:45 pm

Yes - as I noted in an earlier reply, the kernel just provides the
mechanisms.  It's up to user level code and people to decide whether
moving jobs around is a worthwhile activity in their situation.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Paul Jackson
Date: Monday, October 29, 2007 - 12:57 pm

Hmmm ... put your thinking hat for my next comment ...

I could do one of two things in mm/mempolicy.c:
 B1) continue accepting nodemasks across the set_mempolicy and mbind
     system call APIs that are just like now (only nodes in the current
     tasks cpuset matter), but then remember what was passed in, so that
     if the tasks cpuset subsequently shrank down and then expanded
     again back to its original size, they would end up with the same
     memory policy placement they first had, or
 B2) accept nodemasks as if relative to the entire system, regardless
     of what cpuset they were in at the moment (all nodes in the system
     matter and can be specified.)

If I did B1, then that's just a subtle change in the API, and what
you agreed to above holds.

If I did B2, then that's a serious change in the way that nodes
are numbered in the nodemasks passed into mbind and set_mempolicy,
from being only nodes that happen to be in the tasks current cpuset,
to being nodes relative to all possible nodes on the system.

We need B2, I think.  Otherwise, if a job happens to be running in
a shrunken cpuset, it can't request what memory policy placement
it wants should it end up in a larger cpuset later on.  With B1, we
would continue to have the timing dependencies between when a task
is moved between different size cpusets, and when it happens to issue
mbind/set_mempolicy calls.

But B2 is an across the board change in how we number the nodes
passed into mbind and set_mempolicy.  That is in no way an upward
compatible change.

I am strongly inclined toward B2, but it must be a non-default optional
mode, at least for a while, perhaps a long while.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Paul Jackson
Date: Monday, October 29, 2007 - 1:02 pm

Aha - good point.  It happened to be the numactl command line utility
that I tested with that issued the get_mempolicy(0,0,0,0,0) call.

Yup - this proposed hack, to have the kernel revert to the original
memory policy nodemask numbering if it sees such a getmempolicy call
is now officially dead meat.


Yes - I am intending to define such MPOL_F_* flags, to set and get
which behavior applies to the current task.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Saturday, October 27, 2007 - 10:45 am

Thanks for describing the situation with MPOL_PREFERRED so thoroughly.

I prefer Choice B because it does not force mempolicies to have any 
dependence on cpusets with regard to what nodemask is passed.

	[rientjes@xroads ~]$ man set_mempolicy | grep -i cpuset | wc -l
	0

It would be very good to store the passed nodemask to set_mempolicy in 
struct mempolicy, as you've already recommended for MPOL_INTERLEAVE, so 
that you can try to match the intent of the application as much as 
possible.  But since cpusets are built on top of mempolicies, I don't 
think there's any reason why we should respect any nodemask in terms of 
the current cpuset context, whether it's preferred or interleave.

So if you were to pass a nodemask with only the fourth node set for an 
MPOL_PREFERRED mempolicy, the correct behavior would be to prefer the 
fourth node on the system or, if constrained by cpusets, the fourth node 
in the cpuset.  If the cpuset has fewer than four nodes, the behavior 
should be undefined (probably implemented to just cycle the set of 
mems_allowed until you reach the fourth entry).  That's the result of 
constraining a task to a cpuset that obviously wants access to more
nodes -- it's a userspace mistake and abusing cpusets so that the task 
does not get what it expects.

That concept isn't actually new: we already restrict tasks to a certain 
amount of memory by writing to the mems file and just because it happens 
to have access to more memory when unconstrained by cpusets doesn't 
matter.  You've placed it in a cpuset that wasn't prepared to deal with 
what the task was asking for.  At least in the MPOL_PREFERRED case you 
describe above, it'll be dealt with much more pleasantly by at least 
giving it a preferred node as opposed to OOM killing it when a task has 
exhausted its available cpuset-constrained memory.

I'd prefer a solution where mempolicies can always be described and used 
without ever considering cpusets.  Then, a sane implementation will 
configure ...
From: Paul Jackson
Date: Saturday, October 27, 2007 - 2:22 pm

I do intend to implement it as you suggest.  See the lib/bitmap.c
routines bitmap_remap() and bitmap_bitremap(), and the nodemask
wrappers for these, nodes_remap() and node_remap().  They will
define the cycling, or I sometimes call it folding.

I would have tended to make this folding a defined part of the API,
though I will grant that the possibility of being lazy and forgetting

Nah - I wouldn't put it that way.  It's no mistake or abuse.  It's just
one more example of a kernel making too few resources look sufficient
by sharing, multiplexing and virtualizing them.  That's what kernels do.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Monday, October 29, 2007 - 8:10 am

Maybe it's just me, but I think it's pretty presumptuous to think we can
infer the intent of the application from the nodemask w/o additional
flags such as Christoph proposed [cpuset relative]--especially for
subsets of the cpuset.  E.g., the application could intend the nodemask
to specify memories within a certain distance of a physical resource,
such as where a particular IO adapter or set thereof attach to the
platform.  

And even when the intent is to preserve the cpuset relative positions of
the nodes in the nodemask, this really only makes sense if the original
and modified cpusets have the same physical topology w/rt multi-level
NUMA interconnects.  This is something that has bothered me about
dynamic cpusets and current policy remapping.  We don't do a good job of
explaining the implications of changing cpuset topology on applications,
nor do we handle it very well in the code.  Paul addresses one of my
concerns in a later message in this thread, so I'll comment there.

Later,
Lee

-

From: Paul Jackson
Date: Monday, October 29, 2007 - 11:41 am

Well, yes, we can't presume to know whether some application can move
or not.

But our kernel work is not presuming that.

It's providing mechanisms useful for moving apps.

The people using this decide what and when and if to move.

For example, the particular customers (HPC) I focus on for my job don't
move jobs because they don't want to take the transient performance
hit that would come from blowing out all their memory caches.

I'm guessing that David's situation involves something closer what you
see with a shared web hosting service, running jobs that are very
independent of hardware particulars.

But in any case, we (the kernel) are just providing the mechanisms.
If they don't fit ones needs, don't use them ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Monday, October 29, 2007 - 12:01 pm

I'm with you on this last point!  I was reacting to the notion that we
can infer intent from a nodemask and that preserving the cpuset relative
numbering after changing cpuset resources or moving tasks preserves that
intent--especially if it involves locality and distance considerations.
I can envision sets of such transformations on HP platforms where
locality and distance would be preserved by preserving cpuset-relative
numbering, and many where they would not.  I expect you could do the
same for SGI platforms.  I'm not opposed to what you're trying to do,
modulo complexity concerns.  And I'm not saying that the complexity is
not worth it to customers.  But, given that we just "providing the
mechanism", I think we need to provide very good documentation on the
implications of these mechanism vis a vis whatever
characteristics--locality, distance, bandwidth sharing, ...--the
application intends when it installs a policy.

Like you, no doubt, I'm eyeballs deep in a number of things.  At some
point, I'll take a cut at enumerating various "intents" that different
types of applications might have when using mem policies and cpusets.
Others can add to that, or may even beat me to it.   We can then
evaluate how well these scenarios are served by the current mechanisms
and by whatever changes are proposed.

I should note that I really like cpusets--i.e., find them useful--and
I'm painfully aware of the awkward interactions with mempolicy.  On the
other hand, I don't want to sacrifice mem policy capabilities to shoe
horn them into cpusets.  In fact, I want to add additional mechanisms
that may also be awkward in cpusets.  As you say, "if they don't fit
your needs, don't use them."  

Later,
Lee

-

From: David Rientjes
Date: Tuesday, October 30, 2007 - 4:17 pm

The kernel is providing the mechanism to interleave over a set of nodes or 
prefer a single node for allocations, but it also provides for remapping 
those to different nodes, without regard to locality or affinity to 
specific hardware, when the cpuset changes.  That's what Choice C is 
intended to replace: a node means a node so either you get an effected 
mempolicy over the nodemask you asked for, or MPOL_DEFAULT is used because 
you lack sufficient access.

		David
-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 5:03 pm

Yes, one remaps nodes it can't provide, and the other removes
nodes it can't provide.

Yup - that's a logical difference.  So ... I would think that
the only solution that would be satisfactory to apps that require
specific hardware nodes would be to simply not move them in the
first place.  If you do that, then none of these Choices matter
in the slightest.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Tuesday, October 30, 2007 - 3:57 pm

I agree with your assessment of our current policy remapping with respect 
to the passed nodemask, I think it's troublesome.  Whether we can change 
that now is another question, but the remap certainly doesn't help respect 
the intent of the application and the mempolicies they have set up when 
influenced by an outside entity such as cpusets.

See my new Choice C alternative.

		David
-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 4:46 pm

... guess that depends on the intent, doesn't it?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Friday, October 26, 2007 - 1:43 pm

Well, it's an extension for sure, but a backward compatible one.  It
should not affect any correct existing application--i.e., one that
checks it's return status--except maybe the odd test program that needs
to be updated to handle the new semantics.  We're allowed to extend APIs
as long as we don't break correct applications, right?

I mean, it's not like it's a new argument or such.

Lee


-

From: Lee Schermerhorn
Date: Friday, October 26, 2007 - 8:18 am

That's what my "cpuset-independent interleave" patch does.  David
doesn't like the "null node mask" interface because it doesn't work with
libnuma.  I plan to fix that, but I'm chasing other issues.  I should
get back to the mempol work after today.

What I like about the cpuset independent interleave is that the "policy
remap" when cpusets are changed is a NO-OP--no need to change the
policy.  Just as "preferred local" policy chooses the node where the
allocation occurs, my cpuset independent interleave patch interleaves
across the set of nodes available at the time of the allocation.  The
application has to specifically ask for this behavior by the null/empty
nodemask or the TBD libnuma API.  IMO, this is the only reasonable
interleave policy for apps running in dynamic cpusets.

An aside:  if David et al [at google] are using cpusets on fake numa for
resource management [I don't know this is the case, but saw some
discussions way back that indicate it might be?], then maybe this
becomes less of an issue when control groups [a.k.a. containers] and
memory resource controls come to fruition?

Lee

-

From: Christoph Lameter
Date: Friday, October 26, 2007 - 10:36 am

But this makes it cpuset dependent. The set of nodes is dependent on the 
cpuset. If it would be independent then interleave could allow any nodes 


Yes very likely.

-

From: David Rientjes
Date: Friday, October 26, 2007 - 11:45 am

Hacking and requiring an updated version of libnuma to allow empty 
nodemasks to be passed is a poor solution; if mempolicy's are supposed to 
be independent from cpusets, then what semantics does an empty nodemask 
actually imply when using MPOL_INTERLEAVE?  To me, it means the entire 
set_mempolicy() should be a no-op, and that's exactly how mainline 
currently treats it _as_well_ as libnuma.  So justifying this change in 
the man page is respectible, but passing an empty nodemask just doesn't 

Passing empty nodemasks with MPOL_INTERLEAVE to set_mempolicy() is the 
only reasonable way of specifying you want, at all times, to interleave 
over all available nodes?  I doubt it.

I personally prefer an approach where cpusets take the responsibility for 
determining how policies change (they use set_mempolicy() anyway to effect 
their mems boundaries) because it's cpusets that has changed the available 
nodemask out from beneath the application.  So instead of trying to create 
a solution where cpusets impact mempolicies and mempolicies impact 
cpusets, it should only be in a single direction.  Cpusets change the 
set of available nodes and should update the attached tasks' mempolicies 
at the same time.  That's the same as saying that cpusets should be built 
on top of mempolicies, which they are, and shouldn't have any reverse 

Completely irrelevant; I care about the interaction between cpusets and 
mempolicies in mainline Linux.

		David
-

From: Paul Jackson
Date: Friday, October 26, 2007 - 12:02 pm

Agreed.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: David Rientjes
Date: Saturday, October 27, 2007 - 12:16 pm

Another reason that passing an empty nodemask to set_mempolicy() doesn't 
make sense is that libnuma uses numa_set_interleave_mask(&numa_no_nodes)
to disable interleaving completely.

		David
-

From: Lee Schermerhorn
Date: Monday, October 29, 2007 - 9:23 am

David:  as we discussed when you contacted me off-list about this, the
libnuma API and the system call interface are two quite different APIs.
For example,  numa_set_interleave_mask(&numa_no_nodes) does not pass
MPOL_INTERLEAVE with an empty mask to set_mempolicy().  Rather it
"installs" an MPOL_DEFAULT policy which internally just deletes the
task's mempolicy, allowing fallback to system default policy.  I would
not propose to change this behavior, nor break libnuma in any way.

For other, who weren't involved in the off-list exchange, here's an
excerpt from my response to David:

[
At the libnuma level, I think we need an explicit
"numa_set_interleave_allowed()"--analogous to "numa_set_localalloc()".

The current "numa_alloc_interleaved()" should, I think, allocate on all
*allowed* nodes, rather than all nodes.  It can do this using the sys
call interface as defined.

Independent of cpuset-independent interleave, an application needs to
pass a valid subset of the current mems allowed to
"numa_alloc_interleaved_subset()".   An application can now obtain the
mems_allowed using the MPOL_F_MEMS_ALLOWED flag that I added, but we
need a libnuma wrapper for this as well.  [Yeah, this info can change at
any time, but that's always been the case....]

"numa_interleave_memory()" is essentially mbind(), I think [not looking
at the libnuma source code at this moment].  Maybe provide
"numa_interleave_memory_allowed(void *mem, size_t size)" ???

Finally, I think we need to add a query function:  
"nodemask_t numa_get_mems_allowed()" to return the mask of valid nodes
in the current context [cpuset].  This would just be a wrapper around
get_mempolicy() with the MPOL_F_MEMS_ALLOWED flag.
]

Couple of comments on the above:

1. "the sys call interface as defined" in the 2nd paragraph of the
except refers to my patch that uses null/empty nodemask to indicate "all
allowed".

2.  As this thread progresses, you've discussed relaxing the requirement
that applications pass a valid ...
From: Andi Kleen
Date: Monday, October 29, 2007 - 10:35 am

cpuset support in libnuma/numactl is still incomplete. I'm also
not sure what the best way to handle this is.

Probably there should be a switch for both.

-Andi
-

From: Paul Jackson
Date: Monday, October 29, 2007 - 12:35 pm

The more I have stared at this, the more certain I've become that we
need to make the mbind/mempolicy calls modal -- the default mode
continues to interpret node numbers and masks just as these calls do
now, and the alternative mode provides the so called "Choice B",
which takes node numbers and masks as if the task owned the entire
system, and then the kernel internally and automatically scrunches
those masks down to whatever happens to be the current cpuset of
the task.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Christoph Lameter
Date: Monday, October 29, 2007 - 1:36 pm

Ack.

-

From: Andi Kleen
Date: Monday, October 29, 2007 - 2:08 pm

So the user space asks for 8 nodes because it knows the machine
has that many from /sys and it only gets 4 if a cpuset says so? That's
just bad semantics. And is not likely to make the user programs happy.

I don't think you'll get around to teaching user space (or rather libnuma)
about cpusets and let it handle it.

From the libnuma perspective the machine size would be essentially 
current cpuset size. 

On the syscall level I don't think it makes much sense to change though.

The alternative would be to throw out the complete cpuset concept and go for 
virtual nodes inside containers with virtualized /sys.

-Andi


-

From: Paul Jackson
Date: Monday, October 29, 2007 - 3:48 pm

That's no different than what can happen today -- if a task actually
is in an 8 node cpuset, sets up its mempolicies accordingly, and then
gets shoe horned into a 4 node cpuset.

It's not good or bad; it's just interactions between two mechanisms.

If your app doesn't run well in a small cpuset, don't run it there
(or do run it there, poorly ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 12:47 pm

Andi, Christoph, or whomever:

  Are there any good regression tests of mempolicy functionality?

  This patch I'm coding is delicate enough that I probably broke
  something.  It would be nice to catch it sooner rather than later.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Lee Schermerhorn
Date: Tuesday, October 30, 2007 - 1:20 pm

Paul:  Andi has a regression test in the numactl source package.

Try:
	http://freshmeat.net/redir/numactl/62210/url_tgz/numactl-1.0.2.tar.gz

Lee

-

From: Paul Jackson
Date: Tuesday, October 30, 2007 - 1:26 pm

Good - thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401
-

From: Andi Kleen
Date: Tuesday, October 30, 2007 - 1:27 pm

numactl has some basic tests (make test). I think newer LTP 
also has some but i haven't looked at them. And there is Lee's
memtoy which does some things; but I don't think it's very automated.

-Andi
-

From: Paul Jackson
Date: Thursday, October 25, 2007 - 6:13 pm

I'm probably going to be ok with this ... after a bit.

1) First concern - my primary issue:

    One thing I really want to change, the name of the per-cpuset file
    that controls this option.  You call it "interleave_over_allowed".

    Take a look at the existing per-cpuset file names:

	    $ grep 'name = "' kernel/cpuset.c
	    .name = "cpuset",
	    .name = "cpus",
	    .name = "mems",
	    .name = "cpu_exclusive",
	    .name = "mem_exclusive",
	    .name = "sched_load_balance",
	    .name = "memory_migrate",
	    .name = "memory_pressure_enabled",
	    .name = "memory_pressure",
	    .name = "memory_spread_page",
	    .name = "memory_spread_slab",
	    .name = "cpuset",

    The name of every memory related option starts with "mem" or "memory",
    and the name of every memory interleave related option starts with
    "memory_spread_*".

    Can we call this "memory_spread_user" instead, or something else
    matching "memory_spread_*" ?

    The names of things in the public API's are a big issue of mine.

2) Second concern - lessor code clarity issue:

    The logic surrounding current_cpuset_interleaved_mems() seems a tad
    opaque to me.  It appears on the surface as if the memory policy code,
    in mm/mempolicy.c, is getting a nodemask from the cpuset code by
    calling this routine, as if there were an independent per-cpuset
    nodemask stating over what nodes to interleave for MPOL_INTERLEAVE.

    But all that is returned is either (1) an empty node mask or (2) the
    current tasks allowed cpu mask.  If an empty mask is returned, this
    tells the MPOL_INTERLEAVE code to use the mask the user specified in
    an earlier set_mempolicy MPOL_INTERLEAVE call.  If a non-empty mask
    is returned, then the previous user specified mask is ignored and
    that non-empty mask (just all the current cpusets allowed nodes) is
    used instead.

    Restating this in pseudo code, from your patch, the mempolicy.c
    MPOL_INTERLEAVE code to rebind ...
From: David Rientjes
Date: Thursday, October 25, 2007 - 6:30 pm

Sounds better.  I was hoping somebody was going to come forward with an 

That sounds reasonable, it will simply be a wrapper around 

For setting current->il_next, both cases work but yours will be better 
balanced for the next interleaved allocation.  I'll apply it to my 
patchset.

Thanks for the review.

		David
-

Previous thread: none

Next thread: Re: Possibility of adding -march=native to x86 by Michael Lothian on Thursday, October 25, 2007 - 3:58 pm. (1 message)