Extract a helper function from update_nodemask() to load an array of
mm_struct pointers with references to each task's mm_struct that is
currently attached to a given cpuset.
This will be used later for other purposes where memory policies need to
be rebound for each task attached to a cpuset.
Cc: Andi Kleen <ak@suse.de>
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
kernel/cpuset.c | 130 ++++++++++++++++++++++++++++++++++---------------------
1 files changed, 81 insertions(+), 49 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -702,6 +702,79 @@ done:
/* Don't kfree(doms) -- partition_sched_domains() does that. */
}
+/*
+ * Loads mmarray with pointers to all the mm_struct's of tasks attached to
+ * cpuset cs.
+ *
+ * The reference count to each mm is incremented before loading it into the
+ * array, so put_cpuset_mm_array() must be called after this function to
+ * decrement each reference count and free the memory allocated for mmarray
+ * via this function.
+ */
+static struct mm_struct **get_cpuset_mm_array(const struct cpuset *cs,
+ int *ntasks)
+{
+ struct mm_struct **mmarray;
+ struct task_struct *p;
+ struct cgroup_iter it;
+ int count;
+ int fudge;
+
+ *ntasks = 0;
+ fudge = 10; /* spare mmarray[] slots */
+ fudge += cpus_weight(cs->cpus_allowed); /* imagine one fork-bomb/cpu */
+ /*
+ * Allocate mmarray[] to hold mm reference for each task in cpuset cs.
+ * Can't kmalloc GFP_KERNEL while holding tasklist_lock. We could use
+ * GFP_ATOMIC, but with a few more lines of code, we can retry until
+ * we get a big enough mmarray[] w/o using GFP_ATOMIC.
+ */
+ while (1) {
+ count = cgroup_task_count(cs->css.cgroup); /* guess */
+ count += fudge;
+ mmarray = kmalloc(count * sizeof(*mmarray), GFP_KERNEL);
+ if (!mmarray)
+ return ...Adds a new 'interleave_over_allowed' option to cpusets. When a task with an MPOL_INTERLEAVE memory policy is attached to a cpuset with this option set, the interleaved nodemask becomes the cpuset's mems_allowed. When the cpuset's mems_allowed changes, the interleaved nodemask for all tasks with MPOL_INTERLEAVE memory policies is also updated to be the new mems_allowed nodemask. This allows applications to specify that they want to interleave over all nodes that they are allowed to access. This set of nodes can be changed at any time via the cpuset interface and each individual memory policy is updated to reflect the changes for all attached tasks when this option is set. Cc: Andi Kleen <ak@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: David Rientjes <rientjes@google.com> --- Documentation/cpusets.txt | 30 +++++++++++++++++++- include/linux/cpuset.h | 6 ++++ kernel/cpuset.c | 64 +++++++++++++++++++++++++++++++++++++++++++++ mm/mempolicy.c | 6 ++++ 4 files changed, 104 insertions(+), 2 deletions(-) diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -20,7 +20,8 @@ CONTENTS: 1.5 What is memory_pressure ? 1.6 What is memory spread ? 1.7 What is sched_load_balance ? - 1.8 How do I use cpusets ? + 1.8 What is interleave_over_allowed ? + 1.9 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus @@ -497,7 +498,32 @@ the cpuset code to update these sched domains, it compares the new partition requested with the current, and updates its sched domains, removing the old and adding the new, for each change. -1.8 How do I use cpusets ? +1.8 What is interleave_over_allowed ? +------------------------------------- + +Tasks may specify a memory policy of MPOL_INTERLEAVE with the desired +result of ...
More interactions between cpusets and memory policies. We have to be careful here to keep clean semantics. Isnt it a bit surprising for an application that has set up a custom MPOL_INTERLEAVE policy if the nodes suddenly change because of a cpuset or mems_allowed change? -
Every MPOL_INTERLEAVE policy is a custom policy that the application has setup. If you don't use cpusets at all, the nodemask you pass to set_mempolicy() with MPOL_INTERLEAVE is static and won't change without the application's knowledge. It has full control over the nodemask that it desires to interleave over. The problem occurs when you add cpusets into the mix and permit the allowed nodes to change without knowledge to the application. Right now, a simple remap is done so if the cardinality of the set of nodes decreases, you're interleaving over a smaller number of nodes. If the cardinality increases, your interleaved nodemask isn't expanded. That's the problem that we're facing. The remap itself is troublesome because it doesn't take into account the user's desire for a custom nodemask to be used anyway; it could remap an interleaved policy over several nodes that will already be contended with one another. Normally, MPOL_INTERLEAVE is used to reduce bus contention to improve the throughput of the application. If you remap the number of nodes to interleave over, which is currently how it's done when mems_allowed changes, you could actually be increasing latency because you're interleaving over the same bus. This isn't a memory policy problem because all it does is effect a specific policy over a set of nodes. With my change, cpusets are required to update the interleaved nodemask if the user specified that they desire the feature with interleave_over_allowed. Cpusets are, after all, the ones that changed the mems_allowed in the first place and invalidated our custom interleave policy. We simply can't make inferences about what we should do, so we allow the creator of the cpuset to specify it for us. So the proper place to modify an interleaved policy is in cpusets and not mempolicy itself. David -
Right. So I think we are fine if the application cannot setup boundaries Well you may hit some nodes more than others so a slight performance With that MPOL_INTERLEAVE would be context dependent and no longer needs translation. Lee had similar ideas. Lee: Could we make MPOL_INTERLEAVE generally cpuset context dependent? -
Well ... MPOL_INTERLEAVE already is essentially cpuset relative. So long as the cpuset size (number of allowed memory nodes) doesn't change, whatever MPOL_INTERLEAVE you set is remapped whenever the cpusets 'mems' changes, preserving the cpuset relative interleaving. The problem, as David explains, comes when cpusets change sizes. When the cpuset gets smaller, one can still do a pretty good job, scrunching down the interleave nodes in proportion. But when the cpuset gets larger, it's not clear how to convert a subset of a smaller set, to an equivalent subset of a larger set. The existing code handled this last case by saying screw it -- don't expand the set of interleave nodes when the cpuset 'mems' grows. David's new code handles this last case by adding a new per-cpuset Boolean that adds a new alternative, forcing all the tasks using MPOL_INTERLEAVE in that cpuset, anytime thereafter that the cpusets 'mems' changes, to get interleaved over the entire cpuset. Now that I spell it out that way, I am having second thoughts about this one. It's another special case palliative, given that we can't give the user what they really want. David - could you describe the real world situation in which you are finding that this new 'interleave_over_allowed' option, aka 'memory_spread_user', is useful? I'm not always opposed to special case solutions; but they do usually require special case needs to justify them ;). I suspect that the general case solution would require having the user pass in two nodemasks, call them ALL and SUBSET, requesting that relative to the ALL nodes, interleave be done on the SUBSET nodes. That way, even if say the task happened to be running in a cpuset with a -single- allowed memory node at the moment, it could express its user memory interleave memory needs for the general case of any number of nodes. Then for whatever nodes were currently allowed by the cpuset to that task at any point, the nodes_remap() logic could be done to derive from the ...
Yes, when a task with MPOL_INTERLEAVE has its cpuset mems_allowed expanded to include more memory. The task itself can't access all that memory with the memory policy of its choice. Since the cpuset has changed the mems_allowed of the task without its knowledge, it would require a constant get_mempolicy() and set_mempolicy() loop in the application to catch these changes. That's obviously not in the best interest of anyone. So my change allows those tasks that have already expressed the desire to interleave their memory with MPOL_INTERLEAVE to always use the full range of memory available that is dynamically changing beneath them as a result of cpusets. Keep in mind that it is still possible to request an interleave only over a subset of allowed mems: but you must do it when you create the interleaved mempolicy after it has been attached to the cpuset. set_mempolicy() changes are always honored. The only other way to support such a feature is through a modification to mempolicies themselves, which Lee has already proposed. The problem with that is it requires mempolicy support for cpuset cases and modification to the set_mempolicy() API. My solution presents a cpuset fix for a cpuset I find it hard to believe that a single cpuset with a single memory_spread_user boolean is going to include multiple tasks that request interleaved mempolicies over differing nodes within the cpuset's mems_allowed. That, to me, is the special case. David -
That much I could have guessed (did guess, actually.) Are you seeing this in a real world situation? Can you describe the situation? I don't mean just describing how it looks to this kernel code, but what is going on in the system, what sort of job mix or applications, what kind of users, ... In short, a "use case", or brief approximation thereto. See further: http://en.wikipedia.org/wiki/Use_case I have no need of a full blown use case; just a three sentence mini-story should suffice. But it should (if you can, without revealing proprietary knowledge) describe a situation you have Yup, that it does. Note that it is a special case -- "the full range", not any application controlled specific subset thereof, short of reissuing set_mempolicy() calls anytime that the applications cpuset Do you have a link to what Lee proposed? I agree that a full general solution would seem to require a new or changed set_mempolicy API, which may well be more than we want to do, absent a more compelling That may well be, to you. To me, pretty much -all- uses of set_mempolicy() are special cases ;). I have no way of telling whether or not there are users who would require multiple tasks in the same cpuset to have different interleave masks, but since the API clearly supports that (except when changing cpuset 'mems' settings mess things up), I have been presuming that somewhere in the universe, such users exist or might come to exist. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 -
Yes, when using cpusets for resource control. If memory pressure is being felt for that cpuset and additional mems are added to alleviate possible OOM conditions, it is insufficient to allow tasks within that cpuset to continue using memory policies that prohibit them from taking advantage of the extra memory. The best remedy for that situation is to give the cpuset owner the option of allowing tasks with MPOL_INTERLEAVE policies to always interleave over the entire set of available mems so they can be dynamically expanded and http://marc.info/?l=linux-mm&m=118849999128086 -
Well ... "resource control" is a tad thin for a decent "use case".
But ok ... that's a little more compelling.
The user space man pages for set_mempolicy(2) are now even more
behind the curve, by not mentioning that MPOL_INTERLEAVE's mask
might mean nothing, if (1) in a cpuset marked memory_spread_user,
(2) after the cpuset has changed 'mems'.
I wonder if there is any way to fix that. Who does the man pages
for Linux system calls?
Hmmm ... that reminds me ... the period of time between when the
task issues the set_mempolicy(2) MPOL_INTERLEAVE call and when some
cpuset 'mems' change subsequently moves its memory placement is an
anomaly here. During that period of time, the MPOL_INTERLEAVE mask
-does- apply, even if a subset of the 'mems' in the tasks cpuset.
This could result in test cases missing some failures. If they
test with a particular, carefully crafted MPOL_INTERLEAVE mask
that is a proper (strictly less than) subset of the nodes allowed
in the cpuset, they might not notice that their code is broken if
they happen to be in a memory_spread_user cpuset after a 'mems'
change has jammed the entire cpusets 'mems' into their interleave
mask.
Perhaps we should make it so that doing a set_mempolicy(2) call
to set MPOL_INTERLEAVE immediately changes the memory policy to
the cpusets mems_allowed.
A key advantage in doing this would be that the set_mempolicy user
documentation could simply state that the MPOL_INTERLEAVE mask is
ignored when in a cpuset marked memory_spread_user, instead interleaving
over all the memory nodes in the cpuset. This would be quite a bit
simpler and clearer than saying that the cpusets nodes are used only
after subsequent cpuset 'mems' changes.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Yeah. They were already outdated in the sense that they did not specify that the interleave nodemask could change as a result of a cpuset mems Well, sure, but mempolicy's already get overridden by cpusets anyway. For example, if you were to attach a task with an MPOL_BIND mempolicy to a cpuset with a disjoint set of allowed mems. The important distinction is that you can still interleave over a subset of the mems_allowed if you set your memory policy after being attached to No, because that would negate the above. We still want to be able to restrict interleaved memory policies to a subset of allowed mems. This I think that documenting the change in the man page as saying that "the nodemask will include all allowed nodes if the mems_allowed of a memory_spread_user cpuset is expanded" is better. I've got a few fixes for my patchset queued so I'll resend it later; it's mostly style changes but there is a subtle bug where the task changing the value of a cpuset's memory_spread_page is not in the same cpuset. David -
Ok. I'm inclined the other way, but not certain enough of my
position to push the point any further.
Ok. Good work - thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Michael Kerrisk, whom I've copied, does. I recently sent in an update to all of the mempolicy man pages that describe the behavior as it currently exists. [I need to send in an update for MPOL_F_MEMS_ALLOWED]. One of the things that has bothered me is that there are no cpuset man pages to reference from the mempolicy man pages. [I know, we can and do refer to the kernel source Documentation, but that might not be available to everyone w/o some digging. "See Also" refs typically point at other man pages...]. To get around this, I had to talk about "nodes allowed in the current context" or some such weasel-wording in my updates. Paul: what do you think about subsetting the cpuset.txt into a man page Lee -
Oh dear --- looking back in my work queue I have with my employer, I see I have a task that is now over a year old, still unfinished, to provide man pages for cpusets to Michael Kerrisk" <mtk-manpages@gmx.net> So, yes, I agree this would be a "good thing". I just haven't gotten a round to it (http://www.quantumenterprises.co.uk/roundtuit/index.htm) yet. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 -
I'm a little backed up myself, right now, or I'd offer to take a cut for you to review. Once I get some free time [Hah!], I'll check with you again. If you get started before then, I'd be happy to review. Lee -
Yes, it would be great to have those pages. Is there anything I can do to assist? Cheers, Michael PS Note my new addres for man-apges: mtk.manpages@gmail.com -
Got any spare round tuit's ;)?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
I ran out quite some time ago unfortunately. Cheers, Michael -
Actually, my patch doesn't change the set_mempolicy() API at all, it just co-opts a currently unused/illegal value for the nodemask to indicate "all allowed nodes". Again, I need to provide a libnuma API to request this. Soon come, mon... Here's a link the last posting of my patch, as Paul requested: http://marc.info/?l=linux-mm&m=118849999128086&w=4 A bit out of date, but I'll fix that maybe next week. Lee <snip> -
If something that was previously unaccepted is now allowed with a newly-introduced semantic, that's an API change. -
Without at least this sort of change to MPOL_INTERLEAVE nodemasks,
allowing either empty nodemasks (Lee's proposal) or extending them
outside the current cpuset (what I'm cooking up now), there is no way
for a task that is currently confined to a single node cpuset to say
anything about how it wants be interleaved in the event that it is
subsequently moved to a larger cpuset. Currently, such a task is only
allowed to pass exactly one particular nodemask to set_mempolicy
MPOL_INTERLEAVE calls, with exactly the one bit corresponding to its
current node. No useful information can be passed via an API that only
allows a single legal value.
But you knew that ...
You were just correcting my erroneously unqualified statement. Good.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Well, passing a single node to set_mempolicy() for MPOL_INTERLEAVE doesn't make a whole lot of sense in the first place. I prefer your solution of allowing set_mempolicy(MPOL_INTERLEAVE, NODE_MASK_ALL) to mean "interleave me over everything I'm allowed to access." NODE_MASK_ALL would be stored in the struct mempolicy and used later on mpol_rebind_policy(). David -
So instead of an empty nodemask we would pass a nodemask where all bits are set? And they would stay set but the cpuset restrictions would effectively limit the interleaving to the allowed set? rebind could ignore rebinds if all bits are set. -
You would pass NODE_MASK_ALL if your intent was to interleave over everything you have access to, yes. Otherwise you can pass whatever you want access to and your interleaved nodemask becomes mpol_rebind_policy()'s newmask formal (the cpuset's new mems_allowed) AND'd with pol->passed_nodemask. -
We would need two fields in the policy structure 1. The specified nodemask (generally ignored) 2. The effective nodemask (specified & cpuset_mems_allowed) If we have these two then its easy to get a bit further by making the first nodemask a relative nodemask. The calculation of the effective nodemask changes somewhat but the logic is then applicable to MPOL_BIND as well. -
You don't need to save the entire mask--just note that NODE_MASK_ALL was passed--like with my internal MPOL_CONTEXT flag. This would involve special casing NODE_MASK_ALL in the error checking, as currently set_mempolicy() complains loudly if you pass non-allowed nodes--see "contextualize_policy()". [mbind() on the other hand, appears to allow any nodemask, even outside the cpuset. guess we catch this during allocation.] This is pretty much the spirit of my patch w/o the API change/extension [/improvement :)] For some systems [not mine], the nodemasks can get quite large. I have a patch, that I've tested atop Mel Gorman's "onezonelist" patches that replaces the nodemasks embedded in struct mempolicy with pointers to dynamically allocated ones. However, it's probably not much of a win, memorywise, if most of the uses are for interleave and bind policies--both of which would always need the nodemasks in addition to the pointers. Now, if we could replace the 'cpuset_mems_allowed' nodemask with a pointer to something stable, it might be a win. Lee -
The memory policies are already shared and have refcounters for that purpose. -
I must have missed that in the code I'm reading :) Have a nice weekend. Lee -
What is the benefit of having pointers to nodemasks? We likely would need to have refcounts in those nodemasks too? So we duplicate a lot of the characteristics of memory policies? -
Hi, Christoph: remoting the nodemasks from the mempolicy and allocating them only when needed is something that you and Mel and I discussed last month, in the context of Mel's "one zonelist filtered by nodemask" patches. I just put together the dynamic nodemask patch [included below FYI, NOT for serious consideration] to see what it looked like and whether it helped. Conclusion: it's ugly/complex [especially trying to keep the nodemasks embedded for systems that don't require > a pointer's worth of bits] and they probably don't help much if most uses of non-default mempolicy requires a nodemask. I only brought it up again because now you all are considering another nodemask per policy. In fact, I only considered it in the first place because nodemasks on our [HP's] platform don't require more than a pointer's worth of bits [today, at least--I don't know about future plans]. However, since we share an arch--ia64-with SGI and distros don't want to support special kernels for different vendors, if they can avoid it, we have 1K-bit nodemasks. Since this is ia64 we're talking about, most folks don't care. Now that you're going to do the same for x86_64, it might become more visible. Then again, maybe there are few enough mempolicy structs that no-one will care anyway. Note: I don't [didn't] think I need to ref count the nodemasks associated with the mempolicies because they are allocated when the mempolicy is and destroyed when the policy is--not shared. Just like the custom zonelist for bind policy, and we have no ref count there. I.e., they're protected by the mempol's ref. However, now that you bring it up, I'm wondering about the effects of policy remapping, and whether we have the reference counting or indirect protection [mmap_sem, whatever] correct there in current code. I'll have to take a look. Lee
The patch David and I are discussing will replace the
cpuset_mems_allowed nodemask in struct mempolicy, not
add a new nodemask. In other words, the meaning and
name of that existing nodemask will change, with no
change in the overall structure size.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
In that case we could just put the nodemask at the end of the mempolicy structure and then allocate the size needed? That way we would not need to deref an additional pointer? -
Not really, because perhaps your application doesn't want to interleave over all nodes. I suggested NODE_MASK_ALL as the way to get access to all the memory you are allowed, but it's certainly plausible that an application could request to interleave only over a subset. That's the entire reason set_mempolicy(MPOL_INTERLEAVE) takes a nodemask anyway right now instead of just using task->mems_allowed on each allocation. David -
So, you pass the subset, you don't set the flag to indicate you want interleaving over all available. You must be thinking of some other use for saving the subset mask that I'm not seeing here. Maybe restoring to the exact nodes requested if they're taken away and then re-added to the cpuset? Later, Lee -
Paul's motivation for saving the passed nodemask to set_mempolicy() is so that the _intent_ of the application is never lost. That's the biggest advantage that this method has and that I totally agree with. So whenever the mems_allowed of a cpuset changes, the MPOL_INTERLEAVE nodemask of all attached tasks becomes their intent (pol->passed_nodemask) AND'd with the new mems_allowed. That can be done on mpol_rebind_policy() and shouldn't be an extensive change. So MPOL_INTERLEAVE, and possibly other, mempolicies will always try to accomodate the intent of the application but only as far as the task's cpuset restriction allows them. David -
Issue:
Are the nodes and nodemasks passed into set_mempolicy() to be
presumed relative to the cpuset or not? [Careful, this question
doesn't mean what you might think it means.]
Let's say our system has 100 nodes, numbered 0-99, and we have a task
in a cpuset that includes the twenty nodes 10-29 at the moment.
Currently, if that task does say an MPOL_PREFERRED on node 12, we take
that to mean the 3rd node of its cpuset. If we move that task to a
cpuset on nodes 40-59, the kernel will change that MPOL_PREFERRED to
node 42. Similarly for the other MPOL_* policies.
Ok so far ... seems reasonable. Node numbers passed into the
set_mempolicy call are taken to be absolute node numbers that are to
be mapped relative to the tasks current cpuset, perhaps unbeknownst
to the calling task, and remapped if that cpuset changes.
But now imagine that a task happens to be in a cpuset of just two
nodes, and wants to request an MPOL_PREFERRED policy for the fourth
node of its cpuset, anytime there actually is a fourth node. That
task can't say that using numbering relative to its current cpuset,
because that cpuset only has two nodes. It could say it relative to
a mask of all possible nodes by asking for the fourth possible node,
likely numbered node 3.
If that task happened to be in a cpuset on nodes 10 and 11, asking
for the fourth node in the system (node 3) would still be rather
unambiguous, as node 3 can't be either of 10 or 11, so must be
relative to all possible nodes, meaning "the fourth available node,
if I'm ever fortunate enough to have that many nodes."
But if that task happened to be in a cpuset on nodes 2 and 3, then
the node number 3 could mean:
Choice A:
as it does today, the second node in the tasks cpuset or it could
mean
Choice B:
the fourth node in the cpuset, if available, just as
it did in the case above involving a cpuset on nodes 10 and 11.
Let me restate this.
Either way, passing in node 3 means node 3, as numbered ...Yes. We should default to Choice B. Add an option MPOL_MF_RELATIVE to enable that functionality? A new version of numactl can then enable that by default for newer applications. -
I'm confused. If B is the default, then we don't need a flag to
enable it, rather we need a flag to go back to the old choice A.
So are you saying that:
1) Choice A remains the default for the kernel unless
MPOL_MF_RELATIVE is added, or
2) that the new default for the kernel is Choice B,
unless MPOL_MF_RELATIVE is specified, asking to
revert to the original Choice A behaviour?
Perhaps, either way, whatever compatibility flag we have should be
something that can be forced on an application from the outside,
perhaps as a per-system mode flag in /sys, or a per-cpuset mode flag,
or a per-task operation, by what mechanism is not clear.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Dont we need it for numactl to preserve backward compatibility? numactl can set that flag by default for newer software. We likely need a new libnuma can take of that. But we need to have that flag for numactl to be backward compatible. -
I'm still confused, Christoph.
Are you saying:
1) The kernel continues to default to Choice A, unless
the flag enables Choice B, or
2) The kernel defaults to the new Choice B, unless the
flag reverts to the old Choice A?
Alternative (2) breaks libnuma and hence numactl until it is changed
to use the flag, or changed to use choice B (in which case it wouldn't
need the flag.)
So I guess you mean alternative (1) above, since you seem to be taking
the position that we can't break compatibility here.
But I could quote statements from you that seem to clearly state the
exact opposite.
So I remain confused.
Actually, alternative (1) is kinda ugly. It leaves a permanent wart
on the set_mempolicy API -- two different variants to what the node
numbers and node masks mean, depending on whether this MPOL_MF_RELATIVE
is set on each call. We'll have to ship out an extra serving of brain
food for most folks looking at this to have much chance that they will
confidently understand the difference between the two options selected
by this flag.
I wonder if there might be some way to avoid that permanent ugly wart
on each and every set/get mempolicy system call forever afterward.
Please try to double check your next reply, Christoph. I'm beginning
to worry that we might be failing to communicate clearly. Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
2) keeps everything in order. Let everything be as it is today unless Tough. The API needs to remain stable. We can only change it through an additional flag that enables the relativeness and the folding the way you want it. libnuma may set the flag on its own without the user having to do Hmmm.. The alternative is to add new set/get mempolicy functions. -
Good - that I understand. Your position is clear now.
You have chosen (1) above, which keeps Choice A as the default.
Before I leave this part, there is one more thing I kinda really need,
if you could, Christoph. Could you describe in your own words what you
think Choices A and B mean? We seem to be having trouble communicating,
and hence there is some risk right now that we don't mean the same thing
by this new "Choice B".
===
Other alternatives include a per-system, per-cpuset or per-process
flag, in addition to the per-system call flag you suggested earlier
(MPOL_MF_RELATIVE), or whatever you mean by "new set/get mempolicy
functions" ... could you elaborate on that one?
So ... the question becomes this:
How do we migrate to Choice B, without leaving both Choices
permanently supported, and an ugly mode flag selecting the
non-default Choice, while not breaking API's too abruptly?
Thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
There can be different defaults for the user space API via libnuma that are indepdent from the kernel API which needs to remain stable. The kernel None of those sound appealing. Multiple processes may run in one cpuset. Some of those may be linked to older libnumas and therefore depend on old behavior. -
Yes - the user level code can have different defaults too.
Well, that would justify keeping this choice per-task. I tend to
agree with that.
But that doesn't justify having to specify it on each system call.
In another reply David recommends against supporting Choice A at all.
I'm inclined to agree with him. I'll reply there, with more thoughts.
But if we did support Choice A, as a backwards compatible alternative
to Choice B, I'd suggest a per-task mode, not per-system call mode.
This would reduce the impact on the API of the ugly, unobvious, modal
flag needed to select the optional, non kernel default, Choice B
semantics.
I still have low confidence that you (Christoph) and I have the same
understanding of what these Choice A and B are. Hopefully you can
address that, perhaps by briefly describing these choices in your words.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
I think there's a mixup in the flag name there, but I actually would recommend against any flag to effect Choice A. It's simply going to be too complex to describe and is going to be a headache to code and support. The MPOL_PREFERRED behavior when constrained by cpusets was previously, to my knowledge, undocumented; you're in the position to make the behavior do what you want it to do and then release documentation so we'll finally have a complete and unambiguous API for it. Right now it should be considered undefined and thus you are free to implement it as you choose. Then all callers of set_mempolicy(MPOL_PREFERRED) will standardize on that and not have to worry about the machine's mpol_preferred_relative_to_cpuset setting. Then, any task that is attached to a cpuset and expecting the fourth node in their set_mempolicy(MPOL_PREFERRED) call to mean system node 3 if it's in the cpuset's mems_allowed will be broken. If you want that, you'll need your task to be attached to a cpuset with at least mems 0-3; programmers will pick that up quickly enough if it's clearly documented. I think Choice B is correct and makes more sense in terms of the semantics and at least allows mempolicies and cpusets to play nicely together without a bidirectional dependency on one another. David -
While I am sorely tempted to agree entirely with this, I suspect that
Christoph has a point when he cautions against breaking this kernel API.
Especially for users of the set/get mempolicy calls coming in via
libnuma, we have to be very careful not to break the current behaviour,
whether it is documented API or just an accident of the implementation.
There is a fairly deep and important stack of software, involving a
well known DBMS product whose name begins with 'O', sitting on that
libnuma software stack. Steering that solution stack is like steering
a giant oil tanker near shore. You take it slow and easy, and listen
closely to the advice of the ancient harbor master. The harbor masters
True, which is why I am hoping we can keep this modal flag, if such be,
from having to be used on every set/get mempolicy call. The ordinary
coder of new code using these calls directly should just see Choice B
behaviour. However the user of libnuma should continue to see whatever
API libnuma supports, with no change whatsoever, and various versions of
libnuma, including those already shipped years ago, must continue to
behave without any changes in node numbering.
There are two decent looking ways (and some ugly ways) that I can see
to accomplish this:
1) One could claim that no important use of Oracle over libnuma over
these memory policy calls is happening on a system using cpusets.
There would be a fair bit of circumstantial evidence for this
claim, but I don't know it for a fact, and would not be the
expert to determine this. On systems making no use of cpusets,
these two Choices A and B are identical, and this is a non-issue.
Those systems will see no API changes whatsoever from any of this.
2) We have a per-task mode flag selecting whether Choice A or B
node numbering apply to the masks passed in to set_mempolicy.
The kernel implementation is fairly easy. (Yeah, I know, I
too cringe everytime I read that line ;)
If ...From a standpoint of the MPOL_PREFERRED memory policy itself, there is no documented behavior or standard that specifies its interaction with cpusets. Thus, it's "undefined." We are completely free to implement an undefined behavior as we choose and change it as Linux matures. Once it is defined, however, we carry the burden of protecting applications that are written on that definition. That's the point where we need to get it right and if we don't, we're stuck with it forever; I don't believe we're at that point with MPOL_PREFERRED policies under Ok, let's take a look at some specific unproprietary examples of tasks that use set_mempolicy(MPOL_PREFERRED) for a specific node, intending it to be the actual system node offset, that is then assigned to a cpuset that doesn't require that offset to be allowed. I think it's going to become pretty difficult to find an example because the whole scenario is pretty lame: you would need to already know which nodes you're going to be assigned to in the cpuset to ask for one of them as your preferred node. I don't imagine any application can have that type of foresight and, if it does, then we certainly shouldn't support the preferred node_remap() when it changes mems. You're trying to support a scheme, in Choice A, where an application knows it's going to be assigned to a range of nodes (for example, 1-3) and wants the preferred node to be included (for example, 2). So now the application must have control over both its memory policy and its cpuset placement. Then it must be willing to change its cpuset placement to a different set of nodes (with equal or greater cardinality) and have the preferred node offset respected. Why can't it simply then issue another set_mempolicy(MPOL_PREFERRED) call for the new preferred node? See? The problem is that you're trying to protect applications that know its initial cpuset mems [the only way it could ever send a set_mempolicy(MPOL_PREFERRED) for the right node ...
You state this point clearly, but I have to disagree.
The Linux documentation is not a legal contract. Anytime we change the
actual behaviour of the code, we have to ask ourselves what will be the
impact of that change on existing users and usages. The burden is on
us to minimize breaking things (by that I mean, what users would
consider breakage, even if we think it is all for the better and that
their code was the real problem.) I didn't say no breakage, but
minimum breakage, doing our best to guide users through changes with
minimum disruption to their work.
Linux is gaining market share rapidly because we co-operate with our
users to give us both the best chance of succeeding.
We don't just play gotcha games with the documentation -- ha ha --
we didn't document that detail, so it's your fault for ever depending
on it. And besides your code sucks. So there! Let's leave that game
If that were so, then yes much of your subsequent reasoning would follow.
The above is the hack that allows us to support existing libnuma based
applications (the most significant users of memory policy historically)
with a default of Choice A, while other code and future code defaults
That's not the only sort of application I'm trying to protect.
I'm trying to protect almost any application that uses both
set_mempolicy or mbind, while in a cpuset.
If a task is in a cpuset on say nodes 16-23, and it wants to issue
any mbind, or any MPOL_PREFERRED, MPOL_BIND, or MPOL_INTERLEAVE
mempolicy call, then under Choice A it must issue nodemasks offset
by 16, relative to what it would issue under Choice B.
Almost any task using memory policies on a system making active use of
cpusets will be affected, even well written ones doing simple things.
I am more concerned that the above hack for libnuma isn't enough,
rather than it is unnecessary.
I think the above hack covers existing libnuma users rather well,
though I could be wrong even here, as I don't actually ...Nobody can show an example of an application that would be broken because
of this and, given the scenario and sequence of events that it requires to
be broken when implementing the default as Choice B, I don't think it's as
So all applications that use the libnuma interface and numactl will have
different default behavior than those that simply issue
{get,set}_mempolicy() calls. libnuma is a collection of higher level
functions that should be built upon {get,set}_mempolicy() like they
currently are and not introduce new subtleties like changing the semantics
of a preferred node argument. This is going to quickly become a
documentation nightmare and, in my opinion, isn't worth the time or effort
to support because we haven't even idenitifed any real-world examples.
Maybe Andi Kleen should weigh in on this topic because, if we go with what
you're suggesting, we'll never get rid of the two differing behaviors and
we'll be introducing different semantics to arguments of libnuma functions
True, but the ordering of that scenario is troublesome. The correct way
to implement it is to use set_mempolicy() or a higher level libnuma
function with the same semantics and _then_ attach the task to a cpuset.
Then the nodes_remap() takes care of the rest.
The scenario you describe above has a problem because it requires the task
to have knowledge of the cpuset's mems in which it is attached when, for
portability, it should have been written so that it is robust to any range
No, because nodes_remap() takes care of the instances you describe above
when the task sets its memory policy (usually done when it is started) and
Supporting two different behaviors is going to be more problematic than
simply selecting one and going with it and its associated documentation in
Paul, the changes required to an application that is currently using
{get,set}_mempolicy() calls to setup the memory policy or the higher level
functions through libnuma is so easy to use Choice ...Well, neither you nor I have shown an example. That's different than "nobody can." Since it would affect any task setting memory policies while in a cpuset holding less than all memory nodes, it seems potentially serious to me. Actually, I have one example. The libcpuset library would have some breakage with Choice B the only Choice. But I'm in a position to deal Breaking the libnuma-Oracle solution stack is not an option. And, unless someone in the know tells us otherwise, I have to assume that this could break them. Now, the odds are that they simply don't run that solution stack on any system making active use of cpusets, so the odds are this would be no problem for them. But I don't We could get rid of Choice A once libnuma and libcpuset have adapted to Choice B, and any other uses of Choice A that we've subsequently identified have had sufficient time to adapt. But dual support is pretty easy so far as the kernel code is concerned. It's just a few nodes_remap() calls optionally invoked at a few key spots in mm/mempolicy.c. Consequently there won't be a big hurry to There is no "_then_ attach the task to a cpuset." On systems with kernels configured with CONFIG_CPUSETS=y, all tasks are in a cpuset all the time. Moreover, from a practical point of view, on large systems managed with cpuset based mechanisms, almost all tasks are in cpusets that do not include all nodes, for the entire life of the task. And besides, I can't break existing applications willy-nilly, and then claim it's their fault, because they should have been coded differently. So "correct way" arguments don't hold alot of weight David ;) I make some effort to avoid forcing applications to be I had to read that a couple of times to make sense of it. I take that it means that the node numbering used in each cpuset's 'mems' file has to be system-wide. Yes, agreed. (Well, actually, the node numbering of each cpusets 'mems' file could be relative to its parent cpusets 'mem' ...
If we can't identify any applications that would be broken by this, what's
the difference in simply implementing Choice B and then, if we hear
complaints, add your hack to revert back to Choice A behavior based on the
get_mempolicy() call you specified is always part of libnuma?
The problem that I see with immediately offering both choices is that we
don't know if anybody is actually reverting back to Choice A behavior
because libnuma, by default, would use it. That's going to making it very
painful to remove later because we've supported both options and have made
libnuma and {get,set}_mempolicy() arguments ambiguous. We should only
support both choices if they will both be used and there's no hard
You earlier insisted on an ease of documentation for the MPOL_INTERLEAVE
case and now this dual support that you're proposing is going to make the
documentation very difficult to understand for anyone who simply wants to
use mempolicies.
Others even in this thread have had a hard enough time understanding the
difference between the two choices and you explained them very thoroughly.
And that application would need to be implemented to know the nodes that
it has access to before it issues its set_mempolicy(MPOL_PREFERRED)
command anyway if it truly uses Choice A behavior. So unless these tasks
are looking in /proc/pid/status and parsing Mems_allowed and then
specifying one as its preferred node or always being guaranteed a certain
set of nodes that they are always attached to in a cpuset so they have
such foresight of what node to prefer, Choice A can't possibly be what
The needs I was addressing with my initial patchset was so that when a
cpuset is expanded, any MPOL_INTERLEAVE memory policy of attached tasks
automatically get expanded as well. This discussion has somewhat diverged
from that, but I hope you still support what we earlier talked about in
terms of adding a field to struct mempolicy to remember the intended
You don't actually ...I'll probably reply to other parts of your message later, but this
one catches my eye right now.
"if we hear complaints, add your hack ... back" -- this doesn't seem
like a good idea to me. Maybe inside Google you don't see it, but
for those of us shipping computer systems using major distributions
such as SUSE or Red Hat, there can be a year lag between when I send a
feature patch to Andrew, and when my customers send their first
feedback to me resulting from using that new feature.
There are ways to expedite fixes for specific situations, of course,
but in general, this is rather like sending out a deep space probe.
You have to conservatively cover your options pre-launch, because
post-launch repairs are costly, slow and limited.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Let's add a Choice C: Any nodemask that is passed to set_mempolicy() is saved as the intent of the application in struct mempolicy. All policies are effected on a contextualized per-allocation basis. Policies such as MPOL_INTERLEAVE always get AND'd with pol->cpuset_mems_allowed. If that yields numa_no_nodes, MPOL_DEFAULT is used instead. Policies such as MPOL_PREFERRED are respected if the node is set in pol->cpuset_mems_allowed, otherwise MPOL_DEFAULT is used. If an application attempts to setup a memory policy for an MPOL_PREFERRED node that it doesn't have access to or an MPOL_INTERLEAVE nodemask that is empty when AND'd with pol->cpuset_mems_allowed, -EINVAL is returned and no new policy is effected. If an application gains nodes in pol->cpuset_mems_allowed that now include the nodes from MPOL_INTERLEAVE or MPOL_PREFERRED, that policy is then effected once again. Otherwise, MPOL_DEFAULT is still used. -
"contextualized" - I guess that means converted to cpuset
relative numbering - yes.
"per-allocation" - Most of the calculation of nodemasks and
Not issues with Folding.
With folding, an application that layed out an elaborate memory
policy configuration covering say 16 nodes can run in a 4 node
cpuset, where whatever would have been on node N gets folded down
to node N % 4.
With AND'ing, such an application would find 3/4's of its fancy
memory policy configuration replaced with MPOL_DEFAULT and -EINVAL
fallbacks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Missing the point; this is an alternative to the previous choices; Choice C explicitly removes all remaps ("folding") from mempolicies. The nodemask passed to set_mempolicy() will always have exactly one meaning: the system nodes that the policy is intended for. Cpusets, which are built upon mempolicies, can obviously take access to some of those nodes away. That's why the existing mempolicies are AND'd with the cpuset's mems_allowed to represent the current nodemask that the mempolicy is effecting. If none of them are available because of cpusets, the mempolicy is invalidated and MPOL_DEFAULT is used. If access to some nodes from the mempolicy's nodemask become available once again, the policy is again effected. I'm arguing that remapping a policy's nodemask, although that is what currently is done, is troublesome because it can use a policy such as MPOL_PREFERRED to work on a node for which it was never intended. David -
Ok - that makes the meaning of Choice C clearer to me. Thank-you.
We've already got two Choices, one released and one in the oven. Is
there an actual, real world situation, motivating this third Choice?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Let's put Choice C into the lower oven, then. Of course there's actual and real world examples of this, because right now we're not meeting the full intent of the application. Cpusets deal with cpus and memory, they don't have anything to do with affinity to particular I/O devices; that part is left up to the creator of the cpuset to sort out correctly based on their system topology. If my application does tons of I/O on one particular device to which my memory has access, I can use MPOL_PREFERRED to prefer the memory be allocated on a node with the best affinity to my device. If cpusets change my access to that node, I'm still using an MPOL_PREFERRED policy with a remapped node that no longer has affinity to that device because nodes_remap() doesn't take that into account. My preference would be to fallback to MPOL_DEFAULT behavior, since it's certainly plausible that other cpusets share the same node, instead of unnecessarily filling up a node that I don't even prefer anymore. Same situation exists of MPOL_INTERLEAVE policies where my NUMA optimization is no longer helpful because I'm interleaving over a set of nodes that was simply remapped and their affinity (which isn't guaranteed to be unifom) wasn't even taken into account. But, with Choice C, my intent is still preserved in the mempolicy even though it's not effected because my access rights to the node has changed. If I get access to that node back later, and I haven't issued subsequent set_mempolicy() calls to change my policy, my MPOL_PREFERRED or MPOL_INTERLEAVE policy is again effected and I then benefit from my NUMA optimization once again. David -
Please describe one, an actual one, not a hypothetical one, of which you
have personal knowledge.
There are many refinements we could add, an endless stream of them.
Each one adds a burden to those who didn't need it.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Choice B, as I'm coding it, has this property as well.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Yes, that's a problem. I would rather end up with both Choices forever, than breaking stuff because we changed how memory policy No. We could only remove Choice A if we had hard evidence that it wouldn't break things, especially for the libnuma-Oracle stack. Either way, we obviously have to decide this lacking sufficient hard evidence. Changing memory policy node numbering is just way too likely to break things, in ways that users initially find difficult to diagnose. We -can-not- inflict that on our users in a single, sudden, change. We must stage it, starting by Yup - that's a problem. But it is one that users can control. If they just continue using memory policies and libnuma as before, it continues to work as before. If they need to deal with situations in which applications using memory policies are being moved around between larger and smaller cpusets, and they are willing and able to modify and improve the part of their code that handles memory policies, then they can read the new section of the documentation about this improved cpuset-relative node numbering, and give it a try. Blind siding users with a unilateral change like this will leave orphaned bits gasping in agony on the computer room floor. It can sometimes takes months of elapsed time and hundreds of hours of various peoples time across a dozen departments in three to five corporations to track down the root cause of such a problem, from the point of the initial failure, back to the desk of someone like you or me. And then it can take tens or hundreds more hours of human effort to deliver a fix. I refuse to knowingly go down that road. People do this sort of stuff all the time; they just don't realize what all is going on beneath the surface of the various tools, libraries, scripts and magic incantations that they cobble together to meet their needs. Choice A is meeting most of our needs. Not until you brought up this case of MPOL_INTERLEAVE across the nodes of a job being Not ...
If your argument is that most applications are written to implement mempolicies without necessarily thinking too much about its cpuset placement or interactions with cpusets, then the requirement of remapping nodes when a cpuset changes for effected mempolicies isn't actually that important. In other words, my Choice C with AND'd behavior as opposed to remapping behavior could be introduced as a replacement for Choice A. Those applications that currently rely on the remapping are going to be broken anyway because they are unknowingly receiving different nodes than they intended, this is the objection to remapping that Lee agreed with. The remap doesn't take into account any notion of locality or affinity to physical controllers and seems to be merely a convenience of not invalidating the entire mempolicy in light of an ever-changing cpuset Yes, I know, and my Choice C does _not_ want that folding behavior; it wants the AND'd behavior because it fully respects the intent of the application with regard to the actual nodes that it specified in its memory policies. A node should only have one definition and policies that are effected on a set of nodes, or one node in the preferred case, should not change from beneath the application because it was not the intent of the implementation. Doing so is dangerous, regardless of whether or not it is currently the mempolicy behavior in HEAD. David -
Just because they didn't think about cpuset remapping when they coded
their mempolicy calls, doesn't mean they wouldn't be broken by changes
in how mempolicy numbers nodes. Often, it's the other way around:
the less they though of it, the more likely changing it would break
No - I will not agree to changing the default mempolicy kernel API node
numbering at this time. Period. Full stop. We can add non-default
choices for now, and perhaps in the light of future experience, we
No, they may or may not be broken. That depends on whether or not they had
If you're running apps that have specific hardware affinity requirements,
then perhaps you shouldn't be moving them about in the first place ;).
And if they did have such needs, aren't they just as likely to be busted
by AND'ing off some of their nodes as they are by remapping those nodes?
I sure wish I knew what real world, actual, not hypothetical, situations
were motivating this.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Of course they have specific affinity needs, that's why they used mempolicies. Remapping those policies to a set of nodes that resembles the original mempolicy's nodemask in terms of construction but without regard for the affinity those nodes have with respect to system topology No, because you're interleaving over the set of actual nodes you wanted to interleave over in the first place and not some pseudo-random set that You're defending the current remap behavior in terms of semantics of mempolicies? My position, and Choice C's position, is that you either get the exact (or partially-constructed) policy that you asked for, or you get the MPOL_DEFAULT behavior. What you don't get, even though it's currently how we do it, is a completely different set of nodes that you never intended to have a specific policy over. David -
No. Good grief. If they are just looking for some set of memory
banks, not to other node-specific hardware, then they might not need
a specific node.
Consider for example a multi-threaded, compute bound, long running
scientific computation that has a substantial and fussy memory layout.
Remapping it from one cpuset to another having the same NUMA topology
may well work fine, once its memory caches recover. Reverting it to
the lowest common denominator MPOL_DEFAULT policy because (Choice C) it
no longer has access to its initial nodes might devastate its
performance.
I'm still wishing ...
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
If most apps use libnuma APIs instead of directly calling the sys calls, libnuma could query something as simple as an environment variable, or a new flag to get_mempolicy(), or the value of a file in it's current cpuset--but I'd like to avoid a dependency on libcpuset--to determine I'd certainly like to hear from Oracle what libnuma features they use Yeah. This bothered me about policy remapping when I looked at it a while back. Worse, this behavior isn't documented as intended [or not]. I thought at the time that this could be solved by retaining the original argument nodemask, but 1) I was worried about the size when ~1K nodes are required to be supported and 2) it still doesn't solve the problem of ensuring the same locality characteristics w/o a lot of documentation about the implications of changing cpuset resources or moving tasks between cpusets in such a way to preserve the locality characteristics requested by the original mask. Again, we stumble upon the notion of "intent". If the intent is just to spread allocations to share bandwidth, it probably doesn't matter. If, on the other hand, the original mask was carefully constructed, taking into consideration the distances between the memories specified and other resources [cpus in the cpuset, other memories in the cpuset, IO adpater connection points, ...], there is a lot more to consider than In libnuma in numactl-1.0.2 that I recently grabbed off Andi's site, numa_available() indeed issues this call. But, I don't see any internal calls to numa_available() [comments says all other calls undefined when numa_available() returns an error] nor any other calls to get_mempolicy() with all null/0 args. So, you'd be depending on the application to call numa_available(). However, you could define an additional MPOL_F_* flag to get_mempolicy() that is issued in library init code to enable new behavior--again, based on some indication that Only for apps that use the sys calls directly, right? This can ...
The patch I'm working has a new set of options to get_mempolicy to set
and get the per-task kernel state indicating whether to use the old or
new semantics.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Yes - as I noted in an earlier reply, the kernel just provides the
mechanisms. It's up to user level code and people to decide whether
moving jobs around is a worthwhile activity in their situation.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Hmmm ... put your thinking hat for my next comment ...
I could do one of two things in mm/mempolicy.c:
B1) continue accepting nodemasks across the set_mempolicy and mbind
system call APIs that are just like now (only nodes in the current
tasks cpuset matter), but then remember what was passed in, so that
if the tasks cpuset subsequently shrank down and then expanded
again back to its original size, they would end up with the same
memory policy placement they first had, or
B2) accept nodemasks as if relative to the entire system, regardless
of what cpuset they were in at the moment (all nodes in the system
matter and can be specified.)
If I did B1, then that's just a subtle change in the API, and what
you agreed to above holds.
If I did B2, then that's a serious change in the way that nodes
are numbered in the nodemasks passed into mbind and set_mempolicy,
from being only nodes that happen to be in the tasks current cpuset,
to being nodes relative to all possible nodes on the system.
We need B2, I think. Otherwise, if a job happens to be running in
a shrunken cpuset, it can't request what memory policy placement
it wants should it end up in a larger cpuset later on. With B1, we
would continue to have the timing dependencies between when a task
is moved between different size cpusets, and when it happens to issue
mbind/set_mempolicy calls.
But B2 is an across the board change in how we number the nodes
passed into mbind and set_mempolicy. That is in no way an upward
compatible change.
I am strongly inclined toward B2, but it must be a non-default optional
mode, at least for a while, perhaps a long while.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Aha - good point. It happened to be the numactl command line utility
that I tested with that issued the get_mempolicy(0,0,0,0,0) call.
Yup - this proposed hack, to have the kernel revert to the original
memory policy nodemask numbering if it sees such a getmempolicy call
is now officially dead meat.
Yes - I am intending to define such MPOL_F_* flags, to set and get
which behavior applies to the current task.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Thanks for describing the situation with MPOL_PREFERRED so thoroughly. I prefer Choice B because it does not force mempolicies to have any dependence on cpusets with regard to what nodemask is passed. [rientjes@xroads ~]$ man set_mempolicy | grep -i cpuset | wc -l 0 It would be very good to store the passed nodemask to set_mempolicy in struct mempolicy, as you've already recommended for MPOL_INTERLEAVE, so that you can try to match the intent of the application as much as possible. But since cpusets are built on top of mempolicies, I don't think there's any reason why we should respect any nodemask in terms of the current cpuset context, whether it's preferred or interleave. So if you were to pass a nodemask with only the fourth node set for an MPOL_PREFERRED mempolicy, the correct behavior would be to prefer the fourth node on the system or, if constrained by cpusets, the fourth node in the cpuset. If the cpuset has fewer than four nodes, the behavior should be undefined (probably implemented to just cycle the set of mems_allowed until you reach the fourth entry). That's the result of constraining a task to a cpuset that obviously wants access to more nodes -- it's a userspace mistake and abusing cpusets so that the task does not get what it expects. That concept isn't actually new: we already restrict tasks to a certain amount of memory by writing to the mems file and just because it happens to have access to more memory when unconstrained by cpusets doesn't matter. You've placed it in a cpuset that wasn't prepared to deal with what the task was asking for. At least in the MPOL_PREFERRED case you describe above, it'll be dealt with much more pleasantly by at least giving it a preferred node as opposed to OOM killing it when a task has exhausted its available cpuset-constrained memory. I'd prefer a solution where mempolicies can always be described and used without ever considering cpusets. Then, a sane implementation will configure ...
I do intend to implement it as you suggest. See the lib/bitmap.c
routines bitmap_remap() and bitmap_bitremap(), and the nodemask
wrappers for these, nodes_remap() and node_remap(). They will
define the cycling, or I sometimes call it folding.
I would have tended to make this folding a defined part of the API,
though I will grant that the possibility of being lazy and forgetting
Nah - I wouldn't put it that way. It's no mistake or abuse. It's just
one more example of a kernel making too few resources look sufficient
by sharing, multiplexing and virtualizing them. That's what kernels do.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Maybe it's just me, but I think it's pretty presumptuous to think we can infer the intent of the application from the nodemask w/o additional flags such as Christoph proposed [cpuset relative]--especially for subsets of the cpuset. E.g., the application could intend the nodemask to specify memories within a certain distance of a physical resource, such as where a particular IO adapter or set thereof attach to the platform. And even when the intent is to preserve the cpuset relative positions of the nodes in the nodemask, this really only makes sense if the original and modified cpusets have the same physical topology w/rt multi-level NUMA interconnects. This is something that has bothered me about dynamic cpusets and current policy remapping. We don't do a good job of explaining the implications of changing cpuset topology on applications, nor do we handle it very well in the code. Paul addresses one of my concerns in a later message in this thread, so I'll comment there. Later, Lee -
Well, yes, we can't presume to know whether some application can move
or not.
But our kernel work is not presuming that.
It's providing mechanisms useful for moving apps.
The people using this decide what and when and if to move.
For example, the particular customers (HPC) I focus on for my job don't
move jobs because they don't want to take the transient performance
hit that would come from blowing out all their memory caches.
I'm guessing that David's situation involves something closer what you
see with a shared web hosting service, running jobs that are very
independent of hardware particulars.
But in any case, we (the kernel) are just providing the mechanisms.
If they don't fit ones needs, don't use them ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
I'm with you on this last point! I was reacting to the notion that we can infer intent from a nodemask and that preserving the cpuset relative numbering after changing cpuset resources or moving tasks preserves that intent--especially if it involves locality and distance considerations. I can envision sets of such transformations on HP platforms where locality and distance would be preserved by preserving cpuset-relative numbering, and many where they would not. I expect you could do the same for SGI platforms. I'm not opposed to what you're trying to do, modulo complexity concerns. And I'm not saying that the complexity is not worth it to customers. But, given that we just "providing the mechanism", I think we need to provide very good documentation on the implications of these mechanism vis a vis whatever characteristics--locality, distance, bandwidth sharing, ...--the application intends when it installs a policy. Like you, no doubt, I'm eyeballs deep in a number of things. At some point, I'll take a cut at enumerating various "intents" that different types of applications might have when using mem policies and cpusets. Others can add to that, or may even beat me to it. We can then evaluate how well these scenarios are served by the current mechanisms and by whatever changes are proposed. I should note that I really like cpusets--i.e., find them useful--and I'm painfully aware of the awkward interactions with mempolicy. On the other hand, I don't want to sacrifice mem policy capabilities to shoe horn them into cpusets. In fact, I want to add additional mechanisms that may also be awkward in cpusets. As you say, "if they don't fit your needs, don't use them." Later, Lee -
The kernel is providing the mechanism to interleave over a set of nodes or prefer a single node for allocations, but it also provides for remapping those to different nodes, without regard to locality or affinity to specific hardware, when the cpuset changes. That's what Choice C is intended to replace: a node means a node so either you get an effected mempolicy over the nodemask you asked for, or MPOL_DEFAULT is used because you lack sufficient access. David -
Yes, one remaps nodes it can't provide, and the other removes
nodes it can't provide.
Yup - that's a logical difference. So ... I would think that
the only solution that would be satisfactory to apps that require
specific hardware nodes would be to simply not move them in the
first place. If you do that, then none of these Choices matter
in the slightest.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
I agree with your assessment of our current policy remapping with respect to the passed nodemask, I think it's troublesome. Whether we can change that now is another question, but the remap certainly doesn't help respect the intent of the application and the mempolicies they have set up when influenced by an outside entity such as cpusets. See my new Choice C alternative. David -
... guess that depends on the intent, doesn't it?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Well, it's an extension for sure, but a backward compatible one. It should not affect any correct existing application--i.e., one that checks it's return status--except maybe the odd test program that needs to be updated to handle the new semantics. We're allowed to extend APIs as long as we don't break correct applications, right? I mean, it's not like it's a new argument or such. Lee -
That's what my "cpuset-independent interleave" patch does. David doesn't like the "null node mask" interface because it doesn't work with libnuma. I plan to fix that, but I'm chasing other issues. I should get back to the mempol work after today. What I like about the cpuset independent interleave is that the "policy remap" when cpusets are changed is a NO-OP--no need to change the policy. Just as "preferred local" policy chooses the node where the allocation occurs, my cpuset independent interleave patch interleaves across the set of nodes available at the time of the allocation. The application has to specifically ask for this behavior by the null/empty nodemask or the TBD libnuma API. IMO, this is the only reasonable interleave policy for apps running in dynamic cpusets. An aside: if David et al [at google] are using cpusets on fake numa for resource management [I don't know this is the case, but saw some discussions way back that indicate it might be?], then maybe this becomes less of an issue when control groups [a.k.a. containers] and memory resource controls come to fruition? Lee -
But this makes it cpuset dependent. The set of nodes is dependent on the cpuset. If it would be independent then interleave could allow any nodes Yes very likely. -
Hacking and requiring an updated version of libnuma to allow empty nodemasks to be passed is a poor solution; if mempolicy's are supposed to be independent from cpusets, then what semantics does an empty nodemask actually imply when using MPOL_INTERLEAVE? To me, it means the entire set_mempolicy() should be a no-op, and that's exactly how mainline currently treats it _as_well_ as libnuma. So justifying this change in the man page is respectible, but passing an empty nodemask just doesn't Passing empty nodemasks with MPOL_INTERLEAVE to set_mempolicy() is the only reasonable way of specifying you want, at all times, to interleave over all available nodes? I doubt it. I personally prefer an approach where cpusets take the responsibility for determining how policies change (they use set_mempolicy() anyway to effect their mems boundaries) because it's cpusets that has changed the available nodemask out from beneath the application. So instead of trying to create a solution where cpusets impact mempolicies and mempolicies impact cpusets, it should only be in a single direction. Cpusets change the set of available nodes and should update the attached tasks' mempolicies at the same time. That's the same as saying that cpusets should be built on top of mempolicies, which they are, and shouldn't have any reverse Completely irrelevant; I care about the interaction between cpusets and mempolicies in mainline Linux. David -
Agreed.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Another reason that passing an empty nodemask to set_mempolicy() doesn't make sense is that libnuma uses numa_set_interleave_mask(&numa_no_nodes) to disable interleaving completely. David -
David: as we discussed when you contacted me off-list about this, the libnuma API and the system call interface are two quite different APIs. For example, numa_set_interleave_mask(&numa_no_nodes) does not pass MPOL_INTERLEAVE with an empty mask to set_mempolicy(). Rather it "installs" an MPOL_DEFAULT policy which internally just deletes the task's mempolicy, allowing fallback to system default policy. I would not propose to change this behavior, nor break libnuma in any way. For other, who weren't involved in the off-list exchange, here's an excerpt from my response to David: [ At the libnuma level, I think we need an explicit "numa_set_interleave_allowed()"--analogous to "numa_set_localalloc()". The current "numa_alloc_interleaved()" should, I think, allocate on all *allowed* nodes, rather than all nodes. It can do this using the sys call interface as defined. Independent of cpuset-independent interleave, an application needs to pass a valid subset of the current mems allowed to "numa_alloc_interleaved_subset()". An application can now obtain the mems_allowed using the MPOL_F_MEMS_ALLOWED flag that I added, but we need a libnuma wrapper for this as well. [Yeah, this info can change at any time, but that's always been the case....] "numa_interleave_memory()" is essentially mbind(), I think [not looking at the libnuma source code at this moment]. Maybe provide "numa_interleave_memory_allowed(void *mem, size_t size)" ??? Finally, I think we need to add a query function: "nodemask_t numa_get_mems_allowed()" to return the mask of valid nodes in the current context [cpuset]. This would just be a wrapper around get_mempolicy() with the MPOL_F_MEMS_ALLOWED flag. ] Couple of comments on the above: 1. "the sys call interface as defined" in the 2nd paragraph of the except refers to my patch that uses null/empty nodemask to indicate "all allowed". 2. As this thread progresses, you've discussed relaxing the requirement that applications pass a valid ...
cpuset support in libnuma/numactl is still incomplete. I'm also not sure what the best way to handle this is. Probably there should be a switch for both. -Andi -
The more I have stared at this, the more certain I've become that we
need to make the mbind/mempolicy calls modal -- the default mode
continues to interpret node numbers and masks just as these calls do
now, and the alternative mode provides the so called "Choice B",
which takes node numbers and masks as if the task owned the entire
system, and then the kernel internally and automatically scrunches
those masks down to whatever happens to be the current cpuset of
the task.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
So the user space asks for 8 nodes because it knows the machine has that many from /sys and it only gets 4 if a cpuset says so? That's just bad semantics. And is not likely to make the user programs happy. I don't think you'll get around to teaching user space (or rather libnuma) about cpusets and let it handle it. From the libnuma perspective the machine size would be essentially current cpuset size. On the syscall level I don't think it makes much sense to change though. The alternative would be to throw out the complete cpuset concept and go for virtual nodes inside containers with virtualized /sys. -Andi -
That's no different than what can happen today -- if a task actually
is in an 8 node cpuset, sets up its mempolicies accordingly, and then
gets shoe horned into a 4 node cpuset.
It's not good or bad; it's just interactions between two mechanisms.
If your app doesn't run well in a small cpuset, don't run it there
(or do run it there, poorly ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Andi, Christoph, or whomever:
Are there any good regression tests of mempolicy functionality?
This patch I'm coding is delicate enough that I probably broke
something. It would be nice to catch it sooner rather than later.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
Paul: Andi has a regression test in the numactl source package. Try: http://freshmeat.net/redir/numactl/62210/url_tgz/numactl-1.0.2.tar.gz Lee -
Good - thanks.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
-
numactl has some basic tests (make test). I think newer LTP also has some but i haven't looked at them. And there is Lee's memtoy which does some things; but I don't think it's very automated. -Andi -
I'm probably going to be ok with this ... after a bit.
1) First concern - my primary issue:
One thing I really want to change, the name of the per-cpuset file
that controls this option. You call it "interleave_over_allowed".
Take a look at the existing per-cpuset file names:
$ grep 'name = "' kernel/cpuset.c
.name = "cpuset",
.name = "cpus",
.name = "mems",
.name = "cpu_exclusive",
.name = "mem_exclusive",
.name = "sched_load_balance",
.name = "memory_migrate",
.name = "memory_pressure_enabled",
.name = "memory_pressure",
.name = "memory_spread_page",
.name = "memory_spread_slab",
.name = "cpuset",
The name of every memory related option starts with "mem" or "memory",
and the name of every memory interleave related option starts with
"memory_spread_*".
Can we call this "memory_spread_user" instead, or something else
matching "memory_spread_*" ?
The names of things in the public API's are a big issue of mine.
2) Second concern - lessor code clarity issue:
The logic surrounding current_cpuset_interleaved_mems() seems a tad
opaque to me. It appears on the surface as if the memory policy code,
in mm/mempolicy.c, is getting a nodemask from the cpuset code by
calling this routine, as if there were an independent per-cpuset
nodemask stating over what nodes to interleave for MPOL_INTERLEAVE.
But all that is returned is either (1) an empty node mask or (2) the
current tasks allowed cpu mask. If an empty mask is returned, this
tells the MPOL_INTERLEAVE code to use the mask the user specified in
an earlier set_mempolicy MPOL_INTERLEAVE call. If a non-empty mask
is returned, then the previous user specified mask is ignored and
that non-empty mask (just all the current cpusets allowed nodes) is
used instead.
Restating this in pseudo code, from your patch, the mempolicy.c
MPOL_INTERLEAVE code to rebind ...Sounds better. I was hoping somebody was going to come forward with an That sounds reasonable, it will simply be a wrapper around For setting current->il_next, both cases work but yours will be better balanced for the next interleaved allocation. I'll apply it to my patchset. Thanks for the review. David -
