login
Header Space

 
 

Re: [patch 2/2] cpusets: add interleave_over_allowed option

Score:
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Paul Jackson <pj@...>
Cc: <clameter@...>, <Lee.Schermerhorn@...>, <akpm@...>, <ak@...>, <linux-kernel@...>
Date: Sunday, October 28, 2007 - 2:19 pm

On Sat, 27 Oct 2007, Paul Jackson wrote:


From a standpoint of the MPOL_PREFERRED memory policy itself, there is no 
documented behavior or standard that specifies its interaction with 
cpusets.  Thus, it's "undefined."  We are completely free to implement an 
undefined behavior as we choose and change it as Linux matures.

Once it is defined, however, we carry the burden of protecting 
applications that are written on that definition.  That's the point where 
we need to get it right and if we don't, we're stuck with it forever; I 
don't believe we're at that point with MPOL_PREFERRED policies under 
cpusets right now.


Ok, let's take a look at some specific unproprietary examples of tasks 
that use set_mempolicy(MPOL_PREFERRED) for a specific node, intending it 
to be the actual system node offset, that is then assigned to a cpuset 
that doesn't require that offset to be allowed.

I think it's going to become pretty difficult to find an example because 
the whole scenario is pretty lame: you would need to already know which 
nodes you're going to be assigned to in the cpuset to ask for one of them 
as your preferred node.  I don't imagine any application can have that 
type of foresight and, if it does, then we certainly shouldn't support the 
preferred node_remap() when it changes mems.

You're trying to support a scheme, in Choice A, where an application knows 
it's going to be assigned to a range of nodes (for example, 1-3) and wants 
the preferred node to be included (for example, 2).  So now the 
application must have control over both its memory policy and its cpuset 
placement.  Then it must be willing to change its cpuset placement to a 
different set of nodes (with equal or greater cardinality) and have the 
preferred node offset respected.  Why can't it simply then issue another 
set_mempolicy(MPOL_PREFERRED) call for the new preferred node?

See?  The problem is that you're trying to protect applications that know 
its initial cpuset mems [the only way it could ever send a 
set_mempolicy(MPOL_PREFERRED) for the right node in that range in the 
first place] but then seemingly loses control over its cpuset and intends 
for the kernel to fix it up for it without having the burden of issuing 
another set_mempolicy() call.

And you're trying to protect this application that based this 
implementation not on a standard or documentation, but on its observed 
behavior.  My bet is that it's going to issue that subsequent 
set_mempolicy(), at least if libnuma returned a numa_preferred() value 
that it wasn't expecting.


I don't see how you can accomplish that.  If the default behavior is 
Choice B, which is different from what is currently implemented in the 
kernel, you're going to either require a modification to the application 
to set a flag asking for Choice A again or make the default kernel 
behavior that of Choice A and set a flag implicitly via libnuma when 
future versions are released.

In the former case, just ask the application to adjust its node numbering 
scheme or check the result of numa_preferred().  In the latter case, we're 
not even talking about changing the kernel default anymore to Choice B.


If you add this per-task mode flag to default to Choice A for preferred 
memory policies, it'll be extremely confusing to document and support.  If 
it's already decided that we should default to Choice B, it's going to 
require an update to the application to write to /proc/pid/i_want_choice_A 
or use the new set_mempolicy() option anyway, so instead of adding that 
hack you should simply fix your node numbering.

And I suspect that if that per-task mode flag is added, it will eventually 
be the subject of a thread with the subject "is this highly specialized 
flag even used anymore?" at which point it will be marked deprecated and 
eventually obsoleted.


Yeah, remapping the nodemask is a bad idea anyway to get a preferred node.  
Preferred nodes inherently deal with offsets from node 0 anyway.


That still requires a change to the application.  So they should simply 
rethink their node numbering instead and fix their application to follow a 
behavior that will, at that point, be documented.

Any application that doesn't respect the return value of 
set_mempolicy(MPOL_PREFERRED) node isn't worth supporting anyway.

There's two cases to think about:

 - When the cpuset assignment changes from the root cpuset to a
   user-created cpuset with a subset of system mems and then
   set_mempolicy() is called, and

 - When set_mempolicy() is called and then the cpuset mems change either
   because it was attached to a different cpuset or someone wrote to its
   'mems' file.

In the first case, the new API should return -EINVAL if you ask for a 
preferred node offset that is smaller than the cardinality of your 
mems_allowed.  That will catch some of these applications that may have 
actually been implemented based on the current undocumented behavior.

In the second case, the first node in the nodemask passed to 
set_mempolicy() was a system node offset anyway and had nothing to do with 
cpusets (it was a member of the root cpuset with access to all mems) so it 
already behaves as Choice B.


I think any application that gets constrained to a subset of nodes in its 
mems_allowed and then bases its preferred node number off that subset to 
create an offset that is intended to be preserved over subsequent mems 
changes without rechecking the result with numa_preferred() or issuing a 
subsequent set_mempolicy() is poorly written.  Especially since that 
behavior was undocumented.

		David
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[patch 2/2] cpusets: add interleave_over_allowed option, David Rientjes, (Thu Oct 25, 6:54 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Thu Oct 25, 7:37 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Thu Oct 25, 8:28 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Fri Oct 26, 11:18 am)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Mon Oct 29, 12:23 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Tue Oct 30, 4:20 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Mon Oct 29, 4:36 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Fri Oct 26, 1:36 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, David Rientjes, (Thu Oct 25, 10:11 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Fri Oct 26, 11:30 am)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Fri Oct 26, 4:43 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Fri Oct 26, 5:13 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Fri Oct 26, 5:31 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Mon Oct 29, 11:10 am)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Mon Oct 29, 3:01 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Fri Oct 26, 9:26 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Mon Oct 29, 12:54 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, David Rientjes, (Sun Oct 28, 2:19 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, David Rientjes, (Mon Oct 29, 12:47 am)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Fri Oct 26, 10:50 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Sat Oct 27, 2:07 am)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Sat Oct 27, 1:47 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Fri Oct 26, 5:17 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Fri Oct 26, 5:26 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Fri Oct 26, 5:37 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Mon Oct 29, 11:00 am)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Mon Oct 29, 4:35 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Mon Oct 29, 1:46 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Fri Oct 26, 5:05 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Christoph Lameter, (Fri Oct 26, 5:12 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, David Rientjes, (Thu Oct 25, 10:45 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, David Rientjes, (Thu Oct 25, 11:58 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Fri Oct 26, 11:37 am)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Michael Kerrisk, (Fri Oct 26, 4:21 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Michael Kerrisk, (Fri Oct 26, 4:33 pm)
Re: [patch 2/2] cpusets: add interleave_over_allowed option, Lee Schermerhorn, (Fri Oct 26, 1:28 pm)
speck-geostationary