The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND,
and MPOL_INTERLEAVE, are better declared as part of an enum since they
are sequentially numbered and cannot be combined.
The policy member of struct mempolicy is also converted from type short
to type unsigned short. A negative policy does not have any legitimate
meaning, so it is possible to change its type in preparation for adding
optional mode flags later.
The equivalent member of struct shmem_sb_info is also changed from int
to unsigned short.
For compatibility, the policy formal to get_mempolicy() remains as a
pointer to an int:
int get_mempolicy(int *policy, unsigned long *nmask,
unsigned long maxnode, unsigned long addr,
unsigned long flags);
although the only possible values is the range of type unsigned short.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/mempolicy.h | 19 ++++++++++---------
include/linux/shmem_fs.h | 2 +-
mm/mempolicy.c | 29 +++++++++++++++++------------
mm/shmem.c | 9 +++++----
4 files changed, 33 insertions(+), 26 deletions(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -9,12 +9,13 @@
*/
/* Policies */
-#define MPOL_DEFAULT 0
-#define MPOL_PREFERRED 1
-#define MPOL_BIND 2
-#define MPOL_INTERLEAVE 3
-
-#define MPOL_MAX MPOL_INTERLEAVE
+enum {
+ MPOL_DEFAULT,
+ MPOL_PREFERRED,
+ MPOL_BIND,
+ MPOL_INTERLEAVE,
+ MPOL_MAX, /* always last member of enum */
+};
/* Flags for get_mem_policy */
#define MPOL_F_NODE (1<<0) /* return next IL mode instead of node mask */
@@ -62,7 +63,7 @@ struct mm_struct;
*/
struct mempolicy {
atomic_t refcnt;
- short policy; /* See MPOL_* above */
+ unsigned short policy; /* See MPOL_* above */
...With the evolution of mempolicies, it is necessary to support mempolicy
mode flags that specify how the policy shall behave in certain
circumstances. The most immediate need for mode flag support is to
suppress remapping the nodemask of a policy at the time of rebind.
Both the mempolicy mode and flags are passed by the user in the 'int
policy' formal of either the set_mempolicy() or mbind() syscall. A new
constant, MPOL_MODE_FLAGS, represents the union of legal optional flags
that may be passed as part of this int. Mempolicies that include illegal
flags as part of their policy are rejected as invalid.
An additional member to struct mempolicy is added to support the mode
flags:
struct mempolicy {
...
unsigned short policy;
unsigned short flags;
}
The splitting of the 'int' actual passed by the user is done in
sys_set_mempolicy() and sys_mbind() for their respective syscalls. This
is done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the
syscall of there are additional flags, and storing it in the new 'flags'
member of struct mempolicy. The intersection of the actual with
~MPOL_MODE_FLAGS is stored in the 'policy' member of the struct and all
current users of pol->policy remain unchanged.
The union of the policy mode and optional mode flags is passed back to
the user in get_mempolicy().
This combination of mode and flags within the same actual does not break
userspace code that relies on get_mempolicy(&policy, ...) and either
switch (policy) {
case MPOL_BIND:
...
case MPOL_INTERLEAVE:
...
};
statements or
if (policy == MPOL_INTERLEAVE) {
...
}
statements. Such applications would need to use optional mode flags when
calling set_mempolicy() or mbind() for these previously implemented
statements to stop working. If an application does start using optional
mode flags, it will need to mask the optional flags off the policy in
switch and conditional statements that only test mode.
An additional member is also ...Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses
the node remap when the policy is rebound.
Adds another member to struct mempolicy, nodemask_t user_nodemask, as
part of a union with cpuset_mems_allowed:
struct mempolicy {
...
union {
nodemask_t cpuset_mems_allowed;
nodemask_t user_nodemask;
} w;
}
that stores the the nodemask that the user passed when he or she created
the mempolicy via set_mempolicy() or mbind(). When using
MPOL_F_STATIC_NODES, which is passed with any mempolicy mode, the user's
passed nodemask intersected with the VMA or task's allowed nodes is always
used when determining the preferred node, setting the MPOL_BIND zonelist,
or creating the interleave nodemask. This happens whenever the policy is
rebound, including when a task's cpuset assignment changes or the cpuset's
mems are changed.
This creates an interesting side-effect in that it allows the mempolicy
"intent" to lie dormant and uneffected until it has access to the node(s)
that it desires. For example, if you currently ask for an interleaved
policy over a set of nodes that you do not have access to, the mempolicy
is not created and the task continues to use the previous policy. With
this change, however, it is possible to create the same mempolicy; it is
only effected when access to nodes in the nodemask is acquired.
It is also possible to mount tmpfs with the static nodemask behavior when
specifying a node or nodemask. To do this, simply add "=static"
immediately following the mempolicy mode at mount time:
mount -o remount mpol=interleave=static:1-3
Also removes mpol_check_policy() and folds its logic into mpol_new() since
it is now obsoleted. The unused vma_mpol_equal() is also removed.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/mempolicy.h | 11 ++-
...From: Paul Jackson <pj@sgi.com>
The following adds two more bitmap operators, bitmap_onto() and
bitmap_fold(), with the usual cpumask and nodemask wrappers.
The bitmap_onto() operator computes one bitmap relative to
another. If the n-th bit in the origin mask is set, then the
m-th bit of the destination mask will be set, where m is
the position of the n-th set bit in the relative mask.
The bitmap_fold() operator folds a bitmap into a second that
has bit m set iff the input bitmap has some bit n set, where
m == n mod sz, for the specified sz value.
There are two substantive changes between this patch and its
predecessor bitmap_relative:
1) Renamed bitmap_relative() to be bitmap_onto().
2) Added bitmap_fold().
The essential motivation for bitmap_onto() is to provide
a mechanism for converting a cpuset-relative CPU or Node
mask to an absolute mask. Cpuset relative masks are written
as if the current task were in a cpuset whose CPUs or
Nodes were just the consecutive ones numbered 0..N-1, for
some N. The bitmap_onto() operator is provided in anticipation
of adding support for the first such cpuset relative mask,
by the mbind() and set_mempolicy() system calls, using a
planned flag of MPOL_F_RELATIVE_NODES. These bitmap operators
(and their nodemask wrappers, in particular) will be used in
code that converts the user specified cpuset relative memory
policy to a specific system node numbered policy, given the
current mems_allowed of the tasks cpuset.
Such cpuset relative mempolicies will address two deficiencies
of the existing interface between cpusets and mempolicies:
1) A task cannot at present reliably establish a cpuset
relative mempolicy because there is an essential race
condition, in that the tasks cpuset may be changed in
between the time the task can query its cpuset placement,
and the time the task can issue the applicable mbind or
set_memplicy system call.
2) A task cannot at present establish what cpuset relative
mempolicy ...Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
nodemasks passed via set_mempolicy() or mbind() should be considered
relative to the current task's mems_allowed.
When the mempolicy is created, the passed nodemask is folded and mapped
onto the current task's mems_allowed. For example, consider a task
using set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES
with a nodemask of 1-3. If current's mems_allowed is 4-7, the effected
nodemask is 5-7 (the second, third, and fourth node of mems_allowed).
If the same task is attached to a cpuset, the mempolicy nodemask is
rebound each time the mems are changed. Some possible rebinds and
results are:
mems result
1-3 1-3
1-7 2-4
1,5-6 1,5-6
1,5-7 5-7
Likewise, the zonelist built for MPOL_BIND acts on the set of zones
assigned to the resultant nodemask from the relative remap.
In the MPOL_PREFERRED case, the preferred node is remapped from the
currently effected nodemask to the relative nodemask.
This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.
Cc: Paul Jackson <pj@sgi.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/mempolicy.h | 3 ++-
mm/mempolicy.c | 33 +++++++++++++++++++++++++++++++--
mm/shmem.c | 6 ++++++
3 files changed, 39 insertions(+), 3 deletions(-)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -25,12 +25,13 @@ enum {
/* Flags for set_mempolicy */
#define MPOL_F_STATIC_NODES (1 << 15)
+#define MPOL_F_RELATIVE_NODES (1 << 14)
/*
* MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
* either set_mempolicy() or mbind().
*/
-#define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES)
+#define MPOL_MODE_FLAGS (MPOL_F_STATIC_NODES | ...Updates Documentation/vm/numa_memory_policy.txt and Documentation/filesystems/tmpfs.txt to describe optional mempolicy mode flags. Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Paul Jackson <pj@sgi.com> --- Documentation/filesystems/tmpfs.txt | 12 +++ Documentation/vm/numa_memory_policy.txt | 131 +++++++++++++++++++++++------- 2 files changed, 112 insertions(+), 31 deletions(-) diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -92,6 +92,18 @@ NodeList format is a comma-separated list of decimal numbers and ranges, a range being two hyphen-separated decimal numbers, the smallest and largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 +NUMA memory allocation policies have optional flags that can be used in +conjunction with their modes. These optional flags can be specified +when tmpfs is mounted by appending them to the mode before the NodeList. +See Documentation/vm/numa_memory_policy.txt for a list of all available +memory allocation policy mode flags. + + =static is equivalent to MPOL_F_STATIC_NODES + =relative is equivalent to MPOL_F_RELATIVE_NODES + +For example, mpol=bind=static:NodeList, is the equivalent of an +allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES. + Note that trying to mount a tmpfs with an mpol option will fail if the running kernel does not support NUMA; and will fail if its nodelist specifies a node which is not online. If your system relies on that diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt @@ -135,9 +135,11 @@ most general to most specific: Components of ...
