On Sat, 27 Oct 2007, Paul Jackson wrote:From a standpoint of the MPOL_PREFERRED memory policy itself, there is no documented behavior or standard that specifies its interaction with cpusets. Thus, it's "undefined." We are completely free to implement an undefined behavior as we choose and change it as Linux matures. Once it is defined, however, we carry the burden of protecting applications that are written on that definition. That's the point where we need to get it right and if we don't, we're stuck with it forever; I don't believe we're at that point with MPOL_PREFERRED policies under cpusets right now. Ok, let's take a look at some specific unproprietary examples of tasks that use set_mempolicy(MPOL_PREFERRED) for a specific node, intending it to be the actual system node offset, that is then assigned to a cpuset that doesn't require that offset to be allowed. I think it's going to become pretty difficult to find an example because the whole scenario is pretty lame: you would need to already know which nodes you're going to be assigned to in the cpuset to ask for one of them as your preferred node. I don't imagine any application can have that type of foresight and, if it does, then we certainly shouldn't support the preferred node_remap() when it changes mems. You're trying to support a scheme, in Choice A, where an application knows it's going to be assigned to a range of nodes (for example, 1-3) and wants the preferred node to be included (for example, 2). So now the application must have control over both its memory policy and its cpuset placement. Then it must be willing to change its cpuset placement to a different set of nodes (with equal or greater cardinality) and have the preferred node offset respected. Why can't it simply then issue another set_mempolicy(MPOL_PREFERRED) call for the new preferred node? See? The problem is that you're trying to protect applications that know its initial cpuset mems [the only way it could ever send a set_mempolicy(MPOL_PREFERRED) for the right node in that range in the first place] but then seemingly loses control over its cpuset and intends for the kernel to fix it up for it without having the burden of issuing another set_mempolicy() call. And you're trying to protect this application that based this implementation not on a standard or documentation, but on its observed behavior. My bet is that it's going to issue that subsequent set_mempolicy(), at least if libnuma returned a numa_preferred() value that it wasn't expecting. I don't see how you can accomplish that. If the default behavior is Choice B, which is different from what is currently implemented in the kernel, you're going to either require a modification to the application to set a flag asking for Choice A again or make the default kernel behavior that of Choice A and set a flag implicitly via libnuma when future versions are released. In the former case, just ask the application to adjust its node numbering scheme or check the result of numa_preferred(). In the latter case, we're not even talking about changing the kernel default anymore to Choice B. If you add this per-task mode flag to default to Choice A for preferred memory policies, it'll be extremely confusing to document and support. If it's already decided that we should default to Choice B, it's going to require an update to the application to write to /proc/pid/i_want_choice_A or use the new set_mempolicy() option anyway, so instead of adding that hack you should simply fix your node numbering. And I suspect that if that per-task mode flag is added, it will eventually be the subject of a thread with the subject "is this highly specialized flag even used anymore?" at which point it will be marked deprecated and eventually obsoleted. Yeah, remapping the nodemask is a bad idea anyway to get a preferred node. Preferred nodes inherently deal with offsets from node 0 anyway. That still requires a change to the application. So they should simply rethink their node numbering instead and fix their application to follow a behavior that will, at that point, be documented. Any application that doesn't respect the return value of set_mempolicy(MPOL_PREFERRED) node isn't worth supporting anyway. There's two cases to think about: - When the cpuset assignment changes from the root cpuset to a user-created cpuset with a subset of system mems and then set_mempolicy() is called, and - When set_mempolicy() is called and then the cpuset mems change either because it was attached to a different cpuset or someone wrote to its 'mems' file. In the first case, the new API should return -EINVAL if you ask for a preferred node offset that is smaller than the cardinality of your mems_allowed. That will catch some of these applications that may have actually been implemented based on the current undocumented behavior. In the second case, the first node in the nodemask passed to set_mempolicy() was a system node offset anyway and had nothing to do with cpusets (it was a member of the root cpuset with access to all mems) so it already behaves as Choice B. I think any application that gets constrained to a subset of nodes in its mems_allowed and then bases its preferred node number off that subset to create an offset that is intended to be preserved over subsequent mems changes without rechecking the result with numa_preferred() or issuing a subsequent set_mempolicy() is poorly written. Especially since that behavior was undocumented. David -
| Jeff Garzik | Re: Wasting our Freedom |
| Chuck Ebbert | Why do so many machines need "noapic"? |
| Mathieu Desnoyers | [RFC patch 08/18] cnt32_to_63 should use smp_rmb() |
| Richard Hughes | Add INPUT support to toshiba_acpi |
git: | |
| Jan | [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file |
| Elijah Newren | Trying to use git-filter-branch to compress history by removing large, obsolete bi... |
| Thomas Koch | is gitosis secure? |
| Matthieu Moy | git push to a non-bare repository |
| frantisek holop | booting openbsd on eee without cd-rom |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
| Otto Moerbeek | Re: identifying sparse files and get ride of them trick available? |
| Renaud Allard | very weak bridge performance |
| Linux Kernel Mailing List | [ALSA] hda: Added new IDT codec family |
| Linux Kernel Mailing List | usb-storage: clean up unusual_devs.h |
| Linux Kernel Mailing List | USB: Enhance usage of pm_message_t |
| Linux Kernel Mailing List | resource: allow MMIO exclusivity for device drivers |
