David, responding to pj, write:No -- MPOL_F_STATIC_NODES does not handle my second case. Notice the phrase 'cpuset relative.' In my second case, nodes are numbered relative to the cpuset. If you say "node 2" then you mean whatever is the third (since numbering is zero based) node in your current cpuset. This cpuset relative mode that I'm proposing is an entirely different mode of numbering nodes than anything we've seen so far (except for our side discussions with just yourself, Lee and Christoph, last December.) In this mode, "node 2" doesn't mean what the system calls "node 2"; it means the third node in whatever is ones current cpuset placement (if your cpuset even has that many nodes), and mempolicies using this mode are automatically remapped by the kernel, anytime the cpuset placement changes. This second, cpuset relative, mode is required: 1) to provide a race-free means for an application to place its memory when the application considers all physical nodes essentially equivalent, and just wants to lay things out relative to whatever cpuset it happens to be running in, and 2) to provide a practical means, without the need for constantly reprobing ones current cpuset placement, for an application to specify cpuset-relative placement to be applied whenever the application is placed in a larger cpuset, even if the application is presently in a smaller cpuset. Without it, the application has to first query its current physical cpuset placement, then calculate its desired relative placement into terms of the current physical nodes it finds itself on, and then issue various mbind or set_mempolicy calls using those current physical node numbers, praying that it wasn't migrated to other physical nodes at the same time, which would have invalidated its current placement assumptions in its relative to physical node number calculations. And without it, if an application is currently placed in a smaller cpuset than it would prefer, it cannot specify how it would want its mempolicies should it ever subsequently be moved to a larger cpuset. This leaves such an application with little option other than to constantly reprobe its cpuset placement, in order to know if and when it needs to reissue its mbind and set_mempolicy calls because it gained access to a larger cpuset. I agree, David, that this present MPOL_F_STATIC_NODES patch handles the case of a growing cpuset (or hotplug added nodes) for the static mapped case (node "2" means physical system node "2", always.) But this present patch, by design, does not address the case of a growing cpuset for the case where an application actually wants its mempolicies remapped. My original code remapping mempolicies when cpuset placement changes, that has been in the kernel for a couple of years now, was -supposed- to handle this relative case. But it is flawed, as it fails to meet the requirements I list above. We can't just change the way that mode works now, for compatibility reasons. So we need to add a new cpuset relative mode, just as you're adding a STATIC mode, that addresses the above requirements for a cpuset relative, remapped, numbering of mempolicy nodes. Ok - I tried to elaborate, above. You (David), Lee and Christoph will perhaps recognize this elaboration, as it essentially repeats things I said in our earlier discussion in December and early January. Yes - hotplug presents the same problems as cpusets growing larger. If this makes sense to you, David, and you'd like to include this second, cpuset relative mode, in your patchset, that would be excellent. Given that I have not been very good at explaining this second mode you might choose not to do that; in that case I'll have to follow up, after your second patch shapes up, with a patch of my own, adding this cpuset relative mode. I'd suggest we let future expansions deal with their own needs. We don't usually pad internal (not externally visible) data structures in anticipation that someone else might need the space in the future. At least earlier, Andi Kleen, when he was the primary author and sole active maintainer of this mempolicy code, was always keen to avoid expanding the size of 'struct mempolicy' by another nodemask. I have not done the calculations myself to know how vital it is to keep the size of struct mempolicy to a minimum. It certainly seems worth a bit of effort, however, if adding this union of these two nodemasks doesn't complicate the code too horribly much. Cool. Thanks. (I'm glad you caved ... ;). Looking forward in my inbox, I see that Lee has some suggestions on where to handle the conversion between the packed mode and the separate fields. I'm too lazy to think about that more, and will likely acquiesce to whatever you and Lee agree to. Ah - good. I missed that. Just to be sure I'm reading the code right, I take it that it is the following line, at the end of the mpol_new() routine, that stores the unaltered user nodemask ... right? policy->user_nodemask = *nodes; I would be inclined toward having the classic compatibility mode (no such mode as MPOL_F_STATIC_NODES specified) continue to do as it always has done, which apparently includes failing EINVAL in some cases due to an empty nodemask intersection with the current cpuset. But I would also expect that this new MPOL_F_STATIC_NODES would allow specifying any nodes, whether or not any of them were in the tasks current cpuset. Looking ahead, I think Lee is saying pretty much the same thing. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 --
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Greg Kroah-Hartman | [PATCH 007/196] Chinese: add translation of stable_kernel_rules.txt |
| david | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
git: | |
| Alexey Dobriyan | Re: [GIT]: Networking |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [BUG] New Kernel Bugs |
