Fix cpuset sched_relax_domain_level control file
Due to a merge conflict, the sched_relax_domain_level control file was
marked as being handled by cpuset_read/write_u64, but the code to handle it
was actually in cpuset_common_file_read/write.
Since the value being written/read is in fact a signed integer, it
should be treated as such; this patch adds cpuset_read/write_s64
functions, and uses them to handle the sched_relax_domain_level file.
With this patch, the sched_relax_domain_level can be read and written,
and the correct contents seen/updated.
Signed-off-by: Paul Menage <menage@google.com>
---
kernel/cpuset.c | 52 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 40 insertions(+), 12 deletions(-)
Index: cpusetfix-2.6.26-rc1/kernel/cpuset.c
===================================================================
--- cpusetfix-2.6.26-rc1.orig/kernel/cpuset.c
+++ cpusetfix-2.6.26-rc1/kernel/cpuset.c
@@ -1031,11 +1031,9 @@ int current_cpuset_is_being_rebound(void
return task_cs(current) == cpuset_being_rebound;
}
-static int update_relax_domain_level(struct cpuset *cs, char *buf)
+static int update_relax_domain_level(struct cpuset *cs, s64 val)
{
- int val = simple_strtol(buf, NULL, 10);
-
- if (val < 0)
+ if ((int)val < 0)
val = -1;
if (val != cs->relax_domain_level) {
@@ -1280,9 +1278,6 @@ static ssize_t cpuset_common_file_write(
case FILE_MEMLIST:
retval = update_nodemask(cs, buffer);
break;
- case FILE_SCHED_RELAX_DOMAIN_LEVEL:
- retval = update_relax_domain_level(cs, buffer);
- break;
default:
retval = -EINVAL;
goto out2;
@@ -1348,6 +1343,30 @@ static int cpuset_write_u64(struct cgrou
return retval;
}
+static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val)
+{
+ int retval = 0;
+ struct cpuset *cs = cgroup_cs(cgrp);
+ cpuset_filetype_t type = cft->private;
+
+ cgroup_lock();
+
+ if (cgroup_is_removed(cgrp)) {
+ cgroup_unlock();
+ return -ENODEV;
+ }
+ switch ...Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> --
OK, here we go again. There is a patch in linux-next called "cpuset: system sets" which will conflict with your change. A bit of archeology indicates that this patch was sent out on Feb 27 and received a bit of review feedback, including some serious-looking qualms from Paul. There was no response to that reviewer feedback and volia, the damn code goes into linux-next without, afaict, any alteration. Ingo, for the entyenth time: kernel/cpuset.c is not part of the CPU scheduler. argh. Now what to do? If I merge this patch then I need to drop linux-next and if I drop linux-next I drop half my tree. So I guess I need to merge this patch and then somehow smash linux-next on top of it. Thanks a lot. --
Yeah - I was looking at that patch earlier today, when I was
scratching my head, trying to figure out what I broke with
mmotm.
I still don't like it. The conversation evolved into a couple of
positions, mine and someone elses, I forget whom. From what I
can recall now, neither of those positions endorsed the original
patch, as you have it. Then the conversation died out, as all
parties involved other than perhaps myself prefer to work on stuff
that can result in successful patches, rather than ongoing debates.
What's the easiest way to get from where we are now, to a world
without that patch?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Would it help, Andrew, if I proposed a patch that went into your latest
mmtom stack, right after the three patches:
origin.patch
linux-next.patch
Paul Menage's latest "Fix cpuset sched_relax_domain_level control file"
that reverted the cpuset "system" patch (this being a patch that added
a per-cpuset file called "system", which could be used to do things
such as help manage the assignment of IRQs to CPUs.)
I suspect that some of Ingo, Max Krasnyanskiy, or Peter Zijlstra will
not be happy with my doing this, but I'm pretty sure that the "system"
patch needs more work before we have agreement on the API. I really
don't want to add the API of that current patch "as is" to the kernel.
I've added several of the people who were part of the preceding threads
on this discussion to the CC list.
I'm cooking up such a patch now -- I've gotten to the point that it
applies and builds; now I am about to see how badly it breaks the
remaining 426 patches in the mmtom stack.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Don't worry about it. I sorted out things locally and I expect that Stephen will be able to as well. Hopefully Ingo will toss the whole patch series so we can take another look at it all. --
Ok ...
I hesitate more than I should some times to NAQ patches,
but the more I think about this one, the less willing I
am to have a per-cpuset file called "system".
I'm not sure what "sorted out" means ... I'll wait and see.
Thanks, Andrew.
Sorry, Max, Peter and Ingo ... we really had not arrived at
agreement on this one.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
No problem, I've been meaning to redo this whole series but somehow stuff got in the way and I never got around to it :-/ --
I'm actually totally surprised that it got in. Ingo applied Peter's initial patch to his sched-devel tree but then ignored follow up patches with fixes and stuff from me (I'm assuming that was because we started discussion alternative options). Anyway, my vote goes for reverting these series. Max --
yes, there's been a lot of back and forth. Paul/Peter/Max, what's the current agreed-upon approach to merge these physical resource isolation features into cpusets intelligently while still keeping the whole thing as usable and practical to down-to-earth sysadmins as possible? That is the issue that is blocking this whole none of this is upstream yet (nor is any of this even near to being ready for upstream), so there's nothing to revert. Ingo --
I thought one of the earlier patches (Max's, perhaps) that we considered
in this discussion back in Feb or March -did- end up close to traveling
upstream, via the sched-devel tree going into linux-next, or some such.
However I can't claim to understand what (almost) went down there as
Well, yeah, everyone wants "simple". But that tends to degrade into
each of us insisting that whatever we don't appreciate need for in the
other guys proposal be removed. That way lies not progress.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Yeah, unfortunately we did not make much progress. Partly because of disagreements and party because I was on a longish vacation and did not get a chance to push things forward. Now I'm back. At this point I want to make a step back and redo some of the original patches without using cpusets. At least for now while we do not have clear agreement on how cpuset integration should look like it seems to make sense to simply extend existing interfaces. For the irqs specifically I'm just thinking of adding /proc/irq/default_smp_affinity. I'll send some patches later this week. Max --
Are you sure about the typecast here? If `val' has a value of say 0x0000_ffff_ffff_ffff then I assume the casted value will be negative, only it wasn't? --
It seems like the simplest approach - if it's outside the range of a positive int, set it to -1. Paul --
That's very hard to understand for someone who looks at the code - and
being able to understand the code is much more important than the
number of characters in the source code.
If you'd write something like
if ((val < 0) || (val > INT_MAX))
instead it would be obvious for the reader what's happening here, and
cu
Adrian
BTW: I have seen that even further restrictions are discussed in this
case, my comment is more about the general readability of this
"simplest approach".
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
What he said. Our poor reader now knows what was intended. But he still doesn't know _why_ it was intended. --
I saw this, but I think it's ok. -1 : no request. use system default or follow request of others. 0 : no search. ... ( 5~ : search system wide [on NUMA system]) Or maybe we can restrict the value from -1 to 5 ? --
> Or maybe we can restrict the value from -1 to 5 ?
That might be a good idea. My recollection is that this
is what we tell users to input ... values between -1 and
some small positive integer. So returning -EINVAL on an
input of -2 (for example) should be the least surprising
behaviour for users.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
That sounds fine to me too - I was just trying to get close to the original behaviour. Paul --
An honorable goal.
Li Zefan - would you be interested in generating a patch
that fails -EINVAL for inputs outside the range of [-1 ... N]
for whatever small positive N the kernel recognizes?
This seems like a minor enough difference that I for one
don't have any problem with the current code, remapping
all negative inputs to -1, going in, and then a follow-on
patch changing that going in afterward.
Of course, if you or Seto-san prefer the current behaviour,
it would be easy to persuade me to agree.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I'd like to get Paul's patch into mainline this evening, to give us a chance to get the subsequent mess sorted out in time for next linux-next[*]. So there's no rush on this update. [*] Judging by this: kernel/cpuset.c: In function 'cpuset_common_file_write': kernel/cpuset.c:1374: warning: passing argument 3 of 'update_flag' makes integer from pointer without a cast kernel/cpuset.c: In function 'cpuset_destroy': kernel/cpuset.c:1793: warning: passing argument 3 of 'update_flag' makes integer from pointer without a cast Paul should drop sched-devel. I'm suspecting it contains stale old stuff which wasn't supposed to be there. --
I hope that refers to Paul M.
If it refers to Paul J, he's missing a clue.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I'm building against 2.6.26-rc1, and don't see those warnings (or any useful code at those lines). It's probably more fallout from the cpuset files changes? Paul --
It's the git-sched-devel changes in linux-next. Seems they are unaware of the update_flag() changes we made in mainline last week. --
The following error that Andrew reports seems to be another
side affect of the "system" patch to cpusets.
I see this error too:
kernel/cpuset.c: In function 'cpuset_common_file_write':
kernel/cpuset.c:1374: warning: passing argument 3 of 'update_flag' makes integer from pointer without a cast
kernel/cpuset.c: In function 'cpuset_destroy':
kernel/cpuset.c:1793: warning: passing argument 3 of 'update_flag' makes integer from pointer without a cast
I am building with:
Andrew's latest (last couple hours) mmotm, applied on top of
v2.6.26-rc1, with -just- the first two mmotm patches applied to
v2.6.26-rc1, plus your patch:
origin.patch -- from mmotm
linux-next.patch -- from mmotm
Your recent "Fix cpuset sched_relax_domain_level control file" patch
The warning comes from the type inconsistence between the following
lines of kernel/cpuset.c code. The definition of update_flag() is
expecting an "int turning_on" third argument. The borked "system"
patch (on my wish I could kill it list) is providing a "char *"
argument.
static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
int turning_on)
...
static ssize_t cpuset_common_file_write(struct cgroup *cont,
struct cftype *cft,
struct file *file,
const char __user *userbuf,
size_t nbytes, loff_t *unused_ppos)
{
struct cpuset *cs = cgroup_cs(cont);
cpuset_filetype_t type = cft->private;
char *buffer;
int retval = 0;
...
case FILE_SYSTEM:
retval = update_flag(CS_SYSTEM, cs, buffer);
...
if (!is_system(cs))
update_flag(CS_SYSTEM, cs, "1");
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I prefer to put a limit on it, but as Andrew just said, we don't need to rush for this. But sure I'll post a patch if Seto-san doesn't have an objection on this. --
Good - I agree on all points. Thank-you.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I apologize for the confusion... I don't have any. Thanks, H.Seto --
