Re: Reverting per-cpuset "system" (IRQ affinity) patch

Previous thread: [PATCH] m68knommu: missing sections for linker script by Greg Ungerer on Tuesday, May 6, 2008 - 5:52 pm. (5 messages)

Next thread: [GIT]: Sparc by David Miller on Tuesday, May 6, 2008 - 6:14 pm. (1 message)
From: Paul Menage
Date: Tuesday, May 6, 2008 - 6:08 pm

Fix cpuset sched_relax_domain_level control file

Due to a merge conflict, the sched_relax_domain_level control file was
marked as being handled by cpuset_read/write_u64, but the code to handle it
was actually in cpuset_common_file_read/write.

Since the value being written/read is in fact a signed integer, it
should be treated as such; this patch adds cpuset_read/write_s64
functions, and uses them to handle the sched_relax_domain_level file.

With this patch, the sched_relax_domain_level can be read and written,
and the correct contents seen/updated.

Signed-off-by: Paul Menage <menage@google.com>

---
 kernel/cpuset.c |   52 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 40 insertions(+), 12 deletions(-)

Index: cpusetfix-2.6.26-rc1/kernel/cpuset.c
===================================================================
--- cpusetfix-2.6.26-rc1.orig/kernel/cpuset.c
+++ cpusetfix-2.6.26-rc1/kernel/cpuset.c
@@ -1031,11 +1031,9 @@ int current_cpuset_is_being_rebound(void
 	return task_cs(current) == cpuset_being_rebound;
 }
 
-static int update_relax_domain_level(struct cpuset *cs, char *buf)
+static int update_relax_domain_level(struct cpuset *cs, s64 val)
 {
-	int val = simple_strtol(buf, NULL, 10);
-
-	if (val < 0)
+	if ((int)val < 0)
 		val = -1;
 
 	if (val != cs->relax_domain_level) {
@@ -1280,9 +1278,6 @@ static ssize_t cpuset_common_file_write(
 	case FILE_MEMLIST:
 		retval = update_nodemask(cs, buffer);
 		break;
-	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
-		retval = update_relax_domain_level(cs, buffer);
-		break;
 	default:
 		retval = -EINVAL;
 		goto out2;
@@ -1348,6 +1343,30 @@ static int cpuset_write_u64(struct cgrou
 	return retval;
 }
 
+static int cpuset_write_s64(struct cgroup *cgrp, struct cftype *cft, s64 val)
+{
+	int retval = 0;
+	struct cpuset *cs = cgroup_cs(cgrp);
+	cpuset_filetype_t type = cft->private;
+
+	cgroup_lock();
+
+	if (cgroup_is_removed(cgrp)) {
+		cgroup_unlock();
+		return -ENODEV;
+	}
+	switch ...
From: Li Zefan
Date: Tuesday, May 6, 2008 - 6:21 pm

Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
--

From: Andrew Morton
Date: Tuesday, May 6, 2008 - 6:31 pm

OK, here we go again.

There is a patch in linux-next called "cpuset: system sets" which will
conflict with your change.

A bit of archeology indicates that this patch was sent out on Feb 27 and
received a bit of review feedback, including some serious-looking qualms
from Paul.  There was no response to that reviewer feedback and volia, the
damn code goes into linux-next without, afaict, any alteration.

Ingo, for the entyenth time: kernel/cpuset.c is not part of the CPU
scheduler.

argh.  Now what to do?  If I merge this patch then I need to drop
linux-next and if I drop linux-next I drop half my tree.

So I guess I need to merge this patch and then somehow smash linux-next on
top of it.  Thanks a lot.

--

From: Paul Jackson
Date: Tuesday, May 6, 2008 - 6:40 pm

Yeah - I was looking at that patch earlier today, when I was
scratching my head, trying to figure out what I broke with
mmotm.

I still don't like it.  The conversation evolved into a couple of
positions, mine and someone elses, I forget whom.  From what I
can recall now, neither of those positions endorsed the original
patch, as you have it.  Then the conversation died out, as all
parties involved other than perhaps myself prefer to work on stuff
that can result in successful patches, rather than ongoing debates.

What's the easiest way to get from where we are now, to a world
without that patch?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--


Would it help, Andrew, if I proposed a patch that went into your latest
mmtom stack, right after the three patches:

  origin.patch
  linux-next.patch
  Paul Menage's latest "Fix cpuset sched_relax_domain_level control file"

that reverted the cpuset "system" patch (this being a patch that added
a per-cpuset file called "system", which could be used to do things
such as help manage the assignment of IRQs to CPUs.)

I suspect that some of Ingo, Max Krasnyanskiy, or Peter Zijlstra will
not be happy with my doing this, but I'm pretty sure that the "system"
patch needs more work before we have agreement on the API.  I really
don't want to add the API of that current patch "as is" to the kernel.

I've added several of the people who were part of the preceding threads
on this discussion to the CC list.

I'm cooking up such a patch now -- I've gotten to the point that it
applies and builds; now I am about to see how badly it breaks the
remaining 426 patches in the mmtom stack.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--


Don't worry about it.  I sorted out things locally and I expect that
Stephen will be able to as well.  Hopefully Ingo will toss the whole patch
series so we can take another look at it all.


--


Ok ...

I hesitate more than I should some times to NAQ patches,
but the more I think about this one, the less willing I
am to have a per-cpuset file called "system".

I'm not sure what "sorted out" means ... I'll wait and see.

Thanks, Andrew.

Sorry, Max, Peter and Ingo ... we really had not arrived at
agreement on this one.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--


No problem, I've been meaning to redo this whole series but somehow
stuff got in the way and I never got around to it :-/


--

From: Max Krasnyansky
Date: Thursday, May 8, 2008 - 10:56 am

I'm actually totally surprised that it got in. Ingo applied Peter's initial
patch to his sched-devel tree but then ignored follow up patches with fixes
and stuff from me (I'm assuming that was  because we started discussion
alternative options).
Anyway, my vote goes for reverting these series.

Max
--

From: Ingo Molnar
Date: Friday, May 9, 2008 - 3:22 am

yes, there's been a lot of back and forth.

Paul/Peter/Max, what's the current agreed-upon approach to merge these 
physical resource isolation features into cpusets intelligently while 
still keeping the whole thing as usable and practical to down-to-earth 
sysadmins as possible? That is the issue that is blocking this whole 

none of this is upstream yet (nor is any of this even near to being 
ready for upstream), so there's nothing to revert.

	Ingo
--

From: Paul Jackson
Date: Friday, May 9, 2008 - 4:26 am

I thought one of the earlier patches (Max's, perhaps) that we considered
in this discussion back in Feb or March -did- end up close to traveling
upstream, via the sched-devel tree going into linux-next, or some such.

However I can't claim to understand what (almost) went down there as


Well, yeah, everyone wants "simple".  But that tends to degrade into
each of us insisting that whatever we don't appreciate need for in the
other guys proposal be removed.  That way lies not progress.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Tuesday, May 20, 2008 - 5:46 pm

Yeah, unfortunately we did not make much progress. Partly because of 
disagreements and party because I was on a longish vacation and did not get a 
chance to push things forward. Now I'm back.

At this point I want to make a step back and redo some of the original patches 
without using cpusets. At least for now while we do not have clear agreement 
on how cpuset integration should look like it seems to make sense to simply 
extend existing interfaces. For the irqs specifically I'm just thinking of 
adding /proc/irq/default_smp_affinity. I'll send some patches later this week.

Max







--

From: Andrew Morton
Date: Tuesday, May 6, 2008 - 6:38 pm

Are you sure about the typecast here?  If `val' has a value of say
0x0000_ffff_ffff_ffff then I assume the casted value will be negative, only
it wasn't?

--

From: Paul Menage
Date: Tuesday, May 6, 2008 - 6:41 pm

It seems like the simplest approach - if it's outside the range of a
positive int, set it to -1.

Paul
--

From: Adrian Bunk
Date: Wednesday, May 7, 2008 - 2:48 am

That's very hard to understand for someone who looks at the code - and 
being able to understand the code is much more important than the 
number of characters in the source code.

If you'd write something like

  if ((val < 0) || (val > INT_MAX))

instead it would be obvious for the reader what's happening here, and 

cu
Adrian

BTW: I have seen that even further restrictions are discussed in this 
     case, my comment is more about the general readability of this
     "simplest approach".

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Andrew Morton
Date: Wednesday, May 7, 2008 - 8:08 am

What he said.


Our poor reader now knows what was intended.  But he still doesn't know _why_
it was intended.

--

From: Li Zefan
Date: Tuesday, May 6, 2008 - 6:46 pm

I saw this, but I think it's ok.

  -1  : no request. use system default or follow request of others.
   0  : no search.
  ...
 ( 5~ : search system wide [on NUMA system])

Or maybe we can restrict the value from -1 to 5 ?
--

From: Paul Jackson
Date: Tuesday, May 6, 2008 - 6:49 pm

> Or maybe we can restrict the value from -1 to 5 ?

That might be a good idea.  My recollection is that this
is what we tell users to input ... values between -1 and
some small positive integer.  So returning -EINVAL on an
input of -2 (for example) should be the least surprising
behaviour for users.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Menage
Date: Tuesday, May 6, 2008 - 6:51 pm

That sounds fine to me too - I was just trying to get close to the
original behaviour.

Paul
--

From: Paul Jackson
Date: Tuesday, May 6, 2008 - 6:58 pm

An honorable goal.

Li Zefan - would you be interested in generating a patch
that fails -EINVAL for inputs outside the range of [-1 ... N]
for whatever small positive N the kernel recognizes?

This seems like a minor enough difference that I for one
don't have any problem with the current code, remapping
all negative inputs to -1, going in, and then a follow-on
patch changing that going in afterward.

Of course, if you or Seto-san prefer the current behaviour,
it would be easy to persuade me to agree.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Andrew Morton
Date: Tuesday, May 6, 2008 - 7:08 pm

I'd like to get Paul's patch into mainline this evening, to give us a
chance to get the subsequent mess sorted out in time for next
linux-next[*].  So there's no rush on this update.

[*] Judging by this:

kernel/cpuset.c: In function 'cpuset_common_file_write':
kernel/cpuset.c:1374: warning: passing argument 3 of 'update_flag' makes integer from pointer without a cast
kernel/cpuset.c: In function 'cpuset_destroy':
kernel/cpuset.c:1793: warning: passing argument 3 of 'update_flag' makes integer from pointer without a cast

Paul should drop sched-devel.  I'm suspecting it contains stale old stuff
which wasn't supposed to be there.

--

From: Paul Jackson
Date: Tuesday, May 6, 2008 - 7:11 pm

I hope that refers to Paul M.

If it refers to Paul J, he's missing a clue.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Menage
Date: Tuesday, May 6, 2008 - 7:15 pm

I'm building against 2.6.26-rc1, and don't see those warnings (or any
useful code at those lines). It's probably more fallout from the
cpuset files changes?

Paul
--

From: Andrew Morton
Date: Tuesday, May 6, 2008 - 7:28 pm

It's the git-sched-devel changes in linux-next.  Seems they are unaware of
the update_flag() changes we made in mainline last week.

--

From: Paul Jackson
Date: Tuesday, May 6, 2008 - 7:32 pm

The following error that Andrew reports seems to be another
side affect of the "system" patch to cpusets.


I see this error too:

  kernel/cpuset.c: In function 'cpuset_common_file_write':
  kernel/cpuset.c:1374: warning: passing argument 3 of 'update_flag' makes integer from pointer without a cast
  kernel/cpuset.c: In function 'cpuset_destroy':
  kernel/cpuset.c:1793: warning: passing argument 3 of 'update_flag' makes integer from pointer without a cast

I am building with:

Andrew's latest (last couple hours) mmotm, applied on top of
v2.6.26-rc1, with -just- the first two mmotm patches applied to
v2.6.26-rc1, plus your patch:

  origin.patch -- from mmotm
  linux-next.patch -- from mmotm
  Your recent "Fix cpuset sched_relax_domain_level control file" patch

The warning comes from the type inconsistence between the following
lines of kernel/cpuset.c code.  The definition of update_flag() is
expecting an "int turning_on" third argument.  The borked "system"
patch (on my wish I could kill it list) is providing a "char *"
argument.

static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
                       int turning_on)

...

static ssize_t cpuset_common_file_write(struct cgroup *cont,
                                        struct cftype *cft,
                                        struct file *file,
                                        const char __user *userbuf,
                                        size_t nbytes, loff_t *unused_ppos)
{
        struct cpuset *cs = cgroup_cs(cont);
        cpuset_filetype_t type = cft->private;
        char *buffer;
        int retval = 0;

...

        case FILE_SYSTEM:
                retval = update_flag(CS_SYSTEM, cs, buffer);

...

        if (!is_system(cs))
                update_flag(CS_SYSTEM, cs, "1");


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Li Zefan
Date: Tuesday, May 6, 2008 - 7:12 pm

I prefer to put a limit on it, but as Andrew just said, we don't
need to rush for this. But sure I'll post a patch if Seto-san
doesn't have an objection on this.
--

From: Paul Jackson
Date: Tuesday, May 6, 2008 - 7:17 pm

Good - I agree on all points.  Thank-you.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Hidetoshi Seto
Date: Tuesday, May 6, 2008 - 7:27 pm

I apologize for the confusion...


I don't have any.

Thanks,
H.Seto
--

Previous thread: [PATCH] m68knommu: missing sections for linker script by Greg Ungerer on Tuesday, May 6, 2008 - 5:52 pm. (5 messages)

Next thread: [GIT]: Sparc by David Miller on Tuesday, May 6, 2008 - 6:14 pm. (1 message)