Re: [RFC/PATCH 0/4] CPUSET driven CPU isolation

Previous thread: [RFC/PATCH 3/4] genirq: system set irq affinities by Peter Zijlstra on Wednesday, February 27, 2008 - 3:21 pm. (3 messages)

Next thread: [RFC/PATCH 4/4] kthread: system set kthread affinities by Peter Zijlstra on Wednesday, February 27, 2008 - 3:21 pm. (1 message)
From: Peter Zijlstra
Date: Wednesday, February 27, 2008 - 3:21 pm

My vision on the direction we should take wrt cpu isolation.

Next on the list would be figuring out a nice solution to the workqueue
flush issue.

--

From: Max Krasnyanskiy
Date: Wednesday, February 27, 2008 - 4:38 pm

General impressions:
- "cpu_system_map" is %100 identical to the "~cpu_isolated_map" as in my 
patches. It's updated from different place but functionally wise it's very 
much the same. I guess you did not like the 'isolated' name ;-). As I 
mentioned before I'm not hung up on the name so it's cool :).

- Updating cpu_system_map from cpusets
There are a couple of things that I do not like about this approach:
1. We lost the ability to isolate CPUs at boot. Which means slower boot times 
for me (ie before I can start my apps I need to create cpuset, etc). Not a big 
deal, I can live with it.

2. We now need another notification mechanism to propagate the updates to the 
cpu_system_map. That by itself is not a big deal. The big deal is that now we 
need to basically audit the kernel and make sure that everything affected must
have proper notifier and react to the mask changes.
For example your current patch does not move the timers and I do not think it 
makes sense to go and add notifier for the timers. I think the better approach 
is to use CPU hotplug here. ie Enforce the rule that cpu_system_map is updated 
  only when CPU is off-line.
By bringing CPU down first we get a lot of features for free. All the kernel 
threads, timers, softirqs, HW irqs, workqueues, etc are properly 
terminated/moved/canceled/etc. This gives us a very clean state when we bring 
the CPU back online with "system" bit cleared (or "isolated" bit set like in 
my patches). I do not see a good reason for reimplementing that functionality 
via system_map notifiers.

Do not forget the "stop machine", or more specifically module loading/unloading.

Max
--

From: Peter Zijlstra
Date: Thursday, February 28, 2008 - 3:19 am

Ah, but you miss that cpu_system_map doesn't do the one thing the
cpu_isolated_map ever did, prevent sched_domains from forming on those
cpus.


I'm sure those few lines in rc.local won't grow your boot time by a
significant amount of time.

That said, we should look into a replacement for the boot time parameter

I'm not convinced cpu hotplug notifiers are the right thing here. Sure
we could easily iterate the old and new system map and call the matching
cpu hotplug notifiers, but they seem overly complex to me.

The audit would be a good idea anyway. If we do indeed end up with a 1:1

No, the full stop machine thing, there are more interesting users than
module loading. But I'm not too interested in solving this particular
problem atm, I have too much on my plate as it is.

--

From: Max Krasnyanskiy
Date: Thursday, February 28, 2008 - 10:33 am

Did you see my reply on how to support "RT balancer" with cpu_isolated_map ?
Anyway, my point was that you could've just told me something like:
    "lets rename cpu_isolated_map and invert it"
That's is it. The gist of the idea is exactly the same. There is a bitmap that 
I was not talking about calling notifiers directly. We can literally bring the 
I was not suggesting that it's up to you to solve it. You made it sounds like 
your patches provide complete solution, just need to fix the workqueues. I was 
merely pointing out that there is also stop machine that needs fixing.

btw You did not comment about the fact that your patch does not move timers.
I was trying it out last night. It's definitely not ready yet.

Max
--

From: Ingo Molnar
Date: Thursday, February 28, 2008 - 12:50 am

nice work Peter, i find this "system sets" extension to cpusets a much 
more elegant (and much more future-proof) solution than the proposed 
spreadout of the limited hack of isolcpus/cpu_isolated_map. It 
concentrates us on a single API and on a single mechanism to handle 
isolation matters. (be that for clustering/supercomputing or real-time 
purposes)

Thanks for insisting on using cpusets for this!

i've queued up your patches in sched-devel.git, and lets make sure this 
has no side-effects on existing functionality. (it shouldnt)

	Ingo
--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 1:08 am

Before this patchset gets too far, I'd like to decide on whether to
adapt my suggestion to call that per-cpuset flag 'cpus_system' (or
anything else with 'cpu' in it, perhaps 'system_cpus' would be more
idiomatic), rather than the tad too generic 'system'.

People doing 'ls /dev/cpuset' should be able to half-way guess
what things do, just from their name.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Ingo Molnar
Date: Thursday, February 28, 2008 - 2:08 am

yeah. In fact i'm not at all sure this is really a "system" thing - it's 
more of a "bootup" default.

once the system has booted up and the user is in a position to create 
cpusets, i believe the distinction and assymetry between any bootup 
cpuset and the other cpusets should vanish. The "bootup" cpuset is just 
a convenience container to handle everything that the box booted up 
with, and then we can shrink it (without having to enumerate every PID 
and every irq and other resource explicitly) to make place for other 
cpusets.

maybe it's even more idomatic to call it "set0" and just create a 
/dev/cpuset/set0/ directory for it and making it an explicit cpuset - 
instead of the hardcoded /dev/cpusets/system thing? Do you have any 

oh, certainly. This is at the earliest v2.6.26 material - but now it at 
least looks clean conceptually, fits more nicely into cpusets instead of 
being a bolted-on thing.

	Ingo
--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 2:17 am

I'm not quite sure of what you're thinking here; rather I'm just
bouncing off the sound of your words.

But your words sound alot like what we at SGI call a 'boot' cpuset.

Our big honkin NUMA customers, who are managing most of the system
either for a few dedicated, very-important jobs, and/or under a
batch scheduler, need to leave a few nodes to run the classic Unix
load such as init, cron, assorted daemons and the admins login shell.

So we provide them some init script mechanisms that make it easy to
set this up, which includes moving every task (not many at the low
numbered init script time this runs) that isn't pinned (doesn't have
a restricted Cpus_allowed) into the boot cpuset, conventionally
named /dev/cpuset/boot.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: David Rientjes
Date: Thursday, February 28, 2008 - 2:32 am

Should the kernel refuse to move some threads, such as the migration 
or watchdog kthreads, out of the root cpuset where the mems can be 
adjusted to disallow access to the cpu to which they are bound?  This is 
a quick way to cause a crash or soft lockup.

		David
--

From: David Rientjes
Date: Thursday, February 28, 2008 - 3:12 am

Something like this?
---
 include/linux/sched.h |    1 +
 kernel/cpuset.c       |    5 ++++-
 kernel/kthread.c      |    1 +
 kernel/sched.c        |    6 ++++++
 4 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1464,6 +1464,7 @@ static inline void put_task_struct(struct task_struct *t)
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define PF_SPREAD_SLAB	0x02000000	/* Spread some slab caches over cpuset */
+#define PF_CPU_BOUND	0x04000000	/* Kthread bound to specific cpu */
 #define PF_MEMPOLICY	0x10000000	/* Non-default NUMA mempolicy */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezeable */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1175,11 +1175,14 @@ static void cpuset_attach(struct cgroup_subsys *ss,
 	struct mm_struct *mm;
 	struct cpuset *cs = cgroup_cs(cont);
 	struct cpuset *oldcs = cgroup_cs(oldcont);
+	int ret;
 
 	mutex_lock(&callback_mutex);
 	guarantee_online_cpus(cs, &cpus);
-	set_cpus_allowed(tsk, cpus);
+	ret = set_cpus_allowed(tsk, cpus);
 	mutex_unlock(&callback_mutex);
+	if (ret < 0)
+		return;
 
 	from = oldcs->mems_allowed;
 	to = cs->mems_allowed;
diff --git a/kernel/kthread.c b/kernel/kthread.c
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
 	wait_task_inactive(k);
 	set_task_cpu(k, cpu);
 	k->cpus_allowed = cpumask_of_cpu(cpu);
+	k->flags |= PF_CPU_BOUND;
 }
 EXPORT_SYMBOL(kthread_bind);
 
diff --git a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5345,6 +5345,12 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
 		goto out;
 	}
 
+	if ...
From: Peter Zijlstra
Date: Thursday, February 28, 2008 - 3:26 am

Indeed, there is a hole in my cpus_match_system() logic in that when the
system set is reduced to a single cpu, the tasks bound to that cpu also
match.

I had wanted to avoid adding PF_ flags (as I remember we're running
short on them), but I think you're right.


--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 10:37 am

I don't have strong opinions either way on this patch; it adds an error
check that makes sense.  I haven't seen much problem not having this check,
nor do I know of any code that depends on doing what this check prohibits.

Except for three details:

 1) +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
    +	    	     !cpus_equal(p->cpus_allowed, new_mask))) {
    +		ret = -EINVAL;

    The check for equal cpus allowed seems too strong.  Shouldn't you be
    checking that all of task p's cpus_allowed would still be allowed in
    the new mask:

    +	if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
    +	    	     !cpus_subset(p->cpus_allowed, new_mask))) {
    +		ret = -EINVAL;

 2) Doesn't this leave out a check for the flip side -- shrinking
    the cpus allowed by a cpuset so that it no longer contains those
    required by any PF_CPU_BOUND tasks in that cpuset?  I'm not sure
    if this second check is a good idea or not.

 3) Could we call this PF_CPU_PINNED instead?  I tend to use "cpu
    bound" to refer to tasks that consume alot of CPU cycles (which
    these pinned tasks rarely do), and "pinned" to refer to what is
    done to confine a task to a particular subset of all possible CPUs.
    It looks to me like some code in kernel/sched.c already uses the
    word pinned in this same way, so PF_CPU_PINNED would be more
    consistent terminology.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: David Rientjes
Date: Thursday, February 28, 2008 - 2:24 pm

How about moving watchdog/0 to a cpuset with exclusive access to only cpu 

That's a convenient way for a kthread to temporarily expand its set of 
cpus_allowed and then never be able to remove the added cpus again.  Do 

That's why the check in set_cpus_allowed() is

	cpus_equal(p->cpus_allowed, newmask)

since it prevents PF_CPU_BOUND tasks from being moved out of the root 

PF_CPU_BOUND follows the nomenclature of kthread_bind() really well, but 
it could probably be confused with a processor-bound task.  So perhaps 
PF_BOUND_CPU is even better?

		David
--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 3:46 pm

Good question.  Actually, we -normally- have pinned tasks in the top cpuset,
where the top cpuset allows all CPUs, but the pinned task has a cpus_allowed
(in its task struct) of just one or a few CPUs (for node pinning.)

So ... could we allow moving pinned threads to any cpuset that allowed
the CPUs to which it was pinned (my cpus_subset() test, above), but
-not- change the pinned tasks cpus_allowed in its task struct, keeping

I don't think that the cpus_equal() check prevents that (shrinking a
pinned tasks cpuset out from under it.)  Try the following on a freshly
booted system with your proposed patch:

  mkdir /dev/cpuset
  mount -t cpuset cpuset /dev/cpuset
  cd /dev/cpuset
  mkdir a
  cp ???s a
  < tasks sed -un -e p -e 10q > a/tasks

I'll wager you just moved a few pinned tasks into cpuset 'a'.  This
would be allowed, as 'a' has the same cpus as the top cpuset.  But then
one could shrink a (if it had more than 1 CPU in the first place), leaving
some pinned tasks in a cpuset they weren't allowed to run in, essentially

Good point - "BOUND" as the past tense of "BIND".  How about
PF_THREAD_BIND ;)?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: David Rientjes
Date: Thursday, February 28, 2008 - 4:00 pm

Same situation occurs for a task bound to a cpuset that issues a 
sched_setaffinity() call to restrict its cpumask further.  There's nothing 

I'd hesitate to do that unless you can guarantee that restricting 
kthreads mems_allowed via the cpuset interface won't cause any problems 
either.  Is there a benefit to reducing the size of a kthread's 
mems_allowed that doesn't have an adverse effect on the kernel?  What 

You can move them, but you cannot reduce the size of the thread's 
cpus_allowed since it was bound to a specific cpu via kthread_bind().  The 
change to set_cpus_allowed() simply prevents this from happening in the 

Sure.
--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 5:16 pm

I still don't understand ... you must have some context in mind that
I've spaced out ... I can't even tell if that is a statement or a

Well ... I'm suspecting we've got this portion of our discussion wrapped
around the axle one time too many.

Backing up, hopefully unwrapping, you seemed to allow moving bound tasks
only to cpusets with the same cpus (how come you didn't check for the
same memory nodes too?).  If you really needed to move bound tasks at all,
then that seemed like an unnecessarily tight constraint.  It wouldn't hurt
the bound task to move to another cpuset that still allowed the CPUs it was
bound to.

... but after an another iteration of that subthread ... I'm wondering
why you have to move bound tasks at all.  How about PF_THREAD_BIND just
meaning (1) "can't be moved to any other cpuset", and (2) "always
placed in the top cpuset," so we don't have to worry about being unable
to move threads out of child cpusets.

Do you have any situation in which pinned threads have to be moved?

I don't.  Can we just prohibit it?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: David Rientjes
Date: Thursday, February 28, 2008 - 6:05 pm

You said that you weren't aware of any problems that could arise that are 
fixed with this added check in set_cpus_allowed(), so I asked that you 
setup a cpuset that doesn't have access to cpu 0 and move the watchdog/0 

The fix is more general than just the cpusets case, even though it is 
probably the only user in the kernel of set_cpus_allowed() that allows 
bound kthreads to be moved.

The idea is to expressly deny kthreads that have called kthread_bind() 
from changing their cpus_allowed via set_cpus_allowed().  Cpusets should 
handle that gracefully, but my fix is more general: it adds the check to 
set_cpus_allowed() instead of in the cpusets code.

This is a little difficult with the current way that cpusets calls 
set_cpus_allowed() since its attach function is void and does not return 
error or success to the cgroup interface, yet my patch prevents the soft 
lockups and kernel crashes that can occur when cpusets changes the 
cpus_allowed of watchdog or migration threads that is obviously in error.

		David
--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 8:34 pm

Ok, now I understand your question - thanks.

I think your question arises from misreading what I wrote.

I did not say that I wasn't "aware of any problems that could arise"


 - This does not say no (none whatsoever) problem could (ever in the future) arise.

 - This does say not much (just a little) problem had arisen (so far in the past).

Apparently, you thought I was trying to reject the patch on the grounds
that no such problem could ever occur, and you were showing how such a
problem could occur.  I wasn't trying to reject the patch, and I agree
that the check made sense, and I agree that such a problem could occur,
as your example shows.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: David Rientjes
Date: Thursday, February 28, 2008 - 9:00 pm

Ok, good.  We're in agreement that my patch fixes a problem in allowing 
kthreads to move to cpusets that either currently or eventually deny them 
access to the cpu to which they are bound.

Now do you have any preference without extensive cpusets or cgroups 
hacking where we can gracefully handle -EINVAL being returned from 
set_cpus_allowed() so the task doesn't end up in a different cpuset?  The 
return value of set_cpus_allowed() is currently ignored by the cpuset 
implementation and, regardless, cpuset_attach() returns void.

cpuset_can_attach() does some sanity checking, but we need a call to 
set_cpus_allowed() to check the new logic and will actually change 
tsk->cpus_allowed if correctly invoked.

So the cpusets implementation needs a similar check in cpuset_can_attach() 
so that it is expressly denied from moving before we ever call 
cpuset_attach().  It doesn't make sense that a kthread is going to move 
itself through the cpusets interface, so this can simply be done by 
checking for tsk->flags & PF_THREAD_BOUND.



sched: prevent bound kthreads from changing cpus_allowed

Kthreads that have called kthread_bind() are bound to specific cpus, so 
other tasks should not be able to change their cpus_allowed from under 
them.  Otherwise, it is possible to move kthreads, such as the migration 
or watchdog threads, so they are not allowed access to the cpu they work 
on.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 include/linux/sched.h |    1 +
 kernel/cpuset.c       |    2 ++
 kernel/kthread.c      |    1 +
 kernel/sched.c        |    6 ++++++
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1464,6 +1464,7 @@ static inline void put_task_struct(struct task_struct *t)
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_SPREAD_PAGE	0x01000000	/* Spread page cache over cpuset */
 #define ...
From: Paul Jackson
Date: Thursday, February 28, 2008 - 11:53 pm

I'm ok with this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Ingo Molnar
Date: Thursday, February 28, 2008 - 3:46 am

yes. Ideally Peter's patchset should turn into something equivalent and 
i very much agree with Peter's arguments. There was never any design 
level problem with cpusets, and the parallel cpu_isolated_map approach 
was misdirected IMO.

There was indeed a problem with the _manageability_ of cpusets in 
certain (rather new) usecases like real-time or virtualization, and how 
they are connected to other system resources like IRQs and how easy it 
is to manage these resources. IRQs should probably be tied to specific 
cpusets and should migrate together with them, were the span of that 
cpuset be changed. (by default they'd be tied to the boot cpuset)

IMO Peter's patchset is a good first step in that it removes the 
cpu_isolated_map API hack, and i think we should try to go the whole way 
and just offer a /dev/cpuset/boot/ default set that can then be 
restricted to isolate the default workloads away from other CPUs.

( an initscripts approach, while i'm sure it works, would always be a
  bit fragile in that it requires precise knowledge about which task is
  what. I think we should make this a turn-key in-kernel solution that
  both the big-honking NUMA-box guys and the real-time guys would be
  happy with. )

	Ingo
--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 10:47 am

Agreed.  In the early days of cpusets, (1) the less I put in the kernel,
the easier it was to get those first patches accepted, and (2) the more
there was that remained for me to code in user space, where my employer
could provide better, leading edge, products for a profit.

Overtime, these user space features that prove their worth, but that
were difficult to code in the most usable and robust manner in user
space, such as boot cpusets here, and, on another thread, cpuset
relative NUMA mempolicies, naturally migrate to the kernel, for wider
availability.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyansky
Date: Thursday, February 28, 2008 - 1:11 pm

I like the concept of the "boot" set. But we still need a separate "system"
flag. Users should not have to move cpu back into the "boot" set to allow for
kernel (irqs, etc) activity on it. And it's just more explicit and clear that way.

Max



--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 1:13 pm

Good point -- a "boot" cpuset might be 4 CPUs out of 256 CPUs, just for running
the classic Unix load (daemons, init, login, ...).  But irq's might need to go
to most CPUs, except for some (dare I use the word) isolated CPUs.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyansky
Date: Thursday, February 28, 2008 - 1:26 pm

btw Can you send me those init scripts that you mentioned. ie Those that
create "boot" set from userspace.

Max
--

From: Paul Jackson
Date: Thursday, February 28, 2008 - 1:27 pm

They aren't open source ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyansky
Date: Thursday, February 28, 2008 - 1:45 pm

Oh, I see. It's probably patented too ;-)

Max
--

From: Max Krasnyansky
Date: Thursday, February 28, 2008 - 1:23 pm

I think that is a separate thing. Bootup default is one thing and being able
to explicitly allow/disallow kernel activity on a CPU(s) is another.

I think "boot" or "set0" makes perfect sense. In fact that was the first thing
I noticed when I started playing with it. ie Even if I just wanted to isolated
 one cpu I now need to create a cpuset for the other cpus and move all the
tasks there explicitly. It'd be very useful if it happens by default.

Max



--

From: Max Krasnyanskiy
Date: Thursday, February 28, 2008 - 10:48 am

Come on Ingo. You make it sounds like it's radically different solution.
At the end of the day we have a bitmap that represents which CPUs can be used 
for the kernel stuff. How is that different ?
I was saying all along that cpusets is a higher level API and was discussing 
Hmm, that was easy. Not a single ack. Even the core part is not complete yet. 
I pointed out several issues. Like the fact that it does not provide full 
isolation because it does not move timers, does not handle workqueues.
I did not even get a chance to test this stuff properly and see if it actually 
solves the usecase I was solving with my patches.
_Obviously_ we could not have taken my tested solution and evolved it in the 
direction people wanted to see it evolve, ie integration with the cpusets :(.

My main concern is that it introduces a whole new set of notifiers that 
perform similar functions to what CPU hotplut already does.

Max
--

From: Andrew Morton
Date: Friday, February 29, 2008 - 1:31 am

It of course lays waste to a series of cgroup patches from Paul Menage
which I already had queued.

So I shall drop git-sched again.

How often do I have to say this?  git-sched is not
git-everything-which-looks-shiny!  It is for the CPU scheduler.

If you had put this patchset into a private branch for private testing, or
even into a separate git-petes-stuff then I wouldn't have to collaterally
drop the entirety of git-sched because of this.

Sigh.

--

From: Andrew Morton
Date: Friday, February 29, 2008 - 1:36 am

And when I do this I get:

***************
*** 8125,8137 ****
  	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
  }
  
- static int cpu_rt_period_write_uint(struct cgroup *cgrp, struct cftype *cftype,
  		u64 rt_period_us)
  {
  	return sched_group_set_rt_period(cgroup_tg(cgrp), rt_period_us);
  }
  
- static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
  {
  	return sched_group_rt_period(cgroup_tg(cgrp));
  }
--- 8125,8137 ----
  	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
  }
  
+ static int cpu_rt_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
  		u64 rt_period_us)
  {
  	return sched_group_set_rt_period(cgroup_tg(cgrp), rt_period_us);
  }
  
+ static u64 cpu_rt_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
  {
  	return sched_group_rt_period(cgroup_tg(cgrp));
  }
***************
*** 8367,8374 ****
  	},
  	{
  		.name = "rt_period_us",
- 		.read_uint = cpu_rt_period_read_uint,
- 		.write_uint = cpu_rt_period_write_uint,
  	},
  #endif
  };
--- 8367,8374 ----
  	},
  	{
  		.name = "rt_period_us",
+ 		.read_u64 = cpu_rt_period_read_u64,
+ 		.write_u64 = cpu_rt_period_write_u64,
  	},
  #endif
  };

and if I then fix that up, and later restore git-sched, Paul's patch is now
broken.

Your trees continue to cause more trouble than anyone else's have ever
done, by a lot.

Let me try yesterday's git-sched.patch.
--

From: Ingo Molnar
Date: Friday, February 29, 2008 - 2:10 am

Andrew, please stop tracking sched-devel.git and track this tree 
instead:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

thanks,

	Ingo
--

From: Max Krasnyanskiy
Date: Friday, February 29, 2008 - 11:06 am

Another option is to have cpusets or cpu isolation tree that I've started 
already. I was saying from the very beginning cpu isolation stuff does not 
imho belong in the scheduler tree. Besides a tiny patch to the sched.c that 
adds/removes the bitmaps there are no scheduler changes needed for this 
specifically.

Peter, Ingo, if you guys are ok with this lets just have this stuff in 
cpuisol-2.6.git. I'm anyway rebasing it ontop of Peter's work. Of course we'll 
go through regular review and stuff and Andrew can track that tree separately.

Just a suggestion. I'm ok with submitting patches via sched-devel. Separate 
tree seems more appropriate though.

Max
--

From: Mark Hounschell
Date: Thursday, February 28, 2008 - 5:12 am

Is it now the intent, not only that I have to enable cpusets in the
kernel but I will also have to use them in userland to take advantage of
this.

And hot-plug too??

Can I predict that in the future that userland sched_setaffinity will be
taken away also and be forced to use cpusets?

And hot-plug too??

Mark
--

From: Max Krasnyansky
Date: Thursday, February 28, 2008 - 12:57 pm

Mark,
I bet you won't get any replies (besides mine). And yes this means that you
will have to enable cpusets if Peter's patches go in (looks like they will).
Hot-plug may not be needed unless I convince people to reuse the hot-plug
instead of introducing new notifiers.
I guess we can make some extensions to expose "system" bit just like I did
with "isolated" bit via sysfs. In which case cpusets may not be needed. We'll see.

Max


--

From: Peter Zijlstra
Date: Friday, February 29, 2008 - 11:55 am

Hi Paul,

How about something like this; along with the in-kernel version
of /cgroup/boot this could also provide the desired semantics.

Another benefit of this approach would be that it no longer requires
PF_THREAD_BIND, as we'd only stick unbounded kthreads into that cgroup.

(compile tested only)
---
Subject: cpuset: cpuset irq affinities

Allow for an association between cpusets and irqs. 

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/irq.h |    9 ++
 kernel/cpuset.c     |  160 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/irq/manage.c |   19 ++++++
 3 files changed, 188 insertions(+)

Index: linux-2.6-2/include/linux/irq.h
===================================================================
--- linux-2.6-2.orig/include/linux/irq.h
+++ linux-2.6-2/include/linux/irq.h
@@ -174,11 +174,20 @@ struct irq_desc {
 #ifdef CONFIG_PROC_FS
 	struct proc_dir_entry	*dir;
 #endif
+#ifdef CONFIG_CPUSETS
+	struct cpuset		*cs;
+#endif
 	const char		*name;
 } ____cacheline_internodealigned_in_smp;
 
 extern struct irq_desc irq_desc[NR_IRQS];
 
+struct irq_iterator {
+	int (*function)(struct irq_iterator *, int, struct irq_desc *);
+};
+
+extern int irq_iterator(struct irq_iterator *);
+
 /*
  * Migration helpers for obsolete names, they will go away:
  */
Index: linux-2.6-2/kernel/cpuset.c
===================================================================
--- linux-2.6-2.orig/kernel/cpuset.c
+++ linux-2.6-2/kernel/cpuset.c
@@ -50,6 +50,9 @@
 #include <linux/time.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
+#ifdef CONFIG_GENERIC_HARDIRQS
+#include <linux/irq.h>
+#endif
 
 #include <asm/uaccess.h>
 #include <asm/atomic.h>
@@ -732,6 +735,44 @@ void cpuset_change_cpumask(struct task_s
 	set_cpus_allowed(tsk, (cgroup_cs(scan->cg))->cpus_allowed);
 }
 
+#ifdef CONFIG_GENERIC_HARDIRQS
+struct cpuset_irq_cpumask {
+	struct irq_iterator v;
+	struct cpuset *cs;
+	cpumask_t mask;
+};
+
+static ...
From: Ingo Molnar
Date: Friday, February 29, 2008 - 12:02 pm

i like this approach - it makes irqs more resource-alike and attaches 
them to a specific resource control group.

So if /cgroup/boot is changed to have less CPUs then the "default" irqs 
move along with it.

but if an isolated RT domain has specific irqs attached to it (say the 
IRQ of some high-speed data capture device), then the irqs would move 
together with that domain.

irqs are no longer a bolted-upon concept, but more explicitly managed.

[ If you boot-test it and if Paul agrees with the general approach then
  i could even apply it to sched-devel.git ;-) ]

	Ingo
--

From: Max Krasnyanskiy
Date: Friday, February 29, 2008 - 1:52 pm

Believe it or not but I like it too :).
Now we're talking different approach compared to the cpu_isolated_map since 
with this patch cpu_system_map is no longer needed.
I've been playing with latest sched-devel tree and while I think we'll endup 
adding a lot more code, doing it with the cpuset is definitely more flexible.
This way we can provide more fine grain control of what part of the "system" 
services are allowed to run on a cpuset. Rather that "catch all" system flag.

Current sched-devel tree does not provide complete isolation at this point. 
There are still many things here and there that need to be added/fixed.
Having finer control here helps.

One concern I have is that this API conflicts with /proc/irq/X/smp_affinity.
ie Setting smp_affinity manually will override affinity set by the cpuset.
In other words I think
	int irq_set_affinity(unsigned int irq, cpumask_t cpumask)
now needs to make sure that cpumask does not have cpus that do not belong to 
the cpuset this irq belongs to. Just like sched_setaffinity() does for the tasks.

Max
--

From: Peter Zijlstra
Date: Friday, February 29, 2008 - 2:03 pm

The patch also needs to handle group destruction too; currently it
leaves cpuset pointers dangling. So it would either have to refuse
removing a group when there are still irqs associated, or move them to
the parent.

But yeah, this was just a quick hack to show the idea, glad you like it.
Will try to flesh it out a bit in the coming week.

--

From: Max Krasnyanskiy
Date: Friday, February 29, 2008 - 2:20 pm

Are you going to add code for "boot" cpuset ?
I wrote user-space code that is does that, but as I understand from previous 
discussions we want to create that in the kernel.

Max

--

From: Peter Zijlstra
Date: Monday, March 3, 2008 - 4:57 am

Yeah, I'll be trying to (lack of cgroup fu for the moment).

I think something like

 /cgroup
 /cgroup/system
 /cgroup/system/boot

 /cgroup/big_honking_app
 /cgroup/rt_domain

Where the system group includes all IRQs and all unbound kernel threads
(by default). The system/boot group will contain all of userspace.

Doing it in this way ought to allow for some weird setups. The system
group can overlap with anything that does need system services. The boot
group must be a subset thereof, and can be shrunk to a small part of the
machine.



--

From: Paul Jackson
Date: Monday, March 3, 2008 - 10:36 am

I suppose IRQs need to overlap like this, but cpusets often can't
overlap like this.

If a system has the cgroup hierarchy you draw:

  /cgroup
  /cgroup/system
  /cgroup/system/boot

  /cgroup/big_honking_app
  /cgroup/rt_domain

this must not force the cpuset hierarchy to be:

  /dev/cpuset
  /dev/cpuset/system
  /dev/cpuset/system/boot

  /dev/cpuset/big_honking_app
  /dev/cpuset/rt_domain

I guess this means IRQs cannot be added to the cpuset subsystem
of cgroups.  Rather they have to be added to some other cgroup
subsystem, perhaps a new one just for IRQs.

In perhaps the most common sort of cpuset hierarchy:

  /dev/cpuset
  /dev/cpuset/boot
  /dev/cpuset/batch_sched
  /dev/cpuset/big_honking_app
  /dev/cpuset/rt_domain

none of boot or its siblings overlap.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Peter Zijlstra
Date: Monday, March 3, 2008 - 10:57 am

Due to CS_CPU_EXCLUSIVE usage?


The trouble is, cgroups are primarily about tasks, whereas IRQs are not.
So we would create a cgroup that does not manage tasks, but rather
associates irqs with sets of cpus - which are not cpusets.


But as long as nobody does CS_CPU_EXCLUSIVE they may overlap, right?

--

From: Paul Jackson
Date: Monday, March 3, 2008 - 11:10 am

> But as long as nobody does CS_CPU_EXCLUSIVE they may overlap, right?

It's a bit stronger than that:

 1) They need non-overlapping cpusets at this level to control
    the sched_domain setup, if they want to avoid load balancing
    across almost all CPUs in the system.  Depending on the kernel
    version, sched_domain partitioning is controlled either by the
    cpuset flag cpu_exclusive, or the cpuset flag sched_load_balance.

 2) They need non-overlapping cpusets at this level to control
    memory placement of some kernel allocations, which are allowed
    outside the current tasks cpuset, to be confined by the nearest
    ancestor cpuset marked 'mem_exclusive'

 3) Some sysadmin tools are likely coded to expect a /dev/cpuset/boot
    cpuset, not a /dev/cpuset/system/boot cpuset, as that has been
    customary for a long time.

(1) and (2) would break the major batch schedulers.  They typically
mark their top cpuset, /dev/cpuset/pbs or /dev/cpuset/lfs or whatever
batch scheduler it is, as cpu_exclusive and mem_exclusive, by way of
expressing their intention to pretty much own those CPUs and memory
nodes.  If we fired them up on a system where that wasn't allowed due
to overlap with /dev/cpuset/system, they'd croak.  Such changes as that
are costly and unappreciated.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Peter Zijlstra
Date: Monday, March 3, 2008 - 11:18 am

OK, understood, I'll try and come up with yet another scheme :-)

--

From: Paul Jackson
Date: Tuesday, March 4, 2008 - 12:35 am

Would your per-cpuset 'irqs' file work if, unlike pids in the 'tasks' file,
we allowed the same irq to be listed in multiple 'irqs' files?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Peter Zijlstra
Date: Tuesday, March 4, 2008 - 4:06 am

I did think of that, but that seems rather awkward. For one, how would
you remove an irq from a cpuset?

Secondly, the beauty of the current solution is that we use
irq_desc->cs->cpus_allowed, if it were in multiple sets, we'd have to
iterate a list, and cpus_or() the bunch.



--

From: Max Krasnyanskiy
Date: Tuesday, March 4, 2008 - 12:52 pm

Yeah, that would definitely be awkward.

Max
--

From: Paul Jackson
Date: Tuesday, March 4, 2008 - 6:11 pm

Yeah - agreed - awkward.

Forget that idea (allowing the same irq in multiple 'irqs' files.)

It seems to me that we get into trouble trying to cram that 'system'
cpuset into the cpuset hierarchy, where that system cpuset is there to
hold a list of irqs, but is only partially a good fit for the existing
cpuset hierarchy.

Could this irq configuration be partly a system-wide configuration
decision (which irqs are 'system' irqs), and partly a per-cpuset
decision -- which cpusets (such as a real-time one) want to disable
the usual system irqs that everyone else gets.

The cpuset portion of this should take only a single per-cpuset Boolean
flag -- which if set True (1), asks the system to "please leave my CPUs
off the list of CPUs receiving the usual system irqs."

Then the list of "usual system irqs" would be established in some /proc
or /sys configuration.  Such irqs would be able to go to any CPUs
except those CPUs which found themselves in a cpuset with the above
per-cpuset Boolean flag set True (1).

How does all this interact with /proc/irq/N/smp_affinity?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Peter Zijlstra
Date: Wednesday, March 5, 2008 - 1:37 am

How about we make this in-kernel boot set, that by default contains all
IRQs, all unbounded kthreads and all of user-space.

To be compatible with your existing clients you only need to move all
the IRQs to the root domain.

(Upgrading a kernel would require distributing some new userspace
anyway, right? - and we could offer a .config option to disable the boot
set for those who do upgrade kernels without upgrading user-space).

Then, once you want to make use of the new features, you have to update
your batch scheduler to only make use of load_balance and not
cpus_exclusive (as they're only interested in sched_domains, right?)

So if you want to do IRQ isolation and batch scheduling on the same
machine (as is not possible now) you need to update userspace as said
before; so that it allows for the overlapping cpuset.

For example, on a 32 cpu machine:

/cgroup/boot 0-1 (kthreads - initial userspace)
/cgroup/irqs 0-27 (most irqs)
/cgroup/batch_A 2-5
/cgroup/batch_B 6-13
/cgroup/another_big_app 14-27
/cgroup/RT-domain 28-31 (my special irq)

So by providing a .config option for strict backward compatibility, a
simple way for runtime compatibility (moving all IRQs to the root) which
should be easy to do if the kernel upgrade is accompanied by a (limited)
user-space upgrade.

And once all the features need to be used together (something that is
now not possible - so new usage) then the code that relies on
cpus_exclusive to create sched_domains needs to be changed to use
load_balance instead.


Much the same way the cpuset cpus_allowed interacts with a task's
cpus_allowed. That is, cs->cpus_allowed is a mask on top of the provided
affinity.

If for some reason the cs->cpus_allowed changes in such a way that the
user-specified mask becomes empty (irq->cpus_allowed & cs->cpus_allowed
== 0), then print a message and set it to the full mask
(irq->cpus_allowed = cs->cpus_allowed).

If for some reason the cs->cpus_allowed changes in such a way that the
mask is ...
From: Ingo Molnar
Date: Wednesday, March 5, 2008 - 1:50 am

/me likes

this looks like the most straightforward and most manageable approach 
proposed so far - i always thought that cpusets should boot up with some 
meaningful default set that people could play with. This would really 
push cpusets into mainstream use i believe.

Any patch i could try out, to see how well this works in practice? ;-)

	Ingo
--

From: Paul Jackson
Date: Wednesday, March 5, 2008 - 5:35 am

They do ... you just have to mount cpusets to see it ;).

If your kernel is configured with CONFIG_CPUSETS, then
there is one 'top' cpuset containing all tasks, all mems,
and all cpus, setup during the kernel init boot code,
unconditionally.

Anyone who does:
	mkdir -p /dev/cpuset
	mount -t cpuset cpuset /dev/cpuset
can see it and play with it.

... perhaps I misunderstood your comment?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Ingo Molnar
Date: Wednesday, March 5, 2008 - 5:43 am

the root cpuset is special as it is the root of the tree. You cannot 
shrink it in practice (in a meaningful way) because then you cannot have 
children outside of its scope.

	Ingo
--

From: Paul Jackson
Date: Wednesday, March 5, 2008 - 10:44 am

Yes ... true ... so?  I guess whatever you meant by "meaningful"
doesn't include the root cpuset.  Oh well <grin>.

To be a tad more serious, it wasn't (still isn't) clear to me
what you did mean here.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyansky
Date: Wednesday, March 5, 2008 - 12:17 pm

I assumed that this exactly what we meant by the boot set all along :).

One thing I wanted to clarify that by all IRQs we literally mean all of them
even those that do not have handlers yet. /proc/irqN/smp_affinity for example
 is not available for the irqs are not active. In other words an IRQ must not
all of the sudden show up in the root set when something does request_irq() on
it.

I think Paul missed my earlier reply where I pointed out that original patch
conflicted with /proc/irq/N/smp_affinity API. The solution is for
irq_set_affinity() to enforce cpus_allowed just like sched_setaffinity does
for the tasks.

Max
--

From: Paul Jackson
Date: Thursday, March 6, 2008 - 6:47 am

If I understood your proposal, the /cgroup/irq cpuset is rather an
odd cpuset -- it seems to be just used to place the 'system' irqs on
the specified CPUs.  This is not the usual use of cpusets, which are
to associate a set of tasks with some resources.  In your example,
the 'tasks' file in /cgroup/irqs is probably empty, right?

And then you have to argue that certain incompatibilities this

No ... we normally do our best not to force apps to change source
code, or where possible not even force them to recompile, to run on

That sort of alternative is useless when dealing with the major
distros.  They choose one config for half or all of their entire
market (perhaps distinguishing personal PC from server, but no more).

Ok - something like that could be done, if we had to, just as
a cpusets 'mems' masks mbind and set_mempolicy nodemasks, and
a cpusets 'cpus' masks sched_setaffinity cpumasks.

Could be ... I suppose ... if we had to ... however ...


I suspect that the reason you had the odd /cgroup/irqs cpuset is that
it wasn't clear what to do with the irqs of cpusets that had both:
 1) overlapping cpus, and
 2) specified different lists of irqs.

In my view, you are trying to:
 A] force the <<irq, cpu>, Boolean> mapping into linear lists, and
 B] then trying to force those linear lists into the cpuset hierarchy.

With each capability that we've added to cpusets, beginning with CPUs
and Memory Nodes themselves, and more recently with sched_domains,
I've strived to keep cpusets fully hierarchical and nestable, supporting
arbitrary combinations of overlapping CPUs and Memory Nodes.

What we have here, in this cpuset-irq proposal, I claim, is another
example of trying to put in a tree what is essentially flat.

No ... that last paragraph is wrong.  It's worse than that.

The mapping of <cpu, irq> pairs to Boolean ("can we direct this irq
to this CPU - yes or no?") is not flat; it's a two-dimensional matrix
of Boolean.

So you're [A] squinting out the left eye, ...
From: Peter Zijlstra
Date: Thursday, March 6, 2008 - 8:21 am

Likely; if for instance you'd want some unbound kernel threads to join
in that overlapping set, then perhaps that name would be badly chosen.

Although I'm not sure which unbound kernel threads would benefit from

Perhaps we're talking about something else here; how bad would it be to
require:

for irq in `cat /cgroup/boot/irqs` ; do echo $irq > /cgroup/irqs; done

be added to rc.local or fully replace your home-brew boot cpuset script?
Its basically an update for that script, giving the exact same semantics

Sure, but your application vendors will need to re-certify their
applications to run on new distros, sometimes even re-compile because
ABI changes and the like. Certainly providing a new script in the new

Assign a map of cpus where irqs will default into, and a way to

So, yes, cgroups are perhaps awkward because they group tasks whereas
the current problem is grouping IRQs.

Because we're mapping them onto CPUs, cpusets came to mind.

The thing we 'need', is to provide named groups of irqs and for each
such a group specify a cpu mask that is appropriate.

Grouping them makes sense in that we want to make a functional division.
Some IRQs serve a system as a whole, others serve a subset. Typical
subsets could be a RT process space bounded to a cpu/mem domain.

Other usable subsets could be limiting the IRQs of node local network
and IO cards to the cpu/mem domain that runs the application that uses
them.

So we group irqs like:

  system_on_nodes_1_2_and_3 (default)
  big_io_app_on_nodes_2_and_3
  rt_app_on_node_4

Where, again, you see a strong similarity to the cpu/mem divisions
already made by cpusets.

I understand its somewhat at odds with the hierarchical nature of the
<task, cpu/mem> mapping you currently have. But its not too far away
either.

Do you see what we want to do, and why our - perhaps misguided - choice
for cpusets? Creating a whole new cgroup controller is also weird
because we don't deal in tasks.

[ Aside from grouping existing ...
From: Paul Jackson
Date: Thursday, March 6, 2008 - 8:40 pm

Helpful reply, Peter.  Thanks.



Essentially, cpusets are like cgroups in this regard.  They group
tasks.  They just happen to be grouping tasks to associate them
with sets of CPUs (and Memory Nodes), which seems relevant somehow
to the present need, to group irqs to associate them with sets of

I haven't gotten my head around what such a script would do yet,
but you are correct in suspecting that we could add a script like
this easily enough in future releases, if that was useful.

I can change init scripts, for each kernel version, much easier than I
can ask the big batch scheduler providers to change their application


Can you spell out how or why /proc/irq/N/smp_affinity doesn't
provide what you need here?

My guess is that it's fairly obvious why /proc/irq/N/smp_affinity is
not well suited for this.  It requires poking lots of settings, one
at a time, which is cumbersome and racey from user space, difficult
to keep in sync with any other changes in placement of RT or other
jobs, and it requires root permissions, without any finer granularity


Cool -- I'm glad now I asked (rather impatiently) what we needed.

That's a helpful reply.

Could we:
 1) name some sets of IRQs
 2) for each cpuset, specify which named IRQ set applied to it
 3) prioritize these sets of IRQs (linear order), so that
    for any given CPU, if it were in multiple cpusets
    specifying conflicting IRQ sets, we could select the
    IRQ set to apply to that CPU.

Given the reliance in (2) of cpusets on these IRQ set names, this
still needs to be part of cpusets.

But rather than (ab)use cpusets to directly accomplish (1), how
about adding some files to the root cpuset to define IRQ sets,
with names such as (for example):

	irqs.0.system
	irqs.1.big_io_apps
	irqs.2.rt

That is, more generally, add one or more "irqs.N.name" files to
the top cpuset, where N is a distinct natural number and "name"
a user space specified name (except that perhaps the first one,
the 'irqs.0.system', with ...
From: Paul Jackson
Date: Thursday, March 6, 2008 - 11:39 pm

I guess this will require adding a line:
	.create = cgroup_create
line to the:
	static struct file_operations cgroup_file_operations = {
initialization in kernel/cgroup.c, and a cgroup_create() routine
in kernel/cgroup.c, that calls an optional per-cgroup-subsystem
create routine, that, in the case of cpusets, is willing to create
one of these irqs.N.name files, in the top cpuset, if all looks
right.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Menage
Date: Friday, March 7, 2008 - 1:47 am

An alternative would be to just have some kind of "irqsets" file in
the top-level cpuset directory and let the user write irq group
definitions into that file.

Paul
--

From: Paul Jackson
Date: Friday, March 7, 2008 - 7:57 am

Yes, exactly, that's the alternative.

I tried that, in my mind, and got stuck on the complicated syntax that
would have been needed to represent an arbitrary length array of named
lists of irqs with each list having a priority attribute.

So I went with the separate "irqs.N.name" files, (ab)using the file
system directory apparatus to handle the "arbitrary length array of
named" entities aspect, burying the priority attribute (N) in the name,
and leaving each individual irqs.N.name file only needing to hold a
single vector of irq numbers.

A security conscious sysadmin can even assign different permissions to
the different irq lists with this.  And updates of one irq list don't
endanger the other irq lists, thanks to the innate and elaborate
capabilities in the kernel vfs code to handle concurrent updates to
separate files correctly and reliably.

This reminds me of the difference between the Windows Registry (one big
awful file) and the Unix /etc directory (2745 files, on the system
nearest to my shell prompt.)  Well, in this irq case, the different is not
-that- dramatic, but it is a small echo of such.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Menage
Date: Monday, March 3, 2008 - 11:41 am

[Empty message]
From: Paul Jackson
Date: Monday, March 3, 2008 - 11:52 am

Could you motivate this suggestion -- who needs it, or why
would is it needed?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Menage
Date: Monday, March 3, 2008 - 10:26 pm

My impression was that Peter wanted to be able to control the
assignments of CPUs to IRQs in a way that could result in overlapping.
One of the arguments that you posted against his proposal was that
this would break due to the memory overlap requirements of
mem_exclusive cpusets. So this appeared to be a case where the fact
that memory and cpu masks are combined in the same cgroups subsystem
is a drawback. (But maybe I'm misunderstanding the discussion).

I'm sure if cpusets were being developed today on top of cgroups,
rather than being its inspiration, there would be no good reason to
have the memory mask assignment and the cpu mask assignment be part of
the same subsystem - they're only together now because there was no
general grouping mechanism in the kernel when cpusets was written.

Paul
--

From: Paul Jackson
Date: Monday, March 3, 2008 - 11:15 pm

There were three concerns I had with his proposal -- (1) it conflicted
with memory placement, (2) it conflicted with cpu placement (sched
domain definition) and (3) it broke the conventional cpuset hierarchy
configuration on which users have come to depend.

Separating cpus and memory as distinct cgroup subsystems wouldn't help
much; there's still (2) and (3).  And in the legacy /dev/cpuset
interface, cpus and memory remain together, whether or not cgroups

Perhaps ... though they work together rather well in practice, perhaps
because CPUs and memory banks usually are physically associated;
selecting which CPUs to use and which memory banks to use really is not
an independent choice, on most hardware.

Now however the shoe is on the other foot.  Is there a good reason to
separate them ... an actual would-be user, or actual problems inflicted
on current users due to the lack of this split?

If there's just a rare occassion such an independently split CPU and
Memory hiearchy, one can still use the current combined cpuset
implementation, just with a less natural cpuset hierarchy (more leaf
node cpusets, representing the cross product of interesting CPU subsets
and interesting memory node subsets.)  If some specified users start
doing this routinely, then that gets sufficiently cumbersome that it's
worth trying to remedy.

As usual, I seem to be counseling resisting adding code, complexity and
incompatible change, without actual need, beyond just increased
conceptual sophistication.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Menage
Date: Monday, March 3, 2008 - 11:21 pm

Except that this isn't currently possible if you're also trying to do
memory hardwalling on those cpusets, since then sibling cpusets can't
share memory nodes.

Having said that, this bit of the problem can be fixed without
splitting cpus/mems, by my other earlier proposal of adding a separate
"mem_hardwall" flag that can enable the hardwall behaviour without the
exclusive behaviour. (i.e. hardwall behaviour occurs if mem_exclusive
|| mem_hardwall)

Paul
--

From: Paul Jackson
Date: Monday, March 3, 2008 - 11:26 pm

Yes, there would be interactions, on both the CPU and Memory side with
some second order mechanisms (hardwalls and sched domains.)

Still, I ask, where's the beef -- the real users?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Menage
Date: Monday, March 3, 2008 - 11:34 pm

I'm one such user who's been forced to add the mem_hardwall flag to
get around the fact that exclusive and hardwall are controlled by the
same flag. I keep meaning to send it in as a patch but haven't yet got
round to it.

Also, if you're using fake numa for memory isolation (which we're
experimenting with) then the correlation between cpu placement and
memory placement is much much weaker, or non-existent.

Paul
--

From: Paul Jackson
Date: Monday, March 3, 2008 - 11:51 pm

I made essentially the same mistake twice in the evolution of cpusets:
 1) overloading the cpu_exclusive flag to define sched domains, and
 2) overloading the mem_exclusive flag to define memory hardwalls.

I eventually reversed (1), with a deliberately incompatible change
(and you know how I resist those ;), creating a new 'sched_load_balance'
flag that controls the sched_domain partitioning, and removing any
affect that the cpu_exclusive flag has on this.

Perhaps the unfortunate interaction of mem_exclusive and hardwall is
destined to go the same path.  Thought the audience that is currently
using mem_exclusive for the purpose of hardwall enforcement of kernel
allocations might be broader than the specialized real-time audience
that was using cpu_exclusive for dynamic sched domain isolation, and so
we might not choose to just break compatibility in one shot, but rather
phase in your new flag, before, perhaps, in a later release, phasing
out the old hardwall overloading of the mem_exclusive flag.

(My primeval mistake was including the cpu_exclusive and mem_exclusive
flags in the original cpuset design; those two flags have given me

That might be a good answer to my asking where the beef was.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Jackson
Date: Friday, February 29, 2008 - 1:55 pm

Like Ingo, I like the approach.

But I am concerned it won't work, as stated.

Unfortunately, my blithering ignorance of how one might want to
distribute irq's across a system is making it difficult for me
to say for sure if this works or not.

The thing about /dev/cpuset that I am afraid will get in the way
with this use of cpusets to place irqs is that we can really only
have a single purpose hierarchy below /dev/cpuset.

For example, lets say we have:

    /dev/cpuset
        boot
	big_special_app
	a_few_isolated_rt_nodes
	batchscheduler
            batch job 1
	    batch job 2
	    ...

I guess, with your "cpuset: cpuset irq affinities" patch, we'd start
off with /dev/cpuset/irqs listing the irqs available, and we could
reasonably decide to move any or all irqs to /dev/cpuset/boot/irqs,
by writing the numbers of those irqs to that file, one irq number
per write(2) system call (as is the cpuset convention.)

Do these irqs have any special hardware affinity?  Or are they
just consumers of CPU cycles that can be jammed onto whatever CPU(s)
we're willing to let be interrupted?

If for reason of desired hardware affinity, or perhaps for some other
reason that I'm not aware of, we wanted to have the combined CPUs in
both the 'boot' and 'big_special_app' handle some irq, then we'd be
screwed.  We can't easily define, using the cpuset interface and its
conventions, a distinct cpuset overlapping boot and big_special_app,
to hold that irq.  Any such combining cpuset would have to be the
common parent of both the combined cpusets, an annoying intrusion on
the expected hierarchy.

If the actual set of CPUs we wanted to handle a particular irq wasn't
even the union of any pre-existing set of cpusets, then we'd be even
more screwed, unable even to force the issue by imposing additional
intermediate combined cpusets to meet the need.

If there is any potential for this to be a problem, then we should
examine the possibility of making irqs their own cgroup, rather ...
From: Peter Zijlstra
Date: Friday, February 29, 2008 - 2:14 pm

I might just be new-fangled, but I have a /cgroup mount.


Depends a bit, the genirq layer seems to allow for irqs that can't be
freely placed. But most of them can be given a free mask - /me looks @

I see the issue. We don't support mv on cgroups, right? To easily create

Hmm, but that would then be another controller based on cpus. Might be a

I'm not sure I know what you're asking. IRQ are hardware notifiers and
do all kinds of things depending on the hardware. Network cards
typically use them to notify the CPU of incoming packets. Video cards

I'm fine with whatever, I saw a ',' in the bitmap stuff, not really sure
how that ended up being a ' ' in the patch I send out... :-)

--

From: Ingo Molnar
Date: Friday, February 29, 2008 - 2:29 pm

yes - and when they cannot be arbitrarily migrated we just dont move 
them (but still keep them attached to that cpuset). The affinity calls 
will just fail in that case. Might want to emit a kernel warning but 
that's all. (if then it's a hardware constraint)

	Ingo
--

From: Ingo Molnar
Date: Friday, February 29, 2008 - 2:32 pm

irq affinity masks can basically be thought of as: "these are the CPUs 
where external hardware events will trigger certain kernel functions and 
cause overhead on those CPUs". An IRQ can have followup effects: softirq 
execution, workqueue execution, etc.

so managing the IRQ masks is very meaningful and just as meaningful as 
managing the affinity masks of tasks. You can think of "IRQ# 123" as 
"special kernel task # 123".

	Ingo
--

From: Max Krasnyanskiy
Date: Friday, February 29, 2008 - 2:42 pm

We should just check the return value from irq_set_affinity(). If it fails we 
I guess there maybe some fancy HW topologies that may be a problem but for 
most cases we should be ok.
Simple cases like unmovable IRQs are easy to handle (ie set_affinity() fails 
Yeah, I'd prefer it to be along with cpusets. As I mentioned will need similar 
mechanisms for other things besides irqs for complete isolation. Creating a 
separate group for each sounds like an overkill.

Max


--

From: Paul Jackson
Date: Friday, February 29, 2008 - 3:00 pm

One can combine cgroups into a single hierarchy.

If we had irqs and these similar mechanisms you have in mind each in
their separate cgroup subsystem, and if you normally wanted to deal
with them all in a single hierarchy, then you can mount those cgroup
subsystems together.

I've manage to avoid learning how to use cgroups, so you might want
to ask Paul Menage (just added to the cc list) how to do that, if the
Documentation/cgroups.txt isn't sufficient.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Jackson
Date: Friday, February 29, 2008 - 2:53 pm

So some of these irqs might have very particular node affinity then,
right?  If my thingamajig board is attached to node 72, then I might
want its interrupts going to a CPU on node 72, right?

In which case, putting that irq in my boot cpuset that only has nodes
0-3 would be harmful to the performance of my thingamajig board, right?

I'm suspecting here that you don't want a small 'boot' cpuset
(usually a small cpuset running legacy Unix stuff) holding the irqs,
but rather you want a big 'system' cpuset, which has -all-but- a few
nodes dedicated to hard real time or other isolated (there's that
word again) purposes.

That way, most irqs can go to most CPUs, depending on their specific
needs.

Unfortunately, I don't think the cpuset hierarchy and conventions admit
of both a big 'system' cpuset (all but a few isolated nodes) and a small

The only mv supported is simple rename, preserving parentage.

And if one could and did a tree reshaping mv near the top of the

All cgroups mount beneath /cgroup.  For backwards compatibility,
one can also "mount -t cpuset cpuset /dev/cpuset", and just get
the cpuset interface, with a couple of legacy hooks to make it behave

Yes - that's another commonly supported form.  If that's a better
presentation, then you'd probably want to rework your code, to take
in and display the entire vector of irq numbers in one line, using
a comma-separated list of irqs and ranges of irqs.

See further bitmap_scnprintf(), bitmap_parse_user(),
bitmap_scnlistprintf() and bitmap_parselist(), in bitmap.c.

Given that you don't have an pre-existing bitmap of irqs (that I know
of) and that you might have a distinct error code for each irq that you
try to attach to a different cpuset, I'm guessing you want to stick
with the single irq per write on input, single irq per line on output,
paradigm, similar to what the 'tasks' file uses for task pids.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
    ...
From: Christoph Hellwig
Date: Saturday, March 1, 2008 - 10:18 pm

linux/irq.h must not be included in generic code, it's actually more
and asm-generic/hw_irq.h.  Please restructure the code so that the
cpuset code calls into an arch interface which will then be implemented
by arch code (which in most cases will be genirq, the other can be left
stubbed out for now)

--

Previous thread: [RFC/PATCH 3/4] genirq: system set irq affinities by Peter Zijlstra on Wednesday, February 27, 2008 - 3:21 pm. (3 messages)

Next thread: [RFC/PATCH 4/4] kthread: system set kthread affinities by Peter Zijlstra on Wednesday, February 27, 2008 - 3:21 pm. (1 message)