My vision on the direction we should take wrt cpu isolation. Next on the list would be figuring out a nice solution to the workqueue flush issue. --
General impressions: - "cpu_system_map" is %100 identical to the "~cpu_isolated_map" as in my patches. It's updated from different place but functionally wise it's very much the same. I guess you did not like the 'isolated' name ;-). As I mentioned before I'm not hung up on the name so it's cool :). - Updating cpu_system_map from cpusets There are a couple of things that I do not like about this approach: 1. We lost the ability to isolate CPUs at boot. Which means slower boot times for me (ie before I can start my apps I need to create cpuset, etc). Not a big deal, I can live with it. 2. We now need another notification mechanism to propagate the updates to the cpu_system_map. That by itself is not a big deal. The big deal is that now we need to basically audit the kernel and make sure that everything affected must have proper notifier and react to the mask changes. For example your current patch does not move the timers and I do not think it makes sense to go and add notifier for the timers. I think the better approach is to use CPU hotplug here. ie Enforce the rule that cpu_system_map is updated only when CPU is off-line. By bringing CPU down first we get a lot of features for free. All the kernel threads, timers, softirqs, HW irqs, workqueues, etc are properly terminated/moved/canceled/etc. This gives us a very clean state when we bring the CPU back online with "system" bit cleared (or "isolated" bit set like in my patches). I do not see a good reason for reimplementing that functionality via system_map notifiers. Do not forget the "stop machine", or more specifically module loading/unloading. Max --
Ah, but you miss that cpu_system_map doesn't do the one thing the cpu_isolated_map ever did, prevent sched_domains from forming on those cpus. I'm sure those few lines in rc.local won't grow your boot time by a significant amount of time. That said, we should look into a replacement for the boot time parameter I'm not convinced cpu hotplug notifiers are the right thing here. Sure we could easily iterate the old and new system map and call the matching cpu hotplug notifiers, but they seem overly complex to me. The audit would be a good idea anyway. If we do indeed end up with a 1:1 No, the full stop machine thing, there are more interesting users than module loading. But I'm not too interested in solving this particular problem atm, I have too much on my plate as it is. --
Did you see my reply on how to support "RT balancer" with cpu_isolated_map ?
Anyway, my point was that you could've just told me something like:
"lets rename cpu_isolated_map and invert it"
That's is it. The gist of the idea is exactly the same. There is a bitmap that
I was not talking about calling notifiers directly. We can literally bring the
I was not suggesting that it's up to you to solve it. You made it sounds like
your patches provide complete solution, just need to fix the workqueues. I was
merely pointing out that there is also stop machine that needs fixing.
btw You did not comment about the fact that your patch does not move timers.
I was trying it out last night. It's definitely not ready yet.
Max
--
nice work Peter, i find this "system sets" extension to cpusets a much more elegant (and much more future-proof) solution than the proposed spreadout of the limited hack of isolcpus/cpu_isolated_map. It concentrates us on a single API and on a single mechanism to handle isolation matters. (be that for clustering/supercomputing or real-time purposes) Thanks for insisting on using cpusets for this! i've queued up your patches in sched-devel.git, and lets make sure this has no side-effects on existing functionality. (it shouldnt) Ingo --
Before this patchset gets too far, I'd like to decide on whether to
adapt my suggestion to call that per-cpuset flag 'cpus_system' (or
anything else with 'cpu' in it, perhaps 'system_cpus' would be more
idiomatic), rather than the tad too generic 'system'.
People doing 'ls /dev/cpuset' should be able to half-way guess
what things do, just from their name.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
yeah. In fact i'm not at all sure this is really a "system" thing - it's more of a "bootup" default. once the system has booted up and the user is in a position to create cpusets, i believe the distinction and assymetry between any bootup cpuset and the other cpusets should vanish. The "bootup" cpuset is just a convenience container to handle everything that the box booted up with, and then we can shrink it (without having to enumerate every PID and every irq and other resource explicitly) to make place for other cpusets. maybe it's even more idomatic to call it "set0" and just create a /dev/cpuset/set0/ directory for it and making it an explicit cpuset - instead of the hardcoded /dev/cpusets/system thing? Do you have any oh, certainly. This is at the earliest v2.6.26 material - but now it at least looks clean conceptually, fits more nicely into cpusets instead of being a bolted-on thing. Ingo --
I'm not quite sure of what you're thinking here; rather I'm just
bouncing off the sound of your words.
But your words sound alot like what we at SGI call a 'boot' cpuset.
Our big honkin NUMA customers, who are managing most of the system
either for a few dedicated, very-important jobs, and/or under a
batch scheduler, need to leave a few nodes to run the classic Unix
load such as init, cron, assorted daemons and the admins login shell.
So we provide them some init script mechanisms that make it easy to
set this up, which includes moving every task (not many at the low
numbered init script time this runs) that isn't pinned (doesn't have
a restricted Cpus_allowed) into the boot cpuset, conventionally
named /dev/cpuset/boot.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Should the kernel refuse to move some threads, such as the migration or watchdog kthreads, out of the root cpuset where the mems can be adjusted to disallow access to the cpu to which they are bound? This is a quick way to cause a crash or soft lockup. David --
Something like this? --- include/linux/sched.h | 1 + kernel/cpuset.c | 5 ++++- kernel/kthread.c | 1 + kernel/sched.c | 6 ++++++ 4 files changed, 12 insertions(+), 1 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1464,6 +1464,7 @@ static inline void put_task_struct(struct task_struct *t) #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */ +#define PF_CPU_BOUND 0x04000000 /* Kthread bound to specific cpu */ #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */ diff --git a/kernel/cpuset.c b/kernel/cpuset.c --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -1175,11 +1175,14 @@ static void cpuset_attach(struct cgroup_subsys *ss, struct mm_struct *mm; struct cpuset *cs = cgroup_cs(cont); struct cpuset *oldcs = cgroup_cs(oldcont); + int ret; mutex_lock(&callback_mutex); guarantee_online_cpus(cs, &cpus); - set_cpus_allowed(tsk, cpus); + ret = set_cpus_allowed(tsk, cpus); mutex_unlock(&callback_mutex); + if (ret < 0) + return; from = oldcs->mems_allowed; to = cs->mems_allowed; diff --git a/kernel/kthread.c b/kernel/kthread.c --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu) wait_task_inactive(k); set_task_cpu(k, cpu); k->cpus_allowed = cpumask_of_cpu(cpu); + k->flags |= PF_CPU_BOUND; } EXPORT_SYMBOL(kthread_bind); diff --git a/kernel/sched.c b/kernel/sched.c --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5345,6 +5345,12 @@ int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask) goto out; } + if ...
Indeed, there is a hole in my cpus_match_system() logic in that when the system set is reduced to a single cpu, the tasks bound to that cpu also match. I had wanted to avoid adding PF_ flags (as I remember we're running short on them), but I think you're right. --
I don't have strong opinions either way on this patch; it adds an error
check that makes sense. I haven't seen much problem not having this check,
nor do I know of any code that depends on doing what this check prohibits.
Except for three details:
1) + if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
+ !cpus_equal(p->cpus_allowed, new_mask))) {
+ ret = -EINVAL;
The check for equal cpus allowed seems too strong. Shouldn't you be
checking that all of task p's cpus_allowed would still be allowed in
the new mask:
+ if (unlikely((p->flags & PF_CPU_BOUND) && p != current &&
+ !cpus_subset(p->cpus_allowed, new_mask))) {
+ ret = -EINVAL;
2) Doesn't this leave out a check for the flip side -- shrinking
the cpus allowed by a cpuset so that it no longer contains those
required by any PF_CPU_BOUND tasks in that cpuset? I'm not sure
if this second check is a good idea or not.
3) Could we call this PF_CPU_PINNED instead? I tend to use "cpu
bound" to refer to tasks that consume alot of CPU cycles (which
these pinned tasks rarely do), and "pinned" to refer to what is
done to confine a task to a particular subset of all possible CPUs.
It looks to me like some code in kernel/sched.c already uses the
word pinned in this same way, so PF_CPU_PINNED would be more
consistent terminology.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
How about moving watchdog/0 to a cpuset with exclusive access to only cpu That's a convenient way for a kthread to temporarily expand its set of cpus_allowed and then never be able to remove the added cpus again. Do That's why the check in set_cpus_allowed() is cpus_equal(p->cpus_allowed, newmask) since it prevents PF_CPU_BOUND tasks from being moved out of the root PF_CPU_BOUND follows the nomenclature of kthread_bind() really well, but it could probably be confused with a processor-bound task. So perhaps PF_BOUND_CPU is even better? David --
Good question. Actually, we -normally- have pinned tasks in the top cpuset,
where the top cpuset allows all CPUs, but the pinned task has a cpus_allowed
(in its task struct) of just one or a few CPUs (for node pinning.)
So ... could we allow moving pinned threads to any cpuset that allowed
the CPUs to which it was pinned (my cpus_subset() test, above), but
-not- change the pinned tasks cpus_allowed in its task struct, keeping
I don't think that the cpus_equal() check prevents that (shrinking a
pinned tasks cpuset out from under it.) Try the following on a freshly
booted system with your proposed patch:
mkdir /dev/cpuset
mount -t cpuset cpuset /dev/cpuset
cd /dev/cpuset
mkdir a
cp ???s a
< tasks sed -un -e p -e 10q > a/tasks
I'll wager you just moved a few pinned tasks into cpuset 'a'. This
would be allowed, as 'a' has the same cpus as the top cpuset. But then
one could shrink a (if it had more than 1 CPU in the first place), leaving
some pinned tasks in a cpuset they weren't allowed to run in, essentially
Good point - "BOUND" as the past tense of "BIND". How about
PF_THREAD_BIND ;)?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Same situation occurs for a task bound to a cpuset that issues a sched_setaffinity() call to restrict its cpumask further. There's nothing I'd hesitate to do that unless you can guarantee that restricting kthreads mems_allowed via the cpuset interface won't cause any problems either. Is there a benefit to reducing the size of a kthread's mems_allowed that doesn't have an adverse effect on the kernel? What You can move them, but you cannot reduce the size of the thread's cpus_allowed since it was bound to a specific cpu via kthread_bind(). The change to set_cpus_allowed() simply prevents this from happening in the Sure. --
I still don't understand ... you must have some context in mind that
I've spaced out ... I can't even tell if that is a statement or a
Well ... I'm suspecting we've got this portion of our discussion wrapped
around the axle one time too many.
Backing up, hopefully unwrapping, you seemed to allow moving bound tasks
only to cpusets with the same cpus (how come you didn't check for the
same memory nodes too?). If you really needed to move bound tasks at all,
then that seemed like an unnecessarily tight constraint. It wouldn't hurt
the bound task to move to another cpuset that still allowed the CPUs it was
bound to.
... but after an another iteration of that subthread ... I'm wondering
why you have to move bound tasks at all. How about PF_THREAD_BIND just
meaning (1) "can't be moved to any other cpuset", and (2) "always
placed in the top cpuset," so we don't have to worry about being unable
to move threads out of child cpusets.
Do you have any situation in which pinned threads have to be moved?
I don't. Can we just prohibit it?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
You said that you weren't aware of any problems that could arise that are fixed with this added check in set_cpus_allowed(), so I asked that you setup a cpuset that doesn't have access to cpu 0 and move the watchdog/0 The fix is more general than just the cpusets case, even though it is probably the only user in the kernel of set_cpus_allowed() that allows bound kthreads to be moved. The idea is to expressly deny kthreads that have called kthread_bind() from changing their cpus_allowed via set_cpus_allowed(). Cpusets should handle that gracefully, but my fix is more general: it adds the check to set_cpus_allowed() instead of in the cpusets code. This is a little difficult with the current way that cpusets calls set_cpus_allowed() since its attach function is void and does not return error or success to the cgroup interface, yet my patch prevents the soft lockups and kernel crashes that can occur when cpusets changes the cpus_allowed of watchdog or migration threads that is obviously in error. David --
Ok, now I understand your question - thanks.
I think your question arises from misreading what I wrote.
I did not say that I wasn't "aware of any problems that could arise"
- This does not say no (none whatsoever) problem could (ever in the future) arise.
- This does say not much (just a little) problem had arisen (so far in the past).
Apparently, you thought I was trying to reject the patch on the grounds
that no such problem could ever occur, and you were showing how such a
problem could occur. I wasn't trying to reject the patch, and I agree
that the check made sense, and I agree that such a problem could occur,
as your example shows.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Ok, good. We're in agreement that my patch fixes a problem in allowing kthreads to move to cpusets that either currently or eventually deny them access to the cpu to which they are bound. Now do you have any preference without extensive cpusets or cgroups hacking where we can gracefully handle -EINVAL being returned from set_cpus_allowed() so the task doesn't end up in a different cpuset? The return value of set_cpus_allowed() is currently ignored by the cpuset implementation and, regardless, cpuset_attach() returns void. cpuset_can_attach() does some sanity checking, but we need a call to set_cpus_allowed() to check the new logic and will actually change tsk->cpus_allowed if correctly invoked. So the cpusets implementation needs a similar check in cpuset_can_attach() so that it is expressly denied from moving before we ever call cpuset_attach(). It doesn't make sense that a kthread is going to move itself through the cpusets interface, so this can simply be done by checking for tsk->flags & PF_THREAD_BOUND. sched: prevent bound kthreads from changing cpus_allowed Kthreads that have called kthread_bind() are bound to specific cpus, so other tasks should not be able to change their cpus_allowed from under them. Otherwise, it is possible to move kthreads, such as the migration or watchdog threads, so they are not allowed access to the cpu they work on. Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/sched.h | 1 + kernel/cpuset.c | 2 ++ kernel/kthread.c | 1 + kernel/sched.c | 6 ++++++ 4 files changed, 10 insertions(+), 0 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1464,6 +1464,7 @@ static inline void put_task_struct(struct task_struct *t) #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */ #define ...
I'm ok with this.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
yes. Ideally Peter's patchset should turn into something equivalent and i very much agree with Peter's arguments. There was never any design level problem with cpusets, and the parallel cpu_isolated_map approach was misdirected IMO. There was indeed a problem with the _manageability_ of cpusets in certain (rather new) usecases like real-time or virtualization, and how they are connected to other system resources like IRQs and how easy it is to manage these resources. IRQs should probably be tied to specific cpusets and should migrate together with them, were the span of that cpuset be changed. (by default they'd be tied to the boot cpuset) IMO Peter's patchset is a good first step in that it removes the cpu_isolated_map API hack, and i think we should try to go the whole way and just offer a /dev/cpuset/boot/ default set that can then be restricted to isolate the default workloads away from other CPUs. ( an initscripts approach, while i'm sure it works, would always be a bit fragile in that it requires precise knowledge about which task is what. I think we should make this a turn-key in-kernel solution that both the big-honking NUMA-box guys and the real-time guys would be happy with. ) Ingo --
Agreed. In the early days of cpusets, (1) the less I put in the kernel,
the easier it was to get those first patches accepted, and (2) the more
there was that remained for me to code in user space, where my employer
could provide better, leading edge, products for a profit.
Overtime, these user space features that prove their worth, but that
were difficult to code in the most usable and robust manner in user
space, such as boot cpusets here, and, on another thread, cpuset
relative NUMA mempolicies, naturally migrate to the kernel, for wider
availability.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I like the concept of the "boot" set. But we still need a separate "system" flag. Users should not have to move cpu back into the "boot" set to allow for kernel (irqs, etc) activity on it. And it's just more explicit and clear that way. Max --
Good point -- a "boot" cpuset might be 4 CPUs out of 256 CPUs, just for running
the classic Unix load (daemons, init, login, ...). But irq's might need to go
to most CPUs, except for some (dare I use the word) isolated CPUs.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
btw Can you send me those init scripts that you mentioned. ie Those that create "boot" set from userspace. Max --
They aren't open source ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Oh, I see. It's probably patented too ;-) Max --
I think that is a separate thing. Bootup default is one thing and being able to explicitly allow/disallow kernel activity on a CPU(s) is another. I think "boot" or "set0" makes perfect sense. In fact that was the first thing I noticed when I started playing with it. ie Even if I just wanted to isolated one cpu I now need to create a cpuset for the other cpus and move all the tasks there explicitly. It'd be very useful if it happens by default. Max --
Come on Ingo. You make it sounds like it's radically different solution. At the end of the day we have a bitmap that represents which CPUs can be used for the kernel stuff. How is that different ? I was saying all along that cpusets is a higher level API and was discussing Hmm, that was easy. Not a single ack. Even the core part is not complete yet. I pointed out several issues. Like the fact that it does not provide full isolation because it does not move timers, does not handle workqueues. I did not even get a chance to test this stuff properly and see if it actually solves the usecase I was solving with my patches. _Obviously_ we could not have taken my tested solution and evolved it in the direction people wanted to see it evolve, ie integration with the cpusets :(. My main concern is that it introduces a whole new set of notifiers that perform similar functions to what CPU hotplut already does. Max --
It of course lays waste to a series of cgroup patches from Paul Menage which I already had queued. So I shall drop git-sched again. How often do I have to say this? git-sched is not git-everything-which-looks-shiny! It is for the CPU scheduler. If you had put this patchset into a private branch for private testing, or even into a separate git-petes-stuff then I wouldn't have to collaterally drop the entirety of git-sched because of this. Sigh. --
And when I do this I get:
***************
*** 8125,8137 ****
return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
}
- static int cpu_rt_period_write_uint(struct cgroup *cgrp, struct cftype *cftype,
u64 rt_period_us)
{
return sched_group_set_rt_period(cgroup_tg(cgrp), rt_period_us);
}
- static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
{
return sched_group_rt_period(cgroup_tg(cgrp));
}
--- 8125,8137 ----
return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
}
+ static int cpu_rt_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
u64 rt_period_us)
{
return sched_group_set_rt_period(cgroup_tg(cgrp), rt_period_us);
}
+ static u64 cpu_rt_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
{
return sched_group_rt_period(cgroup_tg(cgrp));
}
***************
*** 8367,8374 ****
},
{
.name = "rt_period_us",
- .read_uint = cpu_rt_period_read_uint,
- .write_uint = cpu_rt_period_write_uint,
},
#endif
};
--- 8367,8374 ----
},
{
.name = "rt_period_us",
+ .read_u64 = cpu_rt_period_read_u64,
+ .write_u64 = cpu_rt_period_write_u64,
},
#endif
};
and if I then fix that up, and later restore git-sched, Paul's patch is now
broken.
Your trees continue to cause more trouble than anyone else's have ever
done, by a lot.
Let me try yesterday's git-sched.patch.
--
Andrew, please stop tracking sched-devel.git and track this tree instead: git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git thanks, Ingo --
Another option is to have cpusets or cpu isolation tree that I've started already. I was saying from the very beginning cpu isolation stuff does not imho belong in the scheduler tree. Besides a tiny patch to the sched.c that adds/removes the bitmaps there are no scheduler changes needed for this specifically. Peter, Ingo, if you guys are ok with this lets just have this stuff in cpuisol-2.6.git. I'm anyway rebasing it ontop of Peter's work. Of course we'll go through regular review and stuff and Andrew can track that tree separately. Just a suggestion. I'm ok with submitting patches via sched-devel. Separate tree seems more appropriate though. Max --
Is it now the intent, not only that I have to enable cpusets in the kernel but I will also have to use them in userland to take advantage of this. And hot-plug too?? Can I predict that in the future that userland sched_setaffinity will be taken away also and be forced to use cpusets? And hot-plug too?? Mark --
Mark, I bet you won't get any replies (besides mine). And yes this means that you will have to enable cpusets if Peter's patches go in (looks like they will). Hot-plug may not be needed unless I convince people to reuse the hot-plug instead of introducing new notifiers. I guess we can make some extensions to expose "system" bit just like I did with "isolated" bit via sysfs. In which case cpusets may not be needed. We'll see. Max --
Hi Paul,
How about something like this; along with the in-kernel version
of /cgroup/boot this could also provide the desired semantics.
Another benefit of this approach would be that it no longer requires
PF_THREAD_BIND, as we'd only stick unbounded kthreads into that cgroup.
(compile tested only)
---
Subject: cpuset: cpuset irq affinities
Allow for an association between cpusets and irqs.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/irq.h | 9 ++
kernel/cpuset.c | 160 ++++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/irq/manage.c | 19 ++++++
3 files changed, 188 insertions(+)
Index: linux-2.6-2/include/linux/irq.h
===================================================================
--- linux-2.6-2.orig/include/linux/irq.h
+++ linux-2.6-2/include/linux/irq.h
@@ -174,11 +174,20 @@ struct irq_desc {
#ifdef CONFIG_PROC_FS
struct proc_dir_entry *dir;
#endif
+#ifdef CONFIG_CPUSETS
+ struct cpuset *cs;
+#endif
const char *name;
} ____cacheline_internodealigned_in_smp;
extern struct irq_desc irq_desc[NR_IRQS];
+struct irq_iterator {
+ int (*function)(struct irq_iterator *, int, struct irq_desc *);
+};
+
+extern int irq_iterator(struct irq_iterator *);
+
/*
* Migration helpers for obsolete names, they will go away:
*/
Index: linux-2.6-2/kernel/cpuset.c
===================================================================
--- linux-2.6-2.orig/kernel/cpuset.c
+++ linux-2.6-2/kernel/cpuset.c
@@ -50,6 +50,9 @@
#include <linux/time.h>
#include <linux/backing-dev.h>
#include <linux/sort.h>
+#ifdef CONFIG_GENERIC_HARDIRQS
+#include <linux/irq.h>
+#endif
#include <asm/uaccess.h>
#include <asm/atomic.h>
@@ -732,6 +735,44 @@ void cpuset_change_cpumask(struct task_s
set_cpus_allowed(tsk, (cgroup_cs(scan->cg))->cpus_allowed);
}
+#ifdef CONFIG_GENERIC_HARDIRQS
+struct cpuset_irq_cpumask {
+ struct irq_iterator v;
+ struct cpuset *cs;
+ cpumask_t mask;
+};
+
+static ...i like this approach - it makes irqs more resource-alike and attaches them to a specific resource control group. So if /cgroup/boot is changed to have less CPUs then the "default" irqs move along with it. but if an isolated RT domain has specific irqs attached to it (say the IRQ of some high-speed data capture device), then the irqs would move together with that domain. irqs are no longer a bolted-upon concept, but more explicitly managed. [ If you boot-test it and if Paul agrees with the general approach then i could even apply it to sched-devel.git ;-) ] Ingo --
Believe it or not but I like it too :). Now we're talking different approach compared to the cpu_isolated_map since with this patch cpu_system_map is no longer needed. I've been playing with latest sched-devel tree and while I think we'll endup adding a lot more code, doing it with the cpuset is definitely more flexible. This way we can provide more fine grain control of what part of the "system" services are allowed to run on a cpuset. Rather that "catch all" system flag. Current sched-devel tree does not provide complete isolation at this point. There are still many things here and there that need to be added/fixed. Having finer control here helps. One concern I have is that this API conflicts with /proc/irq/X/smp_affinity. ie Setting smp_affinity manually will override affinity set by the cpuset. In other words I think int irq_set_affinity(unsigned int irq, cpumask_t cpumask) now needs to make sure that cpumask does not have cpus that do not belong to the cpuset this irq belongs to. Just like sched_setaffinity() does for the tasks. Max --
The patch also needs to handle group destruction too; currently it leaves cpuset pointers dangling. So it would either have to refuse removing a group when there are still irqs associated, or move them to the parent. But yeah, this was just a quick hack to show the idea, glad you like it. Will try to flesh it out a bit in the coming week. --
Are you going to add code for "boot" cpuset ? I wrote user-space code that is does that, but as I understand from previous discussions we want to create that in the kernel. Max --
Yeah, I'll be trying to (lack of cgroup fu for the moment). I think something like /cgroup /cgroup/system /cgroup/system/boot /cgroup/big_honking_app /cgroup/rt_domain Where the system group includes all IRQs and all unbound kernel threads (by default). The system/boot group will contain all of userspace. Doing it in this way ought to allow for some weird setups. The system group can overlap with anything that does need system services. The boot group must be a subset thereof, and can be shrunk to a small part of the machine. --
I suppose IRQs need to overlap like this, but cpusets often can't
overlap like this.
If a system has the cgroup hierarchy you draw:
/cgroup
/cgroup/system
/cgroup/system/boot
/cgroup/big_honking_app
/cgroup/rt_domain
this must not force the cpuset hierarchy to be:
/dev/cpuset
/dev/cpuset/system
/dev/cpuset/system/boot
/dev/cpuset/big_honking_app
/dev/cpuset/rt_domain
I guess this means IRQs cannot be added to the cpuset subsystem
of cgroups. Rather they have to be added to some other cgroup
subsystem, perhaps a new one just for IRQs.
In perhaps the most common sort of cpuset hierarchy:
/dev/cpuset
/dev/cpuset/boot
/dev/cpuset/batch_sched
/dev/cpuset/big_honking_app
/dev/cpuset/rt_domain
none of boot or its siblings overlap.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Due to CS_CPU_EXCLUSIVE usage? The trouble is, cgroups are primarily about tasks, whereas IRQs are not. So we would create a cgroup that does not manage tasks, but rather associates irqs with sets of cpus - which are not cpusets. But as long as nobody does CS_CPU_EXCLUSIVE they may overlap, right? --
> But as long as nobody does CS_CPU_EXCLUSIVE they may overlap, right?
It's a bit stronger than that:
1) They need non-overlapping cpusets at this level to control
the sched_domain setup, if they want to avoid load balancing
across almost all CPUs in the system. Depending on the kernel
version, sched_domain partitioning is controlled either by the
cpuset flag cpu_exclusive, or the cpuset flag sched_load_balance.
2) They need non-overlapping cpusets at this level to control
memory placement of some kernel allocations, which are allowed
outside the current tasks cpuset, to be confined by the nearest
ancestor cpuset marked 'mem_exclusive'
3) Some sysadmin tools are likely coded to expect a /dev/cpuset/boot
cpuset, not a /dev/cpuset/system/boot cpuset, as that has been
customary for a long time.
(1) and (2) would break the major batch schedulers. They typically
mark their top cpuset, /dev/cpuset/pbs or /dev/cpuset/lfs or whatever
batch scheduler it is, as cpu_exclusive and mem_exclusive, by way of
expressing their intention to pretty much own those CPUs and memory
nodes. If we fired them up on a system where that wasn't allowed due
to overlap with /dev/cpuset/system, they'd croak. Such changes as that
are costly and unappreciated.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
OK, understood, I'll try and come up with yet another scheme :-) --
Would your per-cpuset 'irqs' file work if, unlike pids in the 'tasks' file,
we allowed the same irq to be listed in multiple 'irqs' files?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I did think of that, but that seems rather awkward. For one, how would you remove an irq from a cpuset? Secondly, the beauty of the current solution is that we use irq_desc->cs->cpus_allowed, if it were in multiple sets, we'd have to iterate a list, and cpus_or() the bunch. --
Yeah, that would definitely be awkward. Max --
Yeah - agreed - awkward.
Forget that idea (allowing the same irq in multiple 'irqs' files.)
It seems to me that we get into trouble trying to cram that 'system'
cpuset into the cpuset hierarchy, where that system cpuset is there to
hold a list of irqs, but is only partially a good fit for the existing
cpuset hierarchy.
Could this irq configuration be partly a system-wide configuration
decision (which irqs are 'system' irqs), and partly a per-cpuset
decision -- which cpusets (such as a real-time one) want to disable
the usual system irqs that everyone else gets.
The cpuset portion of this should take only a single per-cpuset Boolean
flag -- which if set True (1), asks the system to "please leave my CPUs
off the list of CPUs receiving the usual system irqs."
Then the list of "usual system irqs" would be established in some /proc
or /sys configuration. Such irqs would be able to go to any CPUs
except those CPUs which found themselves in a cpuset with the above
per-cpuset Boolean flag set True (1).
How does all this interact with /proc/irq/N/smp_affinity?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
How about we make this in-kernel boot set, that by default contains all IRQs, all unbounded kthreads and all of user-space. To be compatible with your existing clients you only need to move all the IRQs to the root domain. (Upgrading a kernel would require distributing some new userspace anyway, right? - and we could offer a .config option to disable the boot set for those who do upgrade kernels without upgrading user-space). Then, once you want to make use of the new features, you have to update your batch scheduler to only make use of load_balance and not cpus_exclusive (as they're only interested in sched_domains, right?) So if you want to do IRQ isolation and batch scheduling on the same machine (as is not possible now) you need to update userspace as said before; so that it allows for the overlapping cpuset. For example, on a 32 cpu machine: /cgroup/boot 0-1 (kthreads - initial userspace) /cgroup/irqs 0-27 (most irqs) /cgroup/batch_A 2-5 /cgroup/batch_B 6-13 /cgroup/another_big_app 14-27 /cgroup/RT-domain 28-31 (my special irq) So by providing a .config option for strict backward compatibility, a simple way for runtime compatibility (moving all IRQs to the root) which should be easy to do if the kernel upgrade is accompanied by a (limited) user-space upgrade. And once all the features need to be used together (something that is now not possible - so new usage) then the code that relies on cpus_exclusive to create sched_domains needs to be changed to use load_balance instead. Much the same way the cpuset cpus_allowed interacts with a task's cpus_allowed. That is, cs->cpus_allowed is a mask on top of the provided affinity. If for some reason the cs->cpus_allowed changes in such a way that the user-specified mask becomes empty (irq->cpus_allowed & cs->cpus_allowed == 0), then print a message and set it to the full mask (irq->cpus_allowed = cs->cpus_allowed). If for some reason the cs->cpus_allowed changes in such a way that the mask is ...
/me likes this looks like the most straightforward and most manageable approach proposed so far - i always thought that cpusets should boot up with some meaningful default set that people could play with. This would really push cpusets into mainstream use i believe. Any patch i could try out, to see how well this works in practice? ;-) Ingo --
They do ... you just have to mount cpusets to see it ;).
If your kernel is configured with CONFIG_CPUSETS, then
there is one 'top' cpuset containing all tasks, all mems,
and all cpus, setup during the kernel init boot code,
unconditionally.
Anyone who does:
mkdir -p /dev/cpuset
mount -t cpuset cpuset /dev/cpuset
can see it and play with it.
... perhaps I misunderstood your comment?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
the root cpuset is special as it is the root of the tree. You cannot shrink it in practice (in a meaningful way) because then you cannot have children outside of its scope. Ingo --
Yes ... true ... so? I guess whatever you meant by "meaningful"
doesn't include the root cpuset. Oh well <grin>.
To be a tad more serious, it wasn't (still isn't) clear to me
what you did mean here.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I assumed that this exactly what we meant by the boot set all along :). One thing I wanted to clarify that by all IRQs we literally mean all of them even those that do not have handlers yet. /proc/irqN/smp_affinity for example is not available for the irqs are not active. In other words an IRQ must not all of the sudden show up in the root set when something does request_irq() on it. I think Paul missed my earlier reply where I pointed out that original patch conflicted with /proc/irq/N/smp_affinity API. The solution is for irq_set_affinity() to enforce cpus_allowed just like sched_setaffinity does for the tasks. Max --
If I understood your proposal, the /cgroup/irq cpuset is rather an odd cpuset -- it seems to be just used to place the 'system' irqs on the specified CPUs. This is not the usual use of cpusets, which are to associate a set of tasks with some resources. In your example, the 'tasks' file in /cgroup/irqs is probably empty, right? And then you have to argue that certain incompatibilities this No ... we normally do our best not to force apps to change source code, or where possible not even force them to recompile, to run on That sort of alternative is useless when dealing with the major distros. They choose one config for half or all of their entire market (perhaps distinguishing personal PC from server, but no more). Ok - something like that could be done, if we had to, just as a cpusets 'mems' masks mbind and set_mempolicy nodemasks, and a cpusets 'cpus' masks sched_setaffinity cpumasks. Could be ... I suppose ... if we had to ... however ... I suspect that the reason you had the odd /cgroup/irqs cpuset is that it wasn't clear what to do with the irqs of cpusets that had both: 1) overlapping cpus, and 2) specified different lists of irqs. In my view, you are trying to: A] force the <<irq, cpu>, Boolean> mapping into linear lists, and B] then trying to force those linear lists into the cpuset hierarchy. With each capability that we've added to cpusets, beginning with CPUs and Memory Nodes themselves, and more recently with sched_domains, I've strived to keep cpusets fully hierarchical and nestable, supporting arbitrary combinations of overlapping CPUs and Memory Nodes. What we have here, in this cpuset-irq proposal, I claim, is another example of trying to put in a tree what is essentially flat. No ... that last paragraph is wrong. It's worse than that. The mapping of <cpu, irq> pairs to Boolean ("can we direct this irq to this CPU - yes or no?") is not flat; it's a two-dimensional matrix of Boolean. So you're [A] squinting out the left eye, ...
Likely; if for instance you'd want some unbound kernel threads to join in that overlapping set, then perhaps that name would be badly chosen. Although I'm not sure which unbound kernel threads would benefit from Perhaps we're talking about something else here; how bad would it be to require: for irq in `cat /cgroup/boot/irqs` ; do echo $irq > /cgroup/irqs; done be added to rc.local or fully replace your home-brew boot cpuset script? Its basically an update for that script, giving the exact same semantics Sure, but your application vendors will need to re-certify their applications to run on new distros, sometimes even re-compile because ABI changes and the like. Certainly providing a new script in the new Assign a map of cpus where irqs will default into, and a way to So, yes, cgroups are perhaps awkward because they group tasks whereas the current problem is grouping IRQs. Because we're mapping them onto CPUs, cpusets came to mind. The thing we 'need', is to provide named groups of irqs and for each such a group specify a cpu mask that is appropriate. Grouping them makes sense in that we want to make a functional division. Some IRQs serve a system as a whole, others serve a subset. Typical subsets could be a RT process space bounded to a cpu/mem domain. Other usable subsets could be limiting the IRQs of node local network and IO cards to the cpu/mem domain that runs the application that uses them. So we group irqs like: system_on_nodes_1_2_and_3 (default) big_io_app_on_nodes_2_and_3 rt_app_on_node_4 Where, again, you see a strong similarity to the cpu/mem divisions already made by cpusets. I understand its somewhat at odds with the hierarchical nature of the <task, cpu/mem> mapping you currently have. But its not too far away either. Do you see what we want to do, and why our - perhaps misguided - choice for cpusets? Creating a whole new cgroup controller is also weird because we don't deal in tasks. [ Aside from grouping existing ...
Helpful reply, Peter. Thanks.
Essentially, cpusets are like cgroups in this regard. They group
tasks. They just happen to be grouping tasks to associate them
with sets of CPUs (and Memory Nodes), which seems relevant somehow
to the present need, to group irqs to associate them with sets of
I haven't gotten my head around what such a script would do yet,
but you are correct in suspecting that we could add a script like
this easily enough in future releases, if that was useful.
I can change init scripts, for each kernel version, much easier than I
can ask the big batch scheduler providers to change their application
Can you spell out how or why /proc/irq/N/smp_affinity doesn't
provide what you need here?
My guess is that it's fairly obvious why /proc/irq/N/smp_affinity is
not well suited for this. It requires poking lots of settings, one
at a time, which is cumbersome and racey from user space, difficult
to keep in sync with any other changes in placement of RT or other
jobs, and it requires root permissions, without any finer granularity
Cool -- I'm glad now I asked (rather impatiently) what we needed.
That's a helpful reply.
Could we:
1) name some sets of IRQs
2) for each cpuset, specify which named IRQ set applied to it
3) prioritize these sets of IRQs (linear order), so that
for any given CPU, if it were in multiple cpusets
specifying conflicting IRQ sets, we could select the
IRQ set to apply to that CPU.
Given the reliance in (2) of cpusets on these IRQ set names, this
still needs to be part of cpusets.
But rather than (ab)use cpusets to directly accomplish (1), how
about adding some files to the root cpuset to define IRQ sets,
with names such as (for example):
irqs.0.system
irqs.1.big_io_apps
irqs.2.rt
That is, more generally, add one or more "irqs.N.name" files to
the top cpuset, where N is a distinct natural number and "name"
a user space specified name (except that perhaps the first one,
the 'irqs.0.system', with ...I guess this will require adding a line:
.create = cgroup_create
line to the:
static struct file_operations cgroup_file_operations = {
initialization in kernel/cgroup.c, and a cgroup_create() routine
in kernel/cgroup.c, that calls an optional per-cgroup-subsystem
create routine, that, in the case of cpusets, is willing to create
one of these irqs.N.name files, in the top cpuset, if all looks
right.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
An alternative would be to just have some kind of "irqsets" file in the top-level cpuset directory and let the user write irq group definitions into that file. Paul --
Yes, exactly, that's the alternative.
I tried that, in my mind, and got stuck on the complicated syntax that
would have been needed to represent an arbitrary length array of named
lists of irqs with each list having a priority attribute.
So I went with the separate "irqs.N.name" files, (ab)using the file
system directory apparatus to handle the "arbitrary length array of
named" entities aspect, burying the priority attribute (N) in the name,
and leaving each individual irqs.N.name file only needing to hold a
single vector of irq numbers.
A security conscious sysadmin can even assign different permissions to
the different irq lists with this. And updates of one irq list don't
endanger the other irq lists, thanks to the innate and elaborate
capabilities in the kernel vfs code to handle concurrent updates to
separate files correctly and reliably.
This reminds me of the difference between the Windows Registry (one big
awful file) and the Unix /etc directory (2745 files, on the system
nearest to my shell prompt.) Well, in this irq case, the different is not
-that- dramatic, but it is a small echo of such.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Could you motivate this suggestion -- who needs it, or why
would is it needed?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
My impression was that Peter wanted to be able to control the assignments of CPUs to IRQs in a way that could result in overlapping. One of the arguments that you posted against his proposal was that this would break due to the memory overlap requirements of mem_exclusive cpusets. So this appeared to be a case where the fact that memory and cpu masks are combined in the same cgroups subsystem is a drawback. (But maybe I'm misunderstanding the discussion). I'm sure if cpusets were being developed today on top of cgroups, rather than being its inspiration, there would be no good reason to have the memory mask assignment and the cpu mask assignment be part of the same subsystem - they're only together now because there was no general grouping mechanism in the kernel when cpusets was written. Paul --
There were three concerns I had with his proposal -- (1) it conflicted
with memory placement, (2) it conflicted with cpu placement (sched
domain definition) and (3) it broke the conventional cpuset hierarchy
configuration on which users have come to depend.
Separating cpus and memory as distinct cgroup subsystems wouldn't help
much; there's still (2) and (3). And in the legacy /dev/cpuset
interface, cpus and memory remain together, whether or not cgroups
Perhaps ... though they work together rather well in practice, perhaps
because CPUs and memory banks usually are physically associated;
selecting which CPUs to use and which memory banks to use really is not
an independent choice, on most hardware.
Now however the shoe is on the other foot. Is there a good reason to
separate them ... an actual would-be user, or actual problems inflicted
on current users due to the lack of this split?
If there's just a rare occassion such an independently split CPU and
Memory hiearchy, one can still use the current combined cpuset
implementation, just with a less natural cpuset hierarchy (more leaf
node cpusets, representing the cross product of interesting CPU subsets
and interesting memory node subsets.) If some specified users start
doing this routinely, then that gets sufficiently cumbersome that it's
worth trying to remedy.
As usual, I seem to be counseling resisting adding code, complexity and
incompatible change, without actual need, beyond just increased
conceptual sophistication.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Except that this isn't currently possible if you're also trying to do memory hardwalling on those cpusets, since then sibling cpusets can't share memory nodes. Having said that, this bit of the problem can be fixed without splitting cpus/mems, by my other earlier proposal of adding a separate "mem_hardwall" flag that can enable the hardwall behaviour without the exclusive behaviour. (i.e. hardwall behaviour occurs if mem_exclusive || mem_hardwall) Paul --
Yes, there would be interactions, on both the CPU and Memory side with
some second order mechanisms (hardwalls and sched domains.)
Still, I ask, where's the beef -- the real users?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I'm one such user who's been forced to add the mem_hardwall flag to get around the fact that exclusive and hardwall are controlled by the same flag. I keep meaning to send it in as a patch but haven't yet got round to it. Also, if you're using fake numa for memory isolation (which we're experimenting with) then the correlation between cpu placement and memory placement is much much weaker, or non-existent. Paul --
I made essentially the same mistake twice in the evolution of cpusets:
1) overloading the cpu_exclusive flag to define sched domains, and
2) overloading the mem_exclusive flag to define memory hardwalls.
I eventually reversed (1), with a deliberately incompatible change
(and you know how I resist those ;), creating a new 'sched_load_balance'
flag that controls the sched_domain partitioning, and removing any
affect that the cpu_exclusive flag has on this.
Perhaps the unfortunate interaction of mem_exclusive and hardwall is
destined to go the same path. Thought the audience that is currently
using mem_exclusive for the purpose of hardwall enforcement of kernel
allocations might be broader than the specialized real-time audience
that was using cpu_exclusive for dynamic sched domain isolation, and so
we might not choose to just break compatibility in one shot, but rather
phase in your new flag, before, perhaps, in a later release, phasing
out the old hardwall overloading of the mem_exclusive flag.
(My primeval mistake was including the cpu_exclusive and mem_exclusive
flags in the original cpuset design; those two flags have given me
That might be a good answer to my asking where the beef was.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Like Ingo, I like the approach.
But I am concerned it won't work, as stated.
Unfortunately, my blithering ignorance of how one might want to
distribute irq's across a system is making it difficult for me
to say for sure if this works or not.
The thing about /dev/cpuset that I am afraid will get in the way
with this use of cpusets to place irqs is that we can really only
have a single purpose hierarchy below /dev/cpuset.
For example, lets say we have:
/dev/cpuset
boot
big_special_app
a_few_isolated_rt_nodes
batchscheduler
batch job 1
batch job 2
...
I guess, with your "cpuset: cpuset irq affinities" patch, we'd start
off with /dev/cpuset/irqs listing the irqs available, and we could
reasonably decide to move any or all irqs to /dev/cpuset/boot/irqs,
by writing the numbers of those irqs to that file, one irq number
per write(2) system call (as is the cpuset convention.)
Do these irqs have any special hardware affinity? Or are they
just consumers of CPU cycles that can be jammed onto whatever CPU(s)
we're willing to let be interrupted?
If for reason of desired hardware affinity, or perhaps for some other
reason that I'm not aware of, we wanted to have the combined CPUs in
both the 'boot' and 'big_special_app' handle some irq, then we'd be
screwed. We can't easily define, using the cpuset interface and its
conventions, a distinct cpuset overlapping boot and big_special_app,
to hold that irq. Any such combining cpuset would have to be the
common parent of both the combined cpusets, an annoying intrusion on
the expected hierarchy.
If the actual set of CPUs we wanted to handle a particular irq wasn't
even the union of any pre-existing set of cpusets, then we'd be even
more screwed, unable even to force the issue by imposing additional
intermediate combined cpusets to meet the need.
If there is any potential for this to be a problem, then we should
examine the possibility of making irqs their own cgroup, rather ...I might just be new-fangled, but I have a /cgroup mount. Depends a bit, the genirq layer seems to allow for irqs that can't be freely placed. But most of them can be given a free mask - /me looks @ I see the issue. We don't support mv on cgroups, right? To easily create Hmm, but that would then be another controller based on cpus. Might be a I'm not sure I know what you're asking. IRQ are hardware notifiers and do all kinds of things depending on the hardware. Network cards typically use them to notify the CPU of incoming packets. Video cards I'm fine with whatever, I saw a ',' in the bitmap stuff, not really sure how that ended up being a ' ' in the patch I send out... :-) --
yes - and when they cannot be arbitrarily migrated we just dont move them (but still keep them attached to that cpuset). The affinity calls will just fail in that case. Might want to emit a kernel warning but that's all. (if then it's a hardware constraint) Ingo --
irq affinity masks can basically be thought of as: "these are the CPUs where external hardware events will trigger certain kernel functions and cause overhead on those CPUs". An IRQ can have followup effects: softirq execution, workqueue execution, etc. so managing the IRQ masks is very meaningful and just as meaningful as managing the affinity masks of tasks. You can think of "IRQ# 123" as "special kernel task # 123". Ingo --
We should just check the return value from irq_set_affinity(). If it fails we I guess there maybe some fancy HW topologies that may be a problem but for most cases we should be ok. Simple cases like unmovable IRQs are easy to handle (ie set_affinity() fails Yeah, I'd prefer it to be along with cpusets. As I mentioned will need similar mechanisms for other things besides irqs for complete isolation. Creating a separate group for each sounds like an overkill. Max --
One can combine cgroups into a single hierarchy.
If we had irqs and these similar mechanisms you have in mind each in
their separate cgroup subsystem, and if you normally wanted to deal
with them all in a single hierarchy, then you can mount those cgroup
subsystems together.
I've manage to avoid learning how to use cgroups, so you might want
to ask Paul Menage (just added to the cc list) how to do that, if the
Documentation/cgroups.txt isn't sufficient.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
So some of these irqs might have very particular node affinity then,
right? If my thingamajig board is attached to node 72, then I might
want its interrupts going to a CPU on node 72, right?
In which case, putting that irq in my boot cpuset that only has nodes
0-3 would be harmful to the performance of my thingamajig board, right?
I'm suspecting here that you don't want a small 'boot' cpuset
(usually a small cpuset running legacy Unix stuff) holding the irqs,
but rather you want a big 'system' cpuset, which has -all-but- a few
nodes dedicated to hard real time or other isolated (there's that
word again) purposes.
That way, most irqs can go to most CPUs, depending on their specific
needs.
Unfortunately, I don't think the cpuset hierarchy and conventions admit
of both a big 'system' cpuset (all but a few isolated nodes) and a small
The only mv supported is simple rename, preserving parentage.
And if one could and did a tree reshaping mv near the top of the
All cgroups mount beneath /cgroup. For backwards compatibility,
one can also "mount -t cpuset cpuset /dev/cpuset", and just get
the cpuset interface, with a couple of legacy hooks to make it behave
Yes - that's another commonly supported form. If that's a better
presentation, then you'd probably want to rework your code, to take
in and display the entire vector of irq numbers in one line, using
a comma-separated list of irqs and ranges of irqs.
See further bitmap_scnprintf(), bitmap_parse_user(),
bitmap_scnlistprintf() and bitmap_parselist(), in bitmap.c.
Given that you don't have an pre-existing bitmap of irqs (that I know
of) and that you might have a distinct error code for each irq that you
try to attach to a different cpuset, I'm guessing you want to stick
with the single irq per write on input, single irq per line on output,
paradigm, similar to what the 'tasks' file uses for task pids.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
...linux/irq.h must not be included in generic code, it's actually more and asm-generic/hw_irq.h. Please restructure the code so that the cpuset code calls into an arch interface which will then be implemented by arch code (which in most cases will be genirq, the other can be left stubbed out for now) --
