Ingo and Andrew,
The cfs core has been enhanced since quite sometime now to
understand task-groups and provide fairness to such task-group. What was
needed is an interface for the administrator to define task-groups and
specify group "importance" in terms of its cpu share.
The patch below adds such an interface. Its based on the task containers
feature in -mm tree. I request you to review and include the patch in your
trees for more wider testing from community.
Note that the load balancer needs more work, esp to deal with cases like
2-groups on 4-cpus, one group has 3 tasks and other having 4 tasks.
We are working on some ideas, but nothing to share in the form of a
patch yet. I felt sending this patch out would help folks start
testing the feature and also improve it.
Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>
---
include/linux/container_subsys.h | 5
init/Kconfig | 9 +
kernel/sched.c | 311 +++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 3
4 files changed, 313 insertions(+), 15 deletions(-)
Index: current/include/linux/container_subsys.h
===================================================================
--- current.orig/include/linux/container_subsys.h
+++ current/include/linux/container_subsys.h
@@ -36,3 +36,8 @@ SUBSYS(mem_container)
#endif
/* */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+SUBSYS(cpuctlr)
+#endif
+
+/* */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -326,6 +326,15 @@ config RESOURCE_COUNTERS
infrastructure that works with containers
depends on CONTAINERS
+config FAIR_GROUP_SCHED
+ bool "Fair group scheduler"
+ depends on EXPERIMENTAL && CONTAINERS
+ help
+ This option enables you to group tasks and control CPU resource
+ allocation to such groups.
+
+ ...*shrug* .. I used "cpuctlr" to mean "CPU Controller". Any other short names would do. From your list, cpuctl or cpuctrl both qualifies IMO! Unless folks have strong objection to it, I prefer "cptctlr", the way it is. -- Regards, vatsa -
s/cptctlr/cpuctlr ! -- Regards, vatsa -
Captain Controller to the rescue! -
objection ;) "cpuctlr" isn't memorable. Kernel code is write-rarely, read-often. "cpu_controller", please. The extra typing is worth it ;) -
Ok! Here's the modified patch (against 2.6.23-rc4-mm1).
Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>
---
include/linux/container_subsys.h | 5
init/Kconfig | 9 +
kernel/sched.c | 311 +++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 3
4 files changed, 313 insertions(+), 15 deletions(-)
Index: current/include/linux/container_subsys.h
===================================================================
--- current.orig/include/linux/container_subsys.h
+++ current/include/linux/container_subsys.h
@@ -36,3 +36,8 @@ SUBSYS(mem_container)
#endif
/* */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+SUBSYS(cpu_controller)
+#endif
+
+/* */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -326,6 +326,15 @@ config RESOURCE_COUNTERS
infrastructure that works with containers
depends on CONTAINERS
+config FAIR_GROUP_SCHED
+ bool "Fair group scheduler"
+ depends on EXPERIMENTAL && CONTAINERS
+ help
+ This option enables you to group tasks and control CPU resource
+ allocation to such groups.
+
+ Say N if unsure.
+
config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -179,6 +179,58 @@ struct load_stat {
unsigned long delta_fair, delta_exec, delta_stat;
};
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+#include <linux/container.h>
+
+struct cfs_rq;
+
+/* task group related information */
+struct task_grp {
+ struct container_subsys_state css;
+ /* schedulable entities of this group on each cpu */
+ struct sched_entity **se;
+ /* runqueue "owned" by this group on each cpu */
+ struct cfs_rq **cfs_rq;
+ unsigned long ...I guess, update_rq_clock(rq) should be placed here. humm... do you really need deactivate/activate_task() here? 'rq' and p->se.load.weight stay unchanged so enqueue/dequeue_task() would do a job, no? Although, I might be missing (definitely, as this part is not -- Best regards, Dmitry Adamushko -
Good catch. Here's the updated patch.
Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>
---
include/linux/container_subsys.h | 6
init/Kconfig | 9 +
kernel/sched.c | 312 +++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 3
4 files changed, 315 insertions(+), 15 deletions(-)
Index: current/include/linux/container_subsys.h
===================================================================
--- current.orig/include/linux/container_subsys.h
+++ current/include/linux/container_subsys.h
@@ -36,3 +36,9 @@ SUBSYS(mem_container)
#endif
/* */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+SUBSYS(cpu)
+#endif
+
+/* */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -326,6 +326,15 @@ config RESOURCE_COUNTERS
infrastructure that works with containers
depends on CONTAINERS
+config FAIR_GROUP_SCHED
+ bool "Fair group scheduler"
+ depends on EXPERIMENTAL && CONTAINERS
+ help
+ This option enables you to group tasks and control CPU resource
+ allocation to such groups.
+
+ Say N if unsure.
+
config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -179,6 +179,58 @@ struct load_stat {
unsigned long delta_fair, delta_exec, delta_stat;
};
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+#include <linux/container.h>
+
+struct cfs_rq;
+
+/* task group related information */
+struct task_grp {
+ struct container_subsys_state css;
+ /* schedulable entities of this group on each cpu */
+ struct sched_entity **se;
+ /* runqueue "owned" by this group on each cpu */
+ struct cfs_rq **cfs_rq;
+ unsigned long shares;
+};
+
+/* ...I guess, 'rq->curr == tsk' implies a task was on the 'rq' (before
dequeueing) in this particular case. What's about a minor optimization
like below (plus, let's make use of task_running()):
[ btw., real-time task can be also added to a container, right? I
guess, it can be used at least for group-aware load-balancing of
rt_tasks... otherwise, I guess, cpu-resource controlling is not that
applicable to, say, SCHED_FIFO :-)
Anyway, to this goal rt_task should become aware of group-related bits
of the 'struct sched_entity'... and at the moment, the code below is
effectively just a 'nop' for them.. right? ]
/* change task's runqueue when it moves between groups */
static void sched_move_task(struct container_subsys *ss, struct container *cont,
struct container *old_cont, struct task_struct *tsk)
{
int on_rq, running;
unsigned long flags;
struct rq *rq;
rq = task_rq_lock(tsk, &flags);
update_rq_clock(rq);
running = task_running(rq, tsk);
on_rq = tsk->se.on_rq;
if (on_rq) {
dequeue_task(rq, tsk, 0);
if (unlikely(running) && tsk->sched_class == &fair_sched_class)
tsk->sched_class->put_prev_task(rq, tsk);
}
set_task_cfs_rq(tsk);
if (on_rq) {
enqueue_task(rq, tsk, 0);
if (unlikely(running) && tsk->sched_class == &fair_sched_class)
tsk->sched_class->set_curr_task(rq);
}
task_rq_unlock(rq, &flags);
--
Best regards,
Dmitry Adamushko
-
Hi Dmitry,
The modified code looks very good to me (and neater too!). Updated
I am not sure whether we want to allow that. I have been assuming that
only SCHED_NORMAL tasks will be in containers.
Imagine two SCHED_FIFO tasks being put in two different containers -and-
controlling cpu cycle allocation to both containers. Do we actually timeslice
the two SCHED_FIFO tasks now that are across containers? Also we need a
mechanism to choose a container first to schedule and then the real-time
task in it (similar to how its being done in SCHED_NORMAL case). Which
means we need one rt_rq per container per cpu.
Yes, if we were to allow SCHED_FIFO/RR tasks in containers, then the
load balancer would need to spread rt-tasks of a container across CPUs
Yes. To emphasise this code doesn't support real-time tasks in a container, I
am returning -ENOTSUP when trying to move a rt-task into a container.
Thanks for all your review!
Add interface to control cpu bandwidth allocation to task-groups.
Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>
---
include/linux/container_subsys.h | 6
init/Kconfig | 9 +
kernel/sched.c | 330 +++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 3
4 files changed, 333 insertions(+), 15 deletions(-)
Index: current/include/linux/container_subsys.h
===================================================================
--- current.orig/include/linux/container_subsys.h
+++ current/include/linux/container_subsys.h
@@ -36,3 +36,9 @@ SUBSYS(mem_container)
#endif
/* */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+SUBSYS(cpu)
+#endif
+
+/* */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -326,6 +326,15 @@ config RESOURCE_COUNTERS
infrastructure that works with containers
...s/ENOTSUP/EINVAL -- Regards, vatsa -
Hi Srivatsa, A bit of nit-picking... are you sure, there is no need in non '__' versions of dequeue/enqueu() here (at least, for the sake of update_curr())? Although, I don't have -mm at hand at this very moment and original -rc4 (that I have at hand) doesn't already have -- Best regards, Dmitry Adamushko -
well, there is remote possibility that task could have changed policies
between sched_can_attach() and sched_move_task()! In the updated patch
below, I have put some hooks to deal with that scenario elsewhere also
(in rt_mutex_setprio and __setscheduler). Pls let me know if you see
You are right, we need the dequeue/enqueue_entity here, as the sched
entity load is changing in between the two function calls.
How does the updated patch look? (Thanks again for all your review so far!)
--
Add interface to control cpu bandwidth allocation to task-groups.
Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>
---
include/linux/container_subsys.h | 6
init/Kconfig | 9 +
kernel/sched.c | 344 +++++++++++++++++++++++++++++++++++++--
kernel/sched_fair.c | 3
kernel/sched_idletask.c | 5
kernel/sched_rt.c | 5
6 files changed, 355 insertions(+), 17 deletions(-)
Index: current/include/linux/container_subsys.h
===================================================================
--- current.orig/include/linux/container_subsys.h
+++ current/include/linux/container_subsys.h
@@ -36,3 +36,9 @@ SUBSYS(mem_container)
#endif
/* */
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+SUBSYS(cpu)
+#endif
+
+/* */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -326,6 +326,15 @@ config RESOURCE_COUNTERS
infrastructure that works with containers
depends on CONTAINERS
+config FAIR_GROUP_SCHED
+ bool "Fair group scheduler"
+ depends on EXPERIMENTAL && CONTAINERS
+ help
+ This option enables you to group tasks and control CPU resource
+ allocation to such groups.
+
+ Say N if unsure.
+
config SYSFS_DEPRECATED
bool "Create deprecated sysfs files"
default y
Index: ...btw., just in case it was not obvious, i'll repeat my older assessment of your patch: the general picture approach looks good to me and the code is upstream-worthy. ( suggestion: if you want more people to test it then you might want to do some add-on "put all users into separate groups" patch and .config option - which could be tried without people having to know anything about container setup. ) Ingo -
I do want more people to test it and I agree that hooking onto user-id based groups is the best way to get that done. How do we implement that? I have two choices: 1. Do a kernel patch, as you suggest above, which defines task-groups based on user-id and hook that group definition with group scheduler. We need to provide some means for the admin to tune relative nice-value of each user (perhaps thr' sysctl?). This user-id based grouping will have to be mutually exclusive with task-container based grouping. Hence we need to ensure that only one form of grouping is selected and not both at compile time. 2. Enable only one form of grouping, which is task-container based. Provide a user-space daemon (attached) which can automatically put tasks of different users in different task-containers. The daemon will need to be started at early boot-time. It can also be extended to support a configuration file (ex: inittab) where cpu allocation for different users are specified. The fact that daemon is managing to provide fair allocation to users should be transparent. I hope that task-containers (aka cgroups) will go into 2.6.24, in which case the second option seems to be more attractive to me. I will neverthless try to work out Option 1, just to see how it looks. -- Regards, vatsa -
Oops ..forgot to attach. But I realize that I had already sent the source for the daemon here: http://article.gmane.org/gmane.linux.kernel/553267 -- Regards, vatsa -
as everyone seems to be in a quest for a better name... I think, the obvious one would be just 'group_sched'. -- Best regards, Dmitry Adamushko -
But "sched" on its own could refer to CPU scheduling, I/O scheduling, network scheduling, ... And "group" is more or less implied by the fact that it's in the containers/control groups filesystem. So "group_sched" isn't really all that informative. The name should definitely contain either "cpu" or "cfs". Paul -
"cfs" control group subsystem. "cfs" looks good enough to identify the subsystem. C. -
That looks odd, like it's a filesystem. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -
http://en.wikipedia.org/wiki/CFS (scnr) It misses the C...something Filesystem tho. Jan -- -
cfs = completely fair scheduler :) In this thread, we are talking of hooking the cfs cpu scheduler with the task-container framework in -mm tree, so that the scheduler can deal with groups of tasks rather than just tasks, while handling fairness of cpu allocation. I agree "cfs" control subsystem does look odd a bit here. "cpu" control subsystem seems better. -- Regards, vatsa -
Thanks. I agree that using "cpu" is better. I.e., don't tie it to a particular scheduler name. It would just need to change the next time we have a new scheduler. ;) --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -
That's the main contender for the new name, to replace "task containers" since there were enough objections. (Renamed patches coming out today, I hope). Paul -
I think you mean __read_mostly. :-) Jan -- -
By definition any container (about to be renamed control group) subsystem is some kind of "controller" so that bit seems a bit redundant. Any reason not to just call it "cpu" or "cpu_sched" Paul -
Done (in the latest patch I sent a while back)! -- Regards, vatsa -
On Mon, 10 Sep 2007 22:40:49 +0530 Sorry for very lazy responce.. num_possible_cpus() just returns # of possible cpus. Then it will not return Max-cpu-id to be used. I think just use NR_CPUS here is an easy way. Thanks, -Kame -
Good catch! Yes, you are right, the pointer array needs to be NR_CPUS size. On closer examination, I think I can simply use alloc_percpu() here. Will send a patch after some testing. -- Regards, vatsa -
