Hello Ingo and Andrew,
Here are various patches to fair group scheduler code present in
sched-devel and in 2.6.23-rc7-mm1. These patches should apply against
both sched-devel git tree and 2.6.23-rc7-mm1 (as they are both in sync
now).Pls consider for inclusion after review.
[there is still room for improving group fairness obtained on smp
systems. By sending out these patches, i am hoping more folks can test
and improve that aspect]Ingo,
You may want to avoid picking Patch 5/5 as it makes sense only in
-mm tree atm.--
Regards,
vatsa
-
Enable "cgroup" (formerly containers) based fair group scheduling.
This will let administrator create arbitrary groups of tasks (using
"cgroup" psuedo filesystem) and control their cpu bandwidth usage.Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>---
include/linux/cgroup_subsys.h | 6 ++
init/Kconfig | 24 +++++---
kernel/sched.c | 122 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 145 insertions(+), 7 deletions(-)Index: current/include/linux/cgroup_subsys.h
===================================================================
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -36,3 +36,9 @@ SUBSYS(mem_cgroup)
#endif/* */
+
+#ifdef CONFIG_FAIR_CGROUP_SCHED
+SUBSYS(cpu_cgroup)
+#endif
+
+/* */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -327,13 +327,6 @@ config FAIR_GROUP_SCHED
This feature lets cpu scheduler recognize task groups and control cpu
bandwidth allocation to such task groups.-config RESOURCE_COUNTERS
- bool "Resource counters"
- help
- This option enables controller independent resource accounting
- infrastructure that works with cgroups
- depends on CGROUPS
-
choice
depends on FAIR_GROUP_SCHED
prompt "Basis for grouping tasks"
@@ -345,8 +338,25 @@ choice
This option will choose userid as the basis for grouping
tasks, thus providing equal cpu bandwidth to each user.+ config FAIR_CGROUP_SCHED
+ bool "Control groups"
+ depends on CGROUPS
+ help
+ This option allows you to create arbitrary task groups
+ using the "cgroup" psuedo filesystem and control
+ the cpu bandwidth allocated to each such task group.
+ Refer to Documentation/cgroups.txt for more information
+ on "cgroup" psuedo filesyst...
Too much indentation.
---
~Randy
Phaedrus says that Quality is about caring.
-
This one was there before in -mm ..I just moved it below in the Kconfig
file. Neverthless will fix the coding-style here as well.Thanks for your reviews!
--
Regards,
vatsa
-
Enable user-id based fair group scheduling. This is usefull for anyone
who wants to test the group scheduler w/o having to enable
CONFIG_CGROUPS.A separate scheduling group (i.e struct task_grp) is automatically created for
every new user added to the system. Upon uid change for a task, it is made to
move to the corresponding scheduling group.A /proc tunable (/proc/root_user_share) is also provided to tune root
user's quota of cpu bandwidth.Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>---
include/linux/sched.h | 4 +++
init/Kconfig | 13 ++++++++++++
kernel/sched.c | 9 ++++++++
kernel/sched_debug.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/user.c | 43 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 121 insertions(+)Index: linux-2.6.23-rc6/include/linux/sched.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/sched.h
+++ linux-2.6.23-rc6/include/linux/sched.h
@@ -596,6 +596,10 @@ struct user_struct {
/* Hash table maintenance information */
struct hlist_node uidhash_node;
uid_t uid;
+
+#ifdef CONFIG_FAIR_USER_SCHED
+ struct task_grp *tg;
+#endif
};extern struct user_struct *find_user(uid_t);
Index: linux-2.6.23-rc6/init/Kconfig
===================================================================
--- linux-2.6.23-rc6.orig/init/Kconfig
+++ linux-2.6.23-rc6/init/Kconfig
@@ -289,6 +289,19 @@ config FAIR_GROUP_SCHED
This feature lets cpu scheduler recognize task groups and control cpu
bandwidth allocation to such task groups.+choice
+ depends on FAIR_GROUP_SCHED
+ prompt "Basis for grouping tasks"
+ default FAIR_USER_SCHED
+
+ config FAIR_USER_SCHED
+ bool "user id"
+ help
+ This option will choose userid as the basis for grouping
+ tasks, thus providing equal cpu bandwidth to each user.
+
+endchoice
+...
or use this oneliner:
-
Looks good. Will fix this in a follow-on.patch.
Thanks!
--
Regards,
vatsa
-
excellent! I'll make this default-enabled.
Ingo
-
s/cpu/CPU/g please.
---
~Randy
Phaedrus says that Quality is about caring.
-
will take care.
--
Regards,
vatsa
-
With the view of supporting user-id based fair scheduling (and not just
container-based fair scheduling), this patch renames several functions
and makes them independent of whether they are being used for container
or user-id based fair scheduling.Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
less-sized array for tg->cfs_rq[] and tf->se[]).Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>---
include/linux/sched.h | 12 +++
init/Kconfig | 11 +--
kernel/sched.c | 172 ++++++++++++++++++--------------------------------
kernel/sched_fair.c | 5 +
4 files changed, 83 insertions(+), 117 deletions(-)Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -135,6 +135,7 @@ extern unsigned long weighted_cpuload(costruct seq_file;
struct cfs_rq;
+struct task_grp;
#ifdef CONFIG_SCHED_DEBUG
extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m);
extern void proc_sched_set_task(struct task_struct *p);
@@ -1833,6 +1834,17 @@ extern int sched_mc_power_savings, schedextern void normalize_rt_tasks(void);
+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+extern struct task_grp init_task_grp;
+
+extern struct task_grp *sched_create_group(void);
+extern void sched_destroy_group(struct task_grp *tg);
+extern void sched_move_task(struct task_struct *tsk);
+extern int sched_group_set_shares(struct task_grp *tg, unsigned long shares);
+
+#endif
+
#ifdef CONFIG_TASK_XACCT
static inline void add_rchar(struct task_struct *tsk, ssize_t amt)
{
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -282,13 +282,12 @@ config CPUSETS
Say N if unsure.config FAIR_GROUP_SCHED
- bool "Fair group sch...
---
~Randy
Phaedrus says that Quality is about caring.
-
Sounds good. Will add to my follow-on.patch (and will send out after
others have had a chance to comment).--
Regards,
vatsa
-
- Fix a minor bug in yield (seen for CONFIG_FAIR_GROUP_SCHED)
- Print nr_running and load information for cfs_rq in /proc/sched_debug
- Print &rq->cfs statistics as well (usefull for group scheduling)Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>---
kernel/sched_debug.c | 2 ++
kernel/sched_fair.c | 3 ++-
2 files changed, 4 insertions(+), 1 deletion(-)Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -136,6 +136,8 @@ void print_cfs_rq(struct seq_file *m, in
SPLIT_NS(spread0));
SEQ_printf(m, " .%-30s: %ld\n", "spread0",
cfs_rq->nr_sync_min_vruntime);
+ SEQ_printf(m, " .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
+ SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
}static void print_cpu(struct seq_file *m, int cpu)
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -726,7 +726,7 @@ static void dequeue_task_fair(struct rq
*/
static void yield_task_fair(struct rq *rq)
{
- struct cfs_rq *cfs_rq = &rq->cfs;
+ struct cfs_rq *cfs_rq = task_cfs_rq(rq->curr);
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
struct sched_entity *rightmost, *se = &rq->curr->se;
struct rb_node *parent;
@@ -1025,6 +1025,7 @@ static void print_cfs_stats(struct seq_f
{
struct cfs_rq *cfs_rq;+ print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
}--
Regards,
vatsa
-
[refer http://marc.info/?l=linux-kernel&m=119014449807290]
Dmitry/Ingo,
I am sorry for not having reviewed this change properly, but I
think we need to revert this.Some problems with current sched-devel code introduced by this change:
1. SCHED_NORMAL->SCHED_RT transition of current task
a. we reset current->se.exec_start to 0 (update_stats_curr_end)
and not initialize it again. As a result, update_curr_rt()
can calculate bogus valuesb. (for CONFIG_FAIR_GROUP_SCHED):
i) higher level entities of current task arent put back
on their cfs_rq
ii) their corresponding cfs_rq->curr is not set to NULL
iii) update_stats_wait_start() not called for higher
level entities of current task.These functions are usually accomplished by put_prev_entity()
2. SCHED_RT->SCHED_NORMAL
a. (minor) we don't initialize se->prev_sum_exec_runtime
in enqueue_entity() under if (set_curr) { } block.
As a result the current task may get preempted
almost immediately?b. (for CONFIG_FAIR_GROUP_SCHED):
i) higher level entities of current task arent taken out
of their cfs_rq
ii) their corresponding cfs_rq->curr is not set
iii) update_stats_wait_end() and update_stats_curr_start()
arent called for higher level entities of
current task.These functions are usually accomplished by set_next_entity()
(similar problems exist while changing groups)
In theory its possible to solve these problems w/o reintroducing
set_curr_task(). I tried doing so, but found it clutters dequeue_entity
and enqueue_entity a lot and makes it less readable. It will duplicate what
put_prev_entity() and set_next_entity() are supposed to do. Moreoever
it is slightly inefficient to do all these in dequeue_entity() if we
consider that dequeue_entity can be called on current task for other
reasons as well (like when it is abt to sleep or change its nice value).Considering the...
ah, i was wondering about that already. We can certainly skip that
yeah, it's not worth it. I'd go for keeping the code unified even if
adds a few instructions runtime overhead, as i'd expect most distros to
enable fair-group-scheduling by default in the future. (once all the
containers infrastructure and tools has trickled down to them)Ingo
-
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Mike Travis | [RFC 00/15] x86_64: Optimize percpu accesses |
| Dave Jones | agp / cpufreq. |
| Willy Tarreau | Re: [PATCH] tcp: splice as many packets as possible at once |
| Gerrit Renker | [PATCH 14/37] dccp: Tidy up setsockopt calls |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
git: | |
