Re: [PATCH 4/5] Add fair-user scheduler

Previous thread: [patch] usb-skeleton leaking locks on open. by Mark Gross on Monday, September 24, 2007 - 12:18 pm. (2 messages)

Next thread: none
To: <akpm@...>, Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:33 pm

Hello Ingo and Andrew,
Here are various patches to fair group scheduler code present in
sched-devel and in 2.6.23-rc7-mm1. These patches should apply against
both sched-devel git tree and 2.6.23-rc7-mm1 (as they are both in sync
now).

Pls consider for inclusion after review.

[there is still room for improving group fairness obtained on smp
systems. By sending out these patches, i am hoping more folks can test
and improve that aspect]

Ingo,
You may want to avoid picking Patch 5/5 as it makes sense only in
-mm tree atm.

--
Regards,
vatsa
-

To: <akpm@...>, Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:41 pm

Enable "cgroup" (formerly containers) based fair group scheduling.
This will let administrator create arbitrary groups of tasks (using
"cgroup" psuedo filesystem) and control their cpu bandwidth usage.

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>

---
include/linux/cgroup_subsys.h | 6 ++
init/Kconfig | 24 +++++---
kernel/sched.c | 122 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 145 insertions(+), 7 deletions(-)

Index: current/include/linux/cgroup_subsys.h
===================================================================
--- current.orig/include/linux/cgroup_subsys.h
+++ current/include/linux/cgroup_subsys.h
@@ -36,3 +36,9 @@ SUBSYS(mem_cgroup)
#endif

/* */
+
+#ifdef CONFIG_FAIR_CGROUP_SCHED
+SUBSYS(cpu_cgroup)
+#endif
+
+/* */
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -327,13 +327,6 @@ config FAIR_GROUP_SCHED
This feature lets cpu scheduler recognize task groups and control cpu
bandwidth allocation to such task groups.

-config RESOURCE_COUNTERS
- bool "Resource counters"
- help
- This option enables controller independent resource accounting
- infrastructure that works with cgroups
- depends on CGROUPS
-
choice
depends on FAIR_GROUP_SCHED
prompt "Basis for grouping tasks"
@@ -345,8 +338,25 @@ choice
This option will choose userid as the basis for grouping
tasks, thus providing equal cpu bandwidth to each user.

+ config FAIR_CGROUP_SCHED
+ bool "Control groups"
+ depends on CGROUPS
+ help
+ This option allows you to create arbitrary task groups
+ using the "cgroup" psuedo filesystem and control
+ the cpu bandwidth allocated to each such task group.
+ Refer to Documentation/cgroups.txt for more information
+ on "cgroup" psuedo filesyst...

To: <vatsa@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:58 pm

Too much indentation.

---
~Randy
Phaedrus says that Quality is about caring.
-

To: Randy Dunlap <randy.dunlap@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 1:18 pm

This one was there before in -mm ..I just moved it below in the Kconfig
file. Neverthless will fix the coding-style here as well.

Thanks for your reviews!

--
Regards,
vatsa
-

To: <akpm@...>, Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:40 pm

Enable user-id based fair group scheduling. This is usefull for anyone
who wants to test the group scheduler w/o having to enable
CONFIG_CGROUPS.

A separate scheduling group (i.e struct task_grp) is automatically created for
every new user added to the system. Upon uid change for a task, it is made to
move to the corresponding scheduling group.

A /proc tunable (/proc/root_user_share) is also provided to tune root
user's quota of cpu bandwidth.

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>

---
include/linux/sched.h | 4 +++
init/Kconfig | 13 ++++++++++++
kernel/sched.c | 9 ++++++++
kernel/sched_debug.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++
kernel/user.c | 43 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 121 insertions(+)

Index: linux-2.6.23-rc6/include/linux/sched.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/sched.h
+++ linux-2.6.23-rc6/include/linux/sched.h
@@ -596,6 +596,10 @@ struct user_struct {
/* Hash table maintenance information */
struct hlist_node uidhash_node;
uid_t uid;
+
+#ifdef CONFIG_FAIR_USER_SCHED
+ struct task_grp *tg;
+#endif
};

extern struct user_struct *find_user(uid_t);
Index: linux-2.6.23-rc6/init/Kconfig
===================================================================
--- linux-2.6.23-rc6.orig/init/Kconfig
+++ linux-2.6.23-rc6/init/Kconfig
@@ -289,6 +289,19 @@ config FAIR_GROUP_SCHED
This feature lets cpu scheduler recognize task groups and control cpu
bandwidth allocation to such task groups.

+choice
+ depends on FAIR_GROUP_SCHED
+ prompt "Basis for grouping tasks"
+ default FAIR_USER_SCHED
+
+ config FAIR_USER_SCHED
+ bool "user id"
+ help
+ This option will choose userid as the basis for grouping
+ tasks, thus providing equal cpu bandwidth to each user.
+
+endchoice
+...

To: <vatsa@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 7:39 pm

or use this oneliner:

-

To: roel <12o3l@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 10:09 pm

Looks good. Will fix this in a follow-on.patch.

Thanks!

--
Regards,
vatsa
-

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <akpm@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 2:01 pm

excellent! I'll make this default-enabled.

Ingo
-

To: <vatsa@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:56 pm

s/cpu/CPU/g please.

---
~Randy
Phaedrus says that Quality is about caring.
-

To: Randy Dunlap <randy.dunlap@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 1:16 pm

will take care.

--
Regards,
vatsa
-

To: <akpm@...>, Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:39 pm

With the view of supporting user-id based fair scheduling (and not just
container-based fair scheduling), this patch renames several functions
and makes them independent of whether they are being used for container
or user-id based fair scheduling.

Also fix a problem reported by KAMEZAWA Hiroyuki (wrt allocating
less-sized array for tg->cfs_rq[] and tf->se[]).

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>

---
include/linux/sched.h | 12 +++
init/Kconfig | 11 +--
kernel/sched.c | 172 ++++++++++++++++++--------------------------------
kernel/sched_fair.c | 5 +
4 files changed, 83 insertions(+), 117 deletions(-)

Index: current/include/linux/sched.h
===================================================================
--- current.orig/include/linux/sched.h
+++ current/include/linux/sched.h
@@ -135,6 +135,7 @@ extern unsigned long weighted_cpuload(co

struct seq_file;
struct cfs_rq;
+struct task_grp;
#ifdef CONFIG_SCHED_DEBUG
extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m);
extern void proc_sched_set_task(struct task_struct *p);
@@ -1833,6 +1834,17 @@ extern int sched_mc_power_savings, sched

extern void normalize_rt_tasks(void);

+#ifdef CONFIG_FAIR_GROUP_SCHED
+
+extern struct task_grp init_task_grp;
+
+extern struct task_grp *sched_create_group(void);
+extern void sched_destroy_group(struct task_grp *tg);
+extern void sched_move_task(struct task_struct *tsk);
+extern int sched_group_set_shares(struct task_grp *tg, unsigned long shares);
+
+#endif
+
#ifdef CONFIG_TASK_XACCT
static inline void add_rchar(struct task_struct *tsk, ssize_t amt)
{
Index: current/init/Kconfig
===================================================================
--- current.orig/init/Kconfig
+++ current/init/Kconfig
@@ -282,13 +282,12 @@ config CPUSETS
Say N if unsure.

config FAIR_GROUP_SCHED
- bool "Fair group sch...

To: <vatsa@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:53 pm

---
~Randy
Phaedrus says that Quality is about caring.
-

To: Randy Dunlap <randy.dunlap@...>
Cc: <akpm@...>, Ingo Molnar <mingo@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 1:13 pm

Sounds good. Will add to my follow-on.patch (and will send out after
others have had a chance to comment).

--
Regards,
vatsa
-

To: <akpm@...>, Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:38 pm

- Fix a minor bug in yield (seen for CONFIG_FAIR_GROUP_SCHED)
- Print nr_running and load information for cfs_rq in /proc/sched_debug
- Print &rq->cfs statistics as well (usefull for group scheduling)

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by : Dhaval Giani <dhaval@linux.vnet.ibm.com>

---
kernel/sched_debug.c | 2 ++
kernel/sched_fair.c | 3 ++-
2 files changed, 4 insertions(+), 1 deletion(-)

Index: current/kernel/sched_debug.c
===================================================================
--- current.orig/kernel/sched_debug.c
+++ current/kernel/sched_debug.c
@@ -136,6 +136,8 @@ void print_cfs_rq(struct seq_file *m, in
SPLIT_NS(spread0));
SEQ_printf(m, " .%-30s: %ld\n", "spread0",
cfs_rq->nr_sync_min_vruntime);
+ SEQ_printf(m, " .%-30s: %ld\n", "nr_running", cfs_rq->nr_running);
+ SEQ_printf(m, " .%-30s: %ld\n", "load", cfs_rq->load.weight);
}

static void print_cpu(struct seq_file *m, int cpu)
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -726,7 +726,7 @@ static void dequeue_task_fair(struct rq
*/
static void yield_task_fair(struct rq *rq)
{
- struct cfs_rq *cfs_rq = &rq->cfs;
+ struct cfs_rq *cfs_rq = task_cfs_rq(rq->curr);
struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
struct sched_entity *rightmost, *se = &rq->curr->se;
struct rb_node *parent;
@@ -1025,6 +1025,7 @@ static void print_cfs_stats(struct seq_f
{
struct cfs_rq *cfs_rq;

+ print_cfs_rq(m, cpu, &cpu_rq(cpu)->cfs);
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
}

--
Regards,
vatsa
-

To: <akpm@...>, Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:36 pm

[refer http://marc.info/?l=linux-kernel&m=119014449807290]

Dmitry/Ingo,
I am sorry for not having reviewed this change properly, but I
think we need to revert this.

Some problems with current sched-devel code introduced by this change:

1. SCHED_NORMAL->SCHED_RT transition of current task
a. we reset current->se.exec_start to 0 (update_stats_curr_end)
and not initialize it again. As a result, update_curr_rt()
can calculate bogus values

b. (for CONFIG_FAIR_GROUP_SCHED):
i) higher level entities of current task arent put back
on their cfs_rq
ii) their corresponding cfs_rq->curr is not set to NULL
iii) update_stats_wait_start() not called for higher
level entities of current task.

These functions are usually accomplished by put_prev_entity()

2. SCHED_RT->SCHED_NORMAL
a. (minor) we don't initialize se->prev_sum_exec_runtime
in enqueue_entity() under if (set_curr) { } block.
As a result the current task may get preempted
almost immediately?

b. (for CONFIG_FAIR_GROUP_SCHED):
i) higher level entities of current task arent taken out
of their cfs_rq
ii) their corresponding cfs_rq->curr is not set
iii) update_stats_wait_end() and update_stats_curr_start()
arent called for higher level entities of
current task.

These functions are usually accomplished by set_next_entity()

(similar problems exist while changing groups)

In theory its possible to solve these problems w/o reintroducing
set_curr_task(). I tried doing so, but found it clutters dequeue_entity
and enqueue_entity a lot and makes it less readable. It will duplicate what
put_prev_entity() and set_next_entity() are supposed to do. Moreoever
it is slightly inefficient to do all these in dequeue_entity() if we
consider that dequeue_entity can be called on current task for other
reasons as well (like when it is abt to sleep or change its nice value).

Considering the...

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <akpm@...>, <dmitry.adamushko@...>, <linux-kernel@...>, <dhaval@...>, <containers@...>, <kamezawa.hiroyu@...>, <menage@...>, <efault@...>
Date: Monday, September 24, 2007 - 12:35 pm

ah, i was wondering about that already. We can certainly skip that

yeah, it's not worth it. I'd go for keeping the code unified even if
adds a few instructions runtime overhead, as i'd expect most distros to
enable fair-group-scheduling by default in the future. (once all the
containers infrastructure and tools has trickled down to them)

Ingo
-

Previous thread: [patch] usb-skeleton leaking locks on open. by Mark Gross on Monday, September 24, 2007 - 12:18 pm. (2 messages)

Next thread: none