Hi Ingo,
Here are two patches related to group cpu scheduling.Patch 1/2 -> Fixes minor bugs and coding style issues
Patch 2/2 -> Improves group fairness on SMP systems.This is probably one of the last big changes related to group scheduler
that I had on my plate. Pls apply if there is no objection from anyone.--
Regards,
vatsa
-
The current load balancing scheme isn't good for group fairness.
For ex: on a 8-cpu system, I created 3 groups as under:
a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.This is what I get with the latest scheduler git tree:
Before this patch:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------Explanation of o/p:
Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) receive by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60[I can share the test scripts that produce such an o/p if reqd]
The deviation from desired group fairness is as below:
a = +24.47%
b = -9.13%
c = -15.33%which is quite high.
After the patch below is applied, here are the results:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
----------------------------------------...
i'm leaning towards making this v2.6.25 material, as it affects the
non-group-scheduling bits too and is rather large. When i tested it,
group scheduling worked pretty well - at least for CPU bound tasks - and
on SMP too. Could we live with what we have for now and defer this patch
to v2.6.25? If not, could you split up this patch in a way to defer all
the FAIR_GROUP_SCHED relevant changes to a separate patch which will not
affect the !FAIR_GROUP_SCHED case at all? That will make the case much
clearer.Ingo
-
Hi Ingo,
I would prefer this to go in 2.6.24 if possible. 2.6.24 would be the
first kernel to support a group scheduler in its entirety (user interface +
related support in scheduler) and also that works reasonably well :) It wouldFrom my inspection, here are the changes introduced by this patch
for !CONFIG_FAIR_GROUP_SCHED case:- inc/dec_load() takes a load input instead of task pointer input as their
2nd arg
- inc/dec_nr_running don't call inc/dec_load. Instead,
- enqueue/dequeue_task class callbacks call inc/dec_load
- [Unintended/will-fix change] min/max tunables added in
/proc/sys/kernelAll of above changes (except last, which I will fix) should have zero
functional+runtime effect for !CONFIG_FAIR_GROUP_SCHED case. So I don't see how
I can split Patch 2/2 further.Or do you prefer I introduce #ifdef's such that even these minor changes to
inc/dec_load are avoided for !CONFIG_FAIR_GROUP_SCHED case? That would
make the code slightly ugly I suspect.--
Regards,
vatsa
-
ok, as long as it's NOP for the CONFIG_FAIR_GROUP_SCHED, we could try
it.Ingo
-
Ok ..thx. I was begining to make changes to avoid even the above minor changes
for !CONFIG_FAIR_GROUP_SCHED case, but it doesn't look neat, hence will drop
that effort.I am fixing other problems observed with Patch 1/2 (usage of a mutex to
serialize create/destroy groups) and will resend the series very soon.--
Regards,
vatsa
-
well please try it nevertheless, if the changes are not a NOP for the
!FAIR_GROUP_SCHED case it's harder to determine that they are indeed
harmless for the !FAIR_GROUP_SCHED case.Ingo
-
Here's V3 of the group scheduler related patches, which is mainly addressing
improved fairness of cpu bandwidth allocation for task groups.Patch 1/4 -> coding style cleanup
Patch 2/4 -> Minor group scheduling related bug fixesPatch 3/4 (v1) -> Modifies how cpu load is calculated, such that there is zero
impact on !CONFIG_FAIR_GROUP_SCHED
Patch 3/4 (v2) -> Modifies how cpu load is calculated, such that there is a
small impact on code size (but should have NO impact on
functionality or runtime behavior) for
!CONFIG_FAIR_GROUP_SCHED case. The resulting code however is
much neater since it avoids some #ifdefs. I prefer v2.Patch 4/4 -> Updates load balance logic to provide improved fairness for
task groups.To have zero impact on !CONFIG_FAIR_GROUP_SCHED case, please apply the
following patches:- Patch 1/4
- Patch 2/4
- Patch 3/4 (v1)
- Patch 4/4I personally prefer v2 of Patch 3/4. Even though it has a minor impact
on code size for !CONFIG_FAIR_GROUP_SCHED case, the overall code is much
neater IMHO.Impact on sched.o size:
=======================!CONFIG_FAIR_GROUP_SCHED:
text data bss dec hex filename
36829 2766 48 39643 9adb sched.o-before-nofgs
36829 2766 48 39643 9adb sched.o-after-v1-nofgs (v1 of Patch 3/4)
36843 2766 48 39657 9ae9 sched.o-after-v2-nofgs (v2 of Patch 3/4)CONFIG_FAIR_GROUP_SCHED:
text data bss dec hex filename
39019 3346 336 42701 a6cd sched.o-before-fgs
40303 3482 308 44093 ac3d sched.o-after-v1-fgs (v1 of Patch 3/4)
40303 3482 308 44093 ac3d sched.o-after-v2-fgs (v2 of Patch 3/4)Changes since V2 of this patchset [1]
- Split the patches better and make them pass under checkpatch.pl
script
- Fixed compile issues under different config options and also
a suspend failure (as posted by Ingo at [2])
- Make load_balance_monitor thread run as real-time...
The current load balancing scheme isn't good for group fairness.
For ex: on a 8-cpu system, I created 3 groups as under:
a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.This is what I get with the latest scheduler git tree:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------Explanation of o/p:
Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60[I can share the test case that produces a similar o/p if reqd]
The deviation from desired group fairness is as below:
a = +24.47%
b = -9.13%
c = -15.33%which is quite high.
After the patch below is applied, here are the results:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
-----------------------------------------------------------...
this API and locking should be introduced in a separate patch first, to
reduce the size and impact of the 4/4 patch.Ingo
-
the proper comment format is in front of the line and in:
/*
* Comment.
*/Ingo
-
Thanks for spotting this bug and rest of your review comments.
Here's V4 of the patchset, aimed at improving fairness of cpu bandwidth
allocation for task groups.Changes since V3 (http://marc.info/?l=linux-kernel&m=119605252303359):
- Fix bug in setting SCHED_RR priority for load_balance_monitor thread
- Fix coding style related issues
- Separate "introduction of lock_doms_cur() API" into a separate patchI have also tested this patchset against your latest git tree as of
today morning.Please apply if there are no major concerns.
--
Regards,
vatsa
-
thanks, it looks good - but the fact that we are at v4 of the patchset
underlines the point that this is more of a v2.6.25 patchset than a
v2.6.24 one. Group fairness certainly works fine enough in most setups
and we are late into the v2.6.24 -rc stage. We'll merge this patchset
(or any later versions of it) into v2.6.25 for sure, so distros that
happen to base things off v2.6.24 can pick up your patches just fine.Ingo
-
--
Regards,
vatsa
-
no need (unless you have bugfixes) i'm carrying this around in the
scheduler git tree. (it will show up in sched-devel.git)Ingo
-
Cool .. Thx! It will get me some testing results (from those who test
sched-devel tree) and also will save me some maintenance trouble :)--
Regards,
vatsa
-
The current load balancing scheme isn't good for group fairness.
For ex: on a 8-cpu system, I created 3 groups as under:
a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.This is what I get with the latest scheduler git tree:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------Explanation of o/p:
Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60[I can share the test case that produces a similar o/p if reqd]
The deviation from desired group fairness is as below:
a = +24.47%
b = -9.13%
c = -15.33%which is quite high.
After the patch below is applied, here are the results:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
-----------------------------------------------------------...
doms_cur[] array represents various scheduling domains which are mutually
exclusive. Currently cpusets code can modify this array (by calling
partition_sched_domains()) as a result of user modifying sched_load_balance
flag for various cpusets.This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read the
doms_cur[] array w/o worrying abt concurrent modifications to the array.The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -186,6 +186,9 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
*/
static DEFINE_MUTEX(task_group_mutex);+/* doms_cur_mutex serializes access to doms_cur[] array */
+static DEFINE_MUTEX(doms_cur_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -236,11 +239,23 @@ static inline void unlock_task_group_lis
mutex_unlock(&task_group_mutex);
}+static inline void lock_doms_cur(void)
+{
+ mutex_lock(&doms_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+ mutex_unlock(&doms_cur_mutex);
+}
+
#elsestatic inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
static inline void lock_task_group_list(void) { }
static inline void unlock_task_group_list(void) { }
+static inline void lock_doms_cur(void) { }
+static inline void unlock_doms_cur(void) { }#endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -6547,6 +6562,8 @@ void partition_sched_domains(int ndoms_n
{
int i, j;+ lock_doms_cur();
+
/* always unregister in case we don't...
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.This version of patch (v2 of Patch 3/5) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/5
(because of lesser #ifdefs).I prefer v2 of Patch 3/5.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 27 +++++++++++----------------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 40 insertions(+), 20 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,6 +870,16 @@ iter_move_one_task(struct rq *this_rq, i
struct rq_iterator *iterator);
#endif+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -880,26 +890,14 @@ iter_move_one_task(struct rq *this_rq, i#define sched_class_highest (&rt_sched_class)
-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_sub(&rq->load, p->se.load.weight);
-}
-
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
- inc_load(rq, p);
}static void dec_nr_running(struct task_struct...
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.This version of patch (v1 of Patch 3/5) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 38 ++++++++++++++++++++++++++++++--------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 59 insertions(+), 12 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,15 +870,25 @@ iter_move_one_task(struct rq *this_rq, i
struct rq_iterator *iterator);
#endif-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED-#define sched_class_highest (&rt_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
@@ -890,6 +900,18 @@ static inline void dec_load(struct rq *r
update_load_sub(&rq->load, p->se.load.weight);
}+#endif /* CONFIG_FAI...
Minor bug fixes for group scheduler:
- Use a mutex to serialize add/remove of task groups and also when
changing shares of a task group. Use the same mutex when printing cfs_rq
stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
walking task group list)Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 34 ++++++++++++++++++++++++++--------
kernel/sched_fair.c | 4 +++-
2 files changed, 29 insertions(+), 9 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
- /* spinlock to serialize modification to shares */
- spinlock_t lock;
struct rcu_head rcu;
};@@ -182,6 +180,12 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
static struct sched_entity *init_sched_entity_p[NR_CPUS];
static struct cfs_rq *init_cfs_rq_p[NR_CPUS];+/*
+ * task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -222,9 +226,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
}+static inline void lock_task_group_list(void)
+{
+ mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+ mutex_unlock(&task_group_mutex);
+}
+
#elsestatic inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }#endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -6747,7 +6763,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group...
Minor cleanups:
- Fix coding style
- remove obsolete commentSigned-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 21 +++------------------
1 files changed, 3 insertions(+), 18 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i#define sched_class_highest (&rt_sched_class)
-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);--
Regards,
vatsa
-
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.This version of patch (v2 of Patch 3/4) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/4
(because of lesser #ifdefs).I prefer v2 of Patch 3/4.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 27 +++++++++++----------------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 40 insertions(+), 20 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -869,6 +869,16 @@
struct rq_iterator *iterator);
#endif+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -879,26 +889,14 @@#define sched_class_highest (&rt_sched_class)
-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_sub(&rq->load, p->se.load.weight);
-}
-
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
- inc_load(rq, p);
}static void dec_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running--;
- dec_load(rq, p);
}st...
This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated to it.This version of patch (v1 of Patch 3/4) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 38 ++++++++++++++++++++++++++++++--------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 59 insertions(+), 12 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -869,15 +869,25 @@
struct rq_iterator *iterator);
#endif-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED-#define sched_class_highest (&rt_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
@@ -889,6 +899,18 @@
update_load_sub(&rq->load, p->se.load.weight);
}+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#include "sched_stats.h"
+#include "sched_idletask.c"
+#incl...
Minor bug fixes for group scheduler:
- Use a mutex to serialize add/remove of task groups and also when
changing shares of a task group. Use the same mutex when printing cfs_rq
stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when walking task
group list)Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 33 +++++++++++++++++++++++++--------
kernel/sched_fair.c | 4 +++-
2 files changed, 28 insertions(+), 9 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
- /* spinlock to serialize modification to shares */
- spinlock_t lock;
struct rcu_head rcu;
};@@ -182,6 +180,11 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
static struct sched_entity *init_sched_entity_p[NR_CPUS];
static struct cfs_rq *init_cfs_rq_p[NR_CPUS];+/* task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -222,9 +225,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
}+static inline void lock_task_group_list(void)
+{
+ mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+ mutex_unlock(&task_group_mutex);
+}
+
#elsestatic inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }#endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -6747,7 +6762,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.sh...
Minor cleanups:
- Fix coding style
- remove obsolete commentSigned-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 21 +++------------------
1 files changed, 3 insertions(+), 18 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i#define sched_class_highest (&rt_sched_class)
-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);--
Regards,
vatsa
-
Minor scheduler cleanups:
- Fix coding style
- remove obsolete comment
- Use list_for_each_entry_rcu when walking task group listSigned-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
---
kernel/sched.c | 21 +++------------------
kernel/sched_fair.c | 5 ++++-
2 files changed, 7 insertions(+), 19 deletions(-)Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD /* root user's cpu share */
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i#define sched_class_highest (&rt_sched_class)
-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel...
hm, why is this a cleanup?
Ingo
-
Sorry for the wrong subject. It was supposed to include the above bug fix,
related to how we walk the task group list.Thinking abt it now, I realize that print_cfs_rq() can potentially
sleep and hence it cannot be surrounded by rcu_read_lock()/unlock().And as Dipankar just pointed me, sched_create/destroy_group aren't
serialized at all currently, so we need a mutex to protect them. The
same mutex can be then used when walking the list in print_cfs_stats() ..Will send update patches soon ..
--
Regards,
vatsa
-
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Stephen Rothwell | Re: Announce: Linux-next (Or Andrew's dream :-)) |
| Arjan van de Ven | Re: [malware-list] [RFC 0/5] [TALPA] Intro to a linux interfaceforon access scanning |
| Patrick McHardy | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Eric W. Biederman | Re: namespace support requires network modules to say "GPL" |
git: | |
