Re: [PATCH 2/2] sched: Improve fairness of cpu allocation for task groups

Previous thread: MMC sub-system: SDIO block-mode with increment address issue by Dean Jenkins on Monday, November 19, 2007 - 7:44 am. (7 messages)

Next thread: [PATCH 1/3] tty: Add the new termios2 ioctls to the compatible list. by Heiko Carstens on Monday, November 19, 2007 - 8:52 am. (4 messages)
To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 8:27 am

Hi Ingo,
Here are two patches related to group cpu scheduling.

Patch 1/2 -> Fixes minor bugs and coding style issues
Patch 2/2 -> Improves group fairness on SMP systems.

This is probably one of the last big changes related to group scheduler
that I had on my plate. Pls apply if there is no objection from anyone.

--
Regards,
vatsa
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 8:30 am

The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

Before this patch:
--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) receive by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test scripts that produce such an o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
----------------------------------------...

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 9:12 am

i'm leaning towards making this v2.6.25 material, as it affects the
non-group-scheduling bits too and is rather large. When i tested it,
group scheduling worked pretty well - at least for CPU bound tasks - and
on SMP too. Could we live with what we have for now and defer this patch
to v2.6.25? If not, could you split up this patch in a way to defer all
the FAIR_GROUP_SCHED relevant changes to a separate patch which will not
affect the !FAIR_GROUP_SCHED case at all? That will make the case much
clearer.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 11:03 am

Hi Ingo,
I would prefer this to go in 2.6.24 if possible. 2.6.24 would be the
first kernel to support a group scheduler in its entirety (user interface +
related support in scheduler) and also that works reasonably well :) It would

From my inspection, here are the changes introduced by this patch
for !CONFIG_FAIR_GROUP_SCHED case:

- inc/dec_load() takes a load input instead of task pointer input as their
2nd arg
- inc/dec_nr_running don't call inc/dec_load. Instead,
- enqueue/dequeue_task class callbacks call inc/dec_load
- [Unintended/will-fix change] min/max tunables added in
/proc/sys/kernel

All of above changes (except last, which I will fix) should have zero
functional+runtime effect for !CONFIG_FAIR_GROUP_SCHED case. So I don't see how
I can split Patch 2/2 further.

Or do you prefer I introduce #ifdef's such that even these minor changes to
inc/dec_load are avoided for !CONFIG_FAIR_GROUP_SCHED case? That would
make the code slightly ugly I suspect.

--
Regards,
vatsa
-

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 11:22 am

ok, as long as it's NOP for the CONFIG_FAIR_GROUP_SCHED, we could try
it.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 12:06 pm

Ok ..thx. I was begining to make changes to avoid even the above minor changes
for !CONFIG_FAIR_GROUP_SCHED case, but it doesn't look neat, hence will drop
that effort.

I am fixing other problems observed with Patch 1/2 (usage of a mutex to
serialize create/destroy groups) and will resend the series very soon.

--
Regards,
vatsa
-

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 3:00 pm

well please try it nevertheless, if the changes are not a NOP for the
!FAIR_GROUP_SCHED case it's harder to determine that they are indeed
harmless for the !FAIR_GROUP_SCHED case.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Monday, November 26, 2007 - 1:00 am

Here's V3 of the group scheduler related patches, which is mainly addressing
improved fairness of cpu bandwidth allocation for task groups.

Patch 1/4 -> coding style cleanup
Patch 2/4 -> Minor group scheduling related bug fixes

Patch 3/4 (v1) -> Modifies how cpu load is calculated, such that there is zero
impact on !CONFIG_FAIR_GROUP_SCHED
Patch 3/4 (v2) -> Modifies how cpu load is calculated, such that there is a
small impact on code size (but should have NO impact on
functionality or runtime behavior) for
!CONFIG_FAIR_GROUP_SCHED case. The resulting code however is
much neater since it avoids some #ifdefs. I prefer v2.

Patch 4/4 -> Updates load balance logic to provide improved fairness for
task groups.

To have zero impact on !CONFIG_FAIR_GROUP_SCHED case, please apply the
following patches:

- Patch 1/4
- Patch 2/4
- Patch 3/4 (v1)
- Patch 4/4

I personally prefer v2 of Patch 3/4. Even though it has a minor impact
on code size for !CONFIG_FAIR_GROUP_SCHED case, the overall code is much
neater IMHO.

Impact on sched.o size:
=======================

!CONFIG_FAIR_GROUP_SCHED:

text data bss dec hex filename
36829 2766 48 39643 9adb sched.o-before-nofgs
36829 2766 48 39643 9adb sched.o-after-v1-nofgs (v1 of Patch 3/4)
36843 2766 48 39657 9ae9 sched.o-after-v2-nofgs (v2 of Patch 3/4)

CONFIG_FAIR_GROUP_SCHED:

text data bss dec hex filename
39019 3346 336 42701 a6cd sched.o-before-fgs
40303 3482 308 44093 ac3d sched.o-after-v1-fgs (v1 of Patch 3/4)
40303 3482 308 44093 ac3d sched.o-after-v2-fgs (v2 of Patch 3/4)

Changes since V2 of this patchset [1]

- Split the patches better and make them pass under checkpatch.pl
script
- Fixed compile issues under different config options and also
a suspend failure (as posted by Ingo at [2])
- Make load_balance_monitor thread run as real-time...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Monday, November 26, 2007 - 1:09 am

The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
-----------------------------------------------------------...

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Monday, November 26, 2007 - 4:29 pm

this API and locking should be introduced in a separate patch first, to
reduce the size and impact of the 4/4 patch.

Ingo
-

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Monday, November 26, 2007 - 4:28 pm

the proper comment format is in front of the line and in:

/*
* Comment.
*/

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 1:06 am

Thanks for spotting this bug and rest of your review comments.

Here's V4 of the patchset, aimed at improving fairness of cpu bandwidth
allocation for task groups.

Changes since V3 (http://marc.info/?l=linux-kernel&m=119605252303359):

- Fix bug in setting SCHED_RR priority for load_balance_monitor thread
- Fix coding style related issues
- Separate "introduction of lock_doms_cur() API" into a separate patch

I have also tested this patchset against your latest git tree as of
today morning.

Please apply if there are no major concerns.

--
Regards,
vatsa
-

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 7:09 am

thanks, it looks good - but the fact that we are at v4 of the patchset
underlines the point that this is more of a v2.6.25 patchset than a
v2.6.24 one. Group fairness certainly works fine enough in most setups
and we are late into the v2.6.24 -rc stage. We'll merge this patchset
(or any later versions of it) into v2.6.25 for sure, so distros that
happen to base things off v2.6.24 can pick up your patches just fine.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 7:42 am

--
Regards,
vatsa
-

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 8:53 am

no need (unless you have bugfixes) i'm carrying this around in the
scheduler git tree. (it will show up in sched-devel.git)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 10:32 am

Cool .. Thx! It will get me some testing results (from those who test
sched-devel tree) and also will save me some maintenance trouble :)

--
Regards,
vatsa
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 1:27 am

The current load balancing scheme isn't good for group fairness.

For ex: on a 8-cpu system, I created 3 groups as under:

a = 8 tasks (cpu.shares = 1024)
b = 4 tasks (cpu.shares = 1024)
c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 277.676 | 57.8% | 54.1% 54.1% 54.1% 54.2% 56.7% 62.2% 62.8% 64.5%
b | 116.108 | 24.2% | 47.4% 48.1% 48.7% 49.3%
c | 86.326 | 18.0% | 47.5% 47.9% 48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
percentage. Col3 data is derived as:
Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

a = +24.47%
b = -9.13%
c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1 | Col2 | Col3 | Col4
------|---------|-------|-------------------------------------------------------
a | 163.112 | 34.0% | 33.2% 33.4% 33.5% 33.5% 33.7% 34.4% 34.8% 35.3%
b | 156.220 | 32.5% | 63.3% 64.5% 66.1% 66.5%
c | 160.653 | 33.5% | 85.8% 90.6% 91.4%
-----------------------------------------------------------...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 1:21 am

doms_cur[] array represents various scheduling domains which are mutually
exclusive. Currently cpusets code can modify this array (by calling
partition_sched_domains()) as a result of user modifying sched_load_balance
flag for various cpusets.

This patch introduces a mutex and corresponding API (only when
CONFIG_FAIR_GROUP_SCHED is defined) which allows a reader to safely read the
doms_cur[] array w/o worrying abt concurrent modifications to the array.

The fair group scheduler code (introduced in next patch of this series)
makes use of this mutex to walk thr' doms_cur[] array while rebalancing
shares of task groups across cpus.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -186,6 +186,9 @@ static struct cfs_rq *init_cfs_rq_p[NR_C
*/
static DEFINE_MUTEX(task_group_mutex);

+/* doms_cur_mutex serializes access to doms_cur[] array */
+static DEFINE_MUTEX(doms_cur_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -236,11 +239,23 @@ static inline void unlock_task_group_lis
mutex_unlock(&task_group_mutex);
}

+static inline void lock_doms_cur(void)
+{
+ mutex_lock(&doms_cur_mutex);
+}
+
+static inline void unlock_doms_cur(void)
+{
+ mutex_unlock(&doms_cur_mutex);
+}
+
#else

static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
static inline void lock_task_group_list(void) { }
static inline void unlock_task_group_list(void) { }
+static inline void lock_doms_cur(void) { }
+static inline void unlock_doms_cur(void) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -6547,6 +6562,8 @@ void partition_sched_domains(int ndoms_n
{
int i, j;

+ lock_doms_cur();
+
/* always unregister in case we don't...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 1:12 am

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v2 of Patch 3/5) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/5
(because of lesser #ifdefs).

I prefer v2 of Patch 3/5.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 27 +++++++++++----------------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 40 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,6 +870,16 @@ iter_move_one_task(struct rq *this_rq, i
struct rq_iterator *iterator);
#endif

+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -880,26 +890,14 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_sub(&rq->load, p->se.load.weight);
-}
-
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
- inc_load(rq, p);
}

static void dec_nr_running(struct task_struct...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 1:11 am

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v1 of Patch 3/5) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 38 ++++++++++++++++++++++++++++++--------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 59 insertions(+), 12 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -870,15 +870,25 @@ iter_move_one_task(struct rq *this_rq, i
struct rq_iterator *iterator);
#endif

-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED

-#define sched_class_highest (&rt_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }

static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
@@ -890,6 +900,18 @@ static inline void dec_load(struct rq *r
update_load_sub(&rq->load, p->se.load.weight);
}

+#endif /* CONFIG_FAI...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 1:09 am

Minor bug fixes for group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
changing shares of a task group. Use the same mutex when printing cfs_rq
stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
walking task group list)

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 34 ++++++++++++++++++++++++++--------
kernel/sched_fair.c | 4 +++-
2 files changed, 29 insertions(+), 9 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
- /* spinlock to serialize modification to shares */
- spinlock_t lock;
struct rcu_head rcu;
};

@@ -182,6 +180,12 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
static struct sched_entity *init_sched_entity_p[NR_CPUS];
static struct cfs_rq *init_cfs_rq_p[NR_CPUS];

+/*
+ * task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -222,9 +226,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
}

+static inline void lock_task_group_list(void)
+{
+ mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+ mutex_unlock(&task_group_mutex);
+}
+
#else

static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -6747,7 +6763,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Tuesday, November 27, 2007 - 1:08 am

Minor cleanups:

- Fix coding style
- remove obsolete comment

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 21 +++------------------
1 files changed, 3 insertions(+), 18 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};

#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);

--
Regards,
vatsa
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Monday, November 26, 2007 - 1:06 am

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch (v2 of Patch 3/4) has a minor impact on code size
(but should have no runtime/functional impact) for !CONFIG_FAIR_GROUP_SCHED
case, but the overall code, IMHO, is neater compared to v1 of Patch 3/4
(because of lesser #ifdefs).

I prefer v2 of Patch 3/4.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 27 +++++++++++----------------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 40 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -869,6 +869,16 @@
struct rq_iterator *iterator);
#endif

+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
#include "sched_stats.h"
#include "sched_idletask.c"
#include "sched_fair.c"
@@ -879,26 +889,14 @@

#define sched_class_highest (&rt_sched_class)

-static inline void inc_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_add(&rq->load, p->se.load.weight);
-}
-
-static inline void dec_load(struct rq *rq, const struct task_struct *p)
-{
- update_load_sub(&rq->load, p->se.load.weight);
-}
-
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
- inc_load(rq, p);
}

static void dec_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running--;
- dec_load(rq, p);
}

st...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Monday, November 26, 2007 - 1:05 am

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now a
summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated to it.

This version of patch (v1 of Patch 3/4) has zero impact for
!CONFIG_FAIR_GROUP_SCHED case.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 38 ++++++++++++++++++++++++++++++--------
kernel/sched_fair.c | 31 +++++++++++++++++++++++++++----
kernel/sched_rt.c | 2 ++
3 files changed, 59 insertions(+), 12 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -869,15 +869,25 @@
struct rq_iterator *iterator);
#endif

-#include "sched_stats.h"
-#include "sched_idletask.c"
-#include "sched_fair.c"
-#include "sched_rt.c"
-#ifdef CONFIG_SCHED_DEBUG
-# include "sched_debug.c"
-#endif
+#ifdef CONFIG_FAIR_GROUP_SCHED

-#define sched_class_highest (&rt_sched_class)
+static inline void inc_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_add(&rq->load, load);
+}
+
+static inline void dec_cpu_load(struct rq *rq, unsigned long load)
+{
+ update_load_sub(&rq->load, load);
+}
+
+static inline void inc_load(struct rq *rq, const struct task_struct *p) { }
+static inline void dec_load(struct rq *rq, const struct task_struct *p) { }
+
+#else /* CONFIG_FAIR_GROUP_SCHED */
+
+static inline void inc_cpu_load(struct rq *rq, unsigned long load) { }
+static inline void dec_cpu_load(struct rq *rq, unsigned long load) { }

static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
@@ -889,6 +899,18 @@
update_load_sub(&rq->load, p->se.load.weight);
}

+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
+#include "sched_stats.h"
+#include "sched_idletask.c"
+#incl...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Monday, November 26, 2007 - 1:03 am

Minor bug fixes for group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
changing shares of a task group. Use the same mutex when printing cfs_rq
stats for various task groups.
- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when walking task
group list)

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 33 +++++++++++++++++++++++++--------
kernel/sched_fair.c | 4 +++-
2 files changed, 28 insertions(+), 9 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -169,8 +169,6 @@ struct task_group {
/* runqueue "owned" by this group on each cpu */
struct cfs_rq **cfs_rq;
unsigned long shares;
- /* spinlock to serialize modification to shares */
- spinlock_t lock;
struct rcu_head rcu;
};

@@ -182,6 +180,11 @@ static DEFINE_PER_CPU(struct cfs_rq, ini
static struct sched_entity *init_sched_entity_p[NR_CPUS];
static struct cfs_rq *init_cfs_rq_p[NR_CPUS];

+/* task_group_mutex serializes add/remove of task groups and also changes to
+ * a task group's cpu shares.
+ */
+static DEFINE_MUTEX(task_group_mutex);
+
/* Default task group.
* Every task in system belong to this group at bootup.
*/
@@ -222,9 +225,21 @@ static inline void set_task_cfs_rq(struc
p->se.parent = task_group(p)->se[cpu];
}

+static inline void lock_task_group_list(void)
+{
+ mutex_lock(&task_group_mutex);
+}
+
+static inline void unlock_task_group_list(void)
+{
+ mutex_unlock(&task_group_mutex);
+}
+
#else

static inline void set_task_cfs_rq(struct task_struct *p, unsigned int cpu) { }
+static inline void lock_task_group_list(void) { }
+static inline void unlock_task_group_list(void) { }

#endif /* CONFIG_FAIR_GROUP_SCHED */

@@ -6747,7 +6762,6 @@ void __init sched_init(void)
se->parent = NULL;
}
init_task_group.sh...

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>, Dipankar <dipankar@...>
Date: Monday, November 26, 2007 - 1:02 am

Minor cleanups:

- Fix coding style
- remove obsolete comment

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 21 +++------------------
1 files changed, 3 insertions(+), 18 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};

#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);

--
Regards,
vatsa
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 8:28 am

Minor scheduler cleanups:

- Fix coding style
- remove obsolete comment
- Use list_for_each_entry_rcu when walking task group list

Signed-off-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

---
kernel/sched.c | 21 +++------------------
kernel/sched_fair.c | 5 ++++-
2 files changed, 7 insertions(+), 19 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -191,12 +191,12 @@ struct task_group init_task_group = {
};

#ifdef CONFIG_FAIR_USER_SCHED
-# define INIT_TASK_GRP_LOAD 2*NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD 2*NICE_0_LOAD /* root user's cpu share */
#else
-# define INIT_TASK_GRP_LOAD NICE_0_LOAD
+# define INIT_TASK_GROUP_LOAD NICE_0_LOAD
#endif

-static int init_task_group_load = INIT_TASK_GRP_LOAD;
+static int init_task_group_load = INIT_TASK_GROUP_LOAD;

/* return group to which a task belongs */
static inline struct task_group *task_group(struct task_struct *p)
@@ -864,21 +864,6 @@ iter_move_one_task(struct rq *this_rq, i

#define sched_class_highest (&rt_sched_class)

-/*
- * Update delta_exec, delta_fair fields for rq.
- *
- * delta_fair clock advances at a rate inversely proportional to
- * total load (rq->load.weight) on the runqueue, while
- * delta_exec advances at the same rate as wall-clock (provided
- * cpu is not idle).
- *
- * delta_exec / delta_fair is a measure of the (smoothened) load on this
- * runqueue over any given interval. This (smoothened) load is used
- * during load balance.
- *
- * This function is called /before/ updating rq->load
- * and when switching tasks.
- */
static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
update_load_add(&rq->load, p->se.load.weight);
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel...

To: Srivatsa Vaddagiri <vatsa@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 9:08 am

hm, why is this a cleanup?

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: <dmitry.adamushko@...>, <a.p.zijlstra@...>, <dhaval@...>, <linux-kernel@...>, <efault@...>, <skumar@...>, Balbir Singh <balbir@...>
Date: Monday, November 19, 2007 - 11:01 am

Sorry for the wrong subject. It was supposed to include the above bug fix,
related to how we walk the task group list.

Thinking abt it now, I realize that print_cfs_rq() can potentially
sleep and hence it cannot be surrounded by rcu_read_lock()/unlock().

And as Dipankar just pointed me, sched_create/destroy_group aren't
serialized at all currently, so we need a mutex to protect them. The
same mutex can be then used when walking the list in print_cfs_stats() ..

Will send update patches soon ..

--
Regards,
vatsa
-

Previous thread: MMC sub-system: SDIO block-mode with increment address issue by Dean Jenkins on Monday, November 19, 2007 - 7:44 am. (7 messages)

Next thread: [PATCH 1/3] tty: Add the new termios2 ioctls to the compatible list. by Heiko Carstens on Monday, November 19, 2007 - 8:52 am. (4 messages)