Re: [patch] sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares.

Previous thread: [PATCH] cirrusfb: remove unused variables by Stephen Rothwell on Tuesday, November 18, 2008 - 11:17 pm. (1 message)

Next thread: [PATCH] Do not free io context when taking recursive faults in do_exit by Nikanth Karthikesan on Tuesday, November 18, 2008 - 11:45 pm. (2 messages)
From: Ken Chen
Date: Tuesday, November 18, 2008 - 11:41 pm

In the update_shares() path leading to tg_shares_up(), the calculation of
per-cpu cfs_rq shares is rather erratic even under moderate task wake up
rate.  The problem is that the per-cpu tg->cfs_rq load weight used in the
sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
are collected at different time.  Under moderate system load, we've seen
quite a bit of variation on the cfs_rq->shares and ultimately wildly
affects sched_entity's load weight.

This patch caches the result of initial per-cpu load weight when doing the
sum calculation, and then pass it down to update_group_shares_cpu() for
redistributing per-cpu cfs_rq shares.  This allows consistent total cfs_rq
shares across all CPUs. It also simplifies the rounding and zero load
weight check.

Signed-off-by: Ken Chen <kenchen@google.com>


diff --git a/kernel/sched.c b/kernel/sched.c
index 9b1e793..1ff78b6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1473,27 +1473,13 @@ static void
 update_group_shares_cpu(struct task_group *tg, int cpu,
 			unsigned long sd_shares, unsigned long sd_rq_weight)
 {
-	int boost = 0;
 	unsigned long shares;
 	unsigned long rq_weight;

 	if (!tg->se[cpu])
 		return;

-	rq_weight = tg->cfs_rq[cpu]->load.weight;
-
-	/*
-	 * If there are currently no tasks on the cpu pretend there is one of
-	 * average load so that when a new task gets to run here it will not
-	 * get delayed by group starvation.
-	 */
-	if (!rq_weight) {
-		boost = 1;
-		rq_weight = NICE_0_LOAD;
-	}
-
-	if (unlikely(rq_weight > sd_rq_weight))
-		rq_weight = sd_rq_weight;
+	rq_weight = tg->cfs_rq[cpu]->rq_weight;

 	/*
 	 *           \Sum shares * rq_weight
@@ -1501,7 +1487,7 @@ update_group_shares_cpu
 	 *               \Sum rq_weight
 	 *
 	 */
-	shares = (sd_shares * rq_weight) / (sd_rq_weight + 1);
+	shares = (sd_shares * rq_weight) / sd_rq_weight;
 	shares = clamp_t(unsigned long, shares, MIN_SHARES, MAX_SHARES);

 	if (abs(shares - tg->se[cpu]->load.weight) >
@@ ...
From: Peter Zijlstra
Date: Wednesday, November 19, 2008 - 9:47 am

Another thing we could possibly do is put a low-pass filter on the

This does indeed look much better, the cleanup factor alone makes it a
worthwhile patch, he fact that is improves behaviour makes it even
better :-)

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>


--

From: Ingo Molnar
Date: Wednesday, November 19, 2008 - 10:39 am

applied to tip/sched/core (for v2.6.29), thanks guys!

	Ingo
--

From: Peter Zijlstra
Date: Wednesday, November 19, 2008 - 12:19 pm

On Tue, 2008-11-18 at 22:41 -0800, Ken Chen wrote:

It just occurred to me that you lost the boost stuff.

The issue with the boost flag is that we have a different
  tg->cfs_rq[cpu]->shares
than the actual
  tg->cfs_se[cpu]->load.weight

So that the starvation shares don't count towards the total distributed
shares.

I guess this is an overestimate vs underestimate issue, you now

--

Previous thread: [PATCH] cirrusfb: remove unused variables by Stephen Rothwell on Tuesday, November 18, 2008 - 11:17 pm. (1 message)

Next thread: [PATCH] Do not free io context when taking recursive faults in do_exit by Nikanth Karthikesan on Tuesday, November 18, 2008 - 11:45 pm. (2 messages)