Re: [REGRESSION 2.6.30][PATCH v3] sched: update load count only once per cpu in 10 tick update window

Previous thread: Parallel port LCD driver undocumented by Michal Suchanek on Tuesday, April 13, 2010 - 4:13 pm. (3 messages)

Next thread: [GIT PULL] PM / hibernate fix for 2.6.34 by Rafael J. Wysocki on Tuesday, April 13, 2010 - 4:23 pm. (1 message)
From: Chase Douglas
Date: Tuesday, April 13, 2010 - 4:19 pm

There's a period of 10 ticks where calc_load_tasks is updated by all the
cpus for the load avg. Usually all the cpus do this during the first
tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
However, if they wake up calc_load_tasks is not incremented. Thus, if
cpus go idle during the 10 tick period, calc_load_tasks may be
decremented to a non-representative value. This issue can lead to
systems having a load avg of exactly 0, even though the real load avg
could theoretically be up to NR_CPUS.

This change defers calc_load_tasks accounting after each cpu updates the
count until after the 10 tick update window.

A few points:

* A global atomic deferral counter, and not per-cpu vars, is needed
  because a cpu may go NOHZ idle and not be able to update the global
  calc_load_tasks variable for subsequent load calculations.
* It is not enough to add calls to account for the load when a cpu is
  awakened:
  - Load avg calculation must be independent of cpu load.
  - If a cpu is awakend by one tasks, but then has more scheduled before
    the end of the update window, only the first task will be accounted.

BugLink: http://bugs.launchpad.net/bugs/513848

Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
Acked-by: Colin King <colin.king@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
---
 kernel/sched.c |   24 ++++++++++++++++++++++--
 1 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index abb36b1..be348cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3010,6 +3010,7 @@ unsigned long this_cpu_load(void)
 
 /* Variables and functions for calc_load */
 static atomic_long_t calc_load_tasks;
+static atomic_long_t calc_load_tasks_deferred;
 static unsigned long calc_load_update;
 unsigned long avenrun[3];
 EXPORT_SYMBOL(avenrun);
@@ -3064,7 +3065,7 @@ void calc_global_load(void)
  */
 static void calc_load_account_active(struct rq *this_rq)
 {
-	long nr_active, delta;
+	long ...
From: Peter Zijlstra
Date: Monday, April 19, 2010 - 11:52 am

OK, so what you're saying is that because we update calc_load_tasks from
entering idle, we decrease earlier than a regular 10 tick sample
interval would?

Hence you batch these early updates into _deferred and let the next 10
tick sample roll them over?

So the only early updates can come from
pick_next_task_idle()->calc_load_account_active(), so why not specialize
that callchain instead of the below?

Also, since its all NO_HZ, why not stick this in with the ILB? Once
people get around to making that scale better, this can hitch a ride.

Something like the below perhaps? It does run partially from softirq
context, but since there's a distinct lack of synchronization here that
didn't seem like an immediate problem.

---
 kernel/sched.c          |   10 ++++++----
 kernel/sched_fair.c     |    4 +++-
 kernel/sched_idletask.c |    2 --
 3 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 95eaecc..cdd4d8c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2959,6 +2959,11 @@ static void calc_load_account_active(struct rq *this_rq)
 {
 	long nr_active, delta;
 
+	if (!time_after_eq(jiffies, this_rq->calc_load_update))
+		return;
+
+	this_rq->calc_load_update += LOAD_FREQ;
+
 	nr_active = this_rq->nr_running;
 	nr_active += (long) this_rq->nr_uninterruptible;
 
@@ -2998,10 +3003,7 @@ static void update_cpu_load(struct rq *this_rq)
 		this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
 	}
 
-	if (time_after_eq(jiffies, this_rq->calc_load_update)) {
-		this_rq->calc_load_update += LOAD_FREQ;
-		calc_load_account_active(this_rq);
-	}
+	calc_load_account_active(this_rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 88d3053..2c267ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3394,9 +3394,11 @@ static void run_rebalance_domains(struct softirq_action *h)
 			if (need_resched())
 				break;
 
+			rq = ...
From: Peter Zijlstra
Date: Monday, April 19, 2010 - 11:56 am

To clarify, when I wrote that your patch was still below.. ;-)

--

From: Chase Douglas
Date: Monday, April 19, 2010 - 1:16 pm

I understand everything until you move the calc_load_account_active
call to run_rebalance_domains. I take it that when CPUs go NO_HZ idle,
at least one cpu is left to monitor and perform updates as necessary.
Conceptually, it makes sense that this cpu should be handling the load
accounting updates. However, I'm new to this code, so I'm having a
hard time understanding all the cases and timings for when the
scheduler softirq is called. Is it guaranteed to be called during
every 10 tick load update window? If not, then we'll have the issue
where a NO_HZ idle cpu won't be updated to 0 running tasks in time for
the load avg calculation.

Would someone be able to explain how we are guaranteed of the correct
timing for this path?

I also have a concern with run_rebalance_domains: If the designated
no_hz.load_balancer cpu wasn't idle at the last tick or needs
rescheduling, load accounting won't occur for idle cpus. Is it
possible for this to occur every time when called in the 10 tick
update window?

-- Chase
--

From: Peter Zijlstra
Date: Monday, April 19, 2010 - 1:52 pm

(stripped the ubuntu kernel-team list, since it is generating bounces
for each email)



Ah, I overlooked that trigger_load_balance() already has a jiffy delay.
I was ass-uming we triggered the softirq on each tick.

Yes, that needs a bit of a fix to get called at least every 10 ticks,

Right, so I didn't look too closely either, but was more or less going
for the structure than the details. I haven't read through the ILB stuff
current cpu is idle_at_tick, if not it will nominate another cpu to be
ilb -- so I guess it neatly fits together, the !idle cpus fend for
themselves and the idle ones get sorted by the ILB.

Alternatively you could add the calc_load_tasks_deferred thing to the
nohz structure and do it all from trigger_load_balance.

Either approach would work if the ILB were extended to be per node or
something like that (Venki used to work on that, not sure what happened
to that).

--

From: Chase Douglas
Date: Monday, April 19, 2010 - 2:17 pm

I really don't feel comfortable with this code to know how to
implement either of these approaches myself. I have no issue how the
fix is implemented. However, I worry that using the ILB code may be
complex and/or require many little checks to ensure there are no
improper interactions. Will we be sure that further changes to the ILB
won't introduce new issues? In the end, what do we gain by using the
ILB, and is it worth it to introduce that complexity?

-- Chase
--

From: Peter Zijlstra
Date: Thursday, April 22, 2010 - 4:08 am

Ok, so delaying the whole ILB angle for now, the below is a similar
approach to yours but with a more explicit code flow.

Does that work for you?

---
 kernel/sched.c          |   80 +++++++++++++++++++++++++++++++++++++++-------
 kernel/sched_idletask.c |    3 +-
 2 files changed, 68 insertions(+), 15 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 95eaecc..8ac02a9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1815,7 +1815,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
 }
 #endif
 
-static void calc_load_account_active(struct rq *this_rq);
+static void calc_load_account_idle(struct rq *this_rq);
 static void update_sysctl(void);
 static int get_update_sysctl_factor(void);
 
@@ -2907,6 +2907,61 @@ static unsigned long calc_load_update;
 unsigned long avenrun[3];
 EXPORT_SYMBOL(avenrun);
 
+static long calc_load_fold_active(struct rq *this_rq)
+{
+	long nr_active, delta = 0;
+
+	nr_active = this_rq->nr_running;
+	nr_active += (long) this_rq->nr_uninterruptible;
+
+	if (nr_active != this_rq->calc_load_active) {
+		delta = nr_active - this_rq->calc_load_active;
+		this_rq->calc_load_active = nr_active;
+	}
+
+	return delta;
+}
+
+#ifdef CONFIG_NO_HZ
+/*
+ * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
+ *
+ * When making the ILB scale, we should try to pull this in as well.
+ */
+static atomic_long_t calc_load_tasks_idle;
+
+static void calc_load_account_idle(struct rq *this_rq)
+{
+	long delta;
+       
+	delta = calc_load_fold_active(this_rq);
+	if (delta)
+		atomic_long_add(delta, &calc_load_tasks_idle);
+}
+
+static long calc_load_fold_idle(void)
+{
+	long delta = 0;
+
+	/*
+	 * Its got a race, we don't care...
+	 */
+	if (atomic_long_read(&calc_load_tasks_idle))
+		delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+
+	return delta;
+}
+#else
+static void calc_load_account_idle(struct rq *this_rq)
+{
+}
+
+static inline long ...
From: Chase Douglas
Date: Thursday, April 22, 2010 - 6:18 am

This looks good. I'll run my test case to make sure it fixes the
scenario we hit, and then I'll ack it when I've confirmed it works.

Thanks

-- Chase
--

From: Chase Douglas
Date: Thursday, April 22, 2010 - 8:35 am

On Thu, Apr 22, 2010 at 9:18 AM, Chase Douglas

I've run my test case and it seems to push the load avg numbers as expected.

Acked-by: Chase Douglas <chase.douglas@canonical.com>

BTW, I noticed some trailing whitespace, so I ran it through checkpatch.pl:

ERROR: trailing whitespace
#44: FILE: kernel/sched.c:2936:
+       $

Thanks

-- Chase
--

From: tip-bot for Peter Zijlstra
Date: Friday, April 23, 2010 - 3:49 am

Commit-ID:  74f5187ac873042f502227701ed1727e7c5fbfa9
Gitweb:     http://git.kernel.org/tip/74f5187ac873042f502227701ed1727e7c5fbfa9
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Thu, 22 Apr 2010 21:50:19 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 23 Apr 2010 11:02:02 +0200

sched: Cure load average vs NO_HZ woes

Chase reported that due to us decrementing calc_load_task prematurely
(before the next LOAD_FREQ sample), the load average could be scewed
by as much as the number of CPUs in the machine.

This patch, based on Chase's patch, cures the problem by keeping the
delta of the CPU going into NO_HZ idle separately and folding that in
on the next LOAD_FREQ update.

This restores the balance and we get strict LOAD_FREQ period samples.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1271934490.1776.343.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c          |   80 +++++++++++++++++++++++++++++++++++++++-------
 kernel/sched_idletask.c |    3 +-
 2 files changed, 68 insertions(+), 15 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index de0da71..0cc913a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1815,7 +1815,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
 }
 #endif
 
-static void calc_load_account_active(struct rq *this_rq);
+static void calc_load_account_idle(struct rq *this_rq);
 static void update_sysctl(void);
 static int get_update_sysctl_factor(void);
 
@@ -2950,6 +2950,61 @@ static unsigned long calc_load_update;
 unsigned long avenrun[3];
 EXPORT_SYMBOL(avenrun);
 
+static long calc_load_fold_active(struct rq *this_rq)
+{
+	long nr_active, delta = 0;
+
+	nr_active = this_rq->nr_running;
+	nr_active += (long) this_rq->nr_uninterruptible;
+
+	if (nr_active != this_rq->calc_load_active) {
+		delta = nr_active - ...
Previous thread: Parallel port LCD driver undocumented by Michal Suchanek on Tuesday, April 13, 2010 - 4:13 pm. (3 messages)

Next thread: [GIT PULL] PM / hibernate fix for 2.6.34 by Rafael J. Wysocki on Tuesday, April 13, 2010 - 4:23 pm. (1 message)