There's a period of 10 ticks where calc_load_tasks is updated by all the
cpus for the load avg. Usually all the cpus do this during the first
tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
However, if they wake up calc_load_tasks is not incremented. Thus, if
cpus go idle during the 10 tick period, calc_load_tasks may be
decremented to a non-representative value. This issue can lead to
systems having a load avg of exactly 0, even though the real load avg
could theoretically be up to NR_CPUS.
This change defers calc_load_tasks accounting after each cpu updates the
count until after the 10 tick update window.
A few points:
* A global atomic deferral counter, and not per-cpu vars, is needed
because a cpu may go NOHZ idle and not be able to update the global
calc_load_tasks variable for subsequent load calculations.
* It is not enough to add calls to account for the load when a cpu is
awakened:
- Load avg calculation must be independent of cpu load.
- If a cpu is awakend by one tasks, but then has more scheduled before
the end of the update window, only the first task will be accounted.
BugLink: http://bugs.launchpad.net/bugs/513848
Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
Acked-by: Colin King <colin.king@canonical.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
---
kernel/sched.c | 24 ++++++++++++++++++++++--
1 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index abb36b1..be348cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3010,6 +3010,7 @@ unsigned long this_cpu_load(void)
/* Variables and functions for calc_load */
static atomic_long_t calc_load_tasks;
+static atomic_long_t calc_load_tasks_deferred;
static unsigned long calc_load_update;
unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun);
@@ -3064,7 +3065,7 @@ void calc_global_load(void)
*/
static void calc_load_account_active(struct rq *this_rq)
{
- long nr_active, delta;
+ long ...OK, so what you're saying is that because we update calc_load_tasks from
entering idle, we decrease earlier than a regular 10 tick sample
interval would?
Hence you batch these early updates into _deferred and let the next 10
tick sample roll them over?
So the only early updates can come from
pick_next_task_idle()->calc_load_account_active(), so why not specialize
that callchain instead of the below?
Also, since its all NO_HZ, why not stick this in with the ILB? Once
people get around to making that scale better, this can hitch a ride.
Something like the below perhaps? It does run partially from softirq
context, but since there's a distinct lack of synchronization here that
didn't seem like an immediate problem.
---
kernel/sched.c | 10 ++++++----
kernel/sched_fair.c | 4 +++-
kernel/sched_idletask.c | 2 --
3 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 95eaecc..cdd4d8c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2959,6 +2959,11 @@ static void calc_load_account_active(struct rq *this_rq)
{
long nr_active, delta;
+ if (!time_after_eq(jiffies, this_rq->calc_load_update))
+ return;
+
+ this_rq->calc_load_update += LOAD_FREQ;
+
nr_active = this_rq->nr_running;
nr_active += (long) this_rq->nr_uninterruptible;
@@ -2998,10 +3003,7 @@ static void update_cpu_load(struct rq *this_rq)
this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
}
- if (time_after_eq(jiffies, this_rq->calc_load_update)) {
- this_rq->calc_load_update += LOAD_FREQ;
- calc_load_account_active(this_rq);
- }
+ calc_load_account_active(this_rq);
}
#ifdef CONFIG_SMP
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 88d3053..2c267ef 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -3394,9 +3394,11 @@ static void run_rebalance_domains(struct softirq_action *h)
if (need_resched())
break;
+ rq = ...To clarify, when I wrote that your patch was still below.. ;-) --
I understand everything until you move the calc_load_account_active call to run_rebalance_domains. I take it that when CPUs go NO_HZ idle, at least one cpu is left to monitor and perform updates as necessary. Conceptually, it makes sense that this cpu should be handling the load accounting updates. However, I'm new to this code, so I'm having a hard time understanding all the cases and timings for when the scheduler softirq is called. Is it guaranteed to be called during every 10 tick load update window? If not, then we'll have the issue where a NO_HZ idle cpu won't be updated to 0 running tasks in time for the load avg calculation. Would someone be able to explain how we are guaranteed of the correct timing for this path? I also have a concern with run_rebalance_domains: If the designated no_hz.load_balancer cpu wasn't idle at the last tick or needs rescheduling, load accounting won't occur for idle cpus. Is it possible for this to occur every time when called in the 10 tick update window? -- Chase --
(stripped the ubuntu kernel-team list, since it is generating bounces for each email) Ah, I overlooked that trigger_load_balance() already has a jiffy delay. I was ass-uming we triggered the softirq on each tick. Yes, that needs a bit of a fix to get called at least every 10 ticks, Right, so I didn't look too closely either, but was more or less going for the structure than the details. I haven't read through the ILB stuff current cpu is idle_at_tick, if not it will nominate another cpu to be ilb -- so I guess it neatly fits together, the !idle cpus fend for themselves and the idle ones get sorted by the ILB. Alternatively you could add the calc_load_tasks_deferred thing to the nohz structure and do it all from trigger_load_balance. Either approach would work if the ILB were extended to be per node or something like that (Venki used to work on that, not sure what happened to that). --
I really don't feel comfortable with this code to know how to implement either of these approaches myself. I have no issue how the fix is implemented. However, I worry that using the ILB code may be complex and/or require many little checks to ensure there are no improper interactions. Will we be sure that further changes to the ILB won't introduce new issues? In the end, what do we gain by using the ILB, and is it worth it to introduce that complexity? -- Chase --
Ok, so delaying the whole ILB angle for now, the below is a similar
approach to yours but with a more explicit code flow.
Does that work for you?
---
kernel/sched.c | 80 +++++++++++++++++++++++++++++++++++++++-------
kernel/sched_idletask.c | 3 +-
2 files changed, 68 insertions(+), 15 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 95eaecc..8ac02a9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1815,7 +1815,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif
-static void calc_load_account_active(struct rq *this_rq);
+static void calc_load_account_idle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
@@ -2907,6 +2907,61 @@ static unsigned long calc_load_update;
unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun);
+static long calc_load_fold_active(struct rq *this_rq)
+{
+ long nr_active, delta = 0;
+
+ nr_active = this_rq->nr_running;
+ nr_active += (long) this_rq->nr_uninterruptible;
+
+ if (nr_active != this_rq->calc_load_active) {
+ delta = nr_active - this_rq->calc_load_active;
+ this_rq->calc_load_active = nr_active;
+ }
+
+ return delta;
+}
+
+#ifdef CONFIG_NO_HZ
+/*
+ * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
+ *
+ * When making the ILB scale, we should try to pull this in as well.
+ */
+static atomic_long_t calc_load_tasks_idle;
+
+static void calc_load_account_idle(struct rq *this_rq)
+{
+ long delta;
+
+ delta = calc_load_fold_active(this_rq);
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks_idle);
+}
+
+static long calc_load_fold_idle(void)
+{
+ long delta = 0;
+
+ /*
+ * Its got a race, we don't care...
+ */
+ if (atomic_long_read(&calc_load_tasks_idle))
+ delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+
+ return delta;
+}
+#else
+static void calc_load_account_idle(struct rq *this_rq)
+{
+}
+
+static inline long ...This looks good. I'll run my test case to make sure it fixes the scenario we hit, and then I'll ack it when I've confirmed it works. Thanks -- Chase --
On Thu, Apr 22, 2010 at 9:18 AM, Chase Douglas I've run my test case and it seems to push the load avg numbers as expected. Acked-by: Chase Douglas <chase.douglas@canonical.com> BTW, I noticed some trailing whitespace, so I ran it through checkpatch.pl: ERROR: trailing whitespace #44: FILE: kernel/sched.c:2936: + $ Thanks -- Chase --
Commit-ID: 74f5187ac873042f502227701ed1727e7c5fbfa9 Gitweb: http://git.kernel.org/tip/74f5187ac873042f502227701ed1727e7c5fbfa9 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Thu, 22 Apr 2010 21:50:19 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Fri, 23 Apr 2010 11:02:02 +0200 sched: Cure load average vs NO_HZ woes Chase reported that due to us decrementing calc_load_task prematurely (before the next LOAD_FREQ sample), the load average could be scewed by as much as the number of CPUs in the machine. This patch, based on Chase's patch, cures the problem by keeping the delta of the CPU going into NO_HZ idle separately and folding that in on the next LOAD_FREQ update. This restores the balance and we get strict LOAD_FREQ period samples. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1271934490.1776.343.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 80 +++++++++++++++++++++++++++++++++++++++------- kernel/sched_idletask.c | 3 +- 2 files changed, 68 insertions(+), 15 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index de0da71..0cc913a 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1815,7 +1815,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares) } #endif -static void calc_load_account_active(struct rq *this_rq); +static void calc_load_account_idle(struct rq *this_rq); static void update_sysctl(void); static int get_update_sysctl_factor(void); @@ -2950,6 +2950,61 @@ static unsigned long calc_load_update; unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); +static long calc_load_fold_active(struct rq *this_rq) +{ + long nr_active, delta = 0; + + nr_active = this_rq->nr_running; + nr_active += (long) this_rq->nr_uninterruptible; + + if (nr_active != this_rq->calc_load_active) { + delta = nr_active - ...
| Jesse Barnes | Re: [stable] [BUG][PATCH] cpqphp: fix kernel NULL pointer dereference |
| Greg KH | [003/136] p54usb: add Zcomax XG-705A usbid |
| Magnus Damm | [PATCH 03/07] ARM: Use shared GIC entry macros on Realview |
| Oliver Neukum | Re: [Bug #13682] The webcam stopped working when upgrading from 2.6.29 to 2.6.30 |
| Martin Schwidefsky | Re: [PATCH] optimized ktime_get[_ts] for GENERIC_TIME=y |
git: | |
| Junio C Hamano | Re: Some advanced index playing |
| Jeff King | Re: confusion over the new branch and merge config |
| Robin Rosenberg | Re: cvs2svn conversion directly to git ready for experimentation |
| Linus Torvalds |
