There's a period of 10 ticks where calc_load_tasks is updated by all the cpus for the load avg. Usually all the cpus do this during the first tick. If any cpus go idle, calc_load_tasks is decremented accordingly. However, if they wake up calc_load_tasks is not incremented. Thus, if cpus go idle during the 10 tick period, calc_load_tasks may be decremented to a non-representative value. This issue can lead to systems having a load avg of exactly 0, even though the real load avg could theoretically be up to NR_CPUS. This change defers calc_load_tasks accounting after each cpu updates the count until after the 10 tick period. BugLink: http://bugs.launchpad.net/bugs/513848 Signed-off-by: Chase Douglas <chase.douglas@canonical.com> --- kernel/sched.c | 16 ++++++++++++++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 9ab3cd7..c0aedac 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3064,7 +3064,8 @@ void calc_global_load(void) */ static void calc_load_account_active(struct rq *this_rq) { - long nr_active, delta; + static atomic_long_t deferred; + long nr_active, delta, deferred_delta; nr_active = this_rq->nr_running; nr_active += (long) this_rq->nr_uninterruptible; @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq) if (nr_active != this_rq->calc_load_active) { delta = nr_active - this_rq->calc_load_active; this_rq->calc_load_active = nr_active; + + /* Need to defer idle accounting during load update period: */ + if (unlikely(time_before(jiffies, this_rq->calc_load_update) && + time_after_eq(jiffies, calc_load_update))) { + atomic_long_add(delta, &deferred); + return; + } + + deferred_delta = atomic_long_xchg(&deferred, 0); + delta += deferred_delta; + atomic_long_add(delta, &calc_load_tasks); } } @@ -3106,8 +3118,8 @@ static void update_cpu_load(struct rq *this_rq) } if (time_after_eq(jiffies, this_rq->calc_load_update)) ...
callback from leaving nohz mode, your proposed patch has no such thing. --
I believe what you're implying is that there should be a corresponding call for when a cpu is awakened to counter the accounting when a cpu goes to sleep during the 10 tick window. I don't think this is the correct approach because the load avg should be a snapshot in time. If there were 10 runnable tasks at the beginning of the load calculation window, then we should account for 10 tasks. If they all get serviced and the cpu goes to sleep, then wakes back up with one runnable task, we would account for only one task instead of the original 10. What if 9 more tasks then get added to the cpu's rq during the 10 tick update period? None of them would be accounted for either. -- Chase --
On Mon, 29 Mar 2010 09:41:12 -0400 There was useful information in the [patch 0/1] email, such as the offending commit ID. If possible, it's best to avoid the [patch 0/n] thing altogether - that information either has to be moved into the [patch 1/n] changelog by someone (ie: me), or it just gets ommitted and That seems a sensible way to avoid the gross-looking "10 ticks" thing. What was the reason for "update the avenrun load estimates 10 ticks after the CPUs have updated calc_load_tasks"? Can we do something The global `deferred' is unfortunate from a design and possibly scalability POV. Can it be moved into the `struct rq'? That way it can become a plain old `unsigned long', too. --
The reason was to make sure that all cpus have updated their stuff halfways before we start to calculate and we spread out stuff among cpus to avoid contention on those global atomic variables. Yes, we can do something smarter. First thing is to fix the nr_uninterruptible accounting which can become negative. Peter looked into that already and we really need to fix this first before doing anything smart about that load avg stuff. Thanks, tglx --
I noticed this too, and figured it was some clever accounting :). If this is a real bug, I'd be happy to take a look at it as well. What do you have in mind as a smarter way of fixing this issue, and why is it dependent on fixing the absolute values of each runqueue's nr_uninterruptible count? -- Chase --
Well, the uninterruptible count imbalance is preventing us to calc the load avg per cpu, which is something the power saving folks are interested in. If that is fixed we probably have a good chance to collect the per cpu stuff and build a combined one - have not looked into the gory details of the math yet. Also we need to think about a more clever way than just accounting the number of running and uninterruptible tasks. What we really want is to use the numbers which the scheduler has handy, i.e how many tasks did we run since we did the last loadavg calculation. It was always possible (even before the big loadavg changes) to create a task which consumes 50% of a CPU and never shows up in the loadavg calculations at all. Thanks, tglx --
I'm not sure I follow how knowing how many tasks have been run since the last LOAD_FREQ expiration will help, or is that just an example of the kind of data the scheduler has available that may be useful? -- Chase --
Yes, it's just an example. Look at the following (UP) scenario: -> tick calcs loadavg -> task is woken and runs for 50% of a tick -> task goes to sleep -> tick calcs loadavg So loadavg will say 0 forever, while we in reality know that there was significant work on that CPU - at least when we have a sched_clock which has a better granularity than the tick itself. Thanks, tglx --
The problem with all this per-cpu loadavg is that it would require a steady tick on each cpu to age this loadavg, which is quite in contradiction with power savings as they rather like the nohz feature. No idea yet on what to do about that. --
What if we stored the time at which a cpu goes idle in its rq. Then, when the time comes around to calculate the load avg one cpu should be able to scan the run queues of all the cpus and use their load avg if not idle, or use their load avg + idle timestamp math if idle. -- Chase --
The reason it's global is that any given cpu may go NOHZ idle. If they sleep through the next load accounting, then their deferred value won't get accounted for. By keeping it global and having it checked by every cpu, we ensure that in the worst case where only one cpu is awake to do the accounting, the deferred values from all cpus are taken into account. After a review by Andy Whitcroft on the Ubuntu kernel team, we decided to do a few things to improve performance: 1. Check if values are non-zero before atomically touching the deferred variable (load avg accounting is imprecise anyways, we just need to make sure we don't lose any tasks) 2. Rename the deferred variable to calc_load_tasks_deferred and move it right after calc_load_tasks to hopefully catch it in the same cache line Thomas Gleixner seems to think this isn't the best approach (later email in this thread), so I'm deferring sending it unless someone is still interested in this approach. -- Chase --
Well, it's a good workaround for now I think, I just answered Andrews question whether we can't do any better :) Thanks, tglx --
On Thu, 1 Apr 2010 22:01:59 +0200 (CEST) So... should we merge Chase's patch? --
I have no better solution right now. Peter ? Thanks, tglx --
I've made one small change to it. Checking the atomic .counter variable probably isn't the most portable, and atomic_long_read() should be portable and not add any overhead on sane platforms, so I swapped out the check against .counter with an atomic_long_read(). I'll resend it shortly. -- Chase --
