After further investigation and cross-checking with the thread "PROBLEM: Unusually high load average when idle in 2.6.35, 2.6.35.1 and later", I came to the following results: - the commit 74f5187ac873042f502227701ed1727e7c5fbfa9 isolated by Tim seems to be the culprit; - reverting it solves the problem with 2.6.36-rc7 in NOHZ mode: the load when idle goes down to 0.00 (which it never does with the patch applied) - using nohz=no with the commit reverted still gives correct behaviour (tested just in case) - vanilla 2.6.36-rc7 with this commit applied has the problem (load is around 0.60 when machine idle, sometimes less after several hours of uptime, but never 0.00), and rebooting with nohz=no makes the problem disappear: load goes down to 0.00 quickly after boot process has finished. I hope this answers the questions raised in the Tim's thread. Could someone with knowledge of the commit take a look at the problem? It would be a bit annoying to have this problem in 2.6.36, since Tim's initial report dates back to 2 weeks ago... I can help for further testing if needed. Thanks in advance, -- Damien --
Sorry I haven't been as responsive to this issue as I would have liked. I've been rather busy on other work. My biggest testing concern is that in reality, load on a normal desktop machine (i.e. not some stripped down machine disconnected from network or any other input running nothing but busybox) should not be 0.00. Maybe load on a server doing absolutely nothing could be 0.00, but there's usually something going on that should bump it up to a few hundredths. Watch top, and if you see at least 1% constant cpu usage there, your load average should be at least 0.01. That said, there does seem to be a bug somewhere as a load of 0.60 on an idle machine seems high. -- Chase --
FWIW, I'm using htop which maybe behaves differently than top, but after a few seconds or tens of seconds, values are really stabilizing at 0.00 (displaying 0.01 from time to time, but quite rarely). I also get 0.00 through calling "uptime". These tests have been done on a big desktop machine accessed remotely through ssh, so the usual graphical environment eyecandy and Web browser are not running at all, and I have been careful not to run heavy processes in parallel, so reaching 0.00 doesn't seem abnormal in that case. As you wrote, the "bad" case with the commit applied and nohz=yes really seems wrong because in the same conditions, idle load is several tens orders of magnitude larger. -- Damien --
Right, so I think I figured out what's happening.
We're folding sucessive idles of the same cpu into the total idle
number, which is inflating things.
+/*
+ * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
+ *
+ * When making the ILB scale, we should try to pull this in as well.
+ */
+static atomic_long_t calc_load_tasks_idle;
+
+static void calc_load_account_idle(struct rq *this_rq)
+{
+ long delta;
+
+ delta = calc_load_fold_active(this_rq);
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks_idle);
+}
+
+static long calc_load_fold_idle(void)
+{
+ long delta = 0;
+
+ /*
+ * Its got a race, we don't care...
+ */
+ if (atomic_long_read(&calc_load_tasks_idle))
+ delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+
+ return delta;
+}
If you look at that and imagine CPU1 going idle with 1 task blocked,
then waking up due to unblocking, then going idle with that same task
block, etc.. all before we fold_idle on an active cpu, then we can count
that one task many times over.
I haven't come up with a sane patch yet, hackery below, but that does
let my 24-cpu system idle into load 0.0x instead of the constant 1.x it
had before.
Beware, utter hackery below.. lots of races not sure it matters but it
ain't pretty..
Anybody got a bright idea here?
---
kernel/sched.c | 32 ++++++++++++++++++++++++++++++--
kernel/sched_idletask.c | 1 +
2 files changed, 31 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 91c19db..ac4512d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -521,6 +521,9 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_inactive;
+#endif
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1817,6 +1820,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
...OK, I came up with the below, but its not quite working, load continues
to decrease even though I've got a make -j64 running..
Thomas, Chase, any clue?
---
kernel/sched.c | 31 +++++++++++++++++++++++++------
kernel/sched_idletask.c | 1 +
2 files changed, 26 insertions(+), 6 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3312c64..a56446b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -521,6 +521,10 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_idle;
+ int calc_load_seq;
+#endif
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1817,6 +1821,7 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
#endif
static void calc_load_account_idle(struct rq *this_rq);
+static void calc_load_account_nonidle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -2978,14 +2983,25 @@ static long calc_load_fold_active(struct rq *this_rq)
* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static atomic_t calc_load_seq;
static void calc_load_account_idle(struct rq *this_rq)
{
- long delta;
+ long idle;
- delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
+ idle = calc_load_fold_active(this_rq);
+ this_rq->calc_load_idle = idle;
+
+ if (idle) {
+ this_rq->calc_load_seq = atomic_read(&calc_load_seq);
+ atomic_long_add(idle, &calc_load_tasks_idle);
+ }
+}
+
+static void calc_load_account_nonidle(struct rq *this_rq)
+{
+ if (atomic_read(&calc_load_seq) == this_rq->calc_load_seq)
+ atomic_long_sub(this_rq->calc_load_idle, &calc_load_tasks_idle);
}
static long calc_load_fold_idle(void)
@@ -2993,10 +3009,13 @@ static long calc_load_fold_idle(void)
long delta = 0;
...In fact, after several hours of uptime, I also came into a situation of the load being around 0.80 or 0.60 when idle with the commit reverted. So IMHO, just reverting is not a better option than keeping the offending commit, and a real rework of the code is needed to clean up the situation. Should'nt we enlarge the list of CC, because for now, responsivity has been close to 0 and it seems we will get a 2.6.36 with buggy load avg calculation. Even if it is only statistics, many supervision tools rely on the load avg, so for production environments, this is not a good thing. Best, -- Damien --
It already contains all the folks who know the code I'm afraid.. :/ I've been playing with it a bit more today, but haven't actually managed to make it better, just differently worse.. --
Ah, I just remembered Venki recently poked at this code too, maybe he's got a bright idea.. Venki, there are cpu-load issues, the reported issue is that idle load is too high, and I think I can see that happening with the current code (due to 74f5187ac8). The flaw I can see in that commit is that we can go idle multiple times during the LOAD_FREQ window, which will basically inflate the idle contribution. All attempts from me to fix that so far have resulted in curious results.. Would you have a moment to also look at this? --
Just for reference this is my latest patch.. I figured that since its
NOHZ related it should actually be keyed of off the nohz code, not going
idle.
---
include/linux/sched.h | 8 ++++++++
kernel/sched.c | 28 +++++++++++++++++++++-------
kernel/sched_idletask.c | 1 -
kernel/time/tick-sched.c | 2 ++
4 files changed, 31 insertions(+), 8 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0383601..5311ef4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -145,6 +145,14 @@ extern unsigned long this_cpu_load(void);
extern void calc_global_load(void);
+#ifdef CONFIG_NO_HZ
+extern void calc_load_account_idle(void);
+extern void calc_load_account_nonidle(void);
+#else
+static inline void calc_load_account_idle(void) { }
+static inline void calc_load_account_nonidle(void) { }
+#endif
+
extern unsigned long get_parent_ip(unsigned long addr);
struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c
index abf8440..79a29e6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -526,6 +526,10 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_inactive;
+ int calc_load_seq;
+#endif
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1833,7 +1837,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif
-static void calc_load_account_idle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -3111,16 +3114,29 @@ static long calc_load_fold_active(struct rq *this_rq)
* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static atomic_t calc_load_seq;
-static void calc_load_account_idle(struct rq *this_rq)
+void calc_load_account_idle(void)
{
+ struct rq *this_rq = this_rq();
long ...So that should read: that atomic_long_sub() Trouble is, load goes down with that patch fixed, it just never goes --
OK, how does this work for people? I find my idle load is still a tad
high, but maybe I'm not patient enough.
---
include/linux/sched.h | 8 ++++++++
kernel/sched.c | 32 ++++++++++++++++++++++++++------
kernel/sched_idletask.c | 1 -
kernel/time/tick-sched.c | 2 ++
4 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0383601..5311ef4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -145,6 +145,14 @@ extern unsigned long this_cpu_load(void);
extern void calc_global_load(void);
+#ifdef CONFIG_NO_HZ
+extern void calc_load_account_idle(void);
+extern void calc_load_account_nonidle(void);
+#else
+static inline void calc_load_account_idle(void) { }
+static inline void calc_load_account_nonidle(void) { }
+#endif
+
extern unsigned long get_parent_ip(unsigned long addr);
struct seq_file;
diff --git a/kernel/sched.c b/kernel/sched.c
index abf8440..f214222 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -526,6 +526,10 @@ struct rq {
/* calc_load related fields */
unsigned long calc_load_update;
long calc_load_active;
+#ifdef CONFIG_NO_HZ
+ long calc_load_inactive;
+ int calc_load_seq;
+#endif
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
@@ -1833,7 +1837,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares)
}
#endif
-static void calc_load_account_idle(struct rq *this_rq);
static void update_sysctl(void);
static int get_update_sysctl_factor(void);
static void update_cpu_load(struct rq *this_rq);
@@ -3111,16 +3114,36 @@ static long calc_load_fold_active(struct rq *this_rq)
* When making the ILB scale, we should try to pull this in as well.
*/
static atomic_long_t calc_load_tasks_idle;
+static atomic_t calc_load_seq;
-static void calc_load_account_idle(struct rq *this_rq)
+void calc_load_account_idle(void)
{
+ struct rq *this_rq = this_rq();
long delta;
delta = ...Looks quite fine after some basic tests, much saner than without the patch. A bit slow to go down, but I reach 0.00 after enough time being idle. Can't tell about the behavior after hours of uptime, of course, but the values during the first minutes after bootup seems OK; without the patch, they are evidently wrong... Maybe commit this after some more reviews in the next 1 or 2 days, and maybe think about further tweaking during 2.6.37? -- Damien --
I haven't had a chance to keep up with the topic, and I apologize. I'll be testing this as soon as I can finish compiling it. Thank you all for not letting this go unfixed. Tim McGrath
Uhh, problem. This patch does not apply to git checkout 74f5187ac873042f502227701ed1727e7c5fbfa9 which is the version of the kernel that first exhibits this flaw. which version of the kernel does this patch apply cleanly to? Tim McGrath
Try -tip (which includes the scheduler development tree as well): http://people.redhat.com/mingo/tip.git/README Thanks, Ingo --
Tried that, patch still doesn't apply... and I just figured out why. Looks like my email client is screwing the patch up. mutt apparently wants to chew on my mail before I get it. viewing the mail as an attachment and saving it works properly however. Now that I have properly saved the mail, it applies cleanly to tip/master as well as 74f5187ac873042f502227701ed1727e7c5fbfa9 - though in the latter's case it's having to fuzz around a bit. I'll try testing 74f5187ac873042f502227701ed1727e7c5fbfa9 first since it's the one I *know* is flawed, and I want to reduce the amount of changes that I have to test for. I'll build and test it, then let you guys know if there's any noticable difference. Tim McGrath
Now that I've actually had a chance to boot the kernel with the patch applied I'm sorry to say but the load average isn't decaying as fast as it ought to, at the very least. My machine's been idle for the last ten minutes but the one minute average is still at 0.89 and shooting up to 1.5, the 5 min average is 0.9, and the 15 min average is .68 and climbing. Even as I'm writing this the averages are continuing to drop, but *very* slowly. Glacially, almost. The one minute average is continuing to randomly spike high for no reason I can tell as well. I'll let you guys know if this actually bottoms out at some point. Tim McGrath
It did not. When I came home and checked, my load average was a steady 0.7-0.8 across the board on all averages with the machine idle since six hours ago. I guess the patch didn't fix the problem for me. If you want, I'll try building master/tip with the patch applied, but I doubt it'll really be different. On the plus side, the patch did do something - it seems much less erratic than it used to be for whatever reason, and now just has a very steady load average rather than jumping about as it does without the patch applied. I wish I understood the code enough to know what is going wrong here. I have to wonder what impact the original bug was causing. It seems to me that if it only affected a few people it might be worth backing out the patch and rethinking the problem it was meant to fix. On the other hand, if there's some way of diagnosing the problem I'm all for it - is there some kprintfs or something I could put in the code to find out when it's doing unlikely or 'impossible' things? There is serious weirdness going on here and I'd like to figure out the cause of it. I get the impression we're all bumbling about in the dark poking at a gigantic elephant and getting the wrong impressions. Tim McGrath
Ok, so while trying to write a changelog on this patch I got myself terribly confused again.. calc_load_active_fold() is a relative operation and simply gives delta values since the last time it got called. That means that the sum of multiple invocations in a given time interval should be identical to a single invocation. Therefore, the going idle multiple times during LOAD_FREQ hypothesis doesn't really make sense. Even if it became idle but wasn't idle at the LOAD_FREQ turn-over it shouldn't matter, since the calc_load_account_active() call will simply fold the remaining delta with the accrued idle delta and the total should all match up once we fold into the global calc_load_tasks. So afaict its should all have worked and this patch is a big NOP,. except it isn't.. Damn I hate this bug.. ;-) Anybody? --
Yes. Thats what I was thinking trying to understand this code yesterday. Also with sequence number I don't think nr_interruptible would be handled correctly as tasks can move to CPU after it first went idle and may not get accounted later. I somehow feel the problem is with nr_interruptible, which gets accounted multiple times on idle tasks and only once per LOAD_FREQ on busy tasks. However, things are not fully clear to me yet. Have to look at the code a bit more. Thanks, --
I started making small changes to the code, but none of the change helped much. I think the problem with the current code is that, even though idle CPUs update load, the fold only happens when one of the CPU is busy and we end up taking its load into global load. So, I tried to simplify things and doing the updates directly from idle loop. This is only a test patch, and eventually we need to hook it off somewhere else, instead of idle loop and also this is expected work only as x86_64 right now. Peter: Do you think something like this will work? loadavg went quite on two of my test systems after this change (4 cpu and 24 cpu). Thanks, Venki --- arch/x86/kernel/process_64.c | 2 + kernel/sched.c | 67 +++++++++++------------------------------ kernel/sched_idletask.c | 1 - 3 files changed, 20 insertions(+), 50 deletions(-) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 3d9ea53..aaa8025 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -101,6 +101,7 @@ static inline void play_dead(void) } #endif +void idle_load_update(void); /* * The idle thread. There's no useful work to be * done, so just try to conserve power and have a @@ -140,6 +141,7 @@ void cpu_idle(void) stop_critical_timings(); pm_idle(); start_critical_timings(); + idle_load_update(); trace_power_end(smp_processor_id()); diff --git a/kernel/sched.c b/kernel/sched.c index dc85ceb..6d589c1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1819,7 +1819,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares) } #endif -static void calc_load_account_idle(struct rq *this_rq); static void update_sysctl(void); static int get_update_sysctl_factor(void); static void update_cpu_load(struct rq *this_rq); @@ -2959,11 +2958,12 @@ static unsigned long calc_load_update; unsigned long avenrun[3]; EXPORT_SYMBOL(avenrun); -static long ...
(Sorry about the subjectless earlier mail) I started making small changes to the code, but none of the change helped much. I think the problem with the current code is that, even though idle CPUs update load, the fold only happens when one of the CPU is busy and we end up taking its load into global load. So, I tried to simplify things and doing the updates directly from idle loop. This is only a test patch, and eventually we need to hook it off somewhere else, instead of idle loop and also this is expected work only as x86_64 right now. Peter: Do you think something like this will work? loadavg went quite on two of my test systems after this change (4 cpu and 24 cpu). Thanks, Venki --- arch/x86/kernel/process_64.c | 2 + kernel/sched.c | 67 +++++++++++------------------------------ kernel/sched_idletask.c | 1 - 3 files changed, 20 insertions(+), 50 deletions(-) diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 3d9ea53..aaa8025 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -101,6 +101,7 @@ static inline void play_dead(void) } #endif +void idle_load_update(void); /* * The idle thread. There's no useful work to be * done, so just try to conserve power and have a @@ -140,6 +141,7 @@ void cpu_idle(void) stop_critical_timings(); pm_idle(); start_critical_timings(); + idle_load_update(); trace_power_end(smp_processor_id()); diff --git a/kernel/sched.c b/kernel/sched.c index dc85ceb..6d589c1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1819,7 +1819,6 @@ static void cfs_rq_set_shares(struct cfs_rq *cfs_rq, unsigned long shares) } #endif -static void calc_load_account_idle(struct rq *this_rq); static void update_sysctl(void); static int get_update_sysctl_factor(void); static void update_cpu_load(struct rq *this_rq); @@ -2959,11 +2958,12 @@ static unsigned long calc_load_update; unsigned long avenrun[3]; ...
I'd really like to be able to help test this, but with a 32bit x86 machine I guess this patch won't do anything for me. How hard would it be to mangle this into working for my machine or should I just wait? Tim McGrath
Not really, CPUs can stay idle for _very_ long times (!x86 cpus that don't have crappy timers like HPET which roll around every 2-4 seconds). But all CPUs staying idle for a long time is exactly the scenario you fix before using the decay_load_misses() stuff, except that is for the load-balancer per-cpu load numbers not the global cpu load avg. Won't a similar approach work here? --
Yes. Thought about that. One problem there is that works with nohz_idle_balance, which will not be called if all the CPUs are idle for example. As this is once in 5 secs, probably doing nr_running() and nr_uninterruptible() should be OK even on huge systems. But, that was the original code here, except that it was inside xtime_lock. Thanks, Venki --
The crude patch would be something like the below.. a smarter patch will
try and avoid that loop.
---
include/linux/sched.h | 2 +-
kernel/sched.c | 20 +++++++++-----------
kernel/timer.c | 2 +-
3 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7a6e81f..84c1bf1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);
-extern void calc_global_load(void);
+extern void calc_global_load(int ticks);
extern unsigned long get_parent_ip(unsigned long addr);
diff --git a/kernel/sched.c b/kernel/sched.c
index 41f1869..49a2baf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3171,22 +3171,20 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
* calc_load - update the avenrun load estimates 10 ticks after the
* CPUs have updated calc_load_tasks.
*/
-void calc_global_load(void)
+void calc_global_load(int ticks)
{
- unsigned long upd = calc_load_update + 10;
long active;
- if (time_before(jiffies, upd))
- return;
-
- active = atomic_long_read(&calc_load_tasks);
- active = active > 0 ? active * FIXED_1 : 0;
+ while (!time_before(jiffies, calc_load_update + 10)) {
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;
- avenrun[0] = calc_load(avenrun[0], EXP_1, active);
- avenrun[1] = calc_load(avenrun[1], EXP_5, active);
- avenrun[2] = calc_load(avenrun[2], EXP_15, active);
+ avenrun[0] = calc_load(avenrun[0], EXP_1, active);
+ avenrun[1] = calc_load(avenrun[1], EXP_5, active);
+ avenrun[2] = calc_load(avenrun[2], EXP_15, active);
- calc_load_update += LOAD_FREQ;
+ calc_load_update += LOAD_FREQ;
+ }
}
/*
diff --git a/kernel/timer.c b/kernel/timer.c
index d6ccb90..9f82b2a 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1297,7 +1297,7 @@ void ...a1 = a0 * e + a * (1 - e)
a2 = a1 * e + a * (1 - e)
= (a0 * e + a * (1 - e)) * e + a * (1 - e)
= a0 * e^2 + a * (1 - e) * (1 + e)
a3 = a2 * e + a * (1 - e)
= (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
= a0 * e^3 + a * (1 - e) * (1 + e + e^2)
an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1)
= a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
= a0 * e^n + a * (1 - e^n)
the trouble seems to be that that is a rather slow function, stuffing
that in a table will either give us very large tables or very coarse
decay.
n: an an * 2048
1: 0.99446 2037
2: 0.98895 2025
4: 0.978023 2003
8: 0.956529 1959
16: 0.914947 1874
32: 0.837128 1714
64: 0.700784 1435
128: 0.491098 1006
256: 0.241177 494
512: 0.0581666 119
1024: 0.00338335 7
2048: 1.14471e-05 0
--
not tested other than compile, but how does that look...?
---
include/linux/sched.h | 2 +-
kernel/sched.c | 74 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/timer.c | 2 +-
3 files changed, 71 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b3d07df..f878db3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);
-extern void calc_global_load(void);
+extern void calc_global_load(int ticks);
extern unsigned long get_parent_ip(unsigned long addr);
diff --git a/kernel/sched.c b/kernel/sched.c
index 41f1869..125417e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3168,10 +3168,64 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
}
/*
+ * fixed_power_int - compute: x^n in O(log n) time
+ * @x: base of the power
+ * frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ */
+static unsigned long
+fixed_power_int(unsigned long x, unsigned int frac_bits, unsigned int n)
+{
+ unsigned long result = 1UL << frac_bits;
+
+ while (n) {
+ if (n & 1) {
+ result *= x;
+ result += 1UL << (frac_bits - 1);
+ result >>= frac_bits;
+ }
+ n >>= 1;
+ x *= x;
+ x >>= frac_bits;
+ }
+
+ return result;
+}
+
+/*
+ * a1 = a0 * e + a * (1 - e)
+ *
+ * a2 = a1 * e + a * (1 - e)
+ * = (a0 * e + a * (1 - e)) * e + a * (1 - e)
+ * = a0 * e^2 + a * (1 - e) * (1 + e)
+ *
+ * a3 = a2 * e + a * (1 - e)
+ * = (a0 * e^2 + a * (1 - e) * (1 + e)) * e + a * (1 - e)
+ * = a0 * e^3 + a * (1 - e) * (1 + e + e^2)
+ *
+ * an = a0 * e^n + a * (1 - e) * (1 + e + ... + e^n-1)
+ * = a0 * e^n + a * (1 - e) * (1 - e^n)/(1 - e)
+ * = a0 * e^n + a * (1 - e^n)
+ */
+static unsigned long
+calc_load_n(unsigned long load, unsigned long exp,
+ unsigned long active, unsigned int n)
+{
+ unsigned long en = ...Hi Peter, I finally got around to building a test kernel for a user who had been reporting this against F-14 (2.6.35) but they report there's not much improvement. Sorry. :/ <quote> Date: Tue, 09 Nov 2010 12:46:53 -0600 From: Michael Cronenworth <mike@cchtml.com> To: test@lists.fedoraproject.org Subject: Re: Virtually impossible to keep load under 0.5 on idle desktop with F14+kms Running this kernel is not any better. After leaving my system idle for 10 minutes I finally saw it dip below 0.10, but after leaving the computer, and coming back 10 minutes later, I see a load of 0.25. No programs running. Before I hit "send" on this message, I ran uptime again and got this: $ uptime 12:45:24 up 32 min, 3 users, load average: 0.82, 0.43, 0.34 Thunderbird was the only program running. 99.8% idle. </quote> regards, Kyle --
Right, I tested it myself yesterday and got similar results. Does the 'simple' patch from a few posts back work for people? http://lkml.org/lkml/2010/10/26/150 If that does work I fudged the fancy math, if that doesn't work either we're back to square zero. --
I seem to remember someone saying this would only work on x86-64; if this is meant to work with a 32bit intel single core processor, please let me know - I'd like to try it. It would also help if I knew which kernel version this simpler patch is meant to apply to. Thank you, Tim McGrath
Venki's patch was, the one I linked should work on every arch. --
Apparently it doesn't make a large amount of difference. :/ regards, Kyle --
Crud, ok, back to basics it is.. thanks for testing! --
Much like the other tester, this doesn't make any difference at all for me, load fluctuates semi randomly as it did before with an average after 20 mins of around 0.8 with the machine idle. I have to wonder, is there currently a reason NOT to revert this commit? Does this commit that's causing problems for myself and other people actually fix problems for some people that existed before? I just want to know why there seems to be a struggle to fix the current implementation rather than backing out the change that caused the problem I'm dealing with and then trying to work from there. If you've forgotten, I'm talking specifically about this commit: [74f5187ac873042f502227701ed1727e7c5fbfa9] sched: Cure load average vs NO_HZ woes We've gone through two kernel releases with this commit, and I think I'm understandably curious why it's not been reverted if it's causing problems with no solution in sight. Note that I'm currently running 2.6.35.8 with CONFIG_NO_HZ unset with no problems with the load average, so at least that works properly, and proves that I'm not being affected by some other quirk at least... Machine's only been up ten minutes or so, but here's the uptime data anyway for the curious: 00:10:02 up 12 min, 4 users, load average: 0.00, 0.11, 0.14 Tim McGrath
Hi Peter, Do you have a plan about solving this problem in the coming weeks or will it still be present in 2.6.37 final? I am not implying any critic in any way, just want to know a status because compared to other bugs discussed in lkml, this one has seen very little technical proposals towards a resolution (except your attempts). People who signed off the discussed commit did not make any technical analysis about the bug reported and this gives a strange feeling. Earth will not stop rotating if it is not solved, but I still this it is quite important and affects Even with this commit reverted, the load is incorrect in NO_HZ mode, so revert only is not enough. -- Damien Wyart --
Yes I agree, I've put the problem on the back burner for a bit so that I can forget all my tinkering and try again with a fresher mind :-) I'll try and get around to it somewhere this week again. --
How does this work for you? Its hideous but lets start simple.
---
kernel/sched.c | 5 +++++
kernel/timer.c | 14 ++++++++++++--
2 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..2ebb07c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3223,6 +3223,11 @@ static void calc_load_account_active(struct rq *this_rq)
this_rq->calc_load_update += LOAD_FREQ;
}
+void calc_load_account_this(void)
+{
+ calc_load_account_active(this_rq());
+}
+
/*
* The exact cpuload at various idx values, calculated at every tick would be
* load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
diff --git a/kernel/timer.c b/kernel/timer.c
index 68a9ae7..af095c0 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1315,11 +1315,21 @@ void run_local_timers(void)
* jiffies is defined in the linker script...
*/
+extern void calc_load_account_this(void);
+
void do_timer(unsigned long ticks)
{
- jiffies_64 += ticks;
+ if (ticks > 1) {
+ while (ticks--) {
+ jiffies_64++;
+ calc_load_account_this();
+ calc_global_load();
+ }
+ } else {
+ jiffies_64 += ticks;
+ calc_global_load();
+ }
update_wall_time();
- calc_global_load();
}
#ifdef __ARCH_WANT_SYS_ALARM
--
At least one user reports noticeably lower idle load average using a kernel with that patch. Thanks, looks promising! Kyle --
Hi, Doesn't give wrong numbers like initial bug and tentative patches, but feels a bit too slow when numbers go up and down. Correct values are reached when waiting long enough, but it feels slow. As I've tested many combinations, maybe this is an impression because I do not remember about "normal" delays for the load to rise and fall, but this still feels slow. Thanks for working on this problem again. -- Damien Wyart --
Umm.. maybe the code still averages the 1/5/15 over that many minute's worth of ticks, which may be longer for a tickless kernel?
You can test this by either booting with nohz=off, or builting with CONFIG_NO_HZ=n and then comparing the result, something like make O=defconfig clean; while sleep 10; do uptime >> load.log; done & make -j32 O=defconfig; kill %1 And comparing the curves between the NO_HZ and !NO_HZ kernels. I'll try and make the patch less hideous ;-) --
I've tested this patch on my own use case, and it seems to work for the most part - it's still not settling as low as the previous implementation used to, nor is it settling as low as CONFIG_NO_HZ=N (that is to say, 0.00 across the board when not being used) however, this is definitely an improvement: 14:26:04 up 9:08, 5 users, load average: 0.05, 0.01, 0.00 This is the result of running uptime on a checked out version of [74f5187ac873042f502227701ed1727e7c5fbfa9] sched: Cure load average vs NO_HZ woes with the patch applied, starting X, and simply letting the machine sit idle for nine hours. For the brief period I spent watching it after boot, it quickly began settling down to a reasonable value, I only let it sit idle this long to verify the loadavg was consistently low. (the loadavg was consistently erratic, anywhere from 0.6 to 1.2 with the machine idle without this patch) Thank you for the hard work, Tim McGrath
Ok, that's good testing.. so its still not quite the same as NO_HZ=n,
how about this one?
(it seems to drop down to 0.00 if I wait a few minutes with top -d5)
---
kernel/sched.c | 5 +++++
kernel/time/tick-sched.c | 4 +++-
kernel/timer.c | 12 ++++++++++++
3 files changed, 20 insertions(+), 1 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 864040c..a859158 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3082,6 +3082,11 @@ static void calc_load_account_active(struct rq *this_rq)
this_rq->calc_load_update += LOAD_FREQ;
}
+void calc_load_account_this(void)
+{
+ calc_load_account_active(this_rq());
+}
+
/*
* The exact cpuload at various idx values, calculated at every tick would be
* load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e216e0..1e6d384 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -41,6 +41,8 @@ struct tick_sched *tick_get_tick_sched(int cpu)
return &per_cpu(tick_cpu_sched, cpu);
}
+extern void do_timer_nohz(unsigned long ticks);
+
/*
* Must be called with interrupts disabled !
*/
@@ -75,7 +77,7 @@ static void tick_do_update_jiffies64(ktime_t now)
last_jiffies_update = ktime_add_ns(last_jiffies_update,
incr * ticks);
}
- do_timer(++ticks);
+ do_timer_nohz(++ticks);
/* Keep the tick_next_period variable up to date */
tick_next_period = ktime_add(last_jiffies_update, tick_period);
diff --git a/kernel/timer.c b/kernel/timer.c
index d6ccb90..eb2646f 100644
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1300,6 +1300,18 @@ void do_timer(unsigned long ticks)
calc_global_load();
}
+extern void calc_load_account_this(void);
+
+void do_timer_nohz(unsigned long ticks)
+{
+ while (ticks--) {
+ jiffies_64++;
+ calc_load_account_this();
+ calc_global_load();
+ }
+ update_wall_time();
+}
+
#ifdef __ARCH_WANT_SYS_ALARM
/*
--
OK, so here's a less crufty patch that gets the same result on my
machine, load drops down to 0.00 after a while.
It seems a bit slower to reach 0.00, but that could be because I
actually changed the load computation for NO_HZ=n as well, I added a
rounding factor in calc_load(), we no longer truncate the division.
If people want to compare, simply remove the third line from
calc_load(): load += 1UL << (FSHIFT - 1), to restore the old behaviour.
---
include/linux/sched.h | 2 +-
kernel/sched.c | 127 ++++++++++++++++++++++++++++++++++++++++++++----
kernel/timer.c | 2 +-
3 files changed, 118 insertions(+), 13 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index de4d68f..a6e0c62 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -143,7 +143,7 @@ extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);
-extern void calc_global_load(void);
+extern void calc_global_load(unsigned long ticks);
extern unsigned long get_parent_ip(unsigned long addr);
diff --git a/kernel/sched.c b/kernel/sched.c
index 864040c..7868a18 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2978,6 +2978,15 @@ static long calc_load_fold_active(struct rq *this_rq)
return delta;
}
+static unsigned long
+calc_load(unsigned long load, unsigned long exp, unsigned long active)
+{
+ load *= exp;
+ load += active * (FIXED_1 - exp);
+ load += 1UL << (FSHIFT - 1);
+ return load >> FSHIFT;
+}
+
#ifdef CONFIG_NO_HZ
/*
* For NO_HZ we delay the active fold to the next LOAD_FREQ update.
@@ -3007,6 +3016,105 @@ static long calc_load_fold_idle(void)
return delta;
}
+
+/**
+ * fixed_power_int - compute: x^n, in O(log n) time
+ *
+ * @x: base of the power
+ * @frac_bits: fractional bits of @x
+ * @n: power to raise @x to.
+ *
+ * By exploiting the relation between the definition of the natural power
+ * function: x^n := x*x*...*x (x multiplied by itself for n ...Thanks Peter, I'll build a couple kernels for the bugzilla reporters and see what they turn up. --
Looks good, one tester says things are promising at least. Thanks so much for all your effort, Peter. cheers, Kyle --
Yes, really feels slower, but in the two configurations as you write. Then this is a matter of convention, do not know which version is more "correct" or "natural", but this is coherent in the two modes (HZ and NO_HZ). Do you plan to push this one to upstream/stable, or do you plan further modifications? Thanks for your patches, -- Damien Wyart --
No this is about it, I've written a bit more comments, and I'll need to come up with a changelog, I'll also wait for Kyle's testers to come back, but yes, then I'll push it upstream/stable. --
For some bizzare reason, this version has a small but noticable amount of jitter and never really seems to hit 0.00 on my machine, tends to jump around at low values between 0.03 to 0.08 on a routine basis: 16:20:42 up 16:31, 4 users, load average: 0.00, 0.01, 0.05 the jitter seems to have no visible reason for it happening; with no networking, disk access or a process waking up and demanding attention from the cpu, it goes back up. Mind this is obviously NOT as horrible as it was originally, but I'd like to find out why it's acting so differently. I'm going to try this variant again with that line you were talking about disabled and see if it reacts differently. I get the feeling if it's the rounding factor - since you say that was changed for BOTH nohz=y and n, that it's not really a problem in the first place, and likely is very low load that wasn't being accurately reported before. Tim McGrath
Indeed, this seems to be the case: 04:50:14 up 5:45, 5 users, load average: 0.00, 0.00, 0.00 the average seems to not be jittery, or at least noticably, and reacts as I have expected it to in the past with that single line disabled; Since you have said that this change would affect all load calculations I have not tested how this patch with the line enabled/disabled reacts with nohz=n, please let me know if you would like me to test that condition anyway. Personally since it changes the previous behavior of the load calculation I'd prefer that the rounding not be done. Tim McGrath
Commit-ID: 0f004f5a696a9434b7214d0d3cbd0525ee77d428 Gitweb: http://git.kernel.org/tip/0f004f5a696a9434b7214d0d3cbd0525ee77d428 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Tue, 30 Nov 2010 19:48:45 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Wed, 8 Dec 2010 20:15:04 +0100 sched: Cure more NO_HZ load average woes There's a long-running regression that proved difficult to fix and which is hitting certain people and is rather annoying in its effects. Damien reported that after 74f5187ac8 (sched: Cure load average vs NO_HZ woes) his load average is unnaturally high, he also noted that even with that patch reverted the load avgerage numbers are not correct. The problem is that the previous patch only solved half the NO_HZ problem, it addressed the part of going into NO_HZ mode, not of comming out of NO_HZ mode. This patch implements that missing half. When comming out of NO_HZ mode there are two important things to take care of: - Folding the pending idle delta into the global active count. - Correctly aging the averages for the idle-duration. So with this patch the NO_HZ interaction should be complete and behaviour between CONFIG_NO_HZ=[yn] should be equivalent. Furthermore, this patch slightly changes the load average computation by adding a rounding term to the fixed point multiplication. Reported-by: Damien Wyart <damien.wyart@free.fr> Reported-by: Tim McGrath <tmhikaru@gmail.com> Tested-by: Damien Wyart <damien.wyart@free.fr> Tested-by: Orion Poplawski <orion@cora.nwra.com> Tested-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Cc: Chase Douglas <chase.douglas@canonical.com> LKML-Reference: <1291129145.32004.874.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/sched.h | 2 +- kernel/sched.c | 150 +++++++++++++++++++++++++++++++++++++++++++++---- kernel/timer.c | 2 +- 3 files changed, 141 ...
I haven't had time to test your further patches but THIS works! 14:57:03 up 14:01, 4 users, load average: 0.00, 0.00, 0.00 Load seems to finally be accurate on my machine compared to processes running/whatever else usage. This is again testing vs the original commit that caused the problems for me: [74f5187ac873042f502227701ed1727e7c5fbfa9] sched: Cure load average vs NO_HZ woes so I know I'm testing apples to apples here. As time permits I'll test the later replies you made to yourself. Thank you, Tim McGrath
In fact, timings were very similar between the two configurations, so I guess my feelings were wrong... They were also similar with the next version of the patch. -- Damien Wyart --
