I've been investigating a problem recently, in which N runnable CPU-bound tasks on an N-way machine run on only N-1 CPUs. The remaining CPU is almost 100% idle. I have seen it occur with both the CFS and O(1) schedulers. I've traced this down to what seems to be a quirk in the SMP balancer, whereby a high-priority thread which spends most of its time sleeping can artificially inflate the CPU load average calculated for one processor. Most of the time this CPU is idle (nr_running==0) yet its CPU load average is much higher than that of any other CPU. Please find attached a sample program which demonstrates this behaviour on a 2-way SMP machine. It creates three threads: two are CPU bound and run at the default priority, the third spends most of its time sleeping and runs at an elevated priority. It wakes up frequently (using /dev/rtc) and randomly generates some CPU load. On my machine (2-way Opteron with a vanilla 2.6.23.1 kernel) this test program will reliably put the scheduler into a state where one CPU has both of the busy-looping processes in its runqueue, and the other CPU is usually idle. The usually-idle CPU will have a very high cpu_load, as reported by /proc/sched_debug. Your mileage may vary. On some machines, this test program will only enter the "bad" state for a few seconds. Sometimes we bounce back and forth between good and bad states every few seconds. In all cases, removing the priority elevation fixes the balancing problem. Is this a behaviour any of the scheduler developers are aware of? I would be very greatful if anyone could shed some light on the root cause behind the inflated cpu_load average. If this turns out to be a real bug, I would be happy to work on a patch. Thanks in advance, Micah Dowty
I tried your program on my machine (C2D, 2.6.17, O(1) scheduler). Both CPUs are 100% busy all the time. Each busy-looping thread is running on its own CPU. I've been watching top output for 10 minutes, the spreading is stable and the threads don't bounce at all. greetings Cyrus -
As I said, YMMV. I haven't been able to find a single set of parameters for the demo program which cause the problem to occur 100% of the time on all systems. In general, boosting the MAINTHREAD_PRIORITY even more and increasing the WAKE_HZ should exaggerate the problem. These parameters reproduce the problem very reliably on my system: #define NUM_BUSY_THREADS 2 #define MAINTHREAD_PRIORITY -20 #define MAINTHREAD_WAKE_HZ 1024 #define MAINTHREAD_LOAD_PERCENT 5 #define MAINTHREAD_LOAD_CYCLES 2 It's also possible this problem doesn't occur on 2.6.17. I have only tested this example on 2.6.23.1 and 2.6.20 so far. Thanks, --Micah -
I tested a handful of kernel versions, and it looks like this is indeed the case. As far as I can tell, this problem was introduced between 2.6.19.7 and 2.6.20. Looking at the diff between those versions, it looks like there were some noteworthy changes to the SMP balancer. I'm still working on narrowing this down further. Thanks, --Micah -
First of all, since Ingo Molnar seems to be one of the head scheduler gurus, you might CC him on this. Also added a couple other useful CCs for regression reports. Well from these statistics; if you are requesting wakeups that often then it is probably *not* correct to try to move another thread to that CPU in the mean-time. Essentially the migration cost will likely far outweigh the advantage of letting it run a little bit of extra time, and in addition it will dump out cache from the high- priority thread. As per the description I think that an increased a priority and increased WAKE_HZ will certainly cause the "problem" to occur more, simply because it reduces the time between wakeups of the high-priority process and makes it less helpful to migrate another process over to that CPU during the sleep periods. This will also depend on your hardware and possibly other configuration parameters. I'm not really that much of an expert in this particular area, though, so it's entirely possible that one of the above-mentioned scheduler head-honchos will poke holes in my argument and give a better explanation or a possible patch. Cheers, Kyle Moffett -
The real problem, though, is that the high priority thread only needs
a total of a few percent worth of CPU time. The behaviour which I'm
reporting as a potential bug is that this process which needs very
little CPU time is effectively getting an entire CPU to itself,
despite the fact that other CPU-bound threads could benefit from
having time on that CPU.
One could argue that a thread with a high enough priority *should* get
a CPU all to itself even if it won't use that CPU- but that isn't the
behaviour I want. If this is in fact the intended effect of having a
high-priority thread wake very frequently, I should start a different
discussion about how to solve my specific problem without the use of
elevated priorities :)
I don't have any reason to believe, though, that this behaviour was
intentional. I just finished my "git bisect" run, and I found the
first commit after which I can reproduce the problem:
c9819f4593e8d052b41a89f47140f5c5e7e30582 is first bad commit
commit c9819f4593e8d052b41a89f47140f5c5e7e30582
Author: Christoph Lameter <clameter@sgi.com>
Date: Sun Dec 10 02:20:25 2006 -0800
[PATCH] sched: use softirq for load balancing
Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
softirq.
We calculate the earliest time for each layer of sched domains to be rescanned
(this is the rescan time for idle) and use the earliest of those to schedule
the softirq via a new field "next_balance" added to struct rq.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Peter Williams <pwil3058@bigpond.net.au>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Thanks! I've also CC'ed ...The kernel crashes? Sounds like your application crashes with a divide by zero? -
Yes, the Python test harness crashes, not the kernel. It's just because on a kernel which exhibits this SMP balancer bug, within a couple of test iterations I'll hit a case where cpu1 was almost totally idle and the test harness divides by zero when calculating the imbalance. --Micah -
I still have some problem understanding what you are complaining about? The patch that you bisected had some issues that were fixed later by Siddha. -
On all kernels I've tested from after your patch was committed, I can reproduce a problem where a single high-priority thread which wakes up very frequently can artificially inflate the SMP balancer's load average for one CPU, causing other tasks to be migrated off that CPU. The result is that this high-priority thread (which may only use a few percent CPU) gets an entire CPU to itself. Even if there are several busy-looping threads running, this CPU will be mostly idle. In the first email I CC'ed you on, I attached a test program that I've been able to use to reproduce the issue. Feel free to ignore the Python program- it's just a harness I was using to test for the problem quickly during my git-bisect run. to reproduce the problem. The relevant code is all in priosched.c. Thanks, --Micah -
I am a bit at a loss as to how this could relate to the patch. This looks like a load balance logic issue that causes the load calculation to go wrong? -
My best guess is that this has something to do with the timing with which we sample the CPU's instantaneous load when calculating the load averages.. but I still understand only the basics of the scheduler and SMP balancer. All I really know for sure at this point regarding your patch is that git-bisect found it for me. It almost seems like the load average algorithm is ignoring the CPU's idle time, and only accounting for the time that CPU spends running processes. One of the symptoms is that the mostly-idle CPU in my test has an instantaneous load which is usually zero, but a very high load average. (9000, 30000, etc.) I want to help get to the bottom of this issue, but I was hoping that someone experienced with the Linux scheduler and SMP balancer would have some insight or some suggestions about what to try next. Thanks, --Micah -
hm, your code uses timeouts for this, right? The CPU load average that is used for SMP load balancing is sampled from the scheduler tick - and has been sampled from the scheduler tick for eons. v2.6.23 defaulted to a different method but v2.6.24 samples it from the tick again. So my guess is, your testcode behave similarly on 2.6.22 too, correct? Ingo -
My sample program (and the VMware code which I originally observed this behaviour with) wakes up frequently using /dev/rtc. I know the problem occurs on 2.6.20 and 2.6.23.1. I haven't tried 2.6.22 yet, but I'll do so as soon as I can. Thanks, --Micah -
Interesting.. here are the kernels I've tested so far, not including the git-bisect run I did between 2.6.19 and 2.6.20: 2.6.17 - 2.6.19 - 2.6.19.7 - 2.6.20 + 2.6.21 + 2.6.22 - 2.6.23.1 + Here a "-" means that the problem does not occur (my test program uses 100% of both CPUs) and a "+" means that the test program leaves one CPU mostly idle. Unless I've made a mistake, 2.6.22 seems like the outlier rather than 2.6.23. Is this inconsistent with the scheduler tick hypothesis? Thanks, --Micah -
yeah... but one of the kernels in question is actually 2.6.23.1, which does use the 'precise' accounting. Micah, could you try to change either : cat /proc/sys/kernel/sched_stat_granularity put it to the value equal to a tick on your system or just remove bit #3 (which is responsible for 8 == 1000) here: cat /proc/sys/kernel/sched_features (this one is enabled by default in 2.6.23.1) anyway, when it comes to calculating rq->cpu_load[], a nice(0) cpu-hog task (on cpu_0) may generate a similar load (contribute to rq->cpu_load[]) as e.g. some negatively reniced task (on cpu_1) which runs only periodically (say, once in a tick for N ms., etc.) [*] The thing is that the higher a prio of the task, the bigger 'weight' it has (prio_to_wait[] table in sched.c) ... and roughly, the load it generates is not only 'proportional' to 'run-time per fixed interval of time' but also to its 'weight'. That's why the [*] above. so you may have a situation : cpu_0 : e.g. a nice(-20) task running periodically every tick and generating, say ~10% cpu load ; cpu_1 : 2-3 nice(0) cpu-hog tasks ; both cpus may be seen with similar rq->load_cpu[]... yeah, one would argue that one of the cpu hogs could be migrated to cpu_0 and consume remaining 'time slots' and it would not "disturb" the nice(-20) task as : it's able to preempt the lower prio task whenever it want (provided, fine-grained kernel preemption) and we don't care that much of trashing of caches here. btw., without the precise load balancing, there can be situations when the nice(-20) (or say, a RT periodic task) can be even not seen (i.e. don't contribute to cpu_load[]) on cpu_0... we do sampling every tick (sched.c :: update_cpu_load()) and consider this_rq->ls.load.weight at this particular moment (that is the sum of 'weights' for all runnable tasks on this rq)... and it may well be that the aforementioned high-priority task is just never (or likely, rarely) runnable at this particular moment (it runs for short ...
Aha. Turning off bit 3 appears to instantly fix my problem while it's occurring in an existing process, and I can't reproduce it on any new Right. I gathered from reading the scheduler source earlier that the load average is intended to be proportional to the priority of the task, but I was really confused by the fairly nondeterministic effect Part of the problem may be that my high-priority task can run much more often than every tick. In my test case and in the VMware code that I originally observed the problem in, the thread can wake up based on /dev/rtc or on a device IRQ. Either of these can happen much When I try this, cpu0 has a cpu_load[] of over 10000 and cpu1 has a Yes, that's the behaviour I expected to see (and what my application Indeed. I think this is the major contributor to the nondeterminism I'm seeing. Thanks much, --Micah -
For consistency, I tested this using /dev/rtc. I set the rtc frequency
to 16 Hz, and replaced the main loop of my high (-19) priority thread
with:
while (1) {
unsigned long data;
for (i = 0; i < 3; i++) {
if (read(rtc, &data, sizeof data) != sizeof data) {
perror("read");
return 1;
}
}
fcntl(rtc, F_SETFL, O_NONBLOCK);
while (read(rtc, &data, sizeof data) < 0);
fcntl(rtc, F_SETFL, 0);
}
Now it's busy-looping for 62ms, and sleeping for three consecutive
62.5ms chunks totalling 187.5ms.
The results aren't quite what I was expecting. I have only observed
this so far in test cases where I have a very high wakeup frequency,
so I wasn't expecting this to work. I did, however, still observe the
problem where occasionally I get into a state where one CPU is mostly
idle.
Qualitatively, this feels a bit different. With the higher clock
frequency it seemed like the CPU would easily get "stuck" in this
state where it's mostly idle, and it would stay there for a long
time. With the low wakeup frequency, I'm seeing it toggle between the
busy and mostly-idle states more quickly.
I tried a similar test using usleep() and gettimeofday() rather than
/dev/rtc:
while (1) {
usleep(300000);
gettimeofday(&t1, NULL);
do {
gettimeofday(&t2, NULL);
} while (t2.tv_usec - t1.tv_usec +
(t2.tv_sec - t1.tv_sec) * 1000000 < 100000);
}
With this test program, I haven't yet seen a CPU imbalance that lasts
longer than a fraction of a second.
--Micah
-
You seem to have a configuration with domains which don't have SD_BALANCE_NEWIDLE on (CONFIG_NUMA?) as there are no events (all zeros above) for CPU_NEWLY_IDLE. this one is being triggered whenever a cpu becomes idle (schedule() --> idle_balance() --> load_balance_newidle()). (this flag is a bit #1 == 2) cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags ? it can be changed with "echo new_value > ..." does a change of it make any difference? moreover, /proc/sys/kernel/sched_domain/cpu1/domain0/newidle_idx seems to be responsible for a source of the load for calculating the busiest group. e.g. with newidle_idx == 0, the current load on the queue is -- Best regards, Dmitry Adamushko -
Hmm. I don't have this file on my system: root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# ls busy_factor busy_idx forkexec_idx idle_idx imbalance_pct max_interval min_interval newidle_idx wake_idx root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# uname -a Linux micah-64 2.6.23.1 #1 SMP Fri Nov 2 12:25:47 PDT 2007 x86_64 GNU/Linux Is there a config option I'm missing? Thanks, --Micah -
I have that one. I even posted the /proc/sched_debug output :) --Micah -
ah ok, try applying this patch on top of 2.6.23.1.
btw., what's your system? If I recall right, SD_BALANCE_NEWIDLE is on
by default for all configs, except for NUMA nodes.
(attached a white-space non-damaged version)
---
--- kernel/sched.c-old 2007-11-20 22:33:22.000000000 +0100
+++ kernel/sched.c 2007-11-20 22:37:07.000000000 +0100
@@ -5306,7 +5306,7 @@ set_table_entry(struct ctl_table *entry,
static struct ctl_table *
sd_alloc_ctl_domain_table(struct sched_domain *sd)
{
- struct ctl_table *table = sd_alloc_ctl_entry(14);
+ struct ctl_table *table = sd_alloc_ctl_entry(12);
set_table_entry(&table[0], "min_interval", &sd->min_interval,
sizeof(long), 0644, proc_doulongvec_minmax);
@@ -5326,10 +5326,10 @@ sd_alloc_ctl_domain_table(struct sched_d
sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct,
sizeof(int), 0644, proc_dointvec_minmax);
- set_table_entry(&table[10], "cache_nice_tries",
+ set_table_entry(&table[9], "cache_nice_tries",
&sd->cache_nice_tries,
sizeof(int), 0644, proc_dointvec_minmax);
- set_table_entry(&table[12], "flags", &sd->flags,
+ set_table_entry(&table[10], "flags", &sd->flags,
sizeof(int), 0644, proc_dointvec_minmax);
return table;
--
Best regards,
Dmitry Adamushko
It's a dual AMD64 Opteron. So, I recompiled my 2.6.23.1 kernel without NUMA support, and with your patch for scheduling domain flags in /proc. It looks like with NUMA disabled, my test case no longer shows the CPU imbalance problem. Cool. With NUMA disabled (and my test running smoothly), the flags show that SD_BALANCE_NEWIDLE is set: root@micah-64:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags 55 Next I turned it off: root@micah-64:~# echo 53 > /proc/sys/kernel/sched_domain/cpu0/domain0/flags root@micah-64:~# echo 53 > /proc/sys/kernel/sched_domain/cpu1/domain0/flags Oddly enough, I still don't observe the CPU imbalance problem. Now I reboot into a kernel which has NUMA re-enabled but which is otherwise identical. I verify that now I can reproduce the CPU imbalance again. root@micah-64:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags 1101 Now I set cpu[10]/domain0/flags to 1099, and the imbalance immediately disappears. I can reliably cause the imbalance again by setting it back to 1101, and remove the imbalance by setting them to 1099. Do these results make sense? I'm not sure I understand how SD_BALANCE_NEWIDLE could be the whole story, since my /proc/schedstat graphs do show that we continuously try to balance on idle, but we can't successfully do so because the idle CPU has a much higher load than the non-idle CPU. I don't understand how the problem I'm seeing could be related to the time at which we run the balancer, rather than being related to the load average calculation. Assuming the CPU imbalance I'm seeing is actually related to SD_BALANCE_NEWIDLE being unset, I have a couple of questions: - Is this intended/expected behaviour for a machine without NEWIDLE set? I'm not familiar with the rationale for disabling this flag on NUMA systems. - Is there a good way to detect, without any kernel debug flags set, whether the current machine has any scheduling domains that are missing the SD_BALANCE_NEWIDLE bit? I'm ...
Look for the 'load_idx' variable in find_busiest_group() and how it's used.
include/linux/topology.h (+ some architectures defines additional
node-definition-structures, but that's primarily for NUMA)
There you can find : 'busy_idx', 'idle_idx' and 'newidle_idx' (amongst
others that affect load-balancing).
Those parameters specify the _source_ of load for determining the
busiest group for !CPU_IDLE, CPU_IDLE and CPU_NEWLY_IDLE cases
respectively.
when 'load_idx == 0' (say, when CPU_NEWLY_IDLE and 'newidle_idx == 0'
or when CPU_IDLE and 'idle_idx == 0') the cpu_load[] table is _not_
taken into account by {source,target}_load(), and the raw load on the
rq is considered instead. In case of CPU_NEWLY_IDLE the raw load is
always 0 (no tasks on the queue).
so for NUMA, newidle_idx = 0 and it also seems to be a generic option
for other variants on 2.6.23.1. Hence, it doesn't matter how big are
for NUMA, 'idle_idx != 0' (i.e. cpu_load[] is taken into account) for
the CPU_IDLE case
and
'newidle_idx == 0' (just because, SD_BALANCE_NEWIDLE is normally not
(1) CPU_IDLE and idle_idx != 0 --> cpu_load[] is considered
(2) CPU_NEWIDLE and newidle_idx == 0 --> raw load is considered (which
As you may see, there are varios node's configs in
include/linux/topology.h that affects load-balancing. Migrations
between NUMA nodes can be more expensive than e.g. between cpus on
'normal' SMP, multi-core, etc. So it should be about being more
conservative when doing load-balancing on some setups.
e.g. in your particular case, what happens is smth like this:
SD_BALANCE_NEWIDLE + newidle_idx == 0
cpu #0: 2 nice(0) cpu-hogs
cpu #1: nice(-15) blocks...
cpu #1: load_balance_newidle() is triggered (*)
cpu #1: newidle_idx == 0 so the busiest group is on cpu #0
cpu #1: nice(0) (which is != current) is pulled over to cpu #1
cpu #1: nice(0) is running untill it's preempted by nice(-15)
cpu #1: nice(-15) is running
...
in the mean time, rebalance_domain event can be triggered on
cpu ...Dmitry, Thank you for the detailed explanation of the scheduler behaviour I've been seeing. Would it be reasonable perhaps to look for the numa_* entries in The application doesn't really depend on the load-balancer's decisions per se, it just happens that this behaviour I'm seeing on NUMA systems is extremely bad for performance. In this context, the application is a virtual machine runtime which is executing either an SMP VM or it's executing a guest which has a virtual GPU. In either case, there are at least three threads: - Two virtual CPU/GPU threads, which are nice(0) and often CPU-bound - A low-latency event handling thread, at nice(-10) The event handling thread performs periodic tasks like delivering timer interrupts and completing I/O operations. It isn't expected to use much CPU, but it does run very frequently. To get the best latency performance, we've been running it at nice(-10). The intention is that this event handling thread should usually be able to preempt either of the CPU-bound threads. The behaviour we expect is that on a system which is otherwise lightly loaded, the two CPU-bound threads each get nearly an entire physical CPU to themselves. The event handling thread periodically wakes up, preempts one of the CPU-bound threads briefly, then goes to sleep and yields the CPU back to one of the CPU-bound threads. The actual behaviour I see when the load balancer makes this unfavorable decision is that the two CPU-bound threads share a single CPU, and the other CPU is mostly idle. Best case, the virtual machine runs half as fast as it could be running. Worst case it runs much slower, because the CPU-bound threads often need to wait for each other to catch up if they don't always run at about the same rate. This involves a lot of extra synchronization overhead. My current workaround for this is to avoid ever boosting our thread priority on a kernel in which this problem could occur. Currently this is any 2.6.20 or later kernel running ...
Are I/O operations initiated by these "virtual CPU/GPU threads"? If so, would it make sense to have per-CPU event handling threads (instead of one global)? They would handle I/O operations initiated from their respective CPUs to (hopefully) achieve better data locality (esp. if the most part of the involved data is per-CPU). Then let the load balancer to evenly distribute the "virtual CPU/GPU threads" or even (at least, as an experiment) fix them to different CPUs as well? sure, the scenario is highly dependent on a nature of those 'events'... and I can just speculate here :-) (but I'd imagine -- Best regards, Dmitry Adamushko -
We already do this when it makes sense to. The high priority "event" thread is mostly used for asynchronous events- timers (driven by /dev/rtc), completion of disk or USB I/O, etc. These are things that happen asynchronously to the virtual CPU, and which may or may not need to interrupt the virtual CPU at some point. Another good example is audio. We have another thread (which ideally is high-priority) which processes audio events asynchronously from the virtual CPUs. The CPUs manipulate some shared memory to queue up audio data, but while the audio is playing there is no direct control path between the two threads. The virtual CPU reads/writes audio registers whenever it needs to emulate accesses to the audio device, and the audio thread wakes up any time the hardware needs more samples to play. The two threads are asynchronous with very little shared synchronization. Hopefully that clarifies our situation a bit. Thanks again, --Micah -
There are a couple of points I would make about your python test
harness. Your program compares real+system jiffies for both cpus; an
ideal result would be 1.00. The measurement is taken over a relatively
short period of approximately a half-second, and you kill the CPU hogs
before taking final measurements, even wait for them to die first. You
repeat this measurement, starting and killing CPU hogs each time. Why
do you do that?
What happens if you start the hogs and take the baseline outside of the
loop?
from __future__ import division
import sys, os, time
def getCpuTimes():
cpu0 = 0
cpu1 = 1
for line in open("/proc/stat"):
tokens = line.split()
if tokens[0] == "cpu0":
cpu0 = int(tokens[1]) + int(tokens[3])
elif tokens[0] == "cpu1":
cpu1 = int(tokens[1]) + int(tokens[3])
return cpu0, cpu1
pid = os.spawnl(os.P_NOWAIT, "./priosched")
baseline = getCpuTimes()
while True:
time.sleep(0.5)
current = getCpuTimes()
print "%.04f" % (current[0] - baseline[0]) / (current[1] -
baseline[1])
-
The Python test harness is fairly artificial, but this is just the best way I found to reliably reproduce the problem in a short amount of time. It was just for convenience while running git-bisect. When running the C program directly, there seems to be a somewhat random chance that it will start up in the "bad" state. Once the single CPU is stuck in this mostly-idle mode, it seems to stay that way for a The problem still occurs then, but killing/restarting the test app seems to trigger the problem more reliably. As I said in the original email about this, left to its own devices this problem will occur seemingly-randomly. In the original VMware code I observed this problem in, the same process would flip between the "good" and "bad" states seemingly randomly, every few seconds. Thanks, --Micah -
