I have a simple test program that accepts number of threads(pthreads) to be created as a input. Each of these threads that gets created invokes a function which is just a infinite while loop. The main function after creating those threads goes in a infinite loop itself My test machine is a Dual Core AMD Opteron(tm) 860 with 8 sockets(non-HT), I run this test program with number of threads == number of CPUs: ./loadcpu -t 16 I see 100% CPU utilization on almost all CPUs (via mpstat/htop/vmstat). When the above threads are running, if I introduce a few high priority threads by doing: nice -n -13 ./loadcpu -t 3 After a short while, I see a few CPUs becoming idle at ~0% utilization (the number of CPUs becoming idle equals roughly the number of high priority threads i.e. 3). When I stop the high priority threads, the CPU utilization comes back to normal i.e. ~100%. This is reproducible on 2.6.32.10 stable kernel with all the recent all SMT fixes (I hope) and I think it would be reproducible in current upstream as well. sched_mc_power_savings has been always set to 0. I spent a while staring at the load balancing and the thread migration code, but could not figure out why this is happening. Would appreciate any pointers. Thanks, -- Suresh Jayaraman --
Right, except its not a severe imbalance as the subject suggests. For some reason it seems to end up in a semi-stable state that is actually quite balanced. for ((i=0; i<8; i++)) do while :; do :; done & done for ((i=0; i<3; i++)) do while :; do :; done & renice -n -15 -p $! ; done gets me: Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 99.0%us, 1.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16440840k total, 1073672k used, 15367168k free, 105844k buffers Swap: 16777212k total, 0k used, 16777212k free, 296504k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4370 root 5 -15 105m 804 304 R 100.1 0.0 0:45.02 bash 4374 root 5 -15 105m 804 304 R 100.1 0.0 0:44.95 bash 4372 root 5 -15 105m 804 304 R 99.1 0.0 0:45.00 bash 4364 root 20 0 105m 804 304 R 51.0 0.0 0:33.06 bash 4362 root 20 0 105m 800 300 R 50.0 0.0 0:33.17 bash 4365 root 20 0 105m 804 304 R 50.0 0.0 0:33.75 bash 4368 root 20 0 105m 804 304 R 50.0 0.0 0:33.32 bash 4369 root 20 0 105m 804 304 R 50.0 0.0 0:33.38 bash 4363 root 20 0 105m 804 304 R 49.1 0.0 0:33.65 bash 4366 root 20 0 105m 804 304 R 49.1 0.0 0:33.29 bash 4367 root 20 0 105m 804 304 R 49.1 0.0 0:33.54 bash So we have the 3 -15 loops on a cpu each, and the 8 0 loops on 2 cpus each, and 1 cpu idle. That is actually quite balanced, 'better' ...
It was not intentional. It just happened that I first noticed the bug on In my reproduction attempt the number of CPUs becoming idle increased with the number of high priority threads. For e.g. 3 (out of 16 CPUs) become idle when there were 3 high priority threads 5 CPUs become idle when there were 4 high priority threads 7 CPUs become idle when there were 5 high priority threads (~40% ) But, I also starting to think it is some wierd combination of normal priority threads and high priority threads make the problem worse or good. Because with 7 or higher threads the utilization becomes smoother again. Perhaps there is a chance that with more CPUs, different number of high priority threads the problem could get worser as I mentioned above..? Thanks, -- Suresh Jayaraman --
One thing that could be happening (triggered by what Igawa-san said, although his case is more complicated by involving the cgroup stuff) is that f_b_g() ends up selecting a group that contains these niced tasks and then f_b_q() will not find a suitable source queue because all of them will have but a single runnable task on it and hence we simply bail. We'd somehow have to teach update_*_lb_stats() not to consider groups where nr_running <= nr_cpus. I don't currently have a patch for that, but I think that is the direction you might need to look in. --
From: Peter Zijlstra <peterz@infradead.org> Subject: Re: High priority threads causing severe CPU load imbalances I made a patch for my understanding the load_balance()'s behavior. This patch reduced CPU load imbalances but not perfect. --- Cpu0 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 90.1%us, 0.0%sy, 0.0%ni, 9.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 98.7%us, 0.3%sy, 0.0%ni, 1.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 96.1%us, 1.0%sy, 0.0%ni, 3.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 99.0%us, 0.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu7 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8032460k total, 807628k used, 7224832k free, 30692k buffers Swap: 0k total, 0k used, 0k free, 347308k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 9872 root 20 0 66128 632 268 R 99 0.0 0:13.69 4 bash 9876 root 20 0 66128 632 268 R 99 0.0 0:10.31 2 bash 9877 root 20 0 66128 632 268 R 99 0.0 0:10.79 3 bash 9871 root 20 0 66128 632 268 R 99 0.0 0:13.70 0 bash 9873 root 20 0 66128 632 268 R 99 0.0 0:13.68 1 bash 9874 root 20 0 66128 632 268 R 98 0.0 0:10.00 6 bash 9875 root 20 0 66128 632 268 R 92 0.0 0:11.22 4 bash 9878 root 20 0 66128 632 268 R 91 0.0 0:10.03 7 bash --- Also, this patch caused ping-pong load balances.. This patch is regards the sched_group as a idle sched_group if local sched_group's cpu is CPU_IDLE. But the state is not stable because active_load_balance() runs at this situation IIUC. I'll investigate ...
What's wrong with having the three -15 loops each get a CPU, having six of the remaining 0 loops get half a CPU, and the last two get their own CPUs. That's less fair but strictly better than the current solution, and nothing bounces. --Andy --
The fairness thing, that really matters a lot to some people. I've had enterprise bugs filed over such behaviour as you describe. --
I found a similar(maybe same) problem by using the cgroup cpu-subsystem like following: My test machine has Xeon(Quad Core) with 2 sockets(non-HT). # mount -t cgroup -o cpu none /dev/cgroup-cpu/ # mkdir -p /dev/cgroup-cpu/204800 /dev/cgroup-cpu/1024 # echo 204800 > /dev/cgroup-cpu/204800/cpu.shares # for ((i=0; i<3; i++)) do while :; do :; done & echo $! > /dev/cgroup-cpu/204800/tasks ; done # for ((i=0; i<5; i++)) do while :; do :; done & echo $! > /dev/cgroup-cpu/1024/tasks ; done gets me: Tasks: 190 total, 9 running, 181 sleeping, 0 stopped, 0 zombie Cpu0 : 1.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 0.0%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8180292k total, 2430940k used, 5749352k free, 204988k buffers Swap: 0k total, 0k used, 0k free, 1931820k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 30923 root 20 0 5808 540 264 R 100 0.0 2:30.64 3 bash 30922 root 20 0 5808 540 264 R 100 0.0 2:30.64 2 bash 30924 root 20 0 5808 540 264 R 100 0.0 2:30.63 6 bash 30925 root 20 0 5808 540 264 R 42 0.0 1:00.19 7 bash 30928 root 20 0 5808 540 264 R 41 0.0 0:57.26 5 bash 30929 root 20 0 5808 540 264 R 40 0.0 0:57.03 7 bash 30926 root 20 0 5808 540 264 R 39 0.0 0:58.37 7 bash 30927 root ...
