Re: High priority threads causing severe CPU load imbalances

Previous thread: [PATCH] 460EX on-chip SATA driver<kernel 2.6.33> < resubmission : 02> by Rupjyoti Sarmah on Tuesday, April 6, 2010 - 4:41 am. (5 messages)

Next thread: [PATCH] perf kmem: Fix breakage introduced by 5a0e3ad slab.h script by Arnaldo Carvalho de Melo on Tuesday, April 6, 2010 - 6:37 am. (2 messages)
From: Suresh Jayaraman
Date: Tuesday, April 6, 2010 - 6:12 am

I have a simple test program that accepts number of threads(pthreads) to
be created as a input. Each of these threads that gets created invokes a
function which is just a infinite while loop. The main function after
creating those threads goes in a infinite loop itself

My test machine is a Dual Core AMD Opteron(tm) 860 with 8
sockets(non-HT), I run this test program with number of threads ==
number of CPUs:

   ./loadcpu -t 16

I see 100% CPU utilization on almost all CPUs (via mpstat/htop/vmstat).

When the above threads are running, if I introduce a few high priority
threads by doing:

   nice -n -13 ./loadcpu -t 3

After a short while, I see a few CPUs becoming idle at ~0% utilization
(the number of CPUs becoming idle equals roughly the number of high
priority threads i.e. 3). When I stop the high priority threads, the CPU
utilization comes back to normal i.e. ~100%.

This is reproducible on 2.6.32.10 stable kernel with all the recent all
SMT fixes (I hope) and I think it would be reproducible in current
upstream as well.

sched_mc_power_savings has been always set to 0.

I spent a while staring at the load balancing and the thread migration
code, but could not figure out why this is happening. Would appreciate
any pointers.


Thanks,

-- 
Suresh Jayaraman
--

From: Peter Zijlstra
Date: Tuesday, April 6, 2010 - 7:08 am

Right, except its not a severe imbalance as the subject suggests. For
some reason it seems to end up in a semi-stable state that is actually
quite balanced.

for ((i=0; i&lt;8; i++)) do while :; do :; done &amp; done
for ((i=0; i&lt;3; i++)) do while :; do :; done &amp; renice -n -15 -p $! ;
done

gets me:

Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 99.0%us,  1.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16440840k total,  1073672k used, 15367168k free,   105844k buffers
Swap: 16777212k total,        0k used, 16777212k free,   296504k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4370 root       5 -15  105m  804  304 R 100.1  0.0   0:45.02 bash
 4374 root       5 -15  105m  804  304 R 100.1  0.0   0:44.95 bash
 4372 root       5 -15  105m  804  304 R 99.1  0.0   0:45.00 bash
 4364 root      20   0  105m  804  304 R 51.0  0.0   0:33.06 bash
 4362 root      20   0  105m  800  300 R 50.0  0.0   0:33.17 bash
 4365 root      20   0  105m  804  304 R 50.0  0.0   0:33.75 bash
 4368 root      20   0  105m  804  304 R 50.0  0.0   0:33.32 bash
 4369 root      20   0  105m  804  304 R 50.0  0.0   0:33.38 bash
 4363 root      20   0  105m  804  304 R 49.1  0.0   0:33.65 bash
 4366 root      20   0  105m  804  304 R 49.1  0.0   0:33.29 bash
 4367 root      20   0  105m  804  304 R 49.1  0.0   0:33.54 bash 

So we have the 3 -15 loops on a cpu each, and the 8 0 loops on 2 cpus
each, and 1 cpu idle. That is actually quite balanced, 'better' ...
From: Suresh Jayaraman
Subject:
Date: Tuesday, April 6, 2010 - 9:35 am

It was not intentional. It just happened that I first noticed the bug on

In my reproduction attempt the number of CPUs becoming idle increased
with the number of high priority threads. For e.g.

 3 (out of 16 CPUs) become idle when there were 3 high priority threads
 5 CPUs become idle when there were 4 high priority threads
 7 CPUs become idle when there were 5 high priority threads (~40% )

But, I also starting to think it is some wierd combination of normal
priority threads and high priority threads make the problem worse or
good. Because with 7 or higher threads the utilization becomes smoother
again.


Perhaps there is a chance that with more CPUs, different number of high
priority threads the problem could get worser as I mentioned above..?


Thanks,

-- 
Suresh Jayaraman
--

From: Peter Zijlstra
Date: Thursday, April 8, 2010 - 9:15 am

One thing that could be happening (triggered by what Igawa-san said,
although his case is more complicated by involving the cgroup stuff) is
that f_b_g() ends up selecting a group that contains these niced tasks
and then f_b_q() will not find a suitable source queue because all of
them will have but a single runnable task on it and hence we simply
bail.

We'd somehow have to teach update_*_lb_stats() not to consider groups
where nr_running &lt;= nr_cpus. I don't currently have a patch for that,
but I think that is the direction you might need to look in.



--

From: Masayuki Igawa
Date: Thursday, April 8, 2010 - 7:20 pm

From: Peter Zijlstra &lt;peterz@infradead.org&gt;
Subject: Re: High priority threads causing severe CPU load imbalances

I made a patch for my understanding the load_balance()'s behavior.
This patch reduced CPU load imbalances but not perfect.
---
Cpu0  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 90.1%us,  0.0%sy,  0.0%ni,  9.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 98.7%us,  0.3%sy,  0.0%ni,  1.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 96.1%us,  1.0%sy,  0.0%ni,  3.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 99.0%us,  0.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8032460k total,   807628k used,  7224832k free,    30692k buffers
Swap:        0k total,        0k used,        0k free,   347308k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND         
 9872 root      20   0 66128  632  268 R   99  0.0   0:13.69 4 bash            
 9876 root      20   0 66128  632  268 R   99  0.0   0:10.31 2 bash            
 9877 root      20   0 66128  632  268 R   99  0.0   0:10.79 3 bash            
 9871 root      20   0 66128  632  268 R   99  0.0   0:13.70 0 bash            
 9873 root      20   0 66128  632  268 R   99  0.0   0:13.68 1 bash            
 9874 root      20   0 66128  632  268 R   98  0.0   0:10.00 6 bash            
 9875 root      20   0 66128  632  268 R   92  0.0   0:11.22 4 bash            
 9878 root      20   0 66128  632  268 R   91  0.0   0:10.03 7 bash            
---
Also, this patch caused ping-pong load balances..

This patch is regards the sched_group as a idle sched_group
if local sched_group's cpu is CPU_IDLE.

But the state is not stable because active_load_balance() runs at this situation IIUC.


I'll investigate ...
From: Andy Lutomirski
Date: Tuesday, April 6, 2010 - 9:42 pm

What's wrong with having the three -15 loops each get a CPU, having six 
of the remaining 0 loops get half a CPU, and the last two get their own 
CPUs.  That's less fair but strictly better than the current solution, 
and nothing bounces.

--Andy
--

From: Peter Zijlstra
Date: Wednesday, April 7, 2010 - 12:44 am

The fairness thing, that really matters a lot to some people.

I've had enterprise bugs filed over such behaviour as you describe.
--

From: Masayuki Igawa
Date: Tuesday, April 6, 2010 - 10:46 pm

I found a similar(maybe same) problem by using the cgroup cpu-subsystem like following:

My test machine has Xeon(Quad Core) with 2 sockets(non-HT).
# mount -t cgroup -o cpu none /dev/cgroup-cpu/
# mkdir -p /dev/cgroup-cpu/204800 /dev/cgroup-cpu/1024
# echo 204800 &gt; /dev/cgroup-cpu/204800/cpu.shares
# for ((i=0; i&lt;3; i++)) do while :; do :; done &amp; echo $! &gt; /dev/cgroup-cpu/204800/tasks ; done
# for ((i=0; i&lt;5; i++)) do while :; do :; done &amp; echo $! &gt; /dev/cgroup-cpu/1024/tasks ; done


gets me:

Tasks: 190 total,   9 running, 181 sleeping,   0 stopped,   0 zombie
Cpu0  :  1.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8180292k total,  2430940k used,  5749352k free,   204988k buffers
Swap:        0k total,        0k used,        0k free,  1931820k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND          
30923 root      20   0  5808  540  264 R  100  0.0   2:30.64 3 bash             
30922 root      20   0  5808  540  264 R  100  0.0   2:30.64 2 bash             
30924 root      20   0  5808  540  264 R  100  0.0   2:30.63 6 bash             
30925 root      20   0  5808  540  264 R   42  0.0   1:00.19 7 bash             
30928 root      20   0  5808  540  264 R   41  0.0   0:57.26 5 bash             
30929 root      20   0  5808  540  264 R   40  0.0   0:57.03 7 bash             
30926 root      20   0  5808  540  264 R   39  0.0   0:58.37 7 bash             
30927 root   ...
Previous thread: [PATCH] 460EX on-chip SATA driver<kernel 2.6.33> < resubmission : 02> by Rupjyoti Sarmah on Tuesday, April 6, 2010 - 4:41 am. (5 messages)

Next thread: [PATCH] perf kmem: Fix breakage introduced by 5a0e3ad slab.h script by Arnaldo Carvalho de Melo on Tuesday, April 6, 2010 - 6:37 am. (2 messages)