Re: unpredictability in scheduler test results -- still present

Previous thread: Re: mmotm 2008-09-18-14-56 uploaded (olpc_battery build errors) by Randy.Dunlap on Thursday, September 18, 2008 - 3:42 pm. (2 messages)

Next thread: [patch 0/4] Cpu alloc V4: Replace percpu allocator in modules.c by Christoph Lameter on Thursday, September 18, 2008 - 4:36 pm. (1 message)
From: Chris Friesen
Date: Thursday, September 18, 2008 - 3:45 pm

I was running some tests with the "fairtest" testcase and noticed that 
successive runs could give wildly different results.

I was originally using the tip/master tree as of Sep 16, but I also 
confirmed the behaviour with Linus' tree as of Sep 14 (with the 
__load_balance_iterator() fix applied).  The same behaviour is present 
in both cases.

I'm using the test config listed at the bottom.  It's pretty 
straightforward.

The first run gave the following results.  As expected, the system 
picked a static task distribution and didn't migrate tasks during the test.

group       actual(%)     expected(%)  avg latency(ms) max_latency(ms)
     1   33.31(33.33/33.2    30.00      23/23            37/37
     2        36.29          40.00       5               25
     3   30.40(27.40/33.40)  30.00      22/23            60/40



On the second run, the task distribution is almost perfect, but the 
system was only using one of the two cpus as seen by the difference 
between actual and expected cpu time.

Warning, actual cpu time different than expected. actual: 10033.011108, 
expected: 20000.000000
group       actual(%)     expected(%)  avg latency(ms) max_latency(ms)
     1   0.24(30.59/29.88)    30.00      26/27             68/58
     2       39.87            40.00       20                36
     3   29.89(29.87/29.91)   30.00      28/27             47/60


Any ideas what's going on?

Chris



test config file:
#delay (secs)
1

#duration (secs)
10

#groupname,share,numhogs
1,750,n
2,1000,1
3,750,n


--

From: Chris Friesen
Date: Wednesday, September 24, 2008 - 8:19 am

This behaviour (that load balancing is messed up) is now almost 
continuous with both current tip/master and current Linus git.  On the 
first test after booting, it seems to work okay (although there are 
still issues with fairness).  On every subsequent test, fairness is good 
but it only uses one of the two cpus.

Also, building a kernel with "-j10" results in one cpu being mostly idle 
while the other one is 100% busy. It used to be both 100% busy--if I get 
time today I may try bisecting it.

Chris
--

From: Chris Friesen
Date: Wednesday, September 24, 2008 - 4:37 pm

It turns out that disabling CONFIG_DYNAMIC_FTRACE makes the load 
balancing problem go away and causes all cpus to be used.

With this option enabled, the problem seems to be present as far back as 
2.6.27-rc2.  (2.6.27-rc1 doesn't compile on my machine, and 2.6.26 
doesn't have ftrace).

I have no idea why turning on dynamic ftrace would affect load balancing 
behaviour, but it's very repeatable.  The very first test run after 
booting works fine, and all successive runs fail to balance properly.

Chris
--

From: Ingo Molnar
Date: Saturday, September 27, 2008 - 1:04 pm

very weird. Would be very nice to figure it out.

and in tip/master we dont have the 'ftraced' kernel-patching kernel 
thread anymore, so ftrace should be passive by all means.

OTOH, what does 'truning on dftrace' exactly mean? Just enabling it in 
the .config, or also activating it via /debug/tracing/current_tracer?

	Ingo
--

From: Chris Friesen
Date: Monday, September 29, 2008 - 8:43 am

Just enabling it in the .config is enough to trigger the behaviour 
change.  I'm not explicitly activating any traces.

Chris
--

From: Ingo Molnar
Date: Tuesday, September 30, 2008 - 4:12 am

ok, that would be a clear ftrace bug i guess?

	Ingo
--

From: Chris Friesen
Date: Tuesday, September 30, 2008 - 2:14 pm

It's either an ftrace bug or a fragile load balancer bug.  I wonder if 
it's related somehow to the stop_machine() call in ftrace_dynamic_init()?

Chris
--

Previous thread: Re: mmotm 2008-09-18-14-56 uploaded (olpc_battery build errors) by Randy.Dunlap on Thursday, September 18, 2008 - 3:42 pm. (2 messages)

Next thread: [patch 0/4] Cpu alloc V4: Replace percpu allocator in modules.c by Christoph Lameter on Thursday, September 18, 2008 - 4:36 pm. (1 message)