Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-lived threads

Previous thread: [PATCH] SUNPRC: Fix printk format warning by Roland Dreier on Friday, February 8, 2008 - 5:02 pm. (2 messages)

Next thread: scsi/arm/fas216.c compile error by Adrian Bunk on Friday, February 8, 2008 - 5:04 pm. (13 messages)
From: Olof Johansson
Date: Friday, February 8, 2008 - 5:04 pm

Hi,

I ended up with a customer benchmark in my lap this week that doesn't
do well on recent kernels. :(

After cutting it down to a simple testcase/microbenchmark, it seems like
recent kernels don't do as well with short-lived threads competing
with the thread it's cloned off of. The CFS scheduler changes come to
mind, but I suppose it could be caused by something else as well.

The pared-down testcase is included below. Reported runtime for the
testcase has increased almost 3x between 2.6.22 and 2.6.24:

2.6.22: 3332 ms
2.6.23: 4397 ms
2.6.24: 8953 ms
2.6.24-git19: 8986 ms

While running, it'll fork off a bunch of threads, each doing just a little
work, then busy-waiting on the original thread to finish as well. Yes,
it's incredibly stupidly coded but that's not my point here.

During run, (runtime 10s on my 1.5GHz Core2 Duo laptop), vmstat 2  shows:

 0  0      0 115196 364748 2248396    0    0     0     0  163   89  0  0 100  0
 2  0      0 115172 364748 2248396    0    0     0     0  270  178 24  0 76  0
 2  0      0 115172 364748 2248396    0    0     0     0  402  283 52  0 48  0
 2  0      0 115180 364748 2248396    0    0     0     0  402  281 50  0 50  0
 2  0      0 115180 364764 2248396    0    0     0    22  403  295 51  0 48  1
 2  0      0 115056 364764 2248396    0    0     0     0  399  280 52  0 48  0
 0  0      0 115196 364764 2248396    0    0     0     0  241  141 17  0 83  0
 0  0      0 115196 364768 2248396    0    0     0     2  155   67  0  0 100  0
 0  0      0 115196 364768 2248396    0    0     0     0  148   62  0  0 100  0

I.e. runqueue is 2, but only one cpu is busy. However, this still seems
true on the kernel that runs the testcase in more reasonable time.

Also, 'time' reports real and user time roughly the same on all kernels,
so it's not that the older kernels are better at spreading out the load
between the two cores (either that or it doesn't account for stuff right).

I've included the config files, runtime output and vmstat ...
From: Ingo Molnar
Date: Friday, February 8, 2008 - 5:08 pm

if you enable SCHED_DEBUG, and subtract 4 from the value of 
/proc/sys/kernel/sched_features, does it get any better?

if not, does writing 0 into /proc/sys/kernel/sched_features have any 
impact?

	Ingo
--

From: Olof Johansson
Date: Friday, February 8, 2008 - 5:32 pm

Doesn't seem to make a difference, either of them. I can unfortunately
not retest on x86 at the moment, only powerpc. But it made no noticable
difference in runtime w/ pasemi_defconfig and current mainline.


-Olof
--

From: Mike Galbraith
Date: Saturday, February 9, 2008 - 12:58 am

My 3GHz P4 shows disjointed results.

2.6.22.17-smp
time 798 ms
time 780 ms
time 702 ms

2.6.22.17-cfs-v24.1-smp
time 562 ms
time 551 ms
time 551 ms

2.6.23.15-smp
time 254 ms
time 254 ms
time 293 ms

2.6.23.15-cfs-v24-smp
time 764 ms
time 791 ms
time 780 ms

2.6.24.1-smp
time 815 ms
time 820 ms
time 771 ms

2.6.25-smp (git today)
time 29 ms
time 61 ms
time 72 ms


--

From: Willy Tarreau
Date: Saturday, February 9, 2008 - 1:03 am

Hi Mike,


How many CPUs do you have ? It's impressive to see such important variations.
I would guess from the numbers that you sometimes have 1 or 2 CPUs doing
nothing. I've already observed strange CPU usage patterns while building
kernels (bound to 50% usage on a dual-athlon), but I can't say for sure
that it was only on 2.6, so it may be related to make dependencies instead

These ones look rather strange. What type of workload is it ? Can you
publish the program for others to test it ?

Regards,
Willy

--

From: Mike Galbraith
Date: Saturday, February 9, 2008 - 3:58 am

It's the proglet posted in this thread.

	-Mike

--

From: Willy Tarreau
Date: Saturday, February 9, 2008 - 4:40 am

OK sorry, I did not notice it when I first read the report.

Regards,
Willy

--

From: Mike Galbraith
Date: Saturday, February 9, 2008 - 6:37 am

Hm.  The 2.6.25-smp kernel is the only one that looks like it's doing
what proggy wants to do, massive context switching.  Bump threads to
larger number so you can watch: the supposedly good kernel (22) is doing
everything on one CPU.  Everybody else sucks differently (idleness), and
the clear throughput winner, via mad over-schedule (!?!), is git today.

	-Mike

--

From: Willy Tarreau
Date: Saturday, February 9, 2008 - 9:19 am

For me, 2.6.25-smp gives pretty irregular results :

time 6548 ms
time 7272 ms
time 1188 ms
time 3772 ms

The CPU usage is quite irregular too and never goes beyond 50% (this is a
dual-athlon). If I start two of these processes, 100% of the CPU is used,
the context switch rate is more regular (about 700/s) and the total time
is more regular too (between 14.8 and 18.5 seconds).

Increasing the parallel run time of the two threads by changing the upper
limit of the for(j) loop correctly saturates both processors. I think that
this program simply does not have enough work to do for each thread to run
for a full timeslice, thus showing a random behaviour.

However, I fail to understand the goal of the reproducer. Granted it shows
irregularities in the scheduler under such conditions, but what *real*
workload would spend its time sequentially creating then immediately killing
threads, never using more than 2 at a time ?

If this could be turned into a DoS, I could understand, but here it looks
a bit pointless :-/

Regards,
Willy

--

From: Mike Galbraith
Date: Saturday, February 9, 2008 - 10:33 am

(me too, I was trying to figure out what could be expected)

--

From: Olof Johansson
Date: Saturday, February 9, 2008 - 10:29 pm

Right. I should have tinkered a bit more with it before I posted it, the
version posted had too little going on in the first loop and thus got
hung up on the second busywait loop instead.

I did a bunch of runs with various loop sizes. Basically, what seems to
happen is that the older kernels are quicker at rebalancing a new thread
over to the other cpu, while newer kernels let them share the same cpu
longer (and thus increases wall clock runtime).

All of these are built with gcc without optimization, larger loop size
and an added sched_yield() in the busy-wait loop at the end to take that
out as a factor. As you've seen yourself, runtimes can be quite noisy
but the trends are quite clear anyway. All of these numbers were
collected with default scheduler runtime options, same kernels and
configs as previously posted.

Loop to 1M:
2.6.22		time 4015 ms
2.6.23		time 4581 ms
2.6.24		time 10765 ms
2.6.24-git19	time 8286 ms

2M:
2.6.22		time 7574 ms
2.6.23		time 9031 ms
2.6.24		time 12844 ms
2.6.24-git19	time 10959 ms

3M:
2.6.22		time 8015 ms
2.6.23		time 13053 ms
2.6.24		time 16204 ms
2.6.24-git19	time 14984 ms

4M:
2.6.22		time 10045 ms
2.6.23		time 16642 ms
2.6.24		time 16910 ms
2.6.24-git19	time 16468 ms

5M:
2.6.22		time 12055 ms
2.6.23		time 21024 ms
<missed 2.6.24 here>
2.6.24-git19	time 16040 ms

10M:
2.6.22		time 24030 ms
2.6.23		time 33082 ms
2.6.24		time 34139 ms
2.6.24-git19	time 33724 ms

20M:
2.6.22		time 50015 ms
2.6.23		time 63963 ms
2.6.24		time 65100 ms
2.6.24-git19	time 63092 ms

40M:
2.6.22		time 94315 ms
2.6.23		time 107930 ms
2.6.24		time 113291 ms
2.6.24-git19	time 110360 ms

So with more work per thread, the differences become less but they're
still there. At the 40M loop, with 500 threads it's quite a bit of

It seems generally unfortunate that it takes longer for a new thread to
move over to the second cpu even when the first is busy with the original
thread. I can certainly see cases where this causes suboptimal ...
From: Willy Tarreau
Date: Saturday, February 9, 2008 - 11:15 pm

No, it's really nothing. I had to push the loop to 1 billion to make the load
noticeable. You don't have 500 threads, you have 2 threads and that load is
113291/500 = 227 microseconds/loop. This is still very low compared to the
smallest timeslice you would have (1 ms at HZ=1000).

So your threads are still completing *before* the scheduler has to preempt

In fact, I don't think it takes longer, I think it does not do it at their
creation, but will do it immediately after the first slice is consumed. This
would explain the important differences here. I don't know how we could ensure
that the new thread is created on the second CPU from the start, though.

I tried inserting a sched_yield() at the top of the busy loop (1M loops).
By default, it did not change a thing. Then I simply set sched_compat_yield
to 1, and the two threads then ran simultaneously with a stable low time
(2700 ms instead of 10-12 seconds).

Doing so with 10k loops (initial test) shows times in the range 240-300 ms
only instead of 2200-6500 ms.

Ingo, would it be possible (and wise) to ensure that a new thread being
created gets immediately rebalanced in order to emulate what is done here
with sched_compat_yield=1 and sched_yield() in both threads just after the
thread creation ? I don't expect any performance difference doing this,
but maybe some shell scripts reliying on short-lived pipes would get faster
on SMP.

In fact, right now the following command uses only 1 CPU :

$ dd if=/dev/urandom of=rnd bs=1M count=1
$ (time bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9)<rnd >/dev/null

real    0m6.977s
user    0m6.908s
sys     0m0.072s

Maybe this is due to the block size in bzip2 though, because gzip does not
exhibit exactly the same behaviour :

$ (time gzipi -1|gzip -1|gzip -1|gzip -1|gzip -1|gzip -1)<rnd >/dev/null

real    0m5.300s
user    0m7.392s

I'm well aware of that. What's important is to be able to explain what is
causing the difference and why the test case does not ...
From: Olof Johansson
Date: Sunday, February 10, 2008 - 12:00 am

Hmm? I get that to be 227ms per loop, which is way more than a full
timeslice. Running the program took in the range of 2 minutes, so it's

The math doesn't add up for me. Even if it rebalanced at the end of
the first slice (i.e. after 1ms), that would be a 1ms penalty per

Right, likely because the long-running cases got stuck at the busy loop
at the end, which would end up aborting quicker if the other thread got
scheduled for just a bit. It was a mistake to post that variant of the
testcase, it's not as relevant and doesn't mimic the original workload I

There's always the tradeoff of losing cache warmth whenever a thread is
moved, so I'm not sure if it's a good idea to always migrate it at

I believe it started out as a simple attempt to parallelize a workload
that sliced the problem too low, instead of slicing it in larger chunks
and have each thread do more work at a time. It did well on 2.6.22 with
almost a 2x speedup, but did worse than the single-treaded testcase on a
2.6.24 kernel.

So yes, it can clearly be handled through explanations and education
and fixen the broken testcase, but I'm still not sure the new behaviour
is desired.


-Olof
--

From: Willy Tarreau
Date: Sunday, February 10, 2008 - 12:58 am

Damn you're right! I don't know why I assumed that the reported time was


agreed, but what's important is not to change the workload, but to see

yes I know. That should not prevent us from experimenting though. If
thread-CPU affinity is too strong and causes the second CPU to be

While I could accept the first slice of the second thread being consumed 
on the same CPU as the first one, it definitely must migrate quickly if
both threads are competing for CPU.

regards,
willy

--

From: Mike Galbraith
Date: Monday, February 11, 2008 - 1:15 am

Piddling around with your testcase, it still looks to me like things
improved considerably in latest greatest git.  Hopefully that means
happiness is in the pipe for the real workload... synthetic load is
definitely happier here as burst is shortened.

#!/bin/sh

uname -r >> results
./threadtest 10 1000000 >> results
./threadtest 100 100000 >> results
./threadtest 1000 10000 >> results
./threadtest 10000 1000 >> results
./threadtest 100000 100 >> results
echo >> results

(threadtest <iterations> <burn_time>)

results:
2.6.22.17-smp
time: 10525370 (us)  work: 20000000  wait: 181000  idx: 1.90
time: 13514232 (us)  work: 20000001  wait: 2666366  idx: 1.48
time: 36280953 (us)  work: 20000008  wait: 21156077  idx: 0.55
time: 196374337 (us)  work: 20000058  wait: 177141620  idx: 0.10
time: 128721968 (us)  work: 20000099  wait: 115052705  idx: 0.16

2.6.22.17-cfs-v24.1-smp
time: 10579591 (us)  work: 20000000  wait: 203659  idx: 1.89
time: 11170784 (us)  work: 20000000  wait: 471961  idx: 1.79
time: 11650138 (us)  work: 20000000  wait: 1406224  idx: 1.72
time: 21447616 (us)  work: 20000007  wait: 10689242  idx: 0.93
time: 106792791 (us)  work: 20000060  wait: 92098132  idx: 0.19

2.6.23.15-smp
time: 10507122 (us)  work: 20000000  wait: 159809  idx: 1.90
time: 10545417 (us)  work: 20000000  wait: 263833  idx: 1.90
time: 11337770 (us)  work: 20000012  wait: 1069588  idx: 1.76
time: 15969860 (us)  work: 20000000  wait: 5577750  idx: 1.25
time: 54029726 (us)  work: 20000027  wait: 41734789  idx: 0.37

2.6.23.15-cfs-v24-smp
time: 10528972 (us)  work: 20000000  wait: 217060  idx: 1.90
time: 10697159 (us)  work: 20000000  wait: 447224  idx: 1.87
time: 12242250 (us)  work: 20000000  wait: 1930175  idx: 1.63
time: 26364658 (us)  work: 20000011  wait: 15468447  idx: 0.76
time: 158338977 (us)  work: 20000084  wait: 144048265  idx: 0.13

2.6.24.1-smp
time: 10570521 (us)  work: 20000000  wait: 208947  idx: 1.89
time: 10699224 (us)  work: 20000000  wait: 404644  idx: 1.87
time: ...
From: Olof Johansson
Date: Monday, February 11, 2008 - 10:26 am

The real workload doesn't see much of an improvement. The changes I did
when tinkering yesterday seem like they're better at modelling just
what's going on with that one.

I've added the new testcase I'm using for the numbers I posted last
night at http://lixom.net/~olof/threadtest/new/, including numbers for
the various kernels. I also included the binaries I built. (with "gcc
-DLOOPS=<size> testcase.c -lpthread").

I tried graphing the numbers as well, it looks like for larger workloads
that 2.6.22 has a fixed benefit over newer kernels. I.e. it seems quicker
at rebalancing, but once things balance out they're obviously doing ok
independent of kernel version.

Graph at http://lixom.net/olof/threadtest/new/schedgraph.pdf. I couldn't
figure out how to make the X axis linear, so it obviously looks odd with
the current powers-of-two at the end instead of linear, but the
differences can still be seen clearly.


-Olof

--

From: Mike Galbraith
Date: Monday, February 11, 2008 - 12:58 pm

So the real application is trying to yield?  If so, you could try
prodding /proc/sys/kernel/sched_compat_yield.

It shouldn't matter if you yield or not really, that should reduce the
number of non-work spin cycles wasted awaiting preemption as threads
execute in series (the problem), and should improve your performance
numbers, but not beyond single threaded.

If I plugged a yield into the busy wait, I would expect to see a large
behavioral difference due to yield implementation changes, but that
would only be a symptom in this case, no?  Yield should be a noop.

	-Mike

--

From: Olof Johansson
Date: Monday, February 11, 2008 - 1:31 pm

The real application wasn't using yield, but it also didn't get hung up
on the busy-wait loop, that was a mistake by me on Friday. Practically
all the time of that application was spent in the actual workload
loop. I tried adding a yield at the bottom loop, but it didn't make a

Exactly. It made a big impact on the first testcase from Friday, where
the spin-off thread spent the bulk of the time in the busy-wait loop,
with a very small initial workload loop. Thus the yield passed the cpu
over to the other thread who got a chance to run the small workload,
followed by a quick finish by both of them. The better model spends the
bulk of the time in the first workload loop, so yielding doesn't gain
at all the same amount.

I still added it to rule out that it was a factor in the time
differences.


-Olof
--

From: Mike Galbraith
Date: Tuesday, February 12, 2008 - 2:23 am

There is a strong dependency on execution order in this testcase.

Between cpu affinity and giving the child a little head start to reduce
the chance (100% if child wakes on same CPU and doesn't preempt parent)
of busy wait, modified testcase behaves.  I don't think I should need
the CPU affinity, but I do.

If you plunk a usleep(1) in prior to calling thread_func() does your
testcase performance change radically?  If so, I wonder if the real
application has the same kind of dependency.

	-Mike
From: Mike Galbraith
Date: Tuesday, February 12, 2008 - 10:49 pm

The answer is yes for 2.6.22, and no for 2.6.24, which surprised me.

2.6.22.17-smp
homer:..local/tmp # ./threadtest2
10000000 loops time 16215 ms
10000000 loops time 16268 ms
10000000 loops time 16246 ms

homer:..local/tmp # ./threadtest3 (with usleep(1))
10000000 loops time 13938 ms
10000000 loops time 13921 ms
10000000 loops time 13898 ms

2.6.24.1-smp
homer:..local/tmp # ./threadtest2
10000000 loops time 14663 ms
10000000 loops time 14523 ms
10000000 loops time 14466 ms

homer:..local/tmp # ./threadtest3
10000000 loops time 14513 ms
10000000 loops time 14500 ms
10000000 loops time 14464 ms

echo 0 >  /proc/sys/kernel/sched_child_runs_first

homer:..local/tmp # ./threadtest2
10000000 loops time 14157 ms
10000000 loops time 14097 ms
10000000 loops time 14153 ms

homer:..local/tmp # ./threadtest3
10000000 loops time 14065 ms
10000000 loops time 14075 ms
10000000 loops time 14018 ms



--

From: Bill Davidsen
Date: Monday, February 11, 2008 - 2:45 pm

I think the moving to another CPU gets really dependent on the CPU type. 
On a P4+HT the caches are shared, and moving costs almost nothing for 
cache hits, while on CPUs which have other cache layouts the migration 
cost is higher. Obviously multi-core should be cheaper than 
multi-socket, by avoiding using the system memory bus, but it still can 
get ugly.

I have an IPC test around which showed that, it ran like hell on HT, and 
progressively worse as cache because less shared. I wonder why the 
latest git works so much better?

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--

From: Mike Galbraith
Date: Monday, February 11, 2008 - 9:30 pm

Yes, I'm wondering the same.  With latest git, ~400 usec work units
suffice to achieve overlap (on my P4/HT), whereas all other kernels
tested require several milliseconds.

	-Mike

--

Previous thread: [PATCH] SUNPRC: Fix printk format warning by Roland Dreier on Friday, February 8, 2008 - 5:02 pm. (2 messages)

Next thread: scsi/arm/fas216.c compile error by Adrian Bunk on Friday, February 8, 2008 - 5:04 pm. (13 messages)