Hi, I ended up with a customer benchmark in my lap this week that doesn't do well on recent kernels. :( After cutting it down to a simple testcase/microbenchmark, it seems like recent kernels don't do as well with short-lived threads competing with the thread it's cloned off of. The CFS scheduler changes come to mind, but I suppose it could be caused by something else as well. The pared-down testcase is included below. Reported runtime for the testcase has increased almost 3x between 2.6.22 and 2.6.24: 2.6.22: 3332 ms 2.6.23: 4397 ms 2.6.24: 8953 ms 2.6.24-git19: 8986 ms While running, it'll fork off a bunch of threads, each doing just a little work, then busy-waiting on the original thread to finish as well. Yes, it's incredibly stupidly coded but that's not my point here. During run, (runtime 10s on my 1.5GHz Core2 Duo laptop), vmstat 2 shows: 0 0 0 115196 364748 2248396 0 0 0 0 163 89 0 0 100 0 2 0 0 115172 364748 2248396 0 0 0 0 270 178 24 0 76 0 2 0 0 115172 364748 2248396 0 0 0 0 402 283 52 0 48 0 2 0 0 115180 364748 2248396 0 0 0 0 402 281 50 0 50 0 2 0 0 115180 364764 2248396 0 0 0 22 403 295 51 0 48 1 2 0 0 115056 364764 2248396 0 0 0 0 399 280 52 0 48 0 0 0 0 115196 364764 2248396 0 0 0 0 241 141 17 0 83 0 0 0 0 115196 364768 2248396 0 0 0 2 155 67 0 0 100 0 0 0 0 115196 364768 2248396 0 0 0 0 148 62 0 0 100 0 I.e. runqueue is 2, but only one cpu is busy. However, this still seems true on the kernel that runs the testcase in more reasonable time. Also, 'time' reports real and user time roughly the same on all kernels, so it's not that the older kernels are better at spreading out the load between the two cores (either that or it doesn't account for stuff right). I've included the config files, runtime output and vmstat ...
if you enable SCHED_DEBUG, and subtract 4 from the value of /proc/sys/kernel/sched_features, does it get any better? if not, does writing 0 into /proc/sys/kernel/sched_features have any impact? Ingo --
Doesn't seem to make a difference, either of them. I can unfortunately not retest on x86 at the moment, only powerpc. But it made no noticable difference in runtime w/ pasemi_defconfig and current mainline. -Olof --
My 3GHz P4 shows disjointed results. 2.6.22.17-smp time 798 ms time 780 ms time 702 ms 2.6.22.17-cfs-v24.1-smp time 562 ms time 551 ms time 551 ms 2.6.23.15-smp time 254 ms time 254 ms time 293 ms 2.6.23.15-cfs-v24-smp time 764 ms time 791 ms time 780 ms 2.6.24.1-smp time 815 ms time 820 ms time 771 ms 2.6.25-smp (git today) time 29 ms time 61 ms time 72 ms --
Hi Mike, How many CPUs do you have ? It's impressive to see such important variations. I would guess from the numbers that you sometimes have 1 or 2 CPUs doing nothing. I've already observed strange CPU usage patterns while building kernels (bound to 50% usage on a dual-athlon), but I can't say for sure that it was only on 2.6, so it may be related to make dependencies instead These ones look rather strange. What type of workload is it ? Can you publish the program for others to test it ? Regards, Willy --
It's the proglet posted in this thread. -Mike --
OK sorry, I did not notice it when I first read the report. Regards, Willy --
Hm. The 2.6.25-smp kernel is the only one that looks like it's doing what proggy wants to do, massive context switching. Bump threads to larger number so you can watch: the supposedly good kernel (22) is doing everything on one CPU. Everybody else sucks differently (idleness), and the clear throughput winner, via mad over-schedule (!?!), is git today. -Mike --
For me, 2.6.25-smp gives pretty irregular results : time 6548 ms time 7272 ms time 1188 ms time 3772 ms The CPU usage is quite irregular too and never goes beyond 50% (this is a dual-athlon). If I start two of these processes, 100% of the CPU is used, the context switch rate is more regular (about 700/s) and the total time is more regular too (between 14.8 and 18.5 seconds). Increasing the parallel run time of the two threads by changing the upper limit of the for(j) loop correctly saturates both processors. I think that this program simply does not have enough work to do for each thread to run for a full timeslice, thus showing a random behaviour. However, I fail to understand the goal of the reproducer. Granted it shows irregularities in the scheduler under such conditions, but what *real* workload would spend its time sequentially creating then immediately killing threads, never using more than 2 at a time ? If this could be turned into a DoS, I could understand, but here it looks a bit pointless :-/ Regards, Willy --
(me too, I was trying to figure out what could be expected) --
Right. I should have tinkered a bit more with it before I posted it, the version posted had too little going on in the first loop and thus got hung up on the second busywait loop instead. I did a bunch of runs with various loop sizes. Basically, what seems to happen is that the older kernels are quicker at rebalancing a new thread over to the other cpu, while newer kernels let them share the same cpu longer (and thus increases wall clock runtime). All of these are built with gcc without optimization, larger loop size and an added sched_yield() in the busy-wait loop at the end to take that out as a factor. As you've seen yourself, runtimes can be quite noisy but the trends are quite clear anyway. All of these numbers were collected with default scheduler runtime options, same kernels and configs as previously posted. Loop to 1M: 2.6.22 time 4015 ms 2.6.23 time 4581 ms 2.6.24 time 10765 ms 2.6.24-git19 time 8286 ms 2M: 2.6.22 time 7574 ms 2.6.23 time 9031 ms 2.6.24 time 12844 ms 2.6.24-git19 time 10959 ms 3M: 2.6.22 time 8015 ms 2.6.23 time 13053 ms 2.6.24 time 16204 ms 2.6.24-git19 time 14984 ms 4M: 2.6.22 time 10045 ms 2.6.23 time 16642 ms 2.6.24 time 16910 ms 2.6.24-git19 time 16468 ms 5M: 2.6.22 time 12055 ms 2.6.23 time 21024 ms <missed 2.6.24 here> 2.6.24-git19 time 16040 ms 10M: 2.6.22 time 24030 ms 2.6.23 time 33082 ms 2.6.24 time 34139 ms 2.6.24-git19 time 33724 ms 20M: 2.6.22 time 50015 ms 2.6.23 time 63963 ms 2.6.24 time 65100 ms 2.6.24-git19 time 63092 ms 40M: 2.6.22 time 94315 ms 2.6.23 time 107930 ms 2.6.24 time 113291 ms 2.6.24-git19 time 110360 ms So with more work per thread, the differences become less but they're still there. At the 40M loop, with 500 threads it's quite a bit of It seems generally unfortunate that it takes longer for a new thread to move over to the second cpu even when the first is busy with the original thread. I can certainly see cases where this causes suboptimal ...
No, it's really nothing. I had to push the loop to 1 billion to make the load noticeable. You don't have 500 threads, you have 2 threads and that load is 113291/500 = 227 microseconds/loop. This is still very low compared to the smallest timeslice you would have (1 ms at HZ=1000). So your threads are still completing *before* the scheduler has to preempt In fact, I don't think it takes longer, I think it does not do it at their creation, but will do it immediately after the first slice is consumed. This would explain the important differences here. I don't know how we could ensure that the new thread is created on the second CPU from the start, though. I tried inserting a sched_yield() at the top of the busy loop (1M loops). By default, it did not change a thing. Then I simply set sched_compat_yield to 1, and the two threads then ran simultaneously with a stable low time (2700 ms instead of 10-12 seconds). Doing so with 10k loops (initial test) shows times in the range 240-300 ms only instead of 2200-6500 ms. Ingo, would it be possible (and wise) to ensure that a new thread being created gets immediately rebalanced in order to emulate what is done here with sched_compat_yield=1 and sched_yield() in both threads just after the thread creation ? I don't expect any performance difference doing this, but maybe some shell scripts reliying on short-lived pipes would get faster on SMP. In fact, right now the following command uses only 1 CPU : $ dd if=/dev/urandom of=rnd bs=1M count=1 $ (time bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9)<rnd >/dev/null real 0m6.977s user 0m6.908s sys 0m0.072s Maybe this is due to the block size in bzip2 though, because gzip does not exhibit exactly the same behaviour : $ (time gzipi -1|gzip -1|gzip -1|gzip -1|gzip -1|gzip -1)<rnd >/dev/null real 0m5.300s user 0m7.392s I'm well aware of that. What's important is to be able to explain what is causing the difference and why the test case does not ...
Hmm? I get that to be 227ms per loop, which is way more than a full timeslice. Running the program took in the range of 2 minutes, so it's The math doesn't add up for me. Even if it rebalanced at the end of the first slice (i.e. after 1ms), that would be a 1ms penalty per Right, likely because the long-running cases got stuck at the busy loop at the end, which would end up aborting quicker if the other thread got scheduled for just a bit. It was a mistake to post that variant of the testcase, it's not as relevant and doesn't mimic the original workload I There's always the tradeoff of losing cache warmth whenever a thread is moved, so I'm not sure if it's a good idea to always migrate it at I believe it started out as a simple attempt to parallelize a workload that sliced the problem too low, instead of slicing it in larger chunks and have each thread do more work at a time. It did well on 2.6.22 with almost a 2x speedup, but did worse than the single-treaded testcase on a 2.6.24 kernel. So yes, it can clearly be handled through explanations and education and fixen the broken testcase, but I'm still not sure the new behaviour is desired. -Olof --
Damn you're right! I don't know why I assumed that the reported time was agreed, but what's important is not to change the workload, but to see yes I know. That should not prevent us from experimenting though. If thread-CPU affinity is too strong and causes the second CPU to be While I could accept the first slice of the second thread being consumed on the same CPU as the first one, it definitely must migrate quickly if both threads are competing for CPU. regards, willy --
Piddling around with your testcase, it still looks to me like things improved considerably in latest greatest git. Hopefully that means happiness is in the pipe for the real workload... synthetic load is definitely happier here as burst is shortened. #!/bin/sh uname -r >> results ./threadtest 10 1000000 >> results ./threadtest 100 100000 >> results ./threadtest 1000 10000 >> results ./threadtest 10000 1000 >> results ./threadtest 100000 100 >> results echo >> results (threadtest <iterations> <burn_time>) results: 2.6.22.17-smp time: 10525370 (us) work: 20000000 wait: 181000 idx: 1.90 time: 13514232 (us) work: 20000001 wait: 2666366 idx: 1.48 time: 36280953 (us) work: 20000008 wait: 21156077 idx: 0.55 time: 196374337 (us) work: 20000058 wait: 177141620 idx: 0.10 time: 128721968 (us) work: 20000099 wait: 115052705 idx: 0.16 2.6.22.17-cfs-v24.1-smp time: 10579591 (us) work: 20000000 wait: 203659 idx: 1.89 time: 11170784 (us) work: 20000000 wait: 471961 idx: 1.79 time: 11650138 (us) work: 20000000 wait: 1406224 idx: 1.72 time: 21447616 (us) work: 20000007 wait: 10689242 idx: 0.93 time: 106792791 (us) work: 20000060 wait: 92098132 idx: 0.19 2.6.23.15-smp time: 10507122 (us) work: 20000000 wait: 159809 idx: 1.90 time: 10545417 (us) work: 20000000 wait: 263833 idx: 1.90 time: 11337770 (us) work: 20000012 wait: 1069588 idx: 1.76 time: 15969860 (us) work: 20000000 wait: 5577750 idx: 1.25 time: 54029726 (us) work: 20000027 wait: 41734789 idx: 0.37 2.6.23.15-cfs-v24-smp time: 10528972 (us) work: 20000000 wait: 217060 idx: 1.90 time: 10697159 (us) work: 20000000 wait: 447224 idx: 1.87 time: 12242250 (us) work: 20000000 wait: 1930175 idx: 1.63 time: 26364658 (us) work: 20000011 wait: 15468447 idx: 0.76 time: 158338977 (us) work: 20000084 wait: 144048265 idx: 0.13 2.6.24.1-smp time: 10570521 (us) work: 20000000 wait: 208947 idx: 1.89 time: 10699224 (us) work: 20000000 wait: 404644 idx: 1.87 time: ...
The real workload doesn't see much of an improvement. The changes I did when tinkering yesterday seem like they're better at modelling just what's going on with that one. I've added the new testcase I'm using for the numbers I posted last night at http://lixom.net/~olof/threadtest/new/, including numbers for the various kernels. I also included the binaries I built. (with "gcc -DLOOPS=<size> testcase.c -lpthread"). I tried graphing the numbers as well, it looks like for larger workloads that 2.6.22 has a fixed benefit over newer kernels. I.e. it seems quicker at rebalancing, but once things balance out they're obviously doing ok independent of kernel version. Graph at http://lixom.net/olof/threadtest/new/schedgraph.pdf. I couldn't figure out how to make the X axis linear, so it obviously looks odd with the current powers-of-two at the end instead of linear, but the differences can still be seen clearly. -Olof --
So the real application is trying to yield? If so, you could try prodding /proc/sys/kernel/sched_compat_yield. It shouldn't matter if you yield or not really, that should reduce the number of non-work spin cycles wasted awaiting preemption as threads execute in series (the problem), and should improve your performance numbers, but not beyond single threaded. If I plugged a yield into the busy wait, I would expect to see a large behavioral difference due to yield implementation changes, but that would only be a symptom in this case, no? Yield should be a noop. -Mike --
The real application wasn't using yield, but it also didn't get hung up on the busy-wait loop, that was a mistake by me on Friday. Practically all the time of that application was spent in the actual workload loop. I tried adding a yield at the bottom loop, but it didn't make a Exactly. It made a big impact on the first testcase from Friday, where the spin-off thread spent the bulk of the time in the busy-wait loop, with a very small initial workload loop. Thus the yield passed the cpu over to the other thread who got a chance to run the small workload, followed by a quick finish by both of them. The better model spends the bulk of the time in the first workload loop, so yielding doesn't gain at all the same amount. I still added it to rule out that it was a factor in the time differences. -Olof --
There is a strong dependency on execution order in this testcase. Between cpu affinity and giving the child a little head start to reduce the chance (100% if child wakes on same CPU and doesn't preempt parent) of busy wait, modified testcase behaves. I don't think I should need the CPU affinity, but I do. If you plunk a usleep(1) in prior to calling thread_func() does your testcase performance change radically? If so, I wonder if the real application has the same kind of dependency. -Mike
The answer is yes for 2.6.22, and no for 2.6.24, which surprised me. 2.6.22.17-smp homer:..local/tmp # ./threadtest2 10000000 loops time 16215 ms 10000000 loops time 16268 ms 10000000 loops time 16246 ms homer:..local/tmp # ./threadtest3 (with usleep(1)) 10000000 loops time 13938 ms 10000000 loops time 13921 ms 10000000 loops time 13898 ms 2.6.24.1-smp homer:..local/tmp # ./threadtest2 10000000 loops time 14663 ms 10000000 loops time 14523 ms 10000000 loops time 14466 ms homer:..local/tmp # ./threadtest3 10000000 loops time 14513 ms 10000000 loops time 14500 ms 10000000 loops time 14464 ms echo 0 > /proc/sys/kernel/sched_child_runs_first homer:..local/tmp # ./threadtest2 10000000 loops time 14157 ms 10000000 loops time 14097 ms 10000000 loops time 14153 ms homer:..local/tmp # ./threadtest3 10000000 loops time 14065 ms 10000000 loops time 14075 ms 10000000 loops time 14018 ms --
I think the moving to another CPU gets really dependent on the CPU type. On a P4+HT the caches are shared, and moving costs almost nothing for cache hits, while on CPUs which have other cache layouts the migration cost is higher. Obviously multi-core should be cheaper than multi-socket, by avoiding using the system memory bus, but it still can get ugly. I have an IPC test around which showed that, it ran like hell on HT, and progressively worse as cache because less shared. I wonder why the latest git works so much better? -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
Yes, I'm wondering the same. With latest git, ~400 usec work units suffice to achieve overlap (on my P4/HT), whereas all other kernels tested require several milliseconds. -Mike --
