Nick Piggin used 'git bisect' to track a lmbench regression to the main CFS commit, leading to an interesting discussion between Nick and Ingo Molnar. Ultimately the regression was tracked down to the temporary configurability of the scheduler while it is tuned for optimal performance, "one reason for the extra overhead is the current tunability of CFS, but that is not fundamental, it's caused by the many knobs that CFS has at the moment." The solution, already coded but not yet merged in the mainline kernel "changes those knobs to constants, allowing the compiler to optimize the math better and reduce code size," and as a result result, "CFS can be faster at micro-context-switching than 2.6.22."
Ingo described the lmbench configuration in question as a "micro-benchmark", and noted that with a macro-benchmark better performance was more pronounced, "because with CFS the _quality_ of scheduling decisions has increased. So even if we had increased micro-costs (which we wont have once the current tuning period is over and we cast the CFS parameters into constants), the quality of macro-scheduling can offset that, and not only on the desktop!" He summarized, "that's why our main focus in CFS was on the macro-properties of scheduling _first_, and then the micro-properties are adjusted to the macro-constraints as a second layer."
From: Nick Piggin [email blocked] Subject: lmbench ctxsw regression with CFS Date: Thu, 2 Aug 2007 04:15:25 +0200 Hi, I didn't follow all of the scheduler debates and flamewars, so apologies if this was already covered. Anyway. lmbench 3 lat_ctx context switching time with 2 processes bound to a single core increases by between 25%-35% on my Core2 system (didn't do enough runs to get more significance, but it is around 30%). The problem bisected to the main CFS commit. I was really hoping that a smaller runqueue data structure could actually increase performance with the common case of small numbers of tasks :( I assume this was a known issue before CFS was merged. Do you know what is causing the slowdown? Any plans to fix it? Thanks, Nick
From: Linus Torvalds [email blocked] Subject: Re: lmbench ctxsw regression with CFS Date: Wed, 1 Aug 2007 19:31:26 -0700 (PDT) On Thu, 2 Aug 2007, Nick Piggin wrote: > > lmbench 3 lat_ctx context switching time with 2 processes bound to a > single core increases by between 25%-35% on my Core2 system (didn't do > enough runs to get more significance, but it is around 30%). The problem > bisected to the main CFS commit. One thing to check out is whether the lmbench numbers are "correct". Especially on SMP systems, the lmbench numbers are actually *best* when the two processes run on the same CPU, even though that's not really at all the best scheduling - it's just that it artificially improves lmbench numbers because of the close cache affinity for the pipe data structures. So when running the lmbench scheduling benchmarks on SMP, it actually makes sense to run them *pinned* to one CPU, because then you see the true scheduler performance. Otherwise you easily get noise due to balancing issues, and a clearly better scheduler can in fact generate worse numbers for lmbench. Did you do that? It's at least worth testing. I'm not saying it's the case here, but it's one reason why lmbench3 has the option to either keep processes on the same CPU or force them to spread out (and both cases are very interesting for scheduler testing, and tell different things: the "pin them to the same CPU" shows the latency on one runqueue, while the "pin them to different CPU's" shows the latency of a remote wakeup). IOW, while we used the lmbench scheduling benchmark pretty extensively in early scheduler tuning, if you select the defaults ("let the system just schedule processes on any CPU") the end result really isn't necessarily a very meaningful value: getting the best lmbench numbers actually requires you to do things that tend to be actively *bad* in real life. Of course, a perfect scheduler would notice when two tasks are *so* closely related and only do synchronous wakups, that it would keep them on the same core, and get the best possible scores for lmbench, while not doing that for other real-life situations. So with a *really* smart scheduler, lmbench numbers would always be optimal, but I'm not sure aiming for that kind of perfection is even worth it! Linus
From: Nick Piggin [email blocked] Subject: Re: lmbench ctxsw regression with CFS Date: Thu, 2 Aug 2007 04:41:32 +0200 On Wed, Aug 01, 2007 at 07:31:26PM -0700, Linus Torvalds wrote: > > > On Thu, 2 Aug 2007, Nick Piggin wrote: > > > > lmbench 3 lat_ctx context switching time with 2 processes bound to a > > single core increases by between 25%-35% on my Core2 system (didn't do > > enough runs to get more significance, but it is around 30%). The problem > > bisected to the main CFS commit. > > One thing to check out is whether the lmbench numbers are "correct". > Especially on SMP systems, the lmbench numbers are actually *best* when > the two processes run on the same CPU, even though that's not really at > all the best scheduling - it's just that it artificially improves lmbench > numbers because of the close cache affinity for the pipe data structures. Yes, I bound them to a single core. > So when running the lmbench scheduling benchmarks on SMP, it actually > makes sense to run them *pinned* to one CPU, because then you see the true > scheduler performance. Otherwise you easily get noise due to balancing > issues, and a clearly better scheduler can in fact generate worse > numbers for lmbench. > > Did you do that? It's at least worth testing. I'm not saying it's the case > here, but it's one reason why lmbench3 has the option to either keep > processes on the same CPU or force them to spread out (and both cases are > very interesting for scheduler testing, and tell different things: the > "pin them to the same CPU" shows the latency on one runqueue, while the > "pin them to different CPU's" shows the latency of a remote wakeup). > > IOW, while we used the lmbench scheduling benchmark pretty extensively in > early scheduler tuning, if you select the defaults ("let the system just > schedule processes on any CPU") the end result really isn't necessarily a > very meaningful value: getting the best lmbench numbers actually requires > you to do things that tend to be actively *bad* in real life. > > Of course, a perfect scheduler would notice when two tasks are *so* > closely related and only do synchronous wakups, that it would keep them on > the same core, and get the best possible scores for lmbench, while not > doing that for other real-life situations. So with a *really* smart > scheduler, lmbench numbers would always be optimal, but I'm not sure > aiming for that kind of perfection is even worth it! Agreed with all your comments on multiprocessor balancing, but that was eliminated in these tests. I remote wakeup latency is another thing I want to test, but it isn't so interesting until the serial regression is fixed.
From: Ingo Molnar [email blocked] Subject: Re: lmbench ctxsw regression with CFS Date: Thu, 2 Aug 2007 09:19:56 +0200 * Nick Piggin [email blocked] wrote: > > One thing to check out is whether the lmbench numbers are "correct". > > Especially on SMP systems, the lmbench numbers are actually *best* > > when the two processes run on the same CPU, even though that's not > > really at all the best scheduling - it's just that it artificially > > improves lmbench numbers because of the close cache affinity for the > > pipe data structures. > > Yes, I bound them to a single core. could you send me the .config you used? Ingo
From: Nick Piggin [email blocked] Subject: Re: lmbench ctxsw regression with CFS Date: Thu, 2 Aug 2007 09:31:23 +0200 On Thu, Aug 02, 2007 at 09:19:56AM +0200, Ingo Molnar wrote: > > * Nick Piggin [email blocked] wrote: > > > > One thing to check out is whether the lmbench numbers are "correct". > > > Especially on SMP systems, the lmbench numbers are actually *best* > > > when the two processes run on the same CPU, even though that's not > > > really at all the best scheduling - it's just that it artificially > > > improves lmbench numbers because of the close cache affinity for the > > > pipe data structures. > > > > Yes, I bound them to a single core. > > could you send me the .config you used? Sure, attached... You don't see a regression? If not, then can you send me the .config you used? Also what CPU architecture (when I tested an older CFS on a P4 IIRC the regression was much bigger like 100% more costly).
From: Ingo Molnar [email blocked] Subject: Re: lmbench ctxsw regression with CFS Date: Thu, 2 Aug 2007 17:44:47 +0200 * Nick Piggin [email blocked] wrote: > > > > One thing to check out is whether the lmbench numbers are > > > > "correct". Especially on SMP systems, the lmbench numbers are > > > > actually *best* when the two processes run on the same CPU, even > > > > though that's not really at all the best scheduling - it's just > > > > that it artificially improves lmbench numbers because of the > > > > close cache affinity for the pipe data structures. > > > > > > Yes, I bound them to a single core. > > > > could you send me the .config you used? > > Sure, attached... > > You don't see a regression? If not, then can you send me the .config > you used? [...] i used your config to get a few numbers and to see what happens. Here's the numbers of 10 consecutive "lat_ctx -s 0 2" runs: [ time in micro-seconds, smaller is better ] v2.6.22 v2.6.23-git v2.6.23-git+const-param ------- ----------- ----------------------- 1.30 1.60 1.19 1.30 1.36 1.18 1.14 1.50 1.01 1.26 1.27 1.23 1.22 1.40 1.04 1.13 1.34 1.09 1.27 1.39 1.05 1.20 1.30 1.16 1.20 1.17 1.16 1.25 1.33 1.01 ------------------------------------------------------------- avg: 1.22 1.36 (+11.3%) 1.11 (-10.3%) min: 1.13 1.17 ( +3.5%) 1.01 (-11.8%) max: 1.27 1.60 (+26.0%) 1.23 ( -3.2%) one reason for the extra overhead is the current tunability of CFS, but that is not fundamental, it's caused by the many knobs that CFS has at the moment. The const-tuning patch (attached below, results in the rightmost column) changes those knobs to constants, allowing the compiler to optimize the math better and reduce code size. (the code movement in the patch makes up for most of its size, the change that it does is simple otherwise.) so CFS can be faster at micro-context-switching than 2.6.22. But, at this point i'd also like to warn against putting _too_ much emphasis on lat_ctx numbers in general. lat_ctx prints a 'derived' micro-benchmark number. It uses a pair of pipes to context-switch between tasks but only prints the delta overhead that context-switching causes. The 'full' latency of the pipe operations can be seen via the following pipe-test.c code: http://redhat.com/~mingo/cfs-scheduler/tools/pipe-test.c run it to see the full cost: neptune:~> ./pipe-test 4.67 usecs/loop. 4.41 usecs/loop. 4.46 usecs/loop. 4.46 usecs/loop. 4.44 usecs/loop. 4.41 usecs/loop. so the _full_ cost, of even this micro-benchmark, is 4-5 microseconds, not 1 microsecond. So even this artificial micro-benchmark sees an actual slowdown of only 2.8%. if you check a macro-benchmark like "hackbench 50": [ time in seconds, smaller is better ] v2.6.22 v2.6.23-cfs ------- ----------- 3.019 2.842 2.994 2.878 2.977 2.882 3.012 2.864 2.996 2.882 then the difference is even starker because with CFS the _quality_ of scheduling decisions has increased. So even if we had increased micro-costs (which we wont have once the current tuning period is over and we cast the CFS parameters into constants), the quality of macro-scheduling can offset that, and not only on the desktop! so that's why our main focus in CFS was on the macro-properties of scheduling _first_, and then the micro-properties are adjusted to the macro-constraints as a second layer. Ingo -----------------------------> --- include/linux/sched.h | 2 kernel/sched.c | 143 +++++++++++++++++++++++++------------------------- kernel/sched_fair.c | 27 +++++---- kernel/sched_rt.c | 10 --- 4 files changed, 92 insertions(+), 90 deletions(-)
From: Nick Piggin [email blocked] Subject: Re: lmbench ctxsw regression with CFS Date: Fri, 3 Aug 2007 02:14:47 +0200 On Thu, Aug 02, 2007 at 05:44:47PM +0200, Ingo Molnar wrote: > > * Nick Piggin [email blocked] wrote: > > > > > > One thing to check out is whether the lmbench numbers are > > > > > "correct". Especially on SMP systems, the lmbench numbers are > > > > > actually *best* when the two processes run on the same CPU, even > > > > > though that's not really at all the best scheduling - it's just > > > > > that it artificially improves lmbench numbers because of the > > > > > close cache affinity for the pipe data structures. > > > > > > > > Yes, I bound them to a single core. > > > > > > could you send me the .config you used? > > > > Sure, attached... > > > > You don't see a regression? If not, then can you send me the .config > > you used? [...] > > i used your config to get a few numbers and to see what happens. Here's > the numbers of 10 consecutive "lat_ctx -s 0 2" runs: > > [ time in micro-seconds, smaller is better ] > > v2.6.22 v2.6.23-git v2.6.23-git+const-param > ------- ----------- ----------------------- > 1.30 1.60 1.19 > 1.30 1.36 1.18 > 1.14 1.50 1.01 > 1.26 1.27 1.23 > 1.22 1.40 1.04 > 1.13 1.34 1.09 > 1.27 1.39 1.05 > 1.20 1.30 1.16 > 1.20 1.17 1.16 > 1.25 1.33 1.01 > ------------------------------------------------------------- > avg: 1.22 1.36 (+11.3%) 1.11 (-10.3%) > min: 1.13 1.17 ( +3.5%) 1.01 (-11.8%) > max: 1.27 1.60 (+26.0%) 1.23 ( -3.2%) > > one reason for the extra overhead is the current tunability of CFS, but > that is not fundamental, it's caused by the many knobs that CFS has at > the moment. The const-tuning patch (attached below, results in the > rightmost column) changes those knobs to constants, allowing the > compiler to optimize the math better and reduce code size. (the code > movement in the patch makes up for most of its size, the change that it > does is simple otherwise.) [...] Oh good. Thanks for getting to the bottom of it. We have normally disliked too much runtime tunables in the scheduler, so I assume these are mostly going away or under a CONFIG option for 2.6.23? Or...? What CPU did you get these numbers on? Do the indirect calls hurt much on those without an indirect predictor? (I'll try running some tests). I must say that I don't really like the indirect calls a great deal, and they could be eliminated just with a couple of branches and direct calls.
From: Ingo Molnar [email blocked] Subject: Re: lmbench ctxsw regression with CFS Date: Sat, 4 Aug 2007 08:50:37 +0200 * Nick Piggin [email blocked] wrote: > Oh good. Thanks for getting to the bottom of it. We have normally > disliked too much runtime tunables in the scheduler, so I assume these > are mostly going away or under a CONFIG option for 2.6.23? Or...? yeah, they are all already under CONFIG_SCHED_DEBUG. (it's just that the add-on optimization is not upstream yet - the tunings are still being tested) Btw., with SCHED_DEBUG we now also have your domain-tree sysctl patch upstream, which has been in -mm for a near eternity. > What CPU did you get these numbers on? Do the indirect calls hurt much > on those without an indirect predictor? (I'll try running some tests). it was on an older Athlon64 X2. I never saw indirect calls really hurting on modern x86 CPUs - dont both CPU makers optimize them pretty efficiently? (as long as the target function is always the same - which it is here.) > I must say that I don't really like the indirect calls a great deal, > and they could be eliminated just with a couple of branches and direct > calls. yeah - i'll try that too. We can make the indirect call the uncommon case and a NULL pointer be the common case, combined with a 'default', direct function call. But i doubt it makes a big (or even measurable) difference. Ingo
Indirect calls
Hi,
Could anybody explain a bit more about the indirect calls? I didn't catch the point.
Thanks in advance
Calls whose function pointer
Calls whose function pointer isn't known at compile-time, e.g. callbacks.
Yep
Also method invocations for virtual methods in C++. Basically, nearly everywhere you have a function pointers (explicit, such as callbacks, or implied, such as method invocations) that the compiler can't resolve at compile time.
--
Program Intellivision and play Space Patrol!
Thanks to both of you!
Thanks to both of you!