login
Header Space

 
 

Measuring Process Scheduler Performance

October 10, 2007 - 9:02am
Submitted by Jeremy on October 10, 2007 - 9:02am.
Linux news

"As far as my testsystem goes, v2.6.23 beats v2.6.22.9 in sysbench," explained Ingo Molnar in response to a posting showing the opposite results. He referred to his own testing results and explained:

"As you can see it in the graph, v2.6.23 schedules much more consistently too. [ v2.6.22 has a small (but potentially statistically insignificant) edge at 4-6 clients, and CFS has a slightly better peak (which is statistically insignificant)."

Ingo noted that he was nuable to find information as to how the other benchmark was generated, "there are no .configs or other testing details at or around that URL that i could use to reproduce their result precisely, so at least a minimal bugreport would be nice." He then offered some tips on how sysbench works and some suggested tunings, "sysbench is a pretty 'batched' workload: it benefits most from batchy scheduling: the client doing as much work as it can, then server doing as much work as it can - and so on. The longer the client can work the more cache-efficient the workload is. Any round-trip to the server due to pesky preemption only blows up the cache footprint of the workload and gives lower throughput."


From: Nicholas Miell <nmiell@...>
Subject: Re: Linux 2.6.23
Date: Oct 10, 2:12 am 2007

On Tue, 2007-10-09 at 13:54 -0700, Linus Torvalds wrote:
> Finally.
> 
> Yeah, it got delayed, not because of any huge issues, but because of 
> various bugfixes trickling in and causing me to reset my "release clock" 
> all the time. But it's out there now, and hopefully better for the wait.
> 
> Not a whole lot of changes since -rc9, although there's a few updates to 
> mips, sparc64 and blackfin in there.  Ignoring those arch updates, there's 
> basically a number of mostly one-liners (mostly in drivers, but there's 
> some networking fixes and soem VFS/VM fixes there too).
> 
> Shortlog and diffstat appended (both relative to -rc9, of course - the 
> full log from 2.6.22 is on kernel.org as usual).
> 
> I want this to be what people look at for a few days, but expect the x86 
> merge to go ahead after that. So far, all indications are still that it's 
> going to be all smooth sailing, but hey, those indicators seem to always 
> say that, and only after the fact do people notice any problems ;)
> 
> 		Linus

Does CFS still generate the following sysbench graphs with 2.6.23, or
did that get fixed?

http://people.freebsd.org/~kris/scaling/linux-pgsql.png
http://people.freebsd.org/~kris/scaling/linux-mysql.png

(There's also some interesting FreeBSD vs. Linux graphs in
http://people.freebsd.org/~kris/scaling/Scalability%20Update.pdf , but
AFAIK those comparisons are more indicative of glibc malloc performance
than Linux performance.)

-- 
Nicholas Miell <nmiell@comcast.net>

-

From: Ingo Molnar <mingo@...> Subject: Re: Linux 2.6.23 Date: Oct 10, 6:14 am 2007 * Nicholas Miell <nmiell@comcast.net> wrote: > Does CFS still generate the following sysbench graphs with 2.6.23, or > did that get fixed? > > http://people.freebsd.org/~kris/scaling/linux-pgsql.png > http://people.freebsd.org/~kris/scaling/linux-mysql.png as far as my testsystem goes, v2.6.23 beats v2.6.22.9 in sysbench: http://redhat.com/~mingo/misc/sysbench.jpg As you can see it in the graph, v2.6.23 schedules much more consistently too. [ v2.6.22 has a small (but potentially statistically insignificant) edge at 4-6 clients, and CFS has a slightly better peak (which is statistically insignificant). ] ( Config is at http://redhat.com/~mingo/misc/config, system is Core2Duo 1.83 GHz, mysql-5.0.45, glibc-2.6. Nothing fancy either in the config nor in the setup - everything is pretty close to the defaults. ) i'm aware of a 2.6.21 vs. 2.6.23 sysbench regression report, and it apparently got resolved after various changes to the test environment: http://jeffr-tech.livejournal.com/10103.html " [<CFS>] has virtually no dropoff and performs better under load than the default 2.6.21 scheduler. " (paraphrased) (The new link you posted, just a few hours after the release of v2.6.23, has not been reported to lkml before AFAICS - when did you become aware of it? If you learned about it before v2.6.23 it might have been useful to report it to the v2.6.23 regression list.) At a quick glance there are no .configs or other testing details at or around that URL that i could use to reproduce their result precisely, so at least a minimal bugreport would be nice. In any case, here are a few general comments about sysbench numbers: Sysbench is a pretty 'batched' workload: it benefits most from batchy scheduling: the client doing as much work as it can, then server doing as much work as it can - and so on. The longer the client can work the more cache-efficient the workload is. Any round-trip to the server due to pesky preemption only blows up the cache footprint of the workload and gives lower throughput. This kind of workload would probably run best on DOS or Windows 3.11, with no preemptive scheduling done at all. In other words: run both mysqld and the client as SCHED_FIFO to get the best performance out of it. So in that sense the workload is a bit similar to dbench. The other thing is that mysqld does _tons_ of sys_time() calls, so GTOD differences between .22 and .23 might cause extra overhead - especially with 8 CPUs/cores. Does the sys_time() scalability patch below improve sysbench performance for you? (i'm not sure about psqld) If it's indeed due to batched vs. well-spread-out scheduling behavior (which is possible), there are a few things you could do to make scheduling more batched: 1) start the DB daemon up as SCHED_BATCH: schedtool -B -e service mysqld restart (and do the same with the client-side commands as well) or: schedtool -B $$ to mark the parent shell as SCHED_BATCH - then start up the DB and start the client workload. (All other tasks not started from this shell will still be SCHED_OTHER, so only your mysql workload will be affected.) For example "beagled" already runs under SCHED_BATCH by default. SCHED_BATCH will cause the scheduler to batch up the workload more. You basically tell the scheduler: "this workload really wants throughput above all", and the scheduler takes that hint and acts upon it. (it's still not as drastic as SCHED_FIFO, it's somewhere between SCHED_OTHER and SCHED_FIFO, in terms of batching. Start up your DB and your client as SCHED_FIFO via "schedtool -F -p 10 ..." to establish the best-case batching win.) 2) check out the v22 CFS backport patch which has the latest & greatest scheduler code, from http://people.redhat.com/mingo/cfs-scheduler/ . Does performance go up for you with it? It's somewhat less preemption-eager, which might as well make the crutial difference for sysbench. 3) if it's enabled, disable CONFIG_PREEMPT=y. CONFIG_PREEMPT can cause unwanted overscheduling and cache-trashing under overload. hope this helps, and i'm definitely interested in more feedback about this, Ingo Index: linux/kernel/time.c =================================================================== --- linux.orig/kernel/time.c +++ linux/kernel/time.c @@ -57,11 +57,7 @@ EXPORT_SYMBOL(sys_tz); */ asmlinkage long sys_time(time_t __user * tloc) { - time_t i; - struct timespec tv; - - getnstimeofday(&tv); - i = tv.tv_sec; + time_t i = get_seconds(); if (tloc) { if (put_user(i,tloc)) Index: linux/kernel/time/timekeeping.c =================================================================== --- linux.orig/kernel/time/timekeeping.c +++ linux/kernel/time/timekeeping.c @@ -49,19 +49,12 @@ struct timespec wall_to_monotonic __attr static unsigned long total_sleep_time; /* seconds */ EXPORT_SYMBOL(xtime); - -#ifdef CONFIG_NO_HZ static struct timespec xtime_cache __attribute__ ((aligned (16))); static inline void update_xtime_cache(u64 nsec) { xtime_cache = xtime; timespec_add_ns(&xtime_cache, nsec); } -#else -#define xtime_cache xtime -/* We do *not* want to evaluate the argument for this case */ -#define update_xtime_cache(n) do { } while (0) -#endif static struct clocksource *clock; /* pointer to current clocksource */ -


more to come

October 10, 2007 - 10:09am
Anonymous (not verified)

http://people.redhat.com/mingo/private/sysbench-sched-devel.jpg

Those patches (and maybe more?) will get merged for 2.6.24

more

October 10, 2007 - 12:08pm
jospoortvliet (not verified)

I'd love to hear more when more info becomes available...

Hmm

October 10, 2007 - 12:26pm
Fred Flinta (not verified)

Doesn't seem better in all aspects.

Would like to see it against -ck too.

FreeBSD seems to have the upper hand.

I'm the author of FreeBSD's

October 10, 2007 - 3:48pm
Jeff Roberson (not verified)

I'm the author of FreeBSD's ULE scheduler and have been intimately involved with the work to improve mysql scaling on FreeBSD. I'm also the author of the jeffr-tech journal quoted above.

There are a few flaws with Ingo's email. First, he's comparing dual core test results to 4x2 core test results. Scheduling two cores that share cache is a significantly different problem from scheduling 8 cores.

Secondly, he quotes a journal entry where I point out that CFS has basically the same performance as our 4BSD scheduler. That should read the same bad performance as the 4BSD scheduler. It didn't have the considerable dropoff under high load but it also was about 25-30% slower than O(1) at peak. Also in that very same journal there are configs and specific information about reproducing the test results.

ULE and O(1) used to have the same peak but further optimizations to FreeBSD have helped us pull further away. Not all of them are scheduler related. The + adaptive libthr line is an early implementation of PTHRAD_MUTEX_ADAPTIVE that mysql uses in Linux to help improve behavior with contention on userspace locks. Unfortunately that will not make it into 7.0, however it might be merged in for 7.1 from -CURRENT.

Furthermore, I documented the significant slowdown with CFS in my journal on June 26th. I did contact some linux developers. I did not contact Ingo directly because previous emails to him have gone unanswered. No ill will there, just hasn't been fruitful.

I'll continue to report my findings in my journal.

There are a few flaws with

October 10, 2007 - 8:28pm
Anonymous (not verified)

There are a few flaws with Ingo's email. First, he's comparing dual core test results to 4x2 core test results.

That's a blatant distortion of what Ingo said. He clearly qualified his results with:

as far as my testsystem goes

He described his test-system and then he asked for more info about the test environment. If you read his mail, nowhere does he claim that the measurements are not valid. Compounded by:

I did not contact Ingo directly because previous emails to him have gone unanswered.

How convenient ...

More he didn't compare BSD

October 10, 2007 - 8:48pm
Anonymous (not verified)

More he didn't compare BSD with Linux. He talked about CFS vs O(1) scheduler only.

A lot of smart people get

October 11, 2007 - 8:55am
Anonymous (not verified)

A lot of smart people get turned away by the Scheduler Mafia. Con, Roman, and now this BSD Scheduler guru. Something is wrong.

At least not many smart

October 11, 2007 - 2:20pm
Anonymous (not verified)

At least not many smart people cap on tin foil hats and start assumptions.

Working on it

October 11, 2007 - 9:34am
Anonymous (not verified)

Apparently Ingo Molnar is already working on this to fix the problem or at least locate it:

http://kerneltrap.org/mailarchive/linux-kernel/2007/10/11/335077

Re: I'm the author of FreeBSD's

October 11, 2007 - 1:43pm

Thank you for all your hard work! I've been testing 7.0-CURRENT extensively for the future migration of our Linux servers upon release of 7.0 (or 7.1, depending on more testing results). The performance is phenomenal, along with the reliability we've come to except from BSD's.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary