"As far as my testsystem goes, v2.6.23 beats v2.6.22.9 in sysbench," explained Ingo Molnar in response to a posting showing the opposite results. He referred to his own testing results and explained:
"As you can see it in the graph, v2.6.23 schedules much more consistently too. [ v2.6.22 has a small (but potentially statistically insignificant) edge at 4-6 clients, and CFS has a slightly better peak (which is statistically insignificant)."
Ingo noted that he was nuable to find information as to how the other benchmark was generated, "there are no .configs or other testing details at or around that URL that i could use to reproduce their result precisely, so at least a minimal bugreport would be nice." He then offered some tips on how sysbench works and some suggested tunings, "sysbench is a pretty 'batched' workload: it benefits most from batchy scheduling: the client doing as much work as it can, then server doing as much work as it can - and so on. The longer the client can work the more cache-efficient the workload is. Any round-trip to the server due to pesky preemption only blows up the cache footprint of the workload and gives lower throughput."
From: Nicholas Miell <nmiell@...>
Subject: Re: Linux 2.6.23
Date: Oct 10, 2:12 am 2007
On Tue, 2007-10-09 at 13:54 -0700, Linus Torvalds wrote:
> Finally.
>
> Yeah, it got delayed, not because of any huge issues, but because of
> various bugfixes trickling in and causing me to reset my "release clock"
> all the time. But it's out there now, and hopefully better for the wait.
>
> Not a whole lot of changes since -rc9, although there's a few updates to
> mips, sparc64 and blackfin in there. Ignoring those arch updates, there's
> basically a number of mostly one-liners (mostly in drivers, but there's
> some networking fixes and soem VFS/VM fixes there too).
>
> Shortlog and diffstat appended (both relative to -rc9, of course - the
> full log from 2.6.22 is on kernel.org as usual).
>
> I want this to be what people look at for a few days, but expect the x86
> merge to go ahead after that. So far, all indications are still that it's
> going to be all smooth sailing, but hey, those indicators seem to always
> say that, and only after the fact do people notice any problems ;)
>
> Linus
Does CFS still generate the following sysbench graphs with 2.6.23, or
did that get fixed?
http://people.freebsd.org/~kris/scaling/linux-pgsql.png
http://people.freebsd.org/~kris/scaling/linux-mysql.png
(There's also some interesting FreeBSD vs. Linux graphs in
http://people.freebsd.org/~kris/scaling/Scalability%20Update.pdf , but
AFAIK those comparisons are more indicative of glibc malloc performance
than Linux performance.)
--
Nicholas Miell <nmiell@comcast.net>
-
From: Ingo Molnar <mingo@...>
Subject: Re: Linux 2.6.23
Date: Oct 10, 6:14 am 2007
* Nicholas Miell <nmiell@comcast.net> wrote:
> Does CFS still generate the following sysbench graphs with 2.6.23, or
> did that get fixed?
>
> http://people.freebsd.org/~kris/scaling/linux-pgsql.png
> http://people.freebsd.org/~kris/scaling/linux-mysql.png
as far as my testsystem goes, v2.6.23 beats v2.6.22.9 in sysbench:
http://redhat.com/~mingo/misc/sysbench.jpg
As you can see it in the graph, v2.6.23 schedules much more consistently
too. [ v2.6.22 has a small (but potentially statistically insignificant)
edge at 4-6 clients, and CFS has a slightly better peak (which is
statistically insignificant). ]
( Config is at http://redhat.com/~mingo/misc/config, system is Core2Duo
1.83 GHz, mysql-5.0.45, glibc-2.6. Nothing fancy either in the config
nor in the setup - everything is pretty close to the defaults. )
i'm aware of a 2.6.21 vs. 2.6.23 sysbench regression report, and it
apparently got resolved after various changes to the test environment:
http://jeffr-tech.livejournal.com/10103.html
" [<CFS>] has virtually no dropoff and performs better under load than
the default 2.6.21 scheduler. " (paraphrased)
(The new link you posted, just a few hours after the release of v2.6.23,
has not been reported to lkml before AFAICS - when did you become aware
of it? If you learned about it before v2.6.23 it might have been useful
to report it to the v2.6.23 regression list.)
At a quick glance there are no .configs or other testing details at or
around that URL that i could use to reproduce their result precisely, so
at least a minimal bugreport would be nice.
In any case, here are a few general comments about sysbench numbers:
Sysbench is a pretty 'batched' workload: it benefits most from batchy
scheduling: the client doing as much work as it can, then server doing
as much work as it can - and so on. The longer the client can work the
more cache-efficient the workload is. Any round-trip to the server due
to pesky preemption only blows up the cache footprint of the workload
and gives lower throughput.
This kind of workload would probably run best on DOS or Windows 3.11,
with no preemptive scheduling done at all. In other words: run both
mysqld and the client as SCHED_FIFO to get the best performance out of
it. So in that sense the workload is a bit similar to dbench.
The other thing is that mysqld does _tons_ of sys_time() calls, so GTOD
differences between .22 and .23 might cause extra overhead - especially
with 8 CPUs/cores. Does the sys_time() scalability patch below improve
sysbench performance for you? (i'm not sure about psqld)
If it's indeed due to batched vs. well-spread-out scheduling behavior
(which is possible), there are a few things you could do to make
scheduling more batched:
1) start the DB daemon up as SCHED_BATCH:
schedtool -B -e service mysqld restart
(and do the same with the client-side commands as well)
or:
schedtool -B $$
to mark the parent shell as SCHED_BATCH - then start up the DB and
start the client workload. (All other tasks not started from this
shell will still be SCHED_OTHER, so only your mysql workload will be
affected.) For example "beagled" already runs under SCHED_BATCH by
default.
SCHED_BATCH will cause the scheduler to batch up the workload more.
You basically tell the scheduler: "this workload really wants
throughput above all", and the scheduler takes that hint and acts
upon it. (it's still not as drastic as SCHED_FIFO, it's somewhere
between SCHED_OTHER and SCHED_FIFO, in terms of batching. Start up
your DB and your client as SCHED_FIFO via "schedtool -F -p 10 ..." to
establish the best-case batching win.)
2) check out the v22 CFS backport patch which has the latest & greatest
scheduler code, from http://people.redhat.com/mingo/cfs-scheduler/ .
Does performance go up for you with it? It's somewhat less
preemption-eager, which might as well make the crutial difference for
sysbench.
3) if it's enabled, disable CONFIG_PREEMPT=y. CONFIG_PREEMPT can cause
unwanted overscheduling and cache-trashing under overload.
hope this helps, and i'm definitely interested in more feedback about
this,
Ingo
Index: linux/kernel/time.c
===================================================================
--- linux.orig/kernel/time.c
+++ linux/kernel/time.c
@@ -57,11 +57,7 @@ EXPORT_SYMBOL(sys_tz);
*/
asmlinkage long sys_time(time_t __user * tloc)
{
- time_t i;
- struct timespec tv;
-
- getnstimeofday(&tv);
- i = tv.tv_sec;
+ time_t i = get_seconds();
if (tloc) {
if (put_user(i,tloc))
Index: linux/kernel/time/timekeeping.c
===================================================================
--- linux.orig/kernel/time/timekeeping.c
+++ linux/kernel/time/timekeeping.c
@@ -49,19 +49,12 @@ struct timespec wall_to_monotonic __attr
static unsigned long total_sleep_time; /* seconds */
EXPORT_SYMBOL(xtime);
-
-#ifdef CONFIG_NO_HZ
static struct timespec xtime_cache __attribute__ ((aligned (16)));
static inline void update_xtime_cache(u64 nsec)
{
xtime_cache = xtime;
timespec_add_ns(&xtime_cache, nsec);
}
-#else
-#define xtime_cache xtime
-/* We do *not* want to evaluate the argument for this case */
-#define update_xtime_cache(n) do { } while (0)
-#endif
static struct clocksource *clock; /* pointer to current clocksource */
-
more to come
http://people.redhat.com/mingo/private/sysbench-sched-devel.jpg
Those patches (and maybe more?) will get merged for 2.6.24
more
I'd love to hear more when more info becomes available...
Hmm
Doesn't seem better in all aspects.
Would like to see it against -ck too.
FreeBSD seems to have the upper hand.
I'm the author of FreeBSD's
I'm the author of FreeBSD's ULE scheduler and have been intimately involved with the work to improve mysql scaling on FreeBSD. I'm also the author of the jeffr-tech journal quoted above.
There are a few flaws with Ingo's email. First, he's comparing dual core test results to 4x2 core test results. Scheduling two cores that share cache is a significantly different problem from scheduling 8 cores.
Secondly, he quotes a journal entry where I point out that CFS has basically the same performance as our 4BSD scheduler. That should read the same bad performance as the 4BSD scheduler. It didn't have the considerable dropoff under high load but it also was about 25-30% slower than O(1) at peak. Also in that very same journal there are configs and specific information about reproducing the test results.
ULE and O(1) used to have the same peak but further optimizations to FreeBSD have helped us pull further away. Not all of them are scheduler related. The + adaptive libthr line is an early implementation of PTHRAD_MUTEX_ADAPTIVE that mysql uses in Linux to help improve behavior with contention on userspace locks. Unfortunately that will not make it into 7.0, however it might be merged in for 7.1 from -CURRENT.
Furthermore, I documented the significant slowdown with CFS in my journal on June 26th. I did contact some linux developers. I did not contact Ingo directly because previous emails to him have gone unanswered. No ill will there, just hasn't been fruitful.
I'll continue to report my findings in my journal.
There are a few flaws with
That's a blatant distortion of what Ingo said. He clearly qualified his results with:
He described his test-system and then he asked for more info about the test environment. If you read his mail, nowhere does he claim that the measurements are not valid. Compounded by:
How convenient ...
More he didn't compare BSD
More he didn't compare BSD with Linux. He talked about CFS vs O(1) scheduler only.
A lot of smart people get
A lot of smart people get turned away by the Scheduler Mafia. Con, Roman, and now this BSD Scheduler guru. Something is wrong.
At least not many smart
At least not many smart people cap on tin foil hats and start assumptions.
Working on it
Apparently Ingo Molnar is already working on this to fix the problem or at least locate it:
http://kerneltrap.org/mailarchive/linux-kernel/2007/10/11/335077
Re: I'm the author of FreeBSD's
Thank you for all your hard work! I've been testing 7.0-CURRENT extensively for the future migration of our Linux servers upon release of 7.0 (or 7.1, depending on more testing results). The performance is phenomenal, along with the reliability we've come to except from BSD's.