Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler

Previous thread: 2.6.20, Java, unresponsive userspace by Alistair John Strachan on Saturday, March 3, 2007 - 9:07 pm. (1 message)

Next thread: [PATCH] [RSDL 1/6] lists: add list splice tail by Con Kolivas on Sunday, March 4, 2007 - 12:02 am. (4 messages)
From: Con Kolivas
Date: Sunday, March 4, 2007 - 12:00 am

This message is to announce the first general public release of the "Rotating 
Staircase DeadLine" cpu scheduler. 

Based on previous work from the staircase cpu scheduler I set out to design, 
from scratch, a new scheduling policy design which satisfies every 
requirement for SCHED_NORMAL (otherwise known as SCHED_OTHER) task management.

Available for download are:

 A full rollup of the patch for 2.6.20:
http://ck.kolivas.org/patches/staircase-deadline/sched-rsdl-0.26.patch

 Split patches for 2.6.20(which will follow this email):
http://ck.kolivas.org/patches/staircase-deadline/split-out/

 The readme (which will also constitute the rest of this email):
http://ck.kolivas.org/patches/staircase-deadline/rsdl_scheduler.readme


The following readme is also included as documentation in 
Documentation/sched-design.txt


Rotating Staircase Deadline cpu scheduler policy
================================================

Design summary
==============

A novel design which incorporates a foreground-background descending priority
system (the staircase) with runqueue managed minor and major epochs (rotation
and deadline).


Features
========

A starvation free, strict fairness O(1) scalable design with interactivity
as good as the above restrictions can provide. There is no interactivity
estimator, no sleep/run measurements and only simple fixed accounting.
The design has strict enough a design and accounting that task behaviour
can be modelled and maximum scheduling latencies can be predicted by
the virtual deadline mechanism that manages runqueues. The prime concern
in this design is to maintain fairness at all costs determined by nice level,
yet to maintain as good interactivity as can be allowed within the
constraints of strict fairness.


Design description
==================

RSDL works off the principle of providing each task a quota of runtime that
it is allowed to run at each priority level equal to its static priority
(ie. its nice level) and every ...
From: Con Kolivas
Date: Sunday, March 4, 2007 - 12:45 am

It's probably worth mentioning that this scheduler shows a not insignificant 
improvement in the mysql test case that recently received a lot of publicity. 
Why exactly that's the case I'm not sure but it may help track down what is 
actually responsible for the performance drop off as well. Note that the SMP 
balancing of this cpu scheduler is essentially unchanged from the mainline 
one.

-- 
-ck
-

From: Con Kolivas
Date: Sunday, March 4, 2007 - 7:04 am

Only preliminary other benchmarking has been done so far on RSDL, and so far 
the results are equal to or slightly better than mainline on scalability 
grounds.

These are the sysbench graphs just for completion that Nishanth Aravamudan 
performed. We were actually trying to track down the cause of the mysql 
performance problem and I suggested trying different cpu schedulers as well. 
Needless to say I'm not unhappy with the results although they haven't told 
us exactly where the problem lies but it certainly does add evidence that 
perhaps it is a scheduling issue.

-- 
-ck
From: Gene Heskett
Date: Sunday, March 4, 2007 - 4:08 am

I assume to test this, we select the deadline scheduler?

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
-

From: Con Kolivas
Date: Sunday, March 4, 2007 - 4:47 am

No, only the "deadline" in the name is shared. This is a cpu process scheduler 
whereas the deadline scheduler you're thinking of is an I/O scheduler. To 
test this you just patch it in and it replaces the current mainline cpu 
scheduler (the same way the staircase cpu scheduler in -ck replaces it).

-- 
-ck
-

From: Gene Heskett
Date: Sunday, March 4, 2007 - 5:24 am

Oh, then I tried to put the -ck1 patch on top of that, and blew the tree 
up.  I'd built it the first time with the deadline scheduler as the 
default while I was waiting on your reply.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
-

From: Con Kolivas
Date: Sunday, March 4, 2007 - 5:46 am

Yes, sorry. This is mutually exclusive with the -ck1 patch. It is a standalone 
piece of code.

-- 
-ck
-

From: Gene Heskett
Date: Sunday, March 4, 2007 - 6:25 am

I just this instant got it booted, what with building a driver for nvidia 
and all.  I'll let you know what I think in a few hours after I've gotten 
a feel for it.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
-

From: Con Kolivas
Date: Sunday, March 4, 2007 - 6:49 am

Great, thanks.

Just to make it clear. The purpose of this scheduler is at all costs to 
maintain absolute fairness no matter what type of load it is put under. This 
means that if you heavily load up your machine without the use of 'nice' then 
your interactive tasks _will_ slow down proportionately to the amount of cpu 
you use. So doing make -j4 for example will make any other task started in 
taht presence run precisely 1/5th speed, but they will still be responsive, 
have low latency (and audio shouldn't skip for example). 

There will be times when the mainline scheduler feels more interactive than 
this scheduler, and that is because it has significant unfairness granted 
towards interactive tasks. This degree of unfairness in an effort to maintain 
interactivity has been criticised and causes problems in certain environments 
with both loss of fairness, relative starvation and is not entirely 
predictable. 

This was designed to be robust for any application since linux demands a 
general purpose scheduler design, while preserving interactivity, instead of 
optimising for one particular end use.

-- 
-ck
-

From: Gene Heskett
Date: Sunday, March 4, 2007 - 7:11 am

Well, in 20 minutes of playing, I am so far VERY impressed, the kmail 
composer is typing to the screen in unison to my keystrokes, where with 
the older version it often went away for 10 or more seconds at a time, 
then displayed the last sentence I had typed in one (or 2 sometimes) 
swell foop.  Now I can back up and correct a typo in real time whereas 
before, it was often faster if the typo was half a line back, and the key 
repeat so darned slow it was often over a second per character cell 
moved, to go grab the mouse, position the cursor on the typo, click, wait 
for the indicator to move, and then fix it.  Typing is now pleasurable 
again.  Key repeats seem to remain at the set in the bios key repeat 
speed.  Amazing.  I do believe you have given me back my machine.

I may find something to squawk about in due time, I think that Murphy guys 
laws may have a corollary on that subject, but it sure feels good right 

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
-

From: Zwane Mwaikambo
Date: Sunday, March 4, 2007 - 7:31 pm

What are the specs on your hardware?

-

From: Gene Heskett
Date: Sunday, March 4, 2007 - 8:16 pm

XP-2800 Athlon on a biostar board whose model number I've forgotten. 1 GB 
of dual channel 400 mhz ram running at 333 because this athlon falls over 
at 400mhz fsb settings.  480GB of drives in 3 pieces.

Running a jaton 3dForce 6200-256 video card with nvidia driver, which does 
the opengl stuff better than the ati 9200se I took out a week ago, but 
this nvidia card is several times harder on the cpu when tvtime is 
running than the ati was.  There are other cards in this box, 1394, nic, 
usb expander, but that is the main list of power burners.

Its dmesg clock is just short of 2100. and:
Calibrating delay using timer specific routine.. 4177.82 BogoMIPS 
(lpj=2088910)

In comparison to an XP-1400 I had before, this is nowhere near 2x faster.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
-

From: Willy Tarreau
Date: Sunday, March 4, 2007 - 7:36 am

Hi Con !


Well, I haven't tested it yet, but your design choices please me. As you
know, I've been one of those encountering big starvation problems with
the original scheduler, making 2.6 unusable for me in many situations. I
welcome your work and want to thank you for the time you spend trying to
fix it.

Keep up the good work,
Willy

PS: I've looked at your graphs, I hope you're on the way to something really
    better than the 21 first 2.6 releases !

-

From: jos poortvliet
Date: Sunday, March 4, 2007 - 9:08 am

Well, imho his current staircase scheduler already does a better job compar=
ed=20
to mainline, but it won't make it in (or at least, it's not likely). So we=
=20
can hope this WILL make it into mainline, but I wouldn't count on it.

grtz

Jos



=2D-=20
Disclaimer:

Alles wat ik doe denk en zeg is gebaseerd op het wereldbeeld wat ik nu heb.=
=20
Ik ben niet verantwoordelijk voor wijzigingen van de wereld, of het beeld w=
at=20
ik daarvan heb, noch voor de daaruit voortvloeiende gedragingen van mezelf.=
=20
Alles wat ik zeg is aardig bedoeld, tenzij expliciet vermeld.
From: Bill Davidsen
Date: Monday, March 5, 2007 - 4:05 pm

Wrong problem, what is really needed is to get CPU scheduler choice into 
mainline, just as i/o scheduler finally did. Con has noted that for some 
loads this will present suboptimal performance, as will his -ck patches, 
as will the default scheduler. Instead of trying to make ANY one size 
fit all, we should have a means to select, at runtime, between any of 
the schedulers, and preferably to define an interface by which a user 
can insert a new scheduler in the kernel (compile in, I don't mean 
plugable) with clear and well defined rules for how that can be done.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-

From: Con Kolivas
Date: Monday, March 5, 2007 - 5:18 pm

Been there, done that. Wli wrote the infrastructure for plugsched; I took his 
code and got it booting and ported 3 or so different scheduler designs. It 
allowed you to build as few or as many different schedulers into the kernel 
and either boot the only one you built into your kernel, or choose a 
scheduler at boot time. That code got permavetoed by both Ingo and Linus. 
After that I gave up on that code and handed it over to Peter Williams who 
still maintains it. So please note that I pushed the plugsched barrow 
previously and still don't think it's a bad idea, but the maintainers think 
it's the wrong approach.

RSDL is my attempt to create a cpu scheduler to try to do everything well 
instead of some things perfectly, to be best fit when trying to shoehorn it 
into any environment. 

-- 
-ck
-

From: Willy Tarreau
Date: Monday, March 5, 2007 - 9:41 pm

In a way, I think they are right. Let me explain. Pluggable schedulers are
useful when you want to switch away from the default one. This is very useful
during development of a new scheduler, as well as when you're not satisfied
with the default scheduler. Having this feature will incitate many people to
develop their own scheduler for their very specific workload, and nothing
generic. It's a bit what happened after all : you, Peter, Nick, and Mike
have worked a lot trying to provide alternative solutions.

But when you think about it, there are other OSes which have only one scheduler
and which behave very well with tens of thousands of tasks and scale very well
with lots of CPUs (eg: solaris). So there is a real challenge here to try to
provide something at least as good and universal because we know that it can
exist. And this is what you finally did : work on a scheduler which ought to be
good with any workload.

Then, when we have a generic, good enough scheduler for most situations, I
think that it could be good to get the plugsched for very specific usages.
People working in HPC may prefer to allocate ressource differently for
instance. There may also be people refusing to mix tasks from different users
on two different siblings of one CPU for security reasons, etc... All those
would justify a plugable scheduler. But it should not be an excuse to provide
a set of bad schedulers and no good one.

The CPU scheduler is often compared to the I/O schedulers while in fact this
is a completely different story. The I/O schedulers are needed because the
hardware and filesystems may lead to very different behaviours, and the
workload may vary a lot (eg: news server, ftp server, cache, desktop, real
time streaming, ...). But at least, the default I/O scheduler was good enough
for most usages, and alternative ones are here to provide optimal solutions
to specific needs.

Regards,
Willy

-

From: Nicholas Miell
Date: Monday, March 5, 2007 - 10:39 pm

Solaris has a pluggable scheduler framework (each policy --
OTHER/FIFO/RR/etc. -- is it's own separate component).

-- 
Nicholas Miell <nmiell@comcast.net>

-

From: jos poortvliet
Date: Tuesday, March 6, 2007 - 12:04 pm

Did that happen for I/O? There are a few schedulers, eg some for servers,=20

I can imagine a desktop can work optimally with another scheduler than a ti=
ny=20
embedded OS in a phone or an 8-core system serving a website, or a=20
distributed 512 core system doing heavy scientific calculations?!?

Optimizing for all at the same time involves some compromises, and thus lim=
its=20

CFQ does pretty well at most workloads, that's why it's default, right? But=
=20
there is choice, which is a good thing. And the current mainline CPU=20

Ok, for I/O, the diff could be pretty big. But still, there are workloads=20
which could be improved by a certain scheduler, right?

And wouldn't it make sense then to have a choice in the default kernel at=20
boottime? If that wouldn't hurt performance, it would be an improvement for=
=20
desktop distributions like (K)ubuntu who can set staircase by default, and=
=20
server distro's offering RSDL...

At least having a desktop/interactivity optimized scheduler like staircase =
and=20
a fair, throughput-optimized scheduler like RSDL sounds sane. RSDL does=20
better at the msql testcase, staircase is better on the desktop... We're no=
t=20
talking about huge amounts of code, or 10 schedulers, and the diff of a few=
=20
percent and better scaling on many cpu's and processes versus better=20
From: Bill Davidsen
Date: Tuesday, March 6, 2007 - 2:37 pm

The problem is not with "any workload," because that's not the issue, 
the issue is the definition of "good" matching the administrator's 
policy. And that's where the problem comes in. We have the default 
scheduler, which favors interactive jobs. We have Con's staircase 
scheduler which is part of an interactivity package. We have the 
absolutely fair scheduler which is, well... fair, and keeps things 
smooth and under reasonable load crisp.

There are other schedulers in the pluggable package, I did a doorknob 
scheduler for 2.2 (everybody gets a turn, special case of round-robin). 
I'm sure people have quietly hacked many more, which have never been 
presented to public view.

The point is that no one CPU scheduler will satisfy the policy needs of 
all users, any more than one i/o scheduler does so. We have realtime 
scheduling, preempt both voluntary and involuntary, why should we not 
have multiple CPU schedulers. If Linus has an objection to plugable 
schedulers, then let's identify what the problem is and address it. If 
that means one scheduler or the other must be compiled in, or all 
Unless you force the the definition of "good" to "what the default 
scheduler does," there can be no "one" good one. Choice is good, no one 
is calling for bizarre niche implementations, but we have at minimum 
three CPU schedulers which as "best" for a large number of users. 
And multiple schedulers are needed because the type of load, mix of 
loads, and admin preference all require decisions at the policy which 
can't be covered by a single solution. Or at least none of the existing 
solutions, and I think letting people tune the guts of scheduler policy 
is more dangerous than giving a selection of solutions. Linux has been 
about choice all along, I hope it's nearly time for a solution better 
than patches to be presented.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

-

From: Willy Tarreau
Date: Tuesday, March 6, 2007 - 2:54 pm

Hi Bill,

On Tue, Mar 06, 2007 at 04:37:37PM -0500, Bill Davidsen wrote:

I'm not in Linus' head, but I think that he wanted the recurrent scheduler
problems to be addressed first for most users before going further. Too
much choice is often dangerous for quality. For instance, look at all the
netfilter modules. Many of them were completely bogus in their early stages,
and some of them even do mostly the same jobs, and many of them have never
left the "extra" stage. Choice is good to detect users' needs, it's good
for global evolution, but it's not as good when you want to have something

By "good", I mean a scheduler that is not trivially DoSable, and which
does not cause unexpected long pauses to some processes without any reason
(processes which cannot get any time slice for tens of seconds, or ssh
daemons which freeze under system load, to the point of totally preventing

There's a difference between CPU and I/O scheduler though. With the CPU
scheduler, you've always had the choice to assign per-process priorities 
with "nice". Don't get me wrong, I'm all for pluggable schedulers, as I'm
an ever unsatisfied optimizer. It's just that I think it has been good to
encourage people to focus on real issues before dispersing efforts on
different needs. I hope that Con's work will eventually get merged and
that the door will be opened towards pluggable schedulers.

Best regards,
Willy

-

From: Con Kolivas
Date: Monday, March 5, 2007 - 2:52 pm

This patch has been queued at test.kernel.org (thanks Andy Whitcroft).

Here is a summary of the first few benchmarks from those tests rsdl 0.26 vs 
mainline 2.6.20 (However much it is that you value these particular 
benchmarks...) on amd64 4x:


Kernbench (lower elapsed is better);
=========
2.6.20:
Elapsed: 103.058s User: 351.974s System: 36.474s CPU: 376.6%
2.6.20-rsdl:
Elapsed: 102.848s User: 359.186s System: 36.666s CPU: 384.6%


tbench (higher is better):
======
2.6.20:
Throughput 165.5 MB/sec 1 procs
2.6.20-rsdl:
Throughput 261.418 MB/sec 1 procs


reaim (higher is better):
=====
2.6.20:
Max Jobs per Minute 500727.27
2.6.20-rsdl:
Max Jobs per Minute 612000.00

-- 
-ck
-

From: Ingo Molnar
Date: Thursday, March 8, 2007 - 1:53 am

cool! I like this even more than i liked your original staircase 
scheduler from 2 years ago :) Lets try what we did back then: put it 
into -mm and see what breaks (if anything). But in general, it is 
becoming increasingly clear that the interactivity estimator is a more 
fragile concept than the built-in quota mechanism of the staircase 
scheduler, so if it works in practice i'm quite in favor of it, even if 
it regresses /some/ workloads.

	Ingo
-

From: Con Kolivas
Date: Thursday, March 8, 2007 - 3:07 am

Great! Thanks for your support. 

After futzing around for all that time I've become sure that an approach 
without an interactive estimator is our only way forward. So far the 
throughput benchmarks are encouraging too so I suspect the estimator may be 
causing harm there too.

Ensuring the different arches and cpuidle work properly I likely will need 
help with, though, so I'd appreciate any help from people if they see 
something obvious and can get a grip of my code.

Thanks!

-- 
-ck
-

From: Fabio Comolli
Date: Thursday, March 8, 2007 - 1:25 pm

Hi Con
It would be nice if you could rebase this patch to latest git or at
least to 2.6.21-rc3.
Regards,
Fabio




-

From: Con Kolivas
Date: Thursday, March 8, 2007 - 1:57 pm

Check in http://ck.kolivas.org/patches/staircase-deadline/
There's an -rc3 patch there.

-- 
-ck
-

From: Fabio Comolli
Date: Thursday, March 8, 2007 - 2:31 pm

Well, downloaded - compiled - booted: initng measures 17.369 seconds
to complete the boot process; without the patch the same kernel booted
in 21.553 seconds.

Very impressive.
Many thanks for your work.

Fabio






-

From: Rodney Gordon II
Date: Friday, March 9, 2007 - 2:11 pm

Con, you've really outdone yourself this time ! :D

As a long time user of the -ck patchset, RSDL is a welcome change, and a great 
piece of code to play around with, and USE!

Booted up on my system perfectly, Pentium-D 830 3GHz, 1.5GB RAM.

No problems whatsoever so far, using 0.26. I can launch up a bunch of encode 
jobs, in SCHED_NORMAL even, and still have low latency on my desktop (I know 
it's not low latency _specific_ code, but it works very well).

I guess all I can say is.. wow. This code isn't "prime time" ready, yet.. But 
it can be, and would be a great addition to mainline.

Hell, a little tuning and merging this with a few current ck patches could 
make a damn fine kernel, and probably beat out the original staircase in 
desktops. :)

Keep up the good work !
-r

-- 
Rodney "meff" Gordon II -*- meff@pobox.com
Systems Administrator / Coder Geek -*- Open yourself to OpenSource
-

Previous thread: 2.6.20, Java, unresponsive userspace by Alistair John Strachan on Saturday, March 3, 2007 - 9:07 pm. (1 message)

Next thread: [PATCH] [RSDL 1/6] lists: add list splice tail by Con Kolivas on Sunday, March 4, 2007 - 12:02 am. (4 messages)