Linux: Staircase Process Scheduler

Submitted by Jeremy
on March 25, 2004 - 6:37am

Australian doctor and Linux kernel interactivity specialist Con Kolivas [interview] announced the release of a new experimental process scheduler [journal]. Con describes his "staircase scheduler" as, "a descending multilevel single runqueue per cpu with deadline elevation of priorities." It was designed with interactivity in mind, makes renicing processes affect CPU distribution, scales well, and offers low scheduling latency for normal policy tasks, all with low overhead.

Though still marked as experimental, Con has included some benchmarking results with his announcement. He summarizes, "in most cases smp performance is equivalent and occasionally better, while maintaining interactivity and improving responsiveness." As to what you might expect when testing the new scheduler, Con says, "for the end desktop user they will find this performs much like the mainline 2.6 scheduler over a wide range of loads apart from the fact that applications start faster". He goes on to explain:

"Some applications that misbehave with 2.6 mainline will behave better using less cpu time. At extreme loads the stock 2.6 scheduler feels better by being too unfair on cpu bound tasks which this one does not. Nice has more predictable and larger effects on cpu distribution (nice 0 gets 20 times what nice +20 gets)."


From: Con Kolivas [email blocked]
To:  linux-kernel
Subject: [PATCH]Staircase scheduler - experimental
Date: Thu, 25 Mar 2004 11:27:10 +1100

This is a rewrite of the scheduler policy for 2.6, based on the current
O(1) scheduler.

Aims:

 - Making renicing processes actually matter for CPU distribution
   (nice 0 gets 20 times what nice +20 gets)
 - Interactive by design rather than have complicated interactivity algorithm
   bolted onto an existing design
 - Good scalability.
 - Simple predictable design.
 - Maintain appropriate cpu distribution and fairness.
 - Low scheduling latency for normal policy tasks.
 - Low overhead.

Summary of design:
 A descending multilevel single runqueue per cpu with deadline elevation of
 priorities.

Details:
 This patch takes advantage of the existing infrastructure but has no expired
 array. Real time tasks are treated the same as the current fixed priority &
 timeslice system. The details are in the management of normal policy tasks.

 Normal policy tasks have a dynamic priority that drops by one every
 RR_INTERVAL which equals 10ms * num_online_cpus(). Once they drop to zero
 they are requeued with 2 intervals at a lower priority and then drop back to
 one interval. If they drop to zero they are requeued with 3 intervals lower
 priority and so on. Every time a task sleeps it moves back up one priority.
 The sub-jiffy case is handled specially to prevent it fooling this system.

Use a fixed font to see clearly:

ie
RR_INTERVALs nice 0 task
20<-------------->40 PRI
11111111111111111111
02111111111111111111
00311111111111111111
and so on

nice +10 task
30<---->40 PRI
1111111111
0211111111
0031111111
and so on

 This is how cpu distribution is kept proportional while optimising latency
 for interactive tasks.

 The patch was made to be added to the sched_domains infrastructure since this
 is likely to be merged with mainline so all comparisons are made to a kernel
 with sched_domains patched in.

Subjective Performance:

 For the end desktop user they will find this performs much like the
 mainline 2.6 scheduler over a wide range of loads apart from the fact that
 applications start faster with this. 

 Some applications that misbehave with 2.6 mainline will behave better using
 less cpu time. At extreme loads the stock 2.6 scheduler feels better by
 being too unfair on cpu bound tasks which this one does not. Nice has more
 predictable and larger effects on cpu distribution (nice 0 gets 20 times
 what nice +20 gets).

Objective Performance:

 Note that there has been no "tuning" put into this scheduler at all; the cpu
 balancing on smp and so on is that used in mainline/sched_domains.

 The executive summary is that in most cases smp performance is equivalent and
 occasionally better, while maintaining interactivity and improving
 responsiveness. Detailed summary of benchmarks found at the end of the mail.

Known Problems:
 There is one minor interactivity issue I encountered during testing that I
 need to examine and adddress when time permits. There are no known bugs 
 per se.

Future Direction:
 I will be departing shortly for extended leave and will be unable to do any
 further coding till the end of May. This release, therefore, is to make the
 project known and to receive some testing in the interim.

Download:
 http://ck.kolivas.org/patches/2.6/2.6.4/experimental/staircase/

 Three patches are currently available:
 2.6.4-staircase5:
 A full patch against 2.6.4 which includes current sched_domains 
 2.6.4.domains2-staircase5:
 A patch against sched_domains which shows more clearly only my changes
 2.6.5-rc2-mm2-stair
 An incremental patch against 2.6.5-rc2-mm2.

Testing:
 Please feel free to test and use this patch extensively. I will be able
 to respond to emails only intermittently while away but unable to do any
 coding.

Thanks:
 Zwane Mwaikambo and William Lee Irwin III for help and ideas.


Con Kolivas
25th March 2004

-----------
Detailed benchmark results:
 2.6.4 is 2.6.4 patched with latest sched-domains
 2.6.4-s is 2.6.4 with sched-domains patched with staircase deadline sched

Reaim 8x (higher is better)
---------------------------

2.6.4 http://khack.osdl.org/stp/290462/
Peak load Test: Maximum Jobs per Minute 7482.37 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 7351.78 (average of 3 runs)

2.6.4-s http://khack.osdl.org/stp/290536/
Peak load Test: Maximum Jobs per Minute 9492.30 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 9037.34 (average of 3 runs)

kernbench 16x (lower is better)
-------------------------------

2.6.4 http://www.osdl.org/projects/kernbench/results/results.2.6.4-domains
Average Half Load Run:
Elapsed Time 112.084
Average Optimum Load Run:
Elapsed Time 79.07
Average Maximum Load Run:
Elapsed Time 80.926

2.6.4-s http://www.osdl.org/projects/kernbench/results/results.2.6.4-s4.2
Average Half Load -j 8 Run:
Elapsed Time 106.59
Average Optimal -j 64 Load Run:
Elapsed Time 78.6866
Average Maximal -j Load Run:
Elapsed Time 83.2134

Hackbench 8x (lower is better)
------------------------------

2.6.4 http://khack.osdl.org/stp/290460/
# Processes | Ave Time(sec)
20 		1.85
40 		2.7742
60 		3.754
80 		4.8018
100 		5.8758

2.6.4-s http://khack.osdl.org/stp/290542/
# Processes | Ave Time(sec)
20 		1.8894
40 		2.6808
60 		3.51853333333333
80 		4.40575
100 		5.365


Contest 1x (generally lower is better see http://contest.kolivas.org)
---------------------------------------------------------------------
 The osdl automated contest runs have been playing up and the averages posted
 here are wrong so I've tried to extract the meaningful runs from the logs
 and distill them here:

2.6.4 http://khack.osdl.org/stp/290453/
no_load:
Kernel[runs]	Time	CPU%	Loads	LCPU%	Ratio
2.6.4      4	103	96.1	0.0	0.0	1.00
cacherun:
2.6.4      4	100	99.0	0.0	0.0	0.97
process_load:
2.6.4      4	166	59.6	140.2	38.0	1.61
ctar_load:
2.6.4      4	148	68.2	17.2	17.6	1.44
xtar_load:
2.6.4      4	131	78.6	9.0	15.2	1.27
io_load:
2.6.4      4	147	70.7	36.2	20.4	1.43
io_other:
2.6.4      4	143	72.0	31.3	17.5	1.39
read_load:
2.6.4      3	140	XX	98.3	XX	1.36
list_load:
2.6.4      3	120	XX	2	XX	1.16
mem_load:
2.6.4      3	153	XX	121	XX	1.49

2.6.4-s http://khack.osdl.org/stp/290538/
no_load:
Kernel[runs]	Time	CPU%	Loads	LCPU%	Ratio
2.6.4      4	103	96.1	0.0	0.0	1.00
cacherun:
2.6.4      4	100	99.0	0.0	0.0	0.97
process_load:
2.6.4      4	103	97.1	3.8	1.0	1.00
ctar_load:
2.6.4      4	126	80.2	5.2	6.3	1.22
xtar_load:
2.6.4      4	111	91.9	2.0	2.7	1.08
io_load:
2.6.4      4	112	90.2	7.8	4.5	1.09
io_other:
2.6.4      4	117	86.3	8.3	5.1	1.14
read_load:
2.6.4      3	131	XX	76.3	XX	1.27
list_load:
2.6.4      3	113	XX	1	XX	1.10
mem_load:
2.6.4      3	105	XX	44.3	XX	1.02

-------------------------------------------------------

Related Links:

impatience

Anonymous
on
March 25, 2004 - 9:11am

anyone have any idea how long it might take for this to work its way into the kernel at least as an option?

seems like a very stable, predictable, and friendly scheduler, but i don't have a linux box i'm willing to risk patching the kernel on

Re: impatience

sleight
on
March 25, 2004 - 9:36am

I am already running a kernel with this scheduler at the moment and it seems to run just fine. I merged it to 2.6.5-rc2-mm3 and my desktop box is running smoothly.
It seems stable enough to just patch in and test, really, try it :)

-Rik

Sounds great

Hiryu
on
March 25, 2004 - 10:44am

However, I'm still waiting for support for my SATA ICH5R raid controller under 2.6. I have it working under 2.4 right now and so I can't really switch. Otherwise I'd be compiling a new kernel right now. This new scheduler sounds promising.

Con or anyone, are there any apps in particular you've found this to give a boost on? I'm assuming all or most apps start faster (which is good).

Thanks.

Apps startup

Con Kolivas
on
March 25, 2004 - 7:13pm

They all start a little faster, and more so under load. It's no huge effect but it is noticable.

faster

Jeremy
on
March 26, 2004 - 5:30am

I'm running your scheduler now (finally), and sure enough, it _does_ feel like apps start faster. Some apps are quite significant, too. Others I don't notice it as much.

On the whole, quite impressive!

Con Zepplin

Anonymous
on
March 25, 2004 - 2:43pm

We should call this the Staircase to Heaven...

also, BSD is dead, and this proves it.

From now on.... (Thanks!!!!)

doctorture
on
March 25, 2004 - 10:00pm

Wow!!! You are _really_ talented!!!
I hope that you enjoyed doing that.

Con, What about to give a try to:

I/O Scheduler
MemoryManagement
Reiserfs

Do you like anyone of those topics?

Thank you from an admirator!!! :))

Irony?

Mr_Z
on
March 26, 2004 - 3:09pm

Is it just me that finds it ironic that an anesthesiologist is actively admired by someone with the handle "doctorture"? :-)

Staircase Process Scheduler: very good for mplayer

Anonymous
on
March 27, 2004 - 3:14pm

I have been watching movie files with mplayer on 2.4.22,
2.6.4 (with default scheduler) and now with Staircase scheduler on 2.6.4 and only with Staircase scheduler I get smooth playback. With default scheduler on 2.6.4 the frame rate seems to be "unstable" and that gives a "jumpy" feeling. But with Staircase scheduler the movies allways run very smooth.
(1xPentium III)

Few problems

jarlin@jabber.org
on
March 28, 2004 - 12:30am

Hmm, I have some slugginesh. When I grab window, for example Mozilla, with ALT and left mouse button and shake mouse rapidly, my music player start to "jump". And when there are a lot of windows in different workspaces my mouse jumps (not easily noticable but still) when I switch between them. Also compiling software S1 and S2 at the same time it takes very long time to finishing compiling S2 although it's _much_ smaller than S1.
I have tried renice my X but "problems" are still there.
And haven't got these problems with default scheduler ...

Feedback

Con Kolivas
on
March 28, 2004 - 1:55am

Finally! Some meaningful feedback. Ok a few points, but most importantly, remember, this is experimental so this is simply the first stable working version. At this stage this is a proof of concept that such a design can have decent performance even before any tuning or tweaking is put into it.

What hardware? What audio application/driver etc. I found the performance of one audio app vs another vary by 10 times the cpu usage at times. The arts server has serious issues with some hardware and 2.6 drivers. Check the cpu usage of your audio app/sound server during the testing. Try fiddling with buffers etc as well. Oh yeah and don't forget checking dma isn't changed.

X will not be as "aggressive" as it is in mainline so it will suffer under increasing load much more, and will be smooth at light levels of load but will not hold the cpu to do big sudden shifts in cpu usage as much (like shifting desktops).

Compiling things with a jobserver (make -j x) serves almost no useful purpose on a uniprocessor machine that is being used for something else at the same time. That said, this scheduler starts things more rapidly so an i/o bound compile will consume more resources as this is the same thing that makes applications start faster (it's a compromise one way or the other).

Renicing X -ve will never be the correct solution. Renicing +ve the compiles is the correct solution if this causes slowdowns.

I wont be posting any changes for a while but I do appreciate the feedback.

Wierd Bug....

yokem_55
on
March 28, 2004 - 8:54pm

I'm not sure this is due to the staircase scheduler, but parallel init under gentoo breaks with this kernel, causing my machine to hang during bootup. Disabling parallel init allows the machine to boot properly. Parallel init works just fine under other kernels though....

Parallel init

Con Kolivas
on
March 28, 2004 - 11:31pm

I assume parallel init is the one meant to speed up bootup by running services in parallel? The staircase scheduler makes some things start faster so perhaps there are services starting in parallel earlier than they were in mainline and this is breaking the init process. This could happen if some service is actually dependent on something else which normally takes less time to complete; ie it shouldn't be in parallel but series and was lucky enough to complete slower than the thing it depends on in mainline.

Yes same things here.....

Anonymous
on
March 29, 2004 - 5:52am

I'm having the same problems, when I invoke let's say "cat /var/log/wtmp" the xterm window starts moving it's self around the screen, when the `cat` reaches EOF xterm crashes *poof* and the mouse starts shaking rapidly. I'm used to _not_ shutdown my pc, I'm running 2.6.4-ck2 for 2 days now. When I turned on my monitor in the morning I found out that the PC has rebooted somehow, it was frozen hanging in tty1. Cause? Unknown to me. If anyone would like some more information please ask, I'm willing to do what I can to reproduce this (bug?).

Arvind-NL

More feedback

doctorture
on
March 29, 2004 - 8:03am

Ok, I also got a chance to test it here.

Yes, the apps do start soon than mainline scheduler, very good!!!

When playing a video with mplayer, that uses only 25% of CPU normally,
and trying to move a window (grabbing with ALT + left click) a konsole,
the mplayer does not skip, it _stops_, and CPU goes to 100%. Notice that
I wasn't using swap and memory was full with others apps.

Maybe the fact that I am using CFQ scheduler is affecting this?
(mainline is Anticipatory Scheduler scheduler...)

Thank for your help, Con!!!

Comments

Con Kolivas
on
March 29, 2004 - 8:55am

I doubt cfq is having any effect there.

There is a priority inversion problem with some video drivers and mplayer which I've seen exhibited and you're probably getting that happening there, exhibited in another way. I can't fix that without altering the mplayer sources. I quote from a tester "It boils down to the usage of XFlush() in libvo/vo_xv.c . The problem is that when synchronizing mplayer video buffer with X display, this function can't ensure that X server will be given timeslice soon enough." Changing from XFlush to just XSync fixed it for him. Here is what the patch looks like:

diff -u -r1.151 vo_xv.c
--- vo_xv.c	30 Nov 2003 16:36:10 -0000	1.151
+++ vo_xv.c	14 Mar 2004 15:31:33 -0000
@@ -273,7 +273,6 @@
     
     if ( vo_gc != None ) XFreeGC( mDisplay,vo_gc );
     vo_gc = XCreateGC(mDisplay, vo_window, 0L, &xgcv);
-    XFlush(mDisplay);
     XSync(mDisplay, False);
 #ifdef HAVE_XF86VM
     if ( vm )
@@ -403,8 +402,7 @@
    free(xvimage[foo]->data);
   }
  XFree(xvimage[foo]);
- 
- XFlush( mDisplay );
+
  XSync(mDisplay, False);
  return;
 }
@@ -465,11 +463,9 @@
          0, 0,  image_width, image_height,
          drwX-(vo_panscan_x>>1),drwY-(vo_panscan_y>>1),vo_dwidth+vo_panscan_x,(vo_fs?vo_dheight - 1:vo_dheight)+vo_panscan_y);
   }
- if (num_buffers>1){
+ if (num_buffers>1)
     current_buf=vo_directrendering?0:((current_buf+1)%num_buffers);
-    XFlush(mDisplay);
- } else
-    XSync(mDisplay, False);   
+ XSync(mDisplay, False);
  return;
 }
 

I will try this and report later

doctorture
on
March 29, 2004 - 9:32am

Thanks Con, I will give this a try and continue to test
the new scheduler, trying to get some more feedback.

Thanks for your help!

Response

doctorture
on
March 30, 2004 - 9:20am

Con,

I have not get a chance to test this patch yet. As soon as I get to
test I will report.

I tested it with xine instead mplayer and it works fine. Seems like
a priority inversion problem.

I will test patched mplayer on this new tweaked scheduler 5.2v.

Thanks again!!!

v5.2

Con Kolivas
on
March 30, 2004 - 6:09am

I've posted two updates to this patch on my last night here before leaving, bringing the version up to v5.2. This one addresses the one issue I alluded to in my announcement email. This performs a lot better than the first posted one; recommend anyone trying this scheduler update now.

Microoptimizations/Cleanups

Anonymous
on
March 31, 2004 - 4:35am

Hi, Con!

I made a patch to your scheduler that does some microoptimizations.
Could you check it out?
I will be sending the complete file to you on your email:
ckmb@hotmail.com

Thank you!!!

--- kernel/sched.c 2004-03-31 08:14:51.454531696 -0300
+++ /tmp/testekernel/linux-2.6.4/kernel/sched.c 2004-03-31 03:11:46.000000000 -0300
@@ -234,17 +234,20 @@
// deadline - the best deadline rank a task can have.
static inline unsigned int deadline(task_t *p)
{
+ unsigned int deadline;
if (unlikely(rt_task(p)))
return p->deadline;
- return 40 - TASK_USER_PRIO(p);
+ deadline = 40 - TASK_USER_PRIO(p);
+ return deadline;
}

// slice - the duration a task runs before losing a deadline rank.
static inline unsigned int slice(task_t *p)
{
+ unsigned int slice = RR_INTERVAL;
if (likely(!rt_task(p)))
- return RR_INTERVAL * deadline(p);
- return RR_INTERVAL;
+ slice *= deadline(p);
+ return slice;
}

// effective_prio - dynamic priority dependent on deadline rank.
@@ -258,8 +261,9 @@
prio = MAX_PRIO - 1;
else
prio = MAX_PRIO - 2 - p->deadline;
-
- return (prio < MAX_RT_PRIO) ? MAX_RT_PRIO : prio;
+ if (prio < MAX_RT_PRIO)
+ prio = MAX_RT_PRIO;
+ return prio;
}

/*
@@ -270,9 +274,10 @@
*/
static inline unsigned int first_time_slice(task_t *p)
{
+ unsigned int time_slice = RR_INTERVAL;
if (likely(!rt_task(p)))
- return RR_INTERVAL* ((deadline(p) - p->deadline) + 1);
- return RR_INTERVAL;
+ time_slice *= ((deadline(p) - p->deadline) + 1);
+ return time_slice;
}

static inline int apparent_prio(task_t *p)
@@ -286,22 +291,22 @@
static inline void activate_task(task_t *p, runqueue_t *rq)
{
unsigned long long now = sched_clock();
+ unsigned long run_time, sleep_time;
unsigned int full_slice = slice(p);

p->slice = full_slice;
if (likely(!rt_task(p) && !batch_task(p))) {
- unsigned long run_time, sleep_time;
/*
* This ensures that tasks running for less than one tick are
* treated the same as longer running tasks.
*/
sleep_time = now - p->timestamp;
- if (sleep_time >= p->runtime)
+ if (sleep_time > p->runtime)
p->runtime = 0;
else
p->runtime -= sleep_time;
run_time = NS_TO_JIFFIES(p->runtime);
-
+
if (unlikely(run_time >= full_slice)) {
if (p->deadline)
p->deadline--;
@@ -465,14 +470,16 @@
*/
static inline unsigned long get_low_cpu_load(int cpu, int update)
{
+ runqueue_t *rq = cpu_rq(cpu);
runqueue_t *this_rq = this_rq();
- unsigned long nr = cpu_rq(cpu)->nr_running << SCHED_LOAD_SHIFT;
+ unsigned long nr = rq->nr_running << SCHED_LOAD_SHIFT;
unsigned long load = this_rq->cpu_load[cpu];
+ unsigned long ret = min(nr, load);

if (update)
this_rq->cpu_load[cpu] = (nr + load) / 2;

- return min(nr, load);
+ return ret;
}

static inline unsigned long get_high_cpu_load(int cpu, int update)
@@ -547,8 +554,8 @@
{
unsigned long flags;
int success = 0;
- long old_state = p->state;
- runqueue_t *rq = task_rq_lock(p, &flags);
+ long old_state;
+ runqueue_t *rq;
int cpu, this_cpu;
#ifdef CONFIG_SMP
unsigned long long now;
@@ -558,6 +565,8 @@
runqueue_t *this_rq;
#endif

+ rq = task_rq_lock(p, &flags);
+ old_state = p->state;
if (!(old_state & state))
goto out;

@@ -691,15 +700,15 @@
* resulting in more scheduling fairness.
*/
local_irq_disable();
- current->time_slice >>= 1;
- current->slice >>= 1;
- p->time_slice = current->time_slice + 1;
- p->slice = current->slice + 1;
+ p->time_slice = (current->time_slice + 1) >> 1;
+ p->slice = (current->slice + 1) >> 1;
/*
* The remainder of the first timeslice might be recovered by
* the parent if the child exits early enough.
*/
p->first_time_slice = 1;
+ current->time_slice >>= 1;
+ current->slice >>= 1;
p->timestamp = sched_clock();
if (!current->time_slice) {
/*
@@ -756,16 +765,17 @@
void fastcall sched_exit(task_t * p)
{
unsigned long flags;
- task_t *parent = p->parent;
+ runqueue_t *rq;

local_irq_save(flags);
if (p->first_time_slice) {
- parent->time_slice += p->time_slice;
- if (unlikely(parent->time_slice > slice(parent)))
- parent->time_slice = slice(parent);
+ p->parent->time_slice += p->time_slice;
+ if (unlikely(p->parent->time_slice > slice(p->parent)))
+ p->parent->time_slice = slice(p->parent);
}
local_irq_restore(flags);
- task_rq_unlock(task_rq_lock(parent, &flags), &flags);
+ rq = task_rq_lock(p->parent, &flags);
+ task_rq_unlock(rq, &flags);
}

/**
@@ -906,12 +916,14 @@
{
if (rq1 == rq2)
spin_lock(&rq1->lock);
- else if (rq1 < rq2) {
- spin_lock(&rq1->lock);
- spin_lock(&rq2->lock);
- } else {
- spin_lock(&rq2->lock);
- spin_lock(&rq1->lock);
+ else {
+ if (rq1 < rq2) {
+ spin_lock(&rq1->lock);
+ spin_lock(&rq2->lock);
+ } else {
+ spin_lock(&rq2->lock);
+ spin_lock(&rq1->lock);
+ }
}
}

@@ -952,12 +964,6 @@
rq = task_rq_lock(p, &flags);
if (!cpu_isset(dest_cpu, p->cpus_allowed))
goto out;
- /* force the process onto the specified CPU */
- if ((!cpu_isset(dest_cpu, p->cpus_allowed))
- || (!migrate_task(p, dest_cpu, &req))) {
- task_rq_unlock(rq, &flags);
- return;
- }

/* force the process onto the specified CPU */
if (migrate_task(p, dest_cpu, &req)) {
@@ -965,6 +971,10 @@
task_rq_unlock(rq, &flags);
wake_up_process(rq->migration_thread);
wait_for_completion(&req.done);
+ return;
+ }
+out:
+ task_rq_unlock(rq, &flags);
}

/*
@@ -999,6 +1009,7 @@
void sched_balance_exec(void)
{
struct sched_domain *domain = this_sched_domain();
+ int new_cpu;
int this_cpu = smp_processor_id();
if (numnodes == 1)
return;
@@ -1006,14 +1017,13 @@
if (this_rq()->nr_running <= 1)
return;

- while (domain->parent) {
- if (domain->flags & SD_FLAG_EXEC) {
- int new_cpu = sched_best_cpu(current, domain);
- if (new_cpu != this_cpu)
- sched_migrate_task(current, new_cpu);
- return;
- }
+ while (domain->parent && !(domain->flags & SD_FLAG_EXEC))
domain = domain->parent;
+
+ if (domain->flags & SD_FLAG_EXEC) {
+ new_cpu = sched_best_cpu(current, domain);
+ if (new_cpu != this_cpu)
+ sched_migrate_task(current, new_cpu);
}
}
#endif /* CONFIG_NUMA */
@@ -1102,18 +1112,17 @@
task_t *tmp;

if (max_nr_move <= 0 || busiest->nr_running <= 1)
- return 0;
+ goto out;
array = &busiest->array;
dst_array = &this_rq->array;

/* Start searching at priority 0: */
idx = 0;
skip_bitmap:
- if (idx)
- idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
- else
+ if (!idx)
idx = sched_find_first_bit(array->bitmap);
-
+ else
+ idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
if (idx >= MAX_PRIO)
goto out;

@@ -1331,7 +1340,7 @@
struct sched_group *group;
runqueue_t *busiest = NULL;
unsigned long imbalance;
- int balanced = 0;
+ int balanced = 0, failed = 0;
int nr_moved = 0;

spin_lock(&this_rq->lock);
@@ -1357,23 +1366,27 @@
out:
spin_unlock(&this_rq->lock);

- if (!balanced && nr_moved == 0) {
- if (busiest &&
- domain->nr_balance_failed > domain->cache_nice_tries) {
- int wake = 0;
-
- spin_lock(&busiest->lock);
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- wake = 1;
- }
- spin_unlock(&busiest->lock);
- if (wake)
- wake_up_process(busiest->migration_thread);
+ if (!balanced && nr_moved == 0)
+ failed = 1;
+
+ if (failed && busiest &&
+ domain->nr_balance_failed > domain->cache_nice_tries) {
+ int wake = 0;
+
+ spin_lock(&busiest->lock);
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ wake = 1;
}
+ spin_unlock(&busiest->lock);
+ if (wake)
+ wake_up_process(busiest->migration_thread);
+ }
+
+ if (failed)
domain->nr_balance_failed++;
- } else
+ else
domain->nr_balance_failed = 0;

if (balanced) {
@@ -1399,15 +1412,15 @@
struct sched_group *group;
runqueue_t *busiest = NULL;
unsigned long imbalance;
- int nr_moved;
+ int nr_moved = 0;

group = find_busiest_group(domain, this_cpu, &imbalance, NEWLY_IDLE);
if (!group)
- return 0;
+ goto out;

busiest = find_busiest_queue(group);
if (!busiest || busiest == this_rq)
- return 0;
+ goto out;

/* Attempt to move tasks */
double_lock_balance(this_rq, busiest);
@@ -1417,6 +1430,7 @@

spin_unlock(&busiest->lock);

+out:
return nr_moved;
}

@@ -1543,7 +1557,7 @@
interval *= domain->busy_factor;

/* scale ms to jiffies */
- interval *= HZ / 1000;
+ interval = interval * HZ / 1000;
if (unlikely(interval == 0))
interval = 1;

@@ -1726,7 +1740,6 @@
for_each_cpu_mask(i, sibling_map) {
runqueue_t *smt_rq;
task_t *smt_curr;
- unsigned int gain_factor = (100 - sd->per_cpu_gain) / 100);

smt_rq = cpu_rq(i);
smt_curr = smt_rq->curr;
@@ -1739,7 +1752,7 @@
* task from using an unfair proportion of the
* physical cpu's resources. -ck
*/
- if ((smt_curr->slice * gain_factor >
+ if (((smt_curr->slice * (100 - sd->per_cpu_gain) / 100) >
slice(p) || rt_task(smt_curr) || batch_task(p)) &&
p->mm && smt_curr->mm && !rt_task(p)&&
!batch_task(smt_curr))
@@ -1750,7 +1763,7 @@
* or wake it up if it has been put to sleep for priority
* reasons.
*/
- if (((p->slice * gain_factor >
+ if ((((p->slice * (100 - sd->per_cpu_gain) / 100) >
slice(smt_curr) || rt_task(p) || batch_task(smt_curr)) &&
smt_curr->mm && p->mm && !rt_task(smt_curr) &&
!batch_task(p)) ||
@@ -1845,7 +1858,7 @@
// Keeps track of tasks that run less than one tick
prev->runtime += run_time;
else {
- if (run_time >= prev->runtime)
+ if (run_time > prev->runtime)
prev->runtime = 0;
else
prev->runtime -= run_time;
@@ -1911,7 +1924,8 @@

int default_wake_function(wait_queue_t *curr, unsigned mode, int sync)
{
- return try_to_wake_up(curr->task, mode, sync);
+ task_t *p = curr->task;
+ return try_to_wake_up(p, mode, sync);
}

EXPORT_SYMBOL(default_wake_function);
@@ -1988,8 +2002,10 @@
return;

spin_lock_irqsave(&q->lock, flags);
- /* We avoid a branch here */
- __wake_up_common(q, mode, nr_exclusive, likely(nr_exclusive) != 0);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
spin_unlock_irqrestore(&q->lock, flags);
}

@@ -2172,7 +2188,7 @@
new_prio = NICE_TO_PRIO(nice);
delta = new_prio - old_prio;
p->static_prio = NICE_TO_PRIO(nice);
- if (delta >= p->deadline)
+ if (delta > p->deadline)
p->deadline = 0;
else
p->deadline -= delta;
@@ -2218,13 +2234,14 @@
return -EPERM;
if (increment < -40)
increment = -40;
- } else if (increment > 40)
+ }
+ if (increment > 40)
increment = 40;

nice = PRIO_TO_NICE(current->static_prio) + increment;
if (nice < -20)
nice = -20;
- else if (nice > 19)
+ if (nice > 19)
nice = 19;

retval = security_task_setnice(current, nice);
@@ -2287,7 +2304,7 @@
static int setscheduler(pid_t pid, int policy, struct sched_param __user *param)
{
struct sched_param lp;
- int retval;
+ int retval = -EINVAL;
int oldprio;
prio_array_t* array;
unsigned long flags;
@@ -2295,10 +2312,11 @@
task_t *p;

if (!param || pid < 0)
- return -EINVAL;
+ goto out_nounlock;

+ retval = -EFAULT;
if (copy_from_user(&lp, param, sizeof(struct sched_param)))
- return -EFAULT;
+ goto out_nounlock;

/*
* We play safe to avoid deadlocks.
@@ -2389,6 +2407,7 @@
out_unlock_tasklist:
read_unlock_irq(&tasklist_lock);

+out_nounlock:
return retval;
}

@@ -2420,11 +2439,11 @@
*/
asmlinkage long sys_sched_getscheduler(pid_t pid)
{
- int retval;
+ int retval = -EINVAL;
task_t *p;

if (pid < 0)
- return -EINVAL;
+ goto out_nounlock;

retval = -ESRCH;
read_lock(&tasklist_lock);
@@ -2436,6 +2455,7 @@
}
read_unlock(&tasklist_lock);

+out_nounlock:
return retval;
}

@@ -2447,11 +2467,11 @@
asmlinkage long sys_sched_getparam(pid_t pid, struct sched_param __user *param)
{
struct sched_param lp;
- int retval;
+ int retval = -EINVAL;
task_t *p;

if (!param || pid < 0)
- return -EINVAL;
+ goto out_nounlock;

read_lock(&tasklist_lock);
p = find_process_by_pid(pid);
@@ -2469,7 +2489,10 @@
/*
* This one might sleep, we cannot do it with a spinlock held ...
*/
- return copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;
+ retval = copy_to_user(param, &lp, sizeof(*param)) ? -EFAULT : 0;
+
+out_nounlock:
+ return retval;

out_unlock:
read_unlock(&tasklist_lock);
@@ -2532,11 +2555,13 @@
asmlinkage long sys_sched_getaffinity(pid_t pid, unsigned int len,
unsigned long __user *user_mask_ptr)
{
+ unsigned int real_len;
cpumask_t mask;
int retval;
task_t *p;

- if (len < sizeof(mask))
+ real_len = sizeof(mask);
+ if (len < real_len)
return -EINVAL;

read_lock(&tasklist_lock);
@@ -2553,9 +2578,9 @@
read_unlock(&tasklist_lock);
if (retval)
return retval;
- if (copy_to_user(user_mask_ptr, &mask, sizeof(mask)))
+ if (copy_to_user(user_mask_ptr, &mask, real_len))
return -EFAULT;
- return sizeof(mask);
+ return real_len;
}

/**
@@ -2653,17 +2678,20 @@
*/
asmlinkage long sys_sched_get_priority_max(int policy)
{
+ int ret = -EINVAL;
+
switch (policy) {
case SCHED_FIFO:
case SCHED_RR:
- return MAX_USER_RT_PRIO-1;
+ ret = MAX_USER_RT_PRIO-1;
+ break;
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_ISO:
- return 0;
- default:
- return -EINVAL;
+ ret = 0;
+ break;
}
+ return ret;
}

/**
@@ -2675,17 +2703,19 @@
*/
asmlinkage long sys_sched_get_priority_min(int policy)
{
+ int ret = -EINVAL;
+
switch (policy) {
case SCHED_FIFO:
case SCHED_RR:
- return 1;
+ ret = 1;
+ break;
case SCHED_NORMAL:
case SCHED_BATCH:
case SCHED_ISO:
- return 0;
- default:
- return -EINVAL;
+ ret = 0;
}
+ return ret;
}

/**
@@ -2699,12 +2729,12 @@
asmlinkage
long sys_sched_rr_get_interval(pid_t pid, struct timespec __user *interval)
{
- int retval;
+ int retval = -EINVAL;
struct timespec t;
task_t *p;

if (pid < 0)
- return -EINVAL;
+ goto out_nounlock;

retval = -ESRCH;
read_lock(&tasklist_lock);
@@ -2719,8 +2749,9 @@
jiffies_to_timespec(p->policy & SCHED_FIFO ?
0 : slice(p), &t);
read_unlock(&tasklist_lock);
-
- return copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
+ retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
+out_nounlock:
+ return retval;
out_unlock:
read_unlock(&tasklist_lock);
return retval;
@@ -2880,6 +2911,7 @@
int set_cpus_allowed(task_t *p, cpumask_t new_mask)
{
unsigned long flags;
+ int ret = 0;
migration_req_t req;
runqueue_t *rq;

@@ -2892,7 +2924,7 @@
p->cpus_allowed = new_mask;
/* Can the task run on the task's current CPU? If so, we're done */
if (cpu_isset(task_cpu(p), new_mask))
- return 0;
+ goto out;

if (migrate_task(p, any_online_cpu(new_mask), &req)) {
/* Need help from migration thread: drop lock and wait. */
@@ -2901,9 +2933,9 @@
wait_for_completion(&req.done);
return 0;
}
-
+out:
task_rq_unlock(rq, &flags);
- return 0;
+ return ret;
}

EXPORT_SYMBOL_GPL(set_cpus_allowed);
@@ -2958,9 +2990,10 @@
struct sched_param __user param = { .sched_priority = MAX_RT_PRIO-1 };
runqueue_t *rq;
int cpu = (long)data;
+ int ret;

BUG_ON(smp_processor_id() != cpu);
- setscheduler(0, SCHED_FIFO, &param);
+ ret = setscheduler(0, SCHED_FIFO, &param);

rq = this_rq();
BUG_ON(rq->migration_thread != current);
@@ -3013,14 +3046,13 @@
return NOTIFY_BAD;
kthread_bind(p, cpu);
cpu_rq(cpu)->migration_thread = p;
- return NOTIFY_OK;
+ break;
case CPU_ONLINE:
/* Strictly unneccessary, as first user will wake it. */
wake_up_process(cpu_rq(cpu)->migration_thread);
- return NOTIFY_OK;
- default:
- return NOTIFY_OK;
+ break;
}
+ return NOTIFY_OK;
}

/*
@@ -3265,10 +3297,9 @@
void __init sched_init(void)
{
runqueue_t *rq;
- int i;
+ int i, k;

for (i = 0; i < NR_CPUS; i++) {
- int k;
prio_array_t* array;
#ifdef CONFIG_SMP
struct sched_domain *domain;

Email

Con Kolivas
on
April 8, 2004 - 6:27am

Looks interesting. I'd appreciate you sending it to my real email: kernel at kolivas dot org. However I'm away overseas till the end of May so I won't be able to do anything till then.

VMS Scheduler

Anonymous
on
April 3, 2004 - 1:54pm

Hi Con,

Yor scheduler reminds me somewhat the VAX/VMS scheduler which
by the way gave excellent results on various RT, BATCH and INTERACTIVE task mixes.

The NON RT tasks there had a dynamic priority which was decreased each time that task used up it's time slice. When task was awakened from a sleep or on io completion it was given a priority boost.

The amount of priority boost was some function of duration of the wait, device type and a user (programmer) specified parameter.

The RT tasks had fixed priorities and the scehuduler did a round- robin between the tasks of the highest priority.

Vadim at 7chips dot com

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.