login
Header Space

 
 

Linux: The Really Fair Scheduler

August 30, 2007 - 11:07pm
Submitted by Jeremy on August 30, 2007 - 11:07pm.
Linux news

During the many threads discussing Ingo Molnar's recently merged Completely Fair Scheduler, Roman Zippel has repeatedly questioned the complexity of the new process scheduler. In a recent posting to the Linux Kernel mailing list he offered a simpler scheduler named the 'Really Fair Scheduler' saying, "as I already tried to explain previously CFS has a considerable algorithmic and computational complexity. This patch should now make it clearer, why I could so easily skip over Ingo's long explanation of all the tricks CFS uses to keep the computational overhead low - I simply don't need them." He offered a mathematical overview of how his new scheduler works, included some benchmarks, and reflected back to earlier discussions on the lkml asking, "Ingo, from this point now I need your help, you have to explain to me, what is missing now relative to CFS. I tried to ask questions, but that wasn't very successful..." He also noted:

"The basic idea of this scheduler is somewhat different than CFS. Where the old scheduler maintains fixed time slices, CFS still maintains a dynamic per task time slice. This model does away with it completely, instead it puts the task on a virtual (normalized) time line, where only the relative distance between any two task is relevant."


From:	Roman Zippel [email blocked]
Subject: [ANNOUNCE/RFC] Really Fair Scheduler
Date:	Fri, 31 Aug 2007 04:05:12 +0200 (CEST)

Hi,

I'm glad to announce a working prototype of the basic algorithm I
already suggested last time.
As I already tried to explain previously CFS has a considerable
algorithmic and computational complexity. This patch should now make it
clearer, why I could so easily skip over Ingo's long explanation of all
the tricks CFS uses to keep the computational overhead low - I simply
don't need them. The following numbers are based on a 2.6.23-rc3-git1 UP
kernel, the first 3 runs are without patch, the last 3 runs are with the
patch:

Xeon 1.86GHz:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
suse      Linux 2.6.23- 0.9500 1.3900 1.1700 1.6500 1.1700 1.68000 1.22000
suse      Linux 2.6.23- 0.9700 1.3600 1.1400 1.6100 1.1500 1.67000 1.18000
suse      Linux 2.6.23- 0.9800 1.4000 1.1500 1.6600 1.1400 1.70000 1.19000
suse      Linux 2.6.23- 0.7500 1.2000 1.0000 1.4800 0.9800 1.52000 1.06000
suse      Linux 2.6.23- 0.7800 1.2200 0.9800 1.4600 1.0100 1.53000 1.05000
suse      Linux 2.6.23- 0.8300 1.2200 1.0000 1.4800 0.9800 1.52000 1.05000

Celeron 2.66GHz:

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
devel6    Linux 2.6.23- 3.1300 3.3800 5.9000 5.8300   18.1 9.33000 18.0
devel6    Linux 2.6.23- 3.1400 3.2800 7.0700 6.0900   17.6 9.18000 17.8
devel6    Linux 2.6.23- 3.1400 3.6000 3.5400 6.1300   17.2 9.02000 17.2
devel6    Linux 2.6.23- 2.7400 3.0300 5.5200 6.4900   16.6 8.70000 17.0
devel6    Linux 2.6.23- 2.7400 3.2700 8.7700 6.8700   17.1 8.79000 17.4
devel6    Linux 2.6.23- 2.7500 3.3100 5.5600 6.1700   16.8 9.07000 17.2


Besides these numbers I can also provide a mathematical foundation for
it, I tried the same for CFS, but IMHO it's not really sanely possible.
This model is far more accurate than CFS is and doesn't add an error
over time, thus there are no more underflow/overflow anymore within the
described limits.
The small example program also demonstrates how it can be easily scaled
down to millisecond resolution to completely get rid of the 64bit math.
This may be interesting for a few archs even if they have a finer clock
resolution as most scheduling decisions are done on a millisecond basis.

The basic idea of this scheduler is somewhat different than CFS. Where
the old scheduler maintains fixed time slices, CFS still maintains a
dynamic per task time slice. This model does away with it completely,
instead it puts the task on a virtual (normalized) time line, where only
the relative distance between any two task is relevant.

So here all the mathematical details necessary to understand what the
scheduler does, so anyone can judge for himself how solid this design
is. First some basics:

(1)	time = sum_{t}^{T}(time_{t})
(2)	weight_sum = sum_{t}^{T}(weight_{t})

Time per task is calculated with:

(3)	time_{t} = time * weight_{t} / weight_sum

This can be also written as: 

(4)	time_{t} / weight_{t} = time / weight_sum

This way we have the normalized time:

(5)	time_norm = time / weight_sum
(6)	time_norm_{t} = time_{t} / weight_{t}

If every task got its share they are all same. Using time_norm one can
calculate the time tasks should get based on their weight:

(7)	sum_{t}^{T}(time_{t}) = sum_{t}^{T}(round(time / weight_sum) * weight_{t})

This is bascially what CFS currently does and it demonstrates the basic
problem it faces. It rounds the normalized time and the rounding error
is also distributed to the time a task gets, so there is a difference
between the time which is distributed to the tasks and the time consumed
by them. On the upside the error is distributed equally to all tasks
relatively to their weight (so it isn't immediately visible via top). On
the downside the error itself is weighted too and so a small error can
be become quite large and the higher the weight the more it contributes
to the error and the more likely it hits one of the limits. Once it hits
a limit, the overflow/underflow time is simply thrown away and is lost
for accounting and the task doesn't get the time it's supposed to get.

An alternative approach is to not use time_norm at all to distribute the
time. Any task can be used to calculate the time any other task needs
relative to it. For this the normalized time per task is maintained
based on (6):

(8)	time_norm_{t} * 2^16 = time_{t} * round(2^16 / weight_{t})

Using the difference between the normalized times of two tasks, one can
calculate the time it needs to equalize the normalized time.
This has the advantage that round(2^16 / weight_{t}) is constant (unless
reniced) and thus also the error due to the rounding. The time one task
gets relative to another is based on these constants. As only the delta
of these times are needed, the absolute value can simply overflow and
the limit of maximum time delta is:

(9)	time_delta_max = KCLOCK_MAX / (2^16 / weight_min)


The global normalized time is still needed and useful (e.g. waking
tasks) and thus this faces the same issue as CFS right now - managing
the rounding error. This means one can't directly use the real
weight_{t} value anymore without producing new errors, so either one
uses this approximate weight:

(10)	weight_app_{t} = 2^16 / round(2^16 / weight_{t})

or even better would be to get rid of this completely.

Based on (5) and (6) one can calculate the global normalized time as:

(11)	time_norm = sum_{t}^{T}(time_{t}) / sum_{t}^{T}(weight_app_{t})
	          = sum_{t}^{T}(time_norm_{t} * weight_app_{t}) / sum_{t}^{T}(weight_app_{t})

This is now a weighted average and provides the possibility to get rid
of weight_app_{t} by simply replacing it:

(12)	time_norm_app = sum_{t}^{T}(time_norm_{t} * weight_{t}) / sum_{t}^{T}(weight_{t})

This produces only a approximate normalized time, but if all
time_norm_{t} are equal (i.e. all tasks got their share), the result is
the same, thus the error is only temporary. If one already approximates
this value this means value other replacements are possible too. In the
previous example program I simply used 1:

(13)	time_norm_app = sum_{t}^{T}(time_norm_{t}) / T

Another approximation is to use a shift:

(14)	time_norm_app = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) / sum_{t}^{T}(2^weight_shift_{t})

This helps to avoid a possible full 64 bit multiply and makes other
operations elsewhere simpler too and the result should be close enough.
So by maintaining these two sums one can calculate an approximate
normalized time value:

(15)	time_sum_app = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t})
(16)	weight_sum_app = sum_{t}^{T}(2^weight_shift_{t})

Using (6) the dividend can also be written as:

(17)	time_sum_app * 2^16 = sum_{t}^{T}(time_{t} * round(2^16 / weight_{t}) * 2^weight_shift_{t})

The multiplier "round(2^16 / weight_{t}) * 2^weight_shift_{t}" is off at
most by 1.42 (or e(.5*l(2))), so this approximate time sum value grows
almost linearly with the real time and thus the maximum of this sum is
reached approximately after:

(18)	time_max = KCLOCK_MAX / 2^16 / 1.42

The problem now is to avoid this overflow, for this the sum is regularly
updated by splitting it:

(19)	time_sum_app = time_sum_base + time_sum_off

Using this (14) can be written as:

(20)	time_norm_app = (time_sum_base + time_sum_off) / weight_sum_app
	              = time_sum_base / weight_sum_app + time_sum_off / weight_sum_app
	              = time_norm_base + time_sum_off / weight_sum_app

time_norm_base is maintained incrementally be defining this increment:

(21)	time_norm_inc = time_sum_max / weight_sum_app

Everytime time_sum_off exceeds time_sum_max, time_sum_off and
time_norm_base are adjusted appropriately. time_sum_max is scaled so
that the required update frequency is reduced to a minimum, but also so
that time_sum_off can be easily scaled down to a 32 bit value when
needed.

This basically describes the static system, but to allow for sleeping
and waking these sums need adjustments to preserve a proper average:

(22)	weight_app_sum += 2^weight_shift_{new}
(23)	time_sum_max += time_norm_inc * 2^weight_shift_{new}
(24)	time_sum_off += (time_norm_{new} - time_norm_base) * 2^weight_shift_{new}

The last one is a little less obvious, it can be derived from (15) and
(19):

(25)	time_sum_base + time_sum_off = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t})
	time_sum_off = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - time_sum_base
	             = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - time_norm_base * weight_sum_app
	             = sum_{t}^{T}(time_norm_{t} * 2^weight_shift_{t}) - sum_{t}^{T}(time_norm_base * 2^weight_shift_{t})
	             = sum_{t}^{T}((time_norm_{t} - time_norm_base) * 2^weight_shift_{t})

As one can see here time_norm_base is only interesting relative to the
normalized task time and so the same limits apply.

The average from (20) can now be used to calculate the normalized time
for the new task in (24). It can be a given a bonus relative to the
other task or it might be still within a certain limit, as it hasn't
slept long enough. The limit (9) still applies here, so a simple
generation counter may still be needed for long sleeps.
The time_sum_off value used to calculate the average can be scaled down
as mentioned above. As it contains far more resolution than needed for
short time scheduling decisions, the lower bits can be thrown away to
get a 32 bit value. The scaling of time_sum_max makes sure that one
knows the location of the most significant bit, so that the 32 bit is
used as much as possible.


Finally a few more notes about the patch. The current task is not kept
in the tree (just this saves a lot of tree updates), so I faced a
similiar problem as FAIR_GROUP_SCHED, that enqueue_task/dequeue_task can
be called for the current task, so maintaining the current task pointer
for the class is interesting. Instead of adding a set_curr_task it would
be IMO simpler to further reduce the number of indirect calls, e.g. the
deactivate_task/put_prev_task sequence can be replaced with a single
call (and I don't need the sleep/wake arguments anymore, so it can be
reused for that).
I disabled the usage of cpu load as its calculation is also rather 64bit
heavy and I suspect it could be easily scaled down, but this way it's
not my immediate concern.

Ingo, from this point now I need your help, you have to explain to me,
what is missing now relative to CFS. I tried to ask questions, but that
wasn't very successful...

bye, Roman

---
 include/linux/sched.h |   21 -
 kernel/sched.c        |  158 +++++----
 kernel/sched_debug.c  |   26 -
 kernel/sched_norm.c   |  812 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 910 insertions(+), 107 deletions(-)



Related Links:

We have so many schedulers

August 31, 2007 - 5:04am
Anonymous (not verified)

We have so many schedulers now that there is a good probability that a few of them will be able to handle downloads *and* play MP3s at the same time!

Most already can (apparently

August 31, 2007 - 5:55am
Jospoortvliet (not verified)

Most already can (apparently unlike the NT scheduler) ;-)

But yeah, it's getting cosy in the scheduler area. I wonder if this one is better than Ingo. I hope Kerneltrap keeps us posted!!!

Maybe one of them will

August 31, 2007 - 8:16am
Anonymous (not verified)

Maybe one of them will actually work :o)

Every so often...

August 31, 2007 - 11:39am

Every so often, the kernel developers remind me of something like this. "Scheduler! Oooh SHINY!" ;-) :-)

(Tongue planted firmly in cheek.)

--
Program Intellivision and play Space Patrol!

win! <3

August 31, 2007 - 12:10pm
Baczek (not verified)

win!

<3

Make the choice of scheduler a config option?

September 2, 2007 - 2:53am
Sivasankar Chander (not verified)

Is it possible to make all the schedulers available as a .config option, perhaps by guarding the relevant code fragments of each scheduler with appropriate $ifdefs? If so, anybody can choose the scheduler he wants to compile into the kernel and check out the results directly on his own load. This would allow room for all of the old scheduler, CK, CFS and RFS to be present and developed independently, and get rid of some the wasted energy on politics.

This is exactly what Con had

September 2, 2007 - 11:42am
Anonymous (not verified)

This is exactly what Con had in mind: a pluggable scheduler. But not only through a compile-time configuration, you could say at boot-time which scheduler you want to use. Linus (and others) denied.

No issues with that

September 2, 2007 - 6:25am
Anonymous (not verified)

The problem is that we don't have a sound system that can play more that one sound stream at a time without requiring kernel hacks, having a huge latency, or skipping when you drag a window around. The scheduler is fine, it's the rest that sucks.

Actually, alsa will play

September 2, 2007 - 8:47am
Anonymous (not verified)

Actually, alsa will play more than one sound at the same time _if_ your hardware supports it.

If your hardware does not, it can do software emulation if configured. I think current alsa versions will do software emulation automatically (i.e. configuration already done).

Actually I am amazed by the low latency in Linux (even with dmix). I've tried long and hard to lower the sound latency in Windows and couldn't get as low as the default latency in Linux (both in wave and in midi). Linux is sooo superior in this field in my experience that it's not even funny.

For some numbers (full round trip: playing and recording via a loop cable)
- Latency with default drivers in Windows: ~600ms.
- Lowest latency with ASIO4ALL that does not jump: ~50ms.
- Lowest latency configurable with ASIO4ALL: 32ms.
- Same test on Linux with wine and wineasio: 20ms.

That is before we talk about the unstableness of midi on Windows (disconnect the device, and the program will not be able to know, even if it manually queries the os), where I programmed a 50 lines daemon on Linux that I no longer have to think about it - just replug and play.

Your latency sucks!

September 3, 2007 - 5:18am
Anonymous (not verified)

Dude, you're using an ASIO emulation layer. Stop doing that! I'm happily running at 3ms on an old crummy Athlon XP and a Delta 1010 (using a real ASIO driver, of course).

Naturally, your sound card

September 6, 2007 - 4:22pm
Anonymous (not verified)

Naturally, your sound card supports it.
Mine is an cheap onboard type.

Warning! Warning!

August 31, 2007 - 8:04am
Anonymous (not verified)

binutils-2.18.tar.{gz|bz2} from gnu/sourceware is ya GPLv3!

binutils-2.17.90.tar.bz2 from gnu/sourceware is still GPLv3!

The latest GPLv2 is binutils-2.17.50.tar.bz2 from gnu/sourceware
or binutils-2.17.50.0.18 from kernel.org

Remember, Linus Torvalds wann't implications of GPLv3 in his kernel sourcecode.

Warning for what? Binutils

August 31, 2007 - 11:56am
Anonymous (not verified)

Warning for what? Binutils is used to build the kernel, not a part of it. Just like gcc etc.

Warning! GPLv3 virality!

August 31, 2007 - 1:13pm
Anonymous (not verified)

binutils-2.17.90.tar.bz2 from gnu/sourceware is GPLv2, not GPLv3, i'm sorry, there is not COPYING3 in binutils-2.17.90.

binutils-2.18.tar.{gz|bz2} from gnu/sourceware is the only software that contains GPLv3, see COPYING3

The latest GPLv2 is binutils-2.17.90.tar.bz2 from gnu/sourceware.

In what way is the GPLv3 any

August 31, 2007 - 3:13pm
Anonymous (not verified)

In what way is the GPLv3 any different to the GPLv2 with respect to code passing through binutils? Or is this just an excuse to jump up and down with a "look at me" sign?

Sign

September 1, 2007 - 7:06pm
Anonymous (not verified)

Here's his sign. (With a nod to Bill Engvall.)

Sounds interesting

August 31, 2007 - 8:26am
Fred Flinta (not verified)

This sounds interesting.
I wish to see some benchmarks where it is compared against old scheduler, CFS and CK.
In a graphic diagram benchmark.

Just because it is better than CFS in some situations/conditions does not mean it is better under all situations/conditions.

If it is better, then I hope it will be integrated.

I hope to hear the thoughts/opinions of some high-profile kernel developers such as Linus, Ingo and Andrew, etc.

The funny part is that the

August 31, 2007 - 8:50am
Anonymous (not verified)

The funny part is that the only thing he really has documented, is the shorter context switch time...

The context switching (on that machine, without going tickless, which makes it even more seldom) is run every millisecond (+ a little more on userspace yield() calls), so his 0.3 microsecond improvement is a 0.03 % improvement in CPU usage.

Since we have no idea how his scheduler makes the system behave, this very minor improvement could come at a huge cost to system responsiveness...

CTX switch

August 31, 2007 - 5:18pm
Anonymous (not verified)

Actually, no. If we could only context switch once per millisecond we'd be pretty awful. The context switch can happen much more frequently. Potentially modern hardware with current kernels can do about a million context switches per second per cpu. It starts adding up under those workloads. Luckily most workloads don't do that but it all adds up.

Not really

August 31, 2007 - 8:49pm
Anonymous (not verified)

On a non-tickless kernel, a context switch will not happen between ticks (there's a tick every millisecond on x86), unless either the running process yields, or there's a device that needs attention (interrupts).

Oh, or a timer makes a *ding*.

Is that true?

August 31, 2007 - 10:03pm

I thought a context switch could happen whenever the currently active process makes a syscall that sleeps, or a higher priority process became "ready." Thus, one could really quickly end up w/ thousands or millions of context switches per second in that context. Consider, for example, two processes joined by a pipe on an otherwise idle system.

Process A reads from the pipe, but there's no data available so Process A sleeps. Meanwhile, Process B (who was previously blocked by the pipe being full) wakes up and writes to the pipe. Process B goes to sleep when either (1) its timeslice expires, or (2) the pipe fills up and the write blocks.

Are you trying to say that I/O driven blocking and wakeups are still constrained to occur on timer tick boundaries? Somehow I don't think it's so.

Back when HZ == 100, I've quite often seen the total number of context switches go well above 100 per second. On a non-tickless system, HZ only indicates the number of forced context switches (aka. preemption) that can occur, due to the timer enforcing "concurrency." Otherwise, you still potentially have tons of context switches that occur due to blocking combined with processes coming ready because the syscalls that caused them to sleep came ready.

--
Program Intellivision and play Space Patrol!

Indeed, synchronous

September 1, 2007 - 5:55pm
Anonymous (not verified)

Indeed, synchronous communication between threads/processes always makes a context switch. I don't remember what kind of application I was running, but I've actually observed this with Wine: some NT kernel-provided services are emulated with the separate wineserver process. As making/returning from a syscall on a native OS is always synchronous, this means that wineserver communication cannot be batched or made asynchronous.

If I recall correctly, this workload caused 200k-300k context switches per second and was obviously bottlenecked by the context switch time.

However, programmers normally design programs to avoid these situations by batching and asynchronous protocols, so it's not a common workload. I certainly agree that interactivity and fairness are far more important than context switch time.

Indeed, synchronous

September 1, 2007 - 5:55pm
Anonymous (not verified)

Indeed, synchronous communication between threads/processes always makes a context switch. I don't remember what kind of application I was running, but I've actually observed this with Wine: some NT kernel-provided services are emulated with the separate wineserver process. As making/returning from a syscall on a native OS is always synchronous, this means that wineserver communication cannot be batched or made asynchronous.

If I recall correctly, this workload caused 200k-300k context switches per second and was obviously bottlenecked by the context switch time.

However, programmers normally design programs to avoid these situations by batching and using asynchronous protocols, so it's not a common workload. I certainly agree that interactivity and fairness are far more important than context switch time.

If only you were right...

September 2, 2007 - 5:01am
David Kastrup (not verified)

Process A reads from the pipe, but there's no data available so Process A sleeps. Meanwhile, Process B (who was previously blocked by the pipe being full) wakes up and writes to the pipe. Process B goes to sleep when either (1) its timeslice expires, or (2) the pipe fills up and the write blocks.

Unfortunately, this is not what happens. Process B goes immediately to sleep after doing a single write to the pipe. In the quest for low latency, Linux kernel programmers have turned pipes into a mockery that are used almost exclusively for synchronous process communication unless you have multiple processors.

The cost incurred in the kernel itself is low: it can switch contexts rather easily. However, processing data in smallest size chunks usually makes the receiving process less efficient, and it causes unnecessary cache poisoning and paging. When we are talking about something like an xterm, drastically so.

Things like

time xterm -e "hexdump -v /dev/zero|dd count=50 obs=1"

real    0m5.553s
user    0m0.712s
sys     0m0.480s

as opposed to

while true;do :; done & time xterm -e "hexdump -v /dev/zero|dd count=50 obs=1";kill %1
[1] 22578

real    0m2.166s
user    0m0.308s
sys     0m0.308s

are not really funny: the task gets finished much faster if the single CPU gets congested with an empty loop in parallel.

I don't really mind whether we get CFS or staircase or RFS or whatever other random scheduler: any scheduler should be better with regards to pipes than the current switch-contexts-whenever-possible scheduler stupidity which makes pipes under normal loads on single-processor machines into ineffective synchronous devices.

any scheduler should be

September 2, 2007 - 7:17am
Anonymous (not verified)

any scheduler should be better with regards to pipes than the current switch-contexts-whenever-possible scheduler stupidity which makes pipes under normal loads on single-processor machines into ineffective synchronous devices.

Tried your test on a CFS-patched kernel:

# time xterm -e "hexdump -v /dev/zero|dd count=500 obs=1"

real    0m0.514s
user    0m0.174s
sys     0m0.166s

# while true;do :; done & time xterm -e "hexdump -v /dev/zero|dd count=500 obs=1"

real    0m0.830s
user    0m0.182s
sys     0m0.176s

CFS appears to do the trick and gives a fair result.

What kernel does this?

September 2, 2007 - 11:41am

Running Debian's 2.6.21-2-k7 kernel, I get the following numbers:

time xterm -e "hexdump -v /dev/zero|dd count=50 obs=1"
real 0m4.694s
real 0m4.655s
real 0m5.133s

while true;do :; done & time xterm -e "hexdump -v /dev/zero|dd count=50 obs=1";kill %1
real 0m7.305s
real 0m9.008s
real 0m7.105s

Which kernel are you running? My results aren't because of CFS:
$ zgrep -i CFS /boot/System.map-2.6.21-2-k7
c035a4cd t init_timer_list_procfs
c035a560 t init_tstats_procfs
c0376cec t __initcall_init_timer_list_procfs6
c0376cf0 t __initcall_init_tstats_procfs6
$ zgrep -i CFS /boot/config-2.6.21-2-k7
CONFIG_ACPI_PROCFS=y
CONFIG_SND_VERBOSE_PROCFS=y
CONFIG_OCFS2_FS=m
CONFIG_OCFS2_DEBUG_MASKLOG=y

Here's an easier to grok example:

September 2, 2007 - 12:20pm

It wasn't clear to me what your test case was accomplishing. If nothing else, it's too short. I decided on a simpler one:

$ time dd if=/dev/zero count=1M bs=1 | cat > /dev/null
1048576+0 records in
1048576+0 records out
1048576 bytes (1.0 MB) copied, 3.75104 seconds, 280 kB/s

real    0m3.755s
user    0m0.148s
sys     0m4.236s

$ time dd if=/dev/zero count=1k bs=1k | cat > /dev/null
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.005469 seconds, 192 MB/s

real    0m0.009s
user    0m0.004s
sys     0m0.012s

In parallel, I ran a vmstat 1. For the first dd I see this:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0 924472  69756 1010120 1207044    0    0     0    12 1416  691  0  0 100  0
 1  0 924472  69524 1010120 1207044    0    0     0     0 1284 271586  1 28 71  0
 1  0 924472  69428 1010120 1207044    0    0     0    36 1708 346897  3 34 63  0
 1  0 924472  69912 1010120 1207044    0    0     0     0 1360 370034  1 35 64  0
 0  0 924472  69680 1010120 1207044    0    0     0     0 1420 240304  1 23 76  0
 0  0 924472  69680 1010120 1207044    0    0     0     0 1480 1038  0  0 100  0

That's 270k to 370k context switches per second. Yow! Are we having context switching fun yet?

For the second dd, it barely makes a blip:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0 924472  69024 1010184 1207020    0    0     0     0 1268  734  0  0 100  0
 0  0 924472  68776 1010188 1207016    0    0     0    12 1432 3354  0  0 99  0
 0  0 924472  69188 1010188 1207020    0    0     0     0 1288  752  0  0 100  0

I guess from a context-switch overhead perspective, this isn't the greatest choice. I do recall Linus et al playing around with "wake receiver on write to pipe" policy some years back, to find the right tradeoff between latency, bandwidth and context switches. Linux's cheap context switch overhead certainly biases the decision.

--
Program Intellivision and play Space Patrol!

Of course you are absolutely

September 3, 2007 - 6:29pm
Anonymous (not verified)

Of course you are absolutely right, I forgot all about the syscalls.

Still shouldn't hit the schedulers too hard, though, unless you have A LOT of I/O intensive work.

Context switch rates

September 3, 2007 - 7:57pm
Bill Davidsen (not verified)

I you want to see your switching rate, run

vmstat 1

and look at the "cs" value. Try it at idle, downloading complex web pages, building a kernel, whatever. Now imagine that under the heaviest load you could save a few ns on each switch. With a server load on a dual core or hyperthreaded machine I can get 12000 cs/sec really trying. So that would save 9-10ms of CPU time.


 Boy that sounds good!


Now look at idle and wio (wait i/o) time. If either is non-zero then you have gained nothing, because the system is not really CPU bound.


I'm guessing that most users will not reach 5000 cs/sec with a "real load" although some benchmark like the one Roman used may be able to get much higher. I can get 780k/sec with a benchmark, with 60% system, but I still have idle...

But if Roman's code (I have

September 2, 2007 - 11:47am
Anonymous (not verified)

But if Roman's code (I have not looked at it) is simpler, this is also a gain.

Yes, but:

September 2, 2007 - 1:16pm

Code should be as simple as possible, but no simpler.

I am also not involved with scheduler code, but CFS has some nice features that are going to play out in the future. More and more we're looking at virtualized futures. We can imagine the day when java-esque sandboxes are replaced with virtualized machines for users or even processes. Sun's new processors sport 8 logic units each supporting 8 simultaneous threads.

Algorithms for maintining the right scheduler sense with CPU affinity, etc, are going to seem unnecessarily complicated. And 'nice' doesn't make much sense the way it's implemented today. But these are all necessary evolutions in modern computing. The ability to renice not just processes but entire users or containers with fair scheduling within each domain will become ever more important, even on "the desktop".

I would expect embedded people to complain the most about this sort of complexity, but even those devices are becoming more and more powerful and, more importantly, network aware. I'm sure that VM386 mode seemed unnecessarily complex in the beginning, but its benefits were obvious. These sorts of coming changes may not be as obvious, but I predict will be none the less as revolutionary.

Ingo's reply

August 31, 2007 - 8:48am
Anonymous (not verified)

Much of this appears to have already been done for the CFS: http://lkml.org/lkml/2007/8/31/97

And Roman in his reply seems

August 31, 2007 - 11:09am
Anonymous (not verified)

And Roman in his reply seems upset that most of his work's already done and that Ingo suggests they work together on the remaining parts

- Peder

He is always extremly

August 31, 2007 - 12:06pm
Anonymous (not verified)

He is always extremly aggressive towards Ingo. Not a nice person to work with - bummer for the wasted potential.

Roman Zippel is not only

September 2, 2007 - 6:45am
Anonymous (not verified)

Roman Zippel is not only agressive with Ingo - he got into frequent fights with other top developers too. A few weeks ago he got into a flamewar with Jonathan Corbet (editor of LWN) and Andrew Morton (maintainer of 2.6) about hres timers!

Roman is the maintainer of an obsolete architecture (m68k/Amiga/Atari) that no one cares about anymore. That might explain his hostility against code that works better on new hardware (CFS for example, but also hres timers). He also often resists changes that he as an arch maintainer would have to adopt to. The sneaky, biased and mean-spirited way in which he conducts discussions, trying to hide his opinion in technical arguments and innuendo is still disgusting, especially if you are like a long-time lkml reader like me.

Nobody seems to be able to work with him. If he continues like that he will swiftly join the ranks of Jeff Merkey and other (long forgotten) flamers on linux-kernel.

Amiga people...

September 2, 2007 - 10:51am
Anonymous (not verified)

Oh wow, a m68k (Amiga, Atari and perhaps some old macs?) maintainer who (according to fellow Anonymous here) acts like a rabid Amiga weenie straight out of comp.sys.amiga.advocacy. Who's surprised?

Grain of salt time here people. Don't believe everything you read. Getting into flamewars, even the destructive ones, is commonly for the better especially compared to shit like passive-aggressive behaviour. I believe that one that gave rise to Git is a pretty good example.

Also, perhaps it is necessary to behave like a rabid animal (if that is indeed the case, and I'm not saying I know anything about it besides fellow Anon's rumour) to avoid the m68k stuff being stepped on and walked over. Unlike other minority platforms, there's no Sun (Sparc) or IBM (S/390) or IBM (PowerPC) or IBM (POWER) or Hewlett-Packard (HPPA) to back good old m68k up.

Roman's reply

August 31, 2007 - 12:05pm
Anonymous (not verified)

roman appears to disagree: http://lkml.org/lkml/2007/8/31/126
Even if they are end results are similar mathematically after all, this struck me as an attempt to simplify the computation mechanisms.

Tough to be Ingo

August 31, 2007 - 12:38pm
Anonymous (not verified)

Now, I didn't take the time to fully understand Roman's math and don't know that much about scheduling other than it's a hard concept to get right, but I found Ingo's reply to be fairly level headed. Looks as though Roman never even looked at Peter Zijlstra's CFS improvements and how they relate to his concepts. Instead, he looks to be getting agitated and on the verge of an all out flame fest. Even _after_ Mike Galbraith points out that his scheduler doesn't even work well for a simple fairtest run.

It's gotta be tough to be Ingo and put up with all this.

The Roman's Idea is dangerous!!!

August 31, 2007 - 2:14pm
Anonymous (not verified)

Imagine one case of long timed FastCGI task or long uptimed inetd/swapd/X daemon.

The Roman's scheduler model for this above case fails precipitatously.

It's better to use the Ingo's scheduler instead Roman's scheduler.

Ingo has it coming

August 31, 2007 - 4:25pm
Anonymous (not verified)

Seems to me Ingo has it coming after he screwed over Con's efforts. At the very least the merge of CFS into the main kernel branch was premature based on grotesque complexity alone.

Ingo has it coming alright!

August 31, 2007 - 5:45pm

I suggest you read more carefully what has actually been done: Ingo has not only incorporated ideas and code fom Con, but also given him credit in several places.

Yes, Ingo has it coming alright!

What is coming to Ingo, is increased respect for his technical judgement, _AND_ fortitude in putting up with people like you, who can't be bothered to actually understand what has happened!

I think it is really great, that other people are attempting to improve scheduling with their own ideas - especially if they are willing to cooperate. It is not a question, thankfully, that we can only chose one winner - as via the GPL, we can combine code - hence we can all benefit from having an improved scheduler.

IBM had a team developing a vastly improved way of handling threads in the Linux kernel, but what got in was an somebody else's that was simply much better than theirs. IBM benefits from the other thread handling code, and they didn't display 'sour grapes' at not getting in 'their' code into the kernel!!!

Disagreeing with Ingo is valid, but it should be based on factual analysis and done with courtesy, rather than on emotional blind reaction.

-Nivag

Way to go...

September 1, 2007 - 6:39am
Anonymous (not verified)

Roman obviously knew about Peter changes to CFS as he commented on them: "It addresses the conversion error between the different units I was mentioning in an earlier mail, but the value is still rounded.".

This all looks like clever politics: "Thanks but had you taken the time to talk to us you'd have realized it's already been done.". While in fact it looks like claims are made about Roman's work without understanding all of it.

Roman replied and clarified

September 1, 2007 - 2:40pm
Anonymous (not verified)

Roman replied and clarified his post. He obviously knew Peter's patches.

What really puzzles me is that Ingo replied 9 hours later, as Roman pointed out.

Had Ingo not replied on the

September 2, 2007 - 6:27am
Anonymous (not verified)

Had Ingo not replied on the same day he would have been accused of ignoring Roman's patch and taking too much time to reply!

The most recent mails in the thread describe that Roman added not a single line of comment to his implementation (only the formulas in his emails which are not linked to the code) - a pure miracle Ingo was able to reply in substance at all!

But the most important question: is that you again, Mr. Ballmer? Got any chairs? :-D

Coders with a bit of

September 2, 2007 - 11:15am
Anonymous (not verified)

Coders with a bit of experience write code thats readable without needing any, or almost any, comments. Its a myth amongst inexperienced programmers that code without comments are impossible to understand. Usually, its the opposite which is true: Code without, or with very few, comments, doesn't have comments because the code is self-explainable.

You pretty obviously have

September 2, 2007 - 3:03pm
Anonymous (not verified)

You pretty obviously have not read the fine patch in question. Take a look. I'm not kidding you! Just do it man.

The theoretical explanations alone were over 25 mathematical equations.

Tell me what this does:

+		cfs_rq->time_sum_max -= cfs_rq->time_norm_inc << se->weight_shift;
+		while (cfs_rq->time_sum_max < (MAX_TIMESUM >> 1)) {
+			cfs_rq->time_norm_inc <<= 1;
+			cfs_rq->time_sum_max <<= 1;
+			cfs_rq->inc_shift--;
+		}

I have seen my share of complex code (including some kernel code) but I dont have the slightest clue what half of this new code does.

Renormalization

September 2, 2007 - 8:37pm

Looks to me like it's renormalizing an accumulator and the thing that increments it, so as to keep the maximum possible precision. I'm saying that looking only at the snippet you've provided. It's "integer floating point", if you will.

--
Program Intellivision and play Space Patrol!

Coders with a bit of

September 3, 2007 - 1:52pm
Anonymous (not verified)

Coders with a bit of experience write code thats readable without needing any, or almost any, comments. Its a myth amongst inexperienced programmers that code without comments are impossible to understand. Usually, its the opposite which is true: Code without, or with very few, comments, doesn't have comments because the code is self-explainable.

Do experienced programmers need/use comments on every line, like "add 1 to x"? Of course not. But "experienced programmers" who disregard the need of others to read and understand their code by not documenting their assumptions and tradeoffs (not just what the tradeoff is (which often you can see from the code - though not always) but why the tradeoff choice was made. Similarly with handling (or not) of edge conditions, etc. Way too many experienced programmers make it into an ego thing - "if they can't follow it, well ha! They aren't as good as I am!" Ok, maybe they aren't - but why make their jobs harder? It seems as if this ego bit is more common among the young (no surprise), kernel maintainers (the only significant reason I can see here is history - the original V6/V7 Unix kernels were close to undocumented - and perhaps linux is better than the last time I looked at it, but I don't think so in general), and to a lesser extent some parts of the open-source community (lot of variance there, though).

I see my job not just to code something, but also to make it possible/easy for others to understand and extend what I did. If they have to spend hours puzzling over something, or if they miss the reason for a decision (or mis-understand the reason), it could waste huge amounts of time in the long run. Hell, it helps me when I come back to something to have that there.

I've done a lot of maintenance and extension in my career, and I've learned well-documented code is a hell of a lot easier to maintain - both when I come to it, and when someone else (often with less experience/skill) has to deal with code I've written. Writing the code itself so it's easy to understand (as much as possible) is also important - it's part of "documenting", and it's an important skill, but it's often not sufficient.

There was a good thread of comments on this topic at embedded.com a few (3? 4? 5?) months ago.

From (12) to (13) he assumes

August 31, 2007 - 10:15pm
Anonymous (not verified)

From (12) to (13) he assumes all tasks used up their timeslice. This is almost never the case in desktop usage. So from (13) onwards his math is flawed.

Glib

September 1, 2007 - 6:45pm
Anonymous (not verified)

Glib remark. Just oozing with explanation... not.

Please do tell us what the assumption should be and more importantly whether it can be enforced. I just love people talking about theoretical ideal observations - which are too expensive/intensive to enforce in real-world scenarios.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary