login
Header Space

 
 

CFS Development Tree Backported

September 26, 2007 - 8:19am
Submitted by Jeremy on September 26, 2007 - 8:19am.
Linux news

"By popular demand, here is release -v22 of the CFS scheduler. It is a full backport of the latest & greatest sched-devel.git code to v2.6.23-rc8, v2.6.22.8, v2.6.21.7 and v2.6.20.20," announced Ingo Molnar. He added, "this is the first time the development version of the scheduler has been fed back into the stable backport series, so there's many changes since v20.5". Ingo went on to explain, "even if CFS v20.5 worked well for you, please try this release too, with a good focus on interactivity testing - because, unless some major showstopper is found, this codebase is intended for a v2.6.24 upstream merge." He then summarized some of the changes:

"The changes in v22 consist of lots of mostly small enhancements, speedups, interactivity improvements, debug enhancements and tidy-ups - many of which can be user-visible. (These enhancements have been contributed by many people - see the changelog below and the git tree for detailed credits.)

"The biggest individual new feature is per UID group scheduling, written by Srivatsa Vaddagiri, which can be enabled via the CONFIG_FAIR_USER_SCHED=y .config option. With this feature enabled, each user gets a fair share of the CPU time, regardless of how many tasks each user is running."


From: Ingo Molnar <mingo@...>
Subject: [patch/backport] CFS scheduler, -v22, for v2.6.23-rc8, v2.6.22.8, v2.6.21.7, v2.6.20.20
Date: Sep 26, 7:13 am 2007

By popular demand, here is release -v22 of the CFS scheduler. It is a 
full backport of the latest & greatest sched-devel.git code to 
v2.6.23-rc8, v2.6.22.8, v2.6.21.7 and v2.6.20.20. The patches can be 
downloaded from the usual place:

    http://people.redhat.com/mingo/cfs-scheduler/

This is the first time the development version of the scheduler has been 
fed back into the stable backport series, so there's many changes since 
v20.5:

 15 files changed, 1103 insertions(+), 840 deletions(-)

Even if CFS v20.5 worked well for you, please try this release too, with 
a good focus on interactivity testing - because, unless some major 
showstopper is found, this codebase is intended for a v2.6.24 upstream 
merge.

( Even a quick, subjective report of: "checked this patch, it didnt
  crash and it feels like v20.5" or "laggier than v20.5" or "feels 
  better than v20.5" is useful to us and enables us to judge the general 
  direction of interactivity. )

The changes in v22 consist of lots of mostly small enhancements, 
speedups, interactivity improvements, debug enhancements and tidy-ups - 
many of which can be user-visible. (These enhancements have been 
contributed by many people - see the changelog below and the git tree 
for detailed credits.)

The biggest individual new feature is per UID group scheduling, written 
by Srivatsa Vaddagiri, which can be enabled via the 
CONFIG_FAIR_USER_SCHED=y .config option. With this feature enabled, each 
user gets a fair share of the CPU time, regardless of how many tasks 
each user is running.

For example, it took me 0.1 seconds to log in over ssh as root on a 
testbox that was running a kernel with per UID group scheduling enabled:

  $ time ssh root@testbox /bin/true

  real    0m0.125s
  user    0m0.013s
  sys     0m0.011s

Which testbox had a system load of 1000.17 at this time, due to a rogue 
runaway workload of one thousand (!) non-reniced infinite loops:

  top - 14:34:05 up 30 min,  3 users,  load average: 1000.17, 839.23, 444.57
  Tasks: 1131 total, 1002 running, 129 sleeping,   0 stopped,   0 zombie
  Cpu(s): 30.8%us,  0.2%sy,  0.0%ni, 68.2%id,  0.8%wa,  0.0%hi,  0.0%si
  Mem:   2048992k total,   157688k used,  1891304k free,    18308k buffers
  Swap:  4096564k total,        0k used,  4096564k free,    25464k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  3633 root      20   0  2892 1576  724 R    7  0.1   0:00.06 top
  2427 mingo     20   0  1576  244  196 R    2  0.0   0:01.14 loop
  2429 mingo     20   0  1576  244  196 R    2  0.0   0:01.14 loop

To the root user, the box was fully usable an interactivity was 
excellent - i was easily able to kill off those runaway tasks.

( The /proc/root_user_cpu_share tunable also allows the root uid to have
  higher weight than other uids. Unit of the tunable is 0.1%, a weight
  of 100% is 1024, the default weight of the root uid is 200%. )

See the detailed shortlog below for a description of the other changes, 
or pull the sched-devel.git tree for all the 83 commits:

  git-pull git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git

Also, as usual, any sort of feedback, bugreport, fix and suggestion is 
more than welcome!

	Ingo

------------------>
Dmitry Adamushko (9):
      sched: clean up struct load_stat
      sched: clean up schedstat block in dequeue_entity()
      sched: sched_setscheduler() fix
      sched: add set_curr_task() calls
      sched: do not keep current in the tree and get rid of sched_entity::fair_key
      sched: optimize task_new_fair()
      sched: simplify sched_class::yield_task()
      sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
      sched: yield fix

Hiroshi Shimamoto (1):
      sched: clean up sched_fork()

Matthias Kaehlcke (1):
      sched: use list_for_each_entry_safe() in __wake_up_common()

Mike Galbraith (2):
      sched: fix SMP migration latencies
      sched: fix formatting of /proc/sched_debug

Peter Zijlstra (12):
      sched: simplify SCHED_FEAT_* code
      sched: new task placement for vruntime
      sched: simplify adaptive latency
      sched: clean up new task placement
      sched: add tree based averages
      sched: handle vruntime overflow
      sched: better min_vruntime tracking
      sched: add vslice
      sched debug: check spread
      sched: max_vruntime() simplification
      sched: clean up min_vruntime use
      sched: speed up and simplify vslice calculations

S.Caglar Onur (1):
      sched debug: BKL usage statistics, fix

Srivatsa Vaddagiri (12):
      sched: group-scheduler core
      sched: revert recent removal of set_curr_task()
      sched: fix minor bug in yield
      sched: print nr_running and load in /proc/sched_debug
      sched: print &rq->cfs stats
      sched: clean up code under CONFIG_FAIR_GROUP_SCHED
      sched: add fair-user scheduler
      sched: group scheduler wakeup latency fix
      sched: group scheduler SMP migration fix
      sched: group scheduler, fix coding style issues
      sched: group scheduler, fix bloat
      sched: group scheduler, fix latency

Ingo Molnar (44):
      sched: fix new-task method
      sched: resched task in task_new_fair()
      sched: small sched_debug cleanup
      sched: debug: track maximum 'slice'
      sched: uniform tunings
      sched: use constants if !CONFIG_SCHED_DEBUG
      sched: remove stat_gran
      sched: remove precise CPU load
      sched: remove precise CPU load calculations #2
      sched: track cfs_rq->curr on !group-scheduling too
      sched: cleanup: simplify cfs_rq_curr() methods
      sched: uninline __enqueue_entity()/__dequeue_entity()
      sched: speed up update_load_add/_sub()
      sched: clean up calc_weighted()
      sched: introduce se->vruntime
      sched: move sched_feat() definitions
      sched: optimize vruntime based scheduling
      sched: simplify check_preempt() methods
      sched: wakeup granularity fix
      sched: add se->vruntime debugging
      sched: add more vruntime statistics
      sched: debug: update exec_clock only when SCHED_DEBUG
      sched: remove wait_runtime limit
      sched: remove wait_runtime fields and features
      sched: x86: allow single-depth wchan output
      sched: fix delay accounting performance regression
      sched: prettify /proc/sched_debug output
      sched: enhance debug output
      sched: kernel/sched_fair.c whitespace cleanups
      sched: fair-group sched, cleanups
      sched: enable CONFIG_FAIR_GROUP_SCHED=y by default
      sched debug: BKL usage statistics
      sched: remove unneeded tunables
      sched debug: print settings
      sched debug: more width for parameter printouts
      sched: entity_key() fix
      sched: remove condition from set_task_cpu()
      sched: remove last_min_vruntime effect
      sched: undo some of the recent changes
      sched: fix place_entity()
      sched: fix sched_fork()
      sched: remove set_leftmost()
      sched: clean up schedstats, cnt -> count
      sched: cleanup, remove stale comment

 arch/i386/Kconfig       |   11 
 fs/proc/base.c          |    2 
 include/linux/sched.h   |   56 ++-
 init/Kconfig            |   21 +
 kernel/delayacct.c      |    2 
 kernel/sched.c          |  577 ++++++++++++++++++++++++-------------
 kernel/sched_debug.c    |  246 ++++++++++------
 kernel/sched_fair.c     |  733 ++++++++++++++++++------------------------------
 kernel/sched_idletask.c |    5 
 kernel/sched_rt.c       |   12 
 kernel/sched_stats.h    |   28 -
 kernel/sysctl.c         |   31 --
 kernel/user.c           |   43 ++
 13 files changed, 963 insertions(+), 804 deletions(-)
-


good job

September 26, 2007 - 8:33am
Anonymous (not verified)

load avg 1000 and the system is still usable? holy cow, that's crazy!
Looking forward to use this scheduler on my systems.

Looks nifty. The user

September 26, 2007 - 8:45am
Anonymous (not verified)

Looks nifty. The user running 1000 tasks wont have good interactivity, but other users
and root is not punished.

But that idle of 60% does

September 26, 2007 - 1:23pm
Anonymous (not verified)

But that idle of 60% does look somewhat bad.

That is odd...

September 26, 2007 - 2:59pm

...although idle time isn't calculated by the kernel, and could be subject to rounding error on such a (relatively) short sample interval as top's.

Could it be rounding error in top? Top computes idle time by adding up all the CPU-seconds taken since the last refresh. With 1000 tasks all getting darn near identical amounts of CPU time--slightly less than 0.1% of the CPU--it's not hard to imagine many of the usage %ages getting rounded down, thereby compromising the computation.

The real way to see if the idle's at 60% would be to look at total CPU seconds reported by ps after, say, letting things run for 20 minutes or so (1200 seconds), so that each of the 1000 tasks gets at least a full CPU-second. (1.2 CPU seconds, ideally.)

--
Program Intellivision and play Space Patrol!

Nonsense

September 27, 2007 - 4:17am
Alistair Strachan (not verified)

The kernel does calculate idle time. Please read Documentation/filesystems/proc.txt wrt /proc/sys which has a field per-CPU for this.

Well whaddaya know!

September 27, 2007 - 11:55am

It sure does, apparently in units of HZ. This certainly seems to be a rounding error issue at some point in the system, unless the CPU really is 60% idle (which seems unlikely). I wonder if this file is up to date on the situation. Back in February, I guess it was, but a lot's changed in the last 7 months.

As for "top" computing the idle: I guess I was thinking more along the lines of how old school top (which read /dev/kmem) seemed to work. That was a looooong time ago, and I could even be misremembering. It looks like there is definitely some amusing code in the procps-top to handle older Linux ("SMP kernels (as of pre-2.4 era) can report idle time going backwards"), so who knows.


Edit: This is interesting. According to this file, idle time is reported as the sum of kernel and user space time that "init" was given. (Makes sense, since that's what "runs" when everything else sleeps.) Hmmm.... maybe there is something to this?

--
Program Intellivision and play Space Patrol!

what's the default for CONFIG_FAIR_USER_SCHED ?

September 26, 2007 - 10:13am
Anonymous (not verified)

But will CONFIG_FAIR_USER_SCHED=y be the default?
if it isn't the default, it wouldn't be all that useful, imo, because it's a real obscure flag deep in the bowels of the scheduler, and vast majority of users will run without it.
if it really works as advertised, i think it should be the default.
just my ยข 1.75

It doesn't really matter, as

September 26, 2007 - 10:29am
Anonymous (not verified)

It doesn't really matter, as it's what the dists decide to use that'll be the norm.

There were (are) problems

September 26, 2007 - 10:44am
Anonymous (not verified)

There were (are) problems with this setting and SMP. I'm sure it will be ironed out by 2.6.24.

It is enabled by default on the development branch. Ultimately, the default will be decided by the different distros.

There were (are) problems

September 26, 2007 - 10:59am
Anonymous (not verified)

There were (are) problems with this setting and SMP. I'm sure it will be ironed out by 2.6.24.

I believe those problems were fixed prior the CFS-v22 release, and the bug reporter confirmed it too. See this post:

I piddled around with fair users this morning, and it worked well. With Xorg and Gforce as one user (X and Gforce are synchronous ATM), and a make -j30 as another, I could barely tell the make was running. Watching a dvd, I couldn't tell. Latencies were pretty darn good throughout three hours of testing this and that.

Somehow, I think the

September 26, 2007 - 10:44am
Anonymous (not verified)

Somehow, I think the /proc/root_user_cpu_share is a private case. Allowing @stuff to have more CPU than @students (or @guests) would be much better.

And why adding more stuff to /proc?

And why adding more stuff

September 26, 2007 - 11:15am
Anonymous (not verified)

And why adding more stuff to /proc?

Where should it be instead?

why not /sys?

September 26, 2007 - 11:21am
Anonymous (not verified)

why not /sys?

Allowing @stuff to have more

September 27, 2007 - 2:19am
Anonymous (not verified)

Allowing @stuff to have more CPU than @students (or @guests) would be much better.

Just wait for the process containers (now "control groups"?) patch; this will let you do that.

Just tried out the patch

September 27, 2007 - 6:14am

Just tried out the patch with 2.6.23-rc8.

My system feels really quite snappy now, possibly faster than the ck series now.

I did have an issue with tearing with geforce driver but that seems to be fixed now.

Can't wait for this to be in the kernel as default.

can you also "nice" users?

September 27, 2007 - 9:10pm
Anonymous (not verified)

Giving each user a fair share of CPU time is a fine addition, but can you also "nice" a user up or down, i.e., give him/her a larger or smaller share than other users?

Yes with process groups due in 2.6.24

September 28, 2007 - 1:24am
Anonymous (not verified)

Yes you can with the process groups feature due in 2.6.24

The idea is that you can group processes together, and each group gets a fair share. By default each user's processes are a group, but the sysadmin can setup other groups. For example in a university environment, you might put the students in a different group from the staff. Each group can be configured to get an unequal share.

The main driver for this is to prevent users from abusing the scheduler by running lots of processes in parallel, as otherwise one user could run a compile with five concurrent threads and get five times as much CPU time as someone who is running a single threaded application.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary