"We are working [on] a new I/O scheduler based on CFQ, aiming at improved predictability and fairness of the service, while maintaining the high throughput it already provides," began Fabio Checconi, announcing the BFQ I/O scheduler. "The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin scheduling policy of time slices into a fair queuing scheduling of sector budgets," he continued, "more precisely, each task is assigned a budget measured in number of sectors instead of amount of time, and budgets are scheduled using a slightly modified version of WF2Q+. The budget assigned to each task varies over time as a function of its behaviour. However, one can set the maximum value of the budget that BFQ can assign to any task." Fabio went on to explain:
"The time-based allocation of the disk service in CFQ, while having the desirable effect of implicitly charging each application for the seek time it incurs, suffers from unfairness problems also towards processes making the best possible use of the disk bandwidth. In fact, even if the same time slice is assigned to two processes, they may get a different throughput each, as a function of the positions on the disk of their requests. On the contrary, BFQ can provide strong guarantees on bandwidth distribution because the assigned budgets are measured in number of sectors. Moreover, due to its Round Robin policy, CFQ is characterized by an O(N) worst-case delay (jitter) in request completion time, where N is the number of tasks competing for the disk. On the contrary, given the accurate service distribution of the internal WF2Q+ scheduler, BFQ exhibits O(1) delay."
Jens Axboe reacted favorably, "Fabio, I've merged the scheduler for some testing. Overall the code looks great, you've done a good job!" He noted that the scheduler should soon appear in the -mm tree, and that it was worth considering merging the two I/O schedulers together.
Ingo Molnar posted a merge request for the latest git scheduler tree summarizing, "it contains various enhancements to the scheduler - find the full shortlog is below. 96 commits from 19 authors - scheduler developers have been busy again. :-/" He added, "the scheduling behavior of the kernel to normal users should not change over v2.6.24, but there are a good number of new features and enhancements under the hood." Ingo went on to list a number of these new features, including:
"Various instrumentation and debugging enhancements from Arjan van de Ven; Peter Zijlstra's RT time limit and RT throttling code for the RT scheduling class; Paul E. McKenney's preemptible RCU code; refcount based CPU-hotplug rework by Gautham R Shenoy; there's serious interest in running RT tasks on enterprise-class hardware, so Steven Rostedt and Gregory Haskins wrote a large number of enhancements to the RT scheduling class and load-balancer; Peter Zijlstra's high-resolution scheduler tick code; [...] and a good number of other, smaller enhancements."
Ingo Molnar announced that version 24 of his Completely Fair Scheduler patch is now available backported to the 2.6.24-rc3, 220.127.116.11, 18.104.22.168, and 22.214.171.124 kernels. He noted that there have been significant changes since the last backport, "36 files changed, 2359 insertions(+), 1082 deletions(-). That's 187 individual commits from 32 authors." Ingo noted, "99% of these changes are already upstream in Linus's git tree and they will be released as part of v2.6.24. (there are 4 pending commits that are in the small 2.6.24-rc3-v24 patch.)" He also highlighted some of the more significant improvements:
"Improved interactivity via Peter Ziljstra's 'virtual slices' feature. As load increases, the scheduler shortens the virtual timeslices that tasks get, so that applications observe the same constant latency for getting on the CPU. (This goes on until the slices reach a minimum granularity value).
"CONFIG_FAIR_USER_SCHED is now available across all backported kernels and the per user weights are configurable via /sys/kernel/uids/. Group scheduling got refined all around."
Ingo Molnar sent a merge request to Linus Torvalds for the latest CFS fixes. CFS, the Completely Fair Scheduler, was merged into the mainline Linux kernel in July of 2007. It was first included in the 2.6.23 kernel, released in October of 2007. The scheduler appears to be quickly stabilizing, visible in the minimal assortment of fixes contained in the latest source code push. Ingo Molnar summarized the changes:
"There are two cross-subsystem groups of fixes: three commits that resolve a KVM build fix on !SMP - acked by Avi to go in via the scheduler git tree because it changes a central include file. The other one is a powerpc CPU time accounting regression fix from Paul Mackerras.
"The remaining 14 commits: one crash fix (only triggerable via the new control-groups filesystem), a delay-accounting regression fix, two performance regression fixes, a latency fix, two small scheduling-behavior regression fixes and seven cleanups."
"Currently in mainline the balancing of multiple RT threads is quite broken. That is to say that a high priority thread that is scheduled on a CPU with a higher priority thread, may need to unnecessarily wait while it can easily run on another CPU that's running a lower priority thread," began Steven Rostedt, describing his patchset to introduce improved real time task balancing. He explained:
"Balancing (or migrating) tasks in general is an art. Lots of considerations must be taken into account. Cache lines, NUMA and more. This is true with general processes which expect high through put and migration can be done in batch. But when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing. Right now an RT task can wait several milliseconds before it gets scheduled to run. And perhaps even longer. The migration thread is not fast enough to take care of RT tasks."
Steven described his test cases and numerous issues he noticed with the current balancing code, noting, "to solve this issue, I've changed the RT task balancing from a passive method (migration thread) to an active method. This new method is to actively push or pull RT tasks when they are woken up or scheduled."
Ken Chen submitted a patch to reduce the memory footprint of schedstat in a thread titled, "schedstat needs a diet". He explained, "schedstat is useful in investigating CPU scheduler behavior. Ideally, I think it is beneficial to have it on all the time. However, the cost of turning it on in production system is quite high, largely due to number of events it collects and also due to its large memory footprint." His patch converted numerous
unsigned long variables to
unsigned int, "most of the fields probably don't need to be [a] full 64-bits on 64-bit [architectures]. Rolling over 4 billion events will most likly take a long time and user space tools can be made to accommodate that."
Ingo Molnar merged the patch into his scheduler development tree suggesting there were further conversions that could be made, "note that current -git has a whole bunch of new schedstats fields in /proc//sched which can be used to track the exact balancing behavior of tasks. It can be cleared via echoing 0 to the file - so overflow is not an issue. Most of those new fields should probably be unsigned int too. (they are u64 right now.)"
"It contains lots of scheduler updates from lots of people - hopefully the last big one for quite some time," began Ingo Molnar, describing his merge request for the linux-2.6-sched git tree. He continued, "most of the focus was on performance (both micro-performance and scalability/balancing), but there's the fair-scheduling feature now Kconfig selectable too. Find the shortlog below." Ingo noted, "code that is touched outside of the scheduler: the KVM bits were acked by Avi, the net/unix change is trivial and only affects sync wakeups, ditto the fs/pipe.c changes - but i can push those separately if it needs an ack from David first." He then concluded:
"Testing status: the changes are chronological and all the interactivity-impacting changes are near the head of the queue and most of them were done weeks ago, and were thus part of the CFS-v22 backport series - which was tested by many people. There are no known regressions at the moment. It's all fully bisectable."
"As you can see it in the graph, v2.6.23 schedules much more consistently too. [ v2.6.22 has a small (but potentially statistically insignificant) edge at 4-6 clients, and CFS has a slightly better peak (which is statistically insignificant)."
Ingo noted that he was nuable to find information as to how the other benchmark was generated, "there are no .configs or other testing details at or around that URL that i could use to reproduce their result precisely, so at least a minimal bugreport would be nice." He then offered some tips on how sysbench works and some suggested tunings, "sysbench is a pretty 'batched' workload: it benefits most from batchy scheduling: the client doing as much work as it can, then server doing as much work as it can - and so on. The longer the client can work the more cache-efficient the workload is. Any round-trip to the server due to pesky preemption only blows up the cache footprint of the workload and gives lower throughput."
In a continuing discussion about the difference between pluggable security and pluggable schedulers, Linus Torvalds quoted himself:
"Another difference is that when it comes to schedulers, I feel like I actually can make an informed decision. Which means that I'm perfectly happy to just make that decision, and take the flak that I get for it. And I do (both decide, and get flak). That's my job."
He added, "which you seem to not have read or understood (neither did apparently anybody on slashdot)". Linus continued, "the arguments that 'servers' have a different profile than 'desktop' is pure and utter garbage, and is perpetuated by people who don't know what they are talking about." He then asked and answered his own question, "really: tell me what the difference is between 'desktop' and 'server' scheduling. There is absolutely *none*," going on to explain:
"Yes, there are differences in tuning, but those have nothing to do with the basic algorithm. They have to do with goals and trade-offs, and most of the time we should aim for those things to auto-tune (we do have the things in /proc/sys/kernel/, but I really hope very few people use them other than for testing or for some extreme benchmarking - at least I don't personally consider them meant primarily for 'production' use)."
"Really, i have never seen a _single_ mainstream app where the use of sched_yield() was the right choice," stated Ingo Molnar during a continuing discussion about the Completely Fair Scheduler. He went on to ask if anyone could point to specific code that illustrates the proper usage of
sched_yield(). In response to a theory of how it could potentially optimize userland locking, Ingo challenged, "these are generic statements, but I'm _really_ interested in the specifics. Real, specific code that i can look at. The typical Linux distro consists of in excess of 500 millions of lines of code, in tens of thousands of apps, so there really must be some good, valid and 'right' use of sched_yield() somewhere in there, in some mainstream app, right? (because, as you might have guessed it, in the past decade of sched_yield() existence i _have_ seen my share of sched_yield() utilizing user-space code, and at the moment i'm not really impressed by those examples.)" Ingo went on to explain:
"sched_yield() has been around for a decade (about three times longer than futexes were around), so if it's useful, it sure should have grown some 'crown jewel' app that uses it and shows off its advantages, compared to other locking approaches, right?
"For example, if you asked me whether pipes are the best thing for certain apps, i could immediately show you tons of examples where they are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or just about any other core kernel concept or API. Your notion that showing a good example of an API would be 'difficult' because it's hard to determine 'smart' use is not tenable i believe and does not adequately refute my pretty plain-meaning 'it does not exist' assertion."
"Lots of scheduler updates in the past few days, done by many people," noted Ingo Molnar, going on to describe the more significant updates. "Most importantly, the SMP latency problems reported and debugged by Mike Galbraith should be fixed for good now." Ingo noted that the current code base was looking stable and was likely to be merged into the upcoming 2.6.24 kernel, "so please give it a good workout and let us know if there's anything bad going on. (If this works out fine then i'll propagate these changes back into the CFS backport, for wider testing.)" He went on to describe the other main changes in the development branch of the process scheduler:
"I've also included the latest and greatest group-fairness scheduling patch from Srivatsa Vaddagiri, which can now be used without containers as well (in a simplified, each-uid-gets-its-fair-share mode). This feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.
"Peter Zijlstra has been busy enhancing the math of the scheduler: we've got the new 'vslice' forked-task code that should enable snappier shell commands during load while still keeping kbuild workloads in check."
"sched_yield() is not - and should not be - about 'recalculating the position in the scheduler queue' like you do now in CFS," Linus Torvalds stated in a discussion with Completely Fair Scheduler author Ingo Molnar, pointing to the man pages to back up his argument that sched_yield should instead move a thread to the end of its queue, adding, "quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the 'right place' in the rbtree. It should move it *last*."
Ingo described how it worked with the pre-2.6.23 scheduler, "the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task." He went on to compare this to the new process scheduler , "so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. In CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_." He also noted that this would change the behavior for some desktop applications that call sched_yield(), "there will be lots of regression reports about lost interactivity during load."
"After posting some benchmarks involving cfs, I got some feedback, so I decided to do a follow-up that'll hopefully fill in the gaps many people wanted to see filled," Rob Hussey began. He added, "this time around I've done the benchmarks against 2.6.21, 2.6.22-ck1, and 2.6.23-rc6-cfs-devel (latest git as of 12 hours ago)." Rob briefly summarized, "the only analysis I'll offer is that both sd and cfs are improvements, and I'm glad that there is a lot of work being done in this area of linux development. Much respect to Con Kolivas, Ingo Molnar, and Roman Zippel, as well all the others who have contributed."
Referring to a chart in which the blue line represented the CFS process scheduler, and the green line represented the SD "staircase" process scheduler, Ingo Molnar noted, "heh - am i the only one impressed by the consistency of the blue line in this graph? :-) [ and the green line looks a bit like a .. staircase? ]" He acknowledged some slowdown in CFS compared to SD in one of the benchmarks, "-ck1 is 0.8% faster in this particular test." Ingo then explained, "many things happened between 2.6.22-ck1 and 2.6.23-cfs-devel that could affect performance of this test. My initial guess would be sched_clock() overhead." In further testing he applied a low-res-sched-clock that resulted in better performance for CFS leading him to conclude, "the performance difference between -ck and -cfs-devel seems to be mostly down to the more precise (but slower) sched_clock() introduced in v2.6.23 and to the startup penalty of freshly created tasks." When asked if the low-res-sched-clock was likely to be merged, Ingo replied:
"I don't think so - we want precise/accurate scheduling before performance. (otherwise tasks working off the timer tick could steal away cycles without being accounted for them fairly, and could starve out all other tasks.) Unless the difference was really huge in real life - but it isn't."
"Looking at these graphs (and the fixed one from your second email), it sure looks a lot like CFS is doing at *least* as well as the old scheduler in every single test, and doing much better in most of them (in addition it's much more consistent between runs)," Kyle Moffett noted regarding recent benchmarks run against the Completely Fair Scheduler by Rob Hussey. Kyle continued:
"This seems to jive with all the other benchmarks and overall empirical testing that everyone has been doing. Overall I have to say a job well done for Ingo, Peter, Con, and all the other major contributors to this impressive endeavor."
Having recently returned from the Linux kernel summit, Ingo Molnar and Peter Zijlstra sent out some performance updates to the Completely Fair Scheduler:
"Our main focus has been on simplifications and performance - and as part of that we've also picked up some ideas from Roman Zippel's 'Really Fair Scheduler' patch as well and integrated them into CFS. We'd like to ask people go give these patches a good workout, especially with an eye on any interactivity regressions."
He noted that some of the changes included removing features that had proved unecessary. "while keeping the things that worked out fine, like sleeper fairness." Ingo posted some results from the lmbench benchmark noting around a 16% speedup on both the 32-bit and 64-bit x86 architectures. He added, "we are now a bit faster than the O(1) scheduler was under v2.6.22 - even on 32-bit. The main speedup comes from the avoidance of divisions (or shifts) in the wakeup and context-switch fastpaths."