In a June of 1992 posting to the linux-activists mailing list, Linus Torvalds described the original Linux scheduler noting, "the scheduler in linux is pretty simple, but does a reasonably good job at giving good IO response while not being too unfair against cpu-bound processes." A year later, Linus posted a more detailed description of the scheduler noting, "the linux scheduling algorithm is one of the simplest ones possible". Comments in the original 254 line
sched.c file read, "'schedule()' is the scheduler function. This is GOOD CODE! There probably won't be any reason to change this, as it should work well in all circumstances (ie gives IO-bound processes good response etc). The one thing you might take a look at is the signal-handler code here."
Comments in the current 6,709 line
sched.c file show the first changes being made in 1996 by Dave Grothe, "to fix bugs in semaphores and make semaphores SMP safe". Two years later Andrea Arcangeli is credited with implementing "schedule_timeout() and related stuff". It was not until 2002, ten years after Linus' original code was written, that the scheduler received a complete rewrite, "new ultra-scalable O(1) scheduler by Ingo Molnar: hybrid priority-list and round-robin design with an array-switch method of distributing timeslices and per-CPU runqueues." Con Kolivas is credited with "interactivity tuning" in 2003, and Nick Piggin added "scheduler domains" in 2004. A more recent rewrite of the scheduler happened in April, again by Ingo Molnar, this time with his Completely Fair Scheduler.
A university student studying operating systems asked about why the Linux kernel uses two chained lists in its LRU (least recently used) page replacement algorithm. Andrea Arcangeli [interview], whose virtual memory subsystem was merged into the 2.4.10 kernel, explained, "back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set." He went on to add, "a page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too."
Rik van Riel [interview], author of the reverse mapping virtual memory code that was merged into the 2.5 kernel [story] noted, "since memory size has increased a lot more than disk speed over the last decade (and this is likely to continue for the next decades), the quality of page replacement algorithms is likely to become more and more important over time." In response to a proposal to split the LRU into two parts, one for the page cache and the other for mapped pages, Nick Piggin [interview] replied, "I actually had patches to do 'split active lists' a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is." Rik offered some ideas on to how to further tune it, "for each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced. That way we can determine which list has the largest fraction of 'idle' pages sitting around and consequently which list should be scanned more aggressively."
A thread on the lkml began with a query about using O_DIRECT when opening a file. An early white paper written by Andrea Arcangeli [interview] to describe the O_DIRECT patch before it was merged into the 2.4 kernel explains, "with O_DIRECT the kernel will do DMA directly from/to the physical memory pointed [to] by the userspace buffer passed as [a] parameter to the read/write syscalls. So there will be no CPU and memory bandwidth spent in the copies between userspace memory and kernel cache, and there will be no CPU time spent in kernel in the management of the cache (like cache lookups, per-page locks etc..)." Linux creator Linus Torvalds was quick to reply that despite all the claims there is no good reason for mounting files with O_DIRECT, suggesting that interfaces like madvise() and posix_fadvise() should be used instead, "there really is no valid reason for EVER using O_DIRECT. You need a buffer whatever IO you do, and it might as well be the page cache. There are better ways to control the page cache than play games and think that a page cache isn't necessary."
Linus went on to explain, "the only reason O_DIRECT exists is because database people are too used to it, because other OS's haven't had enough taste to tell them to do it right, so they've historically hacked their OS to get out of the way. As a result, our madvise and/or posix_fadvise interfaces may not be all that strong, because people sadly don't use them that much. It's a sad example of a totally broken interface (O_DIRECT) resulting in better interfaces not getting used, and then not getting as much development effort put into them." To further underscore his point, he humorously added:
"The whole notion of "direct IO" is totally brain damaged. Just say no.
This is your brain: O
This is your brain on O_DIRECT: .
Andrea Arcangeli is well known for having completely rewritten and stabilized the virtual memory subsystem in the 2.4 Linux kernel. Many were surprised when Linus Torvalds merged Andrea's VM into 2.4.10, but the new memory subsystem has long since proved itself. Andrea is a 27 year old Linux kernel hacker living in Italy and working for SUSE.