"The problem with swap over network is the generic swap problem: needing memory to free memory. Normally this is solved using mempools, as can be seen in the BIO layer," explained Peter Zijlstra. "Swap over network has the problem that the network subsystem does not use fixed sized allocations, but heavily relies on kmalloc(). This makes mempools unusable."
The first fifteen patches set up a generic framework for reserving memory. Patches 16-23 actually put the framework to use on the network stack. Peter noted, "a network write back completion [involves] receiving packets, which when there is no memory, is rather hard. And even when there is memory there is no guarantee that the required packet comes in in the window that that memory buys us." He went on to explain, "the solution to this problem is found in the fact that network is to be assumed lossy. Even now, when there is no memory to receive packets the network card will have to discard packets. What we do is move this into the network stack." Patches 24-26 set up an infrastructure for swapping to a filesystem instead of a block device, which is then utilized by the final patches, "finally, convert NFS to make use of the new network and vm infrastructure to provide swap over NFS." When the usefulness of these patches were questioned, Peter noted, "There is a large corporate demand for this, which is why I'm doing this. The typical usage scenarios are: 1) cluster/blades, where having local disks is a cost issue (maintenance of failures, heat, etc) 2) virtualisation, where dumping the storage on a networked storage unit makes for trivial migration and what not.."
A recent report on the lkml suggested improved IO/writeback performance in the recently released 2.6.24-rc1 kernel compared to the earlier 2.6.19.2 and 2.6.22.6 kernels. Credit was given to some patches by Peter Zijlstra. Ingo Molnar replied, "wow, really nice results! Peter does know how to make stuff fast :) Now lets pick up some of Peter's other, previously discarded patches as well :-)" He pointed to several patches "as a starter", then quipped, "I think the MM should get out of deep-feature-freeze mode - there's tons of room to improve :-/"
Andrew Morton replied, "kidding. We merged about 265 MM patches in 2.6.24-rc1: 482 files changed, 8071 insertions(+), 5142 deletions(-)". He added, "a lot of that was new functionality. That's easier to add than things which change long-standing functionality." Of the patches Ingo pointed to, Peter noted he was currently working on polishing the swap-over-NFS patch, "will post that one again, soonish.... Esp. after Linus professed liking to have swap over NFS." Rik van Riel also replied regarding rewriting the page replacement code, "at the moment I only have the basic 'plumbing' of the split VM working and am fixing some bugs in that. Expect a patch series with that soon, so you guys can review that code and tell me where to beat it into shape some more :)"
"Lots of scheduler updates in the past few days, done by many people," noted Ingo Molnar, going on to describe the more significant updates. "Most importantly, the SMP latency problems reported and debugged by Mike Galbraith should be fixed for good now." Ingo noted that the current code base was looking stable and was likely to be merged into the upcoming 2.6.24 kernel, "so please give it a good workout and let us know if there's anything bad going on. (If this works out fine then i'll propagate these changes back into the CFS backport, for wider testing.)" He went on to describe the other main changes in the development branch of the process scheduler:
"I've also included the latest and greatest group-fairness scheduling patch from Srivatsa Vaddagiri, which can now be used without containers as well (in a simplified, each-uid-gets-its-fair-share mode). This feature (CONFIG_FAIR_USER_SCHED) is now default-enabled.
"Peter Zijlstra has been busy enhancing the math of the scheduler: we've got the new 'vslice' forked-task code that should enable snappier shell commands during load while still keeping kbuild workloads in check."
Having recently returned from the Linux kernel summit, Ingo Molnar and Peter Zijlstra sent out some performance updates to the Completely Fair Scheduler:
"Our main focus has been on simplifications and performance - and as part of that we've also picked up some ideas from Roman Zippel's 'Really Fair Scheduler' patch as well and integrated them into CFS. We'd like to ask people go give these patches a good workout, especially with an eye on any interactivity regressions."
He noted that some of the changes included removing features that had proved unecessary. "while keeping the things that worked out fine, like sleeper fairness." Ingo posted some results from the lmbench benchmark noting around a 16% speedup on both the 32-bit and 64-bit x86 architectures. He added, "we are now a bit faster than the O(1) scheduler was under v2.6.22 - even on 32-bit. The main speedup comes from the avoidance of divisions (or shifts) in the wakeup and context-switch fastpaths."
Ingo Molnar reviewed Roman Zippel's Really Fair Scheduler code, suggesting that much of the work was similar to that which was being done by Peter Zijlstra, "all in one, we don't disagree, this is an incremental improvement we are thinking about for 2.6.24. We do disagree with this being positioned as something fundamentally different though - it's just the same thing mathematically, expressed without a "/weight" divisor, resulting in no change in scheduling behavior. (except for a small shift of CPU utilization for a synthetic corner-case)"
Roman was not impressed with Ingo's review, asking, "did you even try to understand what I wrote?" He continued, "while Peter's patches are interesting, they are only a small step to what I'm trying to achieve." Regarding performance and code-quality concerns with his patch he added, "I explicitly said that my patch is only a prototype, so I haven't done any cleanups and tuning in this direction yet, so making any conclusions based on code size comparisons is quite ridiculous at this point. The whole point of this patch was to demonstrate the algorithmic changes, not to demonstrate a final and perfectly tuned scheduler."