"If web browsers, office suites and mail clients on Windows have certain kinds of vulnerabilities, it is safe to assume that the same programs on Linux will have similar problems."
"You can't play fast and loose with data integrity."
"Exposing bugs is good for development, bad for business."
A recent report on the lkml suggested improved IO/writeback performance in the recently released 2.6.24-rc1 kernel compared to the earlier 18.104.22.168 and 22.214.171.124 kernels. Credit was given to some patches by Peter Zijlstra. Ingo Molnar replied, "wow, really nice results! Peter does know how to make stuff fast :) Now lets pick up some of Peter's other, previously discarded patches as well :-)" He pointed to several patches "as a starter", then quipped, "I think the MM should get out of deep-feature-freeze mode - there's tons of room to improve :-/"
Andrew Morton replied, "kidding. We merged about 265 MM patches in 2.6.24-rc1:
482 files changed, 8071 insertions(+), 5142 deletions(-)". He added, "a lot of that was new functionality. That's easier to add than things which change long-standing functionality." Of the patches Ingo pointed to, Peter noted he was currently working on polishing the swap-over-NFS patch, "will post that one again, soonish.... Esp. after Linus professed liking to have swap over NFS." Rik van Riel also replied regarding rewriting the page replacement code, "at the moment I only have the basic 'plumbing' of the split VM working and am fixing some bugs in that. Expect a patch series with that soon, so you guys can review that code and tell me where to beat it into shape some more :)"
"That's very nice, but ... what is it? It would be really helpful if each patch series came with some kind of description of what problem the patches try to solve and how it solves it :)"
The previous 2.4 Linux kernel maintainer, Marcelo Tossati, resurrected a discussion on adding support for out of memory notifications to the Linux kernel. He explained, "AIX contains the SIGDANGER signal to notify applications to free up some unused cached memory," then noting, "there have been a few discussions on implementing such an idea on Linux, but nothing concrete has been achieved." In a request for discussion, Marcelo added, "on the kernel side Rik suggested two notification points: 'about to swap' (for desktop scenarios) and 'about to OOM' (for embedded-like scenarios)." Rik van Riel explained:
"The first threshold - 'we are about to swap' - means the application frees memory that it can. Eg. free()d memory that glibc has not yet given back to the kernel, or JVM running the garbage collector, or ...
"The second threshold - 'we are out of memory' - means that the first approach has failed and the system needs to do something else. On an embedded system, I would expect some application to exit or maybe restart itself."
"The kernel newbies community often gets inquiries from CS students who need a project for their studies and would like to do something with the Linux kernel, but would also like their code to be useful to the community afterwards," explained Rik van Riel in a posting titled "WANTED: kernel projects for CS students". He offered a link to a Kernel Newbies wiki page titled "
KernelProjects" adding, "if you have ideas on what projects would be useful, please add them to this page (or email me)". Rik explained that he was assembling a list of projects on that page that meet the following criteria:
"Are self contained enough that the students can implement the project by themselves, since that is often a university requirement; are self contained enough that Linux could merge the code (maybe with additional changes) after the student has been working on it for a few months; are large enough to qualify as a student project, luckily there is flexibility here since we get inquiries for anything from 6 week projects to 6 month projects."
"The aim of these four patches is to introduce Virtual Machine time accounting," began Laurent Vivier. He described the first two patches as:
"1) As recent CPUs introduce a third running state, after 'user' and 'system', we need a new field, 'guest', in cpustat to store the time used by the CPU to run virtual CPU. Modify /proc/stat to display this new field.
"2) Like for cpustat, introduce the 'gtime' (guest time of the task) and 'cgtime' (guest time of the task children) fields for the tasks. Modify signal_struct and task_struct. Modify /proc/<pid>/stat to display these new fields."
Both Ingo Molnar and Rik van Riel responded favorably to the patch. Ingo replied, "the concept certainly looks sane to me," adding, "I'd suggest inclusion into 2.6.24." Regarding concerns that the new information at the end of the line could break utilities such as
ps, Rik assured that it would not, "we have added numbers to the cpu lines in /proc/stat since early 2.6. All the programs parsing /proc/stat should just scan for a number of numbers from the start of the line, without trying to scan for the terminating newline."
"The current VM can get itself into trouble fairly easily on systems with a small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory," Rik van Riel said explaining a small patch to
cmscan.c. He continued, "on one side, page_alloc() will allocate down to zone->pages_low, while on the other side, kswapd() and balance_pgdat() will try to free memory from every zone, until every zone has more free pages than zone->pages_high." He noted that highmem could be filled up with "page tables, ramfs, vmalloc allocations and other unswappable things quite easily and without many bad side effects, since we still have a huge ZONE_NORMAL to do future allocations from. However, as long as the number of free pages in the highmem zone is below zone->pages_high, kswapd will continue swapping things out from ZONE_NORMAL, too! Sami Farin managed to get his system into a stage where kswapd had freed about 700MB of low memory and was still 'going strong'." He described his patch:
"The attached patch will make kswapd stop paging out data from zones when there is more than enough memory free. We do go above zone->pages_high in order to keep pressure between zones equal in normal circumstances, but the patch should prevent the kind of excesses that made Sami's computer totally unusable."
The question was asked on the lkml whether or not memory allocated by kmalloc and vmalloc is swappable. Rik van Reil offered a clear explanation as to why it is not, "unswappable kernel memory is simpler and faster," adding, "there really is no good reason for swapping kernel memory nowadays." He went on to explain:
"Over the last 15 years, the memory requirements of the Linux kernel have grown maybe a factor 10, while the memory of computers has grown by a factor of 1000.
"The data structures that grow with memory (mostly the mem_map array of page structs) has actually gotten smaller since the 2.4 kernel and now takes under 1% of memory even on x86-64."
Rik van Riel [interview] posted some thoughts on the page replacement requirements of the Linux VM, noting that the same kinds of bugs have been getting fixed and reintroduced over the past few years, "this has convinced me that it is time to take a look at the actual requirements of a page replacement mechanism, so we can try to fix things without reintroducing other bugs. Understanding what is going on should also help us deal better with really large memory systems." He added his thoughts from this email to the linux-mm wiki, which he plans to update as new requirements surface.
The initial requirements shortlist included seven items: "1) must select good pages for eviction; must not submit too much I/O at once. Submitting too much I/O at once can kill latency and even lead to deadlocks when bounce buffers (highmem) are involved. Note that submitting sequential I/O is a good thing; 2) must be able to efficiently evict the pages on which pageout I/O completed; 3) must be able to deal with multiple memory zones efficiently; 4) must always have some pages ready to evict. Scanning 32GB of "recently referenced" memory is not an option when memory gets tight; 5) must be able to process pages in batches, to reduce SMP lock contention; 6) a bad decision should have bounded consequences. The VM needs to be resilient against its own heuristics going bad; 7) low overhead of execution." He continued on with some more in depth discussion of the various requirements.
A university student studying operating systems asked about why the Linux kernel uses two chained lists in its LRU (least recently used) page replacement algorithm. Andrea Arcangeli [interview], whose virtual memory subsystem was merged into the 2.4.10 kernel, explained, "back then I designed it with two lru lists because by splitting the active from the inactive cache allows to detect the cache pollution before it starts discarding the working set." He went on to add, "a page in the inactive list will be collected much more quickly than a page in the active list, so the pollution will be collected more quickly than the working set. Then the VM while freeing cache tries to keep a balance between the size of the two lists to avoid being too unfair, obviously at some point the active list have to be de-activated too."
Rik van Riel [interview], author of the reverse mapping virtual memory code that was merged into the 2.5 kernel [story] noted, "since memory size has increased a lot more than disk speed over the last decade (and this is likely to continue for the next decades), the quality of page replacement algorithms is likely to become more and more important over time." In response to a proposal to split the LRU into two parts, one for the page cache and the other for mapped pages, Nick Piggin [interview] replied, "I actually had patches to do 'split active lists' a while back. They worked by lazily moving the page at reclaim-time, based on whether or not it is mapped. This isn't too much worse than the kernel's current idea of what a mapped page is." Rik offered some ideas on to how to further tune it, "for each list we keep track of: 1) the size of the list 2) the rate at which we scan the list 3) the fraction of (non new) pages that get referenced. That way we can determine which list has the largest fraction of 'idle' pages sitting around and consequently which list should be scanned more aggressively."
Avi Kivity suggested that combining KVM, the Kernel-based Virtual Machine [story], with the dyntick patch [story] could improve overall KVM performance. He noted that it would likely improve performance of both the host by "avoiding expensive vmexits due to useless timer interrupts," as well as on the guest by "reducing the load on the host when the guest is idling (currently an idle guest consumes a few percent cpu)". Ingo Molnar [interview] pointed out that KVM with his -rt kernel already works with dynticks enabled on both the host and the guest, "using the dynticks code from the -rt kernel makes the overhead of an idle guest go down by a factor of 10-15". Ingo added that he hopes the dyntick patch will be ready to be merged into the upcoming mainline 2.6.21 kernel.
Rik van Riel [interview] noted that there were other ways to reduce the load of the guest when it's idling, "you do not need dynticks for this actually. Simple no-tick-on-idle like Xen has works well enough." Ingo explained, "s390 (and more recently Xen too) uses a next_timer_interrupt() based method to stop the guest tick - which works in terms of reducing guest load, but it doesnt stop the host-side interrupt. The highest quality approach is to have dynticks on both the host and the guest, and this also gives high-resolution timers and a modernized time/timer-events subsystem for both the host and the guest."
At the July 2004 kernel summit, it was decided that there was no need to fork a 2.7 kernel [forum] to introduce new functionality into the Linux kernel. Instead, the decision was made that it was possible for Andrew Morton [interview] and Linus Torvalds to continue working together to first merge things into Andrew's -mm tree, and then after testing the changes to merge them into Linus' mainline tree [story]. This of course led to discussion, with some confusion as to how the 2.6 kernel [forum] could be considered stable while new features were still being merged in [story]. During another short discussion nine months after this decision, Rik van Riel [interview] offered some insight into why the new development model works:
"Things get merged one change at a time, and stabilised one change at a time. This is a big change from the even/odd numbered kernel series, where sometimes a bug crops up without anybody knowing exactly what change introduced it. The current development model seems to go much smoother than anything I've seen before."