Andrea Arcangeli is well known for having completely rewritten and stabilized the virtual memory subsystem in the 2.4 Linux kernel. Many were surprised when Linus Torvalds merged Andrea's VM into 2.4.10, but the new memory subsystem has long since proved itself. Andrea is a 27 year old Linux kernel hacker living in Italy and working for SUSE.
In this interview, Andrea reflects back on his first exposure to Linux, his first contributions to the kernel, and the meaning of the GPL. He describes his VM work in great detail, both that which is in the 2.4 mainline kernel, as well as his object-based reverse mapping work that can be found in his 2.6-aa patchset. He also discusses his plans for the 2.7 development kernel, describes a possible alternative to kernel preemption that could reduce average latency, talks about his non-kernel hobbies, and much more.
Jeremy Andrews: Please share a little about yourself and your background.
Andrea Arcangeli: I live in Italy, Imola. Most Europeans may know the name of my city because it hosts one of the Formula 1 Grand Prix races. I work from the top floor of my parent's house. I'm 27 years old. I currently live with my parents and my brother. Both my parents work in the public school.
I studied first at the ITIS electronics/telecommunications high school in Imola, and later on at University of Bologna (first at the Informatical Engineering faculty and then at the Computer Science faculty).
JA: What degree did you come away from school with?
Andrea Arcangeli: I've no degree yet, the only piece of paper I have is the high school diploma. I wasn't that bad at school, for instance I surprisingly got 60/60 votation in the diploma and maximal votes at University too for all the software related exams, and I loved studying physics and electronics too (not only computers). Most professors I met along the way have been great, I learned an invaluable technical background at both high school and University. But I definitely didn't go to University with the intention to get a piece of paper, my only intention was to learn everything I could related to computers and electronics like I previously did at high school in the previous 5 years, and that worked out very well, the quality of lessons at Bologna University is excellent. I'm exceptionally happy to have changed from Informatical Engineering to Computer Science at some point so I could get the best classes from both faculty. When I started working for SUSE I was still attending University but over time I got more interested about the work I was doing on linux, plus I could make a living thanks to linux while I didn't get any money by studying more years at University ;).
Linux and the GPL:
JA: When did you get started with Linux?
Andrea Arcangeli: Everything happened shortly after the internet arrived in my city. I connected to the internet the first time from my computer around July 1996. The internet connectivity has been a convincing excuse to trigger an upgrade of my very old IBM PS/2 10mhz 286 with 4M of ram and a 30M harddisk, to buy a new 586 with a 2G hd, 166mhz cpu and 32M of ram so I could surf the internet "comfortably". The very first time I heard about Linux I mistaken it for a freeware program. I couldn't imagine the source was available for everything including the kernel. So the first time I heard about linux I was all but excited. One day on the train to Bologna while going to University I happened to meet a smart linux user and advocate, he was a friend of a friend, and a few days later he helped me to install linux on my 586 computer so I could get some practice with it (I was curious because I've already noticed myself the slashes were all at the reverse on the internet and that some ftp sites had those /etc /lib weird directories and I want to know more about that stuff). I was still very skeptical, because I still had the freeware idea in my mind and I was missing the whole picture. A few days later I started asking him about the crypt(3) algorithm and the unix user/password management. Since I was asking for too many details he just told me I would better go read the source and be done with it. Then I was stunned. After I heard the little detail that besides the binaries being free, the source was also available for the whole system (most of it under the GPL that I already known as a license thanks to some hp48 great software), it become more than obvious I had to destroy everything else I had in my 2G harddisk to make free space for linux binaries and source code.
By the way, I think when I started with linux I could understand the C language a lot better than the english language ;). Anyways I think it's a very good thing to learn more than one language well, regardless on the primary one.
JA: You mention the GPL, or GNU General Public License. How important is it to you that the Linux kernel is available under the GPL, instead of another license?
Andrea Arcangeli: The GPL is fundamental, it may be underestimated these days, but it still is very fundamental. The primary thing the GPL does from my point of view is to protect the contributors from abuse. The early grow up of Linux happened in the spare time of people, and you'll never spend your spare time on something if you believe you can be abused. Even now the collaboration of different big corporations on the same project is still
possible thanks to the proprieties of the GPL.
To make an example if you use the BSD license to me it's like if you're working for Microsoft for free. Working for Microsoft may not sound appealing to a linux kernel developer, but how does it sound to work for Microsoft for free? Again no surprise that Microsoft has nothing against the BSD license and they're only against the GPL.
JA: There was recently a lengthy discussion on the lkml about a binary kernel module that was intentionally tricking the kernel into believing it was a GPL'd module to prevent the displaying of a potentially confusing message for the end user. What perspective can you offer on this incident?
Andrea Arcangeli: I'm not a lawyer but as far as I know hiding a kernel message with a technical trick should not have any legal implications, even the fact you'll be able to use EXPORT_SYMBOL_GPL symbols has no legal implications. It's perfectly legal to workaround EXPORT_SYMBOL_GPL with a wrapper exported with EXPORT_SYMBOL to non-gpl modules too. The only thing with legal implications is the "derivative work" argument. The EXPORT_SYMBOL_GPL is only a legal "hint": if you need a symbol exported only to GPL modules then likely your kernel module has to be GPL'd too because likely the "derivative work" argument would apply to your kernel module.
The above is the legal part. Said that if the kernel module has binary only parts, hiding the confusing message is absolutely the wrong thing to do (even if theoretically legal), and such a module should be banned.
It's not only about the confusing message itself, we need to know if the kernel was tainted when we get an Oops. So I strongly recommend vendors not to make their kernel modules look as GPL'd if they're not, that's wrong and they will likely get complaints from the linux community if they do it. Furthermore if they mark the module as GPL to get easy access to the symbols exported only to GPL modules (with EXPORT_SYMBOL_GPL) that's an huge red flag that such kernel module is going to be truly illegal too, no matter whatever license they advertise to the kernel.
JA: You mention banning certain modules. How would you do this?
Andrea Arcangeli: A module blacklist would be trivial to add if really necessary, but I hope those mistakes of registering non-GPL modules as GPL will be fixed without the need of taking action on the mainline kernel side.
JA: Many seem confused by the suggestion that there is no point in debugging a tainted kernel. Can you explain why this is the case?
Andrea Arcangeli: If a binary only non-GPL driver has a bug, and the machine crashes, there may be no way for kernel developers to understand for sure what's wrong if they miss the source of the driver. This because if the binary module just overwrites some memory randomly either directly or via DMA, the kernel can crash in very weird manners. That may confuse the kernel developers that can try to debug the problem. The "tainted" information exactly avoids kernel developers to waste any time on weird kernel crashes that may happen with a binary kernel module loaded. If the bug is obvious we can understand it and fix it even on a tainted kernel, but if the crash appears very weird and the kernel is tainted we ask the user to try again on a not-tainted kernel first.
Regarding licensing, you've been fairly vocal in the past regarding Linus' use of BitKeeper. Are you still vehemently opposed to this?
Andrea Arcangeli: There's just one problem with that, and it's not a technical one, it's a legal one with the bitkeeper licence. It's not that bitkeeper isn't good enough technically, the problem is that bitkeeper isn't even a freeware program (I don't even care about the fact it's not open source). As a matter of fact I'm personally not legally allowed to use bitkeeper because I want to still be able to fix bugs in CVS or SVN, and I also want to still be allowed to seldom contribute to ARCH.
I quote Linus's word in a recent interview he did with newsforge, Linus said: `it [bitkeeper] also allows me to literally give people more control. That makes it much easier to pass the maintenance around more, which is, after all, what it's all about.'.
I personally do not find it reasonable that such "more control" is being given to people only if they accept to stop contributing to CVS, SVN, ARCH or any other source control software (or alternatively if they pay a not cheap "tax" to bitmover to get a commercial bitkeeper licence). Said that being second class citizens in the short term is ok now that we have a CVS gateway. We've our own open source tools like the great "cvsps" to lookup into the now fully open kernel data. The function of the CVS gateway is to export the kernel data in an open format (the original bitkeeper format is proprietary, I mean proprietary almost like a .doc word document). This way we also don't risk to lose the critical kernel data and we can import it into any other version control system, even if we're not allowed to run the bitkeeper binary ourself.
JA: You've brought up similar arguments on the lkml, to which Linus used some choice words to point out a few problems. Specifically, he said that none of these free alternatives offer even remotely the functionality that BitKeeper offers. Are you suggesting that they have since added this functionality, or that the need to use free software is most important of all?
Andrea Arcangeli: These days ARCH would already be good enough too in practice. In terms of distributed development it's as powerful as bitkeeper and it could improve rapidly in the other areas too. So if I had to choose today a replacement I would go with ARCH, keeping in mind there's still an huge room for improvement, especially in terms of space and time of the database.
In the old days netscape was by far the best web browser for linux, nearly everybody using linux in the last five years used some version of netscape at some point. Think if netscape had a licence that prevented you to contribute to konqueror or mozilla. Would you like to be stuck with netscape today?
Free software isn't the issue, a proprietary binary-only freeware would be enough in the short term. I use a freeware binary-only software myself for the AMR modem kernel driver in my laptop, but that doesn't prevent me from writing or contributing to an open source alternative of such driver. And using the driver in another architecture isn't an issue, since the other archs will be unlikely to have that modem device in the first place.
JA: Why is it so important to you that Linus doesn't use BitKeeper, when he has made it clear that he will accept patches from people that don't use BitKeeper, and when a gateway to CVS has been provided?
Andrea Arcangeli: Clearly after the CVS gateway has been provided it's not critical anymore to move away from bitkeeper. The reason I think it's still important to move away from bitkeeper in the long run is that the bitkeeper licence hurts the open source version control systems like ARCH and SVN, and the assumption that nothing will ever be as good as bitkeeper is flawed. You can argue with all my other points but think about this: if no open source project could ever beat bitkeeper in the long run, Larry wouldn't need this weird licence in the first place.
Early Linux efforts:
JA: Before we talk about your 2.4 and 2.6 VM work, I'm curious to know what other areas of the Linux kernel have you worked on?
Andrea Arcangeli: I started with random bugfixes for problems that troubled my own usage in some 2.0.x and 2.1.x kernels. Then I started focusing on the parallel port for the parport driver during 2.1.x. The parallel port was the device I knew best in the PC hardware because for the final exam at high school we developed an autorange (from a few millivolt to thousand of volts) digital oscilloscope (designing both hardware and software) and I was acquiring the data from the 8 bits output of the ADC using the parallel port in nibble mode with a multiplexer with x86 inline assembler in a C++ program. Knowing the hardware specifications of the parallel port by memory allowed me to contribute to the parport kernel project more quickly. The first big thing I contributed to, has been when together with Philip Blundell and Tim Waugh we made parport sharing work, so you could print while using the zip drive at the same time on the same parallel port. Other operative systems were crashing instantly if you were trying to access the zip drive while printing. I recall Alan Cox helping and motivating in the parport development too.
After some years of spare time work I had the luck of starting to work for SUSE in mid 1999, this allowed me to concentrate a lot more time on the kernel. So I worked on some networking tcp stuff, I worked on the process scheduler, I co-authored the first bigmem design in 2.3.x (later renamed to highmem) introducing kmap() functionality (only atomic at that time) that allowed to break the 800M RAM limit on x86 raising the barrier to 4G, I identified a source of starvation in the 2.2 kernels in the elevator, and I implemented the first latency-aware linux I/O scheduler in kernel 2.3.x, I added the first page-lru with lru_cache_add and lru_lock in linux 2.3.x (previously in 2.2 we were walking over the whole mem_map_t to free the cache), I did the first port of linux to the wildfire alpha architecture, I ran the first linux printk on the AMD64 port in 64bit long mode in the simulator and lots of other random contributions, especially random bugfixes.
JA: Is any of your original parport work still included in the 2.6 kernel?
Andrea Arcangeli: Sure, it's almost unchanged, just a few months ago Alexander Viro found a few locking troubles in the almost 6 year old code (when we wrote it, the big kernel lock and cli() + sleep_on() were still the best way to serialize against irqs ;).
JA: The 2.4.10 Linux kernel was released in September of 2001 and included a new VM that you wrote, replacing Rik van Riel's efforts. What prompted you to write the new VM?
Andrea Arcangeli: There were lots of bug reports to the kernel list, the VM obviously wasn't doing well, it was often generating swap storms, oom deadlocks and similar vm related problems.
JA: How long had you been working on your replacement VM before Linus merged it?
Andrea Arcangeli: By memory I recall it took me some weeks of work full time before I had a kernel that could be compiled successfully with the new VM. After it compiled successfully it worked mostly right at runtime at the first try so I tried to ship it in my 2.4-aa tree to get more testing since it was working good enough for me and it was simple enough to be fixable and predictable code at runtime. Before I could ship a second update in my kernel.org area the new VM was already in mainline together with around another forty patches from my tree too. I'm still amazed how fast Linus could learn the new VM code, he probably didn't immediately know by memory every single line when he applied it, but he sure looked at the overall design and he sent me emails that showed he so quickly had a very deep understanding of the changes. Probably the most serious issue introduced by my changes was a pagecache aliasing problem in the blkdev-pagecache patch, that was unrelated to the real VM changes. In practice it was hurting with some bootloaders with some blocksize in the filesystem. This was fixed a few release later thanks to a great help from Linus and Alexander Viro.
The full vma merging with rbtree and the O_DIRECT support were other important features in my tree that some people were asking for and that were merged at the same time.
Linus, Andrew Morton, Alexander Viro and Marcelo Tosatti and many others helped fixing the bugs after the code was merged in mainline. It would been impossible for me to iron out all the issues so quickly without the support and the help of the kernel community.
JA: It nearly sounds like the VM you contributed to the 2.4 kernel wasn't something that you and Linus discussed before hand?
Andrea Arcangeli: We discussed about VM issues with Linus during 2.3 and a lot during 2.4.0-test too.
JA: Where you surprised when Linus merged your VM into the mainline 2.4 kernel?
Andrea Arcangeli: I was especially surprised he could understand it so deeply so quickly, and from my part the dominant sentiment was the excitement, the surprise was secondary ;).
JA: How stable is the VM that's found in the current 2.4.26 kernel?
Andrea Arcangeli: It should be rock solid. Marcelo merged recently the remaining vm fixes to support at best the very-highmem hardware too.
Only in 2.4-aa I recently added also an objrmap-hybrid for the pte unmapping and I changed the writepage in the paging with a logic that I normally call "can_writepage" and last but not least I changed the shm writepage to start the swapout I/O immediately. These three changes are needed to allow to swap shm efficiently in some big iron machine with >=16G of ram and with some dozen gigs constantly in swap. But this isn't a matter of stability, it's a matter of performance, though the improvement you can get in 2.4-aa thanks to objrmap and the can_writepage tweak can be dramatic while swapping tons of shm on the big iron. This is probably the only relevant VM difference between 2.4-aa and mainline, for most normal machines it won't make any difference, infact it's not obvious that the can_writepage logic won't hurt the lowmem boxes. This is also why in some production kernel rpm I had to keep these three VM changes under a sysctl called vm_shmem_swap and disabled by default. In 2.4-aa I run with these three changes by default instead, they cannot be turned on and off via sysctl in 2.4-aa so they're being tested everywhere.
JA: Are you still actively maintaining the -aa patchset for the 2.4 kernel, or have all of your efforts there already been merged into the mainline 2.4 kernel?
Andrea Arcangeli: Most important stuff has been merged into mainline already. There are enterprise features like o1 scheduler, aio and largepages, or dynamic-hz that aren't going to be merged in mainline 2.4 anytime, for those features 2.4-aa is still sort of maintained. But I don't have enough resources to follow mainline 2.4 anymore, I'm only taking care of merging the major security or stability bugfixes. The focus is entirely on 2.6 these days. I plan to upload a new 2.4-aa update soon, again against 2.4.23, only to apply some of the recent 2.4 fixes I queued.
JA: Why is your 2.4-aa patchset against 2.4.23 instead of the current release, 2.4.26?
Andrea Arcangeli: I don't have enough resources to follow the 2.4 development anymore because the focus is only on 2.6. I only follow any major 2.4 bugs, so I only release further 2.4-aa updates against 2.4.23. I would get a flood of rejects if I would port it to 2.4.26, and fixing lots of rejects takes lots of time that is better spent on 2.6.x.
JA: Does your -aa patchset against the 2.4.23 kernel include the various security fixes that made it into the mainline kernel in 2.4.24, 2.4.25 and 2.4.26?
Andrea Arcangeli: It doesn't include the latest ones in 2.4.26 yet, but I plan to merge those in the next update along other relevant fixes.
2.6 -aa patchset, object-based reverse mapping:
JA: Your -aa patchset for the 2.6 kernel is again focused on the VM. Your current efforts seem to be directed toward improving rmap by replacing the page-based implementation with one that is object-based. Can you explain how this works, and what are the advantages?
Andrea Arcangeli: The pte (page table entry) is a 4 (or 8) bytes large piece of memory that the cpu needs in order to resolve a certain virtual address to a physical address. You know all software runs on virtual addresses, so the cpu needs these pagetables to be constantly maintained from the kernel to reach the right RAM (the physical memory) starting from any virtual address. The pte exactly contains the "physical address" information relative to a certain virtual address. There may be more than one pte pointing to the same physical address/page (in such case we call it a "shared" page). The problem is that when we swapout/pageout to disk a certain physical page, we must first clear all the ptes that were possibly pointing to such physical page, this way (after we clear the pte) the cpu will stop accessing that physical page from under us. Only after the cpu will stop accessing the physical page we can effectively free the page and release memory (and claim the physical page of RAM to be relocated on the harddisk).
What 2.4 does to reach the interesting page tables is more or less to scan _all_ the pagetables in the system to find the interesting ones. But on a big iron you may have 1 thousand of processes all mapping the same shared memory. If each process keeps 2G of shared memory mapped and the PAGE_SIZE is 4k this means you will have to scan 2*1024*1024*1024/4096 = 524288 pagetables for every one of the 1 thousand processes to find all the ptes that possibly map to a certain physical page. So if you want to move a page of shared memory from ram to swap-disk on such a scenario with a pagetable scan this translates in scanning 1000*2*1024*1024*1024/4096 = 524288000 ptes for every 4k physical page that we want to swapout. Even if the pagetables would be organized in an array (and not in a radix tree), it clearly would be way too expensive to read 524288000*8 bytes = 4194304000 bytes = 3.9 gigabytes of ram for every 4k physical page that we want to swap. We definitely cannot loop over 3.9 gigabytes of memory for every 4k swapout.
The above is even a conservative scenario, especially on 64bit system it can be very realistic to have lots more memory than 2G mapped in each process, and many more than 1000 processes too.
Above I made an example using shared memory because it's a fairly common real life test case, but the same unmapping problem applies to all kind of memories.
As I acknowledged in my talk at the kernel summit 2002 for swapping efficiently in big systems with several gigabytes of ram the 2.4 clock scan algorithm isn't going to perform best. On the other end I've never liked the reverse mapping implementation in the 2.4-rmap patches that has been later included in 2.5 and that in turn is now in 2.6 mainline too.
The rmap design (and with rmap I don't mean to abbreviate the generic words "reverse mapping" but I mean the rmap patches floating around for years and included in all 2.6 kernels out there to date) maintains an array of pointers for every physical page. The pte_chains (the core of the rmap patches) are literally an array of "pte_t *" (each pte pointer is 4 bytes with PAE disabled and 8 bytes with PAE enabled with the highmem64G=y). This in short means that with PAE enabled rmap wastes 8 bytes of ram for every pte mapping to a certain physical page, it practically doubles the pte memory overhead since the pte itself also takes 8bytes with PAE and 4 bytes without PAE. On the 32bit architectures these pte_chains arrays will also have to be allocated in low memory (unlike pagetables that can freely go into highmemory). In short having rmap in the VM is like disabling pte-highmem and allocating all pagetables in lowmemory. So with various workloads with stock 2.6 it's very easy to run out of lowmemory (lowmemory means RAM below ~800M) despite tons of highmemory (highmemory means RAM above ~800M) may be still available. On 64bit architectures this specific lowmemory issue doesn't exist, since the whole memory is "lowmemory" from the point of view of the 64bit systems. But still the memory waste in pte_chains can be tremendous in the high end systems (as much as the pagetables memory overhead in mean). Most important the amount of cpu wasted to maintain these coherent pte_chains is quite huge (any readprofile on any 2.6 mainline kernel will normally show the two functions page_add_rmap and page_remove_rmap at the very top of the profiling). Especially for shared librarians and executables having to lock the same pages to serialize cpus against cpus while setting up the pte_chains is it quite a smp scalability hit.
All kernels out there, 2.2 included, already provides a means of reaching all ptes given a certain physical page. Reaching pagetables given a certain physical page has always been necessary for vmtruncate() to work (one of the kernel operations executed during the truncate syscall). The original David Miller's objrmap attempt and the current objrmap patch from Dave McCracken uses exactly this already implemented vmtruncate method for the paging/swapping too, replacing and obsoleting the 2.6 rmap methods. The reason objrmap is nearly zerocost (in both cpu and memory terms) for the fast paths like page faults, fork and execve is that it uses the vmas to track the ptes. To find the ptes with objrmap we use the page->mapping pointer that basically points to the inode that the physical pagecache page belongs to. Then in the inode we have a list of vmas where that physical page can be mapped to. Finally from the vma we reach the pte that we need to clear before freeing the physical page. So you have 1 vma object tracking down not 1 single pte (like the pte_chains do) but tons of ptes, as many ptes as large the vma is. This clearly saves tons of memory because in mean there are tons of ptes mapped by each vma. You can see the size of the vmas in your processes with `cat /proc/*/maps`.
There were two main problems with the objrmap patches: 1) the objrmap information only covers inode-backed pages (I addressed this with the "anon-vma" patch that creates an anon-vma object tracking granularly which vmas to check for every given anonymous page), 2) the list of vmas in the inode can contain vmas that aren't in the right range for our physical page, so in the inode there may be more vmas than what we're interested about, and the quadratic complexity of the objrmap algorithm during paging could waste lots of cpu (this has been addressed by Rajesh Venkatasubramanian with the prio-tree that works O(log(N)) for all operations including the range-lookups we execute during paging in objrmap, this way we only check the vmas that will for sure contain a pte that we must clear before swapping the page, and we won't waste any time on the other vmas belonging to the inode).
JA: Are there any disadvantages to moving to an object-based rmap versus what's currently in 2.6, on large or small computers?
Andrea Arcangeli: objrmap alone has the two problems I mentioned above, and both are solved by applying anon-vma and prio-tree on top of the objrmap patch, like I do in my 2.6-aa tree.
However the additional memory footprint generated by rmap in general is mostly visible on large computers, so while there are no disadvantages probably not many obvious benefits will be visible in desktop systems. This because you need a big amount of memory allocated in pagetables to see these detrimental rmap effects. You can see the amount of memory allocated in pagetables in /proc/meminfo in the "PageTables:" line. To me right now it says 2556kbytes, that means if I would be using 2.6 mainline instead of 2.6-aa I would be wasting around 2Mbytes of ram, so I could hardly notice the rmap memory overhead with the workloads I run on my desktop. In the 1000 processes workload that I described above the amount of memory allocated in pagetables would be instead 1000*2*1024*1024*1024/4096*8/1024/1024/1024 = 3.9Gbytes. That means that even ignoring the lowmem/highmem issues (let's assume we're working on a 64bit architecture), rmap would waste 3.9Gbytes of ram on such a workload. Wasting 3.9Gbytes very much matters. With 2.6 mainline with rmap you would need a total 2+3.9*2 = 9.8Gbytes of ram to run such a workload. With 2.6-aa you could run the very same workload with only 5.9G of ram installed in the machine.
In terms of performance (not in terms of memory footprint) the benefits of using objrmap instead of rmap are noticeable regardless of the memory allocated in pagetables, so replacing rmap with objrmap is beneficial in all systems (but it's mostly noticeable especially in big smp).
JA: Are there benchmarks available that show this increase in performance?
Sure, Martin Bligh posted some on linux-kernel. Here it is a relevant post:
For those workloads rmap was a problem only in terms of performance, not in terms of memory footprint. The performance boost of removing rmap is as high as +20% in some workload. That has been measured on a big ccNUMA machine with many cpus of course.
JA: Do the changes you're making in 2.6-aa have any affect on the functionality or performance of swap?
Andrea Arcangeli: Intensive swapping workloads are normally I/O dominated. The cpu cost difference between dereferencing two pointers to reach the pagetables with rmap, or walking some complex data structure with objrmap+anon-vma+prio-tree is not significant during swapping. All it matters during paging are the complexity issues of the data structure that we lookup to find the pagetables and secondly the smp scalability of the algorithm.
JA: How much of your 2.6 -aa patchset is code that you have written versus code that others have contributed?
Andrea Arcangeli: The objrmap original design is originally from David Miller and the current implementation is from Dave McCracken. The prio-tree design and implementation is from Rajesh Venkatasubramanian.
I only collected the stuff, showed it was possible to remove the pte_chains completely while still providing good paging behavior on big systems (before my anon-vma even Hugh's original anobjrmap patches were leaving pte-chains still in the page_t for mremap), fixed some objrmap issue (especially with 2.6.5) and designed and implemented anon-vma. I shared lots of ideas with the anobjrmap patches from Hugh Dickins while writing anon-vma: there is a "preparation" stage sharable between anon-vma and anobjrmap. So lots of credit for various parts of anon-vma goes to Hugh Dickins too, even if I may have implemented it slightly differently. To make an example we both had to stop using the page->index to store the swap entry of the swapcache, I found the solution to this in Hugh's anobjrmap patches, that is to use page->private, so I didn't need to reinvent the wheel ;). The reason I couldn't use the plain anobjrmap patches is that the design they use is simpler but it breaks mremap, or better they're forced to either retain the pte_chains only for mremap, or to break the COWs during mremap, and at least for the short term I needed a design that wouldn't risk to slowdown the userspace by breaking the COWs. Hugh's right that on the other hand I reduce the vma-merging possibilities slightly with anon-vma, but that's something I'm very comfortable with and it's purely theoretical, while generating copy-on-write faults during mremap doesn't sound as comfortable to ship in production since I cannot predict which open source or proprietary commercial application might be bitten by this change at runtime. I was also not comfortable to share the anobjrmap preparation code since I didn't want to remove the swapper_space address space that simulates a sort of address space for the swapcache, I like to retain the swapper_space, that for example already payed off with the backing-dev-unplug patch that adds callback into the address space that must be recalled for swapcache too to provide a granular unplugging. Last but not the least anon-vma cleanups the locking a lot compared to anobjrmap/anonmm: I removed the page_table_lock from all vma operations since I now only serialize the fast-paths with granular anon-vma locks and I could now even remove the mm-wide page_table_lock (and make it granular in the vma) without altering the paging design at all.
Finally Hugh, Rajesh, Andrew, Linus, Martin, WLI and others helped a lot in auditing the code, showing bugs, fixing bugs and suggesting improvements. The very reasonable and productive seldom criticism from the rmap advocates has been very productive too since it resulted in more options being explored and overall better solutions being found.
JA: You mention that you reduce the vma-merging possibility slightly with anon-vma. Can you further explain what this means?
Andrea Arcangeli: Yes: with anon_vma if you create two vmas with mmap(MAP_FIXED|MAP_ANONYMOUS) and you have an hole in between the two anonymous vmas, and you try to fill the hole with a third mmap(MAP_FIXED|MAP_ANONYMOUS), the hole will be filled fine but on the kernel side there will still be two vmas backing the now contiguous virtual area, instead of just one covering the whole virtually contiguous area. With anonmm/anobjrmap you would get only one vma instead. There are other corner cases like this. In practice it's not important, and having different anon_vma objects for every anonymous vma brings some more smp scalability to the kernel during paging and during stack growsdown page faults, plus it avoids to take the page_table_lock during vma manipulations.
JA: What is a copy-on-write fault during mremap, and what is the affect?
Andrea Arcangeli: A copy-on-write fault on a shared read-only anonymous page, is an operation that allocates a new anonymous page and memcpy the content of the old page into the new page. Normally this copy-on-write only happens if the userspace applications writes to the memory. But without anon_vma and without rmap, the kernel would be forced to trigger this copy-on-write fault during some mremap too, not only if userspace happens to later write to those read-only anonymous shared pages.
The effect is that those potentially superfluous copy-on-write faults during mremap could lead to increased memory footprint generated by the application, and secondly it could lead to slower performance due the memory copies after the new pages of memory have been allocated. The copy-on-write would need to swapin the pages too if they're swapped out.
In short without anon-vma and without rmap, you cannot allow the same physical anonymous page to be mapped read-only in two different virtual addresses in different address spaces (i.e. in different processes). So if mremap wants to move the virtual address of a shared anonymous page, it has to un-share it first with a copy-on-write fault that wastes memory and cpu.
With the anon-vma design I could instead even map the same anonymous physical page multiple times in the _same_ address space (at different addresses obviously), if an userspace API would exist for asking the kernel to do that.
JA: I'm unfamiliar with the swapper_space address space, and the backing-dev-unplug patch that you mentioned. Can you describe what both of these are?
Andrea Arcangeli: The swapcache is part of the pagecache. All pagecache (swapcache included) is associated with an address_space (the address space represents an inode), that knows how and where to write that page to disk. The swapcache address space is the "swapper_space", so the swapper_space knows how and where to write the swapcache to the swap partition or the swap file. Retaining a swapper_space address space make the swapcache look like regular pagecache still, it avoids explicit checks for PageSwapCache in the pagecache operations.
The swapper_space payed off with the backing-dev-unplug patch already, since that patch moves the disk unplugging callback ("disk unplugging" here means "start of the pending I/O") into the address space data structure. Previously the disk unplugging was sort of global, we were unplugging more disks than necessary before waiting for I/O, now we only unplug the disks we're interested about. So having a "dummy" address space for the swapcache made it very easy to handle the fine grained unplugging of the swap disks too, without having to add any special case in the common pagecache code.
JA: While we're talking about swap, what requirements for swap does the 2.6 kernel have? How is this different then the requirements of the 2.4 kernel?
Andrea Arcangeli: the swapcache handling is the same in 2.6 and 2.4, and that in turn means the swap requirements are the same too.
JA: You also mentioned cleanup that resulting in removing some locks. By removing these locks, are you aiming for better performance, or decreased complexity?
Andrea Arcangeli: Both ;).
JA: Have you received any feedback from Andrew or Linus about getting your -aa vm work merged into the mainline 2.6 kernel?
Andrea Arcangeli: Andrew didn't look much into this yet, but he seems positive, Linus gave some positive comment over time too.
What I can tell for sure is that after the practical demonstrations that rmap can be removed completely without paging complexity downsides (and the first kernel doing this ever has been 2.6-aa) it's almost certain rmap will be removed from 2.6 mainline soon too. The downsides of rmap are too huge to leave it there.
JA: How stable is your 2.6-aa patchset?
Andrea Arcangeli: Stability is excellent, there are only two minor fixes I need to push into my kernel.org area, but I don't get any bug report for several weeks and the exact same VM code in 2.6-aa (plus the two minor pending fixes not yet on kernel.org) is being used in production in SUSE Linux 9.1 and it will be included in all the coming SUSE Linux Server 9 products too.
2.4 VM changes versus 2.6 VM changes:
JA: As we've already discussed, fairly late in the development of the 2.4 stable kernel we saw rather significant changes to the memory subsystem. At first glance, it seems that this is happening again with the 2.6 kernel. Are the changes you're currently making in 2.6 on a similar scale, or are you "improving upon" more than "changing"?
Andrea Arcangeli: This is truly "changing", no sign of rmap and pte_chains remains in my tree. The way objrmap works has nothing in common with rmap and the only thing to share is the API to call these methods. But the unmapping method is only a part of the VM, and the whole objrmap+anon-vma+prio-tree changes are black and white thing, in 2.6-aa I'm not touching the VM heuristics at all. The VM at large is a big heuristic, and there's no perfect formula you can use to tell which page it's time to swapout to disk when, nor you can exactly predict how well the swapping will behave at runtime until you test or simulate it; that is the really hard part of the VM. In 2.4 the big changes were in the heuristics, the methods to unmap were almost unchanged in the 2.4 VM rewrite. Here in 2.6 we instead we only change the unmap methods and we don't touch the VM heuristics, and these changes to the unmapping methods are easily predictable with simple math.
I believe there are other bits to fix in the 2.6 VM though, at least the oom handling doesn't seem working safe without swap, the oom handling is less a black and white thing for example, but I'm leaving those for much later.
JA: Does the fact that you're not modifying the 2.6 heuristics suggest that they are an improvement over what was in 2.4?
Andrea Arcangeli: The lru design doesn't look very different from 2.4 but various parts in the heuristic that uses the lru information changed in 2.6 and they're supposedly an improvement over 2.4. Looking at the lkml somebody thinks it swaps too much though and I'm not completely sure if it has all the highmem fixes that 2.4 has, we'll find out soon. Certainly it looks good so far.
Handling out of memory situations:
JA: What plans do you have for the oom handling in 2.6?
Andrea Arcangeli: measuring the per-task allocation rate is an old idea that I'd like to implement.
JA: Can you explain how this works to gracefully handle out of memory situations?
Andrea Arcangeli: The idea behind this logic is that the task that allocated the most memory in last units of time, is the one that will likely generate an out of memory condition at a later time, regardless of what we kill right now, and so we'd better kill it now first.
Let's assume to run on a machine with a 1G of total virtual memory all in ram. If one task allocates memory at rate of 100mbyte/sec it will take it at most 10 seconds before generating an out of memory condition. If all other tasks allocates memory at a rate much lower than 100mbyte/sec, and you kill one of those other tasks to resolve the first oom condition, in less than 10 seconds you'll sure run out of memory again and again, until you finally kill the task allocating memory at peace of 100mbyte/sec.
The current oom killer instead takes especially into account things like the size of the task. That works well on a desktop systems where all regularly working apps have a tiny amount of memory allocated each, but on servers the regularly working apps may have several gigabytes allocated per process. If a relatively tiny application hits a software bug and this software bug triggers a flood of allocations, the current oom killer can kill one of the big real apps that allocate memory at a slow peace, instead of the relatively small application that hit the bug and triggered the oom condition by allocating memory in a loop. That is one of the biggest problems of the current oom killer.
JA: Removing the oom killer doesn't remove oom situations. With the oom killer gone, what happens now in 2.4 if a process tries to allocate more memory but none is available?
Andrea Arcangeli: What happens without an oom killer, is that the first task that failed allocating memory during page fault is being killed. This is also the safest approach from a deadlock standpoint. If an allocation fails during page fault we can run do_exit() immediately and the task will go away synchronously and reliably. Sending signals to other tasks like the oom killer does, is less reliable: for example if the task is in D un-interruptible state it may not be able to handle the sigkill signal.
Probabilistically the behavior without oom killer is the same as the "per-task allocation rate measurement" oom-killer idea mentioned above. The main difference is that the "per-task allocation rate measurement" is reliable and you won't risk to hit the low probability of killing an innocent task that allocated memory for the first time and that happened to trigger the oom first. With the "allocation rate measurement" we'll notice that such unlucky task allocates less than 4kbyte/sec and we'll target some other bigger offender with much higher allocation rate per second.
JA: In the 2.4 kernel you provided patches that completely removed the infamous oom killer. Have you received any problems as a result of the oom killer's removal?
Andrea Arcangeli: Some people (especially with desktop workloads where the current oom killer algorithm will almost never do mistakes) preferred to have the oom killer still, and so Marcelo made it dependent on a sysctl, so you can choose if to leave it disabled or if to enable it, that's very nice. However I warn that if you enable the oom killer you'd better have swap enabled or it may not work well, plus the oom killer may do quite huge mistakes in some corner case even with swap enabled. That's why I had to disable it for servers.
JA: How does selecting the process to kill based on measuring per-task allocation rate avoid trying to kill a task in an un-interruptible state?
Andrea Arcangeli: You got it, mathematically speaking it cannot. "selecting the task to kill" and "how to avoid hanging trying to kill a task in D state" are two separate problems. The pure "per-task allocation rate" calculation addresses only the former problem (just like the current oom killer only address the former problem). To address the latter problem a new logic (not available in any kernel out there) would be necessary on top of the oom killer. However probabilistically the selection made by the "per-task allocation rate" algorithm should normally pick the current task (which is trivial to kill reliably by calling do_exit synchronously), and most important a task that allocated memory in a flood is not likely to stay stuck in D state for a long time. So in practice the risk of hanging trying to kill a task stuck in uninterruptible state sounds very low by using the "per-task allocation rate" method as opposed to selecting the task to kill in function of more static information like the size of the task. So maybe in the short term we can only focus on replacing the oom-selection algorithm, leaving the theoretical oom deadlock issues for later.
JA: You recently had an interesting discussion on the Linux kernel mailing list with Ingo Molnar about the overhead of his 4:4 VM patch. Can you summarize the problems you have with his 4:4 implementation?
Andrea Arcangeli: Numbers were posted to linux kernel: especially for a threaded app like mysql the slowdown of 4:4 was something like -30%.
I definitely agree with Ingo and Martin that 4:4 is a nice thing to have on a 64G x86 32bit machine, but up to 32G of ram included that's just not needed and it only risks to slowdown the performance of the applications and render the hardware less usable for generic use.
All the above assuming that you use 2.6-aa: everything changes if you've rmap: if you've kernel with rmap like 2.6 mainline then your lowmem memory waste could be so huge in some workload that 4:4 may be needed to avoid running out of the lowmem zone even on machines a lot smaller than 64G of ram.
Some user also wants as much user address space as they can get in a single process, 4:4 provides them 4G of user address space instead of 3G of user address space, plus number crunching is the optimal workload for 4:4. I could measure a slowdown with 4:4 with number crunching but only with a working set size close to the l2 cache size, for bigger working set or smaller working set the slowdown of 4:4 for userspace number crunching is close to zero. Overall going 64bit is the way to go for those kind of users (especially today that it become affordable), if they run out of memory at 3G today, 4G will soon not be enough too.
JA: What alternative do you prefer?
Andrea Arcangeli: The best alternative to 4:4 on the 64G systems would be the page clustering design, implementing a soft PAGE_SIZE, but it seems too late and too invasive for 2.6, and it doesn't worth to do so much breakage only for the 64G systems (as said above up to 32G can be handled fine with 2.6-aa). the page clustering is a nice solution since it should speedup 64bit systems with 4k page-size too.
Plans for 2.7:
JA: What plans do you have for the upcoming 2.7 development kernel?
Andrea Arcangeli: for 2.7 at the moment my own interest is set towards various topics including page clustering, basic process migration API and trusted computing. Then it also depends on the plans of SUSE/Novell, but I can freely spend a percentage of my work time on my own area of interest, in which case it will be the above topics.
JA: What is page clustering?
Andrea Arcangeli: Page clustering consists in defining a software page size, bigger than the hardware page size. If the hardware page size is 4kbytes, the software page size could be set to 8 or 16 or 32 kbytes at your own option. With a softpagesize of 8kbytes the mem_map_t array would be already halved in size. You know for every page the kernel has to allocate some dozen of bytes for a "struct page", bigger page size mean means less "struct page" data structures allocated into the mem_map_t array. On a 64G machine that means saving hundred mbytes of low memory. Besides saving memory on the high end, it would provide the advantage of reducing the number of minor page faults too. During the pagins from disk we always read from disk in big chunks with readahead anyways to get a good disk bandwidth by reducing the number of dma commands we send to the storage, but we only map 4kbytes from the cache at every page fault.
The page clustering is especially interesting because it would save memory and improve performance on 64bit architectures too: these days an hardware page size of 4kbytes is tiny, but it's still required as an _hardware_ page size to avoid breaking the 32bit compatibility. The 32bit ABI requires coherent MAP_SHARED at file offsets multiple of 4kbytes and at MAP_FIXED virtual memory addresses multiple of 4kbytes too. Getting the compatibility cases right is probably the brainier part of the page clustering patch.
JA: How much work is currently done to get page clustering working with Linux?
Andrea Arcangeli: The page clustering patch is being maintained by WLI, he releases a new version once every few months, not much work happened recently on it though, probably because it's not 2.6 material.
JA: Can you describe what you mean by a basic process migration API?
Andrea Arcangeli: I would like a kernel internal API that you could use to freeze the state of a task, and either store it to disk and restore it later, or send it to another machine in a cluster. This would be useful for taking snapshots of workloads too so they can be restarted transparently if the machine crashes, and it would be useful for applications like kde too that would be theoretically allowed to restore your bash shell in the same state that you left it when you logged out.
I believe with the network approaching 10gigabit it'll get more and more wasteful not _transparently_ taking advantage of other cpus in the intranet for computations.
The OpenMosix and OpenSSI projects are the most advanced in this area so far. I would like to see tinybits of them to be merged and digested and elaborated by the kernel community over time. The risk of making the kernel over-complex by going into this direction is high, but certainly I would find this more worthwhile than things like preempt that makes the kernel more complex without having a chance to decrease the worst case scheduler latency of the kernel.
JA: Are you suggesting that kernel preemption is not worth the cost? Might this feature be removed in a future Linux kernel?
Andrea Arcangeli: I believe preempt is a fascinating exercise but at the same time I believe it isn't worth the complexity it introduces.
Takashi Iwai measured in various critical workloads that the worst case scheduler latency is not affected by preempt. In terms of complexity I especially dislike pieces of code in the scheduler clearing interrupts some lines before to avoid calling an explicit preempt_disable.
Something preempt can do is to decrease the average scheduler latency, but the most important thing normally is the worst-case latency, not the average one. Something I proposed several years ago on the linux-kernel mailing list, before preempt was written, is to implement the reverse of preempt. My idea is to enable preemption on demand (not to disable it on demand like now) only for some specific piece of code cpu-intensive and capable of rescheduling (like the copy-users) in order to decrease the average-latency. That would be very reasonable and it doesn't increase the complexity of the kernel at all, nor would it bloat the spinlocks or other locking operations, nor would it require cli()s to be advanced of some instruction, nor would it require to disable preemption around per-cpu data structures in smp. And it should provide similar average-latency benefits, though it has never been tested.
JA: How difficult would this be to implement? Is it something you've considered working on for 2.7?
Andrea Arcangeli: The implementation is at least simpler than preempt itself. The most tricky part are the modifications to the entry.S asm to reverse the preempt logic. If somebody is looking for an exercise to understand the kernel preemption bits while entering/exiting kernel this could be a good didactic project to make some practice. The time I got closer to get it running myself was a few days after proposing it. But I didn't complete it after I run into various troubles with the entry.S code. I didn't plan to work on this myself during 2.7, but I certainly would be very curious to compare it with preempt in terms of average latency.
JA: What work do you intend to get involved in that's related to trusted computing?
Andrea Arcangeli: See the TCG benefits at the https://www.trustedcomputinggroup.org/ website, that's what the hardware feature can provide, that should be a generic hardware feature and there should be no technical reason why we couldn't support it in linux too.
Actually in my spare time I had an idea of one revolutionary and ambitious project I can build on top of the trusted computing capable hardware (that project has nothing to do with linux by the way, but for it to run on linux too, linux would need to provide some basic trusted computing support), that's something I wanted to build for a long time but it has never been feasible until they added the trusted computing to the hardware and they filled the gap to make my idea possible, so I'm quite happy about these new hardware features (despite clearly they can be misused for some annoying things too).
JA: Are you willing to share more about what this idea is and how it would work?
Andrea Arcangeli: Not yet sorry, it's excessively complex to be discussed within an interview, so you will have to trust me that you can do really useful things on top of that new hardware technology.
JA: How do you enjoy spending time when you're not hacking on the Linux kernel?
Andrea Arcangeli: playing tennis, listening to lots of music (but often I do that even while I work on the kernel), reading news and magazines, going out with my girlfriend or/and friends, often going to the sea in the weekend, visiting new places, watching cinema, sometime theatre, some snowboarding in the winter, doing various stock market research and trading.
JA: What sorts of music do you listen to?
Andrea Arcangeli: All, mostly metal and progressive rock and sometime electronic and some classical too, though I'm very selective. I enjoy live jazz and blues too. When I was a teenager I was spending lots of my time playing guitar, I was used to be able to play most music I listened to, including classical (I was playing the violin part with guitar with a very fine distortion effect, those experiments were really sounding great). I started studying piano when I was 8 years old or so, I can still play piano decently.
JA: Are you a good snowboarder?
Andrea Arcangeli: I'm probably on the average, not too good but not too bad. I was a lot better at skateboarding than at snowboarding, but I feel too old for skateboarding these days ;), and falling on the snow is a lot more confortable... I tried snowboarding the first time around four years ago at the first and last "kernel hacking skying party" ;).
I'm not a need for speed, I don't take risks and I normally prefer the red slopes. I learned this year turning fakie, I find that a nice exercise to get a real full control on the board no matter which way you're going, for instance my stance angle is +9/-9 (that's a quite unusual angle for most snowboarders).
JA: Are you mainly interested in technology stocks, or others?
Andrea Arcangeli: I'm not focused about any particular industry or sector. I tend to follow the Benjamin Graham's principles, that's a good starting point, but the reality is much more complicated than his simple principles, following strictly his market capital trading below net working capital is often too limiting, and in my very limited experience fundamentals doesn't mean a thing if you don't catch the right entry point, plus his picks are often not going concerns for long but that's a secondary issue. I play with little money and I never go on margin and I never short, so I'm fully safe and in turn I can attempt terribly risky stuff without much care.
JA: It sounds like you keep yourself plenty busy! How much time do you have for your hobbies outside of kernel hacking?
Andrea Arcangeli: Probably one hour per day of spare time on average for all the hobbies, but it mostly depends if there are urgent issues to solve, I always give all the priority to the real work of course.
JA: What advice would you offer to those that are just beginning to get interested in kernel development?
Andrea Arcangeli: Three pieces of advice. Relax and don't be scared by the source even if you can't understand it, it takes some time, knowing the concepts in advance may help a lot, but the devil will remain in the details. While coding be always strict and never write a line of code if it cannot make any difference at all at runtime, if something convert it into a comment. Have fun and enjoy the technology. Hope this helps ;).
JA: Thanks for all your time answering my questions, and even more for all the effort you've put into improving the Linux kernel!
Andrea Arcangeli: You're welcome, my pleasure ;).