Con Kolivas, a practicing doctor in Australia, has written a benchmarking tool called ConTest which has proven to be tremendously useful to kernel developers, having been designed to compare the performance of different versions of the Linux kernel. He was kind enough to speak with us, explaining how he got started on this project, what makes his benchmark unique, and how to interpret the resulting output. Comparing the 2.5 development kernel to the 2.4 stable kernel, Con says, "a good 2.5 kernel (and that's not all of them) feels faster than 2.4 in most ways and this bodes well for 2.6." The interesting results from his frequent benchmarks back up this statement.
Con also describes his high performance patchset for the 2.4 stable kernel, currently at version 2.4.19-ck9. This patchset adds a number of performance boosting patches ideal for a desktop environment, such as the O(1) scheduler, kernel preemption, low latency and compressed caching. Read on for the full interview...
JA: Please share a little about yourself and your background...
Con Kolivas: I'm 32 years old, live and grew up in Melbourne Australia, am very happily married and have a 9 month old son. I'm a little embarrassed people get me confused for a kernel hacker, as my real profession is very remote from IT. I'm a doctor; a specialist in anaesthesia.
JA: How and when did you get started with Linux?
Con Kolivas: I grew up with computers (geek) but did not work or study them in any formal manner. While studying medicine and then specialising I went a long time without computers at all (the post amiga days). When I finally got back to computers in about 97 I was incredibly frustrated with the microsoft based machines I only had to work with after being so happy with the performance and flexibility of much lower spec amiga machines. A friend introduced me to linux in 1997 but being too far removed from computers at the time I found it difficult to get started with it. In 1999 I decided to try again and got quite addicted (bordering on the obsessive) in a very short timespace. 6 months later I gave up on other OSs as I noticed linux had a momentum that would make it unstoppable, even if it definitely wasn't (and still isn't) the best tool for all tasks. I've used numerous distributions in the past but Mandrake gets me up and running with more things working with less fuss so I tend to stick with it.
JA: When did you first start reading kernel code?
Con Kolivas: 2.4.18 when I started trying to merge O1, preempt, low latency and compressed cache. After applying each patch I had to sort out the problems with each merge and found that looking at the code it made a lot of sense to me and I could sort out the problems - mind you I can't program in c at all. Look at the code for long enough and you start understanding what it is doing.
JA: You've recently been doing quite a lot of work on a benchmarking tool called Contest. How is this tool different than other benchmarks?
Con Kolivas: Long story to explain this. When I started merging interesting kernel patches for 2.4.18 that were known for improving system response initially people just gave me small amounts of positive feedback. When I posted that I had merged the patches for 2.4.19 for some reason it attracted a lot more feedback. This time I had people repeatedly asking me if I had benchmarked these patches; could I substantiate my claims that they made the system more responsive. I used the excellent resources of the open source development lab http://www.osdl.org to benchmark my kernels and got the results I expected - virtually unchanged from a vanilla kernel. At about the same time Rik van Riel had been defending his -rmap patches repeatedly on lkml and the #kernelnewbies channel about the fact that although benchmarks didn't show any improvement in performance, users had found that it made a difference. Many lkml threads followed about how one thing benchmarked after the other was not a real measure of system responsiveness. None of the standard benchmarks available at the time would tell you that. I was quoted as saying "a good anecdote is worth a thousand benchmarks". We all know that if you start a cpu intensive process in the background, linux won't bat an eyelid with no noticeable slowdown in system response. Do a big file write or untar a file and try to do anything and be prepared to go make yourself a coffee while waiting. Rik encouraged me and others to "do something" about this on IRC. Repeatedly on lkml Linus was quoted as saying "if we can't measure it it doesn't exist" and Rik said "If we don't measure it our method of development will ensure it won't exist". Even though my c programming skills are shall we say bordering on the /dev/zero I had been thinking about this very thing and knew I could do it with a simple script.
Contest (pun and name courtesy of Rik van Riel) takes an easily reproducible thing to do - compile a kernel - which represents a whole swag of things a user may do and notice a slowdown on the machine; use heavy cpu, file IO etc and times it in different settings. It is run on as fresh a machine as possible in single user mode to eliminate the influence of other activity on the results. Then it flushes the memory and all the swap so the benchmark is always starting up "cold". Then it times a kernel compilation by itself, and in the presence of a number of different loads - a heavy context switching load (process_load) a heavy file write (IO_load), file read (read_load) memory grabbing (mem_load), extracting a tar (xtar_load), creating a tar (ctar_load) and a huge directory listing (list_load). The idea is that by doing it for the duration of the kernel compile it will increase the signal to noise ratio of the test and pick up slowdowns that we may momentarily notice when trying to do things on our machines. This was quite a departure from the "throughput" approach to benchmarking, and appears to more realistically represent what happens in the real world.
JA: For what sorts of benchmarking is Contest best suited?
Con Kolivas: Contest is a very specific tool for kernel comparisons. Because the tools (c compiler etc) and hardware do not change between benchmarks it can only note a difference between kernels. As I was watching kernel development for 2.5 head toward the heavy iron I, and many others, were concerned that the desktop was taking a back seat and that it would suffer and be worse by 2.6. A few threads on lkml went so far as saying this [change or that] would benefit NumaQ machines and be only a small detriment to ordinary machines. Contest is good at picking up changes that would cause real slowdowns on ordinary machines under stress. In fact, the pickup on these machines with contest is greater than big machines which tend to have hardware that compensate in one way or another. As it turns out, though, these slowdowns that affect smaller machines, if corrected, benefit across the board.
JA: For what sorts of benchmarking is Contest not well suited?
Con Kolivas: For just about everything else. Unfortunately although it's an easy to use script for any user, the results from a users point of view are not useful - kernel development was the intended audience.
Comparisons between hardware - even with minor changes - are meaningless. It's almost impossible to pickup what has caused the difference. A faster hard disk for example can speed up the benchmark by taking some of the load off the machine OR it can slow it down by being so busy writing the cpu chokes and doesn't get a chance to do anything else.
The traditional benchmark measures of throughput, iterations, data processed etc are not measured to any great extent. The background load is the only thing measured as lets say you start writing a file in the background - you want the machine to remain responsive, but you also want the file to be written as fast as possible. Contest gives a measure of the responsiveness in the kernel compilation time, and a measure of the file write as the number of "loads" in the result. Ideally time will be low first and loads high second but the balance can swing, and contest will demonstrate this.
JA: Have you received much feedback on your benchmarking tool and the results you regularly post to the Linux kernel mailing list?
Con Kolivas: I've received a lot of feedback. The most reassuring thing is that I've been getting feedback from the people who actually have the ability to act on the information themselves. This has prompted me to spend an unbelievable amount of time developing contest, as I didn't really have the skills to create it in the first place - just the idea - and their feedback has helped develop it. Andrew Morton has given me probably the best comments. Many of his ideas for loads have been incorporated into contest and he has the ability and knowledge of the workings to best interpret the data. Some of the recent feedback has suggested that people _with_ programming skills (unlike me) are developing other benchmarks based on the ideas I used in contest. This is a good thing. Although I am clear of most of the limitations of contest, I have to learn the skills to work around them. While it's good fun, I don't want biased or inaccurate results swaying kernel development the wrong way. I suspect it won't be long before someone realizes I'm a fraud and displaces me.
JA: What are some benchmarks that are beginning to use ideas originated with contest, and what aspects are being used?
Con Kolivas: Bill Davidsen, who's sig reads "Doing interesting things with little computers since 1979" is developing a responsiveness benchmark and just started posting his results to lkml which he is calling "resp1 trivial response time test". In his words "The benchmark runs a noload case, then forks a load process (or processes) in the background, pauses long enough to simulate user interaction (and get swapped out if the system is memory stressed), then reports the time it took to complete, including the ratio of loaded to noload time." The concept being used is that of trying to perform tasks in the presence of different loads much like contest does. It seems to be a very interesting benchmark and if he develops it I think contest will already have a successor.
JA: Based on your benchmarking tests, what observations can you make about the 2.5 development kernel, especially compared to the 2.4 stable kernel.
Con Kolivas: As most of the contest results show changes in scheduling, IO and VM, I can't really comment on any other area. It surprised me when I first started looking at the VM code just how restrained some of the 2.4 stuff was and how many changes have gone into 2.5. Despite this, there are heaps of great ideas for the VM that just won't make it into 2.6, and may not even be ready for 2.8 because of how stepwise the development needs to be. The results I've obtained show that sometimes the most minor changes can have enormous implications for one part of contest or the other. There is no doubt that file writing is a _lot_ faster in 2.5 and is so fast it can all but kill the machine. Secondly the kernel has become very swap happy. This has upset a few people and AKPM is heavily working on a vm_swappiness feature that still needs quite a lot of tuning to not choke the machine in either direction. However a good 2.5 kernel (and that's not all of them) feels faster than 2.4 in most ways and this bodes well for 2.6. What I have noticed with contest is that great code untested in one way or another is not necessarily a speedup unless it's tuned. Hopefully contest and any successor benchmarks can help that.
JA: What are some recent changes in 2.5 that have had significant impact on kernel performance?
Con Kolivas: When rmap was being integrated into 2.5 the results were less than great. At the same time Andrew Morton was heavily putting together his mm patch series which showed substantial improvements with contest (and other benchmarks) and his patches have gradually been incorporated into 2.5. His (and those contributing to his) vm patches have made enormous measurable gains. The deadline scheduler introduced significantly prevented kernel choking. His vm swappiness feature which still needs some tuning can be either very good or very bad for different contest results depending on whether it is set to low or high - he is modifying and tuning it for this to be tamed and the results are already improving dramatically. I came in to 2.5 too late to know exactly where the gains prior to this occurred, but almost across the board 2.5 outperforms 2.4 in contest.
JA: What do you mean when you say that file writing is so fast in 2.5 it can "all but kill the machine"?
Con Kolivas: When the IO scheduler was updated in 2.5 it was proudly announced that 2.5 can keep 60 spindles saturated with data. Well this may have been true, but what happened during this heavy file writing is nicely demonstrated in contest - the kernel could not do anything else at the same time. IO load would show enormous amounts of background load being performed, but a simple kernel compile could take an hour during this load instead of one minute!
JA: What improvements do you still have planned for contest?
Con Kolivas: I would like to find a way to measure the load cpu% during each kernel compilation more accurately. At the moment the load starts 5 seconds before each kernel compilation and finishes a variable length of time after the compile. The load cpu% is simply that returned by the time command at the end of the load running - this overestimates the load. Also the method used to estimate the number of loads is not perfect.
Other people have been sending me patches to clean up what c code is in there.
I'm greatly at the mercy of the kernel hackers themselves as to the direction it will take from here, as they have been suggesting the loads to add. At the moment I have no other loads planned that are reasonable to include. Beyond this my plans are for it to be deprecated in favour of a better benchmark.
JA: When you post your contest results to the lkml, they are divided between several types of loads. What follows, as an example, is the output of the 'io_load' section taken from a recent test you've posted. Can you explain what the individual columns mean, and how I can interpret these numbers?
io_load: Kernel [runs] Time CPU% Loads LCPU% Ratio 2.4.19  492.6 14 38 10 7.33 2.4.19-ck7  174.6 41 8 8 2.60 2.4.19-ck9  140.6 49 5 5 2.09
Con Kolivas: The first column shows the name of the kernel and in  parentheses how many times this particular test was conducted for that kernel. The Time column shows how long it took to perform a simple kernel compilation in the presence of (in this case) io load. CPU% shows the average cpu% kernel compilation was using when this load was running. Loads is an internal number showing the amount of work the background load performed during this test - the absolute number is not meaningful in any other context. LCPU% shows the average cpu% the load was using during the duration the load was running. Finally ratio is a simple ratio of Time compared to the reference (in this case 2.4.19 with no load).
What we want is the following, in this order of importance:
Time needs to be as low as possible (and therefore ratio as close to 1) CPU% needs to be as high as possible Loads needs to be as high as possible LCPU% needs to be as high as possible.
Using your cropped example, you can see that -ck9 manages to take the least time to compile a kernel, and in that time the loads and load cpu% are lower than the others, but given that time is the most important, the other numbers take a back seat. If for example time was the same between kernels but one had a higher load, then the higher load one was performing better, and so on. This is a good example because it shows that when the kernel is busy writing large files (IO_load), the cpu is not being used to it's fullest as the cpu% + lcpu% is not even close to 100% - ie the cpu is idle waiting for the kernel to give it something to do. You can see interpreting this data can be difficult if many factors change at once.
JA: I've been happily using your -ck7 patchset since it was released. Are you still maintaining these patchsets?
Con Kolivas: My idea for the -ck patchsets was to put together great patches that make a real difference to performance into a stable package without fluff. My original idea was to merge O(1), preempt, low latency and the compressed cache patches. ck7 doesn't have the compressed cache, but does have either the aa vm or rmap vm as well, and the feedback I've received so far shows it to be as stable as vanilla 2.4.19. Which is the way I'd like to keep it. I have had requests for one feature or another, and I've sent out some custom patches for people when they've asked but have not made them part of the default ck patchset. I actually do plan to release another -ck for 2.4.19 though, and this is to add compressed caching. I've been battling with it for some time and found a problem when mixing this with the -aa vm (it is virtually incompatible with rmap at this time). After contesting it I've decided that to release ck9 (as it will be) I'll have to back out the aa vm or rmap option (ck7 will still be available with these). All my testing has shown this combination to be as stable as ck7 but perform better. This is what I use on my home machine and I thought I'd make sure it was stable before releasing it. I am interested in feedback though ;). I'll continue with ck for 2.4 wherever possible until 2.6 comes out as I don't think any of these patches will make it into mainstream 2.4
JA: What happened to -ck8?
Con Kolivas: Should have guessed you'd ask that. A ck7-ck8 diff was posted on my homepage but not recommended to be used. This added compressed caching to ck7. When I used it with the added vm changes from the -aa kernels the machine would experience strange prolonged pauses. Backing out the -aa changes fixed it. Others have described this problem with these patches even without the compressed caching but I had not seen it. So I eventually gave in and removed the -aa addons and released it as ck9, as it outperformed just those addons. Plus I had many requests for ALSA and XFS so I added those too.
JA: Do you intend to merge any additional patches into your -ck patchset?
Con Kolivas: Not unless I get specific requests ;-) I'm not aware of any other major performance boosts that are stable. A few small patches around that suggest they may be useful I've contested and the results have not been significant. I intend to bring out a separated version of ck9 so people can add just the patches they want. I also plan to bring out a version of ck9 (with xfs and alsa) suitable for smp; ie with -aa vm instead of -cc.
JA: How stable do you feel the compressed caching code is?
Con Kolivas: Very. The author (R.De Castro) gives no guarantee to their stability (he calls it v0.24pre4) but I've not had any lockups on the machines I've run it on (longest has been 3 weeks). Mind you, compressed caching offers nothing to machines with heaps of ram, and is not compatible with SMP, but my target audience was originally the desktop.
JA: What other projects are you currently working on?
Con Kolivas: I hover from one little thing to the next. I do have a full and very busy lifestyle and I already spend way too much time on linux (just ask my wife)
JA: Can you describe your development environment, including the hardware and software tools you typically use?
Con Kolivas: Not being a software developer myself I never learned to use real tools. I use kate (kde advanced text editor), patch and diff and that's about it. I guess pico at the console for the least fuss. I do most of my work an a modest laptop at the moment (see my benchmarks results for specs).
JA: Have you worked with any other open source kernels?
Con Kolivas: I don't have the time or background in computer systems for this to be possible.
JA: What do you enjoy doing in your non-Linux time?
Con Kolivas: Spending time with my fabulous family. I also enjoy high performance motor vehicles, classical music, science fiction novels, audiophile hifi and playing nintendo (gamecube - now there's a platform we can learn something from - turn it on and it just works tm.)
JA: Is there anything else you'd like to add?
Con Kolivas: Yes, after having looked at some of the kernel code I feel IMHO that something must happen. Magic numbers must die. There are heaps of occasions where a value must be put to a variable, and that value will be chosen depending on benchmarking. These numbers are always a compromise. I submitted an idea for an autoregulated number for vm_swappiness to AKPM. I had no intention of the code actually being used (I only put it through some brief testing), but I was trying to demonstrate what these numbers need. Magic numbers should be autoregulated by the kernel. They should be variable and work from some sort of feedback loop. My experience with human physiology has shown me that there are an almost infinite number of autoregulated feedback loop type control systems in the human body that give it incredible flexibility to cope under all sorts of situations. I believe, and hope, that this approach could add to the flexibility of the linux kernel.
JA: What sorts of feedback did you receive from Andrew Morton regarding your vm_swappiness patch?
Con Kolivas: He said it could work but that he didn't want to get too complex in there. I assume he meant he didn't want an algorithm included in vmscan (even though it's very basic). He also wanted to offer people a swappiness dial that had simple to understand effects they could manipulate, meaning they could set vm_swappiness to the number they wanted rather than have the machine choose it. I suggested letting them set the maximum vm_swappiness instead but the discussion thread trailed off at that point. He is still putting in a lot of work in other areas of vm_swappiness. Currently the default is set at 60 (range 0-100).
JA: Thanks for taking the time to answer my questions! Your -ck patchsets are greatly appreciated daily on my desktop system, and I'm certain your contest benchmarks are ultimately going to result in a better 2.6 kernel.
Con Kolivas: You're most welcome. I'm a verbose kind of person so I didn't need much prompting. I'm glad you enjoy ck as much as I did putting it together. I greatly look forward to 2.6 myself as well and hope I'm helping in whatever small way I can.