Nick Piggin, a college student living in Canberra Australia, has been working on an anticipatory I/O scheduler for the Linux kernel [story].
When a process reads data from a disk, the default "deadline" I/O scheduler can offer poor performance if a streamed write is happening at the same time. The reason is that many read operations require multiple reads, each reporting a result back before the next can be scheduled. Thus, each of these reads has to wait behind a queue of writes, resulting in the aforementioned performance problem. The anticipatory scheduler solves this problem nicely by pausing for a few milliseconds after each read, "anticipating" the next read request [story].
In this interview, Nick offers much more detail behind the operation of the anticipatory scheduler. His goal is to stablize and tune [story] the new scheduler, aiming utimately for inclusion into the 2.5 development kernel tree as the default Linux I/O scheduler [story]. The latest version of Nick's anticipatory scheduler can be found here in Andrew Morton's [interview] -mm kernel branch.
Nick Piggin: OK, I'm Australian, am 21, currently live in Canberra Australia with my parents (still!). I work doing boring website stuff for a technical college (TAFE), and am struggling through a degree at ANU.
JA: What kind of "website stuff" are you doing?
Nick Piggin: Ah, I developed a PHP+SQL website which is used by "practice firms" aka "virtual businesses" to buy and sell online, as well as do online banking.
JA: What is the degree you're working toward?
Nick Piggin: Software Engineering.
JA: How much more do you have to go on your degree?
Nick Piggin: 3 years
JA: How and when did you get started working with Linux?
Nick Piggin: First installed Debian 2.0 which was using about a 2.0.34 kernel at the time IIRC so I could probably work it out from there. Maybe 4 years?
JA: Have you been involved in any kernel projects before you started working on the anticipatory scheduler?
Nick Piggin: Hmm, this would be my first real one. A few random patches to the O(1) scheduler, Rik's rmap VM, and I did make a crappy driver for my soundblaster 16 a while ago.
JA: When did you first start working on the anticipatory scheduler? Is the effort based on any previous code?
Nick Piggin: Started work on it about January 15. I had spent a few months trying to improve the deadline scheduler (some of my deadline stuff is in Linus' tree now). And the reason for _that_ was just out of interest, and trying to figure out why an ftp server (lots of streaming reads) workload hits a wall at a large number of clients. The reason is that memory pressure causes readahead to be discarded before its used. AS solves this problem happily.
Andrew Morton suggested implementing anticipatory scheduling, and I did, and it worked well for where it was supposed to, so the rest of the time (excluding a LOT of debugging) has basically been making it better where it is bad.
The framework for the scheduler is based on Jens' deadline scheduler. If you set antic_expire = 0 (disable anticipation), the scheduling should be similar to deadline, although even that is quite divergent these days.
The idea (but no code) for an anticipatory scheduler goes to this paper, though I have built on his ideas. Basically the hardest thing in turning it from theory to practice is you have to make it work well for the worst cases. Although we have also come to get a lot of performance improvements that the origional paper hadn't found or mentioned.
JA: What changes did you make to the deadline scheduler that got merged into Linus' tree?
Nick Piggin: Can't remember to be honest! Quite a few small correctness things though. I removed the writes starved parameter and replaced with read and write batches. Yeah a few other things. They did make a measureable difference in my test cases.
JA: How does the anticipatory scheduler differ from the default scheduler in the 2.5 development kernel?
Nick Piggin: It _does_ have a lot of changes, but basically AS is the default scheduler which "sometimes pauses for a couple of milliseconds after a read". Sounds simple when I put it like that!
JA: Can you explain more about why the scheduler pauses after a read?
Nick Piggin: Imagine a process "R" which is reading, and another "W" which is writing. You would hope that they each get a good share of the disk.
W quickly fills up say 50MB of cache with writes, which is pushed to the scheduler as quickly as it will be taken.
R submits its first (say 4KB) read, and must now wait for it to complete.
Imagine a scheduler which processes requests as quickly as possible. It will run say 5MB of requests for W, then run R's request. R is woken up due to the completed read and immediately submits another, however between this time, the scheduler doesn't have any more requests from R, and goes on to process another 5MB of requests from W.
A scheduler which pauses a little for R overcomes this problem. It anticipates that R will submit another read, and in this case it pays off. R will be able to run 5MB of reads.
JA: What are some of the other differences between the anticipatory scheduler and the deadline scheduler?
Nick Piggin: OK, well AS will seek in both directions, deadline only one way, AS accounts by time instead of number of requests, they are the fundamental differences in scheduling.
JA: What do you mean by the AS seeking in both directions?
Nick Piggin: Both schedulers keep track of the disk head position. Deadline will service the lowest request _above_ the head, and when there is none, it will sweep back to the start of the disk.
AS does basically the same thing, but it can pick up close requests below the head as well.
JA: Is there additional overhead added by accounting in time instead of the number of requests?
Nick Piggin: No. It isn't much overhead either way, but accounting by time is probably slightly cheaper if you could work it out.
JA: How does the anticipatory scheduler differ from Jens Axboe's CFQ scheduler?
Nick Piggin: Jens Axboe's CFQ scheduler is completely different from deadline scheduler. It basically keeps a queue of requests for _each process_ and round robins between them. Deadline keeps a global queue, and uses an expiry time on each request to give some fairness (it actually seems to work very well in practice).
JA: Have you thought of combining your anticipatory work with Jens CFQ scheduler work?
Nick Piggin: Yes, there are lots of possibilities. The main priority now though is getting AS working well and included in 2.5. It would be unproductive to try to debug and tune two different concepts at the same time.
JA: Is there any chance of the anticipatory scheduler being merged into the main 2.5 development tree?
Nick Piggin: A reliable source tells me yes. There is a good chance, but it does still depend on how things go.
JA: What workloads benefit the most from using the anticipatory scheduler?
Nick Piggin: Just about any read load that is somewhat localised - when other IO is present. Loads which really do well are random but very localised reads in the presence of a big streaming read or write. Sometimes you can see a 50x improvement.
Desktop performance probably is helped along a little bit in general, but I guess is only really improved in more restricted cases when you're copying a big file or making a backup or something - though those cases can benefit a _lot_.
JA: What's an example of an application that results in reads like the one you've described getting as much as a 50x improvement?
Nick Piggin: grep -r "blah" linux-2.5/* - of course, even that would not benefit _at all_ if it is the only source of IO in the system.
JA: Do you know of any workloads that suffer when using the anticipatory scheduler?
Nick Piggin: Yes, just about anything using big TCQ windows, and synchonous writes (ie. database) workloads, although some database workloads can actually improve. Depends. Most of those tests do a lot of writing which is probably a lot more than most production databases. TCQ is a problem though.
JA: What is TCQ?
Nick Piggin: Tagged command queueing. Higher end disk systems can accept multiple (some 256 or more) requests for IO. The disks may and do reorder these requests for "optimal" scheduling. Unfortunately this is often not what we want to happen.
JA: Are you working to resolve this issue with big TCQ windows?
Nick Piggin: Yeah. A lack of SCSI hardware is slowing progress, but yes, it will be done.
JA: How do you intend to solve for the issues with TCQ?
Nick Piggin: Try to schedule the requests better out of the elevator so we don't need TCQ for good performance. Then limit the amount of writes that can be outstanding.
JA: Why does the anticipatory scheduler perform poorly with some database workloads?
Nick Piggin: Often it is a case of the heuristics not working properly and we pause (anticipate) when it will not pay off. For the TCQ loads it is because the anticipatory scheduler naturally disallows a lot of outstanding reads that the drive can reorder and work with.
JA: How do you intend to solve for this?
Nick Piggin: Heuristics is just monitoring and tuning and maybe coming up with something new.
For the TCQ loads its a matter of improving AS's scheduling of requests, so it is nearly optimal to begin with.
JA: How stable and complete do you consider your scheduler to currently be?
Nick Piggin: Well, I have never had a data corruption bug, nor have I seen a report of one. Had a lot of crashes and hangs, but none for a month or so (in the mm tree). So, if you use 2.5, I'd say there's nothing wrong with using AS (should have testing in mainline though before 2.6).
It is pretty complete. There's always more that can be done, but for a 2.6 scheduler, it just has to have a few performance hiccups sorted out. 2.7 work can easily be backported when it proves itself.
JA: What kind of feedback have you gotten regarding your scheduler work?
Nick Piggin: Generally positive and interested. Larry McVoy reported that it sped up some time consuming bk operation by a good amount for example.
Andrew Morton has been instrumental in AS coming this far. He has given me a lot of help, and is quite hopeful of AS doing well. Its probably by far the longest patch he has had in his tree. Its something that is very difficult to optimise just by myself though, and is not yet appropriate for it to be in the main tree when the mm tree is available.
Other than Andrew though, most discussion I have about it is on kerneltrap!
Con Kolivas with Contest, random Andrew Morton tests, Joel Becker with OraSim and WimMark, random Randy Whron tests, and my own tests are basically all. They have been keeping me quite busy though.
JA: Is there a port of the anticipatory scheduler available as a patch for the 2.4 stable kernel?
Nick Piggin: No.
JA: Is this something you ever intend to work on? As it would likely bring in new testers, perhaps it's worth the time it would take?
Nick Piggin: Yeah it could be a possibility. At the moment I still have stuff to get fixed up, but once it is in the 2.5 tree, I might look at doing a backport. I don't think 2.6 is _too_ far away though, and I don't think vendors will wait as long to ship it as they did 2.4.
JA: How close do you think we are to seeing a 2.6 tree?
Nick Piggin: I really don't know. Andrew's todo list probably gives a good idea of what needs doing. At a guess I'd say end of July.
JA: Have you worked with other open source kernels, other than Linux?
Nick Piggin: Nope. Nor closed source.
JA: How do you enjoy spending your time when you're not working on Linux?
Nick Piggin: I'd have to say cars, my motorbike, hanging out with mates. I don't have any other recreational software projects at the moment because I seem to have less hours in the day than everyone else as it is!
JA: Thank you very much for taking the time to answer my questions!
Nick Piggin: No problem!