Jens Axboe provided a patch that makes the various IO schedulers fully modular, allowing people to switch between them on the fly. With the patch applied, on a running system you can switch between Nick Piggin [interview]'s anticipatory IO scheduler [story], Jens' deadline IO scheduler [story], Jens' CFQ IO scheduler [story], and the noop IO scheduler.
Jens points out that, "with the io schedulers fully modular, it is possible to add a new io scheduler type without touching the core kernel at all." He also notes that at least one scheduler must be compiled into the kernel, while the rest can be built as kernel modules. Read on for further details, including information on how to switch between schedulers.
From: Jens Axboe [email blocked] To: Linux Kernel [email blocked] Subject: [PATCH] modular io schedulers with online switching Date: Thu, 16 Sep 2004 16:59:38 +0200 Hi, This patch implements fully modular io schedulers and enables online switching between them. It adds a new sysfs entry in the block path that shows the current one in brackets and the other ones along side it: bart:/sys/block/sda/queue # cat scheduler anticipatory deadline [cfq] cfq is selected here. I built noop as a module here, if I load it: bart:/sys/block/sda/queue # modprobe noop-iosched bart:/sys/block/sda/queue # cat scheduler anticipatory deadline [cfq] noop it shows up as well. Switching is just as simple, you simply echo the desired scheduler into that file: bart:/sys/block/sda/queue # echo anticipatory > scheduler bart:/sys/block/sda/queue # cat scheduler [anticipatory] deadline cfq noop Most of the patch is making q->elevator allocated instead of embedded inside the queue, and splitting elevator_t into two parts - one for the elvator_type (noop, deadline, etc) and elevator_queue that's an instance of that associated with a given queue. With the io schedulers fully modular, it is possible to add a new io scheduler type without touching the core kernel at all. elv_register() and elv_unregister() register a new io scheduler with the elevator core. Remember to build at least 1 io scheduler into the kernel, they can't all be modules :-) Patch is against current BK. [patch] -- Jens Axboe
From: Jens Axboe [email blocked] Subject: [PATCH] modular io schedulers with online switching, #2 Date: Fri, 17 Sep 2004 11:44:36 +0200 Hi, Changes since yesterday: - Don't allow AS to be unloaded, it has open io contexts attached to processes so we can't remove the functions that will be called on task exit. AS could be made unloadable by iterating each task in the system, calling exit_io_context() on tsk->io_context and clearing it. AS is still online switchable, you just cannot remove the module after it has been loaded. - Add module license to each scheduler - Move the chosen_elevator init and setup into elevator.c. Also sanity check given elevator= boot parameter and fall back to a default if it's invalid. This also gets rid of the ugly 'printed' crap in blk_init_queue() [patch] -- Jens Axboe
This way scheduling & performance can be further improved
I may be wrong (I'm don't have much operating system development experience) but this seems to be good idea because:
7) In today's computers, you
7) In today's computers, you really want to use different io schedulers in different io devices. A typical hard disk wants deadline/AS, but a USB stick does want a noop scheduler. In the future I guess each device will asign the best io scheduler for them.
excellent
Sounds like a very clever observation to me. Now, i'm no kernel expert, so - any competent one cares to comment?
That's indeed one of the reas
That's indeed one of the reasons for the online switching. Right now it's only possible to change io schedulers on a per-driver basis by editing the source. And if the same driver drives more than one type of hardware, you have to add code to handle that. Besides - it's a policy issue, so it belongs in user space.
make it a mount option?
Extending your line of reasoning that the IO scheduling policy should be device specific, would it make sense (and be possible) to make the policy a mount option?
If no policy is specified for mount, it could fall back to the built-in scheduler to maintain backward compatibility.
a IO scheduler is there to op
a IO scheduler is there to optimize performance for a IO device. Hence, it needs to be per-device, not per-mount, I think.
We do not always mount devices
You do not always want to mount a block device. The scheduler can be set as simply as echo "name" > /sys/...., so a boot-up script is really enough to accomplish that.
scheduler choice
Im not familiar with schedulers. any link to differentiate between them, why is deadline/AS good fo hdisk, noop for flash devices...
ok, here's it briefly
This is just an executive summary. I don't know the details.
Normally a disk scheduler tries to balance between these goals:
- fairness (let every process have its share of the access to disk)
- performance (try to serve requests close to current disk head position first, because seeking there is fastest)
- realtime (guarantee that a request is serviced in a given time)
A scheduler gets as input the list of sectors that processes have recently requested and which haven't yet given. It knows where the disk head currently is, and whatever other data it "remembers" from the past requests.
A typical scheduler is called cyclic elevator, where disk heads move from beginning of disk to the end of disk in one direction. Assume that you get some requests in a sorted queue, and you're going to satisfy them. So, you'll first filter the list of requests by deciding that you will not satisfy requests for sectors below your current head position. Then you'll simply go to the sector closest to your current position in your sorted list. So, crunch crunch crunch, you move from start of disk to the end of disk. When you reach the highest unsatisfied sector number, your cycle is over and you seek to the lowest unsatisfied sector. Rinse and repeat.
Linux 2.2 at least used the cyclic elevator if I remember right.
So, with these approximate goals in mind, we can look at the different schedulers available.
- deadline scheduler: this is a cyclic elevator but with a twist: requests are given a deadline by which they get served. When a request starts to look like it's going to expire, the kernel will skip intermediate sectors and move to that request, thus giving some realtime behaviour.
- AS scheduler: a cyclic elevator with waiting policy: after you service a request that looks like it might have future requests coming nearby, you pause even if there's more sectors in your work queue. The anticipatory scheduler literally anticipates more requests to follow on this track or very close by. How AS decides whether to anticipate is basically just lot of guesswork based on typical access patterns.
- cfq scheduler: different sort of stab at fairness. It's different from either of these two, and doesn't use cyclic elevator and has realtime guarantees and aggressively avoids starvation. It could be a good scheduler for multiuser systems.
- noop scheduler: just service next request in the queue without any algorithm to prefer this or that request.
You want to use noop scheduler on devices where there are no seeking penalty, such as flash drives. That's why USB stick wants noop. Unfortunately, harddisks are very mechanial beasts and their performance is highly controlled by their seeking abilities. All these schedulers above are really trying to figure out how to extract maximum performance off the harddisk without causing bad behaviour in other cases.
RE: ok, here's it briefly
Thank you for a very clear, well written explanation.
You also might want noop for
You also might want noop for very large RAID arrays.
Perhaps.
But I'd want to benchmark that. However, I doubt it. Here's why:
It depends on the RAID level.
On RAID5, you need sector from all minus one drives when you wish to read a sector from the array. That means that random IO causes a lot of seeking on all the drives, so there at least IO scheduler ought to be beneficial as ever. (Incidentally, the "very large RAID arrays" are probably some kind of RAID5 arrays.)
On the other hand, on RAID0, you get your data interleaved on the disks. I would assume that regular cyclic elevator variant would still be good here, because it's not that different from having just one disk, or is it? You want a random sector, you need one harddisk to seek to random position. Sure, you can do more seeks with more drives, but you'll still do the same kind of seeking.
That leaves RAID1 for me to consider. And it's different, because here the RAID driver can choose which disk to pass read requests on. (It would be very smart for Linux's schedulers to account for RAID1 arrays and try to optimize IO specifically for them. This comes from guy who has RAID1 setups on most computers he has.) If I understood correctly, the RAID driver is based on "the optimal scheduler": choose the drive to satisfy read request on which has the head position closest to the requested sector.
In writing, RAID arrays, especially RAID5, will have to do a _lot_ of work, and I'm pretty sure all of them would benefit from scheduler.
My thoughts on the subject, anyway. Please correct if you spot I made mistakes.
oh yeah, hardware RAID is different altogether
Addendum: those beasts could be similar to SCSI drives & controllers, and may well know how to optimize IO for themselves. Could be Linux wouldn't be helping in trying to do it for them.
Sad as it is, all the layers of abstraction involved in the IO path probably demolish good chunk of the performance we could get if we did stuff the way the disks would like: a lot of sequential IO, very little seeking.
RAID1 and seeking
Maybe for the first couple requests. Once you get a large quantity of requests, it seems like you'd want to start striping those across both disks, regardless of who's closest.
For an extreme example, suppose I read block 0 on disk 0, and block N (at the end of the device) on disk 1. Disk 0's head is near the start of the disk, and disk 1's head is near the end. Now I read 1GB near the end of the disk. Wouldn't you want to read about 512MB from disk 0 and 512MB from disk 1? Basically, you need a notion of "send the request to our second choice if the 'nearest' disk is busy."
As far as the comment of "noop for a large RAID": Consider a logical partition that spans several physical drives (concatenated). Now stripe and mirror that. Predicting which disk will service the I/O for a given block is a crap shoot. :-)
Now if it's a hardware RAID, I'd imagine it looks like one disk to Linux, and so it makes sense for Linux to just send requests to that RAID as fast as it can. The hardware will partition the requests and should "Do the Right Thing." A noop scheduler here is paramount: With a smarter scheduler, Linux would spend all it's time sorting and delaying and reordering requests for some model of disk access that has nothing to do with the underlying media. Once the requests get at the RAID, it'll just ignore all of the hard work Linux did and just reschedule further. Worse, the RAID will have lost all context regarding the order in which the requests were originally made, and so the RAID controller may make decisions that unfairly penalize some I/O.
If it's a software RAID, the device-mapper should intelligently figure out how to partition the requests among the disks, and the I/O scheduler for each disk should do something intelligent for its slice of the overall set of requests. Here, it doesn't make sense to have a noop scheduler, at least once the requests are partitioned among the disks.
I'm not sure exactly how Linux's device-mapper works, but I imagine it must be similar to what I describe above. I/O scheduling makes sense for the final block devices, not for the RAID virtual drive.
--Joe
Schedulers don't need to be simple
Most block devices (read: hard disks) are dog slow compared to cpu and memory speed. So it is worth the effort to optimize disk access.
The AS for example actually pauses after a disk access, hoping that another access to the region the heads are floating now is coming up soon. In practice this is actually faster than servicing the next request, which may be far away on the disk.
This is also the task of the filesystem. It should try hard to cluster files together for faster access.
It may sound like a dirty hack, but this is how windows xp minimized boot times: the OS actually records what disk sectors are accessed during boot, and places them together. During the next boot, the timings are different, so it records the accesses again, optimzes again, until the boottime reaches a minimum.
Why do you call cluster reorganisation an ugly hack?
Clustering file together for faster access is not a dirty hack IMHO. On the contrary it is a very intelligent choice: loose a little CPU (which you already have plenty) to gain time on the slow HDD..
I'm wondering if BeOS did the same disk sector reorganisation?
Anyway it was much, much faster to boot than WindowsXP (or Linux).
OSX does the same, afaik.
OSX does the same, afaik.
Re: Why do you call cluster reorganisation an ugly hack?
Because it benefits only boot time and potentially sacrifies runtime performance (due to fragmentation) and might even risk stability (do you really want the OS to shuffle its vital files around?).
That might be well worth on a desktop you boot several times a day, but is contraproductive on a host on-line several months at a time.
There are lower hanging fruits on typical Unix machines to improve boot time (if necessary): many of the services started in /etc/rc{2,3}.d/ could be started simultaneously, but aren't at this time (I think there is at least one project at Debian targeting this). Often silly network (DNS) or device time-outs are awaited with neither disk nor CPU busy.
IMHO Linux boots fast enough on a desktop and certainly on a 'server'. However, for particular demands of some embedded devices or hand-held computers, further work is needed (but they probably wont use harddisks).
just my 2c
Guenther
...
Seems, that all these points may be refused. I see only two nice points here:
(1) automatically selecting may be needed by someone. But I have no clue what kind of tasks should be that it needs such a switching.
We can say that existing schedulers may be devided onto two parts: those, which may be used on desktop and those which may be used on server. That is if there is a task, which needs switch them, this may mean, that machine is not fully desktop and/or server.
(2) this is possible in priciple, so, it should be done :)
--
umka
imranj / how do i specify one........at boot time
I want to use various Scheduler for testing to see which one is best for me, how and where do i specify for it which file and wht commands options
Pls advice.
how do you specify kernel options
Kernel accepts key=value pairs on its boot prompt. If you are using LILO, the boot prompt is the one that says approximately like this:
LILO:
you must give a kernel name here, (press tab to see a list) and then the options. If vmlinuz is the name of your kernel image, then:
LILO: vmlinuz elevator=deadline
shall choose deadline scheduler. The other option is cfq. The default is as. You are free to experiment with them, but I could tell you that the AS is still probably the best of them for you.
If you are a grub user, you'll choose your kernel version from the list, then instead of booting it, you'll press e to edit the commands, find the line that says "kernel" at the beginning, press e again and write elevator=cfq or something on the line.
If you are happy with a particular setting, then feel free to modify your LILO/GRUB configuration to always include that setting. Most people use append="foo" to give default kernel parameters for all defined images. In GRUB, you'll edit the line defning the kernel image in the configuration. For instance, like this:
title Debian GNU/Linux, kernel 2.6.8-1-k7
root (hd0,0)
kernel /boot/vmlinuz-2.6.8-1-k7 root=/dev/md0 ro elevator=deadline
initrd /boot/initrd.img-2.6.8-1-k7
savedefault
boot
That's literally my setup right now as I'm testing deadline scheduler with RAID1...
Lilo Scheduler
Well can u specify in detail......i am using LILO
i want to use Anticipatory Scheduler
i tried but it didnt work?
as is the default on Linux 2.
as is the default on Linux 2.6.x (except if you are using ck patches, for which cfq is default). If you want to change, you need to add a paramter to the append= line. Someting like elevator=as or elevator=cfq for example.
Don't forget...
That the whole point of this article is that the schedulers are online switchable now. Editing your boot loader config should not be something you would need to do to switch options. Boot time only options are a hassle.
Jens
Same
I agree but i am not using this patch , mine is Yoper's build kernels?
got it.