Linux: Fair Queuing Disk Schedulers

Submitted by Jeremy
on February 12, 2003 - 4:21am

Following the recent release of an anticipatory IO scheduler [story], Andrea Arcangeli started a lengthy thread in which he proposed an SFQ (Stochastic Fair Queuing) disk scheduler. The idea was picked up by Jens Axboe who had evidently worked on a similar idea earlier. Jens quickly posted two different disk schedulers utilizing "fair queuing" algorithms, more commonly used in handling network traffic. When someone suggested he was reinventing the wheel, Jens replied, "There's no wheel reinventing here, just applying the goodies from network scheduling to disk scheduling."

Jens' first disk scheduler utilizes SFQ, or Stochastic Fair Queuing. Fair queuing would allow many processes demanding large levels of disk IO to each get fair access to the device, preventing any one process from denying the others. SFQ is one of the simpler and less accurate fair queuing algorithms that works well on average, its primary benefit being that it requires very minimal overhead. Essentially, SFQ works by dividing IO requests among a large number of queues using a frequently changing hash algorithm, then serving the requests round robin. The term 'stochastic' is used as each process does not get its own queue, leaving some of the derived benefit to random chance. In other words, the queues are stored in a hash table, and it is possible for processes requesting IO to hash to the same bucket, or collide, thus sharing a queue.

The second disk scheduler Jens posted utilizes CFQ, or as he defines it Complete Fair Queuing. Jens' prefers his CFQ implemenation, which builds upon his earlier SFQ patch. This scheduler creates additional queues as needed, avoiding the random nature of the stochastic scheduler by entirely preventing collisions. Recent benchmarking suggests that this new scheduler still needs a fair amount of tuning. Read on for complete details and to view the actual patches.


From: Jens Axboe
To: Linux Kernel Mailing List
Subject: [PATCH] SFQ disk scheduler
Date: Mon, 10 Feb 2003 15:50:01 +0100

Hi,

Here's a simple stochastic fairness queueing disk scheduler, for current
2.5.59-BK. It has known limitations right now, mainly because I didn't
bother making it complete. But it should suffice for some rudimentary
testing, at least.

I'm not going to go into great detail about how it works, see Andrea's
initial post of the paper referenced. This version may not be completely
true to the SFQ concept, but should be close enough I think. It divides
traffic into a fixed number of buckets (64 per default), and perturbs
the hash every 5 seconds (hash shamelessly borrowed from networking atm,
see comment).

To avoid too many disk seeks, when it's time to dispatch requests to the
driver, we round robin all non-empty buckets and grab a single request
from each. These requests are sorted into the dispatch queue.

For performance reasons, io scheduler request merging is still a
per-queue function (and not per-bucket).

In closing, let me stress that this version has not really been tested
all that much. It passes simple SCSI and IDE testing, should work on any
hardware basically.

[patch]

-- 
Jens Axboe


From: Andrea Arcangeli Subject: Re: [PATCH] SFQ disk scheduler Date: Mon, 10 Feb 2003 16:11:44 +0100 On Mon, Feb 10, 2003 at 03:50:01PM +0100, Jens Axboe wrote: > Hi, > > Here's a simple stochastic fairness queueing disk scheduler, for current > 2.5.59-BK. It has known limitations right now, mainly because I didn't > bother making it complete. But it should suffice for some rudimentary > testing, at least. Cool, that was fast! ;) > > I'm not going to go into great detail about how it works, see Andrea's > initial post of the paper referenced. This version may not be completely > true to the SFQ concept, but should be close enough I think. It divides > traffic into a fixed number of buckets (64 per default), and perturbs > the hash every 5 seconds (hash shamelessly borrowed from networking atm, > see comment). I tend to think 5 seconds is too small, 30 sec would be better IMHO (it should be tested at bit). > To avoid too many disk seeks, when it's time to dispatch requests to the > driver, we round robin all non-empty buckets and grab a single request > from each. These requests are sorted into the dispatch queue. > > For performance reasons, io scheduler request merging is still a > per-queue function (and not per-bucket). Unsure if it worth, but it probably it won't make that much difference, likely different workloads are working on different part of the disk anyways. > In closing, let me stress that this version has not really been tested > all that much. It passes simple SCSI and IDE testing, should work on any > hardware basically. How does it feel? Andrea
From: Jens Axboe Subject: Re: [PATCH] SFQ disk scheduler Date: Mon, 10 Feb 2003 16:22:26 +0100 On Mon, Feb 10 2003, Andrea Arcangeli wrote: > On Mon, Feb 10, 2003 at 03:50:01PM +0100, Jens Axboe wrote: > > Hi, > > > > Here's a simple stochastic fairness queueing disk scheduler, for current > > 2.5.59-BK. It has known limitations right now, mainly because I didn't > > bother making it complete. But it should suffice for some rudimentary > > testing, at least. > > Cool, that was fast! ;) It's pretty easy to do in 2.5 :-). A 2.4 backport is of course feasible, but requires a bit more work (and possibly abstracting some of the elevator stuff there). > > I'm not going to go into great detail about how it works, see Andrea's > > initial post of the paper referenced. This version may not be completely > > true to the SFQ concept, but should be close enough I think. It divides > > traffic into a fixed number of buckets (64 per default), and perturbs > > the hash every 5 seconds (hash shamelessly borrowed from networking atm, > > see comment). > > I tend to think 5 seconds is too small, 30 sec would be better IMHO (it > should be tested at bit). It probably is too small, testing will show. I don't see too many collisions from a dbench 32 or 64, so... > > To avoid too many disk seeks, when it's time to dispatch requests to the > > driver, we round robin all non-empty buckets and grab a single request > > from each. These requests are sorted into the dispatch queue. > > > > For performance reasons, io scheduler request merging is still a > > per-queue function (and not per-bucket). > > Unsure if it worth, but it probably it won't make that much difference, > likely different workloads are working on different part of the disk > anyways. The rate of unrelated merging is typically quite low, so no it probably doesn't provide much of a performance benefit. However, it also keeps the code simpler to simply have a single merge hash per queue. > > In closing, let me stress that this version has not really been tested > > all that much. It passes simple SCSI and IDE testing, should work on any > > hardware basically. > > How does it feel? I don't know yet, haven't booted it on my work station yet. Will do so soon :) -- Jens Axboe
From: Jens Axboe Subject: CFQ disk scheduler (was Re: [PATCH] SFQ disk scheduler) Date: Tue, 11 Feb 2003 10:06:44 +0100 Hi, In the nature of taking this concept to the extreme, here's a CFQ disk scheduler (it should be obvious by now, that I'm simply making up a TLA as I see fit :-), or Complete Fair Queueing. It never suffers from queue collisions. So how does it work? As with SFQ, a hash of busy queues is maintained. If a queue for a given queue doesn't exist, one is simply allocated. The actual queueing of requests works like the SFQ scheduler I sent out yesterday, with little twist: we try to put at least cfq_quantum number of requests on the dispatch queue. If only a small number of processes are waiting for io, then this significantly helps throughput by minimizing the time spent between finishing one request and starting a new one. Other changes/fixes from SFQ: - Leave request mergeable even while on the dispatch list, to make the merge window as big as possible. - The dispatch_sort() would sometimes not order requests correctly. - Various small fixes. Interestingly, dbench results show little variance between runs with this CFQ scheduler. Another point of interesting to folks may be that it would be trivial to add process io priorities on top of CFQ (or SFQ for that matter, but I consider CFQ to be the superiour scheduler). If you play with this, let me know how it fares. Patch is against 2.5.60. [patch] -- Jens Axboe

Related Links:

io fairness vs. VM writeback

wli
on
February 12, 2003 - 5:37am

One of the issues I've been concerned about is VM writeback either
being starved or starving userspace io. IMHO this merits special
treatment w/in the io scheduler in order to prevent either underutilizing bandwidth available for VM writeback or the monopolization of io bandwidth by the VM. Are there any plots afoot
to address this issue now that fairness is being actively addressed?

Yep

Nick
on
February 12, 2003 - 6:30am

Andrew Morton et al really know the ins and outs of this.
Let me get the definitions straight first, VM being starved means a process doing a lot of IO prevents memory allocation. Processes being starved means heavy memory pressure prevents IO. Right?

I don't think it is any business of the IO scheduler, however for two reasons:

* a write is a write is a write from a memory cleansing point of view.
* a read won't be performed anyway when there isn't enough memory -
it will have to wait until memory is freed no matter what.

Remember most of the VM work is just async writeout on behalf of the process, so in that case the VM and process are working together. It only comes down to VM vs. Process when there is big swap pressure and I think this would come under load/thrashing control.

Nowadays that isn't a big talking point but I would like to see it being done one day (even a dumb implementation) just to help when a rogue process or unexpected server overload comes by.

write it down

Anonymous
on
February 14, 2003 - 8:36pm

Your explanations here are quite enlightening, Nick. In fact, I would like to see them combined to a chapter in "understanding the linux kernel" or some similar book. Just a suggestion, your work, your decision. :)

maybe

Nick
on
February 15, 2003 - 10:13am

I would if I get a little more familiar with the kernel. We'll see. I will definately write a good administrator's guide to tuning the io scheduler based on all my experimentation, benchmarks, etc, first!

nice values

Anonymous
on
February 12, 2003 - 4:20pm

If each process gets its own que, could the IO scheduler use the nice values for IO priority just like the CPU scheduler does.

Could this concept help the VM subsystem also?

Jamie Burns.

re: nice values

Anonymous
on
February 12, 2003 - 4:31pm

Yes that would indeed be possible. Instead of doing plain round-robin of all processes, you could simply bias your request selection based on the (io) priority of a given queue.

I've been waiting years for that

Anonymous
on
February 16, 2003 - 9:14am

You have no idea how much I would like that feature :)
I especially like the idea of not adding another interface
and just using the nice scheduling values.

order here

order here (not verified)
on
October 30, 2006 - 4:15pm

Hi, just popped in here through a random link. Hi, firstly I'd like to say your site is great and very impressive. Enjoyed the reading.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.