[sorry for reposting, wrong subject]
Hi,
we are working to a new I/O scheduler based on CFQ, aiming at
improved predictability and fairness of the service, while maintaining
the high throughput it already provides.
The patchset, too big for lkml posting, is available here:
http://feanor.sssup.it/~fabio/linux/bfq/patches/
The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin
scheduling policy of time slices into a fair queueing scheduling
of sector budgets. More precisely, each task is assigned a budget
measured in number of sectors instead of amount of time, and budgets
are scheduled using a slightly modified version of WF2Q+. The
budget assigned to each task varies over time as a function of its
behaviour. However, one can set the maximum value of the budget
that BFQ can assign to any task.
The time-based allocation of the disk service in CFQ, while having
the desirable effect of implicitly charging each application for
the seek time it incurs, suffers from unfairness problems also
towards processes making the best possible use of the disk bandwidth.
In fact, even if the same time slice is assigned to two processes,
they may get a different throughput each, as a function of the
positions on the disk of their requests. On the contrary, BFQ can
provide strong guarantees on bandwidth distribution because the
assigned budgets are measured in number of sectors. Moreover, due
to its Round Robin policy, CFQ is characterized by an O(N) worst-case
delay (jitter) in request completion time, where N is the number
of tasks competing for the disk. On the contrary, given the accurate
service distribution of the internal WF2Q+ scheduler, BFQ exhibits
O(1) delay.
We made several tests to measure the aggregate throughput, long-term
bandwidth distribution and single-request completion time guaranteed
by CFQ and BFQ; what we present here was obtained with an outdated
version of the code, we are in the process of collecting data for
the current one (see ...Fabio, I've merged the scheduler for some testing. Overall the code looks great, you've done a good job! I didn't touch patches 2 and 3, but I rewrote #1 somewhat. See the result here: http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=973a02c4ea1c324c41e45b69c07... I'm sure you'll agree with the hlist_sched_*() functions. I also killed the ->bfq_ioprio_changed modification, what exactly was the purpose of that? The code is now in the 'bfq' branch of the block git repo. -- Jens Axboe --
of course the hlist_sched_*() functions are much better than what was in the patch (the idea behind the patch was to not touch at all cfq code). the ->bfq_ioprio_changed was there to avoid that a process/ioc doing i/o on multiple devices managed by cfq and bfq would see ioprio changes only for one of the two schedulers. cfq_ioc_set_ioprio() (and its bfq counterpart bfq_ioc_set_ioprio()) unconditionally assign 0 to (bfq_)ioprio_changed, so the first scheduler that sees the ioprio change eats the new priority values. of course I may be wrong, but I think it (or some better mechanism of course we'll be glad to help in testing in any way you can find useful. thank you! --
As long as the changes at that point are straight forward and 'obviously correct', there's no harm done. Have you thought about merging bfq and I see. If we can and will merge bfq and cfq, then it's not really an issue. Otherwise, I'd suggest using bits 0 of ioprio_changed for cfq and I'll push it to the for-akpm branch as well, so it should show up in the next -mm and get some testing there. -- Jens Axboe --
Well, I'm maintaining bfq as a modified version of cfq, and I use a script to generate the bfq-iosched.c file. It allows keeping the common code in sync. I've posted this version to allow comparisons between the two schedulers and because I consider cfq a reference, more tested/stable scheduler. If you are interested in it I can clean up the cfq patch in the next few days and Yes, that's a better solution, at least for now. [I'm sorry I cannot post a patch correcting it right now because I don't Ok, thank you very much. --
Hi, I'm Paolo Valente and I worked with Fabio on bfq. I'm very happy about your interest in it. Thanks a lot :) --
Here are some microbenchmark results. Test setup is a 2-way IA64 with a
single 15k RPM 73GiB SCSI disk with TCQ depth set to 1. Workloads are
generated with FIO: 128 processes issuing synchronous, O_DIRECT, 16KiB
block size requests.
Figures are quoted as average (stdev). CFQ (i=0) means idle window
disabled. All other tunables are default.
==================================x8=======================================
Random Readers
-----------------------------------------------
Latency (ms) Bandwidth (KiB/s)
-----------------------------------------------
CFQ 841.788 (4070.3) 2428.032 (23.1)
CFQ (i=0) 536.728 (216.9) 3841.024 (8.5)
BFQ 884.4 (8816.0) 2439.04 (1375.0)
Sequential 1MiB Readers
-----------------------------------------------
Latency (ms) Bandwidth (KiB/s)
-----------------------------------------------
CFQ 2865.331 (737.2) 46866.048 (103.1)
CFQ (i=0) 2544.618 (1047.2) 52685.952 (294.2)
BFQ 2860.795 (419.1) 46850.944 (81.5)
Clearly BFQ suffers from the same idle window problems as CFQ, but otherwise
the performance seems comparable in bandwidth terms. I'm guessing variability
in random workload service is due to max budget being too large compared to
CFQ's default time-slice. Sequential access looks nice and consistent, though.
--
Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long... -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
A process with such a low throughput would be marked as seeky from the heuristics implemented in cfq/bfq. Seeky processes are not treated in the same way as sequential ones and they should not get their full slice allocated, since they idle only for very short periods. BTW looking at the code they can get a full slice, if they always reissue requests fast enough - within BFQ_MIN_TT - and this is definitely an issue/error in the current implementation (and we didn't notice it when converting the code from time-based to service-based allocation :) ). An easy solution (without changing the nature of bfq) would be to use shorter slices for seeky queues, with the same mechanism we already use for the async ones. --
Actually, in the worst case among our tests, the aggregate throughput with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512 / 20M = 100 ms. About these test cases we repeated several measures of the aggregate throughput with simultaneous file reads from 2 to 5 (and other varying parameters). The lowest aggregate throughput for 4k sectors (~ 20 MB/s) was achieved in case of 2, 128 MB long, files, lying on two slices at the maximum distance from each other (more details on the experiments, testbed and so on at http://algo.ing.unimo.it/people/paolo/disk_sched/). I hope I didn't misunderstand your point. --
That's not worse case, it is pretty close to BEST case. Worst case is 4k of sectors, with each being a 512b IO and causing a full stroke seek. For that type of workload, even a modern sata hard drive will be doing 500kb/sec or less. That's rougly a thousand sectors per seconds, so ~4 seconds worst case for 4k sectors. -- Jens Axboe --
One seek is still 10msec on modern drive, right? So 4k seeks = 40seconds, no? 4seconds would correspond to 1msec per seek, which seems low. writes with O_SYNC could force full seek on each request, right? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I actually meant 4k ios, not 512b at that isn't really realistic. With 512b full device seeks, you are looking at probably 30kb/sec on a normal 7200rpm drive and that would be around a minute worst case time. The 4kb number of 500kb/sec may even be a bit too high, doing a quick test here shows a little less than 300kb/sec on this drive. So more than 4 seconds Writes generally work somewhat better due to caching, but doing O_DIRECT 512 byte reads all over the drive would exhibit worst case behaviour easily. -- Jens Axboe --
Yes. 100 ms is just the worst case among our tests with 4k, but these In my opinion, the time-slice approach of cfq is definitely better suited than the (sector) budget approach for this type of workloads. On the opposite end, the price of time-slices is unfairness towards, e.g., threads doing sequential accesses. In bfq we were mainly thinking about file copy, ftp, video streaming and so on. I was not able to find a good solution for both types of workloads. BTW, there is also another possibility. The internal scheduler of bfq may be used to schedule time-slices instead of budgets. By doing so, the O(1) instead of O(N) delay/jitter would still be guaranteed (as it is probably already clear, bfq is obtained from cfq by just turning slices into budgets, and the Round Robin-like scheduling policy into a Weighted Fair Queueuing one). --
Which is fine, nothing wrong with a scheduler tuned for a specific I was thinking about that too. Generally I've been opposed to doing scheduling decisions on anything but time, since that is always relevant. When to hand out slices and to what process, that algorithm is really basic in CFQ and could do with an improvement. -- Jens Axboe --
Maybe there is also another middle-ground solution. I'll try to sketch it out: . use sectors instead of time . impose a penalty to each thread in proportion to the distance between its disk requests . reduce the maximum budget of each thread as a function of this seek penalty so as to prevent the thread from stealing more than a given time slice (the simple mechanism to limit per-thread budget is already implemented in bfq). By doing so, both fairness and time isolation should be guaranteed. Finally, this policy should be safe in that, given the maximum time used by a seeky thread to consume its maximum budget on a reference disk, the time used on any faster disk should be shorter. Does it seem reasonable? --
Not for CFQ, that will stay time based. The problem with #2 above is that it then quickly turns to various heuristics, which is just impossible to tune for general behaviour. Or it just falls apart for other real life situations. -- Jens Axboe --
Like SSD or hardware RAID. Time-slices have the nice property of fairness irrespective of the underlying hardware characteristics. --
Exactly. We can cater to that somewhat by adding some simple hardware profiles, so that the IO schedulers if seeks etc are costly or not. But it's a good example none the less. -- Jens Axboe --
Ok.
After a brief offline discussion with paolo, here it is what we can do:
o propose a patch for discussion that uses our WF2Q+ variant to
schedule timeslices in cfq. The resulting scheduler would be
quite close to the EEVDF scheduler discussed some time ago for
the cpu.
o Introduce a timeout in bfq to give an upper time limit to the
slices. Since we have not experimented with that mixed approach
before[*], we will need to do some tests with relevant workloads to
see if/how it can work.
I fear that it will take some time, as we're both travelling in this
week.
[*] Anyway it is quite close to how cfq handles async queues, with their
slice_async and slice_async_rq, so it's definitely not something new.
--
Jumping in at random, does "process" here mean task or mms_struct? If the former, doesn't that mean that a 100-thread process can starve out a single-threaded process? Perhaps we need hierarchical io scheduling, like cfs has for the cpu. -- error compiling committee.c: too many arguments to function --
Hierarchical would simplify isolating groups of threads or processes. However, some simple solution is already available with bfq. For example, if you have to fairly share the disk bandwidth between the above 100 threads and another important thread, you get it by just assigning weight 1 to each of these 100 threads, and weight 100 to the important one. Paolo -- ----------------------------------------------------------- | Paolo Valente | | | Algogroup | | | Dip. Ing. Informazione | tel: +39 059 2056318 | | Via Vignolese 905/b | fax: +39 059 2056199 | | 41100 Modena | | | home: http://algo.ing.unimo.it/people/paolo/ | ----------------------------------------------------------- --
Doesn't work. If the 100-thread process wants to use just on thread for issuing I/O, it will be starved by the single-threaded process. [my example has process A with 100 threads, and process B with 1 thread, not a 101-thread process with one important thread] -- error compiling committee.c: too many arguments to function --
Right. I was thinking only about the case where all the 101 threads concurrently access the disk, and I just wanted to say that weights may offer more help than priorities in simple cases as this one. Apart from this, automatically recomputing weights as needed is most certainly a worse solution than hierarchical scheduling. Paolo --
task_struct. It would also be nice to do per-user I/O scheduling.. -- Aaron --
How do you figure that? This is a situation where time-slices work nicely, because they implicitly account for the performance penalty of poor access patterns. The sequential-accessing processes (and the system overall) ends up with higher throughput. -- Aaron --
The unfairness is not WRT tasks generating poor access patterns. If you have two tasks doing sequential accesses on two different regions of the disk the exact amount of service they receive in the same amount of time depends on the transfer rate of the disk on that regions, and, depending on the media, it is not always the same. We showed some example of that in the original post and it is quite easy to try it out if you put two partitions at the ends of a disk and you try to read from them concurrently. --
Ok... you're talking about ZBR. I'm not convinced this should be treated differently to, say, random vs. sequential workloads. You still end up with reduced global throughput as you've shown in the ``Short-term time guarantees'' table. It is an interesting case though... since the lower performance is not though fault of the process it doesn't seem fair to ``punish'' it. -- Aaron --
It is indeed a valid observation, but I think we are getting into details still. CFQ wants to provide fair access to the drive, it doesn't claim to be 100% fair wrt throughput or transfer sums at all costs. This is where fairness and real life for an all round scheduler divert somewhat. So while it IS true that you could have 40mb/sec at one end and 65mb/sec at the other and thus give the process at the start an 'unfair' share of bandwidth, it's honestly mostly a theoretical problem. I can envision some valid concerns for media streaming filling the entire drive, but then my solution would be to just bump the time slice if you are not meeting deadlines. I've never heard anyone complain about this issue. -- Jens Axboe --
Just a note about that table. The lower aggregate throughput of bfq is due to the fact that, because of the higher number of movies being read, a higher percentage of not-that-profitable accesses is being performed under bfq wrt to cfq. As shown in the complete logs of the aggregate throughput in the raw results, the aggregate throughput with bfq and cfq is practically the same when the number of movies is the same. The figure in the "Aggregate throughput" subsection is probably best suited for a comparison of the performance of the two schedulers with sequential accesses under the same conditions (the figure refers to the 2, 128 MB long, files, but we got virtually the same results in all the other tests). I do agree on that these experiments should be repeated with different (faster) devices. --
