[sorry for reposting, wrong subject]
Hi,
we are working to a new I/O scheduler based on CFQ, aiming at
improved predictability and fairness of the service, while maintaining
the high throughput it already provides.The patchset, too big for lkml posting, is available here:
http://feanor.sssup.it/~fabio/linux/bfq/patches/The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin
scheduling policy of time slices into a fair queueing scheduling
of sector budgets. More precisely, each task is assigned a budget
measured in number of sectors instead of amount of time, and budgets
are scheduled using a slightly modified version of WF2Q+. The
budget assigned to each task varies over time as a function of its
behaviour. However, one can set the maximum value of the budget
that BFQ can assign to any task.The time-based allocation of the disk service in CFQ, while having
the desirable effect of implicitly charging each application for
the seek time it incurs, suffers from unfairness problems also
towards processes making the best possible use of the disk bandwidth.
In fact, even if the same time slice is assigned to two processes,
they may get a different throughput each, as a function of the
positions on the disk of their requests. On the contrary, BFQ can
provide strong guarantees on bandwidth distribution because the
assigned budgets are measured in number of sectors. Moreover, due
to its Round Robin policy, CFQ is characterized by an O(N) worst-case
delay (jitter) in request completion time, where N is the number
of tasks competing for the disk. On the contrary, given the accurate
service distribution of the internal WF2Q+ scheduler, BFQ exhibits
O(1) delay.We made several tests to measure the aggregate throughput, long-term
bandwidth distribution and single-request completion time guaranteed
by CFQ and BFQ; what we present here was obtained with an outdated
version of the code, we are in the process of collecting data for
the current one (see [1])....
Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long...
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Actually, in the worst case among our tests, the aggregate throughput
with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512
/ 20M = 100 ms.
About these test cases we repeated several measures of the aggregate
throughput with simultaneous file reads from 2 to 5 (and other varying
parameters). The lowest aggregate throughput for 4k sectors (~ 20 MB/s)
was achieved in case of 2, 128 MB long, files, lying on two slices at
the maximum distance from each other (more details on the experiments,
testbed and so on at http://algo.ing.unimo.it/people/paolo/disk_sched/).
I hope I didn't misunderstand your point.--
That's not worse case, it is pretty close to BEST case. Worst case is 4k
of sectors, with each being a 512b IO and causing a full stroke seek.
For that type of workload, even a modern sata hard drive will be doing
500kb/sec or less. That's rougly a thousand sectors per seconds, so ~4
seconds worst case for 4k sectors.--
Jens Axboe--
Yes. 100 ms is just the worst case among our tests with 4k, but these
In my opinion, the time-slice approach of cfq is definitely better
suited than the (sector) budget approach for this type of workloads. On
the opposite end, the price of time-slices is unfairness towards, e.g.,
threads doing sequential accesses. In bfq we were mainly thinking about
file copy, ftp, video streaming and so on. I was not able to find a good
solution for both types of workloads.BTW, there is also another possibility. The internal scheduler of bfq
may be used to schedule time-slices instead of budgets. By doing so, the
O(1) instead of O(N) delay/jitter would still be guaranteed (as it is
probably already clear, bfq is obtained from cfq by just turning slices
into budgets, and the Round Robin-like scheduling policy into a Weighted
Fair Queueuing one).--
How do you figure that? This is a situation where time-slices work nicely,
because they implicitly account for the performance penalty of poor access
patterns. The sequential-accessing processes (and the system overall) ends
up with higher throughput.-- Aaron
--
The unfairness is not WRT tasks generating poor access patterns.
If you have two tasks doing sequential accesses on two different
regions of the disk the exact amount of service they receive in the
same amount of time depends on the transfer rate of the disk on
that regions, and, depending on the media, it is not always the same.We showed some example of that in the original post and it is quite
easy to try it out if you put two partitions at the ends of a disk
and you try to read from them concurrently.
--
Ok... you're talking about ZBR.
I'm not convinced this should be treated differently to, say, random vs.
sequential workloads. You still end up with reduced global throughput as
you've shown in the ``Short-term time guarantees'' table. It is an
interesting case though... since the lower performance is not though fault
of the process it doesn't seem fair to ``punish'' it.-- Aaron
--
Just a note about that table. The lower aggregate throughput of bfq is
due to the fact that, because of the higher number of movies being read,
a higher percentage of not-that-profitable accesses is being performed
under bfq wrt to cfq. As shown in the complete logs of the aggregate
throughput in the raw results, the aggregate throughput with bfq and cfq
is practically the same when the number of movies is the same.
The figure in the "Aggregate throughput" subsection is probably best
suited for a comparison of the performance of the two schedulers with
sequential accesses under the same conditions (the figure refers to the
2, 128 MB long, files, but we got virtually the same results in all the
other tests).
I do agree on that these experiments should be repeated with different
(faster) devices.--
It is indeed a valid observation, but I think we are getting into
details still. CFQ wants to provide fair access to the drive, it doesn't
claim to be 100% fair wrt throughput or transfer sums at all costs. This
is where fairness and real life for an all round scheduler divert
somewhat.So while it IS true that you could have 40mb/sec at one end and 65mb/sec
at the other and thus give the process at the start an 'unfair' share of
bandwidth, it's honestly mostly a theoretical problem. I can envision
some valid concerns for media streaming filling the entire drive, but
then my solution would be to just bump the time slice if you are not
meeting deadlines. I've never heard anyone complain about this issue.--
Jens Axboe--
Which is fine, nothing wrong with a scheduler tuned for a specific
I was thinking about that too. Generally I've been opposed to doing
scheduling decisions on anything but time, since that is always
relevant. When to hand out slices and to what process, that algorithm is
really basic in CFQ and could do with an improvement.--
Jens Axboe--
Jumping in at random, does "process" here mean task or mms_struct? If
the former, doesn't that mean that a 100-thread process can starve out a
single-threaded process?Perhaps we need hierarchical io scheduling, like cfs has for the cpu.
--
error compiling committee.c: too many arguments to function--
task_struct. It would also be nice to do per-user I/O scheduling..
-- Aaron
--
Hierarchical would simplify isolating groups of threads or processes.
However, some simple solution is already available with bfq. For
example, if you have to fairly share the disk bandwidth between the
above 100 threads and another important thread, you get it by just
assigning weight 1 to each of these 100 threads, and weight 100 to the
important one.Paolo
--
-----------------------------------------------------------
| Paolo Valente | |
| Algogroup | |
| Dip. Ing. Informazione | tel: +39 059 2056318 |
| Via Vignolese 905/b | fax: +39 059 2056199 |
| 41100 Modena | |
| home: http://algo.ing.unimo.it/people/paolo/ |
-------------------------------------------------------------
Doesn't work. If the 100-thread process wants to use just on thread for
issuing I/O, it will be starved by the single-threaded process.[my example has process A with 100 threads, and process B with 1 thread,
not a 101-thread process with one important thread]--
error compiling committee.c: too many arguments to function--
Right. I was thinking only about the case where all the 101 threads
concurrently access the disk, and I just wanted to say that weights may
offer more help than priorities in simple cases as this one.
Apart from this, automatically recomputing weights as needed is most
certainly a worse solution than hierarchical scheduling.Paolo
--
Maybe there is also another middle-ground solution. I'll try to sketch
it out:
. use sectors instead of time
. impose a penalty to each thread in proportion to the distance between
its disk requests
. reduce the maximum budget of each thread as a function of this seek
penalty so as to prevent the thread from stealing more than a given time
slice (the simple mechanism to limit per-thread budget is already
implemented in bfq).By doing so, both fairness and time isolation should be guaranteed.
Finally, this policy should be safe in that, given the maximum time used
by a seeky thread to consume its maximum budget on a reference disk, the
time used on any faster disk should be shorter.Does it seem reasonable?
--
Not for CFQ, that will stay time based. The problem with #2 above is
that it then quickly turns to various heuristics, which is just
impossible to tune for general behaviour. Or it just falls apart for
other real life situations.--
Jens Axboe--
Ok.
After a brief offline discussion with paolo, here it is what we can do:
o propose a patch for discussion that uses our WF2Q+ variant to
schedule timeslices in cfq. The resulting scheduler would be
quite close to the EEVDF scheduler discussed some time ago for
the cpu.o Introduce a timeout in bfq to give an upper time limit to the
slices. Since we have not experimented with that mixed approach
before[*], we will need to do some tests with relevant workloads to
see if/how it can work.I fear that it will take some time, as we're both travelling in this
week.[*] Anyway it is quite close to how cfq handles async queues, with their
slice_async and slice_async_rq, so it's definitely not something new.
--
Like SSD or hardware RAID. Time-slices have the nice property of fairness
irrespective of the underlying hardware characteristics.
--
Exactly. We can cater to that somewhat by adding some simple hardware
profiles, so that the IO schedulers if seeks etc are costly or not. But
it's a good example none the less.--
Jens Axboe--
One seek is still 10msec on modern drive, right? So 4k seeks =
40seconds, no? 4seconds would correspond to 1msec per seek, which
seems low.writes with O_SYNC could force full seek on each request, right?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
I actually meant 4k ios, not 512b at that isn't really realistic. With
512b full device seeks, you are looking at probably 30kb/sec on a normal
7200rpm drive and that would be around a minute worst case time. The 4kb
number of 500kb/sec may even be a bit too high, doing a quick test here
shows a little less than 300kb/sec on this drive. So more than 4 secondsWrites generally work somewhat better due to caching, but doing O_DIRECT
512 byte reads all over the drive would exhibit worst case behaviour
easily.--
Jens Axboe--
A process with such a low throughput would be marked as seeky from
the heuristics implemented in cfq/bfq. Seeky processes are not
treated in the same way as sequential ones and they should not get
their full slice allocated, since they idle only for very short
periods.BTW looking at the code they can get a full slice, if they always
reissue requests fast enough - within BFQ_MIN_TT - and this is
definitely an issue/error in the current implementation (and we
didn't notice it when converting the code from time-based to
service-based allocation :) ).An easy solution (without changing the nature of bfq) would be
to use shorter slices for seeky queues, with the same mechanism
we already use for the async ones.
--
Fabio, I've merged the scheduler for some testing. Overall the code
looks great, you've done a good job!I didn't touch patches 2 and 3, but I rewrote #1 somewhat. See the
result here:http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=973a02c4ea1c3...
I'm sure you'll agree with the hlist_sched_*() functions. I also killed
the ->bfq_ioprio_changed modification, what exactly was the purpose of
that?The code is now in the 'bfq' branch of the block git repo.
--
Jens Axboe--
Here are some microbenchmark results. Test setup is a 2-way IA64 with a
single 15k RPM 73GiB SCSI disk with TCQ depth set to 1. Workloads are
generated with FIO: 128 processes issuing synchronous, O_DIRECT, 16KiB
block size requests.Figures are quoted as average (stdev). CFQ (i=0) means idle window
disabled. All other tunables are default.==================================x8=======================================
Random Readers
-----------------------------------------------
Latency (ms) Bandwidth (KiB/s)
-----------------------------------------------
CFQ 841.788 (4070.3) 2428.032 (23.1)
CFQ (i=0) 536.728 (216.9) 3841.024 (8.5)
BFQ 884.4 (8816.0) 2439.04 (1375.0)Sequential 1MiB Readers
-----------------------------------------------
Latency (ms) Bandwidth (KiB/s)
-----------------------------------------------
CFQ 2865.331 (737.2) 46866.048 (103.1)
CFQ (i=0) 2544.618 (1047.2) 52685.952 (294.2)
BFQ 2860.795 (419.1) 46850.944 (81.5)Clearly BFQ suffers from the same idle window problems as CFQ, but otherwise
the performance seems comparable in bandwidth terms. I'm guessing variability
in random workload service is due to max budget being too large compared to
CFQ's default time-slice. Sequential access looks nice and consistent, though.--
Hi, I'm Paolo Valente and I worked with Fabio on bfq. I'm very happy
about your interest in it. Thanks a lot :)--
of course the hlist_sched_*() functions are much better than what was
in the patch (the idea behind the patch was to not touch at all cfq code).the ->bfq_ioprio_changed was there to avoid that a process/ioc doing
i/o on multiple devices managed by cfq and bfq would see ioprio
changes only for one of the two schedulers.cfq_ioc_set_ioprio() (and its bfq counterpart bfq_ioc_set_ioprio())
unconditionally assign 0 to (bfq_)ioprio_changed, so the first
scheduler that sees the ioprio change eats the new priority values.
of course I may be wrong, but I think it (or some better mechanismof course we'll be glad to help in testing in any way you can find useful.
thank you!
--
As long as the changes at that point are straight forward and 'obviously
correct', there's no harm done. Have you thought about merging bfq andI see. If we can and will merge bfq and cfq, then it's not really an
issue. Otherwise, I'd suggest using bits 0 of ioprio_changed for cfq andI'll push it to the for-akpm branch as well, so it should show up in the
next -mm and get some testing there.--
Jens Axboe--
Well, I'm maintaining bfq as a modified version of cfq, and I use a
script to generate the bfq-iosched.c file. It allows keeping the common
code in sync.I've posted this version to allow comparisons between the two schedulers
and because I consider cfq a reference, more tested/stable scheduler. If you
are interested in it I can clean up the cfq patch in the next few days andYes, that's a better solution, at least for now.
[I'm sorry I cannot post a patch correcting it right now because I don'tOk, thank you very much.
--
| Hiten Pandya | Re: up? (emacs docbook xml ide) |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| debian developer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
git: | |
| Gerrit Renker | [PATCH 03/37] dccp: List management for new feature negotiation |
| Ingo Molnar | iwlwifi: fix build bug in "iwlwifi: fix LED stall" |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
