Re: [RESEND][RFC] BFQ I/O Scheduler

Previous thread: none

Next thread: how to debug: processes with -D (uninterruptible sleep) by Ram on Tuesday, April 1, 2008 - 8:38 am. (1 message)
From: Fabio Checconi
Date: Tuesday, April 1, 2008 - 8:29 am

[sorry for reposting, wrong subject]

Hi,
    we are working to a new I/O scheduler based on CFQ, aiming at
improved predictability and fairness of the service, while maintaining
the high throughput it already provides.

The patchset, too big for lkml posting, is available here:
    http://feanor.sssup.it/~fabio/linux/bfq/patches/

The Budget Fair Queueing (BFQ) scheduler turns the CFQ Round-Robin
scheduling policy of time slices into a fair queueing scheduling
of sector budgets.  More precisely, each task is assigned a budget
measured in number of sectors instead of amount of time, and budgets
are scheduled using a slightly modified version of WF2Q+.  The
budget assigned to each task varies over time as a function of its
behaviour.  However, one can set the maximum value of the budget
that BFQ can assign to any task.

The time-based allocation of the disk service in CFQ, while having
the desirable effect of implicitly charging each application for
the seek time it incurs, suffers from unfairness problems also
towards processes making the best possible use of the disk bandwidth.
In fact, even if the same time slice is assigned to two processes,
they may get a different throughput each, as a function of the
positions on the disk of their requests.  On the contrary, BFQ can
provide strong guarantees on bandwidth distribution because the
assigned budgets are measured in number of sectors.  Moreover, due
to its Round Robin policy, CFQ is characterized by an O(N) worst-case
delay (jitter) in request completion time, where N is the number
of tasks competing for the disk.  On the contrary, given the accurate
service distribution of the internal WF2Q+ scheduler, BFQ exhibits
O(1) delay.

We made several tests to measure the aggregate throughput, long-term
bandwidth distribution and single-request completion time guaranteed
by CFQ and BFQ; what we present here was obtained with an outdated
version of the code, we are in the process of collecting data for
the current one (see ...
From: Jens Axboe
Date: Tuesday, April 15, 2008 - 1:22 am

Fabio, I've merged the scheduler for some testing. Overall the code
looks great, you've done a good job!

I didn't touch patches 2 and 3, but I rewrote #1 somewhat. See the
result here:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=973a02c4ea1c324c41e45b69c07...

I'm sure you'll agree with the hlist_sched_*() functions. I also killed
the ->bfq_ioprio_changed modification, what exactly was the purpose of
that?

The code is now in the 'bfq' branch of the block git repo.

-- 
Jens Axboe

--

From: Fabio Checconi
Date: Tuesday, April 15, 2008 - 2:11 am

of course the hlist_sched_*() functions are much better than what was
in the patch (the idea behind the patch was to not touch at all cfq code).

the ->bfq_ioprio_changed was there to avoid that a process/ioc doing
i/o on multiple devices managed by cfq and bfq would see ioprio
changes only for one of the two schedulers.

cfq_ioc_set_ioprio() (and its bfq counterpart bfq_ioc_set_ioprio())
unconditionally assign 0 to (bfq_)ioprio_changed, so the first
scheduler that sees the ioprio change eats the new priority values.
of course I may be wrong, but I think it (or some better mechanism

of course we'll be glad to help in testing in any way you can find useful.

thank you!
--

From: Jens Axboe
Date: Tuesday, April 15, 2008 - 5:42 am

As long as the changes at that point are straight forward and 'obviously
correct', there's no harm done. Have you thought about merging bfq and

I see. If we can and will merge bfq and cfq, then it's not really an
issue. Otherwise, I'd suggest using bits 0 of ioprio_changed for cfq and

I'll push it to the for-akpm branch as well, so it should show up in the
next -mm and get some testing there.

-- 
Jens Axboe

--

From: Fabio Checconi
Date: Tuesday, April 15, 2008 - 11:08 am

Well, I'm maintaining bfq as a modified version of cfq, and I use a
script to generate the bfq-iosched.c file.  It allows keeping the common
code in sync.

I've posted this version to allow comparisons between the two schedulers
and because I consider cfq a reference, more tested/stable scheduler.  If you
are interested in it I can clean up the cfq patch in the next few days and

Yes, that's a better solution, at least for now.
[I'm sorry I cannot post a patch correcting it right now because I don't

Ok, thank you very much.
--

From: Paolo Valente
Date: Tuesday, April 15, 2008 - 11:48 pm

Hi, I'm Paolo Valente and I worked with Fabio on bfq. I'm very happy 
about your interest in it. Thanks a lot :)

--

From: Aaron Carroll
Date: Thursday, April 17, 2008 - 6:26 pm

Here are some microbenchmark results.  Test setup is a 2-way IA64 with a
single 15k RPM 73GiB SCSI disk with TCQ depth set to 1.  Workloads are
generated with FIO: 128 processes issuing synchronous, O_DIRECT, 16KiB
block size requests.

Figures are quoted as average (stdev).  CFQ (i=0) means idle window
disabled.  All other tunables are default.

==================================x8=======================================

                Random Readers
-----------------------------------------------
           Latency (ms)       Bandwidth (KiB/s)
-----------------------------------------------
CFQ        841.788 (4070.3)   2428.032 (23.1)
CFQ (i=0)  536.728 (216.9)    3841.024 (8.5)
BFQ        884.4 (8816.0)     2439.04 (1375.0)


            Sequential 1MiB Readers
-----------------------------------------------
           Latency (ms)       Bandwidth (KiB/s)
-----------------------------------------------
CFQ        2865.331 (737.2)   46866.048 (103.1)
CFQ (i=0)  2544.618 (1047.2)  52685.952 (294.2)
BFQ        2860.795 (419.1)   46850.944 (81.5)


Clearly BFQ suffers from the same idle window problems as CFQ, but otherwise
the performance seems comparable in bandwidth terms.  I'm guessing variability
in random workload service is due to max budget being too large compared to
CFQ's default time-slice.  Sequential access looks nice and consistent, though.

--

From: Pavel Machek
Date: Wednesday, April 16, 2008 - 11:44 am

Hmm, 4k sectors is ~40 seconds worst case, no? That's quite long...
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Fabio Checconi
Date: Thursday, April 17, 2008 - 2:14 am

A process with such a low throughput would be marked as seeky from
the heuristics implemented in cfq/bfq.  Seeky processes are not
treated in the same way as sequential ones and they should not get
their full slice allocated, since they idle only for very short
periods.

BTW looking at the code they can get a full slice, if they always
reissue requests fast enough - within BFQ_MIN_TT - and this is
definitely an issue/error in the current implementation (and we
didn't notice it when converting the code from time-based to
service-based allocation :) ).

An easy solution (without changing the nature of bfq) would be
to use shorter slices for seeky queues, with the same mechanism
we already use for the async ones.
--

From: Paolo Valente
Date: Wednesday, April 16, 2008 - 11:14 pm

Actually, in the worst case among our tests, the aggregate throughput 
with 4k sectors was ~ 20 MB/s, hence the time for 4k sectors ~ 4k * 512 
/ 20M = 100 ms.
About these test cases we repeated several measures of the aggregate 
throughput with simultaneous file reads from 2 to 5 (and other varying 
parameters). The lowest aggregate throughput for 4k sectors (~ 20 MB/s) 
was achieved in case of 2, 128 MB long, files, lying on two slices at 
the maximum distance from each other (more details on the experiments, 
testbed and so on at http://algo.ing.unimo.it/people/paolo/disk_sched/).
I hope I didn't misunderstand your point.

--

From: Jens Axboe
Date: Thursday, April 17, 2008 - 12:10 am

That's not worse case, it is pretty close to BEST case. Worst case is 4k
of sectors, with each being a 512b IO and causing a full stroke seek.
For that type of workload, even a modern sata hard drive will be doing
500kb/sec or less. That's rougly a thousand sectors per seconds, so ~4
seconds worst case for 4k sectors.

-- 
Jens Axboe

--

From: Pavel Machek
Date: Thursday, April 17, 2008 - 1:48 am

One seek is still 10msec on modern drive, right? So 4k seeks =
40seconds, no? 4seconds would correspond to 1msec per seek, which
seems low.

writes with O_SYNC could force full seek on each request, right?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Jens Axboe
Date: Thursday, April 17, 2008 - 1:57 am

I actually meant 4k ios, not 512b at that isn't really realistic. With
512b full device seeks, you are looking at probably 30kb/sec on a normal
7200rpm drive and that would be around a minute worst case time. The 4kb
number of 500kb/sec may even be a bit too high, doing a quick test here
shows a little less than 300kb/sec on this drive. So more than 4 seconds

Writes generally work somewhat better due to caching, but doing O_DIRECT
512 byte reads all over the drive would exhibit worst case behaviour
easily.

-- 
Jens Axboe

--

From: Paolo Valente
Date: Thursday, April 17, 2008 - 1:26 am

Yes. 100 ms is just the worst case among our tests with 4k, but these 
In my opinion, the time-slice approach of cfq is definitely better 
suited than the (sector) budget approach for this type of workloads. On 
the opposite end, the price of time-slices is unfairness towards, e.g., 
threads doing sequential accesses. In bfq we were mainly thinking about 
file copy, ftp, video streaming and so on. I was not able to find a good 
solution for both types of workloads.

BTW, there is also another possibility. The internal scheduler of bfq 
may be used to schedule time-slices instead of budgets. By doing so, the 
O(1) instead of O(N) delay/jitter would still be guaranteed (as it is 
probably already clear, bfq is obtained from cfq by just turning slices 
into budgets, and the Round Robin-like scheduling policy into a Weighted 
Fair Queueuing one).


--

From: Jens Axboe
Date: Thursday, April 17, 2008 - 1:30 am

Which is fine, nothing wrong with a scheduler tuned for a specific

I was thinking about that too. Generally I've been opposed to doing
scheduling decisions on anything but time, since that is always
relevant. When to hand out slices and to what process, that algorithm is
really basic in CFQ and could do with an improvement.

-- 
Jens Axboe

--

From: Paolo Valente
Date: Thursday, April 17, 2008 - 2:24 am

Maybe there is also another middle-ground solution. I'll try to sketch 
it out:
. use sectors instead of time
. impose a penalty to each thread in proportion to the distance between 
its disk requests
. reduce the maximum budget of each thread as a function of this seek 
penalty so as to prevent the thread from stealing more than a given time 
slice (the simple mechanism to limit per-thread budget is already 
implemented in bfq).

By doing so, both fairness and time isolation should be guaranteed.
Finally, this policy should be safe in that, given the maximum time used 
by a seeky thread to consume its maximum budget on a reference disk, the 
time used on any faster disk should be shorter.

Does it seem reasonable?

--

From: Jens Axboe
Date: Thursday, April 17, 2008 - 2:27 am

Not for CFQ, that will stay time based. The problem with #2 above is
that it then quickly turns to various heuristics, which is just
impossible to tune for general behaviour. Or it just falls apart for
other real life situations.

-- 
Jens Axboe

--

From: Aaron Carroll
Date: Thursday, April 17, 2008 - 3:19 am

Like SSD or hardware RAID.  Time-slices have the nice property of fairness
irrespective of the underlying hardware characteristics.
--

From: Jens Axboe
Date: Thursday, April 17, 2008 - 3:21 am

Exactly. We can cater to that somewhat by adding some simple hardware
profiles, so that the IO schedulers if seeks etc are costly or not. But
it's a good example none the less.

-- 
Jens Axboe

--

From: Fabio Checconi
Date: Thursday, April 17, 2008 - 4:30 am

Ok.

After a brief offline discussion with paolo, here it is what we can do:

    o propose a patch for discussion that uses our WF2Q+ variant to
      schedule timeslices in cfq.  The resulting scheduler would be
      quite close to the EEVDF scheduler discussed some time ago for
      the cpu.

    o Introduce a timeout in bfq to give an upper time limit to the
      slices.  Since we have not experimented with that mixed approach
      before[*], we will need to do some tests with relevant workloads to
      see if/how it can work.

I fear that it will take some time, as we're both travelling in this
week.

[*] Anyway it is quite close to how cfq handles async queues, with their
slice_async and slice_async_rq, so it's definitely not something new.
--

From: Avi Kivity
Date: Thursday, April 17, 2008 - 8:19 am

Jumping in at random, does "process" here mean task or mms_struct?  If 
the former, doesn't that mean that a 100-thread process can starve out a 
single-threaded process?

Perhaps we need hierarchical io scheduling, like cfs has for the cpu.

-- 
error compiling committee.c: too many arguments to function

--

From: Paolo Valente
Date: Thursday, April 17, 2008 - 8:47 am

Hierarchical would simplify isolating groups of threads or processes.
However, some simple solution is already available with bfq. For 
example, if you have to fairly share the disk bandwidth between the 
above 100 threads and another important thread, you get it by just 
assigning weight 1 to each of these 100 threads, and weight 100 to the 
important one.

Paolo

-- 
-----------------------------------------------------------
| Paolo Valente              |                            |
| Algogroup                  |                            |
| Dip. Ing. Informazione     | tel:   +39 059 2056318     |
| Via Vignolese 905/b	     | fax:   +39 059 2056199     |
| 41100 Modena               | 				  |
|     home:  http://algo.ing.unimo.it/people/paolo/       |
-----------------------------------------------------------

--

From: Avi Kivity
Date: Thursday, April 17, 2008 - 8:51 am

Doesn't work.  If the 100-thread process wants to use just on thread for 
issuing I/O, it will be starved by the single-threaded process.

[my example has process A with 100 threads, and process B with 1 thread, 
not a 101-thread process with one important thread]


-- 
error compiling committee.c: too many arguments to function

--

From: Paolo Valente
Date: Thursday, April 17, 2008 - 11:12 am

Right. I was thinking only about the case where all the 101 threads 
concurrently access the disk, and I just wanted to say that weights may 
offer more help than priorities in simple cases as this one.
Apart from this, automatically recomputing weights as needed is most 
certainly a worse solution than hierarchical scheduling.

Paolo

--

From: Aaron Carroll
Date: Thursday, April 17, 2008 - 4:44 pm

task_struct.  It would also be nice to do per-user I/O scheduling..

 -- Aaron
--

From: Aaron Carroll
Date: Thursday, April 17, 2008 - 3:24 am

How do you figure that?  This is a situation where time-slices work nicely,
because they implicitly account for the performance penalty of poor access
patterns.  The sequential-accessing processes (and the system overall) ends
up with higher throughput.
 
 -- Aaron

--

From: Fabio Checconi
Date: Thursday, April 17, 2008 - 4:14 am

The unfairness is not WRT tasks generating poor access patterns.
If you have two tasks doing sequential accesses on two different
regions of the disk the exact amount of service they receive in the
same amount of time depends on the transfer rate of the disk on
that regions, and, depending on the media, it is not always the same.

We showed some example of that in the original post and it is quite
easy to try it out if you put two partitions at the ends of a disk
and you try to read from them concurrently.
--

From: Aaron Carroll
Date: Thursday, April 17, 2008 - 5:14 am

Ok... you're talking about ZBR.

I'm not convinced this should be treated differently to, say, random vs.
sequential workloads.  You still end up with reduced global throughput as
you've shown in the ``Short-term time guarantees'' table.  It is an
interesting case though... since the lower performance is not though fault
of the process it doesn't seem fair to ``punish'' it.

 -- Aaron

--

From: Jens Axboe
Date: Thursday, April 17, 2008 - 6:54 am

It is indeed a valid observation, but I think we are getting into
details still. CFQ wants to provide fair access to the drive, it doesn't
claim to be 100% fair wrt throughput or transfer sums at all costs. This
is where fairness and real life for an all round scheduler divert
somewhat.

So while it IS true that you could have 40mb/sec at one end and 65mb/sec
at the other and thus give the process at the start an 'unfair' share of
bandwidth, it's honestly mostly a theoretical problem. I can envision
some valid concerns for media streaming filling the entire drive, but
then my solution would be to just bump the time slice if you are not
meeting deadlines. I've never heard anyone complain about this issue.

-- 
Jens Axboe

--

From: Paolo Valente
Date: Thursday, April 17, 2008 - 8:18 am

Just a note about that table. The lower aggregate throughput of bfq is 
due to the fact that, because of the higher number of movies being read, 
a higher percentage of not-that-profitable accesses is being performed 
under bfq wrt to cfq. As shown in the complete logs of the aggregate 
throughput in the raw results, the aggregate throughput with bfq and cfq 
is practically the same when the number of movies is the same.
The figure in the "Aggregate throughput" subsection is probably best 
suited for a comparison of the performance of the two schedulers with 
sequential accesses under the same conditions (the figure refers to the 
2, 128 MB long, files, but we got virtually the same results in all the 
other tests).
I do agree on that these experiments should be repeated with different 
(faster) devices.



--

Previous thread: none

Next thread: how to debug: processes with -D (uninterruptible sleep) by Ram on Tuesday, April 1, 2008 - 8:38 am. (1 message)