RE: I/O scheduler (aka dsched)

Previous thread: Re: bugtracker switch by Oliver Fromme on Monday, March 29, 2010 - 6:30 am. (5 messages)

Next thread: Re: I/O scheduler (aka dsched) by Matthew Dillon on Monday, March 29, 2010 - 8:48 pm. (2 messages)
From: Alex Hornung
Date: Monday, March 29, 2010 - 7:58 pm

Hi,

The past months I've been working intermittently on an I/O scheduler
framework and a fair queuing policy for DragonFly BSD. First off I want to
note that I'm not a Computer Scientist and that my code is probably
sub-optimal, completely off from what is considered an I/O scheduler in
scientific papers, etc.

The FQ policy should serve mainly as a proof of concept of what the
framework can do. It seems to work pretty decently on its own, as some
rather naive benchmarks[1] I did confirm. I really want to emphasize that
it's suboptimal, it has some problems that limit for example overall write
performance more than it should. Yet overall it should solve our extreme
interactivity issues. As the graphs[1] show, the read performance with
ongoing writes has been drastically increased by about a factor of 3.

--

At this point I would like to make my work public and see some testing and
especially some reviews. You can either fetch my iosched-current branch on
leaf or just apply my patch[2] to the current master as of this writing;
although it probably also applies to older kernels.

The work basically consists of 4 parts:
- General system interfacing
- I/O scheduler framework (dsched)
- I/O scheduler fair queuing policy (dsched_fq or fq)
- userland tools (dschedctl and ionice)

--

After applying the patch you still won't notice any difference, as the
default scheduler is the so-called noop (no operation) scheduler; it
emulates our current behaviour. This can be confirmed by dschedctl -l:
# dschedctl -l
cd0     =>      noop
da0     =>      noop
acd0    =>      noop
ad0     =>      noop
fd0     =>      noop
md0     =>      noop

--

To enable the fq policy on a disk you have two options:

1) set scheduler_{diskname}="fq" in /boot/loader.conf; e.g. if it should be
enabled for da0, then scheduler_da0="fq". Certain wildcards are also
understood, e.g. scheduler_da* or scheduler_*. Note that sernos are not
supported (yet).

2) use dschedctl:
# dschedctl -s fq -d ...
From: Magnus Eriksson
Date: Tuesday, March 30, 2010 - 3:06 am

I think the principle of least surprise would suggest that it should work 
exactly like nice, rather than flip it around, or people will get confused 
why the misbehaving program continues to eat all IO even after they 
reniced it.  :-)


MAgnus

From: Chris Turner
Date: Wednesday, March 31, 2010 - 5:08 pm

nice is actually intuitive - the higher the number, the 'nicer' the 
processes are to the rest of the system..

negative nice => "mean"

it's just *system* relative - not process relative..

perhaps a commentary on our modern individualistic nature, that we get 
this wrong..

or something..

shutting up now.

From: Alex Hornung
Date: Thursday, April 1, 2010 - 6:12 am

Right now this is the least of my concerns. If someone wants to use dsched
right now, he should know that he's using something experimental and hence
should read up on whatever documentation is available, including the nice
level inversion.

Eventually I might change it to be similar to the normal nice levels, but it
also would be only in one direction (i.e. -10 to 0, not -20 to 20).


--


On another note, I really want to emphasize the fact that if you want to try
dsched, don't use the patch that was attached to the first email, pull from
my iosched-current branch on leaf which is up to date.

I've done quite a few improvements since the original patch, mainly:

- changing the algorithm to estimate the disk usage percent. Now it's done
right, by measuring the time the disk spends idle in one balancing period.
(thanks to corecode for the idea)

- due to the previous change, I have also been able to add a feedback
mechanism that tries to dispatch more requests if the disk becomes idle,
even if all processes have already reached their rate limit by increasing
the limit if needed.

- moving the heavier balancing calculations out of the fq_balance thread and
into the context of the processes/threads that do I/O, as far as this is
possible. Some of the heavy balancing calculations will still occur in the
dispatch thread instead of the issuing context. (thanks to Aggelos for the
idea)

- ironing out a few bugs related to int32_t overflow.

- general cleanup & refactoring


--


I also forgot to mention in my original email that there are some other
interesting tools/settings, mainly:

sysctl kern.dsched_debug: the higher the level, the more debug you'll get.
By default no debug will be printed. At level 4, only the disk busy-% will
be printed, and at 7 all details about the balancing will be shown.

test/dsched_fq: If you build fqstats (just using 'make' in this directory),
you'll be able to read some of the statistics that dsched_fq keeps track of,
such as ...
From: Antonio Huete Jimenez
Date: Sunday, April 11, 2010 - 1:09 pm

Hi,

I've checked out the branch and done some test for the past two days and
although I didn't go further than some basic tests (music playing under
heavy I/O, recursive finds, X programs startup also during high I/O
activity, etc) and some other more intensive (bonnie++ and what's more
important hammer reblocking) I'd like to share my subjective good
feelings about it.

For example, during a hammer cleanup a music play isn't jumpy anymore,
and the applications like firefox are quite happier than they went
before under certain load. On the other hand I've had some unexpected
freezes (machine not responding to any input from the mouse and
keyboard) but I think my the hardware can take the blame of that,
because I have not been able to reproduce another box. Also, if you
switch the scheduler 5 or 6 times from/to fq and noop, a panic is
triggered related to a TAILQ handling on cleanup.

Besides that I would like to thank Alex for the great effort he's doing
and also encourage people to give the scheduler a try. I would really
like to see this in master soon, disabled by default, so people who are
in the bleeding-edge can benefit from it in the case they want.

For those who would like to try and don't know exactly how to do it,
find below some quick instructions:

a) Switch to master branch, make sure you don't have local changes
before pull and update master branch.
# cd /usr/src                   
# git checkout master
# git pull

b) Add alexh's personal repo to your remotes and update it.
# git remote add leaf_alexh git://leaf.dragonflybsd.org/~alexh/dragonfly.git
# git remote update leaf_alexh       

c) Checkout main scheduler branch and also bring latest changes from master
# git checkout -b iosched-current --no-track leaf_alexh/iosched-current
# git rebase master

d) Build and install everything
# make -j2 buildworld && make -j2 buildkernel KERNCONF=GENERIC
# sudo -E make installkernel KERNCONF=GENERIC && sudo make installworld
&& sudo make upgrade

Hope you ...
From: Alex Hornung
Date: Tuesday, April 13, 2010 - 12:30 am

Any testing is welcome! I would have expected some more interest (and hence
more testers) as this addresses an issue that many people have experienced,

I've not been able to reproduce any system freezes and I'm not sure why they
would happen in the first place.

On the other hand, I've finally been able to solve the policy switching
issue, and this fix is now committed to my branch (commit
0e9144bec7970967edcd917909472d2dde8db23a). I've tried it by switching about

IMHO this should be the way to go, as the impact is virtually non-existent
with the 'noop' policy, which is currently default. Bringing in the code to
master would allow for a wider testing base, as more people might feel
adventurous enough to switch the policy to 'fq' for at least a while.
Just to reiterate: the 'noop' policy just passes the requests through as it
has happened so far, so no breakage is to be expected from just using the

Thanks for providing some instructions!


Cheers,
Alex Hornung

From: Petr Janda
Date: Tuesday, April 13, 2010 - 3:44 am

Hi Alex,

Thanks for dsched. So far it seems work well. Havent got a panic. It feels 
maybe slightly more responsive using fq, than noop. I dont know, i havent 
really "stress tested" it. One thing that I did notice though is that setting 
scheduler_*="fq" doesnt work, but scheduler_da0="fq" does.

Petr
From: Alex Hornung
Date: Tuesday, April 13, 2010 - 11:51 am

Thanks! Didn't notice that the loader doesn't like '*'. I've changed the
tunables to the following:
dsched_pol_da0 = "foo" (replaces scheduler_da0 = "foo")
dsched_pol_da = "foo" (replaces scheduler_da* = "foo")
dsched_pol = "bar" (replaces scheduler_* = "bar")

Cheers,
Alex Hornung


Previous thread: Re: bugtracker switch by Oliver Fromme on Monday, March 29, 2010 - 6:30 am. (5 messages)

Next thread: Re: I/O scheduler (aka dsched) by Matthew Dillon on Monday, March 29, 2010 - 8:48 pm. (2 messages)