"I'm happy to announce that I've implemented a Block I/O bandwidth controller," began Ryo Tsuruta, explaining that it was intended to be used in a cgroup or virtual machine environment, implemented as a device-mapper driver. He detailed a token-based implementation in which dm-band passes out to the various groups, "a group passes on I/O requests that its job issues to the underlying layer so long as it has tokens left, while requests are blocked if there aren't any tokens left in the group. One token is consumed each time the group passes on a request. Dm-band will refill groups with tokens once all of groups that have requests on a given physical device use up their tokens." Ryo explained:
"Dm-band is an I/O bandwidth controller implemented as a device-mapper driver. Several jobs using the same physical device have to share the bandwidth of the device. Dm-band gives bandwidth to each job according to its weight, which each job can set its own value to. At this time, a job is a group of processes with the same pid or pgrp or uid. There is also a plan to make it support cgroup. A job can also be a virtual machine such as KVM or Xen."
From: Ryo Tsuruta <ryov@...>
Subject: [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview
Date: Jan 23, 8:53 am 2008
Hi everyone,
I'm happy to announce that I've implemented a Block I/O bandwidth controller.
The controller is designed to be of use in a cgroup or virtual machine
environment. The current approach is that the controller is implemented as
a device-mapper driver.
What's dm-band all about?
========================
Dm-band is an I/O bandwidth controller implemented as a device-mapper driver.
Several jobs using the same physical device have to share the bandwidth of
the device. Dm-band gives bandwidth to each job according to its weight,
which each job can set its own value to.
At this time, a job is a group of processes with the same pid or pgrp or uid.
There is also a plan to make it support cgroup. A job can also be a virtual
machine such as KVM or Xen.
+------+ +------+ +------+ +------+ +------+ +------+
|cgroup| |cgroup| | the | | pid | | pid | | the | jobs
| A | | B | |others| | X | | Y | |others|
+--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+
+--V----+---V---+----V---+ +--V----+---V---+----V---+
| group | group | default| | group | group | default| band groups
| | | group | | | | group |
+-------+-------+--------+ +-------+-------+--------+
| band1 | | band2 | band devices
+-----------|------------+ +-----------|------------+
+-----------V--------------+-------------V------------+
| | |
| sdb1 | sdb2 | physical devices
+--------------------------+--------------------------+
How dm-band works.
========================
Every band device has one band group, which by default is called the default
group.
Band devices can also have extra band groups in them. Each band group
has a job to support and a weight. Proportional to the weight, dm-band gives
tokens to the group.
A group passes on I/O requests that its job issues to the underlying
layer so long as it has tokens left, while requests are blocked
if there aren't any tokens left in the group. One token is consumed each
time the group passes on a request. Dm-band will refill groups with tokens
once all of groups that have requests on a given physical device use up their
tokens.
With this approach, a job running on a band group with large weight is
guaranteed to be able to issue a large number of I/O requests.
Getting started
=============
The following is a brief description how to control the I/O bandwidth of
disks. In this description, we'll take one disk with two partitions as an
example target.
You can also check the manual at Document/device-mapper/band.txt of the
linux kernel source tree for more information.
Create and map band devices
---------------------------
Create two band devices "band1" and "band2" and map them to "/dev/sda1"
and "/dev/sda2" respectively.
# echo "0 `blockdev --getsize /dev/sda1` band /dev/sda1 1" | dmsetup create band1
# echo "0 `blockdev --getsize /dev/sda2` band /dev/sda2 1" | dmsetup create band2
If the commands are successful then the device files "/dev/mapper/band1"
and "/dev/mapper/band2" will have been created.
Bandwidth control
----------------
In this example weights of 40 and 10 will be assigned to "band1" and
"band2" respectively. This is done using the following commands:
# dmsetup message band1 0 weight 40
# dmsetup message band2 0 weight 10
After these commands, "band1" can use 80% --- 40/(40+10)*100 --- of the
bandwidth of the physical disk "/dev/sda" while "band2" can use 20%.
Additional bandwidth control
---------------------------
In this example two extra band groups are created on "band1".
The first group consists of all the processes with user-id 1000 and the
second group consists of all the processes with user-id 2000. Their
weights are 30 and 20 respectively.
Firstly the band group type of "band1" is set to "user".
Then, the user-id 1000 and 2000 groups are attached to "band1".
Finally, weights are assigned to the user-id 1000 and 2000 groups.
# dmsetup message band1 0 type user
# dmsetup message band1 0 attach 1000
# dmsetup message band1 0 attach 2000
# dmsetup message band1 0 weight 1000:30
# dmsetup message band1 0 weight 2000:20
Now the processes in the user-id 1000 group can use 30% ---
30/(30+20+40+10)*100 --- of the bandwidth of the physical disk.
Band Device Band Group Weight
band1 user id 1000 30
band1 user id 2000 20
band1 default group(the other users) 40
band2 default group 10
Remove band devices
-------------------
Remove the band devices when no longer used.
# dmsetup remove band1
# dmsetup remove band2
TODO
========================
- Cgroup support.
- Control read and write requests separately.
- Support WRITE_BARRIER.
- Optimization.
- More configuration tools. Or is the dmsetup command sufficient?
- Other policies to schedule BIOs. Or is the weight policy sufficient?
Thanks,
Ryo Tsuruta
--
From: Peter Zijlstra <a.p.zijlstra@...>
Subject: Re: [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview
Date: Jan 23, 10:32 am 2008
On Wed, 2008-01-23 at 21:53 +0900, Ryo Tsuruta wrote:
> Hi everyone,
>
> I'm happy to announce that I've implemented a Block I/O bandwidth controller.
> The controller is designed to be of use in a cgroup or virtual machine
> environment. The current approach is that the controller is implemented as
> a device-mapper driver.
What definition of bandwidth does it use? Does it for example account
for seek latency?
--
From: Ryo Tsuruta <ryov@...>
Subject: Re: [PATCH 0/2] dm-band: The I/O bandwidth controller: Overview
Date: Jan 23, 1:25 pm 2008
Hi Peter,
> What definition of bandwidth does it use? Does it for example account
> for seek latency?
The bandwidth in dm-band is determined by the proportion of the
processing time of each device's tokens(I/Os) to the processing time
of all device's tokens(I/Os).
The processing time of one token(I/O) is determined by one I/O cycle
include seek latency, interrupt latency, etc...
Thanks,
Ryo Tsuruta <ryov@valinux.co.jp>
--
Someone explain in laymen terms
Could someone please explain in laymen terms to a noob how this is good, and useful in real-world scenario?
Sometimes, when you have
Sometimes, when you have several processes (programs) competing for the same block device (A block device is anything that reads or writes data in blocks, for example harddrives and DVD drives), the processes has to wait for each other.
If you have one process that can easily wait (like a search indexer), and one process you want to complete as fast as possible, (say, saving your just finished budget, so you can get to smash tanks in Scorched 3D), the bandwidth controller can help you with that, by only letting the search indexer write a little bit of data.
In a server, it could be the company's main database getting priority over programs started by the administrator, so he didn't have to watch out for doing that cleaning of logfiles in the middle of the most busy hours.
That said, I don't know HOW MUCH it can do about the HD access, since the harddrive has to move it's head forward and back even just to read/write a little bit, small accesses can also bother other accesses.
Oh, okay
Is this only useful to hard disks and DVD or is it useful to solid-state disks too?
Because in couple of years, hard disks might be obsolete due to solid-state disks.
I think it'll be even MORE
I think it'll be even MORE useful for SSD's, as you don't have the problem with the moving head.
The quoted mail notes that
The quoted mail notes that it calculates the bandwidth as a proportion of total amount of IO requests that succeed on the device. Presumably it calculates a smoothed average, and the expectation is that future load will be similar to past load, not an unreasonable assumption.
In case the assumption turns out to be wrong, the new IO bandwidth should eventually correct the average up, and thus make a new expectation. The comments are little bit thin on the details, but regardless tolerance of seeking must be somewhat built in.
I'd expect some unfairness with wildly different loads applied to two band devices on same physical media -- the extreme case is random read all over one band device with sequential read on the other. It could penalize the sequential side in favour of the seeking side because the count of blocks readable sequentially is so much higher than the count of blocks readable with seeking per a time window. This would be because the fact that sequential operations tend to succeed very quickly, and so mere counting of IO requests might result in it looking like the time's up for the sequential load and then majority of time would be spent on the seek task.
I would appreciate simple time-based sharing. Regardless of progress made on the task or number of IO requests, one simply allocates, say, 200 ms for one task and 800 ms for the other per second. I'd consider it only fair that a task which does plenty of seek-style load would make less progress proportionally to tasks that make better use of the physical media's inherent properties. If the time granule is large enough, say, one second, you can be virtually certain that the performance of the split-up device would closely approximate the allocated slice percentages, in this case 20 % and 80 %. If the time granule is too small, seeks to and fro between the tasks would tend to disturb the accuracy of the split.
Can't you compensate, though?
If you have a seeky load in parallel with a streaming load, couldn't you give the seeky load a correspondingly lower token allocation based on its expected relative cost-per-I/O?
I guess if you a job that alternates between streaming and seeking, it could be hard to allocate statically. (ISO mastering, perhaps, as it shifts between metadata gathering, packaging large files, and packaging small files?) That is, the allocation won't evolve as I/O patterns shift during a job. That's the real place where the bandwidth allocation doesn't meet the workload.
I guess it's time for measurement on actual device utilization. Your conjecture is definitely one worth testing. The hypothesis is that this mechanism might actually prefer seeking over streaming, since the two are given equal weight since all counting is in terms of I/O requests. You might succeed in ratioing the available bandwidth while also decreasing the available bandwidth. Example: Without this mechanism, a streamer won't block as much as a seeker (per I/O), getting more bandwidth as a result of "good behavior." With this mechanism, if you give them an equal number of tokens, the streamer can go no faster than the seeker (in terms of I/O request rate), and overall device utilization potentially drops, since it's not clear that the seeker's speed improved as a result.
One question though: Do large linear reads/writes come down as small numbers of large requests, or large numbers of small requests that happen to be sequential? If linear requests come through as small numbers of large requests, then perhaps this is accounted for already at higher levels.
Speaking of time-based ratioing: A 1000ms time granule like you suggest might be great for utilization, but wouldn't that totally kill responsiveness? That might work for a batch processing machine, but not for anything remotely interactive. You could literally be telling processes "wait a second." Maybe a 1000ms averaging window makes sense since seeks are on the order of 8ms, but you probably want to pivot more often than that, say every 100ms tops. On the flip side, you probably do want to send as many I/Os to the drive as possible so it can reorder to minimize seek penalty based on its intrinsic knowledge of the drive layout.
I/O is hard. Let's go shopping. ;-)
--
Program Intellivision and play Space Patrol!
Looks like a typo in the
Looks like a typo in the link. When I follow, it ends up on a thread with a heading like this : [PATCH 01/20 -v5] printk - dont wakeup klogd with interrupts disabled
Multipathing options
Hello, I was just wondering if this will interface well with multipathing/RAID? For instance, can I stack this on top of a RAID 5 (over iscsi or ATAoE) which has short stroked bands in the partition scheme? Is this implementation thread safe, or should I worry about that at the file system level?
I've been forced to do a manual implementation of this concept since this didn't exist when I setup my storage backend, so I've got the structure, but adding the throttling capabilities on top of my existing structure would make my life a lot easier. I could lay this on top of my existing /dev/mdXs and throttle IO as it goes to each device over the wire, instead of tweaking the disk scheduler on the back end server to deal with whatever is thrown at it in a generic way. Well done, keep up the good work!