Hi everyone, I'm happy to announce that I've implemented a Block I/O bandwidth controller. The controller is designed to be of use in a cgroup or virtual machine environment. The current approach is that the controller is implemented as a device-mapper driver. What's dm-band all about? ======================== Dm-band is an I/O bandwidth controller implemented as a device-mapper driver. Several jobs using the same physical device have to share the bandwidth of the device. Dm-band gives bandwidth to each job according to its weight, which each job can set its own value to. At this time, a job is a group of processes with the same pid or pgrp or uid. There is also a plan to make it support cgroup. A job can also be a virtual machine such as KVM or Xen. +------+ +------+ +------+ +------+ +------+ +------+ |cgroup| |cgroup| | the | | pid | | pid | | the | jobs | A | | B | |others| | X | | Y | |others| +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--V----+---V---+----V---+ +--V----+---V---+----V---+ | group | group | default| | group | group | default| band groups | | | group | | | | group | +-------+-------+--------+ +-------+-------+--------+ | band1 | | band2 | band devices +-----------|------------+ +-----------|------------+ +-----------V--------------+-------------V------------+ | | | | sdb1 | sdb2 | physical devices +--------------------------+--------------------------+ How dm-band works. ======================== Every band device has one band group, which by default is called the default group. Band devices can also have extra band groups in them. Each band group has a job to support and a weight. Proportional to the weight, dm-band gives tokens to the group. A group passes on I/O requests that its job issues to the ...
Here is the patch of dm-band.
Based on 2.6.23.14
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
diff -uprN linux-2.6.23.14.orig/drivers/md/Kconfig linux-2.6.23.14/drivers/md/Kconfig
--- linux-2.6.23.14.orig/drivers/md/Kconfig 2008-01-15 05:49:56.000000000 +0900
+++ linux-2.6.23.14/drivers/md/Kconfig 2008-01-21 16:09:41.000000000 +0900
@@ -276,4 +276,13 @@ config DM_DELAY
If unsure, say N.
+config DM_BAND
+ tristate "I/O band width control "
+ depends on BLK_DEV_DM
+ ---help---
+ Any processes or cgroups can use the same storage
+ with its band-width fairly shared.
+
+ If unsure, say N.
+
endif # MD
diff -uprN linux-2.6.23.14.orig/drivers/md/Makefile linux-2.6.23.14/drivers/md/Makefile
--- linux-2.6.23.14.orig/drivers/md/Makefile 2008-01-15 05:49:56.000000000 +0900
+++ linux-2.6.23.14/drivers/md/Makefile 2008-01-21 20:45:03.000000000 +0900
@@ -8,6 +8,7 @@ dm-multipath-objs := dm-hw-handler.o dm-
dm-snapshot-objs := dm-snap.o dm-exception-store.o
dm-mirror-objs := dm-log.o dm-raid1.o
dm-rdac-objs := dm-mpath-rdac.o
+dm-band-objs := dm-bandctl.o dm-band-policy.o dm-band-type.o
md-mod-objs := md.o bitmap.o
raid456-objs := raid5.o raid6algos.o raid6recov.o raid6tables.o \
raid6int1.o raid6int2.o raid6int4.o \
@@ -39,6 +40,7 @@ obj-$(CONFIG_DM_MULTIPATH_RDAC) += dm-rd
obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
obj-$(CONFIG_DM_MIRROR) += dm-mirror.o
obj-$(CONFIG_DM_ZERO) += dm-zero.o
+obj-$(CONFIG_DM_BAND) += dm-band.o
quiet_cmd_unroll = UNROLL $@
cmd_unroll = $(PERL) $(srctree)/$(src)/unroll.pl $(UNROLL) \
diff -uprN linux-2.6.23.14.orig/drivers/md/dm-band-policy.c linux-2.6.23.14/drivers/md/dm-band-policy.c
--- linux-2.6.23.14.orig/drivers/md/dm-band-policy.c 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.23.14/drivers/md/dm-band-policy.c 2008-01-21 20:31:14.000000000 +0900
@@ -0,0 +1,185 @@
+/*
+ * Copyright (C) 2008 VA Linux Systems Japan K.K.
+ *
+ * I/O bandwidth ...Hi, I'm not qualified to comment on the code, but here are some suggestions on config option and comments. Cheers, FJP s/band width/bandwidth/ s/band-width/bandwidth/ The help should probably be a bit more verbose as this does not tell anybody much who has not already read the documentation. Maybe something like: <snip> This device-mapper target allows to define how the available bandwith of a storage device should be shared between processes or cgroups. Information on how to use dm-band is available in: Documentation/device-mapper/band.txt s/when there exist some BIOs blocked/if some BIOs exist that are blocked/ ? "none of them can't" : the double negative looks incorrect (and should be s/have/has/ "has to do something" : that's rather vague... --
Thank you for your suggstions. I will correct those mistakes. -- Ryo Tsuruta <ryov@valinux.co.jp> --
I just see in other Kconfig files that the last line should be: <file:Documentation/device-mapper/band.txt>. Cheers, FJP --
Here is the document of dm-band. Based on 2.6.23.14 Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> diff -uprN linux-2.6.23.14.orig/Documentation/device-mapper/band.txt linux-2.6.23.14/Documentation/device-mapper/band.txt --- linux-2.6.23.14.orig/Documentation/device-mapper/band.txt 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.23.14/Documentation/device-mapper/band.txt 2008-01-23 21:48:46.000000000 +0900 @@ -0,0 +1,431 @@ +==================== +Document for dm-band +==================== + +Contents: + What's dm-band all about? + How dm-band works + Setup and Installation + Command Reference + TODO + + +What's dm-band all about? +======================== +Dm-band is an I/O bandwidth controller implemented as a device-mapper driver. +Several jobs using the same physical device have to share the bandwidth of +the device. Dm-band gives bandwidth to each job according to its weight, +which each job can set its own value to. + +At this time, a job is a group of processes with the same pid or pgrp or uid. +There is also a plan to make it support cgroup. A job can also be a virtual +machine such as KVM or Xen. + + +------+ +------+ +------+ +------+ +------+ +------+ + |cgroup| |cgroup| | the | | pid | | pid | | the | jobs + | A | | B | |others| | X | | Y | |others| + +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ + +--V----+---V---+----V---+ +--V----+---V---+----V---+ + | group | group | default| | group | group | default| band groups + | | | group | | | | group | + +-------+-------+--------+ +-------+-------+--------+ + | band1 | | band2 | band devices + +-----------|------------+ +-----------|------------+ + +-----------V--------------+-------------V------------+ + | | | + | sdb1 | sdb2 | ...
Could you please address in the document how the intended use cases/feature set etc. differs from CFQ2 io priorities? Thanks, -Andi --
Thank you for your suggestion, I'll do that step by step. Thanks, Ryo Tsuruta --
What definition of bandwidth does it use? Does it for example account for seek latency? --
The bandwidth in dm-band is determined by the proportion of the processing time of each device's tokens(I/Os) to the processing time of all device's tokens(I/Os). The processing time of one token(I/O) is determined by one I/O cycle include seek latency, interrupt latency, etc... Thanks, Ryo Tsuruta <ryov@valinux.co.jp> --
It seems to rely on 'current' to classify bios and doesn't do it until the map function is called, possibly in a different process context, so it won't always identify the original source of the I/O correctly: people need to take this into account when designing their group configuration and so this should be mentioned in the documentation. I've uploaded it here while we consider ways we might refine the architecture and interfaces etc.: http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-add-band-targ... Alasdair -- agk@redhat.com --
Yes, this should be mentioned in the document with the current implementation as you pointed out. By the way, I think once a memory controller of cgroup is introduced, it will Thank you, Hirokazu Takahashi. --
do you mean to make this a part of the memory subsystem? YAMAMOTO Takashi --
I just think if the memory subsystem is in front of us, we don't need to reinvent the wheel. But I don't have a concrete image how the interface between dm-band and the memory subsystem should be designed yet. I'd be appreciate if some of the cgroup developers give some ideas about it. Thanks, --
the current implementation of memory subsystem associates pages to cgroups directly, rather than via tasks. so it isn't straightforward to use the information for other classification mechanisms like yours which might not share the view of "hierarchy" with the memory subsystem. --
Hi,
Now I report the result of dm-band bandwidth control test I did yesterday.
I've got really good results that dm-band works as I expected. I made
several band-groups on several disk partitions and gave them heavy I/O loads.
Hardware Spec.
==============
DELL Dimention E521:
Linux kappa.local.valinux.co.jp 2.6.23.14 #1 SMP
Thu Jan 24 17:24:59 JST 2008 i686 athlon i386 GNU/Linux
Detected 2004.217 MHz processor.
CPU0: AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ stepping 02
Memory: 966240k/981888k available (2102k kernel code, 14932k reserved,
890k data, 216k init, 64384k highmem)
scsi 2:0:0:0: Direct-Access ATA ST3250620AS 3.AA PQ: 0 ANSI: 5
sd 2:0:0:0: [sdb] 488397168 512-byte hardware sectors (250059 MB)
sd 2:0:0:0: [sdb] Write Protect is off
sd 2:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sdb] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
sdb: sdb1 sdb2 < sdb5 sdb6 sdb7 sdb8 sdb9 sdb10 sdb11 sdb12 sdb13 sdb14
sdb15 >
The results of bandwidth control test on partitions
===================================================
The configurations of the test #1:
o Prepare three partitions sdb5, sdb6 and sdb7.
o Give weights of 40, 20 and 10 to sdb5, sdb6 and sdb7 respectively.
o Run 128 processes issuing random read/write direct I/O with 4KB data
on each device at the same time.
o Count up the number of I/Os and sectors which have done in 60 seconds.
The result of the test #1
---------------------------------------------------------------------------
| device | sdb5 | sdb6 | sdb7 |
| weight | 40 (57.0%) | 20 (29.0%) | 10 (14.0%) |
|-----------------+-------------------+-------------------+-----------------|
| I/Os (r/w) | 6640( 3272/ 3368)| 3434( 1719/ 1715)| 1689( 857/ 832)|
| sectors (r/w) | 53120(26176/26944)| 27472(13752/13720)| 13512(6856/6656)|
| ratio to total ...Hi, you mean that you run 128 processes on each user-device pairs? Namely, I guess that user1: 128 processes on sdb5, user2: 128 processes on sdb5, another: 128 processes on sdb5, The second preliminary studies might be: - What if you use a different I/O size on each device (or device-user pair)? - What if you use a different number of processes on each device (or device-user pair)? And my impression is that it's natural dm-band is in device-mapper, separated from I/O scheduler. Because bandwidth control and I/O scheduling are two different things, it may be simpler that they are implemented in different layers. Regards, --
"User-device pairs" means "band groups", right? What I actually did is the followings: user1: 128 processes on sdb5, user2: 128 processes on sdb5, user3: 128 processes on sdb5, There are other ideas of controlling bandwidth, limiting bytes-per-sec, latency time or something. I think it is possible to implement it if a lot of people really require it. I feel there wouldn't be a single correct answer for this issue. Posting good ideas how it should work I would like to know how dm-band works on various configurations on various type of hardware. I'll try running dm-band on with other configurations. Any reports or impressions of dm-band on your machines are also welcome. Thanks, Ryo Tsuruta --
