Hi everyone,
This series of patches of dm-ioband now includes "The bio tracking mechanism,"
which has been posted individually to this mailing list.
This makes it easy for anybody to control the I/O bandwidth even when
the I/O is one of delayed-write requests.
Have fun!
This series of patches consists of two parts:
1. dm-ioband
Dm-ioband is an I/O bandwidth controller implemented as a
device-mapper driver, which gives specified bandwidth to each job
running on the same physical device. A job is a group of processes
with the same pid or pgrp or uid or a virtual machine such as KVM
or Xen. A job can also be a cgroup by applying the bio-cgroup patch.
2. bio-cgroup
Bio-cgroup is a BIO tracking mechanism, which is implemented on
the cgroup memory subsystem. With the mechanism, it is able to
determine which cgroup each of bio belongs to, even when the bio
is one of delayed-write requests issued from a kernel thread
such as pdflush.
The above two parts have been posted individually to this mailing list
until now, but after this time we would release them all together.
[PATCH 1/7] dm-ioband: Patch of device-mapper driver
[PATCH 2/7] dm-ioband: Documentation of design overview, installation,
command reference and examples.
[PATCH 3/7] bio-cgroup: Introduction
[PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
[PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s
[PATCH 6/7] bio-cgroup: Implement the bio-cgroup
[PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband
Please see the following site for more information:
Linux Block I/O Bandwidth Control Project
http://people.valinux.co.jp/~ryov/bwctl/
Thanks,
Ryo Tsuruta
--
This is the dm-ioband version 1.4.0 release.
Dm-ioband is an I/O bandwidth controller implemented as a device-mapper
driver, which gives specified bandwidth to each job running on the same
physical device.
- Can be applied to the kernel 2.6.27-rc1-mm1.
- Changes from 1.3.0 (posted on July 11, 2008):
- Fix the problem of processing urgent I/O requests.
Dm-ioband gives priority to I/O requests with pages with PG_reclaim
flag. We thought this situation only happens on a write request,
but it also happened on a read request, and it caused mishandling
of urgent I/O requests. We have not clarified it is proper
operation or not, yet.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/Kconfig linux-2.6.27-rc1-mm1/drivers/md/Kconfig
--- linux-2.6.27-rc1-mm1.orig/drivers/md/Kconfig 2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/Kconfig 2008-08-01 16:44:02.000000000 +0900
@@ -275,4 +275,17 @@ config DM_UEVENT
---help---
Generate udev events for DM events.
+config DM_IOBAND
+ tristate "I/O bandwidth control (EXPERIMENTAL)"
+ depends on BLK_DEV_DM && EXPERIMENTAL
+ ---help---
+ This device-mapper target allows to define how the
+ available bandwidth of a storage device should be
+ shared between processes, cgroups, the partitions or the LUNs.
+
+ Information on how to use dm-ioband is available in:
+ <file:Documentation/device-mapper/ioband.txt>.
+
+ If unsure, say N.
+
endif # MD
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/Makefile linux-2.6.27-rc1-mm1/drivers/md/Makefile
--- linux-2.6.27-rc1-mm1.orig/drivers/md/Makefile 2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/Makefile 2008-08-01 16:44:02.000000000 +0900
@@ -7,6 +7,7 @@ dm-mod-objs := dm.o dm-table.o dm-target
dm-multipath-objs := dm-path-selector.o dm-mpath.o
dm-snapshot-objs := dm-snap.o ...Here is the documentation of design overview, installation, command reference and examples. Based on 2.6.27-rc1-mm1 Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp> Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> diff -uprN linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt --- linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt 1970-01-01 09:00:00.000000000 +0900 +++ linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt 2008-08-01 16:44:02.000000000 +0900 @@ -0,0 +1,937 @@ + Block I/O bandwidth control: dm-ioband + + ------------------------------------------------------- + + Table of Contents + + [1]What's dm-ioband all about? + + [2]Differences from the CFQ I/O scheduler + + [3]How dm-ioband works. + + [4]Setup and Installation + + [5]Getting started + + [6]Command Reference + + [7]Examples + +What's dm-ioband all about? + + dm-ioband is an I/O bandwidth controller implemented as a device-mapper + driver. Several jobs using the same physical device have to share the + bandwidth of the device. dm-ioband gives bandwidth to each job according + to its weight, which each job can set its own value to. + + A job is a group of processes with the same pid or pgrp or uid or a + virtual machine such as KVM or Xen. A job can also be a cgroup by applying + the bio-cgroup patch, which can be found at + http://people.valinux.co.jp/~ryov/bio-cgroup/. + + +------+ +------+ +------+ +------+ +------+ +------+ + |cgroup| |cgroup| | the | | pid | | pid | | the | jobs + | A | | B | |others| | X | | Y | |others| + +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ + +--V----+---V---+----V---+ +--V----+---V---+----V---+ + | group | group | default| | group | group | default| ioband groups + | | | group | | | | group | + ...
With this series of bio-cgruop patches, you can determine the owners of any type of I/Os and it makes dm-ioband -- I/O bandwidth controller -- be able to control the Block I/O bandwidths even when it accepts delayed write requests. Dm-ioband can find the owner cgroup of each request. It is also possible that the other people who work on the I/O bandwidth throttling use this functionality to control asynchronous I/Os with a little enhancement. You have to apply the patch dm-ioband v1.4.0 before applying this series of patches. And you have to select the following config options when compiling kernel: CONFIG_CGROUPS=y CONFIG_CGROUP_BIO=y And I recommend you should also select the options for cgroup memory subsystem, because it makes it possible to give some I/O bandwidth and some memory to a certain cgroup to control delayed write requests and the processes in the cgroup will be able to make pages dirty only inside the cgroup even when the given bandwidth is narrow. CONFIG_RESOURCE_COUNTERS=y CONFIG_CGROUP_MEM_RES_CTLR=y This code is based on some part of the memory subsystem of cgroup and I don't think the accuracy and overhead of the subsystem can be ignored at this time, so we need to keep tuning it up. -------------------------------------------------------- The following shows how to use dm-ioband with cgroups. Please assume that you want make two cgroups, which we call "bio cgroup" here, to track down block I/Os and assign them to ioband device "ioband1". First, mount the bio cgroup filesystem. # mount -t cgroup -o bio none /cgroup/bio Then, make new bio cgroups and put some processes in them. # mkdir /cgroup/bio/bgroup1 # mkdir /cgroup/bio/bgroup2 # echo 1234 > /cgroup/bio/bgroup1/tasks # echo 5678 > /cgroup/bio/bgroup1/tasks Now, check the ID of each bio cgroup which is just created. # cat /cgroup/bio/bgroup1/bio.id 1 # cat /cgroup/bio/bgroup2/bio.id 2 Finally, attach the cgroups to "ioband1" and assign them ...
This patch splits the cgroup memory subsystem into two parts.
One is for tracking pages to find out the owners. The other is
for controlling how much amount of memory should be assigned to
each cgroup.
With this patch, you can use the page tracking mechanism even if
the memory subsystem is off.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
diff -Ndupr linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h
--- linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h 2008-08-01 12:18:28.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h 2008-08-01 19:03:21.000000000 +0900
@@ -20,12 +20,62 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
+#include <linux/rcupdate.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+
struct mem_cgroup;
struct page_cgroup;
struct page;
struct mm_struct;
+#ifdef CONFIG_CGROUP_PAGE
+/*
+ * We use the lower bit of the page->page_cgroup pointer as a bit spin
+ * lock. We need to ensure that page->page_cgroup is at least two
+ * byte aligned (based on comments from Nick Piggin). But since
+ * bit_spin_lock doesn't actually set that lock bit in a non-debug
+ * uniprocessor kernel, we should avoid setting it here too.
+ */
+#define PAGE_CGROUP_LOCK_BIT 0x0
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define PAGE_CGROUP_LOCK (1 << PAGE_CGROUP_LOCK_BIT)
+#else
+#define PAGE_CGROUP_LOCK 0x0
+#endif
+
+/*
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ */
+struct page_cgroup {
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ struct list_head lru; /* per cgroup LRU list */
+ struct mem_cgroup *mem_cgroup;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+ struct page *page;
+ int flags;
+};
+#define PAGE_CGROUP_FLAG_CACHE (0x1) /* ...This patch is for cleaning up the code of the cgroup memory subsystem
to remove a lot of "#ifdef"s.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
diff -Ndupr linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c 2008-08-01 19:48:55.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c 2008-08-01 19:49:38.000000000 +0900
@@ -228,6 +228,47 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
}
+static inline void get_mem_cgroup(struct mem_cgroup *mem)
+{
+ css_get(&mem->css);
+}
+
+static inline void put_mem_cgroup(struct mem_cgroup *mem)
+{
+ css_put(&mem->css);
+}
+
+static inline void set_mem_cgroup(struct page_cgroup *pc,
+ struct mem_cgroup *mem)
+{
+ pc->mem_cgroup = mem;
+}
+
+static inline void clear_mem_cgroup(struct page_cgroup *pc)
+{
+ struct mem_cgroup *mem = pc->mem_cgroup;
+ res_counter_uncharge(&mem->res, PAGE_SIZE);
+ pc->mem_cgroup = NULL;
+ put_mem_cgroup(mem);
+}
+
+static inline struct mem_cgroup *get_mem_page_cgroup(struct page_cgroup *pc)
+{
+ struct mem_cgroup *mem = pc->mem_cgroup;
+ css_get(&mem->css);
+ return mem;
+}
+
+/* This sould be called in an RCU-protected section. */
+static inline struct mem_cgroup *mm_get_mem_cgroup(struct mm_struct *mm)
+{
+ struct mem_cgroup *mem;
+
+ mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ get_mem_cgroup(mem);
+ return mem;
+}
+
static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
{
@@ -297,6 +338,26 @@ static void __mem_cgroup_move_lists(stru
list_move(&pc->lru, &mz->lists[lru]);
}
+static inline void mem_cgroup_add_page(struct page_cgroup *pc)
+{
+ struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+ unsigned long flags;
+
+ spin_lock_irqsave(&mz->lru_lock, flags);
+ __mem_cgroup_add_list(mz, ...This patch implements the bio cgroup on the memory cgroup.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c
--- linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c 2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c 2008-08-01 19:18:38.000000000 +0900
@@ -84,24 +84,28 @@ void exit_io_context(void)
}
}
+void init_io_context(struct io_context *ioc)
+{
+ atomic_set(&ioc->refcount, 1);
+ atomic_set(&ioc->nr_tasks, 1);
+ spin_lock_init(&ioc->lock);
+ ioc->ioprio_changed = 0;
+ ioc->ioprio = 0;
+ ioc->last_waited = jiffies; /* doesn't matter... */
+ ioc->nr_batch_requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+ INIT_HLIST_HEAD(&ioc->cic_list);
+ ioc->ioc_data = NULL;
+}
+
struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
{
struct io_context *ret;
ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
- if (ret) {
- atomic_set(&ret->refcount, 1);
- atomic_set(&ret->nr_tasks, 1);
- spin_lock_init(&ret->lock);
- ret->ioprio_changed = 0;
- ret->ioprio = 0;
- ret->last_waited = jiffies; /* doesn't matter... */
- ret->nr_batch_requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
- INIT_HLIST_HEAD(&ret->cic_list);
- ret->ioc_data = NULL;
- }
+ if (ret)
+ init_io_context(ret);
return ret;
}
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h 1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h 2008-08-01 19:21:56.000000000 +0900
@@ -0,0 +1,159 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include ...With this patch, dm-ioband can work with the bio cgroup.
Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
diff -Ndupr linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c
--- linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c 2008-08-01 16:53:57.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c 2008-08-01 19:44:36.000000000 +0900
@@ -6,6 +6,7 @@
* This file is released under the GPL.
*/
#include <linux/bio.h>
+#include <linux/biocontrol.h>
#include "dm.h"
#include "dm-bio-list.h"
#include "dm-ioband.h"
@@ -53,13 +54,13 @@ static int ioband_node(struct bio *bio)
static int ioband_cgroup(struct bio *bio)
{
- /*
- * This function should return the ID of the cgroup which issued "bio".
- * The ID of the cgroup which the current process belongs to won't be
- * suitable ID for this purpose, since some BIOs will be handled by kernel
- * threads like aio or pdflush on behalf of the process requesting the BIOs.
- */
- return 0; /* not implemented yet */
+ struct io_context *ioc = get_bio_cgroup_iocontext(bio);
+ int id = 0;
+ if (ioc) {
+ id = ioc->id;
+ put_io_context(ioc);
+ }
+ return id;
}
struct group_type dm_ioband_group_type[] = {
--
Is this function fully implemented? I tried to put a process into a group by writing to "/cgroup/bio/BGROUP/tasks" but failed. Without "attach" function, it is difficult to check the effectiveness of block I/O tracking. Thanks, - Takuya Yoshikawa --
This function can be more simplified, there is some unnecessary code Could you tell me what you actually did? I will try the same thing. -- Ryo Tsuruta <ryov@valinux.co.jp> --
Hi Tsuruta-san, I wanted to test my own scheduler which uses bio tracking information. SO I tried your patch, especially, get_bio_cgroup_iocontext(), to get the io_context from bio. In my test, I made some threads with certain iopriorities run concurrently. To schedule these threads based on their iopriorities, I made BGROUP directories for each iopriorities. e.g. /cgroup/bio/be0 ... /cgroup/bio/be7 Then, I tried to attach the processes to the appropriate groups. But the processes stayed in the original group(id=0). ... I am sorry but I have to leave now and I cannot come here next week. --> I will take summer holidays. I will reply to you later. Thanks, - Takuya Yoshikawa --
In the current implementation, when a process moves to an another cgroup: - Already allocated memory does not move to the cgroup, still remains. - Only allocated memory after move belongs to the cgroup. This behavior follows the memory controller. Memory does not move between cgroups since it is so heavy operation, but it would be worth under some sort of conditions. Could you try to move a process between cgroups in the following way? # echo $$ > /cgroup/bio/be0 # run_your_program # echo $$ > /cgroup/bio/be1 # run_your_program Have a nice vacation! Thanks, Ryo Tsuruta --
you can remove some ifdefs doing:
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
if (likely(!memcg)) {
rcu_read_lock();
mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
/*
* For every charge from the cgroup, increment reference count
*/
css_get(&mem->css);
rcu_read_unlock();
} else {
mem = memcg;
css_get(&memcg->css);
}
while (res_counter_charge(&mem->res, PAGE_SIZE)) {
if (!(gfp_mask & __GFP_WAIT))
goto out;
if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
continue;
/*
* try_to_free_mem_cgroup_pages() might not give us a full
* picture of reclaim. Some pages are reclaimed and might be
* moved to swap cache or just unmapped from the cgroup.
* Check the limit again to see if the reclaim reduced the
* current usage of the cgroup before giving up
*/
if (res_counter_check_under_limit(&mem->res))
continue;
if (!nr_retries--) {
mem_cgroup_out_of_memory(mem, gfp_mask);
goto out;
}
}
pc->mem_cgroup = mem;
#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
--
I think you don't have to care about this much, since one of the following --
On Mon, 04 Aug 2008 17:57:48 +0900 (JST) Plese CC me or Balbir or Pavel (See Maintainer list) when you try this ;) After this patch, the total structure is page <-> page_cgroup <-> bio_cgroup. (multiple bio_cgroup can be attached to page_cgroup) Does this pointer chain will add - significant performance regression or - new race condtions ? I like more loose relationship between them. For example, adding a simple function. == int get_page_io_id(struct page *) - returns a I/O cgroup ID for this page. If ID is not found, -1 is returned. ID is not guaranteed to be valid value. (ID can be obsolete) == And just storing cgroup ID to page_cgroup at page allocation. Then, making bio_cgroup independent from page_cgroup and get ID if avialble and avoid too much pointer walking. Thanks, --
I don't think it will cause significant performance loss, because the link between a page and a page_cgroup has already existed, which the memory resource controller prepared. Bio_cgroup uses this as it is, and does nothing about this. And the link between page_cgroup and bio_cgroup isn't protected by any additional spin-locks, since the associated bio_cgroup is guaranteed to exist as long as the bio_cgroup owns pages. I've just noticed that most of overhead comes from the spin-locks when reclaiming the pages inside mem_cgroups and the spin-locks to protect the links between pages and page_cgroups. The latter overhead comes from the policy your team has chosen that page_cgroup structures are allocated on demand. I still feel this approach doesn't make any sense because linux kernel tries to make use of most of the pages as far as it can, so most of them have to be assigned its related page_cgroup. It would make us happy I don't think there are any diffrences between a poiter and ID. --
Hmm, I think page_cgroup's cost is visible when 1. a page is changed to be in-use state. (fault or radixt-tree-insert) 2. a page is changed to be out-of-use state (fault or radixt-tree-removal) 3. memcg hit its limit or global LRU reclaim runs. "1" and "2" can be catched as 5% loss of exec throuput. "3" is not measured (because LRU walk itself is heavy.) What new chances to access page_cgroup you'll add ? Overhead between page <-> page_cgroup lock is cannot be catched by lock_stat now.Do you have numbers ? Now, multi-sizer-page-cache is discussed for a long time. If it's our ID can be obsolete, pointer is not. memory cgroup has to take care of bio cgroup's race condition ? (About race conditions, it's already complicated enough) To be honest, I think adding a new (4 or 8 bytes) page struct and record infor mation of bio-control is more straightforward approach. Buy as you might think, "there is no room" Thanks, -Kame --
I haven't add any at this moment, but I thinks some people may want to move some pages in page-cache from one cgroup to another cgroup. When that time comes, I'll try to make the cost minimized that I will probably only update the link between a page_cgroup and The problem is that every time the lock is held, the associated I don't think I can agree to this. When multi-sized-page-cache is introduced, some data structures will be allocated to manage multi-sized-pages. I think page_cgroups should be allocated at the same time. This approach will make things simple. It seems like the on-demand allocation approach leads not only overhead but complexity and a lot of race conditions. If you allocate page_cgroups when allocating page structures, You can get rid of most of the locks and you don't have to care about allocation error of page_cgroups anymore. And it will also give us flexibility that memcg related data can be Bio-cgroup just expects that the call-backs bio-cgroup prepares are called But only if everyone allows me to add some new members into "struct page." I think the same thing goes with memcg you're working on. Thank you, Hirokazu Takahashi. --
On Thu, 07 Aug 2008 16:25:12 +0900 (JST) I think "page" and "page_cgroup" is not so heavly shared object in fast path. foot-print is also important here. But it's not good for the systems with small "NORMAL" pages. This discussion should be done again when more users of page_group appears and it's overhead is obvious. Thanks, -Kame --
Even when it happens to be a system with small "NORMAL" pages, if you want to use memcg feature, you have to allocate page_groups for most of the pages in the system. It's impossible to avoid the allocation as far Thanks, Hirokazu Takahashi. --
During the Containers mini-summit at OLS, it was mentioned that there are at least *FOUR* of these I/O controllers floating around. Have you talked to the other authors? (I've cc'd at least one of them). We obviously can't come to any kind of real consensus with people just tossing the same patches back and forth. -- Dave --
Dave, thanks for this email first of all. I've talked with Satoshi (cc-ed) about his solution "Yet another I/O bandwidth controlling subsystem for CGroups based on CFQ". I did some experiments trying to implement minimum bandwidth requirements for my io-throttle controller, mapping the requirements to CFQ prio and using the Satoshi's controller. But this needs additional work and testing right now, so I've not posted anything yet, just informed Satoshi about this. Unfortunately I've not talked to Ryo yet. I've continued my work using a quite different approach, because the dm-ioband solution didn't work with delayed-write requests. Now the bio tracking feature seems really prosiming and I would like to do some tests ASAP, and review the patch as well. But I'm not yet convinced that limiting the IO writes at the device mapper layer is the best solution. IMHO it would be better to throttle applications' writes when they're dirtying pages in the page cache (the io-throttle way), because when the IO requests arrive to the device mapper it's too late (we would only have a lot of dirty pages that are waiting to be flushed to the limited block devices, and maybe this could lead to OOM conditions). IOW dm-ioband is doing this at the wrong level (at least for my requirements). Ryo, correct me if I'm wrong or if I've not understood the dm-ioband approach. Another thing I prefer is to directly define bandwidth limiting rules, instead of using priorities/weights (i.e. 10MiB/s for /dev/sda), but this seems to be in the dm-ioband TODO list, so maybe we can merge the work I did in io-throttle to define such rules. Anyway, I still need to look at the dm-ioband and bio-cgroup code in details, so probably all I said above is totally wrong... -Andrea --
The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if you look at this in combination with the memory controller, they would make a great team. The memory controller keeps you from dirtying more than your limit of pages (and pinning too much memory) even if the dm layer is doing the throttling and itself can't throttle the memory usage. I also don't think this is any different from the problems we have in the regular VM these days. Right now, people can dirty lots of pages on devices that are slow. The only thing dm-ioband would be added would be changing how those devices *got* slow. :) -- Dave --
mmh... but in this way we would just move the OOM inside the cgroup, that is a nice improvement, but the main problem is not resolved... A safer approach IMHO is to force the tasks to wait synchronously on each operation that directly or indirectly generates i/o. In particular the solution used by the io-throttle controller to limit the dirty-ratio in memory is to impose a sleep via schedule_timeout_killable() in balance_dirty_pages() when a generic process exceeds the limits defined for the belonging cgroup. Limiting read operations is a lot more easy, because they're always synchronized with i/o requests. -Andrea --
Fine in theory, hard in practice. :) I think the best we can hope for is to keep parity with what happens in the rest of the kernel. We already have a problem today with people mmap()'ing lots of memory and dirtying it all at once. Adding a i/o bandwidth controller or a memory controller isn't really going to fix that. I think it is outside the scope of the i/o (and memory) controllers until we solve it generically, first. -- Dave --
Yes, that's right. This should be solved. But there is a good thing when you use a memory controller. A problem occurred in a certain cgroup will be confined in its cgroup. I think this is a great point, don't you think? Thank you, Hirokazu Takahashi. --
I think that you're conflating two issues: - controlling how much dirty memory a cgroup can have at any given time (since dirty memory is much harder/slower to reclaim than clean memory) - controlling how much effect a cgroup can have on a given I/O device. By controlling the rate at which a task can generate dirty pages, you're not really limiting either of these. I think you'd have to set your I/O limits artificially low to prevent a case of a process writing a large data file and then doing fsync() on it, which would then hit the disk with the entire file at once, and blow away any QoS guarantees for other groups. As Dave suggested, I think it would make more sense to have your page-dirtying throttle points hook into the memory controller instead, and allow the memory controller to track/limit dirty pages for a cgroup, and potentially do throttling as part of that. Paul --
Yes, that would be nicer. The IO controller should control both read and write and dirty pages is mostly related to writes. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
Anyway, dirty pages ratio is directly proportional to the IO that will be performed on the real device, isn't it? this wouldn't prevent IO bursts as you correctly say, but IMHO it is a simple and quite effective way to measure the IO write activity of each cgroup on each affected device. To prevent the IO peaks I usually reduce the vm_dirty_ratio, but, ok, this is a workaround, not the solution to the problem either. IMHO, based on the dirty-page rate measurement, we should apply both limiting methods: throttle dirty-pages ratio to prevent too many dirty pages in the system (harde to reclaim and generating unpredictable/unpleasant/unresponsiveness behaviour), and throttle the dispatching of IO requests at the device-mapper/IO-scheduler layer to smooth IO peaks/bursts, generated by fsync() and similar scenarios. Another different approach could be to implement the measurement in the elevator, looking at the elapsed between the IO request is issued to the drive and the request is served. So, look at the start time T1, completion time T2, take the difference (T2 - T1) and say: cgroup C1 consumed an amount of IO of (T2 - T1), and also use a token-bucket policy to fill/reduce the "credits" of each IO cgroup in terms of IO time slots. This would be a more precise measurement, instead of trying to predict how expensive the IO operation will be, only looking at the dirty-page ratio. Then throttle both dirty-page ratio *and* the dispatching of the IO requests submitted by the cgroup that exceeds the Yes, implementing page-drity throttling in memory controller seems absolutely reasonable. I can try to move in this direction, merge the page-dirty throttling in memory controller and also post the RFC. Thanks, -Andrea --
Yeah, I'm sure we're going to have to get to setting the dirty ratio
$ cat /proc/sys/vm/dirty_ratio
40
on a per-container basis at *some* point. We might as well do it
earlier rather than later.
-- Dave
--
Hi, Andrea, The concept of dm-ioband includes it should be used with cgroup memory controller as well as the bio cgroup. The memory controller is supposed to control memory allocation and dirty-page ratio inside each cgroup. Some guys of cgroup memory controller team just started to implement the latter mechanism. They try to make each cgroup have a threshold to limit the number of dirty pages in the group. I guess it would make the memory controller team guys happier if you can help them design their dirty-page ratio controlling functionality much cooler and more generic. I think their goal is almost the same Thank you, Hirokazu Takahashi. --
Interesting, they also post a patch or RFC? -Andrea --
You can take a look at the thread start from http://www.ussg.iu.edu/hypermail/linux/kernel/0807.1/0472.html, whose subject is "[PATCH][RFC] dirty balancing for cgroups." This project has just started, so it would be a good time to discuss it with them. Thanks, Hirokazu Takahashi. --
Hi, Andrea.
I participated in Containers Mini-summit.
And, I talked with Mr. Andrew Morton in The Linux Foundation Japan
Symposium BoF, Japan, July 10th.
Currently, in ML, some I/O controller patches is sent and the respective
patch keeps sending the improvement version.
We and maintainers wouldn't like this situation.
We wanted to solve this situation by the Mini-summit, but unfortunately,
no other developers participated.
(I couldn't give an opinion, because my English skill is low)
Mr. Naveen present his way in Linux Symposium, and we discussed about
I/O control at a few time after this presentation.
Mr. Andrew gave a advice "Should discuss about design more and more"
to me.
And, in Containers Mini-summit (and Linux Symposium 2008 in Ottawa),
Paul said that a necessary to us is to decide a requirement first.
So, we must discuss requirement and design.
My requirement is
* to be able to distribute performance moderately.
(* to be able to isolate each group(environment)).
I guess (it may be wrong)
Naveen's requirement is
* to be able to handle latency.
(high priority is always precede in handling I/O)
(Only share isn't just given priority to, like CFQ.)
* to be able to distribute performance moderately.
Andrea's requirement is
* to be able to set and control by absolute(direct) performance.
Ryo's requirement is
* to be able to distribute performance moderately.
* to be able to set and control I/Os at flexible range
(multi device such as LVM).
I think that most solutions controls I/O performance moderately
(by using weight/priority/percentage/etc. and by not using absolute)
because disk I/O performance is inconstant and is affected by
situation (such as application, file(data) balance, and so on).
So, it is difficult to guarantee performance which is set by
absolute bandwidth.
If devices have constant performance, it will good to control by
absolute bandwidth.
And, when guaranteeing it by the low ability, it'll be ...* improve IO performance predictability of each cgroup It would be probably the best place to evaluate the "cost" of each Agree. At least, maybe we should consider if an IO controller could be I'll collect some numbers and keep you informed. -Andrea --
Hi Andrea, Satoshi and all,
We've implemented dm-ioband and bio-cgroup to meet the following requirements:
* Assign some bandwidth to each group on the same device.
A group is a set of processes, which may be a cgroup.
* Assign some bandwidth to each partition on the same device.
It can work with the process group based bandwidth control.
ex) With this feature, you can assign 40% of the bandwidth of a
disk to /root and 60% of them to /usr.
* It can work with virtual machines such as Xen and KVM.
I/O requests issued from virtual machines have to be controlled.
* It should work any type of I/O scheduler, including ones which
will be released in the future.
* Support multiple devices which share the same bandwidth such as
raid disks and LVM.
* Handle asynchronous I/O requests such as AIO request and delayed
write requests.
- This can be done with bio-cgroup, which uses the page-tracking
mechanism the cgroup memory controller has.
* Control dirty page ratio.
- This can be done with the cgroup memory controller in the near
feature. It would be great that you can also use other features
the memory controller is going to have with dm-ioband.
* Make it easy to enhance.
- The current implementation of dm-ioband has an interface to
add a new policy to control I/O requests. You can easily add
I/O throttling policy if you want.
* Fine grained bandwidth control.
* Keep I/O throughput.
* Make it scalable.
* It should work correctly if the I/O load is quite high,
I don't have any documentation besides in the website.
Thanks,
Ryo Tsuruta
--
Isn't this one of the core points that we keep going back and forth over? It seems like people are arguing in circles over this: Do we: 1. control potential memory usage by throttling I/O or 2. Throttle I/O when memory is full I might lean toward (1) if we didn't already have a memory controller. But, we have one, and it works. Also, we *already* do (2) in the kernel, so it would seem to graft well onto existing mechanisms that we have. I/O controllers should not worry about memory. They're going to have a hard enough time getting the I/O part right. :) Or, am I over-simplifying this? -- Dave --
On Tue, 05 Aug 2008 09:20:18 -0700 memcg have more problems now ;( Only a difficult thing to limit dirty-ratio in memcg is how-to-count dirty pages. If I/O controller's hook helps, it's good. My small concern is "What happens if we throttole I/O bandwidth too small under some memcg." In such cgroup, we may see more OOMs because I/O will not finish in time. A system admin have to find some way to avoid this. But please do I/O control first. Dirty-page control is related but different layer's problem, I think. Thanks, --
Yes, please solve the I/O control problem first. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I/O controllers are just supposed to emulate slow device from the point of view of the processes in a certain cgroup or something. I think the memory management layer and the memory controller are the ones which should be able to handle these, which might be as slow as Yup. Thanks, Hirokazu Takahashi. --
Yes, this is one of the problems linux kernel still has, which should be solved. But I believe this should be done in the linux memory management layer including the cgroup memory controller, which has to work correctly on any type of device with various access speeds. I think it's better that I/O controllers should only focus on flow of I/O requests. This approach will keep the implementation of linux Thank you, Hirokazu Takahashi. --
Ryo and Andrea - Naveen and Satoshi met up at OLS and discussed their approach. It would be really nice to see an RFC, I know Andrea did work on this and compared the approaches. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
yes, I wrote down something about the comparison of priority-based vs bandwidth shaping solutions in terms of performance predictability. And other considerations, like the one I cited before, about dirty-ratio throttling in memory, AIO handling, etc. Something is also reported in the io-throttle documentation: http://marc.info/?l=linux-kernel&m=121780176907686&w=2 But ok, I agree with Balbir, I can try to put the things together (in a better form in particular) and try to post an RFC together with Ryo. Ryo, do you have other documentation besides the info reported in the dm-ioband website? Thanks, -Andrea --
Hi Dave, I have been tracking the memory controller patches for a while which spurred my interest in cgroups and prompted me to start working on I/O bandwidth controlling mechanisms. This year I have had several opportunities to discuss the design challenges of i/o controllers with the NEC and VALinux Japan teams (CCed), most recently last month during the Linux Foundation Japan Linux Symposium, where we took advantage of Andrew Morton's visit to Japan to do some brainstorming on this topic. I will try so summarize what was discussed there (and in the Linux Storage & Filesystem Workshop earlier this year) and propose a hopefully acceptable way to proceed and try to get things started. This RFC ended up being a bit longer than I had originally intended, but hopefully it will serve as the start of a fruitful discussion. As you pointed out, it seems that there is not much consensus building going on, but that does not mean there is a lack of interest. To get the ball rolling it is probably a good idea to clarify the state of things and try to establish what we are trying to accomplish. *** State of things in the mainstream kernel<BR> The kernel has had somewhat adavanced I/O control capabilities for quite some time now: CFQ. But the current CFQ has some problems: - I/O priority can be set by PID, PGRP, or UID, but... - ...all the processes that fall within the same class/priority are scheduled together and arbitrary grouping are not possible. - Buffered I/O is not handled properly. - CFQ's IO priority is an attribute of a process that affects all devices it sends I/O requests to. In other words, with the current implementation it is not possible to assign per-device IO priorities to a task. *** Goals 1. Cgroups-aware I/O scheduling (being able to define arbitrary groupings of processes and treat each group as a single scheduling entity). 2. Being able to perform I/O bandwidth control independently on each device. 3. I/O bandwidth shaping. 4. ...
I'd like to add the following item to the goals. 7. Selectable from multiple bandwidth control policy (proportion, I agree with your plan. We keep bio-cgroup improving and porting to the latest kernel. Thanks, Ryo Tsuruta --
Having more users of bio-cgroup would probably help to get it merged, so we'll certainly send patches as soon as we get our cfq prototype in shape. Regards, Fernando --
I'm confused. Are these two of the competing controllers? Or are the complementary in some way? -- Dave --
Sorry, I did not explain myself correctly. I was not referring to a new IO _controller_, I was just trying to say that the traditional IO _schedulers_ already present in the mainstream kernel would benefit from proper IO tracking too. As an example, the current implementation of CFQ assumes that all IO is generated in the IO context of the current task, which in only true in the synchronous path. This renders CFQ almost unusable for controlling of asynchronous and buffered IO. Of course CFQ is not to blame here, since it has no way to tell who the real originator of the IO was (CFQ just sees IO requests coming from pdflush and other kernel threads). However, once we have a working IO tracking infrastructure in place, the existing IO _schedulers_ could be modified so that they use it to determine the real owner/originator of asynchronous and buffered IO. This can be done without turning IO schedulers into IO resource controllers. If we can demonstrate that a IO tracking infrastructure would also be beneficial outside the cgroups arena, it should be easier to get it merged. --
Would you like to split up IO into read and write IO. We know that read can be very latency sensitive when compared to writes. Should we consider them Won't that get too complex. What if the user has thousands of disks with several Are you suggesting that the IO and memory controller should always be bound Yes, I agree with this step as being the first step. May be extending the Very nice summary -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I'd just suggest doing what is simplest and can be done in the smallest amount of code. As long as it is functional in some way and can be I think what Fernando is suggesting is that we *allow* each disk to be treated separately, not that we actually separate them out. I agree that with large disk count systems, it would get a bit nutty to deal with I/O limits on each of them. It would also probably be nutty for some dude with two disks in his system to have to set (or care about) individual limits. I guess I'm just arguing that we should allow pretty arbitrary grouping of block devices into these resource pools. -- Dave --
As Dave pointed out I just think that we should allow each disk to be treated separately. To avoid the administration nightmare you mention adding block device grouping capabilities should suffice to solve most That is a really good question. The I/O tracking patches split the memory controller in two functional parts: (1) page tracking and (2) memory accounting/cgroup policy enforcement. By doing so the memory controller specific code can be separated from the rest, which admittedly, will not benefit the memory controller a great deal but, hopefully, we can get cleaner code that is easier to maintain. The important thing, though, is that with this separation the page tracking bits can be easily reused by any subsystem that needs to keep track of pages, and the I/O controller is certainly one such candidate. Synchronous I/O is easy to deal with because everything is done in the context of the task that generated the I/O, but buffered I/O and synchronous I/O are problematic. However with the observation that the owner of an I/O request happens to be the owner the of the pages the I/O buffers of that request reside in, it becomes clear that pdflush and friends could use that information to determine who the originator of the I/O is and the I/O request accordingly. Going back to your question, with the current I/O tracking patches I/O controller would be bound to the page tracking functionality of cgroups (page_cgroup) not the memory controller. We would not even need to compile the memory controller. The dependency on cgroups would still be there though. As an aside, I guess that with some effort we could get rid of this dependency by providing some basic tracking capabilities even when the cgroups infrastructure is not being used. By doing so traditional I/O schedulers such as CFQ could benefit from proper I/O tracking capabilities without using cgroups. Of course if the kernel has cgroups support compiled in the cgroups I/O tracking would be used instead (this idea was ...
Oops, I somehow ended up leaving your first question unanswered. Sorry. I do not think we should consider them separately, as long as there is a proper IO tracking infrastructure in place. As you mentioned, reads can be very latecy sensitive, but the read case could be treated as an special case IO controller/IO tracking subsystem. There certainly are optimization opportunities. For example, in the synchronous I/O patch ww could mark bios with the iocontext of the current task, because it will happen to be originator of that IO. By effectively caching the ownership information in the bio we can avoid all the accesses to struct page, page_cgroup, etc, and reads would definitively benefit from that. --
FYI, we should also take special care of pages being reclaimed, the free memory of the cgroup these pages belong to may be really low. Dm-ioband is doing this. Thanks, Hirokazu Takahashi. --
Thank you for the heads-up. - Fernando --
Fernando Nice summary. My comments are inline. -Naveen I/O limiting can be a special case of proportional bandwidth scheduling. A process/process group can use use it's share of bandwidth and if there is spare bandwidth it be allowed to use it. And if we want to absolutely restrict it we add another flag which specifies that the specified proportion is exact and has an upper bound. Let's say the ideal b/w for a device is 100MB/s And process 1 is assigned b/w of 20%. When we say that the proportion is strict, the b/w for process 1 will be 20% of the max b/w (which may It can be argued that any scheduling decision wrt to i/o belongs to elevators. Till now they have been used to improve performance. But with new requirements to isolate i/o based on process or cgroup, we need to change the elevators. If we add another layer of i/o scheduling (block layer I/O controller) above elevators 1) It builds another layer of i/o scheduling (bandwidth or priority) 2) This new layer can have decisions for i/o scheduling which conflict with underlying elevator. e.g. If we decide to do b/w scheduling in this new layer, there is no way a priority based elevator could work underneath it. If a custom make_request_fn is defined (which means the said device is not using existing elevator), it could build it's own scheduling rather than asking kernel to add another layer at the time of i/o --
I seems like the same goes for the current Linux kernel implementation that if processes issued a lot of I/O requests and the io-request queue of a disk is overflowed, all the I/O requests after will be blocked and the priorities of them are meaningless. In other word, it won't work if it receives lots of requests more than the ability/bandwidth of a disk. It doesn't seem so weird if it won't work if a cgroup issues lots of Thanks, Hirokazu Takahashi --
Hi Naveen, I essentially agree with you. The nice thing about proportional bandwidth scheduling is that we get bandwidth guarantees when there is contention for the block device, but still get the benefits of statistical multiplexing in the non-contended case. With strict IO I have the impression there is a tendency to conflate two different issues when discussing I/O schedulers and resource controllers, so let me elaborate on this point. On the one hand, we have the problem of feeding physical devices with IO requests in such a way that we squeeze the maximum performance out of them. Of course in some cases we may want to prioritize responsiveness over throughput. In either case the kernel has to perform the same basic operations: merging and dispatching IO requests. There is no discussion this is the elevator's job and the elevator should take into account the physical characteristics of the device. On the other hand, there is the problem of sharing an IO resource, i.e. block device, between multiple tasks or groups of tasks. There are many ways of sharing an IO resource depending on what we are trying to accomplish: proportional bandwidth scheduling, priority-based scheduling, etc. But to implement this sharing algorithms the kernel has to determine the task whose IO will be submitted. In a sense, we are scheduling tasks (and groups of tasks) not IO requests (which has much in common with CPU scheduling). Besides, the sharing problem is not directly related to the characteristics of the underlying device, which means it does not need to be implemented at the elevator layer. Traditional elevators limit themselves to schedule IO requests to disk with no regard to where it came from. However, new IO schedulers such as CFQ combine this with IO prioritization capabilities. This means that the elevator decides the application whose IO will be dispatched next. The problem is that at this layer there is not enough information to make such decisions in an accurate way, because, ...
Hello Fernando What if we pass the task specific information to the elevator. We do this for CFQ (where we pass the priority). And if we need any additional information to be passed we could add that in a similar manner. I really liked your initial suggestion where step 1 would be to add I/O tracking patches. And then use this in CFQ and AS to do resource sharing. And if we see any shortcoming with this approach. Let's see Is it possible to send the topology information to the elevators. And Another possible approach, if the top layer cannot pass topology info to the underling block device elevators. We could use FIFO for the underlying block devices, effectively disabling them. The Top layer will make it's scheduling decision in custom __make_request and the I agree that we shouldn't be reinventing things for every RAID driver. We could have a generic algorithm which everyone plugs into. If not that is not possible, we always have the option to create one in -Naveen --
A minor sidebar: 2008/8/7 Fernando Luis V
The same above also for IO operations/sec (bandwidth intended not only in terms of bytes/sec), plus: 7. Optimal bandwidth usage: allow to exceed the IO limits to take advantage of free/unused IO resources (i.e. allow "bursts" when the whole physical bandwidth for a block device is not fully used and then "throttle" again when IO from unlimited cgroups comes into place) 8. "fair throttling": avoid to throttle always the same task within a cgroup, but try to distribute the throttling among all the tasks What about using major,minor numbers to identify each device and account IO statistics? If a device is unplugged we could reset IO statistics and/or remove IO limitations for that device from userspace (i.e. by a deamon), but pluggin/unplugging the device would not be blocked/affected Use a deadline-based IO scheduling could be an interesting path to be explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth Very nice RFC. -Andrea --
Hi Andrea! Thank you for the ideas! By the way, point "3." above (I/O bandwidth shaping) refers to IO scheduling algorithms in general. When I wrote the RFC I thought that once we have the IO tracking and accounting mechanisms in place choosing and implementing an algorithm (fair throttling, proportional bandwidth scheduling, etc) would be easy in comparison, and that is the reason a list was not included. Once I get more feedback from all of you I will resend a more detailed If a resource we want to control (a block device in this case) is hot-plugged/unplugged the corresponding cgroup-related structures inside the kernel need to be allocated/freed dynamically, respectively. The problem is that this is not always possible. For example, with the current implementation of cgroups it is not possible to treat each block device as a different cgroup subsytem/resource controlled, because Please note that the only thing we can do is to guarantee minimum bandwidth requirement when there is contention for an IO resource, which is precisely what a proportional bandwidth scheduler does. An I missing something? --
The whole subsystem is created at compile time, but controller data structures are allocated dynamically (i.e. see struct mem_cgroup for memory controller). So, identifying each device with a name, or a key like major,minor, instead of a reference/pointer to a struct could help to handle this in userspace. I mean, if a device is unplugged a userspace daemon can just handle the event and delete the controller data structures allocated for this device, asynchronously, via userspace->kernel interface. And without holding a reference to that particular block device in the kernel. Anyway, implementing a generic interface that would allow to define hooks for hot-pluggable devices (or Correct. Proportional bandwidth automatically allows to guarantee min requirements (instead of IO limiting approach, that needs additional mechanisms to achive this). In any case there's no guarantee for a cgroup/application to sustain i.e. 10MB/s on a certain device, but this is a hard problem anyway, and the best we can do is to try to satisfy "soft" constraints. -Andrea --
Hi, Fernando, The current implementation of bio-cgroup is quite basic that a certain page is owned by the cgroup that allocated the page, that is the same way as the memory controller does. In most of cases this is enough and it helps minimize the overhead. I think you many want to add some feature to change the owner of a page. It will be ok we implement it step by step. I know there will be some tradeoff between the overhead and the accuracy to track pages. We also try to reduce the overhead of the tracking, whose code comes from the memory controller though. We all should help the memory I doubt about the maximum size of I/O requests problem. You can't avoid this problem as far as you use device mapper modules with such a bad manner, even if the controller is implemented as a stand-alone controller. There is no limitation if you only use dm-ioband without any other device mapper modules. And I think the device mapper team just started designing barriers support. I guess it won't take long. Right, Alasdair? We should know it is logically impossible to support barriers on some types of device mapper modules such as LVM. You can't avoid the barrier problem when you use this kind of multiple devices even if you implement the controller in the block layer. But I think a stand-alone implementation will have a merit that it makes it easier to setup the configuration rather than dm-ioband. From this point of view, it would be good that you move the algorithm of dm-ioband into the block layer. On the other hand, we should know it will make it impossible to use --
The following is a part of source code where the limitation comes from.
dm-table.c: dm_set_device_limits()
/*
* Check if merge fn is supported.
* If not we'll force DM to use PAGE_SIZE or
* smaller I/O, just to be safe.
*/
if (q->merge_bvec_fn && !ti->type->merge)
rs->max_sectors =
min_not_zero(rs->max_sectors,
(unsigned int) (PAGE_SIZE >> 9));
As far as I can find, In 2.6.27-rc1-mm1, Only some software RAID
drivers and pktcdvd driver define merge_bvec_fn().
Thanks,
Ryo Tsuruta
--
Yup, exactly. The implication of this is that we may see a drop in performance in some RAID configurations. --
The current device-mapper introduces a bvec merge function for device mapper devices. IMHO, the limitation goes away once we implement this in dm-ioband. Am I right, Alasdair? Thanks, Ryo Tsuruta --
Ryo told me this isn't true anymore. The dm infrastructure introduced a new feature to support multiple page-sized I/O requests, that was just merged to the current linux tree. So you and me don't need to worry about this stuff anymore. Ryo said he was going to make dm-ioband support this new feature and Thanks, Hirokazu Takahashi. --
