Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts

Previous thread: BUG and unresponsive system using cdparanoia on 2.6.26 by David Greaves on Monday, August 4, 2008 - 1:22 am. (1 message)

Next thread: cramfs and named-pipe by Atsushi Nemoto on Monday, August 4, 2008 - 2:21 am. (5 messages)
From: Ryo Tsuruta
Date: Monday, August 4, 2008 - 1:51 am

Hi everyone,

This series of patches of dm-ioband now includes "The bio tracking mechanism,"
which has been posted individually to this mailing list.
This makes it easy for anybody to control the I/O bandwidth even when
the I/O is one of delayed-write requests.
Have fun!

This series of patches consists of two parts:
  1. dm-ioband
    Dm-ioband is an I/O bandwidth controller implemented as a
    device-mapper driver, which gives specified bandwidth to each job
    running on the same physical device. A job is a group of processes
    with the same pid or pgrp or uid or a virtual machine such as KVM
    or Xen. A job can also be a cgroup by applying the bio-cgroup patch.
  2. bio-cgroup
    Bio-cgroup is a BIO tracking mechanism, which is implemented on
    the cgroup memory subsystem. With the mechanism, it is able to
    determine which cgroup each of bio belongs to, even when the bio
    is one of delayed-write requests issued from a kernel thread
    such as pdflush.

The above two parts have been posted individually to this mailing list
until now, but after this time we would release them all together.

  [PATCH 1/7] dm-ioband: Patch of device-mapper driver
  [PATCH 2/7] dm-ioband: Documentation of design overview, installation,
                         command reference and examples.
  [PATCH 3/7] bio-cgroup: Introduction
  [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  [PATCH 5/7] bio-cgroup: Remove a lot of "#ifdef"s
  [PATCH 6/7] bio-cgroup: Implement the bio-cgroup
  [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband

Please see the following site for more information:
  Linux Block I/O Bandwidth Control Project
  http://people.valinux.co.jp/~ryov/bwctl/

Thanks,
Ryo Tsuruta
--

From: Ryo Tsuruta
Date: Monday, August 4, 2008 - 1:52 am

This is the dm-ioband version 1.4.0 release.

Dm-ioband is an I/O bandwidth controller implemented as a device-mapper
driver, which gives specified bandwidth to each job running on the same
physical device.

- Can be applied to the kernel 2.6.27-rc1-mm1.
- Changes from 1.3.0 (posted on July 11, 2008):
  - Fix the problem of processing urgent I/O requests.
    Dm-ioband gives priority to I/O requests with pages with PG_reclaim
    flag. We thought this situation only happens on a write request,
    but it also happened on a read request, and it caused mishandling
    of urgent I/O requests. We have not clarified it is proper
    operation or not, yet.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/Kconfig linux-2.6.27-rc1-mm1/drivers/md/Kconfig
--- linux-2.6.27-rc1-mm1.orig/drivers/md/Kconfig	2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/Kconfig	2008-08-01 16:44:02.000000000 +0900
@@ -275,4 +275,17 @@ config DM_UEVENT
 	---help---
 	Generate udev events for DM events.
 
+config DM_IOBAND
+	tristate "I/O bandwidth control (EXPERIMENTAL)"
+	depends on BLK_DEV_DM && EXPERIMENTAL
+	---help---
+	This device-mapper target allows to define how the
+	available bandwidth of a storage device should be
+	shared between processes, cgroups, the partitions or the LUNs.
+
+	Information on how to use dm-ioband is available in:
+	   <file:Documentation/device-mapper/ioband.txt>.
+
+	If unsure, say N.
+
 endif # MD
diff -uprN linux-2.6.27-rc1-mm1.orig/drivers/md/Makefile linux-2.6.27-rc1-mm1/drivers/md/Makefile
--- linux-2.6.27-rc1-mm1.orig/drivers/md/Makefile	2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1/drivers/md/Makefile	2008-08-01 16:44:02.000000000 +0900
@@ -7,6 +7,7 @@ dm-mod-objs	:= dm.o dm-table.o dm-target
 dm-multipath-objs := dm-path-selector.o dm-mpath.o
 dm-snapshot-objs := dm-snap.o ...
From: Ryo Tsuruta
Date: Monday, August 4, 2008 - 1:52 am

Here is the documentation of design overview, installation, command
reference and examples.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -uprN linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt
--- linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt	2008-08-01 16:44:02.000000000 +0900
@@ -0,0 +1,937 @@
+                     Block I/O bandwidth control: dm-ioband
+
+            -------------------------------------------------------
+
+   Table of Contents
+
+   [1]What's dm-ioband all about?
+
+   [2]Differences from the CFQ I/O scheduler
+
+   [3]How dm-ioband works.
+
+   [4]Setup and Installation
+
+   [5]Getting started
+
+   [6]Command Reference
+
+   [7]Examples
+
+What's dm-ioband all about?
+
+     dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+   driver. Several jobs using the same physical device have to share the
+   bandwidth of the device. dm-ioband gives bandwidth to each job according
+   to its weight, which each job can set its own value to.
+
+     A job is a group of processes with the same pid or pgrp or uid or a
+   virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+   the bio-cgroup patch, which can be found at
+   http://people.valinux.co.jp/~ryov/bio-cgroup/.
+
+     +------+ +------+ +------+   +------+ +------+ +------+
+     |cgroup| |cgroup| | the  |   | pid  | | pid  | | the  |  jobs
+     |  A   | |  B   | |others|   |  X   | |  Y   | |others|
+     +--|---+ +--|---+ +--|---+   +--|---+ +--|---+ +--|---+
+     +--V----+---V---+----V---+   +--V----+---V---+----V---+
+     | group | group | default|   | group | group | default|  ioband groups
+     |       |       |  group |   |       |       |  group |
+     ...
From: Ryo Tsuruta
Date: Monday, August 4, 2008 - 1:57 am

With this series of bio-cgruop patches, you can determine the owners of any
type of I/Os and it makes dm-ioband -- I/O bandwidth controller --
be able to control the Block I/O bandwidths even when it accepts
delayed write requests.
Dm-ioband can find the owner cgroup of each request.
It is also possible that the other people who work on the I/O
bandwidth throttling use this functionality to control asynchronous
I/Os with a little enhancement.

You have to apply the patch dm-ioband v1.4.0 before applying this series
of patches.

And you have to select the following config options when compiling kernel:
  CONFIG_CGROUPS=y
  CONFIG_CGROUP_BIO=y
And I recommend you should also select the options for cgroup memory
subsystem, because it makes it possible to give some I/O bandwidth
and some memory to a certain cgroup to control delayed write requests
and the processes in the cgroup will be able to make pages dirty only
inside the cgroup even when the given bandwidth is narrow.
  CONFIG_RESOURCE_COUNTERS=y
  CONFIG_CGROUP_MEM_RES_CTLR=y

This code is based on some part of the memory subsystem of cgroup
and I don't think the accuracy and overhead of the subsystem can be ignored
at this time, so we need to keep tuning it up.

 --------------------------------------------------------

The following shows how to use dm-ioband with cgroups.
Please assume that you want make two cgroups, which we call "bio cgroup"
here, to track down block I/Os and assign them to ioband device "ioband1".

First, mount the bio cgroup filesystem.

 # mount -t cgroup -o bio none /cgroup/bio

Then, make new bio cgroups and put some processes in them.

 # mkdir /cgroup/bio/bgroup1
 # mkdir /cgroup/bio/bgroup2
 # echo 1234 > /cgroup/bio/bgroup1/tasks
 # echo 5678 > /cgroup/bio/bgroup1/tasks

Now, check the ID of each bio cgroup which is just created.

 # cat /cgroup/bio/bgroup1/bio.id
   1
 # cat /cgroup/bio/bgroup2/bio.id
   2

Finally, attach the cgroups to "ioband1" and assign them ...
From: Ryo Tsuruta
Date: Monday, August 4, 2008 - 1:57 am

This patch splits the cgroup memory subsystem into two parts.
One is for tracking pages to find out the owners. The other is
for controlling how much amount of memory should be assigned to
each cgroup.

With this patch, you can use the page tracking mechanism even if
the memory subsystem is off.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -Ndupr linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h
--- linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h	2008-08-01 12:18:28.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h	2008-08-01 19:03:21.000000000 +0900
@@ -20,12 +20,62 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
+#include <linux/rcupdate.h>
+#include <linux/mm.h>
+#include <linux/smp.h>
+#include <linux/bit_spinlock.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+#ifdef CONFIG_CGROUP_PAGE
+/*
+ * We use the lower bit of the page->page_cgroup pointer as a bit spin
+ * lock.  We need to ensure that page->page_cgroup is at least two
+ * byte aligned (based on comments from Nick Piggin).  But since
+ * bit_spin_lock doesn't actually set that lock bit in a non-debug
+ * uniprocessor kernel, we should avoid setting it here too.
+ */
+#define PAGE_CGROUP_LOCK_BIT    0x0
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define PAGE_CGROUP_LOCK        (1 << PAGE_CGROUP_LOCK_BIT)
+#else
+#define PAGE_CGROUP_LOCK        0x0
+#endif
+
+/*
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ */
+struct page_cgroup {
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct list_head lru;		/* per cgroup LRU list */
+	struct mem_cgroup *mem_cgroup;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+	struct page *page;
+	int flags;
+};
+#define PAGE_CGROUP_FLAG_CACHE	(0x1)	/* ...
From: Ryo Tsuruta
Date: Monday, August 4, 2008 - 1:59 am

This patch is for cleaning up the code of the cgroup memory subsystem
to remove a lot of "#ifdef"s.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -Ndupr linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c	2008-08-01 19:48:55.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c	2008-08-01 19:49:38.000000000 +0900
@@ -228,6 +228,47 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
+static inline void get_mem_cgroup(struct mem_cgroup *mem)
+{
+	css_get(&mem->css);
+}
+
+static inline void put_mem_cgroup(struct mem_cgroup *mem)
+{
+	css_put(&mem->css);
+}
+
+static inline void set_mem_cgroup(struct page_cgroup *pc,
+					struct mem_cgroup *mem)
+{
+	pc->mem_cgroup = mem;
+}
+
+static inline void clear_mem_cgroup(struct page_cgroup *pc)
+{
+	struct mem_cgroup *mem = pc->mem_cgroup;
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	pc->mem_cgroup = NULL;
+	put_mem_cgroup(mem);
+}
+
+static inline struct mem_cgroup *get_mem_page_cgroup(struct page_cgroup *pc)
+{
+	struct mem_cgroup *mem = pc->mem_cgroup;
+	css_get(&mem->css);
+	return mem;
+}
+
+/* This sould be called in an RCU-protected section. */
+static inline struct mem_cgroup *mm_get_mem_cgroup(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem;
+
+	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	get_mem_cgroup(mem);
+	return mem;
+}
+
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
@@ -297,6 +338,26 @@ static void __mem_cgroup_move_lists(stru
 	list_move(&pc->lru, &mz->lists[lru]);
 }
 
+static inline void mem_cgroup_add_page(struct page_cgroup *pc)
+{
+	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+	unsigned long flags;
+
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	__mem_cgroup_add_list(mz, ...
From: Ryo Tsuruta
Date: Monday, August 4, 2008 - 2:00 am

This patch implements the bio cgroup on the memory cgroup.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -Ndupr linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c
--- linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c	2008-07-29 11:40:31.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c	2008-08-01 19:18:38.000000000 +0900
@@ -84,24 +84,28 @@ void exit_io_context(void)
 	}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+	atomic_set(&ioc->refcount, 1);
+	atomic_set(&ioc->nr_tasks, 1);
+	spin_lock_init(&ioc->lock);
+	ioc->ioprio_changed = 0;
+	ioc->ioprio = 0;
+	ioc->last_waited = jiffies; /* doesn't matter... */
+	ioc->nr_batch_requests = 0; /* because this is 0 */
+	ioc->aic = NULL;
+	INIT_RADIX_TREE(&ioc->radix_root, GFP_ATOMIC | __GFP_HIGH);
+	INIT_HLIST_HEAD(&ioc->cic_list);
+	ioc->ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
 	struct io_context *ret;
 
 	ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-	if (ret) {
-		atomic_set(&ret->refcount, 1);
-		atomic_set(&ret->nr_tasks, 1);
-		spin_lock_init(&ret->lock);
-		ret->ioprio_changed = 0;
-		ret->ioprio = 0;
-		ret->last_waited = jiffies; /* doesn't matter... */
-		ret->nr_batch_requests = 0; /* because this is 0 */
-		ret->aic = NULL;
-		INIT_RADIX_TREE(&ret->radix_root, GFP_ATOMIC | __GFP_HIGH);
-		INIT_HLIST_HEAD(&ret->cic_list);
-		ret->ioc_data = NULL;
-	}
+	if (ret)
+		init_io_context(ret);
 
 	return ret;
 }
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h	1970-01-01 09:00:00.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h	2008-08-01 19:21:56.000000000 +0900
@@ -0,0 +1,159 @@
+#include <linux/cgroup.h>
+#include <linux/mm.h>
+#include ...
From: Ryo Tsuruta
Date: Monday, August 4, 2008 - 2:01 am

With this patch, dm-ioband can work with the bio cgroup.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>

diff -Ndupr linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c
--- linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c	2008-08-01 16:53:57.000000000 +0900
+++ linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c	2008-08-01 19:44:36.000000000 +0900
@@ -6,6 +6,7 @@
  * This file is released under the GPL.
  */
 #include <linux/bio.h>
+#include <linux/biocontrol.h>
 #include "dm.h"
 #include "dm-bio-list.h"
 #include "dm-ioband.h"
@@ -53,13 +54,13 @@ static int ioband_node(struct bio *bio)
 
 static int ioband_cgroup(struct bio *bio)
 {
-  /*
-   * This function should return the ID of the cgroup which issued "bio".
-   * The ID of the cgroup which the current process belongs to won't be
-   * suitable ID for this purpose, since some BIOs will be handled by kernel
-   * threads like aio or pdflush on behalf of the process requesting the BIOs.
-   */
-	return 0;	/* not implemented yet */
+	struct io_context *ioc = get_bio_cgroup_iocontext(bio);
+	int id = 0;
+	if (ioc) {
+		id = ioc->id;
+		put_io_context(ioc);
+	}
+	return id;
 }
 
 struct group_type dm_ioband_group_type[] = {
--

From: Takuya Yoshikawa
Date: Friday, August 8, 2008 - 12:10 am

Is this function fully implemented?
I tried to put a process into a group by writing to 
"/cgroup/bio/BGROUP/tasks" but failed.


Without "attach" function, it is difficult to check
the effectiveness of block I/O tracking.

Thanks,
- Takuya Yoshikawa
--

From: Ryo Tsuruta
Date: Friday, August 8, 2008 - 1:30 am

This function can be more simplified, there is some unnecessary code 

Could you tell me what you actually did? I will try the same thing.

--
Ryo Tsuruta <ryov@valinux.co.jp>
--

From: Takuya Yoshikawa
Date: Friday, August 8, 2008 - 2:42 am

Hi Tsuruta-san,



I wanted to test my own scheduler which uses bio tracking information.
SO I tried your patch, especially, get_bio_cgroup_iocontext(), to get
the io_context from bio.

In my test, I made some threads with certain iopriorities run 
concurrently. To schedule these threads based on their iopriorities,
I made BGROUP directories for each iopriorities.
e.g. /cgroup/bio/be0 ... /cgroup/bio/be7
Then, I tried to attach the processes to the appropriate groups.

But the processes stayed in the original group(id=0).
...

I am sorry but I have to leave now and I cannot come here next week.
--> I will take summer holidays.

I will reply to you later.

Thanks,
- Takuya Yoshikawa
--

From: Ryo Tsuruta
Date: Friday, August 8, 2008 - 4:41 am

In the current implementation, when a process moves to an another cgroup:
  - Already allocated memory does not move to the cgroup, still remains.
  - Only allocated memory after move belongs to the cgroup.
This behavior follows the memory controller.

Memory does not move between cgroups since it is so heavy operation, 
but it would be worth under some sort of conditions.

Could you try to move a process between cgroups in the following way?

   # echo $$ > /cgroup/bio/be0
   # run_your_program
   # echo $$ > /cgroup/bio/be1
   # run_your_program

Have a nice vacation!

Thanks,
Ryo Tsuruta
--

From: Andrea Righi
Date: Tuesday, August 5, 2008 - 3:25 am

you can remove some ifdefs doing:

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
	if (likely(!memcg)) {
		rcu_read_lock();
		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
		/*
		 * For every charge from the cgroup, increment reference count
		 */
		css_get(&mem->css);
		rcu_read_unlock();
	} else {
		mem = memcg;
		css_get(&memcg->css);
	}
	while (res_counter_charge(&mem->res, PAGE_SIZE)) {
		if (!(gfp_mask & __GFP_WAIT))
			goto out;

		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
			continue;

		/*
		 * try_to_free_mem_cgroup_pages() might not give us a full
		 * picture of reclaim. Some pages are reclaimed and might be
		 * moved to swap cache or just unmapped from the cgroup.
		 * Check the limit again to see if the reclaim reduced the
		 * current usage of the cgroup before giving up
		 */
		if (res_counter_check_under_limit(&mem->res))
			continue;

		if (!nr_retries--) {
			mem_cgroup_out_of_memory(mem, gfp_mask);
			goto out;
		}
	}
	pc->mem_cgroup = mem;
#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
--

From: Hirokazu Takahashi
Date: Tuesday, August 5, 2008 - 3:35 am

I think you don't have to care about this much, since one of the following
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, August 6, 2008 - 12:54 am

On Mon, 04 Aug 2008 17:57:48 +0900 (JST)

Plese CC me or Balbir or Pavel (See Maintainer list) when you try this ;)

After this patch, the total structure is

 page <-> page_cgroup <-> bio_cgroup.
 (multiple bio_cgroup can be attached to page_cgroup)

Does this pointer chain will add
  - significant performance regression or
  - new race condtions 
?

I like more loose relationship between them.

For example, adding a simple function.
==
int get_page_io_id(struct page *)
 - returns a I/O cgroup ID for this page. If ID is not found, -1 is returned.
   ID is not guaranteed to be valid value. (ID can be obsolete)
==
And just storing cgroup ID to page_cgroup at page allocation.
Then, making bio_cgroup independent from page_cgroup and 
get ID if avialble and avoid too much pointer walking.

Thanks,

--

From: Hirokazu Takahashi
Date: Wednesday, August 6, 2008 - 4:43 am

I don't think it will cause significant performance loss, because
the link between a page and a page_cgroup has already existed, which
the memory resource controller prepared. Bio_cgroup uses this as it is,
and does nothing about this.

And the link between page_cgroup and bio_cgroup isn't protected
by any additional spin-locks, since the associated bio_cgroup is
guaranteed to exist as long as the bio_cgroup owns pages.

I've just noticed that most of overhead comes from the spin-locks
when reclaiming the pages inside mem_cgroups and the spin-locks to
protect the links between pages and page_cgroups.
The latter overhead comes from the policy your team has chosen
that page_cgroup structures are allocated on demand. I still feel
this approach doesn't make any sense because linux kernel tries to
make use of most of the pages as far as it can, so most of them
have to be assigned its related page_cgroup. It would make us happy

I don't think there are any diffrences between a poiter and ID.


--

From: kamezawa.hiroyu
Date: Wednesday, August 6, 2008 - 6:45 am

Hmm, I think page_cgroup's cost is visible when
1. a page is changed to be in-use state. (fault or radixt-tree-insert)
2. a page is changed to be out-of-use state (fault or radixt-tree-removal)
3. memcg hit its limit or global LRU reclaim runs.

"1" and "2" can be catched as 5% loss of exec throuput. 
"3" is not measured (because LRU walk itself is heavy.)

What new chances to access page_cgroup you'll add ?
Overhead between page <-> page_cgroup lock is cannot be catched by
lock_stat now.Do you have numbers ?
Now, multi-sizer-page-cache is discussed for a long time. If it's our
ID can be obsolete, pointer is not. memory cgroup has to take care of
bio cgroup's race condition ? (About race conditions, it's already complicated
enough)

To be honest, I think adding a new (4 or 8 bytes) page struct and record infor
mation of bio-control is more straightforward approach. Buy as you might
think, "there is no room"

Thanks,
-Kame

--

From: Hirokazu Takahashi
Date: Thursday, August 7, 2008 - 12:25 am

I haven't add any at this moment, but I thinks some people may want
to move some pages in page-cache from one cgroup to another cgroup.
When that time comes, I'll try to make the cost minimized that
I will probably only update the link between a page_cgroup and

The problem is that every time the lock is held, the associated

I don't think I can agree to this.
When multi-sized-page-cache is introduced, some data structures will be
allocated to manage multi-sized-pages. I think page_cgroups should be
allocated at the same time. This approach will make things simple.

It seems like the on-demand allocation approach leads not only
overhead but complexity and a lot of race conditions.
If you allocate page_cgroups when allocating page structures,
You can get rid of most of the locks and you don't have to care about
allocation error of page_cgroups anymore.

And it will also give us flexibility that memcg related data can be

Bio-cgroup just expects that the call-backs bio-cgroup prepares are called

But only if everyone allows me to add some new members into "struct page."
I think the same thing goes with memcg you're working on.


Thank you,
Hirokazu Takahashi.

--

From: KAMEZAWA Hiroyuki
Date: Thursday, August 7, 2008 - 1:21 am

On Thu, 07 Aug 2008 16:25:12 +0900 (JST)
I think "page" and "page_cgroup" is not so heavly shared object in fast path.
foot-print is also important here.
But it's not good for the systems with small "NORMAL" pages.
This discussion should be done again when more users of page_group appears and
it's overhead is obvious.

Thanks,
-Kame



--

From: Hirokazu Takahashi
Date: Thursday, August 7, 2008 - 1:45 am

Even when it happens to be a system with small "NORMAL" pages, if you
want to use memcg feature, you have to allocate page_groups for most of
the pages in the system. It's impossible to avoid the allocation as far

Thanks,
Hirokazu Takahashi.
--

From: Dave Hansen
Date: Monday, August 4, 2008 - 10:20 am

During the Containers mini-summit at OLS, it was mentioned that there
are at least *FOUR* of these I/O controllers floating around.  Have you
talked to the other authors?  (I've cc'd at least one of them).

We obviously can't come to any kind of real consensus with people just
tossing the same patches back and forth.

-- Dave

--

From: Andrea Righi
Date: Monday, August 4, 2008 - 11:22 am

Dave,

thanks for this email first of all. I've talked with Satoshi (cc-ed)
about his solution "Yet another I/O bandwidth controlling subsystem for
CGroups based on CFQ".

I did some experiments trying to implement minimum bandwidth requirements
for my io-throttle controller, mapping the requirements to CFQ prio and
using the Satoshi's controller. But this needs additional work and
testing right now, so I've not posted anything yet, just informed
Satoshi about this.

Unfortunately I've not talked to Ryo yet. I've continued my work using a
quite different approach, because the dm-ioband solution didn't work
with delayed-write requests. Now the bio tracking feature seems really
prosiming and I would like to do some tests ASAP, and review the patch
as well.

But I'm not yet convinced that limiting the IO writes at the device
mapper layer is the best solution. IMHO it would be better to throttle
applications' writes when they're dirtying pages in the page cache (the
io-throttle way), because when the IO requests arrive to the device
mapper it's too late (we would only have a lot of dirty pages that are
waiting to be flushed to the limited block devices, and maybe this could
lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
(at least for my requirements). Ryo, correct me if I'm wrong or if I've
not understood the dm-ioband approach.

Another thing I prefer is to directly define bandwidth limiting rules,
instead of using priorities/weights (i.e. 10MiB/s for /dev/sda), but
this seems to be in the dm-ioband TODO list, so maybe we can merge the
work I did in io-throttle to define such rules.

Anyway, I still need to look at the dm-ioband and bio-cgroup code in
details, so probably all I said above is totally wrong...

-Andrea
--

From: Dave Hansen
Date: Monday, August 4, 2008 - 12:02 pm

The avoid-lots-of-page-dirtying problem sounds like a hard one.  But, if
you look at this in combination with the memory controller, they would
make a great team.

The memory controller keeps you from dirtying more than your limit of
pages (and pinning too much memory) even if the dm layer is doing the
throttling and itself can't throttle the memory usage.

I also don't think this is any different from the problems we have in
the regular VM these days.  Right now, people can dirty lots of pages on
devices that are slow.  The only thing dm-ioband would be added would be
changing how those devices *got* slow. :)

-- Dave

--

From: Andrea Righi
Date: Monday, August 4, 2008 - 1:44 pm

mmh... but in this way we would just move the OOM inside the cgroup,
that is a nice improvement, but the main problem is not resolved...

A safer approach IMHO is to force the tasks to wait synchronously on
each operation that directly or indirectly generates i/o.

In particular the solution used by the io-throttle controller to limit
the dirty-ratio in memory is to impose a sleep via
schedule_timeout_killable() in balance_dirty_pages() when a generic
process exceeds the limits defined for the belonging cgroup.

Limiting read operations is a lot more easy, because they're always
synchronized with i/o requests.

-Andrea
--

From: Dave Hansen
Date: Monday, August 4, 2008 - 1:50 pm

Fine in theory, hard in practice. :)

I think the best we can hope for is to keep parity with what happens in
the rest of the kernel.  We already have a problem today with people
mmap()'ing lots of memory and dirtying it all at once.  Adding a i/o
bandwidth controller or a memory controller isn't really going to fix
that.  I think it is outside the scope of the i/o (and memory)
controllers until we solve it generically, first.

-- Dave

--

From: Hirokazu Takahashi
Date: Monday, August 4, 2008 - 11:28 pm

Yes, that's right. This should be solved.

But there is a good thing when you use a memory controller.
A problem occurred in a certain cgroup will be confined in its cgroup.
I think this is a great point, don't you think?

Thank you,
Hirokazu Takahashi.




--

From: Paul Menage
Date: Monday, August 4, 2008 - 10:55 pm

I think that you're conflating two issues:

- controlling how much dirty memory a cgroup can have at any given
time (since dirty memory is much harder/slower to reclaim than clean
memory)

- controlling how much effect a cgroup can have on a given I/O device.

By controlling the rate at which a task can generate dirty pages,
you're not really limiting either of these. I think you'd have to set
your I/O limits artificially low to prevent a case of a process
writing a large data file and then doing fsync() on it, which would
then hit the disk with the entire file at once, and blow away any QoS
guarantees for other groups.

As Dave suggested, I think it would make more sense to have your
page-dirtying throttle points hook into the memory controller instead,
and allow the memory controller to track/limit dirty pages for a
cgroup, and potentially do throttling as part of that.

Paul
--

From: Balbir Singh
Date: Monday, August 4, 2008 - 11:03 pm

Yes, that would be nicer. The IO controller should control both read and write
and dirty pages is mostly related to writes.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Andrea Righi
Date: Tuesday, August 5, 2008 - 2:27 am

Anyway, dirty pages ratio is directly proportional to the IO that will
be performed on the real device, isn't it? this wouldn't prevent IO
bursts as you correctly say, but IMHO it is a simple and quite effective
way to measure the IO write activity of each cgroup on each affected
device.

To prevent the IO peaks I usually reduce the vm_dirty_ratio, but, ok,
this is a workaround, not the solution to the problem either.

IMHO, based on the dirty-page rate measurement, we should apply both
limiting methods: throttle dirty-pages ratio to prevent too many dirty
pages in the system (harde to reclaim and generating
unpredictable/unpleasant/unresponsiveness behaviour), and throttle the
dispatching of IO requests at the device-mapper/IO-scheduler layer to
smooth IO peaks/bursts, generated by fsync() and similar scenarios.

Another different approach could be to implement the measurement in the
elevator, looking at the elapsed between the IO request is issued to the
drive and the request is served. So, look at the start time T1,
completion time T2, take the difference (T2 - T1) and say: cgroup C1
consumed an amount of IO of (T2 - T1), and also use a token-bucket
policy to fill/reduce the "credits" of each IO cgroup in terms of IO
time slots. This would be a more precise measurement, instead of trying
to predict how expensive the IO operation will be, only looking at the
dirty-page ratio. Then throttle both dirty-page ratio *and* the
dispatching of the IO requests submitted by the cgroup that exceeds the

Yes, implementing page-drity throttling in memory controller seems
absolutely reasonable. I can try to move in this direction, merge the
page-dirty throttling in memory controller and also post the RFC.

Thanks,
-Andrea
--

From: Dave Hansen
Date: Tuesday, August 5, 2008 - 9:25 am

Yeah, I'm sure we're going to have to get to setting the dirty ratio
        
        $ cat /proc/sys/vm/dirty_ratio
        40

on a per-container basis at *some* point.  We might as well do it
earlier rather than later.

-- Dave

--

From: Hirokazu Takahashi
Date: Monday, August 4, 2008 - 11:16 pm

Hi, Andrea,


The concept of dm-ioband includes it should be used with cgroup memory
controller as well as the bio cgroup. The memory controller is supposed
to control memory allocation and dirty-page ratio inside each cgroup.

Some guys of cgroup memory controller team just started to implement
the latter mechanism. They try to make each cgroup have a threshold
to limit the number of dirty pages in the group.


I guess it would make the memory controller team guys happier if you
can help them design their dirty-page ratio controlling functionality
much cooler and more generic. I think their goal is almost the same

Thank you,
Hirokazu Takahashi.
--

From: Andrea Righi
Date: Tuesday, August 5, 2008 - 2:31 am

Interesting, they also post a patch or RFC?

-Andrea
--

From: Hirokazu Takahashi
Date: Tuesday, August 5, 2008 - 3:01 am

You can take a look at the thread start from
http://www.ussg.iu.edu/hypermail/linux/kernel/0807.1/0472.html,
whose subject is "[PATCH][RFC] dirty balancing for cgroups."

This project has just started, so it would be a good time to
discuss it with them.

Thanks,
Hirokazu Takahashi.

--

From: Satoshi UCHIDA
Date: Monday, August 4, 2008 - 7:50 pm

Hi, Andrea.

I participated in Containers Mini-summit.
And, I talked with Mr. Andrew Morton in The Linux Foundation Japan
Symposium BoF, Japan, July 10th.

Currently, in ML, some I/O controller patches is sent and the respective
patch keeps sending the improvement version.
We and maintainers wouldn't like this situation.
We wanted to solve this situation by the Mini-summit, but unfortunately, 
no other developers participated.
(I couldn't give an opinion, because  my English skill is low)
Mr. Naveen present his way in Linux Symposium, and we discussed about
I/O control at a few time after this presentation.


Mr. Andrew gave a advice "Should discuss about design more and more"
to me.
And, in Containers Mini-summit (and Linux Symposium 2008 in Ottawa),
Paul said that a necessary to us is to decide a requirement first.
So, we must discuss requirement and design.

My requirement is
 * to be able to distribute performance moderately.
 (* to be able to isolate each group(environment)). 

I guess (it may be wrong)
 Naveen's requirement is
   * to be able to handle latency.
      (high priority is always precede in handling I/O)
   (Only share isn't just given priority to, like CFQ.)
   * to be able to distribute performance moderately.
 Andrea's requirement is
   * to be able to set and control by absolute(direct) performance.
 Ryo's requirement is
   * to be able to distribute performance moderately.
   * to be able to set and control I/Os at flexible range
         (multi device such as LVM).

I think that most solutions controls I/O performance moderately
(by using weight/priority/percentage/etc. and by not using absolute) 
because disk I/O performance is inconstant and is affected by
situation (such as application, file(data) balance, and so on).
So, it is difficult to guarantee performance which is set by
absolute bandwidth.
If devices have constant performance, it will good to control by
absolute bandwidth.
And, when guaranteeing it by the low ability, it'll be ...
From: Andrea Righi
Date: Tuesday, August 5, 2008 - 2:28 am

* improve IO performance predictability of each cgroup

It would be probably the best place to evaluate the "cost" of each

Agree. At least, maybe we should consider if an IO controller could be

I'll collect some numbers and keep you informed.

-Andrea
--

From: Ryo Tsuruta
Date: Tuesday, August 5, 2008 - 6:17 am

Hi Andrea, Satoshi and all,


We've implemented dm-ioband and bio-cgroup to meet the following requirements:
    * Assign some bandwidth to each group on the same device.
      A group is a set of processes, which may be a cgroup.
    * Assign some bandwidth to each partition on the same device.
      It can work with the process group based bandwidth control.
        ex) With this feature, you can assign 40% of the bandwidth of a
	    disk to /root and 60% of them to /usr.
    * It can work with virtual machines such as Xen and KVM.
      I/O requests issued from virtual machines have to be controlled.
    * It should work any type of I/O scheduler, including ones which
      will be released in the future.
    * Support multiple devices which share the same bandwidth such as
      raid disks and LVM.   
    * Handle asynchronous I/O requests such as AIO request and delayed 
      write requests.
        - This can be done with bio-cgroup, which uses the page-tracking
	  mechanism the cgroup memory controller has.
    * Control dirty page ratio.
        - This can be done with the cgroup memory controller in the near
	  feature. It would be great that you can also use other features
	  the memory controller is going to have with dm-ioband.
    * Make it easy to enhance.
        - The current implementation of dm-ioband has an interface to
	  add a new policy to control I/O requests. You can easily add
	  I/O throttling policy if you want.
    * Fine grained bandwidth control.
    * Keep I/O throughput.
    * Make it scalable.
    * It should work correctly if the I/O load is quite high,

I don't have any documentation besides in the website.

Thanks,
Ryo Tsuruta
--

From: Dave Hansen
Date: Tuesday, August 5, 2008 - 9:20 am

Isn't this one of the core points that we keep going back and forth
over?  It seems like people are arguing in circles over this:

Do we:
	1. control potential memory usage by throttling I/O
or
	2. Throttle I/O when memory is full

I might lean toward (1) if we didn't already have a memory controller.
But, we have one, and it works.  Also, we *already* do (2) in the
kernel, so it would seem to graft well onto existing mechanisms that we
have.

I/O controllers should not worry about memory.  They're going to have a
hard enough time getting the I/O part right. :)

Or, am I over-simplifying this?

-- Dave


--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 5, 2008 - 7:44 pm

On Tue, 05 Aug 2008 09:20:18 -0700
memcg have more problems now ;( 

Only a difficult thing to limit dirty-ratio in memcg is how-to-count dirty
pages. If I/O controller's hook helps, it's good.

My small concern is "What happens if we throttole I/O bandwidth too small
under some memcg." In such cgroup, we may see more OOMs because I/O will
not finish in time.
A system admin have to find some way to avoid this.

But please do I/O control first. Dirty-page control is related but different
layer's problem, I think.

Thanks,



--

From: Balbir Singh
Date: Tuesday, August 5, 2008 - 8:30 pm

Yes, please solve the I/O control problem first.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--

From: Hirokazu Takahashi
Date: Tuesday, August 5, 2008 - 11:48 pm

I/O controllers are just supposed to emulate slow device from the point
of view of the processes in a certain cgroup or something. I think
the memory management layer and the memory controller are the ones
which should be able to handle these, which might be as slow as

Yup.

Thanks,
Hirokazu Takahashi.
--

From: Hirokazu Takahashi
Date: Tuesday, August 5, 2008 - 5:01 am

Yes, this is one of the problems linux kernel still has, which should
be solved.

But I believe this should be done in the linux memory management layer
including the cgroup memory controller, which has to work correctly
on any type of device with various access speeds.

I think it's better that I/O controllers should only focus on flow of
I/O requests. This approach will keep the implementation of linux

Thank you,
Hirokazu Takahashi.

--

From: Balbir Singh
Date: Monday, August 4, 2008 - 11:34 am

Ryo and Andrea - Naveen and Satoshi met up at OLS and discussed their approach.
It would be really nice to see an RFC, I know Andrea did work on this and
compared the approaches.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Andrea Righi
Date: Monday, August 4, 2008 - 1:42 pm

yes, I wrote down something about the comparison of priority-based vs
bandwidth shaping solutions in terms of performance predictability.  And
other considerations, like the one I cited before, about dirty-ratio
throttling in memory, AIO handling, etc.

Something is also reported in the io-throttle documentation:

http://marc.info/?l=linux-kernel&m=121780176907686&w=2

But ok, I agree with Balbir, I can try to put the things together (in a
better form in particular) and try to post an RFC together with Ryo.

Ryo, do you have other documentation besides the info reported in the
dm-ioband website?

Thanks,
-Andrea
--

From: Fernando Luis
Date: Tuesday, August 5, 2008 - 6:13 pm

Hi Dave,

I have been tracking the memory controller patches for a while which
spurred my interest in cgroups and prompted me to start working on I/O
bandwidth controlling mechanisms. This year I have had several
opportunities to discuss the design challenges of i/o controllers with
the NEC and VALinux Japan teams (CCed), most recently last month during
the Linux Foundation Japan Linux Symposium, where we took advantage of
Andrew Morton's visit to Japan to do some brainstorming on this topic. I
will try so summarize what was discussed there (and in the Linux Storage
& Filesystem Workshop earlier this year) and propose a hopefully
acceptable way to proceed and try to get things started.

This RFC ended up being a bit longer than I had originally intended, but
hopefully it will serve as the start of a fruitful discussion.

As you pointed out, it seems that there is not much consensus building
going on, but that does not mean there is a lack of interest. To get the
ball rolling it is probably a good idea to clarify the state of things
and try to establish what we are trying to accomplish.

*** State of things in the mainstream kernel<BR>
The kernel has had somewhat adavanced I/O control capabilities for quite
some time now: CFQ. But the current CFQ has some problems:
  - I/O priority can be set by PID, PGRP, or UID, but...
  - ...all the processes that fall within the same class/priority are
scheduled together and arbitrary grouping are not possible.
  - Buffered I/O is not handled properly.
  - CFQ's IO priority is an attribute of a process that affects all
devices it sends I/O requests to. In other words, with the current
implementation it is not possible to assign per-device IO priorities to
a task.

*** Goals
  1. Cgroups-aware I/O scheduling (being able to define arbitrary
groupings of processes and treat each group as a single scheduling
entity).
  2. Being able to perform I/O bandwidth control independently on each
device.
  3. I/O bandwidth shaping.
  4. ...
From: Ryo Tsuruta
Date: Tuesday, August 5, 2008 - 11:18 pm

I'd like to add the following item to the goals.

  7. Selectable from multiple bandwidth control policy (proportion,

I agree with your plan.
We keep bio-cgroup improving and porting to the latest kernel.

Thanks,
Ryo Tsuruta
--

From: Fernando Luis
Date: Tuesday, August 5, 2008 - 11:41 pm

Having more users of bio-cgroup would probably help to get it merged, so
we'll certainly send patches as soon as we get our cfq prototype in
shape.

Regards,

Fernando

--

From: Dave Hansen
Date: Wednesday, August 6, 2008 - 8:48 am

I'm confused.  Are these two of the competing controllers?  Or are the
complementary in some way?

-- Dave

--

From: Fernando Luis
Date: Wednesday, August 6, 2008 - 9:38 pm

Sorry, I did not explain myself correctly. I was not referring to a new
IO _controller_, I was just trying to say that the traditional IO
_schedulers_ already present in the mainstream kernel would benefit from
proper IO tracking too. As an example, the current implementation of CFQ
assumes that all IO is generated in the IO context of the current task,
which in only true in the synchronous path. This renders CFQ almost
unusable for controlling of asynchronous and buffered IO. Of course CFQ
is not to blame here, since it has no way to tell who the real
originator of the IO was (CFQ just sees IO requests coming from pdflush
and other kernel threads).

However, once we have a working IO tracking infrastructure in place, the
existing IO _schedulers_ could be modified so that they use it to
determine the real owner/originator of asynchronous and buffered IO.
This can be done without turning IO schedulers into IO resource
controllers. If we can demonstrate that a IO tracking infrastructure
would also be beneficial outside the cgroups arena, it should be easier
to get it merged.

--

From: Balbir Singh
Date: Wednesday, August 6, 2008 - 9:42 am

Would you like to split up IO into read and write IO. We know that read can be
very latency sensitive when compared to writes. Should we consider them

Won't that get too complex. What if the user has thousands of disks with several

Are you suggesting that the IO and memory controller should always be bound

Yes, I agree with this step as being the first step. May be extending the

Very nice summary

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--

From: Dave Hansen
Date: Wednesday, August 6, 2008 - 11:00 am

I'd just suggest doing what is simplest and can be done in the smallest
amount of code.  As long as it is functional in some way and can be

I think what Fernando is suggesting is that we *allow* each disk to be
treated separately, not that we actually separate them out.  I agree
that with large disk count systems, it would get a bit nutty to deal
with I/O limits on each of them.  It would also probably be nutty for
some dude with two disks in his system to have to set (or care about)
individual limits.

I guess I'm just arguing that we should allow pretty arbitrary grouping
of block devices into these resource pools.

-- Dave

--

From: Fernando Luis
Date: Wednesday, August 6, 2008 - 7:44 pm

As Dave pointed out I just think that we should allow each disk to be
treated separately. To avoid the administration nightmare you mention
adding block device grouping capabilities should suffice to solve most
That is a really good question. The I/O tracking patches split the
memory controller in two functional parts: (1) page tracking and (2)
memory accounting/cgroup policy enforcement. By doing so the memory
controller specific code can be separated from the rest, which
admittedly, will not benefit the memory controller a great deal but,
hopefully, we can get cleaner code that is easier to maintain.

The important thing, though, is that with this separation the page
tracking bits can be easily reused by any subsystem that needs to keep
track of pages, and the I/O controller is certainly one such candidate.
Synchronous I/O is easy to deal with because everything is done in the
context of the task that generated the I/O, but buffered I/O and
synchronous I/O are problematic. However with the observation that the
owner of an I/O request happens to be the owner the of the pages the I/O
buffers of that request reside in, it becomes clear that pdflush and
friends could use that information to determine who the originator of
the I/O is and the I/O request accordingly.

Going back to your question, with the current I/O tracking patches I/O
controller would be bound to the page tracking functionality of cgroups
(page_cgroup) not the memory controller. We would not even need to
compile the memory controller. The dependency on cgroups would still be
there though.

As an aside, I guess that with some effort we could get rid of this
dependency by providing some basic tracking capabilities even when the
cgroups infrastructure is not being used. By doing so traditional I/O
schedulers such as CFQ could benefit from proper I/O tracking
capabilities without using cgroups. Of course if the kernel has cgroups
support compiled in the cgroups I/O tracking would be used instead (this
idea was ...
From: Fernando Luis
Date: Wednesday, August 6, 2008 - 8:01 pm

Oops, I somehow ended up leaving your first question unanswered. Sorry.

I do not think we should consider them separately, as long as there is a
proper IO tracking infrastructure in place. As you mentioned, reads can
be very latecy sensitive, but the read case could be treated as an
special case IO controller/IO tracking subsystem. There certainly are
optimization opportunities. For example, in the synchronous I/O patch ww
could mark bios with the iocontext of the current task, because it will
happen to be originator of that IO. By effectively caching the ownership
information in the bio we can avoid all the accesses to struct page,
page_cgroup, etc, and reads would definitively benefit from that. 

--

From: Hirokazu Takahashi
Date: Friday, August 8, 2008 - 4:39 am

FYI, we should also take special care of pages being reclaimed, the free
memory of the cgroup these pages belong to may be really low.
Dm-ioband is doing this.

Thanks,
Hirokazu Takahashi.

--

From: Fernando Luis
Date: Monday, August 11, 2008 - 10:35 pm

Thank you for the heads-up.

- Fernando

--

From: Naveen Gupta
Date: Wednesday, August 6, 2008 - 12:37 pm

Fernando

Nice summary. My comments are inline.

-Naveen


I/O limiting can be a special case of proportional bandwidth
scheduling. A process/process group can use use it's share of
bandwidth and if there is spare bandwidth it be allowed to use it. And
if we want to absolutely restrict it we add another flag which
specifies that the specified proportion is exact and has an upper
bound.

Let's say the ideal b/w for a device is 100MB/s

And process 1 is assigned b/w of 20%. When we say that the proportion
is strict, the b/w for process 1 will be 20% of the max b/w (which may

It can be argued that any scheduling decision wrt to i/o belongs to
elevators. Till now they have been used to improve performance. But
with new requirements to isolate i/o based on process or cgroup, we
need to change the elevators.

If we add another layer of i/o scheduling (block layer I/O controller)
above elevators
1) It builds another layer of i/o scheduling (bandwidth or priority)
2) This new layer can have decisions for i/o scheduling which conflict
with underlying elevator. e.g. If we decide to do b/w scheduling in
this new layer, there is no way a priority based elevator could work
underneath it.

If a custom make_request_fn is defined (which means the said device is
not using existing elevator), it could build it's own scheduling
rather than asking kernel to add another layer at the time of i/o
--

From: Hirokazu Takahashi
Date: Thursday, August 7, 2008 - 1:30 am

I seems like the same goes for the current Linux kernel implementation
that if processes issued a lot of I/O requests and the io-request queue
of a disk is overflowed, all the I/O requests after will be blocked
and the priorities of them are meaningless.
In other word, it won't work if it receives lots of requests more than
the ability/bandwidth of a disk.

It doesn't seem so weird if it won't work if a cgroup issues lots of

Thanks,
Hirokazu Takahashi

--

From: Fernando Luis
Date: Thursday, August 7, 2008 - 6:17 am

Hi Naveen,

I essentially agree with you. The nice thing about proportional
bandwidth scheduling is that we get bandwidth guarantees when there is
contention for the block device, but still get the benefits of
statistical multiplexing in the non-contended case. With strict IO
I have the impression there is a tendency to conflate two different
issues when discussing I/O schedulers and resource controllers, so let
me elaborate on this point.

On the one hand, we have the problem of feeding physical devices with IO
requests in such a way that we squeeze the maximum performance out of
them. Of course in some cases we may want to prioritize responsiveness
over throughput. In either case the kernel has to perform the same basic
operations: merging and dispatching IO requests. There is no discussion
this is the elevator's job and the elevator should take into account the
physical characteristics of the device.

On the other hand, there is the problem of sharing an IO resource, i.e.
block device, between multiple tasks or groups of tasks. There are many
ways of sharing an IO resource depending on what we are trying to
accomplish: proportional bandwidth scheduling, priority-based
scheduling, etc. But to implement this sharing algorithms the kernel has
to determine the task whose IO will be submitted. In a sense, we are
scheduling tasks (and groups of tasks) not IO requests (which has much
in common with CPU scheduling). Besides, the sharing problem is not
directly related to the characteristics of the underlying device, which
means it does not need to be implemented at the elevator layer.

Traditional elevators limit themselves to schedule IO requests to disk
with no regard to where it came from. However, new IO schedulers such as
CFQ combine this with IO prioritization capabilities. This means that
the elevator decides the application whose IO will be dispatched next.
The problem is that at this layer there is not enough information to
make such decisions in an accurate way, because, ...
From: Naveen Gupta
Date: Monday, August 11, 2008 - 11:18 am

Hello Fernando



What if we pass the task specific information to the elevator. We do
this for CFQ (where we pass the priority). And if we need any
additional information to be passed we could add that in a similar
manner.

I really liked your initial suggestion where step 1 would be to add
I/O tracking patches. And then use this in CFQ and AS to do resource
sharing. And if we see any shortcoming with this approach. Let's see

Is it possible to send the topology information to the elevators. And

Another possible approach, if the top layer cannot pass topology info
to the underling block device elevators. We could use FIFO for the
underlying block devices, effectively disabling them. The Top layer
will make it's scheduling decision in custom __make_request and the
I agree that we shouldn't be reinventing things for every RAID driver.
We could have a generic algorithm which everyone plugs into. If not
that is not possible, we always have the option to create one in


-Naveen
--

From: David Collier-Brown
Date: Monday, August 11, 2008 - 9:35 am

A minor sidebar:
2008/8/7 Fernando Luis V
From: Andrea Righi
Date: Thursday, August 7, 2008 - 12:46 am

The same above also for IO operations/sec (bandwidth intended not only
in terms of bytes/sec), plus:

7. Optimal bandwidth usage: allow to exceed the IO limits to take
advantage of free/unused IO resources (i.e. allow "bursts" when the
whole physical bandwidth for a block device is not fully used and then
"throttle" again when IO from unlimited cgroups comes into place)

8. "fair throttling": avoid to throttle always the same task within a
cgroup, but try to distribute the throttling among all the tasks

What about using major,minor numbers to identify each device and account
IO statistics? If a device is unplugged we could reset IO statistics
and/or remove IO limitations for that device from userspace (i.e. by a
deamon), but pluggin/unplugging the device would not be blocked/affected

Use a deadline-based IO scheduling could be an interesting path to be
explored as well, IMHO, to try to guarantee per-cgroup minimum bandwidth

Very nice RFC.

-Andrea
--

From: Fernando Luis
Date: Thursday, August 7, 2008 - 6:59 am

Hi Andrea!


Thank you for the ideas!

By the way, point "3." above (I/O bandwidth shaping) refers to IO
scheduling algorithms in general. When I wrote the RFC I thought that
once we have the IO tracking and accounting mechanisms in place choosing
and implementing an algorithm (fair throttling, proportional bandwidth
scheduling, etc) would be easy in comparison, and that is the reason a
list was not included.

Once I get more feedback from all of you I will resend a more detailed
If a resource we want to control (a block device in this case) is
hot-plugged/unplugged the corresponding cgroup-related structures inside
the kernel need to be allocated/freed dynamically, respectively. The
problem is that this is not always possible. For example, with the
current implementation of cgroups it is not possible to treat each block
device as a different cgroup subsytem/resource controlled, because
Please note that the only thing we can do is to guarantee minimum
bandwidth requirement when there is contention for an IO resource, which
is precisely what a proportional bandwidth scheduler does. An I missing
something?

--

From: Andrea Righi
Date: Monday, August 11, 2008 - 1:52 pm

The whole subsystem is created at compile time, but controller data
structures are allocated dynamically (i.e. see struct mem_cgroup for
memory controller). So, identifying each device with a name, or a key
like major,minor, instead of a reference/pointer to a struct could help
to handle this in userspace. I mean, if a device is unplugged a
userspace daemon can just handle the event and delete the controller
data structures allocated for this device, asynchronously, via
userspace->kernel interface. And without holding a reference to that
particular block device in the kernel. Anyway, implementing a generic
interface that would allow to define hooks for hot-pluggable devices (or

Correct. Proportional bandwidth automatically allows to guarantee min
requirements (instead of IO limiting approach, that needs additional
mechanisms to achive this).

In any case there's no guarantee for a cgroup/application to sustain
i.e. 10MB/s on a certain device, but this is a hard problem anyway, and
the best we can do is to try to satisfy "soft" constraints.

-Andrea
--

From: Hirokazu Takahashi
Date: Thursday, August 7, 2008 - 11:21 pm

Hi, Fernando,


The current implementation of bio-cgroup is quite basic that a certain
page is owned by the cgroup that allocated the page, that is the same
way as the memory controller does. In most of cases this is enough and
it helps minimize the overhead. 

I think you many want to add some feature to change the owner of a page.
It will be ok we implement it step by step. I know there will be some
tradeoff between the overhead and the accuracy to track pages.

We also try to reduce the overhead of the tracking, whose code comes
from the memory controller though. We all should help the memory


I doubt about the maximum size of I/O requests problem. You can't avoid
this problem as far as you use device mapper modules with such a bad
manner, even if the controller is implemented as a stand-alone controller.
There is no limitation if you only use dm-ioband without any other device
mapper modules.

And I think the device mapper team just started designing barriers support.
I guess it won't take long. Right, Alasdair?
We should know it is logically impossible to support barriers on some
types of device mapper modules such as LVM. You can't avoid the barrier
problem when you use this kind of multiple devices even if you implement
the controller in the block layer.

But I think a stand-alone implementation will have a merit that it
makes it easier to setup the configuration rather than dm-ioband.
From this point of view, it would be good that you move the algorithm
of dm-ioband into the block layer.
On the other hand, we should know it will make it impossible to use
--

From: Ryo Tsuruta
Date: Friday, August 8, 2008 - 12:20 am

The following is a part of source code where the limitation comes from.

dm-table.c: dm_set_device_limits()
        /*
         * Check if merge fn is supported.
         * If not we'll force DM to use PAGE_SIZE or
         * smaller I/O, just to be safe.
         */

        if (q->merge_bvec_fn && !ti->type->merge)
                rs->max_sectors =
                        min_not_zero(rs->max_sectors,
                                     (unsigned int) (PAGE_SIZE >> 9));

As far as I can find, In 2.6.27-rc1-mm1, Only some software RAID
drivers and pktcdvd driver define merge_bvec_fn().

Thanks,
Ryo Tsuruta
--

From: Fernando Luis
Date: Friday, August 8, 2008 - 1:10 am

Yup, exactly. The implication of this is that we may see a drop in
performance in some RAID configurations.

--

From: Ryo Tsuruta
Date: Friday, August 8, 2008 - 3:05 am

The current device-mapper introduces a bvec merge function for device
mapper devices. IMHO, the limitation goes away once we implement this
in dm-ioband. Am I right, Alasdair?

Thanks,
Ryo Tsuruta
--

From: Hirokazu Takahashi
Date: Friday, August 8, 2008 - 7:31 am

Ryo told me this isn't true anymore. The dm infrastructure introduced
a new feature to support multiple page-sized I/O requests, that was
just merged to the current linux tree. So you and me don't need to
worry about this stuff anymore.

Ryo said he was going to make dm-ioband support this new feature and

Thanks,
Hirokazu Takahashi.
--

Previous thread: BUG and unresponsive system using cdparanoia on 2.6.26 by David Greaves on Monday, August 4, 2008 - 1:22 am. (1 message)

Next thread: cramfs and named-pipe by Atsushi Nemoto on Monday, August 4, 2008 - 2:21 am. (5 messages)