login
Header Space

 
 

[RFC][patch 7/11][CFQ-cgroup] Control cfq_data per driver

Previous thread: Re: bluetooth lockdep trace. (.25rc5-git4) by Dave Young on Tuesday, April 1, 2008 - 4:58 am. (1 message)

Next thread: WM97xx touchscreen drivers by Mark Brown on Tuesday, April 1, 2008 - 6:28 am. (12 messages)
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, 'Tomoyoshi Sugawara' <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:22 am

This patchset introduce "Yet Another" I/O bandwidth controlling
subsystem for cgroups based on CFQ (called 2 layer CFQ).

The idea of 2 layer CFQ is to build fairness control per group on the top of existing CFQ control.
We add a new data structure called CFQ meta-data on the top of
cfqd in order to control I/O bandwidth for cgroups.
CFQ meta-data control cfq_datas by service tree (rb-tree) and
CFQ algorithm when synchronous I/O.
An active cfqd controls queue for cfq by service tree.
Namely, the CFQ meta-data control traditional CFQ data.
the CFQ data runs conventionally.

           cfqmd     cfqmd     (cfqmd = cfq meta-data)
            |          |
  cfqc  -- cfqd ----- cfqd     (cfqd = cfq data,
            |          |        cfqc = cfq cgroup data)
  cfqc  --[cfqd]----- cfqd
            ↑
     conventional control.


This patchset is gainst 2.6.25-rc2-mm1.


Last week, we found a patchset from Vasily Tarasov (Open VZ) that
posted to LKML.
   [RFC][PATCH 0/9] cgroups: block: cfq: I/O bandwidth controlling subsystem for CGroups based on CFQ
  http://lwn.net/Articles/274652/

Our subsystem and  Vasily's one are similar on the point of modifying
the CFQ subsystem, but they are different on the point of the layer of
implementation. Vasily's subsystem add a new layer for cgroup between
cfqd and cfqq, but our subsystem add a new layer for cgroup on the top
of cfqd.

The different of implementation from OpenVZ's one are:
   * top layer algorithm is also based on service tree, and
   * top layer program is stored in the different file (block/cfq-cgroup.c).

We hope to discuss not which is better implementation, but what is the
best way to implement I/O bandwidth control based on CFQ here.

Please give us your comments, questions and suggestions.



Finally, we introduce a usage of our implementation.

* Preparation for using 2 layer CFQ

 1. Adopt this patchset to kernel 2.6.25-rc2-mm1.

 2. Build kernel with CFQ-CGROUP option.

 3. Restart new kernel...
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:42 am

This patch controls whether cfq_data is active or not.
When cfq_data is not active and active cfq_queue is inserted into cfq_data,
cfq_data is activated.
When cfq_data is active and active cfq_queue is not exist,
cfq_data is deactivated.

The new cfq optional operations:
The "cfq_add_cfqq_opt_fn" defines a function that runs an additional process
when active queue is inserted into cfq_data.

The "cfq_del_cfqq_opt_fn" defines a function that runs an additional process
when active queue is removed in cfq_data.

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index f040c98..46f3635 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -741,6 +741,32 @@ static int cfq_cgroup_active_data_check(struct cfq_data *cfqd)
 	return (cfqd-&gt;cfqmd-&gt;active_data == cfqd);
 }
 
+static void cfq_cgroup_add_cfqd_rr(struct cfq_data *cfqd)
+{
+	if (!cfq_cfqd_on_rr(cfqd)) {
+		cfq_mark_cfqd_on_rr(cfqd);
+		cfqd-&gt;cfqmd-&gt;busy_data++;
+
+		cfq_cgroup_service_tree_add(cfqd, 0);
+	}
+}
+
+
+static void cfq_cgroup_del_cfqd_rr(struct cfq_data *cfqd)
+{
+	if (RB_EMPTY_ROOT(&amp;cfqd-&gt;service_tree.rb)) {
+		struct cfq_meta_data *cfqdd = cfqd-&gt;cfqmd;
+		BUG_ON(!cfq_cfqd_on_rr(cfqd));
+		cfq_clear_cfqd_on_rr(cfqd);
+		if (!RB_EMPTY_NODE(&amp;cfqd-&gt;rb_node)) {
+			cfq_rb_erase(&amp;cfqd-&gt;rb_node,
+				     &amp;cfqdd-&gt;service_tree);
+		}
+		BUG_ON(!cfqdd-&gt;busy_data);
+		cfqdd-&gt;busy_data--;
+	}	
+}
+
 struct cfq_ops opt = {
 	.cfq_init_queue_fn = __cfq_cgroup_init_queue,
 	.cfq_exit_queue_fn = __cfq_cgroup_exit_data,
@@ -749,4 +775,6 @@ struct cfq_ops opt = {
 	.cfq_completed_request_after_fn = cfq_cgroup_completed_request_after,
 	.cfq_empty_fn = cfq_cgroup_queue_empty,
 	.cfq_active_check_fn = cfq_cgroup_active_data_check,
+	.cfq_add_cfqq_opt_fn =  cfq_cgroup_add_cfqd_rr,
+	.cfq_del_cfqq_opt_fn =  cfq_cgroup_del_cfqd_rr,
 };
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
ind...
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:41 am

This patch introduced to control cfq_data.
Its algorithm is similar to one when CFQ synchronous I/O.

The new cfq optional operations:
The "cfq_dispatch_requests_fn" defines a function which is implemented
request dispatching algorithm.
This becomes main function for fairness.

The "cfq_completed_request_after_fn" defines a function which winds up I/O's
affairs.

The "cfq_active_check_fn" defines a function which make sure whether selecting cfq_data is equal to active cfq_data.

The "cfq_empty_fn" defines a function which check whether active data exists.

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 568e433..f040c98 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -15,9 +15,35 @@
 #include &lt;linux/cgroup.h&gt;
 #include &lt;linux/cfq-iosched.h&gt;
 
+
 #define CFQ_CGROUP_SLICE_SCALE		(5)
 #define CFQ_CGROUP_MAX_IOPRIO		(8)
 
+static const int cfq_cgroup_slice = HZ / 10;
+
+enum cfqd_state_flags {
+	CFQ_CFQD_FLAG_on_rr = 0,	/* on round-robin busy list */
+	CFQ_CFQD_FLAG_slice_new,	/* no requests dispatched in slice */
+};
+
+#define CFQ_CFQD_FNS(name)						\
+static inline void cfq_mark_cfqd_##name(struct cfq_data *cfqd)		\
+{									\
+	(cfqd)-&gt;flags |= (1 &lt;&lt; CFQ_CFQD_FLAG_##name);			\
+}									\
+static inline void cfq_clear_cfqd_##name(struct cfq_data *cfqd)	\
+{									\
+	(cfqd)-&gt;flags &amp;= ~(1 &lt;&lt; CFQ_CFQD_FLAG_##name);			\
+}									\
+static inline int cfq_cfqd_##name(const struct cfq_data *cfqd)		\
+{									\
+	return ((cfqd)-&gt;flags &amp; (1 &lt;&lt; CFQ_CFQD_FLAG_##name)) != 0;	\
+}
+
+CFQ_CFQD_FNS(on_rr);
+CFQ_CFQD_FNS(slice_new);
+#undef CFQ_CFQD_FNS
+
 static const int cfq_cgroup_slice_idle = HZ / 125;
 
 struct cfq_cgroup {
@@ -45,6 +71,7 @@ static inline struct cfq_cgroup *task_to_cfq_cgroup(struct task_struct *tsk)
  * Add device or cgroup data functions.
  */
 struct cfq_data *__cfq_cgroup_init_queue(struct request_queue...
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:40 am

This patch is possible to select a cfq_data corresponding group with task.
This is used when merge, merge check, queue check and queue setting.

The new cfq optional operations:
The "cfq_search_data_fn" defines a function that selects a correct cfq_data when
cfq_queue and requests are not connected yet. 

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 4110ab7..568e433 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -35,6 +35,12 @@ static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont)
 			    struct cfq_cgroup, css);
 }
 
+static inline struct cfq_cgroup *task_to_cfq_cgroup(struct task_struct *tsk)
+{
+	return container_of(task_subsys_state(tsk, cfq_cgroup_subsys_id),
+			    struct cfq_cgroup, css);
+}
+
 /*
  * Add device or cgroup data functions.
  */
@@ -392,6 +398,32 @@ struct cgroup_subsys cfq_cgroup_subsys = {
 };
 
 
+struct cfq_data *cfq_cgroup_search_data(void *data,
+					struct task_struct *tsk)
+{
+	struct cfq_data *cfqd = (struct cfq_data *)data;
+	struct cfq_meta_data *cfqmd = cfqd-&gt;cfqmd;
+	struct cfq_cgroup *cont = task_to_cfq_cgroup(tsk);
+	struct rb_node *p = cont-&gt;sibling_tree.rb_node;
+
+	while (p) {
+		struct cfq_data *__cfqd;
+		__cfqd = rb_entry(p, struct cfq_data, group_node);
+
+
+		if (cfqmd &lt; __cfqd-&gt;cfqmd) {
+			p = p-&gt;rb_left;
+		} else if (cfqmd &gt; __cfqd-&gt;cfqmd) {
+			p = p-&gt;rb_right;
+		} else {
+			return __cfqd;
+		}
+	}
+
+	return NULL;
+}
+
+
 struct cfq_ops opt = {
 	.cfq_init_queue_fn = __cfq_cgroup_init_queue,
 	.cfq_exit_queue_fn = __cfq_cgroup_exit_data,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b1757bc..3aa320a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -623,6 +623,9 @@ static int cfq_merge(struct request_queue *q, struct request **req,
 	struct cfq_data *cfqd = q-&gt;elevator-&gt;elevator_data;
 	struct request *__rq;
 
+	if (opt.cfq_searc...
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:38 am

This patch expands cfq data to handling multi cfq_data in group.
This control is used rb_tree and the key is used a pointer of cfq_meta_data.

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 55090dc..4110ab7 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -23,6 +23,9 @@ static const int cfq_cgroup_slice_idle = HZ / 125;
 struct cfq_cgroup {
 	struct cgroup_subsys_state css;
 	unsigned int ioprio;
+
+	struct rb_root sibling_tree;
+	unsigned int siblings;
 };
 
 
@@ -35,6 +38,8 @@ static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont)
 /*
  * Add device or cgroup data functions.
  */
+struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data);
+
 static struct cfq_meta_data *cfq_cgroup_init_meta_data(struct cfq_data *cfqd, struct request_queue *q)
 {
 	struct cfq_meta_data *cfqmd;
@@ -90,16 +95,75 @@ static void cfq_meta_data_sibling_tree_add(struct cfq_meta_data *cfqmd,
 	cfqmd-&gt;siblings++;
 	cfqd-&gt;cfqmd = cfqmd;
 }
- 
+
+static void cfq_cgroup_sibling_tree_add(struct cfq_cgroup *cfqc,
+					struct cfq_data *cfqd)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(!RB_EMPTY_NODE(&amp;cfqd-&gt;group_node));
+
+	p = &amp;cfqc-&gt;sibling_tree.rb_node;
+
+	while (*p) {
+		struct cfq_data *__cfqd;
+		struct rb_node **n;
+
+		parent = *p;
+		__cfqd = rb_entry(parent, struct cfq_data, group_node);
+
+		if (cfqd-&gt;cfqmd &lt; __cfqd-&gt;cfqmd) {
+			n = &amp;(*p)-&gt;rb_left;
+		} else {
+			n = &amp;(*p)-&gt;rb_right;
+		}
+		p = n;
+	}
+		
+	rb_link_node(&amp;cfqd-&gt;group_node, parent, p);
+	rb_insert_color(&amp;cfqd-&gt;group_node, &amp;cfqc-&gt;sibling_tree);
+	cfqc-&gt;siblings++;
+	cfqd-&gt;cfqc = cfqc;
+}
+
+static void *cfq_cgroup_init_cfq_data(struct cfq_cgroup *cfqc, struct cfq_data *cfqd)
+{
+	struct cgroup *child;
+
+	/* setting cfq_data for cfq_cgroup */
+	if (!cfqc) {
+		cfqc = cgroup_to_...
To: Satoshi UCHIDA <s-uchida@...>
Cc: <linux-kernel@...>, <containers@...>, <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Thursday, April 3, 2008 - 11:35 am

Rather than adding get_root_subsys(), can't you just keep a reference
locally to your root subsystem state?

Paul
--
To: 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Friday, April 4, 2008 - 2:20 am

If the cfqc has not specific address, namely cfqc is null, 
the cfq_cgroup_init_cfq_data function is called when new device is plugged.
In this time, it needs to relate new cfq_data data with top cfq_cgroup data.
Probably, a running program will be "insmod" in its time.
However, its program is not in root group.

On the supposition that only current interface is usedm,
If using current process, the top cgroup can be calculated by

   task_cgroup(current, cfq_subsys_id)-&gt;top_cgroup .

Therefore cfq_cgroup data of top cgroup is calculated by

   cgroup_to_cfq_cgroup(task_cgroup(current, cfq_subsys_id)-&gt;top_cgroup) .


However,  It would be bad to use "current" variable in order to
calculate cfq_cgroup data of top cgroup.
Because relationship between "calculating cfq_cgroup data of top cgroup" and
"running(current) task" is weak.


So that you say, root subsystem state maybe keep a reference locally.
For example, create a variable for root subsystem state and
store the pointer when making subsystem state first.

However, I think that it is smart to calculate root group of subsystems
when needing its information.
Does the current code have any problem?


Satoshi UCHIDA.

--
To: Satoshi UCHIDA <s-uchida@...>
Cc: <linux-kernel@...>, <containers@...>, <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Friday, April 4, 2008 - 5:00 am

I don't see why it's better to go through the cgroup subsystem to
retrieve the reference.

Paul
--
To: 'Paul Menage' <menage@...>
Cc: <linux-kernel@...>, <containers@...>, <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Friday, April 4, 2008 - 5:46 am

If program have temporary data, it may cause inconsistency.
I think that information should be referred to original(control mechanism).

Information of cgroup is controlled at cgroup subsystem (and cgroupfs_root).
So, I think that it's better to query into cgroup subsystem.

Now, cgroup susbsystem would not change root group.
Therefore, system would also runs by your suggestion.

I will consider this issue.

Thanks,
 Satoshi UCHIDA.


--
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:37 am

This patch expands cfq_meta_data to handling multi cfq_data.
This control is used rb_tree and the key is used a pointer of cfq_data.

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 34894d9..55090dc 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -54,9 +54,42 @@ static struct cfq_meta_data *cfq_cgroup_init_meta_data(struct cfq_data *cfqd, st
 	cfqmd-&gt;cfq_driv_d.idle_slice_timer.data = (unsigned long) cfqd;
 	cfqmd-&gt;cfq_driv_d.cfq_slice_idle = cfq_cgroup_slice_idle;
 
+	cfqmd-&gt;sibling_tree = RB_ROOT;
+	cfqmd-&gt;siblings = 0;
+
 	return cfqmd;
 }
 
+static void cfq_meta_data_sibling_tree_add(struct cfq_meta_data *cfqmd,
+					     struct cfq_data *cfqd)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(!RB_EMPTY_NODE(&amp;cfqd-&gt;sib_node));
+
+	p = &amp;cfqmd-&gt;sibling_tree.rb_node;
+
+	while (*p) {
+		struct cfq_data *__cfqd;
+		struct rb_node **n;
+
+		parent = *p;
+		__cfqd = rb_entry(parent, struct cfq_data, sib_node);
+
+		if (cfqd &lt; __cfqd) {
+			n = &amp;(*p)-&gt;rb_left;
+		} else {
+			n = &amp;(*p)-&gt;rb_right;
+		}
+		p = n;
+	}
+		
+	rb_link_node(&amp;cfqd-&gt;sib_node, parent, p);
+	rb_insert_color(&amp;cfqd-&gt;sib_node, &amp;cfqmd-&gt;sibling_tree);
+	cfqmd-&gt;siblings++;
+	cfqd-&gt;cfqmd = cfqmd;
+}
  
 struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data)
 {
@@ -66,6 +99,8 @@ struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data)
 	if (!cfqd)
 		return NULL;
 
+	RB_CLEAR_NODE(&amp;cfqd-&gt;sib_node);
+
 	if (!cfqmd) {
        		cfqmd = cfq_cgroup_init_meta_data(cfqd, q);
 		if (!cfqmd) {
@@ -73,6 +108,7 @@ struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data)
 			return NULL;
 		}
 	}
+	cfq_meta_data_sibling_tree_add(cfqmd, cfqd);
 
 	return cfqd;
 }
@@ -102,11 +138,35 @@ cfq_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 /...
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:36 am

This patch introduces CFQ meta data (cfq_meta data).
This creates new control data layer over traditional control data (cfq_data).

The new cfq optional operations:
The "cfq_init_queue_fn" defines a function that runs when a new device is plugged, namely
new I/O queue is created.

The "cfq_exit_queue_fn" defines a function that runs when device is unplugged, namely 
I/O queue is removed.

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 95663f9..34894d9 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -18,6 +18,8 @@
 #define CFQ_CGROUP_SLICE_SCALE		(5)
 #define CFQ_CGROUP_MAX_IOPRIO		(8)
 
+static const int cfq_cgroup_slice_idle = HZ / 125;
+
 struct cfq_cgroup {
 	struct cgroup_subsys_state css;
 	unsigned int ioprio;
@@ -30,6 +32,52 @@ static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont)
 			    struct cfq_cgroup, css);
 }
 
+/*
+ * Add device or cgroup data functions.
+ */
+static struct cfq_meta_data *cfq_cgroup_init_meta_data(struct cfq_data *cfqd, struct request_queue *q)
+{
+	struct cfq_meta_data *cfqmd;
+	
+	cfqmd = kmalloc_node(sizeof(*cfqmd), GFP_KERNEL | __GFP_ZERO, q-&gt;node);
+	if (!cfqmd) {
+		return NULL;
+	}
+	cfqmd-&gt;elv_data = cfqd;
+
+	cfqmd-&gt;cfq_driv_d.queue = q;
+	INIT_WORK(&amp;cfqmd-&gt;cfq_driv_d.unplug_work, cfq_kick_queue);
+	cfqmd-&gt;cfq_driv_d.last_end_request = jiffies;
+       
+	init_timer(&amp;cfqmd-&gt;cfq_driv_d.idle_slice_timer);
+	cfqmd-&gt;cfq_driv_d.idle_slice_timer.function = cfq_idle_slice_timer;
+	cfqmd-&gt;cfq_driv_d.idle_slice_timer.data = (unsigned long) cfqd;
+	cfqmd-&gt;cfq_driv_d.cfq_slice_idle = cfq_cgroup_slice_idle;
+
+	return cfqmd;
+}
+
+ 
+struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data)
+{
+	struct cfq_meta_data *cfqmd = (struct cfq_meta_data *)data;
+	struct cfq_data *cfqd = __cfq_init_cfq_data(q);
+
+	if (!cfqd)
+		return NULL;
+
+	if (!cfqmd) {
+       		...
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:35 am

This patch creates a cfq optional operations framework.
This framework defines specific functions for expanding CFQ.

    Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index b5303d9..95663f9 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -151,3 +151,7 @@ struct cgroup_subsys cfq_cgroup_subsys = {
 	.subsys_id = cfq_cgroup_subsys_id,
 	.populate = cfq_cgroup_populate,
 };
+
+
+struct cfq_ops opt = {
+};
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aaf5d7e..245c252 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2233,6 +2233,11 @@ static void __exit cfq_exit(void)
 module_init(cfq_init);
 module_exit(cfq_exit);
 
+#ifndef CONFIG_CGROUP_CFQ
+struct cfq_ops opt = {
+};
+#endif
+
 MODULE_AUTHOR("Jens Axboe");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Completely Fair Queueing IO scheduler");
diff --git a/include/linux/cfq-iosched.h b/include/linux/cfq-iosched.h
index 035bfc4..9287da1 100644
--- a/include/linux/cfq-iosched.h
+++ b/include/linux/cfq-iosched.h
@@ -87,4 +87,10 @@ static inline struct request_queue * __cfq_container_of_queue(struct work_struct
 	return cfqd-&gt;cfq_driv_d.queue;
 };
 
+struct cfq_ops
+{
+};
+
+extern struct cfq_ops opt;
+
 #endif  /* _LINUX_CFQ_IOSCHED_H */

--
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:33 am

This patch exacts driver unique data into new structure(cfq_driver_data)
in order to move top control layer(cfq_meata_data layer
 in next patch).

CFQ_DRV_UNIQ_DATA macro calculates control data in top control layer.
In one lalyer CFQ, macro selects cfq_driver_data in cfq_data.
In two lalyer CFQ, macro selects cfq_driver_data in cfq_meta_data.
(in [6/11] patch)

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c1f9da9..aaf5d7e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -177,7 +177,7 @@ static inline int cfq_bio_sync(struct bio *bio)
 static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
 {
 	if (cfqd-&gt;busy_queues)
-		kblockd_schedule_work(&amp;cfqd-&gt;unplug_work);
+		kblockd_schedule_work(&amp;CFQ_DRV_UNIQ_DATA(cfqd).unplug_work);
 }
 
 static int cfq_queue_empty(struct request_queue *q)
@@ -260,7 +260,7 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 	s1 = rq1-&gt;sector;
 	s2 = rq2-&gt;sector;
 
-	last = cfqd-&gt;last_position;
+	last = CFQ_DRV_UNIQ_DATA(cfqd).last_position;
 
 	/*
 	 * by definition, 1KiB is 2 sectors
@@ -535,7 +535,7 @@ static void cfq_add_rq_rb(struct request *rq)
 	 * if that happens, put the alias on the dispatch list
 	 */
 	while ((__alias = elv_rb_add(&amp;cfqq-&gt;sort_list, rq)) != NULL)
-		cfq_dispatch_insert(cfqd-&gt;queue, __alias);
+		cfq_dispatch_insert(CFQ_DRV_UNIQ_DATA(cfqd).queue, __alias);
 
 	if (!cfq_cfqq_on_rr(cfqq))
 		cfq_add_cfqq_rr(cfqd, cfqq);
@@ -579,7 +579,7 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q-&gt;elevator-&gt;elevator_data;
 
-	cfqd-&gt;rq_in_driver++;
+	CFQ_DRV_UNIQ_DATA(cfqd).rq_in_driver++;
 
 	/*
 	 * If the depth is larger 1, it really could be queueing. But lets
@@ -587,18 +587,18 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 	 * low queueing, and a low qu...
To: <linux-kernel@...>, <containers@...>, <axboe@...>, <menage@...>
Cc: <s-uchida@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Tuesday, April 1, 2008 - 5:32 am

This patch introduces a simple cgroup subsystem.
New cgroup subsystem is called cfq_cgroup.

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..ea07b46 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_AS)	+= as-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_CGROUP_CFQ)	+= cfq-cgroup.o
 
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
new file mode 100644
index 0000000..cea2b92
--- /dev/null
+++ b/block/cfq-cgroup.c
@@ -0,0 +1,57 @@
+/*
+ *  CFQ CGROUP disk scheduler.
+ *
+ *     This program is a wrapper program that is
+ *     extend CFQ disk scheduler for handling
+ *     cgroup subsystem. 
+ *
+ *     This program is based on original CFQ code.
+ * 
+ *  Copyright (C) 2008 Satoshi UCHIDA &lt;s-uchida@ap.jp.nec.com&gt;
+ *   and NEC Corp.
+ */
+
+#include &lt;linux/blkdev.h&gt;
+#include &lt;linux/cgroup.h&gt;
+#include &lt;linux/cfq-iosched.h&gt;
+
+struct cfq_cgroup {
+	struct cgroup_subsys_state css;
+};
+
+
+static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, cfq_cgroup_subsys_id),
+			    struct cfq_cgroup, css);
+}
+
+static struct cgroup_subsys_state *
+cfq_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct cfq_cgroup *cfqc;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (!cgroup_is_descendant(cont))
+		return ERR_PTR(-EPERM);
+
+	cfqc = kzalloc(sizeof(struct cfq_cgroup), GFP_KERNEL);
+	if (unlikely(!cfqc))
+		return ERR_PTR(-ENOMEM);
+
+	return &amp;cfqc-&gt;css;	
+}
+
+static void cfq_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	kfree(cgroup_to_cfq_cgroup(cont));
+}
+
+st...
To: Satoshi UCHIDA <s-uchida@...>
Cc: <linux-kernel@...>, <containers@...>, <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Wednesday, April 2, 2008 - 6:41 pm

What are these checks for? Cgroups already provides filesystem
permissions to control directory creation, and the "descendant" check

To fit with the convention for other subsystems, simply "cfq" would be
a better name than "cfq_cgroup". (Clearly it's a cgroup subsystem from
context).

Is this subsystem meant to allow you to control any device that uses
CFQ, or is it specific to disks? It would be nice to be able to allow
different groups have different guarantees on different disks.

Paul
--
To: 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>
Date: Thursday, April 3, 2008 - 3:09 am

This patchset modified a name of subsystem (from "cfq_cgroup" to "cfq")
and a checking in create function.


This patchset introduce "Yet Another" I/O bandwidth controlling
subsystem for cgroups based on CFQ (called 2 layer CFQ).

The idea of 2 layer CFQ is to build fairness control per group on the top of existing CFQ control.
We add a new data structure called CFQ meta-data on the top of
cfqd in order to control I/O bandwidth for cgroups.
CFQ meta-data control cfq_datas by service tree (rb-tree) and
CFQ algorithm when synchronous I/O.
An active cfqd controls queue for cfq by service tree.
Namely, the CFQ meta-data control traditional CFQ data.
the CFQ data runs conventionally.

           cfqmd     cfqmd     (cfqmd = cfq meta-data)
            |          |
  cfqc  -- cfqd ----- cfqd     (cfqd = cfq data,
            |          |        cfqc = cfq cgroup data)
  cfqc  --[cfqd]----- cfqd
            ↑
     conventional control.


This patchset is gainst 2.6.25-rc2-mm1.


Last week, we found a patchset from Vasily Tarasov (Open VZ) that
posted to LKML.
   [RFC][PATCH 0/9] cgroups: block: cfq: I/O bandwidth controlling subsystem for CGroups based on CFQ
  http://lwn.net/Articles/274652/

Our subsystem and  Vasily's one are similar on the point of modifying
the CFQ subsystem, but they are different on the point of the layer of
implementation. Vasily's subsystem add a new layer for cgroup between
cfqd and cfqq, but our subsystem add a new layer for cgroup on the top
of cfqd.

The different of implementation from OpenVZ's one are:
   * top layer algorithm is also based on service tree, and
   * top layer program is stored in the different file (block/cfq-cgroup.c).

We hope to discuss not which is better implementation, but what is the
best way to implement I/O bandwidth control based on CFQ here.

Please give us your comments, questions and suggestions.



Finally, we introduce a usage of our implementation.

* Preparation for using 2 layer CFQ

 1. Ad...
To: <s-uchida@...>, <vtaras@...>
Cc: <linux-kernel@...>, <containers@...>, <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>, <devel@...>
Date: Friday, April 25, 2008 - 5:54 am

Hi, 

I report benchmark results of the following I/O bandwidth controllers.

  From:	Vasily Tarasov &lt;vtaras@openvz.org&gt;
  Subject: [RFC][PATCH 0/9] cgroups: block: cfq: I/O bandwidth
           controlling subsystem for CGroups based on CFQ
  Date: Fri, 15 Feb 2008 01:53:34 -0500

  From: "Satoshi UCHIDA" &lt;s-uchida@ap.jp.nec.com&gt;
  Subject: [RFC][v2][patch 0/12][CFQ-cgroup]Yet another I/O bandwidth
           controlling subsystem for CGroups based on CFQ
  Date: Thu, 3 Apr 2008 16:09:12 +0900

The test procedure is as follows:
  o Prepare 3 partitions sdc2, sdc3 and sdc4.
  o Run 100 processes issuing random direct I/O with 4KB data on each
    partitions.
  o Run 3 tests:
    #1 issuing read I/O only.
    #2 issuing write I/O only.
    #3 sdc2 and sdc3 are read, sdc4 is write.
  o Count up the number of I/Os which have done in 60 seconds.

Unfortunately, both bandwidth controllers didn't work as I expected,
On the test #3, the write I/O ate up the bandwidth regardless of the
specified priority level.

                          Vasily's scheduler
               The number of I/Os (percentage to total I/Os)
   ---------------------------------------------------------------------
  | partition     |     sdc2     |     sdc3     |     sdc4     | total  |
  | priority      |  7(highest)  |      4       |  0(lowest)   |  I/Os  |
  |---------------+--------------+--------------+--------------|--------|
  | #1 read       |  3620(35.6%) |  3474(34.2%) |  3065(30.2%) |  10159 |
  | #2 write      | 21985(36.6%) | 19274(32.1%) | 18856(31.4%) |  60115 |
  | #3 read&amp;write |  5571( 7.5%) |  3253( 4.4%) | 64977(88.0%) |  73801 |
   ---------------------------------------------------------------------

                          Satoshi's scheduler
               The number of I/Os (percentage to total I/O)
   ---------------------------------------------------------------------
  | partition     |     sdc2     |     sdc3     |     sdc4     | total  |
  | priority      |...
To: 'Ryo Tsuruta' <ryov@...>, <vtaras@...>
Cc: <linux-kernel@...>, <containers@...>, <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>, <devel@...>
Date: Friday, May 9, 2008 - 6:17 am

Hi, Ryo-San.
Thank you for your test results.

In the test #2 and #3, did you use direct write?
I guess you have used the non-direct write I/O (using cache).

CFQ I/O scheduler was extended in my and Vasily's controllers so that both controllers inherit the features of CFQ.

The current CFQ I/O scheduler cannot control non-direct write I/Os.
This main cause is a cache system.
Bio data is created by special daemon process, such as pdflush or kswapd, for the write I/O using cache.
Therefore, many non-direct write I/Os will belong to one of cgroup (perhaps, root of cgroup).

We consider that this problem should be resolved by fixing cache system.
Specifically, I/Os created by collection of cache pages belong to I/O-context for task which wrote data to cache.
This resolution has a problem.
 * Who is the owner of cache page?
     Cache is reused by many tasks.
     Therefore, it is difficult to decide owner.

In the test #3, It seems that system could control I/Os only among read (sdc2 and sdc3).
Therefore, your test shows that our controller can control I/O without above problem.


Meanwhile, I'm very interested in the result of your test #2.
In the non-direct write I/O, performance will be influenced to task scheduling and sequence of output pages.
Therefore, non-direct write I/O will be fair in the current default task scheduler.
However, your result shows almost fair in Vasilly's controller, whereas non fair in ours. 
I'm just wondering if this is an accidental result or an usual result.


Thanks,

--
To: <s-uchida@...>
Cc: <vtaras@...>, <linux-kernel@...>, <containers@...>, <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>, <devel@...>
Date: Sunday, May 11, 2008 - 11:10 pm

Hi, Uchida-san,

Yes, I did. I used direct write in all tests.
I would appreciate it if you would try a test like I did.

--
Ryo Tsuruta &lt;ryov@valinux.co.jp&gt;
--
To: <s-uchida@...>
Cc: <axboe@...>, <containers@...>, <linux-kernel@...>, <vtaras@...>, <tom-sugawara@...>
Date: Monday, May 12, 2008 - 11:33 am

And I'll retest and report you back.

--
Ryo Tsuruta &lt;ryov@valinux.co.jp&gt;
--
To: <s-uchida@...>
Cc: <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Thursday, May 22, 2008 - 9:04 am

Hi Uchida-san,

I realized that the benchmark results which I posted on Apr 25 had
some problems with the testing environment.

  From: Ryo Tsuruta &lt;ryov@valinux.co.jp&gt;
  Subject: Re: [RFC][v2][patch 0/12][CFQ-cgroup]Yet another I/O
  bandwidth controlling subsystem for CGroups based on CFQ
  Date: Fri, 25 Apr 2008 18:54:44 +0900 (JST)


I answered "Yes," but actually I did not use direct write I/O, because
I ran these tests on Xen-HVM. Xen-HVM backend driver doesn't use direct
I/O for actual disk operations even though guest OS uses direct I/O.

An another problem was that the CPU time was used up during the tests.

So, I retested with the new testing environment and got good results. 
The number of I/Os is proportioned according to the priority levels.

Details of the tests are as follows:

Envirionment:
  Linux version 2.6.25-rc2-mm1 based.
  CPU0: Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz stepping 06
  CPU1: Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz stepping 06
  Memory: 2063568k/2088576k available (2085k kernel code, 23684k
  reserved, 911k data, 240k init, 1171072k highmem)
  scsi 1:0:0:0: Direct-Access     ATA      WDC WD2500JS-55N 10.0 PQ: 0  ANSI: 5
  sd 1:0:0:0: [sdb] 488397168 512-byte hardware sectors (250059 MB)
  sd 1:0:0:0: [sdb] Write Protect is off
  sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
  sd 1:0:0:0: [sdb] Write cache: disabled, read cache: enabled,
  doesn't support DPO or FUA
  sdb: sdb1 sdb2 sdb3 sdb4 &lt; sdb5 sdb6 sdb7 sdb8 sdb9 sdb10 sdb11
  sdb12 sdb13 sdb14 sdb15 &gt;

Procedures:
  o Prepare 3 partitions sdb5, sdb6 and sdb7.
  o Run 100 processes issuing random direct I/O with 4KB data on each
    partitions.
  o Run 3 tests:
    #1 issuing read I/O only.
    #2 issuing write I/O only.
    #3 sdb5 and sdb6 are read, sdb7 is write.
  o Count up the number of I/Os which have done in 60 seconds.

Results:
                          Vasily's scheduler
               The number of I/Os (percentage to total I/Os)
   ------...
To: 'Ryo Tsuruta' <ryov@...>
Cc: <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Thursday, May 22, 2008 - 10:53 pm

Hi, Tsuruta-san,


Where did you build expanded CFQ schedulers?
I guess that schedulers can be control I/Os if it is built on guest OS,
But not if on Dom0.

Ok.
I'm testing both systems and get similar results.

Thanks,
Satoshi UCHIDA.


--
To: <s-uchida@...>
Cc: <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Sunday, May 25, 2008 - 10:46 pm

Hi Uchida-san,



I'm looking forward to your report.

Thanks,
Ryo Tsuruta
--
To: 'Ryo Tsuruta' <ryov@...>
Cc: <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Tuesday, May 27, 2008 - 7:32 am

I report my tests.

My test shows following features.

 o The guaranteeing degrees are widely in write I/Os than in read I/Os for each environment.
 o Vasily's scheduler can guarantee I/O control.
    However, its guaranteeing degree is narrow.
      (in particular, at low priority)
 o Satoshi's scheduler can guarantee I/O control.
   However, guaranteeing degree is too small in write only and low priority case.
 o Write I/Os are faster than read I/Os.
   And, CFQ scheduler controls I/Os by time slice.
   So, guaranteeing degree is caused difference from estimating degree at requests level by the situation of read and write I/Os. 

I'll continue testing variously.
I hope to improve I/O control scheduler through many tests.


Details of the tests are as follows:

Environment:
   Linux version 2.6.25-rc5-mm1 based.
     4 type:
         kernel with Vasily's scheduler
         kernel with Satoshi's scheduler
         Native kernel
         Native kernel and use ionice commands to each process.

   CPU0: Intel(R) Core(TM)2 CPU          6700  @ 2.66GHz stepping 6
   CPU1: Intel(R) Core(TM)2 CPU          6700  @ 2.66GHz stepping 6
   Memory: 4060180k/5242880k available (2653k kernel code, 132264k reserved, 1412k data, 356k init)
   scsi 3:0:0:0: Direct-Access     ATA      WDC WD2500JS-19N 10.0 PQ: 0 ANSI: 5
   sd 3:0:0:0: [sdb] 488282256 512-byte hardware sectors (250001 MB)
   sd 3:0:0:0: [sdb] Write Protect is off
sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
   sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
   sd 3:0:0:0: [sdb] 488282256 512-byte hardware sectors (250001 MB)
   sd 3:0:0:0: [sdb] Write Protect is off
   sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
   sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
   sdb: sdb1 sdb2 sdb3 sdb4 &lt; sdb5 &gt;


Test 1:
 Procedures:
   o Prepare 200 files which size is 250MB on 1 partition sdb3
   o Create 3 groups with priority 0, 4 and 7.
       Es...
To: <s-uchida@...>
Cc: <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Tuesday, June 3, 2008 - 4:15 am

I did a similar test to yours. I increased the number of I/Os
which are issued simultaneously up to 100 per cgroup.

  Procedures:
    o Prepare 300 files which size is 250MB on 1 partition sdb3
    o Create three groups with priority 0, 4 and 7.
    o Run many processes issuing random direct I/O with 4KB data on each
      files in three groups.
          #1 Run  25 processes issuing read I/O only per group.
          #2 Run 100 processes issuing read I/O only per group.
    o Count up the number of I/Os which have done in 10 minutes.

               The number of I/Os (percentage to total I/O)
     --------------------------------------------------------------
    | group       |  group 1   |  group 2   |  group 3   |  total  |
    | priority    | 0(highest) |     4      |  7(lowest) |  I/Os   |
    |-------------+------------+------------+------------+---------|
    | Estimate    |            |            |            |         |
    | Performance |    61.5%   |    30.8%   |    7.7%    |         |
    |-------------+------------+------------+------------|---------|
    | #1  25procs | 52763(57%) | 30811(33%) |  9575(10%) |  93149  |
    | #2 100procs | 24949(40%) | 21325(34%) | 16508(26%) |  62782  |
     --------------------------------------------------------------

The result of test #1 is close to your estimation, but the result
of test #2 is not, the gap between the estimation and the result 
increased.

In addition, I got the following message during test #2. Program
"ioload", our benchmark program, was blocked more than 120 seconds.
Do you see any problems?

INFO: task ioload:8456 blocked for more than 120 seconds.
"echo 0 &gt; /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ioload        D 00000008  2772  8456   8419
       f72eb740 00200082 c34862c0 00000008 c3565170 c35653c0 c2009d80
       00000001
       c1d1bea0 00200046 ffffffff f6ee039c 00000000 00000000 00000000
       c2009d80
       018db000 00000000 f71a6a00 c0604fb6 00000000 f71a6bc8 ...
To: 'Ryo Tsuruta' <ryov@...>
Cc: <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>, <m-takahashi@...>
Date: Thursday, June 26, 2008 - 12:49 am

No.
I tried to test in  environment which runs from 1 to 200 processes
per group.

In the above my test, the gap between the estimation and the result
is increasing as a process increases.

And, in native CFQ with ionice command, this situation is a similar.
These circumstances are shown in the case of more than processes of total 200.

I'll investigate this problem continuously.


Thanks,

--
To: Satoshi UCHIDA <s-uchida@...>
Cc: 'Ryo Tsuruta' <ryov@...>, <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Friday, May 30, 2008 - 6:37 am

Hi Satoshi,

I'm testing your patch agains latest Linus git and I've got the
following bug. It can be easily reproduced creating a cgroup, switching
the i/o scheduler from cfq to any other and switch back to cfq again.

-Andrea

BUG: unable to handle kernel paging request at ffffffeb
IP: [&lt;c0212dc6&gt;] cfq_cgroup_sibling_tree_add+0x36/0x90
Oops: 0000 [#1] SMP
Modules linked in: i2c_piix4 ne2k_pci 8390 i2c_core

Pid: 3543, comm: bash Not tainted (2.6.26-rc4 #1)
EIP: 0060:[&lt;c0212dc6&gt;] EFLAGS: 00010286 CPU: 0
EIP is at cfq_cgroup_sibling_tree_add+0x36/0x90
EAX: 00000003 EBX: c7704c90 ECX: ffffffff EDX: c7102180
ESI: c7102240 EDI: c7704c80 EBP: c7afbe94 ESP: c7afbe80
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3543, ti=c7afa000 task=c7aeda00 task.ti=c7afa000)
Stack: c7704c90 c71022d0 c7ace078 c7704c80 c71020c0 c7afbea8 c021363c
c7ace078
       c7803460 c71020c0 c7afbebc c021360a c7102184 c71020c0 c7102180
c7afbee4
       c02134e0 00000000 c7102184 c72b8ab0 00000001 c7102140 c72b8ab0
c04b8ac0
Call Trace:
 [&lt;c021363c&gt;] ? cfq_cgroup_init_cfq_data+0x7c/0x80
 [&lt;c021360a&gt;] ? cfq_cgroup_init_cfq_data+0x4a/0x80
 [&lt;c02134e0&gt;] ? __cfq_cgroup_init_queue+0x100/0x1e0
 [&lt;c021097b&gt;] ? cfq_init_queue+0xb/0x10
 [&lt;c0204ff8&gt;] ? elevator_init_queue+0x8/0x10
 [&lt;c0205cd0&gt;] ? elv_iosched_store+0x80/0x2b0
 [&lt;c0209379&gt;] ? queue_attr_store+0x49/0x70
 [&lt;c01c488b&gt;] ? sysfs_write_file+0xbb/0x110
 [&lt;c0186276&gt;] ? vfs_write+0x96/0x160
 [&lt;c01c47d0&gt;] ? sysfs_write_file+0x0/0x110
 [&lt;c018696d&gt;] ? sys_write+0x3d/0x70
 [&lt;c0104267&gt;] ? sysenter_past_esp+0x78/0xd1
 =======================
Code: ec 08 8b 82 90 00 00 00 83 e0 fc 89 45 f0 8d 82 90 00 00 00 39 45
f0 75 5f
8d 47 10 89 45 ec 89 c3 31 c0 eb 11 8b 56 7c 8d 41 04 &lt;3b&gt; 51 ec 8d 59
08 0f 43
d8 89 c8 8b 0b 85 c9 75 e9 89 86 90 00
EIP: [&lt;c0212dc6&gt;] cfq_cgroup_sibling_tree_add+0x36/0x90 SS:ESP
0068:c7afbe80
---[ end trace 9701f4859bb53d27 ]---...
To: <righi.andrea@...>
Cc: 'Ryo Tsuruta' <ryov@...>, <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Wednesday, June 18, 2008 - 5:48 am

Hi, Andrea.

Thanks for bug reports.
I fix this problem.

This problem causes by miss of trace for children groups.
Please adopt and test this patch.

If OK, this amendment is adopted when I release new patch-set.


Regards,
 Satoshi Uchida.


diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index f868f4f..64561f5 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -183,7 +184,7 @@ static void *cfq_cgroup_init_cfq_data(struct cfq_cgroup *cfqc, struct cfq_data *
 
 	/* check and create cfq_data for children */
 	if (cfqc-&gt;css.cgroup)
-		list_for_each_entry(child, &amp;cfqc-&gt;css.cgroup-&gt;children, children){
+		list_for_each_entry(child, &amp;cfqc-&gt;css.cgroup-&gt;children, sibling){
 			cfq_cgroup_init_cfq_data(cgroup_to_cfq_cgroup(child), cfqd);
To: Satoshi UCHIDA <s-uchida@...>
Cc: 'Ryo Tsuruta' <ryov@...>, <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Sunday, June 22, 2008 - 1:04 pm

OK, I can confirm the fix resolves the problem.

--
To: Satoshi UCHIDA <s-uchida@...>
Cc: 'Ryo Tsuruta' <ryov@...>, <axboe@...>, <vtaras@...>, <containers@...>, <tom-sugawara@...>, <linux-kernel@...>
Date: Wednesday, June 18, 2008 - 6:33 pm

Thanks Satoshi, I'll test the fix in this weekend and I'll let you know.

--
To: Ryo Tsuruta <ryov@...>
Cc: <s-uchida@...>, <vtaras@...>, <axboe@...>, <m-takahashi@...>, <containers@...>, <linux-kernel@...>, <tom-sugawara@...>, <devel@...>
Date: Friday, April 25, 2008 - 5:37 pm

Ryo Tsuruta &lt;ryov@valinux.co.jp&gt; wrote:

Here are a few results. IO is issued in 4k chunks,
using O_DIRECT. Each process issues both reads
and writes. There are 60 such processes in each cgroup (except
where noted). Numbers given show the total count of io requests
(read and write) completed in 60 seconds. All processes use
the same partition, fs is ext3.

Vasily's scheduler:
------------------------------------------------------
| cgroup | s0                 | s1             |total |
|priority|  4                 |  4             |I/Os  |
------------------------------------------------------
|        | 24953              | 24062          | 49015|
|        | 29558(60 processes)| 14639 (30 proc)| 44197|
-------------------------------------------------------
|priority|    0               |  4             |      |
|        | 24221              | 24047          | 48268|
|priority|    1               |  4             |      |
|        | 24897              | 24509          | 49406|
|priority|    2               |  4             |      |
|        | 23295              | 23622          | 46917|
|priority|    0               |  7             |      |
|        | 22301              | 23373          | 45674|
-------------------------------------------------------

Satoshi's scheduler:
-------------------------------------------------------
| cgroup | s0                 | s1             |total |
|priority|  3                 |  3             |I/Os  |
|        | 25175              | 26463          | 51638|
|        | 26944 (60)         | 26698 (30)     | 53642|
-------------------------------------------------------
|priority|   0                |  3             |      |
|        | 60821              | 19846          | 80667|
|priority|   1                |  3             |      |
|        | 50608              | 25994          | 76602|
|priority|   2                |  3             |      |
|        | 32132              | 26641          | 58773|
|priority|   7                |  0 ...
To: <fw@...>
Cc: <s-uchida@...>, <vtaras@...>, <axboe@...>, <m-takahashi@...>, <containers@...>, <linux-kernel@...>, <tom-sugawara@...>, <devel@...>
Date: Monday, April 28, 2008 - 8:44 pm

Of my previous results, Satoshi's scheduler has about twice of the io
count on the write test, but both io counts on the read test are
nearly the same.

Here are another results. The test procedure is as follows:
  o Prepare 3 partitions sdc2, sdc3 and sdc4.
  o Run 50 processes for sdc2, 100 processes for sdc3 and 150 processes
    for sdc4. Apply I/O loads in inverse to the priority level.
  o Each process issuing random read/write direct I/O with 4KB data.
  o Count up the number of I/Os which have done in 60 seconds.

               The number of I/Os (percentage to total I/Os)
   -------------------------------------------------------------------
  | partition       |    sdc2     |    sdc3     |    sdc4     | total |
  | processes       |     50      |    100      |    150      |  I/Os |
  |-----------------+-------------+-------------+-------------|-------|
  | Normal          |  2281(18%)  |  4287(34%)  |  6005(48%)  | 12573 |
  |-----------------+-------------+-------------+-------------+-------|
  | Vasily's sched. |             |             |             |       |
  | cgroup priority |  7(highest) |     4       |  0(lowest)  |       |
  |                 |  3713(24%)  |  6587(42%)  |  5354(34%)  | 15654 |
  |-----------------+-------------+-------------+-------------+-------|
  | Satoshi's sched.|             |             |             |       |
  | cgroup priority |  0(highest) |     4       |  7(lowest)  |       |
  |                 |  5399(42%)  |  6506(50%)  |  1034( 8%)  | 12939 |
   -------------------------------------------------------------------

Satoshi's scheduler suppressed the I/O to the lowest priority
partition better than Vasily's one, but Vasily's scheduler got the
highest total I/Os.

Thanks,
Ryo Tsuruta
--
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:18 am

This patch controls whether cfq_data is active or not.
When cfq_data is not active and active cfq_queue is inserted into cfq_data,
cfq_data is activated.
When cfq_data is active and active cfq_queue is not exist,
cfq_data is deactivated.

The new cfq optional operations:
The "cfq_add_cfqq_opt_fn" defines a function that runs an additional process
when active queue is inserted into cfq_data.

The "cfq_del_cfqq_opt_fn" defines a function that runs an additional process
when active queue is removed in cfq_data.

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-cgroup.c          |   28 ++++++++++++++++++++++++++++
 block/cfq-iosched.c         |    6 ++++++
 include/linux/cfq-iosched.h |    4 ++++
 3 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 27a9a7a..f868f4f 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -741,6 +741,32 @@ static int cfq_cgroup_active_data_check(struct cfq_data *cfqd)
 	return (cfqd-&gt;cfqmd-&gt;active_data == cfqd);
 }
 
+static void cfq_cgroup_add_cfqd_rr(struct cfq_data *cfqd)
+{
+	if (!cfq_cfqd_on_rr(cfqd)) {
+		cfq_mark_cfqd_on_rr(cfqd);
+		cfqd-&gt;cfqmd-&gt;busy_data++;
+
+		cfq_cgroup_service_tree_add(cfqd, 0);
+	}
+}
+
+
+static void cfq_cgroup_del_cfqd_rr(struct cfq_data *cfqd)
+{
+	if (RB_EMPTY_ROOT(&amp;cfqd-&gt;service_tree.rb)) {
+		struct cfq_meta_data *cfqdd = cfqd-&gt;cfqmd;
+		BUG_ON(!cfq_cfqd_on_rr(cfqd));
+		cfq_clear_cfqd_on_rr(cfqd);
+		if (!RB_EMPTY_NODE(&amp;cfqd-&gt;rb_node)) {
+			cfq_rb_erase(&amp;cfqd-&gt;rb_node,
+				     &amp;cfqdd-&gt;service_tree);
+		}
+		BUG_ON(!cfqdd-&gt;busy_data);
+		cfqdd-&gt;busy_data--;
+	}	
+}
+
 struct cfq_ops opt = {
 	.cfq_init_queue_fn = __cfq_cgroup_init_queue,
 	.cfq_exit_queue_fn = __cfq_cgroup_exit_data,
@@ -749,4 +775,6 @@ struct cfq_ops opt = {
 	.cfq_completed_request_after_fn = cfq_cgroup_completed_request_after,
 	.cfq_empty_fn = cfq_cgroup_queue_empty,
 	.cfq_a...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:17 am

This patch introduced to control cfq_data.
Its algorithm is similar to one when CFQ synchronous I/O.

The new cfq optional operations:
The "cfq_dispatch_requests_fn" defines a function which is implemented
request dispatching algorithm.
This becomes main function for fairness.

The "cfq_completed_request_after_fn" defines a function which winds up I/O's
affairs.

The "cfq_active_check_fn" defines a function which make sure whether selecting cfq_data is equal to active cfq_data.

The "cfq_empty_fn" defines a function which check whether active data exists.

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-cgroup.c          |  326 ++++++++++++++++++++++++++++++++++++++++++-
 block/cfq-iosched.c         |   89 +++++++++---
 include/linux/cfq-iosched.h |   41 ++++++-
 3 files changed, 434 insertions(+), 22 deletions(-)

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 6a8a219..27a9a7a 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -15,9 +15,35 @@
 #include &lt;linux/cgroup.h&gt;
 #include &lt;linux/cfq-iosched.h&gt;
 
+
 #define CFQ_CGROUP_SLICE_SCALE		(5)
 #define CFQ_CGROUP_MAX_IOPRIO		(8)
 
+static const int cfq_cgroup_slice = HZ / 10;
+
+enum cfqd_state_flags {
+	CFQ_CFQD_FLAG_on_rr = 0,	/* on round-robin busy list */
+	CFQ_CFQD_FLAG_slice_new,	/* no requests dispatched in slice */
+};
+
+#define CFQ_CFQD_FNS(name)						\
+static inline void cfq_mark_cfqd_##name(struct cfq_data *cfqd)		\
+{									\
+	(cfqd)-&gt;flags |= (1 &lt;&lt; CFQ_CFQD_FLAG_##name);			\
+}									\
+static inline void cfq_clear_cfqd_##name(struct cfq_data *cfqd)	\
+{									\
+	(cfqd)-&gt;flags &amp;= ~(1 &lt;&lt; CFQ_CFQD_FLAG_##name);			\
+}									\
+static inline int cfq_cfqd_##name(const struct cfq_data *cfqd)		\
+{									\
+	return ((cfqd)-&gt;flags &amp; (1 &lt;&lt; CFQ_CFQD_FLAG_##name)) != 0;	\
+}
+
+CFQ_CFQD_FNS(on_rr);
+CFQ_CFQD_FNS(slice_new);
+#undef CFQ_CFQD_FNS
+
 static const int cfq_cgroup_slice_idle = HZ...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:16 am

This patch is possible to select a cfq_data corresponding group with task.
This is used when merge, merge check, queue check and queue setting.

The new cfq optional operations:
The "cfq_search_data_fn" defines a function that selects a correct cfq_data when
cfq_queue and requests are not connected yet. 

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-cgroup.c          |   32 ++++++++++++++++++++++++++++++++
 block/cfq-iosched.c         |   12 ++++++++++++
 include/linux/cfq-iosched.h |    2 ++
 3 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 1ad9d33..6a8a219 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -35,6 +35,12 @@ static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont)
 			    struct cfq_cgroup, css);
 }
 
+static inline struct cfq_cgroup *task_to_cfq_cgroup(struct task_struct *tsk)
+{
+	return container_of(task_subsys_state(tsk, cfq_subsys_id),
+			    struct cfq_cgroup, css);
+}
+
 /*
  * Add device or cgroup data functions.
  */
@@ -392,6 +398,32 @@ struct cgroup_subsys cfq_subsys = {
 };
 
 
+struct cfq_data *cfq_cgroup_search_data(void *data,
+					struct task_struct *tsk)
+{
+	struct cfq_data *cfqd = (struct cfq_data *)data;
+	struct cfq_meta_data *cfqmd = cfqd-&gt;cfqmd;
+	struct cfq_cgroup *cont = task_to_cfq_cgroup(tsk);
+	struct rb_node *p = cont-&gt;sibling_tree.rb_node;
+
+	while (p) {
+		struct cfq_data *__cfqd;
+		__cfqd = rb_entry(p, struct cfq_data, group_node);
+
+
+		if (cfqmd &lt; __cfqd-&gt;cfqmd) {
+			p = p-&gt;rb_left;
+		} else if (cfqmd &gt; __cfqd-&gt;cfqmd) {
+			p = p-&gt;rb_right;
+		} else {
+			return __cfqd;
+		}
+	}
+
+	return NULL;
+}
+
+
 struct cfq_ops opt = {
 	.cfq_init_queue_fn = __cfq_cgroup_init_queue,
 	.cfq_exit_queue_fn = __cfq_cgroup_exit_data,
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index b1757bc..3aa320a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:16 am

This patch expands cfq data to handling multi cfq_data in group.
This control is used rb_tree and the key is used a pointer of cfq_meta_data.

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-cgroup.c          |  121 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/cfq-iosched.h |    6 ++
 include/linux/cgroup.h      |    1 +
 kernel/cgroup.c             |    6 ++
 4 files changed, 133 insertions(+), 1 deletions(-)

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index ba0f3db..1ad9d33 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -23,6 +23,9 @@ static const int cfq_cgroup_slice_idle = HZ / 125;
 struct cfq_cgroup {
 	struct cgroup_subsys_state css;
 	unsigned int ioprio;
+
+	struct rb_root sibling_tree;
+	unsigned int siblings;
 };
 
 
@@ -35,6 +38,8 @@ static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont)
 /*
  * Add device or cgroup data functions.
  */
+struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data);
+
 static struct cfq_meta_data *cfq_cgroup_init_meta_data(struct cfq_data *cfqd, struct request_queue *q)
 {
 	struct cfq_meta_data *cfqmd;
@@ -90,16 +95,75 @@ static void cfq_meta_data_sibling_tree_add(struct cfq_meta_data *cfqmd,
 	cfqmd-&gt;siblings++;
 	cfqd-&gt;cfqmd = cfqmd;
 }
- 
+
+static void cfq_cgroup_sibling_tree_add(struct cfq_cgroup *cfqc,
+					struct cfq_data *cfqd)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(!RB_EMPTY_NODE(&amp;cfqd-&gt;group_node));
+
+	p = &amp;cfqc-&gt;sibling_tree.rb_node;
+
+	while (*p) {
+		struct cfq_data *__cfqd;
+		struct rb_node **n;
+
+		parent = *p;
+		__cfqd = rb_entry(parent, struct cfq_data, group_node);
+
+		if (cfqd-&gt;cfqmd &lt; __cfqd-&gt;cfqmd) {
+			n = &amp;(*p)-&gt;rb_left;
+		} else {
+			n = &amp;(*p)-&gt;rb_right;
+		}
+		p = n;
+	}
+		
+	rb_link_node(&amp;cfqd-&gt;group_node, parent, p);
+	rb_insert_color(&amp;cfqd-&gt;group_node, &amp;cfqc-&gt;siblin...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:15 am

This patch expands cfq_meta_data to handling multi cfq_data.
This control is used rb_tree and the key is used a pointer of cfq_data.

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-cgroup.c          |   62 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/cfq-iosched.h |    7 +++++
 2 files changed, 68 insertions(+), 1 deletions(-)

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 68336b1..ba0f3db 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -54,9 +54,42 @@ static struct cfq_meta_data *cfq_cgroup_init_meta_data(struct cfq_data *cfqd, st
 	cfqmd-&gt;cfq_driv_d.idle_slice_timer.data = (unsigned long) cfqd;
 	cfqmd-&gt;cfq_driv_d.cfq_slice_idle = cfq_cgroup_slice_idle;
 
+	cfqmd-&gt;sibling_tree = RB_ROOT;
+	cfqmd-&gt;siblings = 0;
+
 	return cfqmd;
 }
 
+static void cfq_meta_data_sibling_tree_add(struct cfq_meta_data *cfqmd,
+					     struct cfq_data *cfqd)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+
+	BUG_ON(!RB_EMPTY_NODE(&amp;cfqd-&gt;sib_node));
+
+	p = &amp;cfqmd-&gt;sibling_tree.rb_node;
+
+	while (*p) {
+		struct cfq_data *__cfqd;
+		struct rb_node **n;
+
+		parent = *p;
+		__cfqd = rb_entry(parent, struct cfq_data, sib_node);
+
+		if (cfqd &lt; __cfqd) {
+			n = &amp;(*p)-&gt;rb_left;
+		} else {
+			n = &amp;(*p)-&gt;rb_right;
+		}
+		p = n;
+	}
+		
+	rb_link_node(&amp;cfqd-&gt;sib_node, parent, p);
+	rb_insert_color(&amp;cfqd-&gt;sib_node, &amp;cfqmd-&gt;sibling_tree);
+	cfqmd-&gt;siblings++;
+	cfqd-&gt;cfqmd = cfqmd;
+}
  
 struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data)
 {
@@ -66,6 +99,8 @@ struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data)
 	if (!cfqd)
 		return NULL;
 
+	RB_CLEAR_NODE(&amp;cfqd-&gt;sib_node);
+
 	if (!cfqmd) {
        		cfqmd = cfq_cgroup_init_meta_data(cfqd, q);
 		if (!cfqmd) {
@@ -73,6 +108,7 @@ struct cfq_data *__cfq_cgroup_init_queue(struct request_queue *q, void *data)
 			...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:15 am

This patch introduces CFQ meta data (cfq_meta data).
This creates new control data layer over traditional control data (cfq_data).

The new cfq optional operations:
The "cfq_init_queue_fn" defines a function that runs when a new device is plugged, namely
new I/O queue is created.

The "cfq_exit_queue_fn" defines a function that runs when device is unplugged, namely 
I/O queue is removed.

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-cgroup.c          |   62 +++++++++++++++++++++++++++++++++++++++
 block/cfq-iosched.c         |   67 ++++++++++++++++++++++++++++++++-----------
 include/linux/cfq-iosched.h |   42 ++++++++++++++++++++++++++-
 3 files changed, 153 insertions(+), 18 deletions(-)

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index bcb55c8..68336b1 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -18,6 +18,8 @@
 #define CFQ_CGROUP_SLICE_SCALE		(5)
 #define CFQ_CGROUP_MAX_IOPRIO		(8)
 
+static const int cfq_cgroup_slice_idle = HZ / 125;
+
 struct cfq_cgroup {
 	struct cgroup_subsys_state css;
 	unsigned int ioprio;
@@ -30,6 +32,52 @@ static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont)
 			    struct cfq_cgroup, css);
 }
 
+/*
+ * Add device or cgroup data functions.
+ */
+static struct cfq_meta_data *cfq_cgroup_init_meta_data(struct cfq_data *cfqd, struct request_queue *q)
+{
+	struct cfq_meta_data *cfqmd;
+	
+	cfqmd = kmalloc_node(sizeof(*cfqmd), GFP_KERNEL | __GFP_ZERO, q-&gt;node);
+	if (!cfqmd) {
+		return NULL;
+	}
+	cfqmd-&gt;elv_data = cfqd;
+
+	cfqmd-&gt;cfq_driv_d.queue = q;
+	INIT_WORK(&amp;cfqmd-&gt;cfq_driv_d.unplug_work, cfq_kick_queue);
+	cfqmd-&gt;cfq_driv_d.last_end_request = jiffies;
+       
+	init_timer(&amp;cfqmd-&gt;cfq_driv_d.idle_slice_timer);
+	cfqmd-&gt;cfq_driv_d.idle_slice_timer.function = cfq_idle_slice_timer;
+	cfqmd-&gt;cfq_driv_d.idle_slice_timer.data = (unsigned long) cfqd;
+	cfqmd-&gt;cfq_driv_d.cfq_slice_idle = cfq_cgroup_slice_idle;
+
+	r...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:14 am

This patch creates a cfq optional operations framework.
This framework defines specific functions for expanding CFQ.

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-cgroup.c          |    4 ++++
 block/cfq-iosched.c         |    5 +++++
 include/linux/cfq-iosched.h |    6 ++++++
 3 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index 378a23d..bcb55c8 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -151,3 +151,7 @@ struct cgroup_subsys cfq_subsys = {
 	.subsys_id = cfq_subsys_id,
 	.populate = cfq_cgroup_populate,
 };
+
+
+struct cfq_ops opt = {
+};
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aaf5d7e..245c252 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2233,6 +2233,11 @@ static void __exit cfq_exit(void)
 module_init(cfq_init);
 module_exit(cfq_exit);
 
+#ifndef CONFIG_CGROUP_CFQ
+struct cfq_ops opt = {
+};
+#endif
+
 MODULE_AUTHOR("Jens Axboe");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("Completely Fair Queueing IO scheduler");
diff --git a/include/linux/cfq-iosched.h b/include/linux/cfq-iosched.h
index 035bfc4..9287da1 100644
--- a/include/linux/cfq-iosched.h
+++ b/include/linux/cfq-iosched.h
@@ -87,4 +87,10 @@ static inline struct request_queue * __cfq_container_of_queue(struct work_struct
 	return cfqd-&gt;cfq_driv_d.queue;
 };
 
+struct cfq_ops
+{
+};
+
+extern struct cfq_ops opt;
+
 #endif  /* _LINUX_CFQ_IOSCHED_H */
-- 
1.5.4.1


--
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:14 am

This patch exacts driver unique data into new structure(cfq_driver_data)
in order to move top control layer(cfq_meata_data layer
 in next patch).

CFQ_DRV_UNIQ_DATA macro calculates control data in top control layer.
In one lalyer CFQ, macro selects cfq_driver_data in cfq_data.
In two lalyer CFQ, macro selects cfq_driver_data in cfq_meta_data.
(in [7/12] patch)

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-iosched.c         |  138 +++++++++++++++++++++----------------------
 include/linux/cfq-iosched.h |   48 +++++++++++----
 2 files changed, 102 insertions(+), 84 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c1f9da9..aaf5d7e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -177,7 +177,7 @@ static inline int cfq_bio_sync(struct bio *bio)
 static inline void cfq_schedule_dispatch(struct cfq_data *cfqd)
 {
 	if (cfqd-&gt;busy_queues)
-		kblockd_schedule_work(&amp;cfqd-&gt;unplug_work);
+		kblockd_schedule_work(&amp;CFQ_DRV_UNIQ_DATA(cfqd).unplug_work);
 }
 
 static int cfq_queue_empty(struct request_queue *q)
@@ -260,7 +260,7 @@ cfq_choose_req(struct cfq_data *cfqd, struct request *rq1, struct request *rq2)
 	s1 = rq1-&gt;sector;
 	s2 = rq2-&gt;sector;
 
-	last = cfqd-&gt;last_position;
+	last = CFQ_DRV_UNIQ_DATA(cfqd).last_position;
 
 	/*
 	 * by definition, 1KiB is 2 sectors
@@ -535,7 +535,7 @@ static void cfq_add_rq_rb(struct request *rq)
 	 * if that happens, put the alias on the dispatch list
 	 */
 	while ((__alias = elv_rb_add(&amp;cfqq-&gt;sort_list, rq)) != NULL)
-		cfq_dispatch_insert(cfqd-&gt;queue, __alias);
+		cfq_dispatch_insert(CFQ_DRV_UNIQ_DATA(cfqd).queue, __alias);
 
 	if (!cfq_cfqq_on_rr(cfqq))
 		cfq_add_cfqq_rr(cfqd, cfqq);
@@ -579,7 +579,7 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q-&gt;elevator-&gt;elevator_data;
 
-	cfqd-&gt;rq_in_driver++;
+	CFQ_DRV_UNIQ_DATA(cfqd).rq_in_driver++;
 
 	/*
 	 *...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:13 am

This patch add "ioprio" entry in cfq_cgroup.
The "ioprio" entry shows I/O priority of group.
When you would change priority, write this entry.

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-cgroup.c |   96 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 96 insertions(+), 0 deletions(-)

diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
index de00a0d..378a23d 100644
--- a/block/cfq-cgroup.c
+++ b/block/cfq-cgroup.c
@@ -15,8 +15,12 @@
 #include &lt;linux/cgroup.h&gt;
 #include &lt;linux/cfq-iosched.h&gt;
 
+#define CFQ_CGROUP_SLICE_SCALE		(5)
+#define CFQ_CGROUP_MAX_IOPRIO		(8)
+
 struct cfq_cgroup {
 	struct cgroup_subsys_state css;
+	unsigned int ioprio;
 };
 
 
@@ -41,6 +45,8 @@ cfq_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	if (unlikely(!cfqc))
 		return ERR_PTR(-ENOMEM);
 
+	cfqc-&gt;ioprio = 3;
+
 	return &amp;cfqc-&gt;css;	
 }
 
@@ -49,9 +55,99 @@ static void cfq_cgroup_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
 	kfree(cgroup_to_cfq_cgroup(cont));
 }
 
+static ssize_t cfq_cgroup_read(struct cgroup *cont, struct cftype *cft,
+			       struct file *file, char __user *userbuf,
+			       size_t nbytes, loff_t *ppos)
+{
+	struct cfq_cgroup *cfqc;
+	char *page;
+	ssize_t ret;
+
+	page = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!page)
+		return -ENOMEM;
+
+	cgroup_lock();
+	if (cgroup_is_removed(cont)) {
+		cgroup_unlock();
+		ret = -ENODEV;
+		goto out;
+	}
+
+	cfqc = cgroup_to_cfq_cgroup(cont);
+
+	cgroup_unlock();
+
+	/* print priority */
+	ret = sprintf(page, " priority: %d \n", cfqc-&gt;ioprio);
+
+	ret = simple_read_from_buffer(userbuf, nbytes, ppos, page, ret);
+
+out:
+	free_page((unsigned long)page);
+	return ret;
+}
+
+static ssize_t cfq_cgroup_write(struct cgroup *cont, struct cftype *cft,
+				struct file *file, const char __user *userbuf,
+				size_t nbytes, loff_t *ppos)
+{
+	struct cfq_cgroup *cfqc;
+	ssize_t ret;
+	int new...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:12 am

This patch introduces a simple cgroup subsystem.
New cgroup subsystem is called cfq_cgroup.

   Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/Makefile                |    1 +
 block/cfq-cgroup.c            |   57 +++++++++++++++++++++++++++++++++++++++++
 include/linux/cgroup_subsys.h |    6 ++++
 3 files changed, 64 insertions(+), 0 deletions(-)
 create mode 100644 block/cfq-cgroup.c

diff --git a/block/Makefile b/block/Makefile
index 5a43c7d..ea07b46 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_IOSCHED_NOOP)	+= noop-iosched.o
 obj-$(CONFIG_IOSCHED_AS)	+= as-iosched.o
 obj-$(CONFIG_IOSCHED_DEADLINE)	+= deadline-iosched.o
 obj-$(CONFIG_IOSCHED_CFQ)	+= cfq-iosched.o
+obj-$(CONFIG_CGROUP_CFQ)	+= cfq-cgroup.o
 
 obj-$(CONFIG_BLK_DEV_IO_TRACE)	+= blktrace.o
 obj-$(CONFIG_BLOCK_COMPAT)	+= compat_ioctl.o
diff --git a/block/cfq-cgroup.c b/block/cfq-cgroup.c
new file mode 100644
index 0000000..de00a0d
--- /dev/null
+++ b/block/cfq-cgroup.c
@@ -0,0 +1,57 @@
+/*
+ *  CFQ CGROUP disk scheduler.
+ *
+ *     This program is a wrapper program that is
+ *     extend CFQ disk scheduler for handling
+ *     cgroup subsystem. 
+ *
+ *     This program is based on original CFQ code.
+ * 
+ *  Copyright (C) 2008 Satoshi UCHIDA &lt;s-uchida@ap.jp.nec.com&gt;
+ *   and NEC Corp.
+ */
+
+#include &lt;linux/blkdev.h&gt;
+#include &lt;linux/cgroup.h&gt;
+#include &lt;linux/cfq-iosched.h&gt;
+
+struct cfq_cgroup {
+	struct cgroup_subsys_state css;
+};
+
+
+static inline struct cfq_cgroup *cgroup_to_cfq_cgroup(struct cgroup *cont)
+{
+	return container_of(cgroup_subsys_state(cont, cfq_subsys_id),
+			    struct cfq_cgroup, css);
+}
+
+static struct cgroup_subsys_state *
+cfq_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+	struct cfq_cgroup *cfqc;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (!cgroup_is_descendant(cont))
+		return ERR_PTR(-EPERM);
+
+	cfqc = kzalloc(sizeof(s...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:12 am

This patch moves some data structure into header file
(include/linux/cfq-iosched.h).

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/cfq-iosched.c         |   60 +------------------------------------
 include/linux/cfq-iosched.h |   70 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 59 deletions(-)
 create mode 100644 include/linux/cfq-iosched.h

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 0f962ec..c1f9da9 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -11,6 +11,7 @@
 #include &lt;linux/elevator.h&gt;
 #include &lt;linux/rbtree.h&gt;
 #include &lt;linux/ioprio.h&gt;
+#include &lt;linux/cfq-iosched.h&gt;
 
 /*
  * tunables
@@ -58,65 +59,6 @@ static struct completion *ioc_gone;
 
 #define sample_valid(samples)	((samples) &gt; 80)
 
-/*
- * Most of our rbtree usage is for sorting with min extraction, so
- * if we cache the leftmost node we don't have to walk down the tree
- * to find it. Idea borrowed from Ingo Molnars CFS scheduler. We should
- * move this into the elevator for the rq sorting as well.
- */
-struct cfq_rb_root {
-	struct rb_root rb;
-	struct rb_node *left;
-};
-#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, }
-
-/*
- * Per block device queue structure
- */
-struct cfq_data {
-	struct request_queue *queue;
-
-	/*
-	 * rr list of queues with requests and the count of them
-	 */
-	struct cfq_rb_root service_tree;
-	unsigned int busy_queues;
-
-	int rq_in_driver;
-	int sync_flight;
-	int hw_tag;
-
-	/*
-	 * idle window management
-	 */
-	struct timer_list idle_slice_timer;
-	struct work_struct unplug_work;
-
-	struct cfq_queue *active_queue;
-	struct cfq_io_context *active_cic;
-
-	/*
-	 * async queue for each priority case
-	 */
-	struct cfq_queue *async_cfqq[2][IOPRIO_BE_NR];
-	struct cfq_queue *async_idle_cfqq;
-
-	sector_t last_position;
-	unsigned long last_end_request;
-
-	/*
-	 * tunables, see top of file
-	 */
-	unsig...
To: 'Satoshi UCHIDA' <s-uchida@...>, 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>
Date: Thursday, April 3, 2008 - 3:11 am

This patch adds configuration entry into block/Kconfig.iosched.

      Signed-off-by: Satoshi UCHIDA &lt;uchida@ap.jp.nec.com&gt;

---
 block/Kconfig.iosched |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 96a01b3..25fa6bb 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -25,6 +25,15 @@ config IOSCHED_CFQ
 
 	  If unsure, say Y.
 
+config CGROUP_CFQ
+	bool "handling cgroup in CFQ"
+	default n
+	depends on IOSCHED_CFQ &amp;&amp; CGROUPS
+	---help---
+          This option exptends CFQ to handling cgroup.
+          CFQ is changed into two layler control --
+          per-cgroup layler and per-task layler --.
+	
 config IOSCHED_AS
 	tristate "Anticipatory I/O scheduler"
 	default y
-- 
1.5.4.1


--
To: 'Paul Menage' <menage@...>, <linux-kernel@...>, <containers@...>
Cc: <axboe@...>, <tom-sugawara@...>, <m-takahashi@...>