Stephane posted a patchset to add perf_cgroup subsystem, so perf can be used to monitor all threads belonging to a cgroup. But if you already mounted a cgroup hierarchy but without perf_cgroup and the hierarchy has sub-cgroups, you can't bind perf_cgroup to it, and thus you're not able to use per-cgroup perf feature. This patchset alleviates the pain, and then a subsytem can be bound/unbound to/from a hierarchy which has sub-cgroups in it. Some subsystems still can't take advantage of this patchset, memcgroup and cpuset in specific. For cpuset, if a hierarchy has a sub-cgroup and the cgroup has tasks, we can't decide sub-cgroup's cpuset.mems and cpuset.cpus automatically if we try to bind cpuset to this hierarchy. For memcgroup, memcgroup uses css_get/put(), and due to some complexity, for now bindable subsystems should not use css_get/put(). Usage: # mount -t cgroup -o cpuset xxx /mnt # mkdir /mnt/tmp # echo $$ > /mnt/tmp/tasks (add cpuacct to the hierarchy) # mount -o remount,cpuset,cpuacct xxx /mnt (remove it from the hierarchy) # mount -o remount,cpuset xxx /mnt There's another limitation, cpuacct should not be bound to any mounted hierarchy before the above operation. But that's not a problem, as you can remove it from a hierarchy and bind it to another one. --- Documentation/cgroups/cgroups.txt | 26 ++- include/linux/cgroup.h | 25 +++- kernel/cgroup.c | 305 +++++++++++++++++++++++++++++++------ kernel/cgroup_freezer.c | 19 ++- kernel/sched.c | 1 + net/sched/cls_cgroup.c | 1 + security/device_cgroup.c | 1 + 7 files changed, 311 insertions(+), 67 deletions(-) --
On x86_32, sizeof(struct cgroup_subsys) shrinks from 276 bytes
to 264.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
include/linux/cgroup.h | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..e23ded6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -481,14 +481,16 @@ struct cgroup_subsys {
void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
int subsys_id;
- int active;
- int disabled;
- int early_init;
+
+ unsigned int active:1;
+ unsigned int disabled:1;
+ unsigned int early_init:1;
/*
* True if this subsys uses ID. ID is not available before cgroup_init()
* (not available in early_init time.)
*/
- bool use_id;
+ unsigned int use_id:1;
+
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;
--
1.7.0.1
--
Acked-by: Paul Menage <menage@google.com> Maybe use "bool" here? --
(Sorry for the delayed reply) Do you mean: bool active; bool disabled; ... ? With alignment 5-8 bool values == 8 bytes in 64-bit machine, compared to --
I meant specifying it as: bool active:1; bool disabled:1; i.e. keeping the bit-sized flags but also keeping the bool semantics. Having said that, I'm not really sure why saving 12 bytes per subsystem is worth a patch. Paul --
Every thing that reduces code size (without sacrifice readability and maintain maintainability) should be worth. This is one of the reasons we accept patches that replacing kmalloc+memset with kzalloc, which just saves 8 bytes in my box. --
Are you sure? I don't have a buildable kernel tree at the moment, but
the following fragment compiled fine for me (with gcc 4.4.3):
struct foo {
_Bool b1:1;
_Bool b2:1;
};
and was sized at one byte. And "bool" is just a typedef of _Bool in
Agreed, within reason. But this patch doesn't reduce code size - it
makes the code fractionally more complicated and reduces the *binary*
Replacing two function calls with one function call is a code
simplification and hence (generally) a good thing - the minuscule
reduction in binary size reduction that comes with it is just noise.
Paul
--
It's a commonly used skill in kernel code, so I can't say it makes code more complicated. That said, I'll happily drop this patch. It just came to me when I started to add new bool values to the structure. Or if you prefer --
bool xxx:1 is fine with me - I think it's worth keeping the semantics of these being boolean flags as obvious as possible. I just wouldn't invest much effort in similar patches in the future (given that there are only likely to be, what, <20 instances of cgroup_subsys in the kernel?). Shrinking a structure that had potentially very many instances, such as css_set or cg_cgroup_link, would be different of course. Paul --
Stephane posted a patchset to add perf_cgroup subsystem, so perf can
be used to monitor all threads belonging to a cgroup.
But if you already mounted a cgroup hierarchy but without perf_cgroup
and the hierarchy has sub-cgroups, you can't bind perf_cgroup to it,
and thus you're not able to use per-cgroup perf feature.
This patchset alleviates the pain, and then a subsytem can be bind/unbind
to/from a hierarchy which has sub-cgroups in it.
For a cgroup subsystem to become bindable, the can_bind flag of
struct cgroup_subsys should be set, and provide ->bind() callback
if necessary.
But for some constraints, not all subsystems can take advantage of
this patch. For example, we can't decide a cgroup's cpuset.mems and
cpuset.cpus automatically, so cpuset is not bindable.
Usage:
# mount -t cgroup -o cpuset xxx /mnt
# mkdir /mnt/tmp
# echo $$ > /mnt/tmp/tasks
(assume cpuacct is bindable, and we add cpuacct to the hierarchy)
# mount -o remount,cpuset,cpuacct xxx /mnt
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
include/linux/cgroup.h | 5 +
kernel/cgroup.c | 225 ++++++++++++++++++++++++++++++++++++++---------
2 files changed, 187 insertions(+), 43 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index e23ded6..49369ff 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -490,6 +490,11 @@ struct cgroup_subsys {
* (not available in early_init time.)
*/
unsigned int use_id:1;
+ /*
+ * Indicate if this subsystem can be bound/unbound to/from a cgroup
+ * hierarchy which has child cgroups.
+ */
+ unsigned int can_bind:1;
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6c36750..46df5f8 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,7 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/eventfd.h>
#include <linux/poll.h>
+#include <linux/bitops.h>
#include ...You mean to say that you cannot add cpuset to an existing hierarchy right? Not that you cannot add perf/cpuacct to an existing cpuset hierarchy? --
You could just say it's a depth-first walk except we process the parent before Is there something wrong with the indentation here? I can't see the closing brace for the "if (root->number_of_cgroups > 1)" I think you could avoid the can_bind flag field entirely and do: if (!subsys[i]->bind) Also, if we're going with my "split out unbind" suggestion I think the part here would be: for_each_set_bit(i, &removed_bits, CGROUP_SUBSYS_COUNT) if (!subsys[i]->unbind) return -EBUSY; <snip> Cheers, -Matt Helsley --
Nope. For some subsystems we just set the flag and need not to --
The standard term for that is "pre-order traversal". You shouldn't need a diagram. Paul --
Maybe call this "bindable"? Basic idea looks great, it could do with a bunch more comments, and maybe locking rules. Is there any chance of a lock inversion between dir->i_mutex and cgroup_lock in hierarchy_popuiate_dir() ? Paul --
The lock order is:
mutex_lock(&dir->i_mutex)
mutex_lock(&cgroup_mutex)
mutex_lock(&dir->i_mutex, I_MUTEX_CHILD)
it should be safe.
--
This allows us to unbind a cgroup subsystem from a hierarchy
which has sub-cgroups in it.
Due to some complexity, for now subsytems that use css_get/put()
can't be unbound from such a hierarchy.
Usage:
# mount -t cgroup -o cpuset,cpuacct xxx /mnt
# mkdir /mnt/tmp
# echo $$ > /mnt/tmp/tasks
(remove it from the hierarchy)
# mount -o remount,cpuset xxx /mnt
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
kernel/cgroup.c | 76 ++++++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 61 insertions(+), 15 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 46df5f8..9ce3fdb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1036,19 +1036,42 @@ static int hierarchy_attach_css(struct cgroup *cgrp, void *data)
return ret;
}
-static int hierarchy_update_css_sets(struct cgroup *cgrp, void *data)
+static int hierarchy_remove_css(struct cgroup *cgrp, void *data)
{
- unsigned long added_bits = (unsigned long)data;
+ unsigned long removed_bits = (unsigned long)data;
int i;
+
+ /*
+ * There might be some pointers to the cgroup_subsys_state
+ * that we are going to destroy.
+ */
+ synchronize_rcu();
+
+ for_each_set_bit(i, &removed_bits, CGROUP_SUBSYS_COUNT) {
+ subsys[i]->destroy(subsys[i], cgrp);
+ cgrp->subsys[i] = NULL;
+ }
+
+ return 0;
+}
+
+static void hierarchy_update_css_sets(struct cgroup *cgrp,
+ unsigned long bits, bool add)
+{
struct cg_cgroup_link *link;
+ int i;
write_lock(&css_set_lock);
list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
struct css_set *cg = link->cg;
struct hlist_head *hhead;
- for_each_set_bit(i, &added_bits, CGROUP_SUBSYS_COUNT)
- cg->subsys[i] = cgrp->subsys[i];
+ for_each_set_bit(i, &bits, CGROUP_SUBSYS_COUNT) {
+ if (add)
+ cg->subsys[i] = cgrp->subsys[i];
+ else
+ cg->subsys[i] = dummytop->subsys[i];
+ }
/* rehash */
hlist_del(&cg->hlist);
@@ -1056,7 +1079,21 @@ static int ...Should these technically be rcu_assign_pointer() ? Paul --
For those subsystems (debug, cpuacct, net_cls and devices),
setting the can_bind flag is sufficient.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
kernel/cgroup.c | 1 +
kernel/sched.c | 1 +
net/sched/cls_cgroup.c | 1 +
security/device_cgroup.c | 1 +
4 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9ce3fdb..6364bb5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5124,5 +5124,6 @@ struct cgroup_subsys debug_subsys = {
.destroy = debug_destroy,
.populate = debug_populate,
.subsys_id = debug_subsys_id,
+ .can_bind = 1,
};
#endif /* CONFIG_CGROUP_DEBUG */
diff --git a/kernel/sched.c b/kernel/sched.c
index 51944e8..cae104f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9329,6 +9329,7 @@ struct cgroup_subsys cpuacct_subsys = {
.destroy = cpuacct_destroy,
.populate = cpuacct_populate,
.subsys_id = cpuacct_subsys_id,
+ .can_bind = 1,
};
#endif /* CONFIG_CGROUP_CPUACCT */
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 37dff78..020ddfe 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -38,6 +38,7 @@ struct cgroup_subsys net_cls_subsys = {
#define net_cls_subsys_id net_cls_subsys.subsys_id
#endif
.module = THIS_MODULE,
+ .can_bind = 1,
};
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 8d9c48f..b8136fc 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -473,6 +473,7 @@ struct cgroup_subsys devices_subsys = {
.destroy = devcgroup_destroy,
.populate = devcgroup_populate,
.subsys_id = devices_subsys_id,
+ .can_bind = 1,
};
int devcgroup_inode_permission(struct inode *inode, int mask)
--
1.7.0.1
--
To make it bindable, we need to thaw all processes when unbinding
the freezer subsystem from a cgroup hierarchy.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
include/linux/cgroup.h | 3 ++-
kernel/cgroup.c | 22 ++++++++++++++++++++--
kernel/cgroup_freezer.c | 19 +++++++++++++++++--
3 files changed, 39 insertions(+), 5 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 49369ff..1e4e1df 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -478,7 +478,8 @@ struct cgroup_subsys {
int (*populate)(struct cgroup_subsys *ss,
struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
- void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
+ void (*bind)(struct cgroup_subsys *ss, struct cgroup *cgrp,
+ bool unbind);
int subsys_id;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6364bb5..12c1f7c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1105,6 +1105,22 @@ static int hierarchy_populate_dir(struct cgroup *cgrp, void *data)
return 0;
}
+static int hierarchy_subsys_bind(struct cgroup *cgrp, void *data)
+{
+ struct cgroup_subsys *ss = data;
+
+ ss->bind(ss, cgrp, false);
+ return 0;
+}
+
+static int hierarchy_subsys_unbind(struct cgroup *cgrp, void *data)
+{
+ struct cgroup_subsys *ss = data;;
+
+ ss->bind(ss, cgrp, true);
+ return 0;
+}
+
/*
* Call with cgroup_mutex held. Drops reference counts on modules, including
* any duplicate ones that parse_cgroupfs_options took. If this function
@@ -1172,7 +1188,8 @@ static int rebind_subsystems(struct cgroupfs_root *root,
list_move(&ss->sibling, &root->subsys_list);
ss->root = root;
if (ss->bind)
- ss->bind(ss, cgrp);
+ cgroup_walk_hierarchy(hierarchy_subsys_bind,
+ ss, cgrp);
mutex_unlock(&ss->hierarchy_mutex);
/* refcount was already taken, and we're keeping it */
} else if (bit & removed_bits) {
@@ -1180,7 +1197,8 @@ static int ...Based on experience using cgroups and questions we've fielded in the past on IRC I'd say users will really appreciate this. We're planning to use the freezer in checkpoint/restart. Since checkpoint requires the tasks to remain frozen for the duration of the syscall we add a kernel-internal freezer subsystem interface which prevents the cgroup from thawing. So we'll need some way to "block" unbinding for that as well. Perhaps the bind op should be able to return an error when unbind == true? Although that raises the question of what to do if only one of multiple unbinds fails.. Alternately you could split the bind/unbind op function pointers and get rid of the boolean argument. Then just don't fill in the freezer's unbind op and refuse to unbind subsystems that lack the unbind op. That seems a bit cleaner for now at least. Cheers, --
Or alternately to the alternate :): add a can_unbind:1 field and keep the boolean bind op argument. That might be best -- avoids an extra function pointer and still splits the bind/unbind-ability of a subsystem. Cheers, -Matt Helsley --
I think I'd vote for the separate bind() / unbind() callbacks. Paul --
And in case you're curious here are the portions of the checkpoint/restart tree which implement the above: http://www.linux-cr.org/git/?p=linux-cr.git;a=commitdiff;h=a584c1c6b729a544c76cbb173ce... http://www.linux-cr.org/git/?p=linux-cr.git;a=commitdiff;h=34869146ecd662352c0024e0769... Cheers, -Matt Helsley --
How about, we allow unbinding only when all the cgroups' state is We can use bindable:1 along with a callback can_bind(). For some subsystems, we just set bindable to true. For freezer subsystem, we set it to true and provide freezer_can_bind(), and return true only if no cgroup's state is FROZEN. --
For now bindable subsystems should not use css_get/put(), so warn
on this misuse.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
include/linux/cgroup.h | 7 +++++--
kernel/cgroup.c | 3 +++
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1e4e1df..10e4e02 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -80,13 +80,15 @@ struct cgroup_subsys_state {
/* bits in struct cgroup_subsys_state flags field */
enum {
- CSS_ROOT, /* This CSS is the root of the subsystem */
- CSS_REMOVED, /* This CSS is dead */
+ CSS_ROOT, /* This CSS is the root of the subsystem */
+ CSS_REMOVED, /* This CSS is dead */
+ CSS_NO_GET, /* Forbid calling css_get/put() */
};
/* Caller must verify that the css is not for root cgroup */
static inline void __css_get(struct cgroup_subsys_state *css, int count)
{
+ BUG_ON(test_bit(CSS_NO_GET, &css->flags));
atomic_add(count, &css->refcnt);
}
@@ -119,6 +121,7 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
{
if (test_bit(CSS_ROOT, &css->flags))
return true;
+ BUG_ON(test_bit(CSS_NO_GET, &css->flags));
while (!atomic_inc_not_zero(&css->refcnt)) {
if (test_bit(CSS_REMOVED, &css->flags))
return false;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 12c1f7c..88e265d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -936,6 +936,9 @@ static void init_cgroup_css(struct cgroup_subsys_state *css,
css->cgroup = cgrp;
atomic_set(&css->refcnt, 1);
css->flags = 0;
+ /* For now, a bindable subsystem should not call css_get/put(). */
+ if (ss->can_bind)
+ set_bit(CSS_NO_GET, &css->flags);
css->id = NULL;
if (cgrp == dummytop)
set_bit(CSS_ROOT, &css->flags);
--
1.7.0.1
--
Maybe re-title this patch to BUG() if a bindable subsystem calls css_get() ? Paul --
Provide a usage example, and update the bind() callback API. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> --- Documentation/cgroups/cgroups.txt | 26 +++++++++++++++++--------- 1 files changed, 17 insertions(+), 9 deletions(-) diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 190018b..5b5382a 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt @@ -363,17 +363,23 @@ Note this will add ns to the hierarchy but won't remove memory or cpuset, because the new options are appended to the old ones: # mount -o remount,ns /dev/cgroup +For some subsystems you can bind them to a mounted hierarchy or +remove them from it, even if there're sub-cgroups in it: +# mount -t cgroup -o freezer hier1 /dev/cgroup +# echo $$ > /dev/cgroup/my_cgroup +# mount -o freezer,cpuset hier1 /dev/cgroup +(failed) +# mount -o freezer,cpuacct hier1 /dev/cgroup +# mount -o cpuacct hier1 /dev/cgroup + +Note cpuacct should be sit in the default hierarchy before remount. + To Specify a hierarchy's release_agent: # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ xxx /dev/cgroup Note that specifying 'release_agent' more than once will return failure. -Note that changing the set of subsystems is currently only supported -when the hierarchy consists of a single (root) cgroup. Supporting -the ability to arbitrarily bind/unbind subsystems from an existing -cgroup hierarchy is intended to be implemented in the future. - Then under /dev/cgroup you can find a tree that corresponds to the tree of the cgroups in the system. For instance, /dev/cgroup is the cgroup that holds the whole system. @@ -623,13 +629,15 @@ initialization which might be required before a task could attach. For example in cpusets, no task may attach before 'cpus' and 'mems' are set up. -void bind(struct cgroup_subsys *ss, struct cgroup *root) +void bind(struct cgroup_subsys *ss, struct cgroup *cgrp, bool unbind) ...
On Fri, 22 Oct 2010 16:12:25 +0800 Can this operation be used with noop cgroup ? (IOW, perf cgroup can be attached to noop cgroup ?) Thanks, -Kame --
But, wait. It stuck when I moved a task from a sub-cgroup to the root. Need to dig more.. --
This is a bit vague. How about: For non-bindable subsystems, this will only involve movement between the default hierarchy (which never has sub-cgroups) and a hierarchy that is being created/destroyed (and hence has no sub-cgroups). For binadable subsystems, this may also involve movement between the default hierarchy and a mounted hierarchy that's populated with sub-cgroups. Also, the docs should mention that a cgroup setting the can_bind flag has to be able to support side-effect free movement of a task into any just-created cgroup, and into the root cgroup at any time. i.e. it's not suitable for any subsystem where can_attach() might return false for the root cgroup or a newly-created cgroup, or attach() might have side-effects for those same cases. Actually, perhaps we should forbid the combination of having both an attach() callback and can_bind=true ? Also, post_clone() doesn't get called when creating the css hierarchy during binding. Paul --
Oops, ignore that - I noticed that it does. Paul --
This is much better. :) Documentation often causes my headache due to my limited English skill. --
Right, so the only remaining problem I see with this approach is that you cannot profile two different hierarchies at the same time, but I can't really think of a solution to that problem (nor do I care very much). Seems like a nice approach, Thanks Li! --
Paul had a patch to allow some subsystems to be added to multi-hierarchies, which may help. But it forbids accessing t->cgroups, which makes this feature of limited use. Thanks! --
I more or less dropped it since I couldn't see a big use for it - all the examples I had were more easily implemented as basic cgroup Not exactly - it meant that you couldn't do a fast dereference via t->cgroups to get to your cgroup_subsys_state, since there was no longer a unique css for each task for that subsystem - there would be one per mounted instance of the subsystem. Instead it became necessary to follow a chain of pointers to find the appropriate css for the hierarchy in question. In general this also means that it's harder for the multi-bound subsystem have any real-world meaning - it's mostly useful for stateless subsystems, or ones where the cgroup_subsys_state contents have no direct connection with machine-wide data. Paul --
