Re: [RFC] Control Groups Roadmap ideas

Previous thread: signalfd() not handling sigqueue() sigval data correctly by Michael Kerrisk on Tuesday, April 8, 2008 - 2:09 pm. (8 messages)

Next thread: Re: Clock has stopped (time/date looping over 5 secon by Roland on Tuesday, April 8, 2008 - 2:47 pm. (2 messages)
From: Paul Menage
Date: Tuesday, April 8, 2008 - 2:14 pm

This is a list of some of the sub-projects that I'm planning for
Control Groups, or that I know others are planning on or working on.
Any comments or suggestions are welcome.


1) Stateless subsystems
-----

This was motivated by the recent "freezer" subsystem proposal, which
included a facility for sending signals to all members of a cgroup.
This wasn't specifically freezer-related, and wasn't even something
that needed particular per-cgroup state - its only state is that set
of processes, which is already tracked by crgoups. So it could
theoretically be mounted on multiple hierarchies at once, and wouldn't
need an entry in the css_set array.

This would require a few internal plumbing changes in cgroups, in particular:

- hashing css_set objects based on their cgroups rather than their css pointers
- allowing stateless subsystems to be in multiple hierarchies
- changing the way hierarchy ids are calculated - simply ORing
together the subsystem would no longer work since that could result in
duplicates


2) More flexible binding/unbinding/rebinding
-----

Currently you can only add/remove subsystems to a hierarchy when it
has just a single (root) cgroup. This is a bit inflexible, so I'm
planning to support:

- adding a subsystem to an existing hierarchy by automatically
creating a subsys state object for the new subsystem for each existing
cgroup in the hierarchy and doing the appropriate
can_attach()/attach_tasks() callbacks for all tasks in the system

- removing a subsystem from an existing hierarchy by moving all tasks
to that subsystem's root cgroup and destroying the child subsystem
state objects

- merging two existing hierarchies that have identical cgroup trees

- (maybe) splitting one hierarchy into two separate hierarchies

Whether all these operations should be forced through the mount()
system call, or whether they should be done via operations on cgroup
control files, is something I've not figured out yet.


3) Subsystem ...
From: Li Zefan
Date: Tuesday, April 8, 2008 - 7:28 pm

Sounds good, and I wrote a prototype in a quick:

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a6a6035..091bc21 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -254,6 +254,7 @@ struct cgroup_subsys {
 			struct cgroup *cgrp);
 	void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
+	int (*can_mount)(struct cgroup_subsys *ss, unsigned long subsys_bits);
 	int subsys_id;
 	int active;
 	int disabled;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 62f1a52..3d43ff2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -824,6 +824,25 @@ static int parse_cgroupfs_options(char *data,
 	return 0;
 }
 
+static int check_mount(unsigned long subsys_bits)
+{
+	int i;
+	int ret;
+	struct cgroup_subsys *ss;
+
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		ss = subsys[i];
+
+		if (test_bit(i, &subsys_bits) && ss->can_mount) {
+			ret = ss->can_mount(ss, subsys_bits);
+			if (ret)
+				return ret;
+		}
+	}
+
+	return 0;
+}
+
 static int cgroup_remount(struct super_block *sb, int *flags, char *data)
 {
 	int ret = 0;
@@ -839,6 +858,10 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
 	if (ret)
 		goto out_unlock;
 
+	ret = check_mount(opts.subsys_bits);
+	if (ret)
+		goto out_unlock;
+
 	/* Don't allow flags to change at remount */
 	if (opts.flags != root->flags) {
 		ret = -EINVAL;
@@ -959,6 +982,13 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
 		return ret;
 	}
 
+	ret = check_mount(opts.subsys_bits);
+	if (ret) {
+		if (opts.release_agent)
+			kfree(opts.release_agent);
+		return ret;
+	}
+
 	root = kzalloc(sizeof(*root), GFP_KERNEL);
 	if (!root) {
 		if (opts.release_agent)
-------

for the example about swap controller and memory controller:

static int swap_cgroup_can_mount(struct cgroup_subsys *ss,
				unsigned long subsys_bits)
{
	if (!test_bit(mem_cgroup_subsys_id, ...
From: Paul Menage
Date: Thursday, April 10, 2008 - 1:10 pm

Yes, that's pretty much what I was envisaging, thanks.

--

From: Serge E. Hallyn
Date: Friday, April 11, 2008 - 7:48 am

I'm tempted to ask what the use case is for this (I assume you have one,
you don't generally introduce features for no good reason), but it
doesn't sound like this would have any performance effect on the general
case, so it sounds good.

I'd stick with mount semantics.  Just
	mount -t cgroup -o remount,devices,cpu none /devwh"

I guess I'm hoping that if libcg goes well then a userspace daemon can
do all we need.  Of course the use case I envision is having a container
which is locked to some amount of ram, wherein the container admin wants
to lock some daemon to a subset of that ram.  If the host admin lets the
container admin edit a config file (or talk to a daemon through some
sock designated for the container) that will only create a child of the
container's cgroup, that's probably great.


I'm slooowly trying to whip together a swapfile namespace - not a
cgroup - which ties a swapfns to a list of swapfiles (where each
swapfile belongs to only one swapfns).  So I also need an mm->task
pointer of some kind.  I've got my own in my patches right now but
sure do hope to make use of Balbir's mm owner field.

-serge
--

From: Balbir Singh
Date: Friday, April 11, 2008 - 10:10 pm

I thought of doing something like this in libcg (having a daemon and a
client socket interface), but dropped the idea later. When all
controllers support multi-levels well, the plan is to create a
sub-directory in the cgroup hierarchy and give subtree ownership to

If you do have any specific requirements, we can cater to them right
now. Please do let us know. The biggest challenge right now is getting

Yes, it can be easily handled by libcg. I think this is an important

I have version 9 out. It has all the review comments incorporated. If

Serge, do you have any specific requirements for the mm owner field.
Will the current patch meet your requirements (including
mm_owner_changed field callbacks)?

Balbir
--

From: Serge E. Hallyn
Date: Sunday, April 13, 2008 - 9:11 am

It sounds like what you're talking about should suffice - the container
can only write to its own subdirectory, and the control files therein
should not allow the container to escape the bounds set for it, only to
partition it.

The only thing that worries me is how subtle it may turn out to be to
properly set up a container this way.  I.e. you'll need to
	mount --bind /etc/cgroups/mycontainer /vps/container1/etc/cgroups
before the container is off and running and be able to then prevent
the cgroup from mounting the host's /etc any other way.

As in so many other cases it shouldn't be too difficult with selinux,
otherwise I suppose one thing you could do is to put the host's
/etc/cgroup (or really the host's /) on partitionN, mount
/etc/cgroup/container from another partitionM, and use the device
whitelist (eventually, device namespaces) to allow the container to
mount partitionM but not partitionN.

So that's the one place where kernel support might be kind of seductive,
but I suspect it would just lead to either an unsafe, an inflexible, or
just a hokey "solution".  So let's stick with libcg for now.  A daemon
can always be written on top of it if people want, and if at some point
we see a real need for kernel support we can talk about it then.


I'm behind in versions, but the last I took a look it looked great.

thanks,
-serge
--

From: Balbir Singh
Date: Monday, April 14, 2008 - 7:31 am

Sounds fair to me. We intend to provide the basis for building a good daemon if

Thanks, that would be nice. I've just asked Andrew to include it, if there are


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Sunday, April 13, 2008 - 10:24 pm

Back during the early versions of control groups, Paul Jackson
proposed a bind/unbind API that would let you affect the subsystems on
an active hierarchy, and it was always a goal of mine to implement
that - current inflexibility is something that I've never been that
keen on, but it was OK for the first big release and could be extended
later.

One of the potential scenarios was that you might want to have a very
early boot script set up cpusets and node isolation for a set of
system daemons, and then bind other subsystems on to the same

Yes, probably - particularly if we restrict it to adding/removing
subsystems from an existing tree, rather than splitting and merging

That's a different issue, and one that I left out of the roadmap
email. We can have a virtualization subsystem that controls what
subset of a given hierarchy you can see - if the virtualization
subsystem is bound to a given hierarchy, and a cgroup is marked as
virtualized, then a mount of that hierarchy by a process in the
virtualized cgroup will see that cgroup as the root of the hierarchy.
It would be a bit like doing a bind mount of a subtree of the main

This would be to allow virtual servers to mount their own swapfiles?
Presumably there'd still be a use for a swap cgroup for job systems
that want to isolate swap usage without virtualization or requiring
jobs to mount their own swapfiles?

Paul
--

From: Serge E. Hallyn
Date: Monday, April 14, 2008 - 7:11 am

That seems to work.  Now we don't necessarily want that for every group
composed with the virtualized subsystem right?  I.e. if I do

mount -o cgroup -t ns,cpuset,virt none /containers

then all tasks are mapped under /containers.  If login does a
clone(CLONE_NEWNS) for hallyn's login to give him a private /tmp,
then hallyn ends up under /containers/node_xyz, but we don't want him
to be virtualized under there.  So I assume we'd want a virt.lock file
or something like that so, that when I create a container, my
start_container script can echo 1 > /containers/node_abc/virt.lock

I assume the container will also have to remount a fresh copy of the
cgroup composition so it can have the dentry for /containers/node_abc
as the root dentry for /containers?


Yes.  Main reason for having this would be so that a container which
you're going to migrate could have its own swapfile which can move
with it (or live on network fs).

-serge
--

From: Paul Menage
Date: Monday, April 14, 2008 - 8:03 am

Yes.

Paul
--

Previous thread: signalfd() not handling sigqueue() sigval data correctly by Michael Kerrisk on Tuesday, April 8, 2008 - 2:09 pm. (8 messages)

Next thread: Re: Clock has stopped (time/date looping over 5 secon by Roland on Tuesday, April 8, 2008 - 2:47 pm. (2 messages)