Re: Re: [RFC] How to handle the rules engine for cgroups

Previous thread: [GIT PULL] Userlib integration patches by Glauber Costa on Tuesday, July 1, 2008 - 11:46 am. (4 messages)

Next thread: Web Site Link Exchange Request by UK TATTOO STUDIOS on Tuesday, July 1, 2008 - 12:20 pm. (1 message)
From: Vivek Goyal
Date: Tuesday, July 1, 2008 - 12:11 pm

Hi,

While development is going on for cgroup and various controllers, we also
need a facility so that an admin/user can specify the group creation and
also specify the rules based on which tasks should be placed in respective
groups. Group creation part will be handled by libcg which is already
under development. We still need to tackle the issue of how to specify
the rules and how these rules are enforced (rules engine).

I have gathered few views, with regards to how rule engine can possibly be
implemented, I am listing these down.

Proposal 1
==========
Let user space daemon hanle all that. Daemon will open a netlink socket
and receive the notifications for various kernel events. Daemon will
also parse appropriate admin specified rules config file and place the
processes in right cgroup based on rules as and when events happen.

I have written a prototype user space program which does that. Program 
can be found here. Currently it is in very crude shape.

http://people.redhat.com/vgoyal/misc/rules-engine-daemon/user-id-based-namespaces.patch

Various people have raised two main issues with this approach.

- netlink is not a reliable protocol.
	- Messages can be dropped and one can loose message. That means a
	  newly forked process might never go into right group as meant.

- How to handle delays in rule exectuion?
	- For example, if an "exec" happens and by the time process is moved to
	 right group, it might have forked off few more processes or might
	 have done quite some amount of memory allocation which will be
   	 charged to the wring group. Or, newly exec process might get
 	 killed in existing cgroup because of lack of memory (despite the
	 fact that destination cgroup has sufficient memory).

Proposal 2
==========
Implement one or more kernel modules which will implement the rule engine.
User space program can parse the config files and pass it to module.
Kernel will be patched only on select points to look for the rules (as
provided by modules). Very ...
From: Kazunaga Ikeno
Date: Wednesday, July 2, 2008 - 2:33 am

right.

I think it is necessary to avoid these issues.
IMO, In particular a second one (handle may delay).

I'd agree with your opinion.
Strict movement of tasks is indispensable in enterprises scene.


Regards, Kazunaga Ikeno

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, July 2, 2008 - 6:19 pm

On Tue, 1 Jul 2008 15:11:26 -0400
Hmm, can't we rework the process event connector to use some reliable
fast interface besides netlink ? (I mean an interface like eventpoll.)
(Or enhance netlink ? ;)

Because "a child inherits parent's" rule is very strong, I think the amount
of events we have to check is much less than we get report. Can't we add some
filter/assumption here ?

BTW, the placement of proc_exec_connector() is not too late ? It seems memory for
creating exec-image is charged to original group...

Thanks,
-Kame

--

From: Vivek Goyal
Date: Thursday, July 3, 2008 - 8:54 am

I see following text in netlink man page.

"However, reliable transmissions from kernel to user are impossible in
 any case. The kernel can’t send a netlink message if the socket buffer
 is full: the message will be dropped and the kernel and  the userspace
 process will no longer have the same view of kernel state. It is up to
 the application to detect when this  happens  (via  the  ENOBUFS error
 returned by recvmsg(2)) and resynchronize."

So at the end of the day, it looks like unreliability comes from the
fact that we can not allocate memory currently so we will discard the
packet.

Are there alternatives as compared to dropping packets?

- Let sender cache the packet and retry later. So maybe netlink layer
  can return error if packet can not be queued and connector can cache the
  event and try sending it later. (Hopefully later memory situation became
  better because of OOM or some process exited or something else...).

  This looks like a band-aid to handle the temporary congestion kind of
  problems. Will not be able to help if consumer is inherently slow and
  event generation is faster.

This probably can be one possible enhancement to connector, but at the end
of the day, any kind of user space daemon will have to accept the fact
that packets can be dropped, leading to lost events. Detect that situation
(using ENOBUFS) and then let admin know about it (logging). I am not sure
what admin is supposed to do after that.

I am CCing Thomas Graf. He might have a better idea of netlink limitations

I am not sure if proc connector currently allows filtering of various
events like fork, exec, exit etc. In a quick look it looks like it
does not. But probably that can be worked out. Even then, it will just
help reduce the number of messages queued for user space on that socket
but will not take away the fact that messages can be dropped under

As of today it should happen because newly execed process will run into
same cgroup as parent.  But that's what probably we ...
From: KAMEZAWA Hiroyuki
Date: Thursday, July 3, 2008 - 5:34 pm

On Thu, 3 Jul 2008 11:54:46 -0400
If it's just problem of memory allocation, preallocate socket buffer and
use it later, like radix_tree_preload().
==
   foo() {
	if (preallocate())
		return -ENOBUFS;

	.......
	proc_xxxx_connector();
   }
==
(this means setuid() will return -ENOBUFS, undocumented error code.)

But af_netlink layer have another cause of dropping packets
 1. copying skb at broadcast.
 2. recv buffer over run..


Thanks,

--

From: Li Zefan
Date: Thursday, July 3, 2008 - 8:17 pm

Proc connector doesn't support event filtering. We can easily add a
global event mask, but not straightforward to add per-socket event mask
if not impossible.

--

From: Balbir Singh
Date: Tuesday, July 8, 2008 - 2:35 am

One thing we did with the delay accounting framework was to add the ability for
clients to listen on a per-cpu basis, that helped us scale well (user space

CKRM had a kernel module for rule based classification - called rule based
classification engine (rbce). We should consider a simple cgroups client that
can share a database from user space and use the fork callback for classification.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

--

From: Vivek Goyal
Date: Tuesday, July 8, 2008 - 6:45 am

Ok, I will look into it. But another key question still remains that if we
do it in user space, then there is no easy way of avoiding delay in execution

Hmm..., had a quick look and CKRM implemented rule based engine as kernel 
module.

Initially I thought of providing rules based in uid, gid and executable name.
So basically policies enforced upon setuid and exec related calls. I am
thinking if rules engine can be split in two parts. Set of rules which can
bear dealy can live in user space and which can not bear delay can live
in kernel. Something like, moving of tasks from one cgroup to other can 
probably go in user space or fork notifications related rules can live
in user space.

Thanks
Vivek
--

From: Paul Menage
Date: Thursday, July 10, 2008 - 2:23 am

I think that I'm a little skeptical that anyone would ever want to do that.

Wouldn't it be a simpler mechanism for the admin to simply have
wrappers around the "firefox" and "oracle" binaries that move the
process into the "browser" or "database" cgroup before running the

I can help there. :-) At Google we have two approaches:

- grid jobs, which are moved into the appropriate cgroup (actually,
currently cpuset) by the grid daemon when it starts the job

- ssh logins, which are moved into the appropriate cpuset by a
forced-command script specified in the sshd config.

I don't see the rule-based approach being all that useful for our needs.

It's all very well coming up with theoretical cases that a fancy new
mechanism solves. But it carries more weight if someone can stand up
and say "Yes, I want to use this on my real cluster of machines". (Or
even "Yes, if this is implemented I *will* use it on my desktop" would
be a start)

Paul
--

From: Vivek Goyal
Date: Thursday, July 10, 2008 - 7:30 am

Well, that would mean first wrappers need to be created around all the
applications which needs to be controlled. Then wrapper needs to 
synchronize with the classification daemon if I have been put into
the right cgroup and can I go ahead with launching the real binary etc.
This sounds ugly and putting wrappers around all the applications does

So grid daemon probably first forks off, determines the right cpuset

So it boils down to.

1) Can we bear the delay in task classification (Especially, exec). If yes,
  then all the classification job can take place in userspace.

2) If no,
	a) Then either we need to implement rule based engine to let
	  kernel do classfication.

	b) or we need to do various things in user space as you suggested.
		- Pur wrapper around applications.
		- Job launcher (ex. Grid daemon) is modified to determine
		  the right cgroup and place application there before
		  actually launching the job.

Balbir and other people, any more thoughts on this? How exactly this thing
need to be used in your work environment.

I am little skeptical of options 2b working in most of the scenarios.

Thanks
Vivek
--

From: Dhaval Giani
Date: Thursday, July 10, 2008 - 8:42 am

I like this approach. The whole classification should really be done by
userspace. Let the wrapper move into the correct group and then start the
task. The kernel really is not the right place for the classification.

And you can have a default group for tasks who really don't care about
where they are placed.

-- 
regards,
Dhaval
--

From: Paul Menage
Date: Thursday, July 10, 2008 - 9:51 am

I was suggesting that you wouldn't need a classification daemon in
this case. The logic of which cgroup to enter would be in the

Pretty much, yes. Most jobs have their own cpuset that's created for
them dynamically when the job starts on the machine.

Paul
--

From: Rik van Riel
Date: Thursday, July 10, 2008 - 7:48 am

On Thu, 10 Jul 2008 02:23:52 -0700

Agreed, there really is no need for a rule-based approach in kernel space.

There are basically three different cases:

1) daemons get started up in their own process groups, this can
   be handled by the initscripts

2) user sessions (ssh, etc) start in their own process groups,
   this can be handled by PAM

3) users fork processes that should go into special process
   groups - this could be handled by having a small ruleset
   in userspace handle things, right before calling exec(),
   it can even be hidden from the application by hooking into
   the exec() call

If a user overrides the rules for their own processes, at worst
s/he takes away resources from him/herself.  No security problem.

Is there any reason at all to push for a kernel side rule-based
engine, except "I want to make my patch set unmergeable?"

-- 
All Rights Reversed
--

From: Vivek Goyal
Date: Thursday, July 10, 2008 - 8:40 am

That means application launcher (ex, shell) is aware of the right cgroup
targeted application should go in and then move forked pid to right

This means hooking into libc. So libc will parse rules file, determine
the right cgroup, place application there and then call exec?

CCing, Ulrich also in case he has some thoughts.

Thanks
Vivek
--

From: Ulrich Drepper
Date: Thursday, July 10, 2008 - 8:56 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


As with any "solution" based on userlevel code, the problem is overhead
and interfaces.

Such a rules file would be a real file, I assume, and as such we'd have
to read it every time an exec call is made.  At least we'd have to check
using a stat() call that nothing changed.  That's always a big overhead.

Once the information is available, how is it used?  We'd have to pass
additional information to the exec syscalls.  And it has to happen so
that if the exec call fails the original process is not affected (i.e.,
premature changing isn't an option).  The method also must be
thread-safe in a limited way: executing failing exec syscalls in
multiple threads mustn't disturb the process.

There is one set of problems which I don't care about but others likely
will: what happens if some program uses the syscalls directly?  And what
happens with old libcs and old statically linked programs?  It's exactly
the kind of problem why I tell people to never linked statically but
some people don't listen.


The additional file update check is hurting performance but since I hope
what we will get an inotify-like interface that doesn't need normal file
descriptors (or any file descriptors) I think I can live with it.
Somebody would "just" have to implement, e.g., the anonfd functionality
discussed some time ago.  (Make sure to talk to Al Viro who already
mentioned to me that it'll be "fun").

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkh2MSkACgkQ2ijCOnn/RHTepgCgrlkwQMItX2QGW6Tw//lw4vH2
ItIAoJ7qyQE31jpQ2D8fBIO/yqmrwgcH
=NQMC
-----END PGP SIGNATURE-----
--

From: Rik van Riel
Date: Thursday, July 10, 2008 - 10:25 am

On Thu, 10 Jul 2008 08:56:25 -0700

One easy way is to have a "migrate on exec" option added to the
process group code.  Instead of moving yourself to a new process
group before exec, you do the same invocation but with a "migrate
me lazily at exec time" flag.

At exec time, your current resources will be subtracted from the
old process group (most of it automatically in exit_mmap) and your 
new resources will be added to the new process group on the other 
side of exec.


Those people will have to move their processes around between
process groups manually (or with shell scripts).  Having per
program process groups is essentially bonus functionality
over the "start daemon in own process group" and "start user
in own process group" functionalities.

Whether and how we want to implement this is open for discussion.

Personally I suspect that a kernel side rule-based engine with
user loadable rules may not be the best idea :)

-- 
All Rights Reversed
--

From: Ulrich Drepper
Date: Thursday, July 10, 2008 - 10:39 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


That's going to be ugly because the exec functions are signal-safe.
I.e., they can happen at any time.  This would mean that one always has
to set the migration policy before every exec call and that there must
be a way to retrieve the currently selected policy so that it can
potentially be restored.  This policy must be a thread property, not a
process property.

Sticky information like this is IMO always hairy at best.  We had the
same discussion at the time of the sys_indirect discussion.  This new
syscall proposal was the result of sticky information not being suitable
and it could very well be used for the exec syscalls, too.

Again, this is all about failing exec calls of which there can be
arbitrarily many.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkh2SVAACgkQ2ijCOnn/RHSYsgCfeH3tTQLSILTksRTfWPhffY0x
okkAn0fQDRDBkqSboqzfrqlj1zpvA3Hm
=bi0P
-----END PGP SIGNATURE-----
--

From: Vivek Goyal
Date: Thursday, July 10, 2008 - 11:41 am

Sorry, I did not understand exactly what's the problem with signal
safe exec function. Before exec, we should be able to determine the
migration policy related to process/thread (either by reading file or
something else etc). Set the policy through cgroup file system. If exec
fails for some reason, we just need to go back to cgroup file system to
undo the effect of setting migration policy previously set for that thread.

Thanks
Vivek
--

From: Ulrich Drepper
Date: Thursday, July 10, 2008 - 3:29 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


That's what I said.  It would be necessary to get the old state and
reset it if necessary.

As for the interface: I hope nobody honestly thinks that it is doable to
perform a whole bunch of filesystem operations for every exec.

And more: reading a rule file, interpreting the rules to find the best
match, etc is also too expensive.  Every process would have to read the
rule file again.  If this is non-trivial or the rule file is large, the
cost of an exec could easily be overshadowed by the cost of this
preparation.  Unlike the kernel, the userlevel runtime cannot in general
amortize the cost over several exec calls.  Handling all this in the
kernel wouldn't have any of these problems.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAkh2jVMACgkQ2ijCOnn/RHQ6JACgx4W0dUh/MK6po23D1ObcnsKA
HOAAn2Qfrh8m5zsdHQoniaoLl12Ut3ZE
=IU/X
-----END PGP SIGNATURE-----
--

From: KAMEZAWA Hiroyuki
Date: Thursday, July 10, 2008 - 5:55 pm

On Thu, 10 Jul 2008 11:40:35 -0400

Hmm, as I wrote, the rule that the child inherits its own parent't is very
strong rule. (Most of case can be handle by this.) So, what I think of is

1. support a new command (in libcg.)
  - /bin/change_group_exec ..... read to /etc/cgroup/config and move cgroup
                                 and call exec.
2. and libc function
  - if necessary.

1. is enough because admin/user can write a wrapper script for their
applications if "child inherits parent's" isn't suitable.

no ?

Thanks,
-Kame











--

From: Vivek Goyal
Date: Monday, July 14, 2008 - 6:57 am

If admin has decided to group applications and has written the rules for
it then applications should not know anything about grouping. So I think
application writing an script for being placed into the right group should
be out of question. Now how does an admin write a wrapper around existing
application without breaking anything else.

One thing could be creating soft links where admin created alias points
to wrapper and wrapper inturn invokes the executable. But this will not
solve the problem if some user decides to invoke the application
executable directly and not use admin created alias. 

Did you have something else in mind when it came to creating wrappers
around applications?

Thanks
Vivek
--

From: David Collier-Brown
Date: Monday, July 14, 2008 - 7:44 am

In the Solaris world, processes are placed into cgroups (projects) by
one of two mechanisms:

1) inheritance, with everything I create in my existing project.
   To get this started, there is a mechanism under login/getty/whatever 
   reading the /etc/projects file and, for example, tossing user davecb 
   into a "user.davecb" project.

2) explicit placement with newtask, which starts a program or moves
   a process into a project/cgroup

I have a "bg" project which I use for limiting resource consumption of
background jobs, and a background command which either starts or moves
jobs, thusly:

 case "$1" in
 [0-9]*) # It's a pid
         newtask -p bg -c $1
         ;;
  *) # It's a command-line
         newtask -p bg "$@" &
         ;;
  esac

A rules engine would be more useful for managing workloads once
they're assigned, as IBM does on the mainframe with WLM and goal-directed
resource management. (They're brilliant in this area, by the way, so
I'd be inclined to steal ideas from them  (;-))

--dave
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb@sun.com                 |                      -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#
--

From: Vivek Goyal
Date: Monday, July 14, 2008 - 8:21 am

Placing the login sessions in right cgroup based on uid/gid rules is  
probably easy as check needs to be placed only on system entry upon login
(Pam plugin should do).  And after that any job started by the user


Ok, this is moving of tasks from one cgroup to other based on pid. This
is really easy to do through cgroup file system. Just a matter of writing

So here a user explicitly invokes the wrapper passing it the targeted
cgroup and the application to be launched in that cgroup. This should work
if there is a facility if user has created its own cgroups (lets say
under user controlled cgroup dir in the hierarchy) and user explicitly
wants to control the resources of applications under its dir. For example,

 		/mnt/cgroup
		|	|
		gid1	gid2
		|  |	|  |
	      uid1 uid2	uid3 uid4
	     |  |
	 proj1  proj2

Here probably admin can write the rules for how users are allocated the
resources and give ability to users to create subdirs under their cgroups
where users can create more cgroups and can do their own resource
management based on application tasks and place applications in the right
cgroup by writing wrappers as mentioned by you "newtask".

But here there is no discrimination of application type by admin. Admin
controls resource divisions only based on uid/gid. And users can manage
applications within their user groups. In fact I am having hard time thinking
in what kind of scenarios, there is a need for an admin to control
resource based on application type? Do we really need setups like, on
a system databases should get network bandwidth of 30%. If yes, then
it becomes tricky where admin need to write a wrapper to place the task
in right cgroup without application/user knowing it.

Thanks
Vivek
--

From: Kazunaga Ikeno
Date: Thursday, July 17, 2008 - 12:05 am

I think a wrapper (move to right group and calls exec) will run by user, not by admin.
In explicit placement, user knows what a type of application he/she launch.

 		/mnt/cgroup
		|	|
		gid1	gid2
		|  |	|  |
	      uid1 uid2	uid3 uid4
	     |  |
	 proj1  proj2

[uid1/gid1]% newtask.sh proj1app
... proj1app run under /mnt/cgroup/gid1/uid1

[uid1/gid1]% newtask.sh --type proj1type proj1app
... proj1app run under /mnt/cgroup/gid1/uid1/proj1
 
In this case, admin sets up limitation of proj1type.
Also I guess proj1type has ownership (ex: proj1type allows gid1).
Isn't this enough?

Thanks,
Kazunaga Ikeno

--

From: Vivek Goyal
Date: Thursday, July 17, 2008 - 6:47 am

This is the easy to handle situation and I am hoping it will work in many
of the cases.

Currently I am writting a patch for libcg which allows querying the
destination cgroup based on uid/gid and libcg will also migrate the
application there. I am also writing a pam plugin which will move
all the login sessions to respective cgroup (as mentioned by rule file).
Will also modify "init" so that all the system services to into cgroup
belonging to root.

Once user is logged in and running into his resource group, he can manage
further subgroups at his own based on his application requirements (as you

Yes, so if a user does not specifically launch an application targetted
for a particular cgroup, then it will run into default group for that

IOW, probably a user can say.


I think admin should setup the limits only til /mnt/cgroup/gid1/uid1.
After that how resources allocated to uid1 are subdivided between various
user applications should be controller by user. So resources under

I think to begin with and to get some kind of simple functionality
going it might be good. I am sure others will target for more complex
configurations and usages.

Thanks
Vivek
--

From: Andrea Righi
Date: Sunday, August 17, 2008 - 3:33 am

The problem of placing tasks in respective cgroups seems to be correctly
addressed by userspace lib wrappers or classifier daemons [1].

However, this is an attempt to implement an in-kernel classifier.

[ I wrote this patch for a "special purpose" environment, where a lot of
short-lived processes belonging to different users are spawned by
different daemons, so the main goal here would be to remove the dealy
needed by userspace classification and place the tasks in the right
cgroup at the time they're created. This is just an ugly hack for now
and it works only for uid-based rules, gid-based rules could be
implemented in a similar way. ]

UID:cgroup associations are stored in a RCU-protected hash list.

The kernel<->userspace interface works as following:
 - the file "uids" is added in the cgroup filesystem
 - a UID can be placed only in a single cgroup
 - a cgroup can have multiple UIDs

Respect to the userspace solution (e.g. classifier daemon) this solution
has the advantage of removing the delay for task classification, that
means each task always runs in the appropriate cgroup at the time is
created (fork, exec) or when the uid changes (setuid).

OTOH the disadvantage is to introduce the complexity in the kernel.

[1] http://lkml.org/lkml/2008/7/1/391

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 include/linux/cgroup.h |    9 +++
 kernel/cgroup.c        |  141 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sys.c           |    6 ++-
 3 files changed, 154 insertions(+), 2 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 30934e4..243819a 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -393,6 +393,7 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp,
 void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
+struct cgroup *uid_to_cgroup(uid_t uid);
 
 ...
From: Vivek Goyal
Date: Monday, August 18, 2008 - 5:35 am

Hi  Andrea,

Recently I introduced the infrastructure in libcgroup to handle
the task placement issue based on uid and gid rules. This is what I did.

- Introduced two new APIs in libcgroup to place the task in right cgroup.
	- cgroup_change_cgroup_uid_gid
		Pleces the task in destination cgroup based on uid/gid
		rules specified in /etc/cgrules.conf
	- cgroup_change_cgroup_path
		Puts the task into the cgroup specified by caller

- Provided two command line tools (cgexec and cgclassify) to perform
  various process placement related tasks.
	- cgexec
		A tool to launch a task in user specfied cgroup
	- cgclassify
		A tool to re-classify already running tasks.

- Wrote a pam plugin so that tasks are placed in right user groups upon
  login or reception of other services which take pam's help.

- Currently work is in progress for a user space daemon which will 
  automatically place the tasks based on notifications.

For your environment, where delay is unbearable, I think you can modify
the daemon to use libcgroup to place the forked task in right cgroup
before actually executing it. Once the task has been placed in right
cgroup, exec() will be called.

We have been doing all the user space development on following mailing
list.

https://lists.sourceforge.net/lists/listinfo/libcg-devel

Latest patches which got merged in libcgroup, are here.

http://sourceforge.net/mailarchive/forum.php?thread_name=20080813171720.108005557%40re...

It is accompanied with a decent README file for design details and for
how to use it.

I think modifying the daemon to make use of libcgroup is the right way
to handle this issue than duplicating the infrastructure in user space
as well as kernel space.

Thanks
Vivek
--

From: righi.andrea
Date: Tuesday, August 19, 2008 - 7:35 am

yep! I'm having some troubles with my internet connection, and it seems
my previous reply is lost.. :( resending it, sorry for the noise if

That's interesting. All the daemons that provide access to a system
should pam-aware, so with the pam plugin I should be able to handle all
the cases. Unfortunately I don't have too much details about those

The deamons should all use the exec() + setuid() way. If pam doesn't
help I'll try to wrap setuid(), using a wrapper lib or something


Totally agree in perspective (obviously when it's possible/reasonable in
terms of efforts to change the userspace daemon).

Thanks,
-Andrea
--

From: Paul Menage
Date: Monday, August 18, 2008 - 2:05 pm

What kinds of daemons are these? Is it not possible to add some
libcgroup calls to these daemons?

I'm reluctant to add features like this to the kernel side of cgroups
due to their "magical" nature - any task that does a setuid() now
risks being swept off into a different cgroup.

Having the cgroup attachment done explicitly e.g. by a PAM library at
login time is much less likely to cause unexpected behaviour.

Maybe if we had a way to control which tasks the magical setuid
switching occurs for, it might be more acceptable. (Perhaps base it on
the cgroup of the task that's doing the setuid as well?

Other thoughts:

- what about other uids (euid, fsuid)?

- what about multiple hierarchies?

- if the attach fails, userspace gets no notification.

Paul
--

From: Vivek Goyal
Date: Tuesday, August 19, 2008 - 5:57 am

Hi Paul,

Same thing will happen if we implement the daemon in user space. A task
who does seteuid(), can be swept away to a different cgroup based on 
rules specified in /etc/cgrules.conf. 

What do you mean by risk? This is the policy set up by system admin and
behaviour would seem consistent as per the policy. If an admin decides
that tasks of user "apache" should run into /container/cpu/apache cgroup and
if a "root" tasks does seteuid(apache), then it manes sense to move task
to /container/cpu/apache.

Exactly what kind of scenario do you have in mind when you want the policy
to be enforced selectively based on task (tid)?

Thanks
Vivek
--

From: Paul Menage
Date: Monday, August 25, 2008 - 5:54 pm

Yes, I'm not so keen on a daemon magically pulling things into a
cgroup based on uid either, for the same reasons.

But a user-space based solution can be much more flexible (e.g. easier

The kind of unexpected behaviour I was imagining was when some other
daemon (e.g. ftpd?) unexpectedly does a setuid to one of the
magically-controlled users, and results in that daemon being pulled
into the specified cgroup. For something like cpu maybe that's mostly
benign (but what moves it back into its original group after it
switches back to root?) but for other subsystems it could be more

I was thinking of something like possibly a per-cgroup file (that also
affected child cgroups) rather than a global file. That would also
automatically handle multiple hierarchies.

Paul
--

From: Vivek Goyal
Date: Tuesday, August 26, 2008 - 6:41 am

Once ftpd does seteuid() or setreuid() again to switch effective user to
"root", it will be moved back to original group (root's group).

So basic question is if a program changes its effective user id temporarily
to user B than all the resource consumption should take place from the
resources of user B or should continue to take place from original cgroup.

I would think that we should move the task temporarily to B's cgroup and
bring back again upon identity change.

At the same time I can also understand that this behavior can probably
be considered over-intrusive and some people might want to avoid that.

Two things come to my mind.

- Users who find it too intrusive, can just shut down the rules based
  daemon.

- Or, we can implement selective movement of tasks by daemon as suggested by
  you. This will make system more complex but provides more flexibility
  in the sense users can keep daemon running at the same time control

So there can be two kind of controls.

- Create a per cgroup file say "group_pinned", where if 1 is written to
  "group_pinned" that means daemon will not move tasks from this cgroup upon
  effective uid/gid changes.

- Provide more fine grained control where task movement is not controlled
  per cgroup, rather per thread id. In that case every cgroup will contain
  another file "tasks_pinned" which will contain all the tids which cannot
  be moved from this cgroup by daemon. By default this file will be empty
  and all the tids are movable.

I think initially we can keep things simple and implement "group_pinned" 
which provides coarse control on the whole group and pins all the tasks
in that cgroup.

Thoughts?

Thanks
Vivek
--

From: Balbir Singh
Date: Tuesday, August 26, 2008 - 7:35 am

Yes, I would say administrators should do that. Classification via setuid(),
does make a lot of sense, but at the same time it might be too aggressive if an

Applications that really care about moving should use cgroup_attach_task* and
move back otherwise with cgrules parsing turned off.

I see control as a two level hierarchy, automatic and controlled, how do we make

Hmm... I wonder if we are providing too many knobs. Can't we do something simpler?

-- 
	Balbir
--

From: David Collier-Brown
Date: Tuesday, August 26, 2008 - 8:04 am

Solaris doesn't try to change cgroup ("project") on a setuid call, assuming
the program is in the proper cgroup initially.  For most cases this is
trivially true under the very simple default rules, and for the rest one
can create a rule or a startup script that sets it with newtask".

The Sun default is
	$ cat /etc/project
	system:0::::
	user.root:1::::
	noproject:2::::
	default:3::::
	group.staff:10::::

Which means that root users are distinguished from users in
the staff group, and they are distinguished from daemons
and everyone else.

Personally, I add
	user.davecb:101::davecb::
	bg:100:Background jobs:davecb::
which puts me in a separate cgroup, and provides another one
for me to put background tasks into.  The latter allows
me to keep them from reducing the interactive performance of
my laptop. 

  In practice, this looks like:

$ prstat -J
PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   695 davecb     52M   38M sleep    1    0   0:01:41 2.4% Xsun/1
  1025 davecb    150M   88M sleep   59    0   0:04:25 1.9% mozilla-bin/5
   926 davecb     73M   16M sleep   33    0   0:00:11 1.3% gnome-terminal/2
  1067 davecb   6232K 5224K cpu0    54    0   0:00:00 0.3% prstat/1
   918 davecb     66M   15M sleep   59    0   0:00:15 0.2% metacity/1
   956 davecb     67M   13M sleep   59    0   0:00:04 0.1% gnome-netstatus/1
   958 davecb     66M   12M sleep   59    0   0:00:02 0.1% mixer_applet2/1
   931 root     2112K 1240K sleep   59    0   0:00:01 0.0% rpc.rstatd/1
   954 davecb     68M   15M sleep   57    0   0:00:06 0.0% wnck-applet/1
   920 davecb     71M   17M sleep   59    0   0:00:04 0.0% gnome-panel/1
   943 davecb   1408K 1136K sleep   57    0   0:00:00 0.0% ksh/1
   871 davecb   3984K 2656K sleep   59    0   0:00:01 0.0% xscreensaver/1
   916 davecb     10M 4936K sleep   59    0   0:00:01 0.0% gnome-smproxy/1
   924 davecb     67M   13M sleep   59    0   0:00:01 0.0% gnome-perfmeter/1
   116 root     4352K 1168K sleep   59    0   0:00:00 ...
From: Vivek Goyal
Date: Tuesday, August 26, 2008 - 9:00 am

Who executes default rules? IOW, how do you make sure tasks of user.davecb

Now Linux also will allow admin to specify simple rules in
/etc/cgrules.conf. Rules are based basically on premise that users/groups
own resources in a particular cgroup and one can specify which cgroup
the task should run in. For ex.

#john          cpu              usergroup/faculty/john/
#@student      cpu,memory       usergroup/student/
#@root          *               admingroup/
#*              *               default/

This simply means which user/group's tasks should run in what cgroup for
which controller. (There are some wild cards also). For details, you can

So by default all the tasks of user.davecb will run into project 101 until
user davecb decides to launch some background jobs in project 100 using
newtask?

"newtask" like functionality is being provided by a new command line tool
"cgexec" which will allow launching of a new task in specific cgroup
(project).

Thanks
Vivek
--

From: David Collier-Brown
Date: Tuesday, August 26, 2008 - 9:32 am

A classifier at login/connect starts each new process off in the correct group.

That's right, the and cgexec-like "newtask" is what I use
to script things: for example, my background script says

       case "$1" in
        [0-9]*) # It's a pid
                newtask -p bg -c $1
                ;;
        *) # It's a command-line
                newtask -p bg "$@" &
                ;;
        esac

There's also an -F option to put a process into a cgroup
and never let it newtask itself or it's children to another one,
so that software from Dr Evil, Inc. can't do privilege 
escalation (;-))

--dave
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb@sun.com                 |                      -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
--

From: Vivek Goyal
Date: Tuesday, August 26, 2008 - 9:08 am

Just minor clarification. Right now all the classification is being done
based on effective uid and effective gid.


I also fear that we are probably providing too many knobs. Until we get
a strong use case, to keep things simple I recommend that for the time
being let us stick to simple user space daemon and user can turn it on
or off based on his needs (whether user wants a cgroup change upon seteuid()
related events). No controls based on group_pinned or tasks_pinned
etc. It is all or none.

Thanks
Vivek
--

From: Paul Menage
Date: Thursday, September 4, 2008 - 11:25 am

I don't think you'd necessarily need a cgroup file for that - it could
be part of the daemon configuration.

Paul
--

From: righi.andrea
Date: Tuesday, August 19, 2008 - 8:12 am

unfortunately I don't have too much details for now, so I was just
looking for the most generic solution. The PAM lib approach seems
reasonable for each daemon that represents an entry point to the
system, and, to a large degree, I like the userspace solution (e.g.
the libcgroup as reported by Vivek). It seems to be the right way to


If the admin configures so, moving tasks that do setuid() in different

do you mean create a cgroup subsystem to handle different per-cgroup

good points.

For the last one we could just return an error code from cgroup_fork()
and  goto bad_fork_cleanup_cgroup (in this way the fork/exec would

Thanks,
-Andrea
--

From: Paul Menage
Date: Monday, August 25, 2008 - 5:55 pm

Is the sysadmin aware of all the places in all system daemons that do
setuid() calls?

Paul
--

From: kamezawa.hiroyu
Date: Monday, July 14, 2008 - 8:07 am

I have no strong idea around this but now it seems

 - handling complicated rules under the kernel will got amount of Nacks.
   (and it seems to add some latency.)
 - We cannot avoid the problem discussed here if we handle the rule in 
   userland daemon/process-event-connector.

So, I wonder adding some limitation may make things simple.

  - application under wrapper will be executed under a group defined by admin.
  - application without wrapper will be executed under a group where exec()
    called.

A point is that application-without-wrapper is also under Admin's control beca
use it's executed under a group which calls exec.

But this is not strict control..this is very loose ;)

Thanks,
-Kame

--

From: Paul Menage
Date: Thursday, July 10, 2008 - 2:07 am

Hi Vivek,


One way that you could avoid the unreliability would be to not use
netlink, but instead use cgroups itself.

What we're looking for is a way to easily distinguish between
processes that are in the right cgroups, and processes that might be
in the wrong cgroups. Additionally, we want the children of such
processes to inherit the same status until we've dealt with them, and
not be able to change their status themselves.

That sounds a bit like a cgroup. How about the following?

- create a cgroup subsystem called "setuid".

- have a uid_changed() hook called by sys_setuid() and friends; this
hook would simply attach current to the root cgroup in the "setuid"
hierarchy if it wasn't already in that cgroup (which can be determined
with a couple of dereferences from current and no locking, so not
slowing down the normal case).

- userspace uses this by:

mount the setuid hierarchy, e.g. at /mnt/setuid
create a child cgroup /mnt/setuid/processed
while true:
  wait for /mnt/setuid/tasks to be non-empty
  read a pid from /mnt/setuid/tasks
  move that pid to the appropriate cgroups in memory/cpu/etc
hierarchies if necessary
  move that pid to /mnt/setuid/processed/tasks

i.e. any pid in the root cgroup of the setuid hierarchy is one that
needs attention and may need to be moved to different cgroups

A couple of enhancements to make this more usable might include:

- adding an API (via a new syscall or an eventfd?) to wait for a
cgroup to be non-empty, to avoid having to poll /mnt/setuid/tasks more
than necessary

- allow the user to designate certain processes and their children as
uninteresting so that their setuid calls don't trigger them being
moved back to the root (perhaps indicated via membership of an
"ignored" cgroup in the setuid hierarchy?)

This should be more reliable than netlink since it doesn't involve
userspace having to keep up with a stream of events - we're not
queuing up events, we're just shifting process group memberships.

Similar ...
From: Vivek Goyal
Date: Thursday, July 10, 2008 - 7:06 am

This looks interesting. So above method should solve atleast the
reliability issue of event transport to user space. Got few thougts.

- Hopefully number of hiearchies will not explode as we will be
  mounting one hierarchies per event type (uid change, gid change,
  exec, maybe fork etc.).

- IIUC, it does not solve the concern of delay. So after setuid, or exec,
  tasks continues to run into existing cgroup until user space daemon
  processes the event and moves the task into right cgroup. More on this
  in reply to your other mail.

Thanks
Vivek
--

From: Paul Menage
Date: Thursday, July 10, 2008 - 9:41 am

In what circumstances would you want to reclassify processes to a
different cgroup on a fork?

Paul
--

From: Vivek Goyal
Date: Thursday, July 10, 2008 - 10:19 am

I don't know. Balbir had mentioned in one of the mails in this thread
regarding getting notification on fork.

Thanks
Vivek
--

From: Dhaval Giani
Date: Thursday, July 10, 2008 - 10:27 am

fork or exec? I believe reclassifications would happen only on exec.

-- 
regards,
Dhaval
--

From: Vivek Goyal
Date: Thursday, July 10, 2008 - 7:33 am

We also need to do something to track all the forked childs after
the setuid, setgid or exec till original parent event got classified
and children need to meet the same treatment.

Thanks
Vivek
--

From: Paul Menage
Date: Thursday, July 10, 2008 - 9:46 am

You'd get that automatically, since children of the task moved to the
root cgroup (indicating "needs attention") would also end up in that
cgroup since cgroup are inherited across fork.

Paul
--

From: Dhaval Giani
Date: Thursday, July 10, 2008 - 10:18 am

I am sorry, I seem to missing something, but who moves the forked
children (which got forked during the time between the parent getting
classified into the right group and the fork itself) into the correct
group?

-- 
regards,
Dhaval
--

From: Paul Menage
Date: Thursday, July 10, 2008 - 10:30 am

On Thu, Jul 10, 2008 at 10:18 AM, Dhaval Giani

The classifier daemon would have to do that - my point was that it
would be very clear exactly which processes needed this attention,
since they'd end up in the root cgroup too.

Paul
--

From: Dhaval Giani
Date: Thursday, July 10, 2008 - 10:44 am

It still would not solve the problem of the correct group getting
charged. Say for something like cpu, it would get fare more cpu time as
opposed to what it should get. Its in the correct direction, but I am
not sure if it is the solution. I was thinking of having a sandbox
cgroup at each level, but then I am not very sure of this "correct
cgroup getting charged" problem.

-- 
regards,
Dhaval
--

From: Dhaval Giani
Date: Thursday, July 10, 2008 - 8:49 am

Where I see complications is handling forks happening in that time. It
will take us a long time to ensure that a fork bomb goes into the
correct cgroup as an example.

Also another issue, where does the pid reside in the memory/cpu hierarchy.
If it is not in the correct cgroup at the time of exec, or soon after
exec, the wrong cgroup is getting charged.

I liked the other idea you posted about in the other mail, having
wrappers around. I believe that can be done at distro level, which
should not really be too tough.

Or maybe we can use something like selinux (ok, this really is a shot in
the dark, i should read up before opening my mouth here.)

Thanks,
-- 
regards,
Dhaval
--

From: KAMEZAWA Hiroyuki
Date: Friday, July 18, 2008 - 2:52 am

On Tue, 1 Jul 2008 15:11:26 -0400

A different topic.
 
Recently I'm interested in "How to write userland daemon program
to control group subsystem." To implement that effectively, we need
some notifier between user <-> kernel.

Can we use "inotify" to catch changes in cgroup (by daemon program) ?

For example, create a new file under memory cgroup
==
  /opt/memory_cgroup/group_A/notify_at_memory_reach_limit
==
And a user watches the file by inotify.
The kernel modify modified-time of notify_at_memory_reach_limit file and call
fs/notify_user.c::notify_change() against this inode. He can catchthe event
by inotify.
(I think he can also catch removal of this file, etc...)

Is there some difficulty or problem ? (I'm sorry if we can do this now.)

Thanks,

--

From: Paul Menage
Date: Friday, July 18, 2008 - 8:46 am

On Fri, Jul 18, 2008 at 2:52 AM, KAMEZAWA Hiroyuki

We've been doing something like this to handle OOMs in userspace, with
pretty good success. The approach that we used so far was a custom
control file tied to a wait queue, that gets woken when a cgroup
triggers OOM, but it's a bit hacky. I've been considering some kind of
more generic approach that could be reused by different subsystems for
other notifications, maybe using eventfd or maybe netlink.

inotify would be an option too, but that seems like it might be
forcing ourselves into filesystem semantics too much.

Paul
--

From: kamezawa.hiroyu
Date: Friday, July 18, 2008 - 4:05 pm

Hmm, eventfd is AIO's one ?
At quick glance, Inotify's good points are

- can be used for any file. for example, even changes in "tasks" file can be
  cathced if it modify modified-time.
- It can be queued.
- It supports ONESHOT, NONBLOCK, etc...
- All memory allocation is done by the waiter (the user).

But yes, we cannot notify other events than "there is some change".

Thanks,

--

From: Balbir Singh
Date: Friday, July 18, 2008 - 9:39 am

Won't the time latency be an issue (time between exceeding the limit and the
user space being notified?). Since the notification does not use user memory at
the moment (it will not stress the limits futher :)), provided the notification
handler is not running under the group that has exceeded its limit. Do we expect
the user space application to ACK that it's seen the notification? We could use
a netlink channel as well (in the case that we need two way communication).




-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Vivek Goyal
Date: Friday, July 18, 2008 - 11:55 am

Does not look like it will be an issue. Of course faster the notification
better it is but there will be some latency. So if we get notified on
memory.failcnt then probably will try to increase the memory limit and 
even if it takes some time should be fine. Anyway, there is no way to avoid
latency and hopefully we are not looking at real time notifications and

Can't think of a reason why user space needs to send an ACK to kernel 
after seeing the event. If we are not using netlink and resorting to
inotify coupled with epoll then we should not loose any events and kernel
need not to be acked back.

Given the fact that netlink can drop packets, I am not sure how good an
option netlink is for cgroup notifications. Is it too hard to stick to
filesystem semantics for notifications? 

Thanks
Vivek
--

From: kamezawa.hiroyu
Date: Friday, July 18, 2008 - 4:10 pm

Maybe we need some technique "How to run a daemon in proper way."
(use special daemon cgroup etc...)
I don't think the user space has to do ACK to the kernel. The user space
can modify control file when he get events, but that's all he can do, anyway.

Thanks,
-Kame 

--

Previous thread: [GIT PULL] Userlib integration patches by Glauber Costa on Tuesday, July 1, 2008 - 11:46 am. (4 messages)

Next thread: Web Site Link Exchange Request by UK TATTOO STUDIOS on Tuesday, July 1, 2008 - 12:20 pm. (1 message)