Implement a cgroup using the LSM interface to enforce mknod and open on device files. This implements a simple device access whitelist. A whitelist entry has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it applies to all types, all major numbers, and all minor numbers. Major and minor are obvious. Access is a composition of r (read), w (write), and m (mknod). The root devcgroup starts with rwm to 'all'. A child devcg gets a copy of the parent. Admins can then add and remove devices to the whitelist. Once CAP_HOST_ADMIN is introduced it will be needed to add entries as well or remove entries from another cgroup, though just CAP_SYS_ADMIN will suffice to remove entries for your own group. An entry is added by doing "echo <type> <maj> <min> <access>" > devcg.allow, for instance: echo b 7 0 mrw > /cgroups/1/devcg.allow An entry is removed by doing likewise into devcg.deny. Since this is a pure whitelist, not acls, you can only remove entries which exist in the whitelist. You must explicitly echo a 0 0 mrw > /cgroups/1/devcg.deny to remove the "allow all" entry which is automatically inherited from the root cgroup. While composing this with the ns_cgroup may seem logical, it is not the right thing to do, because updates to /cg/cg1/devcg.deny are not reflected in /cg/cg1/cg2/devcg.allow. A task may only be moved to another devcgroup if it is moving to a direct descendent of its current devcgroup. CAP_NS_OVERRIDE is defined as the capability needed to cross namespaces. A task needs both CAP_NS_OVERRIDE and CAP_SYS_ADMIN to create a new devcgroup, update a devcgroup's access, or move a task to a new devcgroup. CONFIG_COMMONCAP is defined whenever security/commoncap.c should be compiled, so that the decision of whether to show the option for FILE_CAPABILITIES can be a bit cleaner. Pavel, do you have any experience with what sorts of rules a typical customer would get? I wasn't sure whether it was worth having some sort of hash ...
For lower overall complexity, why not just extend the capability LSM to include the devcgroup_ perms if CONFIG_CGROUP_DEV ? - James -- James Morris <jmorris@namei.org> --
That does make for a simpler implementation at this point, however if any other such LSMs come along (as Casey seemed to think they would) the end result could end up being horrible spaghetti code of dependencies and interrelated configs. But OTOH we went years with no such changes, so that's probably not a particularly practical concern unless someone can cite specific upcoming examples. So if noone objects I'll try that approach. thanks, -serge --
That can be addressed as the need arises. A basic tenet of kernel -- James Morris <jmorris@namei.org> --
True, but while this change simplifies the code a bit, the semantics seem more muddled - devcg will be enforcing when CONFIG_CGROUP_DEV=y and: SECURITY=n or rootplug is enabled capabilities is enabled smack is enabled selinux+capabilities is enabled It will not be enforcing when dummy is loaded only selinux (and not capabilities) is loaded If that's ok with people then I'm fine with it. I suppose it should be explained in the CONFIG_CGROUP_DEV help section, which it isn't in this version I'm about to set. Patch hitting the wire in a minute. thanks, --
Well, this is how real systems are going to be deployed. It becomes confusing, IMHO, if you have to change which secondary LSM you stack with SELinux to enable a cgroup feature. -- James Morris <jmorris@namei.org> --
So you're saying selinux without capabilities should still be able to use dev_cgroup? (Just making sure I understand right) -serge --
Yes. All Fedora, RHEL, CentOS etc. ship with SELinux+capabilities. I can't Nope, SELinux always stacks with capabilities, so havng the cgroup hooks in capabilities makes sense (rather than having us change the secondary stacking LSM just to enable a feature). -- James Morris <jmorris@namei.org> --
Oh, ok. Will let the patch stand until Pavel and Greg comment then. thanks, -serge --
My main question was why was that file in the kernel/ directory? Shouldn't that also be in the security/ directory? And to be honest, I didn't really look at it at all other than the diffstat to make sure you weren't messing with the kobj_map stuff anymore :) thanks, greg k-h --
I'm using cgroups to track the tasks which should have their device --
Ah. Guess it could all go under security/. Should it still go there even if --
Sure it can - security/ is in obj-y regardless of whether the --
There is the precedent of the security/keys directory (security-related, but not using LSM - aside from calling LSM hooks for access checks and labeling of keys). -- Stephen Smalley National Security Agency --
Well, I saw your previous patch, that was implemented as just another LSM module and I liked it except for the LSM dependency. --
James and Stephen agree with your LSM qualms. I suppose we could add cgroups next to the lsm hooks. I suspect Paul Menage would complain about that (Paul?), and I do think it's silly as they are security questions, not group tracking questions, but if it's what people want In an earlier version I asked whether you had any experience with usual # rules per container. Do you have an idea? Right now the whitelist is a straight list we search through linearly. If # rules is generally tiny then I'm inclined to keep it that way... thanks, -serge --
Depends on what you mean by "add cgroups to the LSM hooks". Could you expand on that? Paul --
cgroup hooks next to the lsm hooks. So in fs/namei.c where there are security_inode_permission() hooks, there would also be cgroup_inode_permission() hooks to let the devices cgroup mediate the access. Well, in permission(), probably not in exec_permission_lite() since that's probalby not a device access :) So far it looks like everyone likes that, so as long as you don't nack I guess that'll be the way to go. thanks, -serge --
This would just be a device cgroup-specific thing, right? Nothing to do with the generic framework? If so, then that sounds fine (to me, at least). Paul --
The way I see this is: cgroups provide a common way to group tasks and an API for general configuration - that's the controller "face", and it's up to the controller to decide where he turns his "back", IOW where the hooks are placed. For the memory controller - they are injected directly into the mm code. For this controller, I think it The # of rules usually has a linear dependency on the number of containers (each of then has to have an access to /dev/null,zero,random at least), so having 100 containers we will have to scan through a 300-entries list. I'd vote for a hash table or a radix/binary/rb tree for that. Or any other way --
Oh no, the rules are stored per-container, so it sounds like you're I'm fine with that, but not for 3 rules :) -serge --
So am I :) Anyway - if someday this will grow up to tens of entries turning --
That's what I was getting at. When the next feature comes along are we going to stuff it into capabilities, too? Maybe we'll cram it into audit or CIPSO instead, but how long can this go on? Eventually we need a mechanism that allows more or less general mix-and-match, maybe with a few rules like "don't mix plaids and stripes" to keep things sane or these lesser facilities have no chance. Seems like we're still making LSM too hard to use. Unless I take an aggressive approach to adding them to Smack. Hmm, might be a way to make a buck or two that way. (Only kidding)(Well, mostly only kidding) And yes, I understand that there are still those about who don't like LSM in any form, much less in a useful one. Casey Schaufler casey@schaufler-ca.com --
Maybe you should follow up the tree to ensure that all parent groups have access to the device too? Or alternatively, cache the results of But this isn't necessarily crossing namespaces. It could be used for device control in the same namespace (e.g. allowing a job to access a raw disk for its data storage rather than going through the filesystem). Paul --
Yes, I considered that. Alternatively additions to a parent cgroup's .deny could be propagated to all its descendents (but not additions to the .allow). To prevent it escaping to laxer device permissions, which of course only Yeah it should be renamed. I want to use the same cap which we would use for user namespaces though. CAP_NS_CONT(ainer)? Even though there really is no such thing as a 'container'. But that would tie together any such privileges for cgroups and namespaces. thanks, -serge --
That makes it impossible for a root process to enter a child cgroup, do something, and then go back to its own cgroup. Why aren't the existing cgroup security semantics sufficient? Paul --
Yes, but it can fire off a child in the child cgroup to do something, Because the point of this is to provide some restrictions to otherwise privileged users, and cgroups only provides dac-based permissions. But that doesn't mean that I'm not doing too much. I could just add a CAP_SYS_ADMIN or CAP_CONT_OVERRIDE+CAP_SYS_ADMIN check, and not restrict which cgroups a task can move to. Does that sound good? -serge --
In keeping with the naming convention for control groups, "devices" would be better than "devcg". Paul --
